# Introduction and credits

This notebook is an application of techniques from the fantastic fast.ai course *[A Code-First Introduction to Natural Language Processing](https://www.fast.ai/2019/07/08/fastai-nlp/)* by Rachel Thomas and Jeremy Howard to the kaggle competition *[Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)*. I also want to point out Jeremy Howard's [kaggle kernel](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline) from the earlier *[Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)* which served as an inspiration for this notebook.

While many other notebooks in this competition focus on the application of complex deep learning architectures, such as recurrent neural networks and transformers with pre-trained weights, I want to present another approach by training a fairly simple classification model from scratch that nevertheless delivers surprisingly good results.

The model I will use was introduced in this [paper](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) by Sida Wang and Christopher D. Manning. The paper shows that while Naive Bayes (NB) usually performs better on text classification problems with short input sequences and Support Vector Machines (SVM) are better on longer sequences, a combined approach that uses NB log-count ratios as input features to SVM performs consistently well on a variety of classification tasks and datasets with different sequence lenghts. In line with the above-mentioned kernel I will use a Logistic Regression model instead of SVM, which according to the paper should give similar results.

The notebook will have the following structure:
1. Importing libraries and data
2. Exploratory data analysis
3. Tokenization and training/validation split
4. Naive Bayes with bag-of-words
5. Naive Bayes with tf-idf
5. Logistic Regression with NB features
6. Inference and submission

# Importing libraries and data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import os, re, string

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [None]:
path = Path('../input/nlp-getting-started')
os.listdir(path)

In [None]:
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')

In [None]:
print(f'Training set size: {len(train_df)}')
print(f'Test set size: {len(test_df)}')

In [None]:
train_df.head()

In [None]:
print(f'Missing values in training set: {train_df.text.isna().sum()}')
print(f'Missing values in test set: {test_df.text.isna().sum()}')

# Exploratory data analysis

Before I start with pre-processing the texts, I will do a short exploratory data analysis. When dealing with NLP datasets, two things I *always* look at first are the distribution of target classes and the distribution of lenghts within each class. 

In [None]:
plt.bar(['No disaster', 'Disaster'], [len(train_df[train_df.target==0]), len(train_df[train_df.target==1])], color=['darkblue', 'darkorange'])
plt.xlabel('Dependent variable', fontsize=12)
plt.ylabel('Number of tweets', fontsize=12)
plt.title('Class distribution', fontsize=16)
plt.show()

In [None]:
print(f'Average target in training set: {np.round(train_df.target.mean(),2)}')

In [None]:
diff = len(train_df[train_df.target==0])-len(train_df[train_df.target==1]) 
pct_diff = np.round(diff/len(train_df),2)
diff, pct_diff

With 1071, or 14% more no disaster tweets than disaster tweets, we have a slightly imbalanced dataset. However, such a modest imbalance should not be a problem for our model and thus there is no need for rebalancing.

Next I will calculate the lengths by character for all tweets in training set and test set as well as for all no disaster and disaster tweets in the training set separately.

In [None]:
lengths_trn = train_df.text.str.len()
lengths_tst = train_df.text.str.len()
lengths_trn0 = train_df[train_df.target==0].text.str.len()
lengths_trn1 = train_df[train_df.target==1].text.str.len()
print('Avg length, min length, max length')
print('**********************************')
print(f'For training set: {int(lengths_trn.mean())}, {lengths_trn.min()}, {lengths_trn.max()}')
print(f' - no disaster tweets: {int(lengths_trn0.mean())}, {lengths_trn0.min()}, {lengths_trn0.max()}')
print(f' - disaster tweets: {int(lengths_trn1.mean())}, {lengths_trn1.min()}, {lengths_trn1.max()}')
print(f'For test set: {int(lengths_tst.mean())}, {lengths_tst.min()}, {lengths_tst.max()}')

In [None]:
fig, axs = plt.subplots(2, 2, sharex='row', figsize=(10,10))

axs[0, 0].hist(lengths_trn, color='darkgrey')
axs[0, 0].set_title('Training set', fontsize=16)
axs[0, 0].set_ylabel('Number of tweets', fontsize=12)
axs[0, 1].hist(lengths_tst, color='lightgrey')
axs[0, 1].set_title('Test set', fontsize=16)
axs[1, 0].hist(lengths_trn0, color='darkblue')
axs[1, 0].set_title('Training set (no disaster)', fontsize=16)
axs[1, 0].set_ylabel('Number of tweets', fontsize=12)
axs[1, 0].set_xlabel('Character lenghts', fontsize=12)
axs[1, 1].hist(lengths_trn1, color='darkorange')
axs[1, 1].set_title('Training set (disaster)', fontsize=16)
axs[1, 1].set_xlabel('Character lenghts', fontsize=12)

plt.show()

The length distributions of training and test sets look very similar. Because twitter limits the number of characters, the overall variation in lengths is not too large for this dataset. Interestingly, there is a spike in all four charts shortly before the maximum. When looking into the separate target classes, the average length for no disaster tweets is a bit lower than for disaster tweets. It's not surprising that there are fewer disaster tweets which are very short, as you probably would use more than a only few words for describing a disaster.

# Tokenization and train/valid split

After having experimented with different tokenizers (e.g. nltk and spaCy), I arrived at the conclusion that a simple regular expression tokenizer that can be defined in two lines of code (from Jeremy Howard's [kaggle kernel](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline)) works best in this setting. Also, none of the many pre-processing techniques that I tried (cleaning urls and html tags, removing stopwords and punctuation, stemming and lemmatization) actually improved the performance of the NB-LR algorithm on this specific dataset.

The tokenizer splits a string of text into separate tokens based on a set of common punctuations, converts all tokens into lowercase and then returns a list of the resulting tokens. Let's see how it works...

In [None]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤€‘’])')
def re_tokenizer(s): return re_tok.sub(r' \1 ', s).lower().split()

In [None]:
train_df.text[6]

In [None]:
print(re_tokenizer(train_df.text[6]))

Before moving on to Naive Bayes, we need to create a separate validation set for our model in order to evaluate how well it performs on data it has not seen during training. The validation set will be composed of 10% randomly selected tweets from the training set.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_df.text, train_df.target, test_size=0.1, random_state=42)

In [None]:
print(f'Training set size: {len(X_train)}')
print(f'Validation set size: {len(X_valid)}')

# Naive Bayes with bag-of-words

Because we cannot feed the tokens directly into a machine learning model, the inputs need to be numericalized first. CountVectorizer creates a *bag-of-words* model of the tokenized texts. It creates a vocabulary of all tokens that appear at least 4 times (min_df) in the corpus but are in less than 80 percent of all documents (max_df). Such frequent words are treated as corpus-specific stopwords and therefore removed. We overwrite the default tokenizer with our regular expression tokenizer from above, replace non-standard characters and deactivate lowercasing (which is already done by the tokenizer).

In addition to including unigrams (i.e. single tokens), I also include bigrams which are two adjacent tokens (e.g. heavy and rain --> heavy_rain) that appear at least four times in the entire corpus (specified by min_df).

In [None]:
vec = CountVectorizer(ngram_range=(1,2), tokenizer=re_tokenizer, min_df=4, max_df=0.8, strip_accents='unicode', lowercase=False)

CountVectorizer returns separate term-document matrices for training and validation sets. The rows in the matrix correspond to documents and the columns to vocabulary items. The values in the matrix represent how often each token appears in each document. Since most tokens don't appear in a given document, the term-document matrices are sparse, i.e. they contain mostly zeros.

It is important to only call transform (not fit_transform) on the validation set, because the same vocabulary that was created from the training set needs to be used to create the term-document matrix for the validation set.

In [None]:
train_term_doc = vec.fit_transform(X_train)
valid_term_doc = vec.transform(X_valid)
train_term_doc.shape, valid_term_doc.shape

In [None]:
vocab = vec.get_feature_names()
print(f'Vocabulary size: {len(vocab)}')

The **log-count ratio** $r$ for feature $f$ is the following:

$r = \log \frac{\text{ratio of feature $f$ in disaster tweets}}{\text{ratio of feature $f$ in no disaster tweets}}$

where features $f$ in the case of NLP are the tokens in our vocabulary (unigrams and bigrams). We will slowly build up to that equation in the next lines of code in order to understand better what is going on.

In [None]:
# Rename term-document matrices for convenience and convert labels from pandas series into numpy arrays
x_train = train_term_doc
y_train = y_train.values
x_valid = valid_term_doc
y_valid = y_valid.values

The first step is to calculate how often each token appears in disaster tweets and no disaster tweets. This is done by only selecting the tweets that correspond to target==1/target==0 and summing up along the columns (since each column represents one token).

In [None]:
p1 = np.squeeze(np.asarray(x_train[y_train==1].sum(0)))
p0 = np.squeeze(np.asarray(x_train[y_train==0].sum(0)))

In [None]:
p1.shape, p1[:10]

The next step calculates the ratio of each feature in disaster and no disaster tweets by dividing the token counts from above by the number of disaster/no disaster tweets. (Note: the ones are added for numerical stability)

In [None]:
pr1 = (p1+1) / ((y_train==1).sum()+1)
pr0 = (p0+1) / ((y_train==0).sum()+1)

In [None]:
pr1.shape, pr1[:10]

Before moving on let's interpret one of these ratios.

In [None]:
vocab[2160:2170]

Printing out the ratios for the word "crisis" in disaster and no disaster tweets, we can see that the token appears in *around* 0.6 percent of all disaster tweets, but only in 0.005 percent of all no disaster tweets.

In [None]:
pr1[2164], pr0[2164]

Note that I wrote "*around* 0.6 percent" before because the actual ratio is slightly different, which is due to the fact that we are adding +1 in numerator and denominator for numerical stability when calculating the ratios. However, the difference is so small we can ignore it.

In [None]:
p1[2164]/(y_train==1).sum(), (p1[2164]+1)/((y_train==1).sum()+1)

Even though the individual ratios can be slightly different, dividing both ratios by each other gives a good measure for how often a given token appears in disaster vs. no disaster tweets. The word "crisis" undoubtedly appears way more often in disaster tweets.

In [None]:
pr1[2164] / pr0[2164]

The log-count ratios are obtained by doing exactly that for all tokens and taking the log

In [None]:
r = np.log(pr1/pr0)
r.shape, r[:10]

To finalize our Naive Bayes model we also need to calculate bias b, which measures the relative frequency of disaster tweets to no disaster tweets. Since there are only about 43 percent disaster tweets in the training set, the ratio is slightly negative.

In [None]:
b = np.log((y_train==1).mean() / (y_train==0).mean()); b

Let's now put it all together. In our setting, **Bayes' theorem** is defined in the following way:


$ p(\textrm{target} \mid \textrm{tokens}) \displaystyle \propto {p(\textrm{tokens} \mid \textrm{target}) \cdot p(\textrm{target})} $

* $ p(\textrm{target} \mid \textrm{tokens}) $ is the *posterior*, i.e. the conditional probability of a target given a set of tokens 
* $ p(\textrm{tokens} \mid \textrm{target}) $ is the *likelihood*, i.e. the conditional probability of a set of tokens given the target 
* $ p(\textrm{target}) $ is the *prior*, i.e. the overall probability of each target in the dataset

In other words, we want to get the posterior - which is proportional to the likelihood times the prior - to predict the target from tokens.

Without further diving into the mathematical details, we can compute the **Naive Bayes predictions** like this:

$ \textrm{predictions} = \textrm{term-document} \ \textrm{matrix} \cdot \textrm{r} + \textrm{b}$

If the result is larger than zero, the model classifies a tweet as disaster tweet. The training process was actually already done before by calculating the log-count ratios and bias b on the training set. So we can now move on to evaluate the model on the validation set.


In [None]:
preds = (x_valid @ r + b) > 0

In [None]:
print(f'Validation accuracy: {(preds == y_valid).mean()}')
print(f'Validation F1 score: {f1_score(y_valid, preds)}')

# Naive Bayes with Tf-idf

The results from above are not great, so I will try to get some improvement by using a tf-idf model instead of a simple bag-of-words. Tf-idf stands for *term frequency - inverse document frequency* and calculates the ratio of how often a token appears in a document to how important the token is in the entire corpus (assuming that rare tokens are more important than frequent ones).

The only change in the code is that I now use sklearn's TfidfVectorizer and include sublinear term frequency scaling (which replaces tf with 1 + log(tf) and gives slightly better results). Everything else stays the same.

In [None]:
vec_tfidf = TfidfVectorizer(ngram_range=(1,2), tokenizer=re_tokenizer, lowercase=False,
               min_df=4, max_df=0.8, strip_accents='unicode', sublinear_tf=True)

In [None]:
train_term_doc_tfidf = vec_tfidf.fit_transform(X_train)
valid_term_doc_tfidf = vec_tfidf.transform(X_valid)

You can use exactly the same cells as from the section above to calculate the log-count ratios. However, I decided to demonstrate another way to calculate r faster and with less lines of code. The formula to do this is also from the above-cited [kaggle kernel](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline). 

In [None]:
def pr(y_i, y, x):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [None]:
r = np.squeeze(np.asarray(np.log(pr(1, y=y_train, x=train_term_doc_tfidf)/pr(0, y=y_train, x=train_term_doc_tfidf))))

b is the same as above, so there is no need to calculate it again

In [None]:
preds = (valid_term_doc_tfidf @ r + b) > 0

In [None]:
print(f'Validation accuracy: {(preds == y_valid).mean()}')
print(f'Validation F1 score: {f1_score(y_valid, preds)}')

# Logistic Regression with NB features

Using tf-idf improved the results, but it turns out we can achieve an even better performance by using the Naive Bayes log-count ratios as input features to a Logistic Regression model. The features are obtained by multiplying the tf-idf term-document matrices for training and validation sets by the log-count ratios r from above.

In [None]:
nb_train = train_term_doc_tfidf.multiply(r)
nb_valid = valid_term_doc_tfidf.multiply(r)
nb_train.shape, nb_valid.shape

I will fit a simple Logistic Regression model on these features. Hyperparameter C controls the inverse regularization strength (default is L2 regularization with C=1), where a higher value indicates that the model is *less* regulated. Even though we have a small dataset, the model doesn't seem to overfit much, so I can decrease regularization to get better results.

In [None]:
# Setting up the model
model = LogisticRegression(C=4, solver='liblinear')
# Fitting the model on the training data
model.fit(nb_train, y_train)
# Getting predictions for the validation set
preds = model.predict(nb_valid)

In [None]:
print(f'Validation accuracy: {(preds == y_valid).mean()}')
print(f'Validation F1 score: {f1_score(y_valid, preds)}')

# Inference and submission

Since the dataset is rather small, I will retrain the model on the entire training set (including the previous validation set) and then make predictions on the unlabeled test set for submitting to kaggle.

In [None]:
train_term_doc_tfidf = vec_tfidf.fit_transform(train_df.text)
test_term_doc_tfidf = vec_tfidf.transform(test_df.text)
y_train = train_df.target.values

In [None]:
r = np.squeeze(np.asarray(np.log(pr(1, y=y_train, x=train_term_doc_tfidf)/pr(0, y=y_train, x=train_term_doc_tfidf))))

In [None]:
model = LogisticRegression(C=4, solver='liblinear')
model.fit(train_term_doc_tfidf.multiply(r), y_train)
preds = model.predict(test_term_doc_tfidf.multiply(r))

In [None]:
submit = pd.read_csv(path/'sample_submission.csv')

In [None]:
submit.columns

In [None]:
assert all(submit.id == test_df.id)
assert len(submit) == len(test_df) == len(preds)

In [None]:
submit.target = preds

In [None]:
submit.head()

In [None]:
# Save submissions
submit.to_csv('submission_060320.csv', index=False)