![Logo](https://raw.githubusercontent.com/marcelo-campa/Analytica-Kaggle-NLP/main/img/logo_desafio.png)

# Natural Language Processing with Disaster Tweets Challenge

The Brazilian Portuguese version of this notebook can be found on [this link](https://github.com/marcelo-campa/Analytica-Kaggle-NLP) 

The [challenge](https://www.kaggle.com/c/nlp-getting-started) consists of a text classification problem. More specifically, it deals with tweets classification, which can be related to real disasters or not. For this, we will use Natural Language Processing (NLP) tools, as well as Machine Learning.

On this notebook, we'll deal with

- "Tweet Tokenizer" for tweets tokenization 
- "Word Vectors"
- SVC and Logistic Regression as classifiers

We'll also explain future improvements at the end of this document. Some parts were inspired by [this notebook](https://www.kaggle.com/pranjalchatterjee/word-vectors-svc-on-nlp-with-disaster-tweets) from @pranjalchatterjee 

In [None]:
# Importing Libs

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse import csr_matrix
from tqdm import tqdm

In [None]:
# Loading train and Test datasets

train = pd.read_csv('../input/nlp-getting-started/train.csv')
test = pd.read_csv('../input/nlp-getting-started/test.csv')

In [None]:
train.head(3)

In [None]:
test.head(3)

In [None]:
train.info()

In [None]:
test.info()

The `'Keyword'` column has some potential, as it hasn't much missing values. However, we will start our analysis with solutions focused on the content of documents.

## Default Tokenizer

Tokenizers are tools that helps us transforming sentences into separate words. As an example, the sentence `'Three people died from the heat wave so far'`, picked from the train dataset, is transformed in a list, in which every word is a token. The result is

```['Three', 'people', 'died', 'from', 'the', 'heat', 'wave', 'so', 'far']```


In [None]:
stop_words_nltk = list(stopwords.words('english'))
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(train['text'].values)

In [None]:
csr_matrix(count_train).toarray()

In [None]:
# Picking a sentence to be tokenized

train['text'].values[10]

In [None]:
# Checking the tokenized sentence

word_tokenize(train['text'][10])

## Tweet tokenizer

The tokenizer mentioned above was made with common texts in mind. However, we know that tweets have their own characteristics, such as the use of emojis, hashtags and various abbreviations.

With that in mind, we opted to test the Tweet Tokenizer, a tokenizer in the NLTK library optimized for analyzing Twitter texts.

In [None]:
from nltk.tokenize import TweetTokenizer

def tweet_tokenize_column(df, column):
    """     
        This function gets the Dataframe and the name of a column (String) containing texts (Strings) and returns
        a list of lists containing the tokenized text. It also turns every token to it's lower form and excludes
        stopwords.
        
        Essa funcao recebe o Dataframe e o nome de uma coluna (String) contendo textos (Strings), e retorna uma lista
        de listas contendo o texto tokenizado. A funcao tambem transforma todas as letras maiusculas em minusculas e 
        exclui stopwords.
        
        Input: Pandas DataFrame, String
        Return: Nested List
    """
    
    tweet_tokenizer = TweetTokenizer()
    
    # List of sentences / Lista de sentencas
    list_sent = [tweet_tokenizer.tokenize(sent) for sent in df[column].values]
    
    # List of sentences excluding stopword tokens / Lista de sentencas excluindo stopwords
    list_sent_no_stop = [[token.lower() 
                           for token in sent 
                           if token not in stopwords.words('english')] 
                           for sent in list_sent]
    
    return list_sent_no_stop

In [None]:
# Using the function on train and test datasets

tokenized_sent_train = tweet_tokenize_column(train,'text')
tokenized_sent_test = tweet_tokenize_column(test,'text')

We can see the difference of the tokenizers. On this one, we observe that the tokenization of hashtags were improved.

In [None]:
tokenized_sent_train[:2]

In [None]:
tokenized_sent_test[:2]

We will create a list of lists containing all tokenized tweets from both training and testing. This is a way to ensure an unified analysis by TF-IDF, which will be explained later.

In [None]:
tokenized_sent_all = tokenized_sent_train + tokenized_sent_test

## TF-IDF

The TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical statistic that gives us the importance of a given word in a set of documents (corpus). Intuitively, a word (token) has a high TF-IDF score if

- It has a high number of occurrences on a single document
e
- It has a low number of ocurrences on all documents

This 'number' is a result of the product of the 'TF' and the 'IDF', which are calculated as follows

\begin{align}
    \operatorname{tf}(t, d)&=\frac{f_{t, d}}{\sum_{t^{\prime} \in d} f_{t^{\prime}, d}}\\\\
    \operatorname{idf}(t, D)&=\log \frac{N}{|\{d \in D: t \in d\}|}\\\\
    \operatorname{tfidf}(t, d, D)&=\operatorname{tf}(t, d) \cdot \operatorname{idf}(t, D)
\end{align}

All terms can be seen in depth on [this link](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Now, let's apply the TF-IDF.

In [None]:
# Auxiliar function to bypass the tokenizer, as this step had already been done

def identity_tokenizer(text):
    return text

tfidf_all = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf_all_fit = tfidf_all.fit_transform(tokenized_sent_all)

tfidf_all.get_feature_names()[1000:1002]


Note that because TF-IDF was done on all documents both from the train and test datasets, we calculate the scores by treating this aggregated list as a new corpus. This is positive as we will have this score based on more data, tending to give us a more realistic information.


In [None]:
# Creating an unified dataframe. The firs 'n' lines there are data about the train dataset, and on the 'm' following
# lines we have data from the test dataset. 'm' is the number of train documents and 'n' is the number of test 
# documents.

tfidf_all_df = pd.DataFrame(tfidf_all_fit.toarray(), columns=tfidf_all.get_feature_names())

In [None]:
tfidf_all_df.head()

In [None]:
# Splitting the aggregated dataframe into train and test

tfidf_train_df = tfidf_all_df[:len(train)]

tfidf_test_df = tfidf_all_df[len(train):]


In [None]:
# Including target column on the TF-IDF train dataset

tfidf_train_df["target_column"] = train['target']

## Classifiers

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

X = tfidf_train_df.drop("target_column", axis=1)
y = tfidf_train_df["target_column"]

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=16)

clf = LogisticRegression(random_state=16)

scores_logistic = cross_val_score(clf, X, y, cv=5)

In [None]:
scores_logistic.mean()

In [None]:
from sklearn.metrics import accuracy_score

clf.fit(X,y)

y_pred = clf.predict(X)

print('Training accuracy is {}'.format(accuracy_score(y, y_pred)))

We noticed a considerable gap between the cross-validation scores and the accuracy score. We note that this may have to do with overfit, when we train on the entire training dataset, and underfit when we separate the training and testing, datasets as we train with much smaller data.

The submission to Kaggle gave us a public score of 78,547%

In [None]:
# Submission 0.78547

sample_submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

y_sub = clf.predict(tfidf_test_df)

In [None]:
sub = sample_submission.copy()
sub['target'] = y_sub
sub.set_index('id',inplace=True)

In [None]:
sub.head()

In [None]:
sub.to_csv("./sub_01.csv")

# Feature Selection

As a way to try to reduce the overfit and improve the performance of our classifier, we chose to use the $\chi^{2}$ and Mutual Information scores in order to select the most "important" variables for our model, reducing also the dimensionality of the problem.

In [None]:
from sklearn.feature_selection import mutual_info_classif, chi2

# mi = mutual_info_classif(tfidf_train_df_int.drop("target_column", axis=1), tfidf_train_df_int["target_column"])
# mi = pd.Series(mi)
# mi.index = intersect_columns
# mi.sort_values(ascending=False, inplace=True) 

chi = chi2(X,y)
chi = pd.Series(chi[0])
chi.index = X.columns
chi.sort_values(ascending=False, inplace=True)    


In [None]:
chi[:5]

In [None]:
chi.to_csv("./chi.csv")

In [None]:
atts = np.linspace(100,10000,100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)



In [None]:
clf.fit(X_train[chi[:3800].index],y_train)

y_pred = clf.predict(X_test[chi[:3800].index])

acc = accuracy_score(y_test , y_pred)

In [None]:
y_sub_chi = clf.predict(tfidf_test_df[chi[:3800].index])

In [None]:
sub_chi = sample_submission.copy()

sub_chi['target'] = y_sub_chi

sub_chi.set_index('id',inplace=True)

In [None]:
# Chi^2 feature selection submission

sub_chi.to_csv("./sub_chi.csv")

### Suppor Vector Classifier (SVC)

In [None]:
from sklearn.svm import SVC

clf_svc = SVC()
clf_svc.fit(X_train[chi[:3800].index],y_train)
y_pred = clf_svc.predict(X_test[chi[:3800].index])
acc = accuracy_score(y_test , y_pred)

print('Training accuracy is {}'.format(acc))

In [None]:
clf_svc.fit(tfidf_train_df[chi[:3800].index],y)

In [None]:
y_sub_svc = clf_svc.predict(tfidf_test_df[chi[:3800].index])

In [None]:
sub_svc = sample_submission.copy()
sub_svc['target'] = y_sub_svc
sub_svc.set_index('id',inplace=True)

sub_svc.to_csv("./sub_svc_overfit.csv")

In [None]:
# atts = [1000,3000,5000]
# list_scores_svc = []

# for att in tqdm(atts):
#     clf_svc.fit(X_train[chi[:int(att)].index],y_train)
#     y_pred = clf_svc.predict(X_test[chi[:int(att)].index])
#     acc = accuracy_score(y_test , y_pred)
    
#     list_scores_svc.append(acc)

# Word Vectors

Having in mind the solution through tokenization with TF-IDF, let's now move on to another type of abstraction, the Word Vectors. This abstraction allows us to map words as multidimensional vectors, indicating the mapping of that word not only in terms of quantity like TF-IDF, but also taking context into account.

As an example, the image below represents the two most important coordinates of a series of 300 dimension word vectors (which represent words in a given context).

<img src="https://raw.githubusercontent.com/marcelo-campa/Analytica-Kaggle-NLP/main/img/word_vectors_map.png" alt="Vetor de Palavras" width="600"/>

Credits to [this website](https://dzone.com/articles/introduction-to-word-vectors).


In [None]:
import spacy 

nlp = spacy.load('en_core_web_lg')

with nlp.disable_pipes():
    train_vecs = pd.DataFrame(np.array([nlp(text).vector for text in train.text])) # doc vectors for training set
    test_vecs = pd.DataFrame(np.array([nlp(text).vector for text in test.text])) # doc vectors for testing set

In [None]:
mi = mutual_info_classif(train_vecs,train.target)
mi = pd.Series(mi)
mi.index = train_vecs.columns
mi.sort_values(ascending=False, inplace=True)    

In [None]:
X_word_vec_train, X_word_vec_test, y_word_vec_train, y_word_vec_test = train_test_split(train_vecs, train.target.values, test_size=0.33, random_state=16)

In [None]:
svc = SVC()

atts = np.linspace(1, 299, 299)
list_scores_svc = []

for att in tqdm(atts):
    svc.fit(X_word_vec_train[mi[:int(att)].index].values, y_word_vec_train)
    y_pred = svc.predict(X_word_vec_test[mi[:int(att)].index].values)
    acc = accuracy_score(y_word_vec_test , y_pred)
    
    list_scores_svc.append(acc)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

int_atts = [int(att) for att in atts]

sns.set()
plt.figure(figsize=(14,7))
sns.lineplot(y=list_scores_svc, x=atts)
plt.show()

In [None]:
svc.fit(train_vecs.values, train.target)
y_pred = svc.predict(test_vecs.values)

sub_svc = sample_submission.copy()
sub_svc['target'] = y_pred
sub_svc.set_index('id',inplace=True)

sub_svc.to_csv("./sub_svc_word_vec.csv")

## To-Do

- Test other models ( NaiveBayes, RidgeClassifier, ...)
- Use GridSearch to tune Hyperparameters
