# <center>Natural Language Processing with Disaster Tweets<center>

The objective of this work is to present my solution for the Natural Language Processing with Disaster Tweets competition.

<p style='text-align: justify;'>The vast majority of the winning solutions of this competition use neural networks, but in this work I seek to present a solution that achieves good performance using only the Multinomial Naive Bayes model. Furthermore, as a strategy for text processing, I will use the Bag of Words approach (CountVectorizer) which seems to lead us to better results than the TfidfVectorizer approach.

More information about this competition at https://www.kaggle.com/competitions/nlp-getting-started/overview

In [None]:
# Some basic imports

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import spacy
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split , KFold
from sklearn.metrics import f1_score
from IPython.display import Image

## Reading Training and Test Files

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')
test = pd.read_csv('../input/nlp-getting-started/test.csv')

In [None]:
# Training Dataset Dimensions

train.shape

In [None]:
# Test Dataset Dimensions

test.shape

In [None]:
# Viewing the first 5 lines of the training Dataset

train.head()

In [None]:
# Data Types

train.dtypes

In [None]:
# % of Real Disasters in the Training Dataset

(train['target'].sum()/train['target'].count())*100

In our modeling we observed that including the variables "keyword" and "location" in the models brings us worse results, so we will only pre-process the variable "text". So we won't worry about dealing with the missing values ​​of the 'keyword' and 'location' variables.

## Text preprocessing

Creating a function that receives a string and returns a string composed only of words composed only of alphabetic characters (letters from "a" to "z"). In addition this function applies Lemmatization.

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
def text_preprocessing(string):
    doc = nlp(string)
    
    doc2 = " ".join([str(token) for token in doc if str(token).isalpha()])
    
    doc3 = nlp(doc2)
    
    return " ".join([token.lemma_ for token in doc3])

Creating the "pre-processed text" column in the training and test datasets. This column is the result of preprocessing the texts present in the "text" column through the text_preprocessing function.

In [None]:
train['pre-processed text'] = train['text'].apply(text_preprocessing)

test['pre-processed text'] = test['text'].apply(text_preprocessing)

In [None]:
# Taking a look at the preprocessing result

train[['text' , 'pre-processed text']].head()

## Cross-Validation

In [None]:
# Multinomial Naive-Bayes

model = MultinomialNB() 

In [None]:
vectorizer = CountVectorizer(lowercase = True , stop_words = 'english')

In [None]:
X = np.array(train['pre-processed text'])
y = np.array(train['target'])

In [None]:
# Splitting the training dataset for cross validation

X_train , X_test , y_train , y_test = train_test_split(X , y , test_size = 0.3 , stratify = y , random_state = 12)

In [None]:
kf = KFold(n_splits = 10 , shuffle = True , random_state = 451)

In [None]:
# Getting the average of the 10 cross-validation scores according to the f1 metric

scores = []
for train_index , test_index in kf.split(X):
        X_train = X[train_index]
        X_test = X[test_index]
        y_train = y[train_index]
        y_test = y[test_index]
        bow_train = vectorizer.fit_transform(X_train)
        bow_test = vectorizer.transform(X_test)
        model.fit(bow_train , y_train)
        pred = model.predict(bow_test)
        scores.append(f1_score(y_test , pred))
np.mean(scores)

<p style='text-align: justify;'>A score of 74.63% is an excellent cross-validation score. By training our Bag of Words on the entire training set, the vocabulary obtained will be greater and, therefore, the chance of getting a better result when submitting to Kaggle is great. Let's see what score we get in Kaggle when we submit the predictions of a Multinomial Naive-Bayes model with the default Scikit-Learn configuration:

## First Submission to Kaggle

In [None]:
bow_train = vectorizer.fit_transform(X)

bow_test = vectorizer.transform(test['pre-processed text'])

In [None]:
# Training MultinomialNB()

model.fit(bow_train , y)

In [None]:
# Predictions

test['target'] = model.predict(bow_test)

In [None]:
test[['id' , 'target']].to_csv('submission1.csv' , index = False)

In [None]:
Image('../input/imagem1/kaggle1.png')

We got a score of 0.79895 on Kaggle. To get a score of 0.79926 we need to optimize the hyperparameters. Let's see how to do this:

## Hyperparameter Tuning 

In [None]:
# Splitting the training dataset for cross validation

X_train , X_test , y_train , y_test = train_test_split(X , y , test_size = 0.3 , stratify = y , random_state = 12)

In [None]:
kf = KFold(n_splits = 10 , shuffle = True , random_state = 451)

In [None]:
# Observing the average of the cross-validation scores for 100 different values ​​of the alpha parameter

alpha_scores = []
alphas = np.arange(0.1 , 10.1 , 0.1)
for alpha in alphas :
    scores = []
    for train_index , test_index in kf.split(X):
        X_train = X[train_index]
        X_test = X[test_index]
        y_train = y[train_index]
        y_test = y[test_index]
        bow_train = vectorizer.fit_transform(X_train)
        bow_test = vectorizer.transform(X_test)
        model = MultinomialNB(alpha = alpha)
        model.fit(bow_train , y_train)
        pred = model.predict(bow_test)
        scores.append(f1_score(y_test , pred))
    alpha_scores.append(np.mean(scores))

In [None]:
# Looking at alpha values sorted by their respective average cross-validation scores

dic = {'Alphas' : alphas , 'Average Score Cross-Validation' : alpha_scores}

df = pd.DataFrame(dic)

df.sort_values(by = ['Average Score Cross-Validation'] , ascending = False)

The best scores obtained occurred for alpha = 2.2 and for alpha = 2.1 . When submitting to Kaggle the best result occurs for alpha = 2.1

## Second Submission to Kaggle

In [None]:
bow_train = vectorizer.fit_transform(X)

bow_test = vectorizer.transform(test['pre-processed text'])

In [None]:
model = MultinomialNB(alpha = 2.1)

model.fit(bow_train , y)

In [None]:
# Predictions

test['target'] = model.predict(bow_test)

In [None]:
test[['id' , 'target']].to_csv('submission2.csv' , index = False)

In [None]:
Image('../input/imagem12/kaggle12.png')

## Final Comments


<p style='text-align: justify;'>I hope I have been able to show a simple yet effective approach to dealing with word processing problems. Of course, there are other approaches that easily allow for a better score, but most of them use neural networks. A suggestion for those looking for a better ranking is to use Google BERT (Bidirectional Encoder Representations from Transformers). This Deep Learning algorithm easily leads to a better score.