# NLP Disaster Tweets

This analysis deals with twitter data to predict whether a given tweet really talks about a particular disaster, or just a fake tweet. The kaggle competition link is: https://www.kaggle.com/c/nlp-getting-started/overview

Importing the necessary modules for the analysis:

In [1]:
import pandas as pd
import io
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.util import ngrams
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.pipeline import FeatureUnion
import time

And downloading the necessary packages:

In [2]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\toviv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\toviv\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\toviv\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Loading the `train` and `test` datasets:

In [3]:
train = pd.read_csv('train.csv')
print(train.head())
print(train.shape)

test = pd.read_csv('test.csv')
t1 = test   # keeping a copy of the test dataset
test.head()

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  
(7613, 5)


Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Checking the number of NULL values in this dataframe

In [4]:
train.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

It can be seen that out of 7613 observations, the `location` variable has 2533 NULL values. Hence, I decide to delete the entire column.

Additionally, the `keyword` variable has 61 NULLS, hence I decided to delete all observations where `keyword` has NULL values.

Thus, the dataset is now free of NULL values

In [5]:
train.drop('location', axis=1, inplace=True)
train.dropna(inplace=True)
train.shape

(7552, 4)

In [6]:
train = train.reset_index(drop=True) # resetting the index of the dataframe

Now, choosing the features for the train/test split, and also keeping a copy of the `x_train` variable

In [7]:
x = train['text']
y = train['target']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
x_train_copy = x_train

Defining functions for the following operations:

- Tokenization: creating a Bag-of-Words for every tweet
- Stop wrods: removing stop words and punctuation marks
- Lemmatization: converting each token to its root word

In [8]:
def tokenize(text):
  text = text.lower()
  word_tokens = word_tokenize(text)
  return word_tokens

def remove_words(text):

  # set of stop words
  stop_words = set(stopwords.words('english'))

  punc=['?',':','!','.',',',';','#','@','$','-','(',')','_',"'"]
  punctuations = set(punc)

  filtered_sentence = []

  for w in text: 
      if w not in stop_words: 
        if w not in punctuations:
          filtered_sentence.append(w) 

  return filtered_sentence

def lemmatize(text):
  wordnet_lemmatizer = WordNetLemmatizer()
  lemmatized_token = []
  for token in text:
    w = wordnet_lemmatizer.lemmatize(token)
    lemmatized_token.append(w)
  return ' '.join(lemmatized_token)

Applying these functions to the `x_train` sample:

In [9]:
x_train = x_train.apply(tokenize)
x_train = x_train.apply(remove_words)
x_train = x_train.apply(lemmatize)

The models applied on the `x_train` dataset are:
- CountVectorizer()
- TfidfVectorizer()
- N-Grams

All models were run on both, preprocessed (`x_train`) and raw (`x_train_copy`) datasets

### 1.a. CountVectorizer on preprocessed data `x_train`

In [10]:
vectorizer = CountVectorizer()

# Generate matrix of word vectors
x_train_bow = vectorizer.fit_transform(x_train)

# Transform X_test
x_test_bow = vectorizer.transform(x_test)

The first analysis was run on multiple classifiers, to compare their results (F1 score). The CrossValidation parameter has been set to 5, to obtain an accurate value of the desired scores.

In [11]:
names = ["Gaussian Naive Bayes", "Multinomial Naive Bayes","Bernoulli Naive Bayes", "LogisticRegression"]
classifiers = [GaussianNB(), MultinomialNB(), BernoulliNB(), LogisticRegression()]

for name, clf in zip(names, classifiers):
  #Cross validation prediction, and we measure fitting time 
  start = time.time()
  preds = cross_val_predict(clf,x_train_bow.toarray(),y_train,cv=5)
  end = time.time()
  #Metrics
  acc = accuracy_score(y_train,preds)
  precision = precision_score(y_train,preds)
  recall = recall_score(y_train,preds)
  f1 = f1_score(y_train,preds)
  cm = confusion_matrix(y_train,preds)
  #Printing results
  print (name, 'Accuracy  :  ', "%.2f" %(acc*100),'%', ', Precision',"%.3f" %precision, 'Recall :' , "%.3f" %recall ,'F1-Score : ',"%.3f" %f1)
  print('The confusion Matrix : ' )
  print(cm)
  #Now we check how long did it take
  print('Time used :', "%.3f" %(end - start), 'seconds')
  print(' *-----------------------------------------------------------------------------------------------------*')

Gaussian Naive Bayes Accuracy  :   61.12 % , Precision 0.530 Recall : 0.802 F1-Score :  0.638
The confusion Matrix : 
[[1620 1837]
 [ 512 2072]]
Time used : 12.020 seconds
 *-----------------------------------------------------------------------------------------------------*
Multinomial Naive Bayes Accuracy  :   79.51 % , Precision 0.774 Recall : 0.736 F1-Score :  0.755
The confusion Matrix : 
[[2900  557]
 [ 681 1903]]
Time used : 10.300 seconds
 *-----------------------------------------------------------------------------------------------------*
Bernoulli Naive Bayes Accuracy  :   79.69 % , Precision 0.849 Recall : 0.639 F1-Score :  0.729
The confusion Matrix : 
[[3164  293]
 [ 934 1650]]
Time used : 12.650 seconds
 *-----------------------------------------------------------------------------------------------------*
LogisticRegression Accuracy  :   79.77 % , Precision 0.808 Recall : 0.692 F1-Score :  0.745
The confusion Matrix : 
[[3032  425]
 [ 797 1787]]
Time used : 29.282 sec

It can be seen that the Multinomial Naive Bayes Classifier provides the best F1 score among all the classifiers. Hence, I will proceed with Multinomial Naive Bayes Classifier for further analysis.

Once the validation is done, I will now predict the `target` variable from the `test` dataset.

The predictions are merged with the `test` dataset, the unnecessary variables are removed, and the resultant dataframe is downloaded, as shown:

In [12]:
clf = MultinomialNB()
clf.fit(x_train_bow.toarray(),y_train)

y_test = test['text']
y_test = y_test.apply(tokenize)
y_test = y_test.apply(remove_words)
y_test = y_test.apply(lemmatize)

test_bow = vectorizer.transform(y_test)

y_pred = clf.predict(test_bow)

pred = pd.DataFrame(y_pred)

test['target'] = pred

test = test.drop(['keyword','location','text'], axis=1)

test.to_csv('bowMNB.csv', index=False)

Submitting these predicted score resulted in a score of 0.79754

![picture](https://drive.google.com/uc?id=1jAQfnlr2EnpSqEB5XPjlCWCvofX_nmqX)

### 1.b. CountVectorizer on raw data `x_train_copy`

In [13]:
vectorizer = CountVectorizer()

# Generate matrix of word vectors
x_train_bow = vectorizer.fit_transform(x_train_copy)

# Transform X_test
x_test_bow = vectorizer.transform(x_test)

for name, clf in zip(names, classifiers):
  #Cross validation prediction, and we measure fitting time 
  start = time.time()
  preds = cross_val_predict(clf,x_train_bow.toarray(),y_train,cv=5)
  end = time.time()
  #Metrics
  acc = accuracy_score(y_train,preds)
  precision = precision_score(y_train,preds)
  recall = recall_score(y_train,preds)
  f1 = f1_score(y_train,preds)
  cm = confusion_matrix(y_train,preds)
  #Printing results
  print (name, 'Accuracy  :  ', "%.2f" %(acc*100),'%', ', Precision',"%.3f" %precision, 'Recall :' , "%.3f" %recall ,'F1-Score : ',"%.3f" %f1)
  print('The confusion Matrix : ' )
  print(cm)
  #Now we check how long did it take
  print('Time used :', "%.3f" %(end - start), 'seconds')
  print(' *-----------------------------------------------------------------------------------------------------*')

Gaussian Naive Bayes Accuracy  :   61.86 % , Precision 0.537 Recall : 0.796 F1-Score :  0.641
The confusion Matrix : 
[[1681 1776]
 [ 528 2056]]
Time used : 12.700 seconds
 *-----------------------------------------------------------------------------------------------------*
Multinomial Naive Bayes Accuracy  :   79.57 % , Precision 0.785 Recall : 0.719 F1-Score :  0.751
The confusion Matrix : 
[[2949  508]
 [ 726 1858]]
Time used : 10.972 seconds
 *-----------------------------------------------------------------------------------------------------*
Bernoulli Naive Bayes Accuracy  :   79.79 % , Precision 0.840 Recall : 0.651 F1-Score :  0.734
The confusion Matrix : 
[[3137  320]
 [ 901 1683]]
Time used : 13.501 seconds
 *-----------------------------------------------------------------------------------------------------*


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression Accuracy  :   79.84 % , Precision 0.804 Recall : 0.699 F1-Score :  0.748
The confusion Matrix : 
[[3018  439]
 [ 779 1805]]
Time used : 39.409 seconds
 *-----------------------------------------------------------------------------------------------------*


Trying the LogisticRegression Classifier due to its higher F1 score:

In [14]:
clf = LogisticRegression()
clf.fit(x_train_bow.toarray(),y_train)

test = t1
y_test = test['text']

test_bow = vectorizer.transform(y_test)

y_pred = clf.predict(test_bow)

pred = pd.DataFrame(y_pred)

test['target'] = pred

test = test.drop(['keyword','location','text'], axis=1)

test.to_csv('rawLR.csv', index=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## 2. TfidfVectorizer() on preprocessed data `x_train`

In [15]:
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
x_train_tf = vectorizer.fit_transform(x_train)

In [16]:
# Transform X_test
x_test_tf = vectorizer.transform(x_test)

for name, clf in zip(names, classifiers):
  #Cross validation prediction, and we measure fitting time 
  start = time.time()
  preds = cross_val_predict(clf,x_train_tf.toarray(),y_train,cv=5)
  end = time.time()
  #Metrics
  acc = accuracy_score(y_train,preds)
  precision = precision_score(y_train,preds)
  recall = recall_score(y_train,preds)
  f1 = f1_score(y_train,preds)
  cm = confusion_matrix(y_train,preds)
  #Printing results
  print (name, 'Accuracy  :  ', "%.2f" %(acc*100),'%', ', Precision',"%.3f" %precision, 'Recall :' , "%.3f" %recall ,'F1-Score : ',"%.3f" %f1)
  print('The confusion Matrix : ' )
  print(cm)
  #Now we check how long did it take
  print('Time used :', "%.3f" %(end - start), 'seconds')
  print(' *-----------------------------------------------------------------------------------------------------*')

Gaussian Naive Bayes Accuracy  :   60.82 % , Precision 0.529 Recall : 0.765 F1-Score :  0.626
The confusion Matrix : 
[[1696 1761]
 [ 606 1978]]
Time used : 12.616 seconds
 *-----------------------------------------------------------------------------------------------------*
Multinomial Naive Bayes Accuracy  :   79.87 % , Precision 0.848 Recall : 0.646 F1-Score :  0.733
The confusion Matrix : 
[[3157  300]
 [ 916 1668]]
Time used : 3.422 seconds
 *-----------------------------------------------------------------------------------------------------*
Bernoulli Naive Bayes Accuracy  :   79.69 % , Precision 0.849 Recall : 0.639 F1-Score :  0.729
The confusion Matrix : 
[[3164  293]
 [ 934 1650]]
Time used : 6.584 seconds
 *-----------------------------------------------------------------------------------------------------*
LogisticRegression Accuracy  :   79.16 % , Precision 0.852 Recall : 0.621 F1-Score :  0.718
The confusion Matrix : 
[[3178  279]
 [ 980 1604]]
Time used : 13.301 secon

In [17]:
clf = MultinomialNB()
clf.fit(x_train_tf.toarray(),y_train)

test = t1
y_test = test['text']
y_test = y_test.apply(tokenize)
y_test = y_test.apply(remove_words)
y_test = y_test.apply(lemmatize)

test_bow = vectorizer.transform(y_test)

y_pred = clf.predict(test_bow)

pred = pd.DataFrame(y_pred)

test['target'] = pred

test = test.drop(['keyword','location','text'], axis=1)

test.to_csv('tf_MNB.csv', index=False)

It can be seen that the F1 score improved slightly using TF-IDF method:
![picture](https://drive.google.com/uc?id=1CJSKV0q40RvZcMXoU6qm7Z7Ihegd6Y1T)

## 3. N-Gram Method on preprocessed data

In [18]:
vectorizer_ng = CountVectorizer(ngram_range=[1,3])

# Generate matrix of word vectors
x_train_ng = vectorizer_ng.fit_transform(x_train)

# Transform X_test
x_test_ng = vectorizer_ng.transform(x_test)

for name, clf in zip(names, classifiers):
  #Cross validation prediction, and we measure fitting time 
  start = time.time()
  preds = cross_val_predict(clf,x_train_ng.toarray(),y_train,cv=5)
  end = time.time()
  #Metrics
  acc = accuracy_score(y_train,preds)
  precision = precision_score(y_train,preds)
  recall = recall_score(y_train,preds)
  f1 = f1_score(y_train,preds)
  cm = confusion_matrix(y_train,preds)
  #Printing results
  print (name, 'Accuracy  :  ', "%.2f" %(acc*100),'%', ', Precision',"%.3f" %precision, 'Recall :' , "%.3f" %recall ,'F1-Score : ',"%.3f" %f1)
  print('The confusion Matrix : ' )
  print(cm)
  #Now we check how long did it take
  print('Time used :', "%.3f" %(end - start), 'seconds')
  print(' *-----------------------------------------------------------------------------------------------------*')

Gaussian Naive Bayes Accuracy  :   65.12 % , Precision 0.566 Recall : 0.787 F1-Score :  0.659
The confusion Matrix : 
[[1901 1556]
 [ 551 2033]]
Time used : 147.340 seconds
 *-----------------------------------------------------------------------------------------------------*
Multinomial Naive Bayes Accuracy  :   78.27 % , Precision 0.741 Recall : 0.756 F1-Score :  0.748
The confusion Matrix : 
[[2775  682]
 [ 631 1953]]
Time used : 172.228 seconds
 *-----------------------------------------------------------------------------------------------------*
Bernoulli Naive Bayes Accuracy  :   73.10 % , Precision 0.936 Recall : 0.398 F1-Score :  0.559
The confusion Matrix : 
[[3387   70]
 [1555 1029]]
Time used : 257.327 seconds
 *-----------------------------------------------------------------------------------------------------*
LogisticRegression Accuracy  :   79.49 % , Precision 0.836 Recall : 0.647 F1-Score :  0.730
The confusion Matrix : 
[[3130  327]
 [ 912 1672]]
Time used : 312.777

When these predicted values were submitted on Kaggle, it gave a score of 0.7965

## Combining CountVectorizer() and TfidfVectorizer()

In [19]:
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(x_train)

classifier = MultinomialNB()
classifier.fit(combined_features, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [20]:
for name, clf in zip(names, classifiers):
  #Cross validation prediction, and we measure fitting time 
  start = time.time()
  preds = cross_val_predict(clf,combined_features.toarray(),y_train,cv=5)
  end = time.time()
  #Metrics
  acc = accuracy_score(y_train,preds)
  precision = precision_score(y_train,preds)
  recall = recall_score(y_train,preds)
  f1 = f1_score(y_train,preds)
  cm = confusion_matrix(y_train,preds)
  #Printing results
  print (name, 'Accuracy  :  ', "%.2f" %(acc*100),'%', ', Precision',"%.3f" %precision, 'Recall :' , "%.3f" %recall ,'F1-Score : ',"%.3f" %f1)
  print('The confusion Matrix : ' )
  print(cm)
  #Now we check how long did it take
  print('Time used :', "%.3f" %(end - start), 'seconds')
  print(' *-----------------------------------------------------------------------------------------------------*')

Gaussian Naive Bayes Accuracy  :   61.79 % , Precision 0.537 Recall : 0.774 F1-Score :  0.634
The confusion Matrix : 
[[1732 1725]
 [ 583 2001]]
Time used : 30.231 seconds
 *-----------------------------------------------------------------------------------------------------*
Multinomial Naive Bayes Accuracy  :   79.27 % , Precision 0.782 Recall : 0.714 F1-Score :  0.747
The confusion Matrix : 
[[2944  513]
 [ 739 1845]]
Time used : 8.382 seconds
 *-----------------------------------------------------------------------------------------------------*
Bernoulli Naive Bayes Accuracy  :   80.28 % , Precision 0.840 Recall : 0.666 F1-Score :  0.743
The confusion Matrix : 
[[3130  327]
 [ 864 1720]]
Time used : 16.538 seconds
 *-----------------------------------------------------------------------------------------------------*
LogisticRegression Accuracy  :   79.47 % , Precision 0.798 Recall : 0.697 F1-Score :  0.744
The confusion Matrix : 
[[3001  456]
 [ 784 1800]]
Time used : 49.932 seco

In [21]:
clf = MultinomialNB()
clf.fit(combined_features.toarray(),y_train)

test = t1
y_test = test['text']
y_test = y_test.apply(tokenize)
y_test = y_test.apply(remove_words)
y_test = y_test.apply(lemmatize)

test_bow = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).transform(y_test)

y_pred = clf.predict(test_bow)

pred = pd.DataFrame(y_pred)

test['target'] = pred

test = test.drop(['keyword','location','text'], axis=1)

test.to_csv('cv_tf_MNB.csv', index=False)

It can be seen that the F1 score has improved to 80.26%

![picture](https://drive.google.com/uc?id=1cSdLFZQzzOwmqpMdNfXqtGP2Ke-pLdKg)

In [22]:
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
ngram = CountVectorizer(ngram_range=[1,3])

combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer), ("ngram", ngram)]).fit_transform(x_train)
classifier.fit(combined_features, y_train)

for name, clf in zip(names, classifiers):
  #Cross validation prediction, and we measure fitting time 
  start = time.time()
  preds = cross_val_predict(clf,combined_features.toarray(),y_train,cv=5)
  end = time.time()
  #Metrics
  acc = accuracy_score(y_train,preds)
  precision = precision_score(y_train,preds)
  recall = recall_score(y_train,preds)
  f1 = f1_score(y_train,preds)
  cm = confusion_matrix(y_train,preds)
  #Printing results
  print (name, 'Accuracy  :  ', "%.2f" %(acc*100),'%', ', Precision',"%.3f" %precision, 'Recall :' , "%.3f" %recall ,'F1-Score : ',"%.3f" %f1)
  print('The confusion Matrix : ' )
  print(cm)
  #Now we check how long did it take
  print('Time used :', "%.3f" %(end - start), 'seconds')
  print(' *-----------------------------------------------------------------------------------------------------*')

Gaussian Naive Bayes Accuracy  :   65.77 % , Precision 0.576 Recall : 0.756 F1-Score :  0.654
The confusion Matrix : 
[[2020 1437]
 [ 631 1953]]
Time used : 262.687 seconds
 *-----------------------------------------------------------------------------------------------------*
Multinomial Naive Bayes Accuracy  :   79.06 % , Precision 0.758 Recall : 0.749 F1-Score :  0.754
The confusion Matrix : 
[[2840  617]
 [ 648 1936]]
Time used : 52.544 seconds
 *-----------------------------------------------------------------------------------------------------*
Bernoulli Naive Bayes Accuracy  :   77.88 % , Precision 0.912 Recall : 0.534 F1-Score :  0.674
The confusion Matrix : 
[[3324  133]
 [1203 1381]]
Time used : 174.226 seconds
 *-----------------------------------------------------------------------------------------------------*
LogisticRegression Accuracy  :   79.85 % , Precision 0.821 Recall : 0.676 F1-Score :  0.742
The confusion Matrix : 
[[3077  380]
 [ 837 1747]]
Time used : 302.130 

This prediction gives an F1 score of 0.79243.

In [23]:
clf = LogisticRegression()
clf.fit(combined_features.toarray(),y_train)

test = t1
y_test = test['text']
y_test = y_test.apply(tokenize)
y_test = y_test.apply(remove_words)
y_test = y_test.apply(lemmatize)

test_bow = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer), ("ngram", ngram)]).transform(y_test)

y_pred = clf.predict(test_bow)

pred = pd.DataFrame(y_pred)

test['target'] = pred

test = test.drop(['keyword','location','text'], axis=1)

test.to_csv('cv_tf_ng_LR.csv', index=False)

This prediction gives an F1 score of 0.79652

## Conclusions

Hence, the top scores obtained are shown in the screenshot below:

![picture](https://drive.google.com/uc?id=180PknEPk91NDto9hPqhykyT2xQ2wdI89)

The following conclusions can be made:
- The F1 score of the model can be increased when transformations are ensembled together
    * In this case, the combination of CounVectorizer() and TfidfVectorizer() functions produced the highest F1 score
- Other models like BERT can be implemented to identify any improvements in the analysis
- Data can be analysed from a different perspective such as - FAKE tweets usually have:
    * a higher count of NOUNS compared to the average number of NOUNS for each tweet in the dataset
    * longer sentences compared to REAL tweets

Hence, as a future work, calculate the average number of NOUNS within a dataset, and compare it to the number of NOUNS present in a FAKE tweet. Similarly, calculate the length of FAKE tweets vs REAL tweets, and observe the results