# **Real Disaster Prediction from Disaster Tweets using Natural Language Processing**



### Objective

1. Perform data preprocessing
2.  Predicting whether a given tweet is about a real disaster or not?
3.  Build and test the model for test data
4.  Evaluate the quality of the trained models

---------

### About Dataset

This dataset taken from kaggle competition called [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/overview)

This dataset was created by the company figure-eight and originally shared on their ‘[Data For Everyone](https://appen.com/data-for-everyone/)*’* website here.



#### Importing Datasets

In [None]:
# install the opendatasets package
!pip install opendatasets

import opendatasets as od

# download the dataset (this is a Kaggle dataset)
# during download you will be required to input your Kaggle username and password

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [None]:
od.download("https://www.kaggle.com/competitions/nlp-getting-started/data")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: aravindhmp
Your Kaggle Key: ··········
Downloading nlp-getting-started.zip to ./nlp-getting-started


100%|██████████| 593k/593k [00:00<00:00, 74.1MB/s]


Extracting archive ./nlp-getting-started/nlp-getting-started.zip to ./nlp-getting-started





#### Import and install Libraries

In [None]:
!pip install gensim
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [None]:
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.utils import simple_preprocess

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import FreqDist
from nltk.tokenize import word_tokenize

import tensorflow as tf

from tensorflow.keras.layers import Embedding, Flatten, Dense, Reshape
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier


from xgboost import XGBClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
%matplotlib inline

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### Load the Disaster tweets Dataset

In [None]:
df_train=pd.read_csv('/content/nlp-getting-started/train.csv')
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
df_test=pd.read_csv('/content/nlp-getting-started/test.csv')
df_test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [None]:
print('Training Set Shape = {}'.format(df_train.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(df_train.memory_usage().sum() / 1024**2))
print('Test Set Shape = {}'.format(df_test.shape))
print('Test Set Memory Usage = {:.2f} MB'.format(df_test.memory_usage().sum() / 1024**2))

Training Set Shape = (7613, 5)
Training Set Memory Usage = 0.29 MB
Test Set Shape = (3263, 4)
Test Set Memory Usage = 0.10 MB


#### Explore the dataset

In [None]:
df_train["length"] = df_train["text"].apply(lambda x : len(x))
df_test["length"] = df_test["text"].apply(lambda x : len(x))

print("Train Length Stat")
print(df_train["length"].describe())
print()

print("Test Length Stat")
print(df_test["length"].describe())

Train Length Stat
count    7613.000000
mean      101.037436
std        33.781325
min         7.000000
25%        78.000000
50%       107.000000
75%       133.000000
max       157.000000
Name: length, dtype: float64

Test Length Stat
count    3263.000000
mean      102.108183
std        33.972158
min         5.000000
25%        78.000000
50%       109.000000
75%       134.000000
max       151.000000
Name: length, dtype: float64


In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
 5   length    7613 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 357.0+ KB


If you want to know more information about the data, you can grab useful information [here](https://www.kaggle.com/code/alexia/kerasnlp-starter-notebook-disaster-tweets?scriptVersionId=138734217&cellId=10)

Note that all the tweets are in english.

#### Preprocess the data

In [None]:
len(df_train['id'])

7613

In [None]:
lemmatizer=WordNetLemmatizer() # Intitialize lemmstizer

In [None]:
def preprocess(df):
  corpus=[]
  words=[]
  for i in range(0, len(df)):                           ## Text cleaning and lemmatizing the tweets
    review = re.sub('[^a-zA-Z0-9]', ' ', df['text'][i])
    review = review.lower()
    review = review.split()

    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

  for sent in corpus:                                   ## Tokenize the each words of the tweets
    sent_token=sent_tokenize(sent)
    for sent in sent_token:
      words.append(simple_preprocess(sent))

  return corpus, words

In [None]:
corpus_train,words_train=preprocess(df_train)
corpus_test,words_test=preprocess(df_test)

In [None]:
corpus_train[1]

'forest fire near la ronge sask canada'

In [None]:
## Finding the vocabulary size
lst = [word for sentence in corpus_train for word in word_tokenize(sentence)]
freq_dist = FreqDist(lst)
print(freq_dist.most_common(20))
rare_threshold = 10000
rare_words_train = [word for word, freq in freq_dist.items() if freq < rare_threshold]
len(rare_words_train)

[('co', 4745), ('http', 4721), ('fire', 356), ('like', 350), ('amp', 344), ('u', 261), ('get', 255), ('new', 228), ('via', 220), ('2', 217), ('one', 209), ('people', 201), ('news', 200), ('year', 177), ('video', 175), ('time', 166), ('disaster', 161), ('emergency', 159), ('body', 155), ('day', 147)]


20187

In [None]:
lst = [word for sentence in corpus_test for word in word_tokenize(sentence)]
freq_dist = FreqDist(lst)
print(freq_dist.most_common(20))
rare_threshold = 10000
rare_words_test = [word for word, freq in freq_dist.items() if freq < rare_threshold]
len(rare_words_test)


[('co', 2069), ('http', 2059), ('fire', 167), ('amp', 166), ('like', 146), ('get', 126), ('u', 106), ('via', 105), ('new', 103), ('2', 96), ('news', 93), ('one', 90), ('people', 85), ('would', 77), ('year', 76), ('emergency', 73), ('3', 69), ('attack', 69), ('time', 66), ('disaster', 65)]


11431

In [None]:
## Total size of the vocabulary
len(rare_words_test)+len([word for word in rare_words_test if word not in rare_words_train])

17334

In [None]:
## Finding the sentence length

max([[i,j] for i,j in zip(list(map(len,corpus_train)),corpus_train)])

[137,
 'bomb crash loot riot emergency pipe bomb nuclear chemical spill gas ricin leak violence drug cartel cocaine marijuana heroine kidnap bust']

In [None]:
max([[i,j] for i,j in zip(list(map(len,corpus_test)),corpus_test)])

[132,
 'harvardu 90blks amp 8whts colluded 2 take wht f usagov auth hostage amp 2 make look blk w bioterrorism amp use lgl org idis id still']

In [None]:
## Defining the Vocabulary size, sentence lenth, and Dimension for vector
voc_size=18000
sent_length=30
embedding_vector_features=512

In [None]:
onehot_repr_train=[one_hot(words,voc_size) for words in corpus_train]
onehot_repr_train[1]

[5910, 2348, 11997, 16631, 11941, 12149, 12691]

In [None]:
corpus_train[1]

'forest fire near la ronge sask canada'

In [None]:
onehot_repr_test=[one_hot(words,voc_size) for words in corpus_test]
onehot_repr_test[1]

[13041, 9572, 4390, 7315, 3632, 6067, 14549]

In [None]:
corpus_test[1]

'heard earthquake different city stay safe everyone'

In [None]:
embedded_doc_train=pad_sequences(onehot_repr_train,padding='pre',maxlen=sent_length)
embedded_doc_train[1]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,  5910,  2348, 11997, 16631,
       11941, 12149, 12691], dtype=int32)

In [None]:
embedded_doc_test=pad_sequences(onehot_repr_test,padding='pre',maxlen=sent_length)

In [None]:
X=np.array(embedded_doc_train)
Y=np.array(df_train['target'])

X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.40, random_state=0)

X_test=np.array(embedded_doc_test)

In [None]:
X_train.shape

(4567, 30)

In [None]:
X_test.shape

(3263, 30)

In [None]:
## Word EMbedding Model Build
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Reshape((sent_length * embedding_vector_features,)))
model.compile('adam','mse')

In [None]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 30, 512)           9216000   
                                                                 
 reshape_4 (Reshape)         (None, 15360)             0         
                                                                 
Total params: 9216000 (35.16 MB)
Trainable params: 9216000 (35.16 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
X_train=model.predict(X_train)
X_train.shape



(4567, 15360)

In [None]:
X_train[0]

array([ 0.04543694, -0.04390713,  0.03865236, ...,  0.03854943,
        0.03323985,  0.0237692 ], dtype=float32)

In [None]:
X_val=model.predict(X_val)
X_val.shape



(3046, 15360)

In [None]:
X_test=model.predict(X_test)
X_test.shape



(3263, 15360)

In [None]:
## Reducing dimensionality and Staandartization the data

scaler=StandardScaler()
pca=PCA(n_components=100)

In [None]:
X_train=scaler.fit_transform(X_train)
X_val=scaler.transform(X_val)
X_test=scaler.fit_transform(X_test)

In [None]:
X_train=pca.fit_transform(X_train)
X_val=pca.transform(X_val)
X_test=pca.fit_transform(X_test)

In [None]:
X_train.shape,X_test.shape

((4567, 300), (3263, 300))

#### Bulding,Training and Evaluating the Model

In [None]:
models ={
    'Logistic Regression': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'Random Forest' : RandomForestClassifier(),
    'AdaBoost' : AdaBoostClassifier(),
    'SVM' : SVC(),
    'Gradient Boosting' : GradientBoostingClassifier(),
    'Decision Tree' : DecisionTreeClassifier(),
    'XGBoost' : XGBClassifier(),
    'CatBoost' : CatBoostClassifier()
}

In [None]:
parameter = {
    'Logistic Regression': {
        'C':10.0**np.arange(-2,3),
        'penalty':['l1', 'l2', 'elasticnet', None],
        'solver':['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
        'max_iter':np.arange(1000,6000,1000)
        },
    'KNN': {
            'n_neighbors':[2,3,4,5,6,7,8,9,10,11],
             'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
                },
    'Random Forest' : {
                    'criterion':['gini', 'entropy', 'log_loss'],
                    'max_features':['sqrt','log2',None],
                    'n_estimators': [8,16,32,64,128,256]

                },
    'AdaBoost' :{
                'learning_rate':[.1,.01,0.5,.001],
                'algorithm': ['SAMME','SAMME.R'],
                'n_estimators': [8,16,32,64,128,256]
                },
    'SVM' : {
        'kernel':['linear', 'poly', 'rbf', 'sigmoid']
    },
    'Gradient Boosting' : {
                          'loss': ['log_loss', 'exponential'],
                          'learning_rate':[.1,.01,.05,.001],
                          'criterion':['squared_error', 'friedman_mse'],
                          'max_features':['sqrt','log2'],
                          'n_estimators': [8,16,32,64,128,256]
                },
    'Decision Tree' : {
                  'criterion':['gini', 'entropy', 'log_loss'],
                  'splitter':['best','random'],
                  'max_features':['auto','sqrt','log2'],
                },
    'XGBoost' : {
                    'learning_rate':[.1,.01,.05,.001],
                    'n_estimators': [8,16,32,64,128,256]
                },
    'CatBoost' : {
                    'depth': [6,8,10],
                    'learning_rate': [0.01, 0.05, 0.1],
                    'iterations': [30, 50, 100]
                }
}

In [None]:
def evaluate_models(X_train, y_train,X_test,y_test,models,param):         ## Hyperparameter tuning for the model
  report = {}

  for i in range(len(list(models))-1):
    model = list(models.values())[i]
    para=param[list(models.keys())[i]]
    print(model,para)

    cv=KFold(n_splits=5, random_state=None, shuffle=False)
    gs = GridSearchCV(model,para,cv=cv)
    gs.fit(X_train,y_train)

    model.set_params(**gs.best_params_)
    model.fit(X_train,y_train)

    y_train_pred = model.predict(X_train)

    y_test_pred = model.predict(X_test)

    train_model_score = accuracy_score(y_train, y_train_pred)

    test_model_score = accuracy_score(y_test, y_test_pred)
    print('Train Accuracy:',train_model_score)
    print('Test Accuracy:',test_model_score)

    report[list(models.keys())[i]] = test_model_score

  return report

In [None]:
def evaluate_models(X_train, y_train,X_test,y_test,models,param):         ## Hyperparameter tuning for the model
  report = {}

  for i in range(len(list(models))-1):
    model = list(models.values())[i]
    #para=param[list(models.keys())[i]]
    print(model)

    model.fit(X_train,y_train)

    y_train_pred = model.predict(X_train)

    y_test_pred = model.predict(X_test)

    train_model_score = accuracy_score(y_train, y_train_pred)

    test_model_score = accuracy_score(y_test, y_test_pred)
    print('Train Accuracy:',train_model_score)
    print('Test Accuracy:',test_model_score)

    report[list(models.keys())[i]] = test_model_score

  return report

In [None]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [None]:
model_report:dict=evaluate_models(X_train=X_train,y_train=y_train,X_test=X_val,y_test=y_val,
                                             models=models,param=parameter)

LogisticRegression()
Train Accuracy: 0.9402233413619444
Test Accuracy: 0.6943532501641497
KNeighborsClassifier()
Train Accuracy: 0.7845412743595358
Test Accuracy: 0.6671043992120814
RandomForestClassifier()
Train Accuracy: 0.9975914166849135
Test Accuracy: 0.6808929743926461
AdaBoostClassifier()
Train Accuracy: 0.7359316838186993
Test Accuracy: 0.6483913328956008
SVC()
Train Accuracy: 0.8679658419093497
Test Accuracy: 0.6996060407091267
GradientBoostingClassifier()
Train Accuracy: 0.8887672432669148
Test Accuracy: 0.6733420879842417
DecisionTreeClassifier()
Train Accuracy: 0.9975914166849135
Test Accuracy: 0.6037426132632961
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interacti

In [None]:
best_model_score = max(sorted(model_report.values()))

## To get best model name from dict

best_model_name = list(model_report.keys())[
  list(model_report.values()).index(best_model_score)
]

best_model = models[best_model_name]

print(best_model_name)

SVM


In [None]:
y_pred=best_model.predict(X_val) ## Prediciting the validation data

In [None]:
confusion_matrix(y_val,y_pred)  ## Confusion matrix for validation data

array([[818,  68],
       [215, 422]])

In [None]:
accuracy_score(y_val,y_pred)

0.814182534471438

In [None]:
print(classification_report(y_val,y_pred))

              precision    recall  f1-score   support

           0       0.79      0.92      0.85       886
           1       0.86      0.66      0.75       637

    accuracy                           0.81      1523
   macro avg       0.83      0.79      0.80      1523
weighted avg       0.82      0.81      0.81      1523



#### Generate the submission file

For each tweets in the test set, we predict if the given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

The submission.csv file uses the following format: id,target

In [None]:
sample_submission = pd.read_csv("/content/nlp-getting-started/sample_submission.csv")
sample_submission.head()

In [None]:
sample_submission["target"] = best_model.predict(X_test)

In [None]:
sample_submission.describe()

In [None]:
sample_submission.to_csv("submission.csv", index=False)