## Disaster detection from tweets
This was a project from a Kaggle tutorial on Deep learning for NLP. The aim of the project is to predict whether a tweet is about an actual disaster or not based on the content of the tweet

[Tutorial link](https://www.kaggle.com/philculliton/nlp-getting-started-tutorial)

[LSA tutorial](http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/)

[Count Vectorizer, TF-IDF vectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

import sklearn
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

import tensorflow as tf

import warnings
warnings.filterwarnings("ignore")

In [41]:
''' Read data and look at examples of each type
'''
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print('Non-disaster tweet: ', end='')
print(train_df[train_df["target"] == 0]["text"].values[1])

print('Ddisaster tweet: ', end='')
print(train_df[train_df["target"] == 1]["text"].values[1])

Non-disaster tweet: I love fruits
Ddisaster tweet: Forest fire near La Ronge Sask. Canada


### Get statistics of the dataset
* How many training example of each category?

In [43]:
disaster = (train_df['target']==1).sum();
no_disaster = (train_df['target']==0).sum();
total = disaster + no_disaster

print('===== Training dataset ======')
print(f'Disaster tweets = {disaster}, no disaster tweets = {no_disaster} out of {total} tweets')
print(f'Random assignment accuracy: {np.max((disaster, no_disaster))/total*100:0.2f}')

y_true = train_df['target'];
y_pred = np.ones_like(y_true);
print('\nIf we just predicted every tweet as being disastrous, the f1-score is:',end=" ")
print(f'{sklearn.metrics.f1_score(y_true,y_pred)*100:0.2f}')

print('So we should do better than this....')

Disaster tweets = 3271, no disaster tweets = 4342 out of 7613 tweets
Random assignment accuracy: 57.03

If we just predicted every tweet as being disastrous, the f1-score is: 60.11
So we should do better than this....


### Building vectors
Let's first make a basic classifier where we use the words contained in the tweet to determine if the tweet is about a real disaster or not

Below I report f1-scores for the following methods

* Using scikit-learn's CountVectorizer to count the number of occurrences of every word and then classifying using a Ridge Regression
* Using TF-IDF Vectorizer
* Using a Neural Network


In [67]:
df_scores = pd.DataFrame();#df = pd.DataFrame(columns=['Method','f1-score'])
clf = linear_model.RidgeClassifier()
# clf = linear_model.Lasso(alpha=1e-6); 

##### 1) Count Vectorizer + Ridge Regression

In [72]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"].astype(int), cv=3, scoring="f1")
scores

df_scores = df_scores.append({'Method':'CountVectorizer+Ridge','f1':np.mean(scores)},ignore_index=True)

This is not much better than random assignment accuracy of 57%

##### 2) TF-IDF Vectorizer + Ridge Regression
Count Vectorizer weighs each word equally, but there might be some words that are present often but do not provide much information about the content, e.g., 'a', 'the' etc. In order to account for that, we can use a different vectorizer called TF-IDF, where each terms's frequency $(tf)$ is multiplied by $idf$ or inverse-document frequency. If a word occurs in a lot of documents (here, tweets), then it's idf is low and vice-versa

In short, the rarer a word, the higher is its TF-IDF score

In [73]:
vectorizer = feature_extraction.text.TfidfVectorizer()
train_vectors_tfidf = vectorizer.fit_transform(train_df["text"])
scores = model_selection.cross_val_score(clf, train_vectors_tfidf, train_df["target"].astype(int), cv=3, scoring="f1")
scores
df_scores = df_scores.append({'Method':'TfidVectorizer+Ridge','f1':np.mean(scores)},ignore_index=True)

##### 3) Latent Semantic Analysis (LSA)
LSA is basically a version of TF-IDF with reduced dimensionality via SVD decomposition. It preserves the high variance features and removes the high frequency features.

Interestingly, the performance is lower with LSA suggesting that the high frequency/low variance terms are important in classification

In [80]:
svd = TruncatedSVD(100)
train_vectors_tfidf_svd = svd.fit_transform(train_vectors_tfidf)
scores = model_selection.cross_val_score(clf, train_vectors_tfidf_svd, train_df["target"].astype(int), cv=3, scoring="f1")
scores

df_scores = df_scores.append({'Method':'LSA+Ridge','f1':np.mean(scores)},ignore_index=True)



#### 4) Simple Neural Network

In [None]:

def create_model():
    model = Sequential()
    model.add(Dense(12, input_dim=21637, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
idx_sort_tgt = np.argsort(train_df['target'])
sort_values = train_df['target'][idx_sort_tgt].values;
sort_train = train_vectors[idx_sort_tgt,:].toarray()

scores = model_selection.cross_val_score(model, sort_train, sort_values, cv=3,scoring="f1")
scores

df_scores = df_scores.append({'Method':'MLP','f1':np.mean(scores)},ignore_index=True)


In [None]:
df_scores