## Disaster Tweets Classification

Write about the problem

### Importing Libraries

In [1]:
import os
import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE, RFECV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    ShuffleSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.svm import SVC, SVR

%matplotlib inline

### Data Preprocessing

In [2]:
df = pd.read_csv("tweets.csv", usecols=["keyword", "text", "target", "location"])
train_df, test_df = train_test_split(df, test_size=0.2, random_state=2)
train_df.head(10)

Unnamed: 0,keyword,location,text,target
3289,debris,,"Unfortunately, both plans fail as the 3 are im...",0
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...,0
2436,collide,,—pushes himself up from the chair beneath to r...,0
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...,1
8999,screaming,Azania,As soon as God say yes they'll be screaming we...,0
9895,survived,,you've no idea the suffering and horrors that ...,0
7294,mass%20murder,United States,"Oh wait, lets' not forget Anders Brevik, that ...",1
30,ablaze,,"Marivan, Kurdistan Province Monday, Jan 13th, ...",1
2713,crashed,Amphoe Mueang Nakhon Ratchasim,imagine: 15x09 airs. dean and cas share a kiss...,0
9385,snowstorm,,"494. On account of the snowstorm, all the trai...",1


In [3]:
X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns=["target"]), test_df["target"]

In [4]:
train_df['target'].value_counts()

0    7395
1    1701
Name: target, dtype: int64

As we can see above, we have only 1701 examples of actual disaster tweets. To handle class imbalance, we could use a different scoring metric instead of accuracy which basically focuses on the model's performance in capturing the positive label (tweet is of a real disaster event).

In [5]:
# Scoring metric to evaluate all the models

scoring = ['precision', 'f1', 'recall', 'roc_auc']

As there is a significant class imbalance, having `accuracy` as the scoring metric does not make sense, so we should use other scoring metrics to evaluate our model. For our use case, we have to minimise the `False Negatives` as we don’t want to classify an actual disaster tweet as a non disastrous tweet. So, a suitable metric can be `recall`, as a higher recall will mean that we have less number of `False Negatives`. But, we don’t want to reduce the `precision` while we increase the recall, as `True Positives` are equally important. So, a better scoring metric would be `f1`. We can also use the `auc_roc` score to be show how well the model can distinguish between the 2 classes.

In [6]:
# 'Location' feature

train_df['location'].describe(include = 'all')

count              6370
unique             3746
top       United States
freq                 80
Name: location, dtype: object

In [10]:
train_df['location'].head(50)

3289                                NaN
2672                                SLC
2436                                NaN
9622                                NaN
8999                             Azania
9895                                NaN
7294                      United States
30                                  NaN
2713     Amphoe Mueang Nakhon Ratchasim
9385                                NaN
4355                                NaN
8                          Accra, Ghana
8219                     Lagos, Nigeria
6774                                NaN
9608                   Rohnert Park, CA
4381                           Brighton
1927        Hell,Hades,Mictlan,Tartarus
8310                     Pittsburgh, PA
391                       Mumbai, India
3774                                NaN
8280                      New York City
5331                          FT. Myers
8935                               Hell
10621                       Quezon City
240                       Osun, Nigeria


As we can see above, there is a challenge in using the `location` column. The feature is quite messy. There are missing values, emoticons (flags), different languages, unrelated information (she/her), and free text comment (e.g., "I dont know where i am. Help." Here are a few reasons why it would not be a good decision to use the `location` feature for our model training :

1) The `location` column has a many null values (NaN) which will have to be handled.
2) Most `location` values are not in an appropriate format (includes special characters and emojis)
3) It has Countries and cities mixed with each other and there is no standardization. 
4) Few values are not even location values and cannot be used.
5) There are 3747 unique values of location values and it would be very expensive and inefficient to apply transformations like One hot Encoding on this column.

### Identifying features and building Transformer

In [11]:
print(train_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9096 entries, 3289 to 7336
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   keyword   9096 non-null   object
 1   location  6370 non-null   object
 2   text      9096 non-null   object
 3   target    9096 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 355.3+ KB
None


In [12]:
print(train_df["keyword"].value_counts())

thunderstorm    74
flattened       74
sirens          73
drown           71
stretcher       71
                ..
blown%20up      11
siren           10
rainstorm       10
deluged          7
tsunami          6
Name: keyword, Length: 219, dtype: int64


In [13]:
# Creating column transformer
categorical_features = ["keyword"]
drop_features = ["location"]
text_feature = "text"
target = "target"

preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore"), categorical_features),
    (CountVectorizer(stop_words="english", lowercase=False), text_feature),
    ("drop", drop_features),
)

We are dropping the `location` feature due to the NULL values, non-standardized values and reliability of the values in the column. We apply the Count Vectorizer transformation on the `text` column in order to convert the text column to numeric vectors which the model can understand. We apply One Hot Encoding transformation on the `keyword` column so the model can also consider the prominent disaster related keywords as part of the prediction.

### Model Training

In [16]:
results = {}

In [17]:
# Function to report mean cross validation scores for different models

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

#### Dummy Classifier (Baseline)

In [19]:
dummy = DummyClassifier(strategy="stratified")
results["dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, return_train_score=True, scoring=scoring
)
pd.DataFrame(results)

Unnamed: 0,dummy
fit_time,0.001 (+/- 0.001)
score_time,0.004 (+/- 0.002)
test_precision,0.189 (+/- 0.022)
train_precision,0.190 (+/- 0.010)
test_f1,0.190 (+/- 0.019)
train_f1,0.188 (+/- 0.009)
test_recall,0.192 (+/- 0.018)
train_recall,0.187 (+/- 0.009)
test_roc_auc,0.501 (+/- 0.015)
train_roc_auc,0.496 (+/- 0.005)


#### Logistic Regression

In [None]:
pipe_lr = make_pipeline(preprocessor, LogisticRegression(max_iter=2000))
results["logistic regression"] = mean_std_cross_val_scores(
    pipe_lr, X_train, y_train, return_train_score=True, scoring=scoring
)
pd.DataFrame(results)