## Disaster Tweets Classification

Write about the problem

### Importing Libraries

In [1]:
import os
import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE, RFECV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    ShuffleSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.svm import SVC, SVR

%matplotlib inline

### Data Preprocessing

In [2]:
df = pd.read_csv("tweets.csv", usecols=["keyword", "text", "target", "location"])
train_df, test_df = train_test_split(df, test_size=0.2, random_state=2)
train_df.head(10)

Unnamed: 0,keyword,location,text,target
3289,debris,,"Unfortunately, both plans fail as the 3 are im...",0
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...,0
2436,collide,,—pushes himself up from the chair beneath to r...,0
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...,1
8999,screaming,Azania,As soon as God say yes they'll be screaming we...,0
9895,survived,,you've no idea the suffering and horrors that ...,0
7294,mass%20murder,United States,"Oh wait, lets' not forget Anders Brevik, that ...",1
30,ablaze,,"Marivan, Kurdistan Province Monday, Jan 13th, ...",1
2713,crashed,Amphoe Mueang Nakhon Ratchasim,imagine: 15x09 airs. dean and cas share a kiss...,0
9385,snowstorm,,"494. On account of the snowstorm, all the trai...",1


In [3]:
X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns=["target"]), test_df["target"]

In [4]:
train_df['target'].value_counts()

0    7395
1    1701
Name: target, dtype: int64

As we can see above, we have only 1701 examples of actual disaster tweets. To handle class imbalance, we could use a different scoring metric instead of accuracy which basically focuses on the model's performance in capturing the positive label (tweet is of a real disaster event).

In [5]:
# Scoring metric to evaluate all the models

scoring = ['precision', 'f1', 'recall', 'roc_auc']

As there is a significant class imbalance, having `accuracy` as the scoring metric does not make sense, so we should use other scoring metrics to evaluate our model. For our use case, we have to minimise the `False Negatives` as we don’t want to classify an actual disaster tweet as a non disastrous tweet. So, a suitable metric can be `recall`, as a higher recall will mean that we have less number of `False Negatives`. But, we don’t want to reduce the `precision` while we increase the recall, as `True Positives` are equally important. So, a better scoring metric would be `f1`. We can also use the `auc_roc` score to be show how well the model can distinguish between the 2 classes.

In [6]:
# 'Location' feature

train_df['location'].describe(include = 'all')

count              6370
unique             3746
top       United States
freq                 80
Name: location, dtype: object

In [10]:
train_df['location'].head(50)

3289                                NaN
2672                                SLC
2436                                NaN
9622                                NaN
8999                             Azania
9895                                NaN
7294                      United States
30                                  NaN
2713     Amphoe Mueang Nakhon Ratchasim
9385                                NaN
4355                                NaN
8                          Accra, Ghana
8219                     Lagos, Nigeria
6774                                NaN
9608                   Rohnert Park, CA
4381                           Brighton
1927        Hell,Hades,Mictlan,Tartarus
8310                     Pittsburgh, PA
391                       Mumbai, India
3774                                NaN
8280                      New York City
5331                          FT. Myers
8935                               Hell
10621                       Quezon City
240                       Osun, Nigeria


As we can see above, there is a challenge in using the `location` column. The feature is quite messy. There are missing values, emoticons (flags), different languages, unrelated information (she/her), and free text comment (e.g., "I dont know where i am. Help." Here are a few reasons why it would not be a good decision to use the `location` feature for our model training :

1) The `location` column has a many null values (NaN) which will have to be handled.
2) Most `location` values are not in an appropriate format (includes special characters and emojis)
3) It has Countries and cities mixed with each other and there is no standardization. 
4) Few values are not even location values and cannot be used.
5) There are 3747 unique values of location values and it would be very expensive and inefficient to apply transformations like One hot Encoding on this column.

### Identifying features and building Transformer

In [11]:
print(train_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9096 entries, 3289 to 7336
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   keyword   9096 non-null   object
 1   location  6370 non-null   object
 2   text      9096 non-null   object
 3   target    9096 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 355.3+ KB
None


In [12]:
print(train_df["keyword"].value_counts())

thunderstorm    74
flattened       74
sirens          73
drown           71
stretcher       71
                ..
blown%20up      11
siren           10
rainstorm       10
deluged          7
tsunami          6
Name: keyword, Length: 219, dtype: int64


In [13]:
# Creating column transformer
categorical_features = ["keyword"]
drop_features = ["location"]
text_feature = "text"
target = "target"

preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore"), categorical_features),
    (CountVectorizer(stop_words="english", lowercase=False), text_feature),
    ("drop", drop_features),
)

We are dropping the `location` feature due to the NULL values, non-standardized values and reliability of the values in the column. We apply the Count Vectorizer transformation on the `text` column in order to convert the text column to numeric vectors which the model can understand. We apply One Hot Encoding transformation on the `keyword` column so the model can also consider the prominent disaster related keywords as part of the prediction.

### Model Training

In [16]:
results = {}

In [17]:
# Function to report mean cross validation scores for different models

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

#### Dummy Classifier (Baseline)

In [19]:
dummy = DummyClassifier(strategy="stratified")
results["dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, return_train_score=True, scoring=scoring
)
pd.DataFrame(results)

Unnamed: 0,dummy
fit_time,0.001 (+/- 0.001)
score_time,0.004 (+/- 0.002)
test_precision,0.189 (+/- 0.022)
train_precision,0.190 (+/- 0.010)
test_f1,0.190 (+/- 0.019)
train_f1,0.188 (+/- 0.009)
test_recall,0.192 (+/- 0.018)
train_recall,0.187 (+/- 0.009)
test_roc_auc,0.501 (+/- 0.015)
train_roc_auc,0.496 (+/- 0.005)


#### Logistic Regression

In [21]:
pipe_lr = make_pipeline(preprocessor, LogisticRegression(max_iter=2000))
results["logistic regression"] = mean_std_cross_val_scores(
    pipe_lr, X_train, y_train, return_train_score=True, scoring=scoring
)
pd.DataFrame(results)

Unnamed: 0,dummy,logistic regression
fit_time,0.001 (+/- 0.001),0.297 (+/- 0.046)
score_time,0.004 (+/- 0.002),0.047 (+/- 0.001)
test_precision,0.189 (+/- 0.022),0.804 (+/- 0.017)
train_precision,0.190 (+/- 0.010),0.997 (+/- 0.001)
test_f1,0.190 (+/- 0.019),0.616 (+/- 0.027)
train_f1,0.188 (+/- 0.009),0.971 (+/- 0.002)
test_recall,0.192 (+/- 0.018),0.499 (+/- 0.030)
train_recall,0.187 (+/- 0.009),0.946 (+/- 0.003)
test_roc_auc,0.501 (+/- 0.015),0.890 (+/- 0.012)
train_roc_auc,0.496 (+/- 0.005),0.999 (+/- 0.000)


### Hyperparameter Optimization

In [22]:
# Accessing the Vocabulary of the count vectorizer

pipe_lr.fit(X_train, y_train)
vocab = (
    pipe_lr.named_steps["columntransformer"]
    .named_transformers_["countvectorizer"]
    .get_feature_names_out()
)
len(vocab)

28010

In [23]:
from scipy.stats import loguniform, randint

param_dist = {
    "columntransformer__countvectorizer__max_features": randint(5_000, len(vocab)),
    "logisticregression__C": loguniform(1e-3, 1e3),
    "logisticregression__class_weight": ["balanced", None],
}

random_search = RandomizedSearchCV(
    pipe_lr,
    param_distributions=param_dist,
    n_iter=100,
    verbose=1,
    n_jobs=-1,
    scoring="f1",
    random_state=123,
)
random_search.fit(X_train, y_train);

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [24]:
random_search.best_params_

{'columntransformer__countvectorizer__max_features': 26323,
 'logisticregression__C': 0.38474818517522286,
 'logisticregression__class_weight': 'balanced'}

In [25]:
random_search.best_score_

0.6646678202469128

In [26]:
best_max_feats = random_search.best_params_[
    "columntransformer__countvectorizer__max_features"
]
best_c = random_search.best_params_["logisticregression__C"]
best_class_weight = random_search.best_params_["logisticregression__class_weight"]

In [28]:
results["logistic regression (optimized params)"] = mean_std_cross_val_scores(
    random_search.best_estimator_,
    X_train,
    y_train,
    return_train_score=True,
    scoring=scoring,
)
pd.DataFrame(results)

Unnamed: 0,dummy,logistic regression,logistic regression (optimized params)
fit_time,0.001 (+/- 0.001),0.297 (+/- 0.046),0.273 (+/- 0.073)
score_time,0.004 (+/- 0.002),0.047 (+/- 0.001),0.050 (+/- 0.004)
test_precision,0.189 (+/- 0.022),0.804 (+/- 0.017),0.670 (+/- 0.013)
train_precision,0.190 (+/- 0.010),0.997 (+/- 0.001),0.906 (+/- 0.004)
test_f1,0.190 (+/- 0.019),0.616 (+/- 0.027),0.665 (+/- 0.017)
train_f1,0.188 (+/- 0.009),0.971 (+/- 0.002),0.946 (+/- 0.003)
test_recall,0.192 (+/- 0.018),0.499 (+/- 0.030),0.660 (+/- 0.030)
train_recall,0.187 (+/- 0.009),0.946 (+/- 0.003),0.990 (+/- 0.002)
test_roc_auc,0.501 (+/- 0.015),0.890 (+/- 0.012),0.891 (+/- 0.012)
train_roc_auc,0.496 (+/- 0.005),0.999 (+/- 0.000),0.998 (+/- 0.000)


### Feature Engineering

In [29]:
import nltk

nltk.download("vader_lexicon")
nltk.download("punkt")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/snehajhaveri/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/snehajhaveri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [30]:
# Creating new features relevant to the data and the problem 

def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
    """
    Returns the relative length of text.

    Parameters:
    ------
    text: (str)
    the input text

    Keyword arguments:
    ------
    TWITTER_ALLOWED_CHARS: (float)
    the denominator for finding relative length

    Returns:
    -------
    relative length of text: (float)

    """
    return len(text) / TWITTER_ALLOWED_CHARS

def get_length_in_words(text):
    """
    Returns the length of the text in words.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    length of tokenized text: (int)

    """
    return len(nltk.word_tokenize(text))


def get_sentiment(text):
    """
    Returns the compound score representing the sentiment: -1 (most extreme negative) and +1 (most extreme positive)
    The compound score is a normalized score calculated by summing the valence scores of each word in the lexicon.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    sentiment of the text: (str)
    """
    scores = sid.polarity_scores(text)
    return scores["compound"]

In [31]:
train_df = train_df.assign(n_words=train_df["text"].apply(get_length_in_words))
train_df = train_df.assign(vader_sentiment=train_df["text"].apply(get_sentiment))
train_df = train_df.assign(rel_char_len=train_df["text"].apply(get_relative_length))

test_df = test_df.assign(n_words=test_df["text"].apply(get_length_in_words))
test_df = test_df.assign(vader_sentiment=test_df["text"].apply(get_sentiment))
test_df = test_df.assign(rel_char_len=test_df["text"].apply(get_relative_length))

In [32]:
from spacymoji import Emoji
import en_core_web_md  # pre-trained model
import spacy

nlp = en_core_web_md.load()

In [34]:
nlp.add_pipe("emoji", first=True);
def get_emojis(text):
    """
    Returns the number of emojis in the given text
    
    Parameters:
    ------
    text: (str)
    the input text
    
    Returns:
    ------
    count of emojis in the text : int
    """
    doc = nlp(text)
    return len(doc._.emoji)
train_df['num_emojis']=train_df["text"].apply(get_emojis)
test_df['num_emojis']=test_df["text"].apply(get_emojis)

**Reasoning**

As we can see in our dataset, most formal tweets related to disaster events have very few emojis. In contrast, fake reviews and fake tweets tend to have comparatively more emojis. So, we are creating a new feature which counts the number of emojis per tweet and this if the number of emojis is higher then the chance it being a disaster tweet is less.

In [35]:
# Adding a new feature for Mention Count

train_df['mention_count'] = train_df['text'].apply(lambda x: len([c for c in str(x) if c == '@']))
test_df['mention_count'] = test_df['text'].apply(lambda x: len([c for c in str(x) if c == '@']))

**Reasoning**

Usually formal tweets about disasters consists of mentions which starts with '@'. Most formal disaster posts mentions important authorities of the city. So, there is a chance that the tweets that consists of '@' are authentic disaster related tweets.

In [36]:
# Adding a new feature for Punctuation Count 

train_df['punctuation_count'] = train_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
test_df['punctuation_count'] = test_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

**Reasoning**

Formal tweets about disasters are usually written with proper punctuations compared to fake tweets and movie reviews. Non-disaster tweets could have more typos and be missing punctuations than disaster tweets because they are coming from individual users. 

In [37]:
train_df.head()

Unnamed: 0,keyword,location,text,target,n_words,vader_sentiment,rel_char_len,mention_count,num_emojis,punctuation_count
3289,debris,,"Unfortunately, both plans fail as the 3 are im...",0,22,-0.765,0.425,0,0,2
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...,0,18,-0.5697,0.267857,0,0,7
2436,collide,,—pushes himself up from the chair beneath to r...,0,21,0.0,0.439286,0,0,6
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...,1,20,-0.946,0.428571,0,0,5
8999,screaming,Azania,As soon as God say yes they'll be screaming we...,0,14,0.296,0.203571,0,2,1


### Building pipeline with new features

In [38]:
train_df.columns

Index(['keyword', 'location', 'text', 'target', 'n_words', 'vader_sentiment',
       'rel_char_len', 'mention_count', 'num_emojis', 'punctuation_count'],
      dtype='object')

In [39]:
X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns=["target"]), test_df["target"]

In [40]:
# Segregating the features based on the transformations

drop_features = ['location']
keyword_features = "keyword"
text_features = "text"
numeric_features = ['n_words', 'vader_sentiment', 'rel_char_len', 'num_emojis', 'mention_count', 'punctuation_count']

In [42]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_features),
    (CountVectorizer(stop_words="english", max_features=best_max_feats), text_feature),
    ("drop", drop_features),
)

In [43]:
# Model with new features

pipe_lr = make_pipeline(
    preprocessor,
    LogisticRegression(max_iter=1000, class_weight=best_class_weight, C=best_c),
)

results["logistic regression (new features)"] = mean_std_cross_val_scores(
    pipe_lr, X_train, y_train, return_train_score=True, scoring=scoring
)
pd.DataFrame(results)

Unnamed: 0,dummy,logistic regression,logistic regression (optimized params),logistic regression (new features)
fit_time,0.001 (+/- 0.001),0.297 (+/- 0.046),0.273 (+/- 0.073),0.375 (+/- 0.045)
score_time,0.004 (+/- 0.002),0.047 (+/- 0.001),0.050 (+/- 0.004),0.060 (+/- 0.001)
test_precision,0.189 (+/- 0.022),0.804 (+/- 0.017),0.670 (+/- 0.013),0.657 (+/- 0.007)
train_precision,0.190 (+/- 0.010),0.997 (+/- 0.001),0.906 (+/- 0.004),0.880 (+/- 0.007)
test_f1,0.190 (+/- 0.019),0.616 (+/- 0.027),0.665 (+/- 0.017),0.678 (+/- 0.016)
train_f1,0.188 (+/- 0.009),0.971 (+/- 0.002),0.946 (+/- 0.003),0.928 (+/- 0.005)
test_recall,0.192 (+/- 0.018),0.499 (+/- 0.030),0.660 (+/- 0.030),0.701 (+/- 0.036)
train_recall,0.187 (+/- 0.009),0.946 (+/- 0.003),0.990 (+/- 0.002),0.982 (+/- 0.003)
test_roc_auc,0.501 (+/- 0.015),0.890 (+/- 0.012),0.891 (+/- 0.012),0.898 (+/- 0.012)
train_roc_auc,0.496 (+/- 0.005),0.999 (+/- 0.000),0.998 (+/- 0.000),0.996 (+/- 0.000)


On adding new features, we can notice that the recall has improved and so has the f1 score compared to our model with original features. However, the precision has reduced slightly on the addition of new features compared to the score we achieved with original features. The roc_auc score also seems to have become slightly better after adding the new features.

### Analyzing Feature Coefficients

In [44]:
pipe_lr.fit(X_train, y_train)

feature_names = (
    numeric_features
    + list(
        pipe_lr.named_steps["columntransformer"]
        .named_transformers_["onehotencoder"]
        .get_feature_names()
    )
    + list(
        pipe_lr.named_steps["columntransformer"]
        .named_transformers_["countvectorizer"]
        .get_feature_names_out()
    )
)

data = {
    "coefficient": pipe_lr.named_steps["logisticregression"].coef_[0].tolist(),
    "magnitude": np.absolute(
        pipe_lr.named_steps["logisticregression"].coef_[0].tolist()
    ),
}
coef_df = pd.DataFrame(data, index=feature_names).sort_values(
    "magnitude", ascending=False
)
coef_df[:20]



Unnamed: 0,coefficient,magnitude
thunderstorm,1.603396,1.603396
x0_windstorm,1.452117,1.452117
died,1.449866,1.449866
survived,1.44063,1.44063
rescued,1.439215,1.439215
collision,1.372419,1.372419
ukrainian,1.327828,1.327828
road,1.309697,1.309697
x0_buildings%20on%20fire,1.255956,1.255956
hitchin,1.1698,1.1698


The top features shown above makes sense. As we can see above, features like `windstorm`, `rescued`, `thunderstorm`, `died`, `survived`, `carried` etc seems to be important features in the context of real tweets related to disastrous events.

In [45]:
# Evaluating the importance of newly added features

extracted_feats = ['n_words', 'vader_sentiment', 'rel_char_len', 'num_emojis', 'mention_count', 'punctuation_count']

coef_df.loc[extracted_feats].sort_values("coefficient", ascending=False)

Unnamed: 0,coefficient,magnitude
rel_char_len,0.645789,0.645789
punctuation_count,-0.043281,0.043281
mention_count,-0.048048,0.048048
num_emojis,-0.083662,0.083662
vader_sentiment,-0.406554,0.406554
n_words,-0.667181,0.667181


Some coefficients do make sense -

1) Presence of more emojis seem to drive predictions in the non-disaster direction.
2) The coefficient of vader_sentiment feature is negative, suggesting that bigger sentiment score (i.e., positive sentiment) is pushing us towards non-disaster tweet.

### Model Evaluation

In [47]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

print('F1 Score')
print(f1_score(y_test, pipe_lr.predict(X_test)))

print('Precision Score')
print(precision_score(y_test, pipe_lr.predict(X_test)))

print('Recall Score')
print(recall_score(y_test, pipe_lr.predict(X_test)))

print('ROC AUC Score')
print(roc_auc_score(y_test, pipe_lr.predict(X_test)))

F1 Score
0.7262313860252003
Precision Score
0.6891304347826087
Recall Score
0.7675544794188862
ROC AUC Score
0.8453570355181481
