# Hi!

This notebook will implement a RNN model to classify a given tweet as disastrous or not with some EDA.

## Problem Statement

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

##### Sentence 1: Forest fire near la ronge sask.
##### Sentence 2: This video was fire!

You'll notice that both these sentences have fire in it but only one of them is about an actual disaster. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine (or wrong?). 

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.

Importing Dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go

from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.initializers import Constant
import keras.metrics as metrics
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

import warnings
import string
import re
warnings.filterwarnings('ignore')
sns.set()

Importing Data

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [None]:
train.head()

# Understanding the data

In [None]:
train.shape, test.shape

Training dataset has 7613 rows and 5 columns, and testing dataset has 3263 rows and 4 columns. The missing column is of course the target variable which we want to predict. 

In [None]:
train.isnull().sum().sum(), test.isnull().sum().sum()

Training dataset has 2594 null values, and test dataset has 1131 null values. Many algorithms cannot deal with missing values, hence we would need to either drop these values or impute them. 

In [None]:
train.id.isin(test.id).value_counts()

There are no id's from training dataset in the testing dataset.

In [None]:
train.text.isin(test.text).value_counts()

There are 127 texts from training dataset in the testing dataset.

In [None]:
disaster_tweets = pd.concat([train, test], axis=0)
disaster_tweets.info()

# Exploratory Data Analysis

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(disaster_tweets.target, palette='Blues')
plt.title("Distribution of Target Counts", size=25, weight='bold')
plt.xlabel("Target Label", size=14)
plt.ylabel("Frequency", size=14)
plt.show()

As displayed, there is an imbalance towards the negative class (not a disaster) in our dataset. This is important as it can have significant effect on the classifier. Hence, accuracy is a misleading metric here for these types of problems as the classifier will be more inclined towards the dominant class for evaluating our models. In this case, the imbalance is not severe hence we will not perform any oversampling and undersampling techniques to tweak this but it still worthy of note-taking. 

In [None]:
missing_values = pd.DataFrame(dict(
    round(100 * disaster_tweets.drop('target', axis=1).isnull().sum() /
          len(disaster_tweets), 2)), index=[0])
missing_values = missing_values.melt(var_name = 'Features', value_name = 'Percentage')

plt.figure(figsize=(20,10))
sns.barplot('Features', 'Percentage', data=missing_values, palette='Blues')
plt.title("Percentage of Missing Values", size=25, weight='bold')
plt.xlabel("Features", size=14)
plt.ylabel("Percentage", size=14)
plt.show()

There are about 0.8% of missing values in feature 'keyword' and about 33.45% in feature 'location'. Since these are locations that users can opt out of disclosing when tweeting.

In [None]:
unique_values = pd.DataFrame(dict(disaster_tweets.drop('target', axis=1).nunique()),
             index=[0]).melt(var_name = 'Features', value_name = 'Frequency')

plt.figure(figsize=(20,10))
sns.barplot('Features', 'Frequency', data=unique_values, palette='Blues')
plt.title("Count of Unique Values", size=25, weight='bold')
plt.xticks(size=18)
plt.xlabel("Features", size=18)
plt.ylabel("Frequency", size=14)
plt.show()

In [None]:
print("Number of Duplicated Tweets in Train and Test: {}".format(
    len(disaster_tweets[disaster_tweets.text.duplicated()])))
print("Number of Duplicated Tweets in Train: {}".format(
    (disaster_tweets[disaster_tweets.text.duplicated()].target.notna().sum())))
print("Number of Duplicated Tweets in Test: {}".format(
    (disaster_tweets[disaster_tweets.text.duplicated()].target.isna().sum())))

This is important to consider because if there are duplicate tweets in both the datasets, since this dataset was hand engineered there could be possibilities of human error in labelling the target variables differently (for the same tweets of course). 

In [None]:
duplicate_values = train[train.text.duplicated()][['text','target']]
duplicate_values[duplicate_values.text=='To fight bioterrorism sir.']

I took one example from the training dataset and found out that there are multiple labels for the same tweet. Hence this will confuse our model and result in inconsistant results.

In [None]:
different_target_tweets_indexes = []
for index, target in enumerate(duplicate_values.groupby(['text']).agg(list).target):
    if len(list(set(target)))>1:
        different_target_tweets_indexes.append(index)
        
duplicate_values.groupby('text').agg(list).iloc[different_target_tweets_indexes]

Hence it is possible for the test dataset that Kaggle will use to evaluate might also have incorrect labels like this in the training dataset. However, I cannot be totally sure.

In [None]:
location_values = pd.DataFrame(dict(disaster_tweets.location.value_counts()), 
             index=[0]).melt(var_name="Country", value_name="#Unique")
location_values[:10]

It can be understood that the feature 'location' has different names for the same location. For eg, USA and United States. Another thing to notice here is that location can contain names of cities as well. For eg, Los Angeles, CA and London. This is important to note and clean but rather we can just drop this column as it has about 38% null values as well. Although, I would replace some values just for the sake of exploring. 

In [None]:
replace_locations= {
    "United States": "USA",
    "London": "UK",
    "New York": "USA",
    "Los Angeles, CA": "USA",
    "Washington, DC": "USA",
    "Mumbai": "India"
}

countries = disaster_tweets["location"].replace(replace_locations).value_counts()
location_values = pd.DataFrame(dict(zip(countries.index, countries.values)), 
                               index=[0]).melt(var_name="Country", value_name="#Unique")

data = {
    "locations": location_values.loc[:10, "Country"],
    "locationmode": "country names",
    "z": location_values.loc[:10, "#Unique"],
    "colorscale": "blugrn",
    "text": location_values.loc[:10, "Country"],
    "type": "choropleth",
    "colorbar": {"title": "#Unique", "len": 200, "lenmode":"pixels"}
}

layout = go.Layout(title_text="<b>Top 10 Countries based on # Tweets Location</b>",
                   geo=dict(scope="world"))

fig = go.Figure(data=[data], layout=layout)
fig.update_layout(title_x=0.5)
fig.show()

In [None]:
df = disaster_tweets.replace(replace_locations)
df = df[df.location.isin(location_values.Country.head(9))]
df = df.sort_values(by='location')

plt.figure(figsize=(20,10))
seaborn_plot = sns.countplot(df.location, hue=df.target, palette='Blues')
for p in seaborn_plot.patches:
    seaborn_plot.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center',
                   va = 'center', xytext = (0, 9), textcoords = 'offset points')
plt.title("# Tweets for Top10 Countries", size=25, weight='bold')
plt.xticks(size=18)
plt.xlabel("Countries", size=25, weight='bold')
plt.ylabel("Frequency", size=14)
plt.legend(("Not Disastrous", "Disastrous"))
plt.show()

Since I only replaced only a few locations these counts may appear wrong but you get the gist I hope. 

In [None]:
all_keywords = " ".join(keyword for keyword in disaster_tweets[disaster_tweets.target==1].keyword.dropna())
word_cloud= WordCloud(width=1250, height=625, max_font_size=350, 
                      random_state=42).generate(all_keywords)
plt.figure(figsize=(20, 10))
plt.title("Words used for Disastrous Tweets", size=20, weight="bold")
plt.imshow(word_cloud)
plt.axis("off")
plt.show()

In [None]:
all_keywords = " ".join(keyword for keyword in disaster_tweets[disaster_tweets.target==0].keyword.dropna())
word_cloud= WordCloud(width=1250, height=625, max_font_size=350, 
                      random_state=42).generate(all_keywords)
plt.figure(figsize=(20, 10))
plt.title("Words used for Non-Disastrous Tweets", size=20, weight="bold")
plt.imshow(word_cloud)
plt.axis("off")
plt.show()

It can be derived that tweets which are disastrous have words like 'outbreak', 'decrailment' and 'wreckage' more frequent and for tweets which are non-disastrous have words like 'body', 'armageddon' and 'bags'.

In [None]:
stats_values = disaster_tweets.copy()
STOPWORDS = set(stopwords.words("english"))

stats_values['char_len'] = stats_values.text.apply(lambda x: len(x))
stats_values['word_len'] = stats_values.text.apply(lambda x: len(x.split()))
stats_values['count_stopwords'] = stats_values.text.apply(
    lambda x: len(([w for w in str(x).lower().split() if w in STOPWORDS])))
stats_values['count_punctuation'] = stats_values.text.apply(
    lambda x: len([c for c in str(x) if c in string.punctuation]))

stats_values.head()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,10))
ax = sns.distplot(stats_values[stats_values.target==0].char_len, 
                  ax = axes[0], bins=40, color='green')
ax.set_title("Non-Disastrous Tweets", size=25)
ax.set_xlabel('Character Length', size=18)
ax = sns.distplot(stats_values[stats_values.target==1].char_len, 
                  ax = axes[1], bins=40, color='red')
ax.set_title("Disastrous Tweets", size=25)
ax.set_xlabel('Character Length', size=18)
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,10))
ax = sns.distplot(stats_values[stats_values.target==0].word_len, 
                  ax=axes[0], color='green')
ax.set_title("Non-Disastrous Tweets", size=25)
ax.set_xlabel('Word Length', size=18)
ax = sns.distplot(stats_values[stats_values.target==1].word_len, 
                  ax=axes[1], color='red')
ax.set_title("Disastrous Tweets", size=25)
ax.set_xlabel('Word Length', size=18)
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,10))
ax = sns.distplot(stats_values[stats_values.target==0].count_stopwords, 
                  ax=axes[0], color='green', bins=15)
ax.set_title("Non-Disastrous Tweets", size=25)
ax.set_xlabel('Count of StopWords', size=18)
ax = sns.distplot(stats_values[stats_values.target==1].count_stopwords, 
                  ax=axes[1], color='red', bins=15)
ax.set_title("Disastrous Tweets", size=25)
ax.set_xlabel('Count of StopWords', size=18)
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,10))
ax = sns.distplot(stats_values[stats_values.target==0].count_punctuation, 
                  ax=axes[0], color='green')
ax.set_title("Non-Disastrous Tweets", size=25)
ax.set_xlabel('Count of Punctuation', size=18)
ax = sns.distplot(stats_values[stats_values.target==1].count_punctuation, 
                  ax=axes[1], color='red')
ax.set_title("Disastrous Tweets", size=25)
ax.set_xlabel('Count of Punctuation', size=18)
plt.show()

Realistically, I think this dataset is small to infer real-world behaviour although here is a summary of all my findings with this dataset:

1. Tweets which are disastrous have words like 'outbreak', 'decrailment' and 'wreckage' more frequent and for tweets which are non-disastrous have words like 'body', 'armageddon' and 'bags'.

2. People generally tend to type more characters when tweeting about a disaster.

3. People generally type more words when tweeting about a disaster.

4. People generally use more StopWords when tweeting about a disaster.

5. People generally use more punctuation when tweeting about a disaster.

Pretty strange right?!

<center> <img src="https://tenor.com/view/president-donald-trump-tweets-mad-gif-8149740.gif"> </center>

# Data Cleaning

In [None]:
train = disaster_tweets.iloc[:len(train), :]
test = disaster_tweets.iloc[len(train):, :].drop('target', axis=1)

train.shape, test.shape

In [None]:
def clean_text(tweet):
    tweet = tweet.lower() #text to lowercase
    tweet = re.sub(r'\$\w*', '', str(tweet)) #remove stock market symbols
    tweet = re.sub(r'^RT[\s]+', '', str(tweet)) #remove RT or Retweet symbols
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', str(tweet)) #remove links
    tweet = re.sub(r'#', '', str(tweet)) #remove # or Hashtag symbols
    return tweet

def clean_punctuation(tweet):
    tweet = "".join(word for word in tweet if word not in set(string.punctuation))
    return tweet

def clean_stopwords(tweet):
    STOPWORDS = set(stopwords.words("english"))
    tweet = " ".join(word for word in tweet.split() if word not in STOPWORDS)
    return tweet

def clean_tweet(tweet):
    tweet = clean_text(tweet)
    tweet = clean_punctuation(tweet)
    tweet = clean_stopwords(tweet)
    return tweet

def manual_label_target(df):
    df.loc[df["text"]==
           "#Allah describes piling up #wealth thinking it would last #forever as the description of the people of #Hellfire in Surah Humaza. #Reflect",
           "target"]= 0
    df.loc[df["text"]==
          "#foodscare #offers2go #NestleIndia slips into loss after #Magginoodle #ban unsafe and hazardous for #humanconsumption",
          "target"] = 0
    df.loc[df["text"]==
          ".POTUS #StrategicPatience is a strategy for #Genocide; refugees; IDP Internally displaced people; horror; etc. https://t.co/rqWuoy1fm4",
          "target"] = 0
    df.loc[df["text"]==
          "CLEARED:incident with injury:I-495  inner loop Exit 31 - MD 97/Georgia Ave Silver Spring",
          "target"] = 1
    df.loc[df["text"]==
          "He came to a land which was engulfed in tribal war and turned it into a land of peace i.e. Madinah. #ProphetMuhammad #islam",
          "target"] = 0
    df.loc[df["text"]==
          "Hellfire is surrounded by desires so be careful and donÛªt let your desires control you! #Afterlife",
          "target"] = 0
    df.loc[df["text"]==
          "The Prophet (peace be upon him) said 'Save yourself from Hellfire even if it is by giving half a date in charity.'",
          "target"] = 0
    df.loc[df["text"]=="To fight bioterrorism sir.", "target"] = 0
    df.loc[df["text"]==
          "that horrible sinking feeling when youÛªve been at home on your phone for a while and you realise its been on 3G this whole time",
          "target"] = 0
    return df

In [None]:
train = manual_label_target(train)

train['clean_tweet'] = train.text.apply(clean_tweet)
test['clean_tweet'] = test.text.apply(clean_tweet)

In [None]:
train.head()

In [None]:
train[train.text == 'To fight bioterrorism sir.']

In [None]:
train.drop(['id', 'keyword', 'location', 'text'], axis=1, inplace=True)
test.drop(['id', 'keyword', 'location', 'text'], axis=1, inplace=True)

I cleaned the dataset by converting to lowercase, removing stock-market symbols, RTs, links and hashtags. I removed punctuation marks and stopwords and finally manually labelled duplicated tweets as found earlier. 

In [None]:
train.shape, test.shape

In [None]:
train.head()

# Preprocessing

I will use GLoVe Embeddings as they represent words using the context and combine matrix factorization algorithms and window based algorithms. If you are not familier with GLoVe Embeddings, I have made another notebook which is beginner friendly which explains why GLoVe Embeddings are useful and how to implement them using Deep Learning models such as RNNs. 

Link: https://www.kaggle.com/shivam017arora/imdb-sentiment-analysis

In [None]:
corpus = []
for text in train['clean_tweet']:
    words = [word.lower() for word in word_tokenize(text)] 
    corpus.append(words)
num_words = len(corpus)
print(num_words)

In [None]:
X = train['clean_tweet']
y = train['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
max_len = 32
tokenizer = Tokenizer(num_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen=max_len, truncating='post', padding='post')
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=max_len, truncating='post', padding='post')

In [None]:
word_index = tokenizer.word_index
print("Number of unique words: {}".format(len(word_index)))

In [None]:
embedding = {}
with open("/kaggle/input/glovetwitter27b100dtxt/glove.twitter.27B.100d.txt") as file:
    for line in file:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:], 'float32')
        embedding[word] = vectors
file.close()

In [None]:
embedding_matrix = np.zeros((num_words+1, 100))
for i, word in tokenizer.index_word.items():
    if i < (num_words+1):
        vector = embedding.get(word)
        if vector is not None:
            embedding_matrix[i] = vector

# Modelling

Evaluation Rules: 
Submissions are evaluated using F1 score between the predicted and expected answers.

In order to evaluate my model, I have split my training data with 30% testing data.

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Initalizing Model

In [None]:
model = Sequential()

model.add(Embedding(input_dim=num_words+1, output_dim=100, 
                    embeddings_initializer=Constant(embedding_matrix), 
                    input_length=32, trainable=False))
model.add(LSTM(100, dropout=0.1))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

Training Model

In [None]:
history = model.fit(X_train, y_train, epochs=5, batch_size=256, 
                    validation_data=(X_test, y_test))

Plotting loss while training

In [None]:
plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['loss'], 'b', label='Training Loss', color='red')
plt.plot(epochs, history.history['val_loss'], 'b', label='Validation Loss')
plt.legend()
plt.show()

Plotting accuracy while training

In [None]:
plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['accuracy'], 'b', label='Training Accuracy', color='red')
plt.plot(epochs, history.history['val_accuracy'], 'b', label='Validation Accuracy')
plt.legend()
plt.show()

In [None]:
y_pred = model.predict_classes(X_test)
print(classification_report(y_test, y_pred))

In [None]:
labels= ["Negative", "Positive"]
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(cm, annot=True, fmt="g", ax=ax)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.title("Confusion Matrix", size=20, weight="bold")
plt.yticks(ticks=[0.5, 1.5], labels=labels)
plt.show()

# Conclusion

Considering this dataset is small, and only using the 'text' feature I got a good f1 score on predicting the negative class (dominant class) but only got a decent score for predicting the positive class. The confusion matrix also indicates there were more false predictions for the positive class i.e 265 and 175 for the negative class. In the future, I plan to perform Synthetic Minority Oversampling Technique and upload a new version with improvements (hopefully!) and try evaluating more models for better understanding. 

If you enjoyed and liked my work, please follow up and upvote and comment your feedback as I am a beginner and open to learn.