# Classify Disaster Tweets
Our goal in this notebook is to build a RNN (Recurrent Neural Network) classifer that can discern whether a tweet is a report of a disaster or not. This is a many-to-one sequential data problem, where the length of the tweet (in words) varies and the binary classification (of 1 for disaster or 0 for not a disaster) is static. We will look at the efficacy of a vanilla RNN, RNN with LSTM layer, and RNN with GRU layer. Our outputs will be scored using a F1 score (which is the harmonic mean of precision and recall).

$$F1 = 2 * \frac{precision * recall}{precision + recall}$$

$$precision = \frac{TP}{TP+FP} \quad \quad recall = \frac{TP}{TP+FN}$$

This notebook uses the Natural Language Processing with Disaster Tweets Kaggle Competition Dataset (https://www.kaggle.com/c/nlp-getting-started/data) which is a derivative and publicly hosted instance of the original dataset published by figure-eight (figure-eight is now owned by Appen or appen.com). This notebook will be scored by the ongoing competition. Manual scoring has achieved a 78.87%. 

* There is a data leak for this competition that enables 100% accuracy. This notebook isn't striving for said accuracy, but instead looks to explore the efficacy of differing model architectures on the data. 

Work to do:
1. Import Data and Libraries
2. EDA (Exploratory Data Analysis)
3. Data Preprocessing
4. Build Model(s) and Evaluate Results
- Simple RNN
- LSTM
- GRU
5. Make Predictions on Validation Set
6. Submit Predictions for Scoring
7. Explain with Markup Code and Comments

## TLDR Outcomes:


In [None]:
import pandas as pd
outcomes = pd.DataFrame({"Model":['Vanilla RNN','Vanilla RNN','Vanilla RNN','LSTM','LSTM','GRU' ],
'Decision Threshold':[.2,.34,.5,.5,.5,.5],
"Leaderboard Score":[.70364,.73521,.75574,.77352,.77597,.78853]})
outcomes

# Import Data and Libraries
Let's get started by importing some libraries, and ensuring our NLP tools work.

In [None]:
 
import warnings
warnings.filterwarnings('ignore')
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import string
import tensorflow as tf
AUTOTUNE = tf.data.AUTOTUNE
import keras
from keras import layers
import matplotlib.pyplot as plt
 
import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename)) 

In [None]:
#test our NLP tools before we move on...
!pip install nltk
!pip install spacy
#spacy.load("en_core_web_sm")
# nltk for stopwords 
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwordset = set(stopwords.words('english'))
print("Stopwords:",stopwordset)



In [None]:
# spacy for lemmatization 
import spacy
!python -m spacy download en
print("\n\nSpacy:",spacy.__version__)
nlp = spacy.load('en_core_web_sm', disable=['parser','ner'])

sentence = "The striped bats are hanging on their feet for best sleeping style."
print("Sample Sentence: ",sentence)
doc = nlp(sentence)
sent = " ".join([token.lemma_ for token in doc])
print("Lemmatized: ",sent)
sen = [x if x not in stopwordset else "" for x in sent.split(" ")]
print("Sans Stopwords:",sen)

# Import Data

With NLP tools at the ready we can import the data and characterize it.

In [None]:
dftr = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
print("Length of Training Data:", len(dftr))
print(dftr.info())
dftr.head()

#ID, text, and target have 0 null values

In [None]:
dfval = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
print("Length of Training Data:", len(dfval))
print(dfval.info())
dfval.head()

#ID, & text have 0 null values

In [None]:
dfsub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
print("Length of Submission.csv",len(dfsub))
print("Submission Length Match Validation Length?", len(dfval)==len(dfsub))
dfsub.head()


Great. At this point we know we have 7613 training observations with labels in the train.csv. We also have 3263 validation observations without labels in test.csv. The sample submission length matches the test.csv length, so we can be reasonably sure the train.csv data is the validation set.

# Exploratory Data Analysis

Let's characterize our data. I'll just read through them and see what we've got. I'll keep a list of the things that I see. And behold, these are not properly labeled...




In [None]:
ids_targeted_error = [328,443,513,2619,3640,3900,4342,5781,6552,6554,6570,6701,6702,6729,6861,7226]
dftr[dftr['id'].isin(ids_targeted_error)]

In [None]:
# fix them
print("initial sum of targets:",dftr.target.sum())
dftr.target.loc[dftr.id.isin(ids_targeted_error)] = 0
print("final sum of targets:",dftr.target.sum())

dftr.target.value_counts()

Those 16 targets are fixed, and we've definitely not got a balanced dataset. But we can move on.

We'll begin by visualizing the length of the tweets (in words), getting the words from the tweets (dictionary), defining the word counts, and then understanding which words in the validation set are not present in the training set. 

In [None]:
print("Max Words in Train Tweets", max([len(x.split(" ")) for x in dftr.text]))
plt.hist([len(x.split(" ")) for x in dftr.text])
plt.xlabel("Words Per Tween")
plt.ylabel("Occurences")
plt.show()

In [None]:
print("Max Words in Test Tweets", max([len(x.split(" ")) for x in dfval.text]))
plt.hist([len(x.split(" ")) for x in dfval.text])
plt.xlabel("Words Per Tween")
plt.ylabel("Occurences")
plt.show()

In [None]:
# function to lemmatize sentence and remove stop words
def lemmat(sentence):
    doc = nlp(sentence)
    lemmed = " ".join([token.lemma_ for token in doc])
    nostop = [x if x not in stopwordset else "" for x in lemmed.split(" ")]
    return nostop

# function to remove numbers, punctuation, and make lowercase
def depunc(word):
    punc = '''0123456789!()-[]{};:'"\,<>./?@#$%^&*_~'''
    for ele in word:
        if ele in punc:
            word = word.replace(ele, "")
    word = word.lower()
    return word

In [None]:
# define wordlist dictionaries
trdict = {}
tedict = {}

# make columns in dataframes for cleaned data
dftr['wordlist'] = ''
dfval['wordlist'] = ''

# get the training text by line
for i in range(len(dftr.text)):
    # get the words split out lemmatized, and remove stop words
    frag = lemmat(dftr.text[i])
    #make an interim list for dataframe insertion
    ww=[]
    # process each word
    for j in frag:
        # remove the punctuation, numbers, flip to lowercase
        w = depunc(j)
        # if wordlength > 0
        if len(w)>0:
            #add word to interim list
            ww.append(w)
            # add to the dictionary
            if w in trdict.keys():
                trdict[w] += 1
            else: 
                trdict[w] = 1
    #put wordlist in dataframe
    dftr.wordlist.iat[i]= ww        

# get the validation text by line
for i in range(len(dfval.text)):
    # get the words split out lemmatized, and remove stop words
    frag = lemmat(dfval.text[i])
    #make an interim list for dataframe insertion
    ww=[]
    # process each word
    for j in frag:
        # remove the punctuation, numbers, flip to lowercase
        w = depunc(j)
        # if wordlength > 0
        if len(w)>0:
            #add word to interim list
            ww.append(w)
            # add to the dictionary
            if w in tedict.keys():
                tedict[w] += 1
            else: 
                tedict[w] = 1
                
    #put wordlist in dataframe
    dfval.wordlist.iat[i]= ww 
    
print('Word Dictionaries and Dataframe Populated')

In [None]:
# # delete new line dictionary items (not exhaustive)
# trdict.pop('\n'), tedict.pop('\n')
# trdict.pop('\n\n'), tedict.pop('\n\n')

In [None]:
print("Training Dictionary Size:",len(trdict.keys()))
top20 = sorted([[v,k] for k,v in trdict.items()],reverse=True)[0:20]
print("Top 10 Words by Count:", top20)
plt.barh([k for v,k in top20],[v for v,k in top20])
plt.gca().invert_yaxis()
plt.show()

print("\n\nValidation Dictionary Size:",len(tedict.keys()))
top20v = sorted([[v,k] for k,v in tedict.items()],reverse=True)[0:20]
print("Top 10 Words by Count:", top20v)
plt.barh([k for v,k in top20v],[v for v,k in top20v])
plt.gca().invert_yaxis()
plt.show()



In [None]:
val_unique_words = {}
for i in tedict.keys():
    if i not in trdict.keys():
        if i in val_unique_words.keys():
            val_unique_words[i] +=1
        else:
            val_unique_words[i] = 1

print("Words in Validation but NOT in Training):", len(val_unique_words.keys()))
top20diff = sorted([(v,k) for k,v in val_unique_words.items()],reverse=False)
print("Top 20 by Occurence:",top20diff[0:20])
occurences = [v for k,v in val_unique_words.items()]
avgoccurences = sum(occurences)/len(occurences)
print("Average Number of Occurences in List:", avgoccurences)
plt.barh([k for v,k in top20diff][0:20],[v for v,k in top20diff][0:20])
plt.gca().invert_yaxis()
plt.show()

    

This is good. There are ~20K words in the training set dictionary. There are ~11K words in the validation set dictionary. There are 5.9K words that are in the validation set that don't match a word in the training set dictionary, but these occur no more than once. This means we have mostly isolated the typos, urls, and other made up words in the overlap.

# Data Preprocessing
We have already lemmatized and removed the stop words above (common words that aren't very helpful) and populated the dataframes with word lists (tokens). Let's ensure that worked well before moving on.

In [None]:
dftr.head()

In [None]:
dfval.head()

Great. Our wordlist columns seem to have populated with lemmatized tokens and no stopwords. The depunc function took out the punctuation and the numbers. Let's move on to making embeddings.

For our embedding matrix, we will take both of the dictionaries above and make sets from the keys. When we combine the sets, python automatically removes duplicates by construction. We then turn the final product back into a list so that each word has a location.

In [None]:
# make sets
trset = set(trdict.keys())
teset = set(tedict.keys())

# make finset
finset = trset.copy()
for i in teset:
    finset.add(i)

# back to a list
finlist = list(finset)
print(len(finlist))

With our ~25K long wordlist, we simply map each word to the list from the columns of our dataframe. 

In [None]:
# make columns
dftr['embeddings'] = ''
dfval['embeddings'] = ''

# for each row of data
for i in range(len(dftr.wordlist)):
    # crete an interim list
    ems = []
    # for each word in the list, add the appropriate index to the list
    for j in dftr.wordlist[i]:
        ems.append(finlist.index(j))
    # push the list into the dataframe
    dftr.embeddings.iat[i] = ems

# for each row of data
for i in range(len(dfval.wordlist)):
    # crete an interim list
    ems = []
    # for each word in the list, add the appropriate index to the list
    for j in dfval.wordlist[i]:
        ems.append(finlist.index(j))
    # push the list into the dataframe
    dfval.embeddings.iat[i] = ems
    
    

In [None]:

dftr.head()


In [None]:
dfval.head()

In [None]:
#ensure that we can decode our numbers to words 
print(f"finlist[{dftr.embeddings[2][3]}] == '{dftr.wordlist[2][3]}'?")
finlist[dftr.embeddings[2][3]] == dftr.wordlist[2][3]

Great, we can now use these numerical embeddings to build our RNN model.

Now we just need datasets to feed the model. We'll build those here. We'll also make sure the data is roughly balanced before we make the train and test splits.

In [None]:
# Detect hardware and light up the GPUs/TPUs
try:
     # detect and init the TPU
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()

    # instantiate a distribution strategy
    tf.tpu.experimental.initialize_tpu_system(tpu)
    tpu_strategy = tf.distribute.TPUStrategy(tpu)
 
    # tell us what happened
    print('Running on TPU ', tpu.cluster_spec().as_dict())

except ValueError: # If TPU not found
    tpu = None
    tpu_strategy = tf.distribute.get_strategy() # Default strategy that works on CPU and single GPU
    print('Running on CPU instead')

print("Number of accelerators: ", tpu_strategy.num_replicas_in_sync)
print("TPU: ", tpu)


In [None]:
# Train Set
sentences = tf.constant([" ".join(x) for x in dftr.wordlist[0:6090]])
is_question = tf.constant([x for x in dftr.target[0:6090]])

# Preprocess the input strings.
hash_buckets = len(finlist)
words = tf.strings.split(sentences, ' ')
hashed_words = tf.strings.to_hash_bucket_fast(words, hash_buckets)

# Test set
testin = tf.constant([" ".join(x) for x in dftr.wordlist[6090:]])
ans_question = tf.constant([x for x in dftr.target[6090:]])

# Preprocess the input strings. 
words2 = tf.strings.split(testin, ' ')
hashed_words2 = tf.strings.to_hash_bucket_fast(words2, hash_buckets)



# Build Models

To begin, we'll use a very simple RNN layout inside tensorflow. We have already lit up the hardware, so we just use the scope defined above. AKA we need to build the model inside the strategy scope to use the hardware (GPU/TPU) during model training and testing. 

# - Simple RNN Model

In [None]:
with tpu_strategy.scope():
        # Build the Keras model.
    keras_model = tf.keras.Sequential([
            tf.keras.layers.Input(batch_input_shape=[1,None], dtype=tf.int64, ragged=True),
            tf.keras.layers.Embedding(len(finlist), 16),
            tf.keras.layers.SimpleRNN(256),

            tf.keras.layers.Dense(256),
            tf.keras.layers.Activation(tf.nn.relu),
            tf.keras.layers.Dense(1)
        ])

print(keras_model.summary())

keras_model.compile(loss=tf.keras.losses.BinaryFocalCrossentropy(
                                apply_class_balancing=False,
                                alpha=0.25,
                                gamma=2.5,
                                from_logits=True,
                                label_smoothing=0.0,
                                axis=-1,
                                reduction= tf.keras.losses.Reduction.AUTO,
                                name='binary_focal_crossentropy'
                                    ), 
                            optimizer=tf.optimizers.legacy.RMSprop(
                                learning_rate=0.0012
                                    ), 
                            metrics=['accuracy']
                                   )

# define model save location
checkpoint_filepath = '/kaggle/working/RNN/'
!mkdir {checkpoint_filepath}
                                           
checkpoints  = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

# Run the Model        
hist = keras_model.fit(
            x= hashed_words,
            y= is_question, 
            epochs=22,
            callbacks = [checkpoints],
            validation_data=[hashed_words2,ans_question]
            )
print(keras_model.predict(hashed_words))

In [None]:
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.legend(['train','test'],loc='upper left')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()

plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.legend(['train','test'],loc='upper left')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()


We have a working model. The model is robustly attempting to overfit the data though. The best iteration was saved by the callback so we can load that below and submit it for scoring. This model only scores 75% on the leaderboard though. 



Let's move on to a LSTM model.

# - LSTM Model

In [None]:
with tpu_strategy.scope():
        # Build the Keras model.
    keras_model2 = tf.keras.Sequential([
            tf.keras.layers.Input(batch_input_shape=[1,None], dtype=tf.int64, ragged=True),
            tf.keras.layers.Embedding(len(finlist), 16), 
            tf.keras.layers.LSTM(256, 
                                 activation='sigmoid',
                                 recurrent_activation='tanh',
                                 use_bias=False,
                                ), 
            tf.keras.layers.Dense(256),
            tf.keras.layers.Activation(tf.nn.relu),
            tf.keras.layers.Dense(1)
        ])

print(keras_model2.summary())

keras_model2.compile(loss=tf.keras.losses.BinaryFocalCrossentropy(
                                apply_class_balancing=False,
                                alpha=0.25,
                                gamma=2.5,
                                from_logits=True,
                                label_smoothing=0.0,
                                axis=-1,
                                reduction= tf.keras.losses.Reduction.AUTO,
                                name='binary_focal_crossentropy'
                            ), 
                            optimizer=tf.optimizers.legacy.RMSprop(
                                learning_rate=0.0012,
                                #jit_compile=True
                            ), 
                            metrics=['accuracy']
                           )

# define model save location
checkpoint_filepath = '/kaggle/working/LSTM/'
!mkdir {checkpoint_filepath}
                                           
checkpoints  = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

# Run the Model        
hist2 = keras_model2.fit(
            hashed_words, 
            is_question, 
            epochs=22,
            callbacks = [checkpoints],
            validation_data=[hashed_words2,ans_question]
            )
print(keras_model2.predict(hashed_words))

In [None]:
plt.plot(hist2.history['accuracy'])
plt.plot(hist2.history['val_accuracy'])
plt.legend(['train','test'],loc='upper left')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()

plt.plot(hist2.history['loss'])
plt.plot(hist2.history['val_loss'])
plt.legend(['train','test'],loc='upper left')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()


Great. We have trained a LSTM Model. Once again, the model is powerful enough to overfit to the data. We can see this in the graphs. Test set loss came way down, but then started rising again. The callbacks saved the best model for scoring below.

This model scored a little bit better than the first. The LSTM module implementation did achieve a higher accuracy than the vanilla RNN above, but not by much. I tried a number of different layouts including 32, 64, 128, 256, 512, and 1024 units. I may need to add more complexity to the model to see the results.

Let's move on to the GRU builds.

# - GRU Model

In [None]:
with tpu_strategy.scope():
        # Build the Keras model.
    keras_model3 = tf.keras.Sequential([
            tf.keras.layers.Input(batch_input_shape=[1,None], dtype=tf.int64, ragged=True),
            tf.keras.layers.Embedding(len(finlist), 16),
            tf.keras.layers.GRU(256,
                                           return_sequences=False,
                                           return_state=False),
            tf.keras.layers.Dense(256),
            tf.keras.layers.Activation(tf.nn.relu),
            tf.keras.layers.Dense(1)
        ])

print(keras_model2.summary())

keras_model3.compile(loss=tf.keras.losses.BinaryFocalCrossentropy(
                                apply_class_balancing=False,
                                alpha=0.25,
                                gamma=2.5,
                                from_logits=True,
                                label_smoothing=0.0,
                                axis=-1,
                                reduction= tf.keras.losses.Reduction.AUTO,
                                name='binary_focal_crossentropy'
                            ), 
                            optimizer=tf.optimizers.legacy.RMSprop(
                                learning_rate=0.0012,
                                #jit_compile=True
                            ), 
                            metrics=['accuracy']
                              )
# define model save location
checkpoint_filepath = '/kaggle/working/GRU/'
!mkdir {checkpoint_filepath}
                                           
checkpoints  = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

# Run the Model        
hist3 = keras_model3.fit(
            hashed_words, 
            is_question, 
            epochs=22,
            callbacks = [checkpoints],
            validation_data=[hashed_words2,ans_question]
            )
print(keras_model3.predict(hashed_words))

In [None]:
plt.plot(hist3.history['accuracy'])
plt.plot(hist3.history['val_accuracy'])
plt.legend(['train','test'],loc='upper left')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()

plt.plot(hist3.history['loss'])
plt.plot(hist3.history['val_loss'])
plt.legend(['train','test'],loc='upper left')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()


Again we see overfitting in the graphs. But at least this model got the highest validation accuracy. Since the models are saved, we'll move ahead and make some submissions to see which one is best.

# Generate Submissions

To generate a submission, we need to make sure that all rows in the validation set have words after cleaning. Then we will move on to loading a model, making predictions, populating the predictions into the dataframe, and finally outputting the dataframe to a csv. I've done this manually for each model above, and will leave the best one for the final run.

In [None]:
# Find validation rows with zero words and input 'no'
x =0
while x == 0:
    z = [len(x) for x in dfval.wordlist]
    z = np.array(z)
    if z[z.argmin()] ==0:
        dfval.wordlist.iat[z.argmin()] = ['no']
        print("Fixed a row.")
    x = z[z.argmin()]

In [None]:
# Preprocess the input strings 
valts = tf.constant([" ".join(x) for x in dfval.wordlist])
hash_buckets = len(finlist)
words22 = tf.strings.split(valts, ' ')
hashed_words22 = tf.strings.to_hash_bucket_fast(words22, hash_buckets)

# copy the dataframe to a submission frame
dfsub = dfval.copy()

In [None]:
#load the model
modelp = tf.keras.models.load_model('/kaggle/working/GRU/')
# modelp = tf.keras.models.load_model('/kaggle/working/LSTM/')
# modelp = tf.keras.models.load_model('/kaggle/working/RNN/')

#make predictions and append them to the dataframe
outs = modelp.predict(hashed_words22)
dfsub['preds'] = outs
dfsub.head()

In [None]:
#set a threshold for predictions, and then make an integer prediction 
dfsub['predsfin'] = [1 if x > .5 else 0 for x in dfsub['preds']]
dfsub.head()

In [None]:
# reset the dataframe to the correct size and column names and then output the csv.
dfout = dfsub[['id','predsfin']]
dfout.columns = ['id','target']
dfout.to_csv('submission.csv', index=False)
!head submission.csv

# Conclusion

This was a good challenge. People use words very differently, and a computer is only able to discern the meaning 3/4 of the time with the tools used in this notebook. With a best score of 78%, the GRU model architecture seems to be the most apt for this job. Not only is the training faster, the results are better too. 

What I found interesting is the ability for each of the models (Simple RNN, LSTM, and GRU) to overfit to the data incredibly rapidly. Turning down the learning rate, reducing the units in the recurrent node (models with 1024, 512, 256, 128, 64, and 32 were tried for each model), or changing the optimizers only helped a little. This tells me that the RNN model architecture is incredibly powerful but hard to dial in. In the future I'd like to apply the optuna module to really dial those in. 

I did find that the best results came from models that used the focal loss function. This function is forced to concentrate the computing focus on the edge cases between the classes. This seemed to really help the classifier by several make good decisions. In the future I would add the hyperparameters for this loss function to my optuna optimization routine.