# Lab 6: Recurrent neural networks

In this lab you will use a recurrent neural network to predict whether or not a *tweet* is talking about a real disaster or not. To do this, we will use *Kaggle.com*'s competition [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started). Please follow the competition directions to obtain the data and evaluate your final model, noting the extra requirements below. **There is no requirement to actually submit your resutls to the competition.**

**Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.**

**Requirements**
 - Keras's `TextVectorization` functionality must be used, although it need not be part of the model
 - `train.csv` should be split into training and validation sets 
 - the heart of your model must only use recurrent layers chosen from those [available in Keras](https://keras.io/api/layers/#recurrent-layers)
 - an embedding layer should be used; this can be learned along with the main task or use the [GloVe](https://github.com/stanfordnlp/GloVe) or [word2vec]() pretrained word embeddings
 - the evaluation metric for this dataset is the [F1-Score](https://www.kaggle.com/c/nlp-getting-started/overview/evaluation)

**Grading:** 

 - 50% of the grade will come from FINAL, error-free code written in Python/Keras that accomplishes all the steps outlined  
 - 50% will come from descriptive comments associated with that code, where the comments explain what the code is doing and why it is important to the overall objective; see example below
 
```
def one_hot_encode_token(token):
    """This function can be used to convert integer encoded vectors to one-hot-encoded vectors.
    It processes one integer at a time and requires that vocabulary indexing already be done. 
    input: 
        token: an integer, e.g., 3
    return:
        vector: a one-hot vector of vocabulary length, [0, 0, 0, 1, 0,...]
    """
    vector = np.zeros((len(vocabulary),))
    vector[token] = 1
    return vector
```


**What to submit:**
- a copy of this notebook with:
    - final, well-commented, error-free code in Python/Keras
    - all code cells executed and output visible
- a `submission.csv` file containing the predictions of your final model on the `test.csv` data
- the final version of your model saved as a `Group_#_Lab_6.keras` file

**What NOT to submit:**
 - data files



## Group 10


Keshav Yadav - 0770087

Sri Sankeerth Koduru - 0768993

Dilpreet Singh - 0771612

Siva Sai Chaitanya Varma Sykam - 0770796

In [58]:
# Importing packages
import pandas as pd
import numpy as np
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import ngrams
from tensorflow.keras.layers import TextVectorization
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from sklearn.model_selection import train_test_split
from keras import backend as K
from tensorflow.keras.layers import Dense, Flatten, Embedding, LSTM, Dropout
from tensorflow.keras.models import Sequential

In [59]:
# Reading the dataset and looking at the top 5 values of the dataset
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


#### Cleaning the data

In [60]:
""" Cleaning th data by dropping the id, keyword, location from the dataframe
    input: 
        The entire training dataset
    return:
        Dataset with only the text and target files """
data = data.drop(['id','keyword','location'],axis=1)

In [61]:
""" Creating the vectorize class with a standardize finction. The standerdize function helps remove all the puctuations
    as well as change all the letter to lower letters. This is needed to standardize the text and only keep the text.
    input: 
        string: All residents asked to 'shelter in place' are
    return:
        string: all residents asked to shelter in place are """
class Vectorizer:
    def standardize(self, input_text):
        text = input_text.lower()
        return "".join(char for char in text if char not in string.punctuation)

v1 = Vectorizer()
data['text'] = data['text'].apply(lambda x : v1.standardize(x))

In [62]:
# Looking at the top 5 rows of the clean data
data.head()

Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,13000 people receive wildfires evacuation orde...,1
4,just got sent this photo from ruby alaska as s...,1


#### Vectorizong the text

In [63]:
""" converting the dataset into a dataframe so that the vectorization can be performed on the text data.
    input: 
        dataframe:data['text']
    return:
        dataset: text """
text = tf.data.Dataset.from_tensor_slices((data['text']))

In [64]:
""" The following steps are don in this code chunk:
    The vectorization model was created using the TextVectorization function.This is important as this will be used to create the dictionary as well as the tokenizing of the text.
    The .adapt fuction was used on the text to create a word dictionary which will be used to create the tokens on the text.
    vectorize_layer was used on the data['text'] to convert the text to numeric data.
    Padding the text so that none of the information is lost while running it in the model.
    identifying the 
    input: 
        string: i am good
    return:
        numpy array:[15,27,1302]  """
vectorize_layer = TextVectorization(output_mode='int',max_tokens=20000, standardize='lower_and_strip_punctuation',split='whitespace')
vectorize_layer.adapt(text)
vectorized_text = vectorize_layer(data['text'])
Final_Text = tf.keras.preprocessing.sequence.pad_sequences(vectorized_text, maxlen=None, dtype='int64', padding='post')
length = Final_Text.shape[1]
length

31

#### Importing the Embedding 

In [65]:
""" The glove.6B.50d.txt was taken and a dataframe with the embedding for all the words within glove.6B.50d.txt was created.
    input: 
        text file:glove.6B.50d.txt
    return:
        dataframe: glove_df  """
glove = []

with open("glove.6B.50d.txt") as file:
    i = 0
    for line in file:
        glove.append(line.rstrip())
        i += 1

glove_dict ={}

for word in glove:
    vec = word.split()
    glove_dict[vec[0]] = vec[1:]

glove_df = pd.DataFrame(data=glove_dict).transpose() 

glove_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
the,0.418,0.24968,-0.41242,0.1217,0.34527,-0.044457,-0.49688,-0.17862,-0.00066023,-0.6566,0.27843,-0.14767,-0.55677,0.14658,-0.0095095,0.011658,0.10204,-0.12792,-0.8443,-0.12181,-0.016801,-0.33279,-0.1552,-0.23131,-0.19181,-1.8823,-0.76746,0.099051,-0.42125,-0.19526,4.0071,-0.18594,-0.52287,-0.31681,0.00059213,0.0074449,0.17778,-0.15897,0.012041,-0.054223,-0.29871,-0.15749,-0.34758,-0.045637,-0.44251,0.18785,0.0027849,-0.18411,-0.11514,-0.78581
",",0.013441,0.23682,-0.16899,0.40951,0.63812,0.47709,-0.42852,-0.55641,-0.364,-0.23938,0.13001,-0.063734,-0.39575,-0.48162,0.23291,0.090201,-0.13324,0.078639,-0.41634,-0.15428,0.10068,0.48891,0.31226,-0.1252,-0.037512,-1.5179,0.12612,-0.02442,-0.042961,-0.28351,3.5416,-0.11956,-0.014533,-0.1499,0.21864,-0.33412,-0.13872,0.31806,0.70358,0.44858,-0.080262,0.63003,0.32111,-0.46765,0.22786,0.36034,-0.37818,-0.56657,0.044691,0.30392
.,0.15164,0.30177,-0.16763,0.17684,0.31719,0.33973,-0.43478,-0.31086,-0.44999,-0.29486,0.16608,0.11963,-0.41328,-0.42353,0.59868,0.28825,-0.11547,-0.041848,-0.67989,-0.25063,0.18472,0.086876,0.46582,0.015035,0.043474,-1.4671,-0.30384,-0.023441,0.30589,-0.21785,3.746,0.0042284,-0.18436,-0.46209,0.098329,-0.11907,0.23919,0.1161,0.41705,0.056763,-6.3681e-05,0.068987,0.087939,-0.10285,-0.13931,0.22314,-0.080803,-0.35652,0.016413,0.10216
of,0.70853,0.57088,-0.4716,0.18048,0.54449,0.72603,0.18157,-0.52393,0.10381,-0.17566,0.078852,-0.36216,-0.11829,-0.83336,0.11917,-0.16605,0.061555,-0.012719,-0.56623,0.013616,0.22851,-0.14396,-0.067549,-0.38157,-0.23698,-1.7037,-0.86692,-0.26704,-0.2589,0.1767,3.8676,-0.1613,-0.13273,-0.68881,0.18444,0.0052464,-0.33874,-0.078956,0.24185,0.36576,-0.34727,0.28483,0.075693,-0.062178,-0.38988,0.22902,-0.21617,-0.22562,-0.093918,-0.80375
to,0.68047,-0.039263,0.30186,-0.17792,0.42962,0.032246,-0.41376,0.13228,-0.29847,-0.085253,0.17118,0.22419,-0.10046,-0.43653,0.33418,0.67846,0.057204,-0.34448,-0.42785,-0.43275,0.55963,0.10032,0.18677,-0.26854,0.037334,-2.0932,0.22171,-0.39868,0.20912,-0.55725,3.8826,0.47466,-0.95658,-0.37788,0.20869,-0.32752,0.12751,0.088359,0.16351,-0.21634,-0.094375,0.018324,0.21048,-0.03088,-0.19722,0.082279,-0.09434,-0.073297,-0.064699,-0.26044


#### Model Creation

In [66]:
""" The followinf functions were created:
    recall_m: This function is made to calculate the recall score and is used to calculate the f1 score
    precision_m: This function is made to calculate the precision score and is used to calculate the f1 score  
    f1: This function is made to calculate the precision score and is used to calculate the effecticveness of the model"""
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [67]:
""" Creating a model using the rnn network.
    input: 
        Final_Text and data['target']
    return:
        f1 score """
x_train, x_val, y_train, y_val = train_test_split(Final_Text, data['target'],test_size= 0.2,random_state=411)

model = Sequential()
model.add(Embedding(input_dim =  20000, output_dim = 50, input_length=length))
model.add(LSTM(64,activation='relu',return_sequences=True))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', 
              loss='binary_crossentropy',
              metrics=['acc',f1])

model.fit(x_train, y_train, epochs=10)

model.evaluate(x_val,y_val)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[4839.9765625, 0.7478660345077515, 0.7126083970069885]

#### Predicting testing data

In [76]:
# Reading the dataset and looking at the top 5 values of the dataset
data = pd.read_csv('test.csv')
data.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [77]:
""" Cleaning th data by dropping the id, keyword, location from the dataframe
    input: 
        The entire training dataset
    return:
        Dataset with only the text and target files """
data = data.drop(['id','keyword','location'],axis=1)
data1 = data.copy()

In [78]:
""" Creating the vectorize class with a standardize finction. The standerdize function helps remove all the puctuations
    as well as change all the letter to lower letters. This is needed to standardize the text and only keep the text.
    input: 
        string: All residents asked to 'shelter in place' are
    return:
        string: all residents asked to shelter in place are """
class Vectorizer:
    def standardize(self, input_text):
        text = input_text.lower()
        return "".join(char for char in text if char not in string.punctuation)

v1 = Vectorizer()
data1['text'] = data1['text'].apply(lambda x : v1.standardize(x))

In [79]:
# Looking at the top 5 rows of the clean data
data1.head()

Unnamed: 0,text
0,just happened a terrible car crash
1,heard about earthquake is different cities sta...
2,there is a forest fire at spot pond geese are ...
3,apocalypse lighting spokane wildfires
4,typhoon soudelor kills 28 in china and taiwan


In [80]:
""" converting the dataset into a dataframe so that the vectorization can be performed on the text data.
    input: 
        dataframe:data['text']
    return:
        dataset: text """
text = tf.data.Dataset.from_tensor_slices((data1['text']))

In [81]:
""" The following steps are don in this code chunk:
    vectorize_layer was used on the data['text'] to convert the text to numeric data.
    Padding the text so that none of the information is lost while running it in the model.
    identifying the 
    input: 
        string: i am good
    return:
        numpy array:[15,27,1302]  """
vectorized_text = vectorize_layer(data1['text'])
Final_Text_padd = tf.keras.preprocessing.sequence.pad_sequences(vectorized_text, maxlen=None, dtype='int64', padding='post')

In [82]:
""" Predicting the target variable
    input: 
        the text of the testing data
    return:
        Prediction  """
pred = model.predict(Final_Text_padd)

In [83]:
""" creating a csv with the text as well as the prediction values between 0 or 1
    input: 
        text data as well as the pred data
    return:
        Group10.csv"""
pred_sigmoid = np.where(pred > 0.5, 1, 0)
text_col = np.asarray(data['text'])
text_col = np.expand_dims(text_col, axis=1)
csv_pred = np.concatenate((text_col, pred_sigmoid), axis = 1)
pd.DataFrame(csv_pred).to_csv("Group10.csv", index=False)

In [85]:
model.save('Group_10_Lab_6.keras', save_format='h5')