# Lab 6: Recurrent neural networks

In this lab you will use a recurrent neural network to predict whether or not a *tweet* is talking about a real disaster or not. To do this, we will use *Kaggle.com*'s competition [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started). Please follow the competition directions to obtain the data and evaluate your final model, noting the extra requirements below. **There is no requirement to actually submit your resutls to the competition.**

**Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.**

**Requirements**
 - Keras's `TextVectorization` functionality must be used, although it need not be part of the model
 - `train.csv` should be split into training and validation sets
 - the heart of your model must only use recurrent layers chosen from those [available in Keras](https://keras.io/api/layers/#recurrent-layers)
 - an embedding layer should be used; this can be learned along with the main task or use the [GloVe](https://github.com/stanfordnlp/GloVe) or [word2vec]() pretrained word embeddings
 - the evaluation metric for this dataset is the [F1-Score](https://www.kaggle.com/c/nlp-getting-started/overview/evaluation)

**Grading:**

 - 50% of the grade will come from FINAL, error-free code written in Python/Keras that accomplishes all the steps outlined  
 - 50% will come from descriptive comments associated with that code, where the comments explain what the code is doing and why it is important to the overall objective; see example below

```
def one_hot_encode_token(token):
    """This function can be used to convert integer encoded vectors to one-hot-encoded vectors.
    It processes one integer at a time and requires that vocabulary indexing already be done.
    input:
        token: an integer, e.g., 3
    return:
        vector: a one-hot vector of vocabulary length, [0, 0, 0, 1, 0,...]
    """
    vector = np.zeros((len(vocabulary),))
    vector[token] = 1
    return vector
```


**What to submit:**
- a copy of this notebook with:
    - final, well-commented, error-free code in Python/Keras
    - all code cells executed and output visible
- a `submission.csv` file containing the predictions of your final model on the `test.csv` data
- the final version of your model saved as a `Group_#_Lab_6.keras` file

**What NOT to submit:**
 - data files



In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("./data/train.csv")

In [None]:
df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


### Data cleaning

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords

"""USING REGULAR EXPRESSION PATTERNS, WE WILL CLEAN THE TWEET DATA"""

""" THIS METHOD REMOVES URL IN THE A SENTENCE """
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

""" THIS METHOD REMOVES USER MENTIONS STARTING WITH @"""
def remove_mentions(text):
    url = re.compile(r'@[A-Za-z0-9]+')
    return url.sub(r'',text)

""" THISI METHOD REMOVES PUNCTUATIONS"""
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

""" THISI METHOD REMOVES ANY HTML TAGS IN THE SENTENCES"""
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

""" REMOVES ALPHANUMERIC WORDS """
def remove_alphanumeric(text):
    html=re.compile(r'\w*\d\w*')
    return html.sub(r'',text)

""" REMOVES NEW LINES"""
def remove_newline(text):
    html=re.compile(r'\n')
    return html.sub(r' ',text)

In [None]:
"""DEFINING A METHOD THAT WILL """
def clean(df):
    """
    input:
          a dataframe with a text column containing tweets
    return:
          the same data frame after cleaning the tweet column
    """

    """ CONVERTING ALL TO LOWERCASE"""
    df['text']=df['text'].apply(lambda x : x.lower())

    """ REMOVING URLS """
    df['text']=df['text'].apply(lambda x : remove_url(x))

    """ REMOVING MENTIONS @"""
    df['text']=df['text'].apply(lambda x : remove_mentions(x))

    """ REMOVING PUNCTUATIONS"""
    df['text']=df['text'].apply(lambda x : remove_punct(x))

    """ REMOVING HTML"""
    df['text']=df['text'].apply(lambda x : remove_html(x))

    """ REMOVING ALPHANUMERIC WORDS"""
    df['text']=df['text'].apply(lambda x : remove_alphanumeric(x))

    df['text']=df['text'].apply(lambda x : remove_newline(x))

    return df

df_clean = clean(df)

In [None]:
#dropping unwanted columns
df_clean= df_clean.drop(['id', 'keyword', 'location'], axis = 1)

df_clean.head()

Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,people receive wildfires evacuation orders in...,1
4,just got sent this photo from ruby alaska as s...,1


### PERFORMING VECTORIZATION AND PADDING (USING KERAS VECTORIZATION LAYER)

In [None]:
import tensorflow as tf

tf_data = tf.data.Dataset.from_tensor_slices(df_clean['text'])

In [None]:
#initializing keras text vectorization layer with output mode as count

text_vectorization = tf.keras.layers.TextVectorization(output_mode='int',
    max_tokens=None, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=2)

In [None]:
#calling the adapt method on the layer to learn the vocabulary from the input text
text_vectorization.adapt(tf_data)

In [None]:
vocab_length = len(text_vectorization.get_vocabulary())
print(vocab_length)

72136


In [None]:
#coverting the text data to vectors containing integers using the adapted layer
text_vectorized = text_vectorization(df_clean['text'])

In [None]:
#padding the vectorized data so that all the samples are of the same length before passing to the model
text_padded = tf.keras.preprocessing.sequence.pad_sequences(
    text_vectorized, maxlen=None, dtype='int32', padding='post',
    truncating='post')

In [None]:
text_padded.shape

(7613, 61)

In [None]:
max_length = text_padded.shape[1]

### CREATE AND EVALUATE MODEL

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(text_padded, df_clean['target'],test_size= 0.2)

In [None]:
from keras import backend as K

"""
Defining methods to calculate Recall, Precision and F1 scores

These are callback methods which will be executed in the fit method, in addition to the regular accuracy
"""
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:
from tensorflow.keras.layers import Dense, Flatten, Embedding, LSTM, Dropout
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(input_dim =  vocab_length, output_dim = 8, input_length=max_length))
model.add(LSTM(32,activation='relu',return_sequences=True))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['acc',f1])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 61, 8)             577088    
                                                                 
 lstm (LSTM)                 (None, 61, 32)            5248      
                                                                 
 flatten (Flatten)           (None, 1952)              0         
                                                                 
 dense (Dense)               (None, 1)                 1953      
                                                                 
Total params: 584,289
Trainable params: 584,289
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
model.fit(x_train, y_train, epochs=10, validation_data = (x_val, y_val), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1d994cb3ee0>

In [None]:
model.evaluate(x_val,y_val)



[0.6983233690261841, 0.7426132559776306, 0.7175071835517883]

>### The F1 score on validation dataset is 0.7175

In [None]:
val_pred = model.predict(x_val)

val_sigmoid = np.where(val_pred > 0.5, 1, 0)

In [None]:
from sklearn import metrics

matrix = metrics.classification_report(list(y_val),val_sigmoid)

print(matrix)

              precision    recall  f1-score   support

           0       0.80      0.72      0.76       849
           1       0.68      0.78      0.73       674

    accuracy                           0.74      1523
   macro avg       0.74      0.75      0.74      1523
weighted avg       0.75      0.74      0.74      1523



### PREDICTING ON PROVIDED DATA

In [None]:
input_text = pd.read_csv("./data/test.csv")
input_text

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [None]:
input_text = input_text.drop(['location', 'id', 'keyword'], axis = 1)

In [None]:
input_text_clean = clean(input_text)

input_text_clean.head()

Unnamed: 0,text
0,just happened a terrible car crash
1,heard about earthquake is different cities sta...
2,there is a forest fire at spot pond geese are ...
3,apocalypse lighting spokane wildfires
4,typhoon soudelor kills in china and taiwan


In [None]:
input_text_vectorized = text_vectorization(input_text_clean['text'])

In [None]:
input_text_padded = tf.keras.preprocessing.sequence.pad_sequences(
    input_text_vectorized, maxlen=None, dtype='int32', padding='post',
        truncating='post')

In [None]:
preds = model.predict(input_text_padded)

In [None]:
pred_sigmoid = np.where(preds > 0.5, 1, 0)

In [None]:
text_arr = np.asarray(input_text['text'])

text_arr = np.expand_dims(text_arr, axis=1)

text_arr.shape

(3263, 1)

In [None]:
csv_output = np.concatenate((text_arr, pred_sigmoid), axis = 1)

In [None]:
pd.DataFrame(csv_output).to_csv("submission.csv", index=False)

### SAVING THE MODEL TO DIRECTORY

In [None]:
model.save('Group_11_Lab_6.keras', save_format='h5')