This notebook tries to classify tweets as either being about a **real** disaster or not. It is intended only for learning and practicing NLP tehniques.

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 data-toggle="list"  role="tab" aria-controls="home"><p style="font-size : 25px">NLP Disaster Tweets:</p></h3>
    
0. [Imports](#0)
    
1. [Data Understanding and Feature Selection](#1)      
    
2. [Text Cleaning + Glove Embeddings](#2)
    
3. [Neural Networks](#3)
    - 3.1 [Bidirectional LSTM](#3.1)
    - 3.2 [1D CNN + Bidirectional LSTM](#3.2)
    - 3.3 [Bidirectional LSTM + FC Network on numerical features](#3.3)
    - 3.4 [Bert](#3.4)  
    
4. [Submission](#4)

5. [Conclusions](#5)

# <font size="+2" color="black"><b>0. Imports </b></font><br><a id="0"></a>

In [None]:
from nltk.corpus import stopwords    
from nltk.tokenize import word_tokenize
from textblob import TextBlob

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, SpatialDropout1D, MaxPooling1D, Conv1D, Concatenate, Bidirectional, GlobalMaxPool1D, ActivityRegularization, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

from tqdm import tqdm

import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import re
import string

%matplotlib inline
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

nr_examples = 75

# <font size="+2" color="black"><b>1. Data Understanding and Feature Selection </b></font><br><a id="1"></a>

**Let's take our first look at the data.**

In [None]:
train_df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
train_df.head()

**There seem to be a lot of NaN in keyword and location. Let's take a closer look at these columns in order to see if we can use them as features or not.**

# Does location matter?

In [None]:
fig, ax = plt.subplots(figsize = (20, 5))

ax.plot(train_df['location'].value_counts().head(nr_examples).index, train_df['location'].value_counts().head(nr_examples).values)
ax.set(xlabel = 'locations', ylabel = '# of appearances')

plt.xticks(rotation = 'vertical')
plt.tight_layout()
plt.show()

train_df['haslocation'] = np.where(train_df['location'].isnull(), 0, 1)
"{:.2f}% of the entries have a location".format((len(train_df[(train_df['haslocation'] == 1)]) / len(train_df) * 100))

**66% is pretty low and we don't have a good way to replace the missing data. Let's see whether the presence of the location could be a useful feature instead of the location itself.**

In [None]:
"{:.2f}% of the entries which represent actual disasters have a location".format((len(train_df[(train_df['haslocation'] == 1) & (train_df['target'] == 1)]) / len(train_df[(train_df['target'] == 1)]) * 100)) 

In [None]:
"{:.2f}% of the entries which do not represent disasters have a location".format((len(train_df[(train_df['haslocation'] == 1) & (train_df['target'] == 0)]) / len(train_df[(train_df['target'] == 0)]) * 100)) 

**The presence/absence of the location does not seem to be correlated with the target. Let's drop location and haslocation.**

In [None]:
train_df.drop(["location", "haslocation"], axis = 1, inplace = True)
test_df.drop(["location"], axis = 1, inplace = True)

# But what about the keyword?

In [None]:
fig, ax = plt.subplots(figsize = (20, 5))

ax.bar(np.arange(nr_examples), train_df.where(train_df['target'] == 1)['keyword'].value_counts().head(nr_examples), 0.35, color = "C0")
ax.set_ylabel('# appeareances')
ax.set_title('Keyword')
ax.set_xticks(np.arange(nr_examples))
ax.set_xticklabels(train_df.where(train_df['target'] == 1)['keyword'].value_counts().head(nr_examples).index, rotation = 'vertical')

plt.tight_layout()
plt.show()

train_df['haskeyword'] = np.where(train_df['keyword'].isnull(), 0, 1)
"{:.2f}% of the entries have a keyword".format((len(train_df[(train_df['haskeyword'] == 1)]) / len(train_df) * 100)) 

**Now that's a percentage we can work with. Let's see what kind of keywords do the real disaster tweets have compared to the non-disater ones.**

In [None]:
fig, (ax, bx) = plt.subplots(nrows = 2, ncols = 1, figsize = (20, 10))

ax.bar(np.arange(nr_examples), train_df.where(train_df['target'] == 1)['keyword'].value_counts().head(nr_examples), 0.35, color = "C1")
ax.set_ylabel('# appeareances')
ax.set_title('Disaster')
ax.set_xticks(np.arange(nr_examples))
ax.set_xticklabels(train_df.where(train_df['target'] == 1)['keyword'].value_counts().head(nr_examples).index, rotation='vertical')

bx.bar(np.arange(nr_examples), train_df.where(train_df['target'] == 0)['keyword'].value_counts().head(nr_examples), 0.35, color = "C0")
bx.set_ylabel('# appeareances')
bx.set_title('Non-disaster')
bx.set_xticks(np.arange(nr_examples))
bx.set_xticklabels(train_df.where(train_df['target'] == 0)['keyword'].value_counts().head(nr_examples).index, rotation='vertical')

plt.tight_layout()
plt.show()

**In both cases the keyword seems to be a negative word, with misspellings or not. Not very helpful. Let us convince ourselves of the lack of importance of the keyword by comparing the sentiment of the keywords for dissaster vs non-dissaster tweets. If we're doing this, we might aswell analyze the sentiment for the text too.**

In [None]:
train_df['keyword_sentiment'] = train_df['keyword'].apply(lambda x: (TextBlob(x).sentiment[0] + 1) / 2 if type(x) == str else None)
train_df['tweet_sentiment'] = train_df['text'].apply(lambda x: (TextBlob(x).sentiment[0] + 1) / 2)
test_df['tweet_sentiment'] = test_df['text'].apply(lambda x: (TextBlob(x).sentiment[0] + 1) / 2)

In [None]:
disaster = train_df.where(train_df['target'] == 1)['keyword_sentiment'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['keyword_sentiment'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster keyword sentiment vs Non-disaster keyword sentiment')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

In [None]:
disaster = train_df.where(train_df['target'] == 1)['tweet_sentiment'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['tweet_sentiment'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster overall sentiment vs Non-disaster overall sentiment')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

**The keyword sentiment does not appear to matter that much. The overall tweet sentiment however, seems to be of somewhat importance. Let's keep that one and drop the keyword sentiment. Due to the unfortunately small size of the dataset thought, we will add the keyword to the text.**

In [None]:
train_df['text'] = train_df['keyword'].astype(str) + ' ' + train_df['text'].astype(str)
test_df['text'] = test_df['keyword'].astype(str) + ' ' + test_df['text'].astype(str)

train_df.drop(["keyword", "haskeyword", "keyword_sentiment"], axis = 1, inplace = True)
test_df.drop(["keyword"], axis = 1, inplace = True)

# So the location and the keyword did not help us very much, but the tweet sentiment seems to. Let's try and find more valuable features to add. We will normalize everything in order for the future neural net to learn better.

**Number of characters.**

In [None]:
train_df['nr_of_char'] = train_df['text'].str.len()
train_df['nr_of_char'] = train_df['nr_of_char'] / train_df['nr_of_char'].max()

disaster = train_df.where(train_df['target'] == 1)['nr_of_char'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['nr_of_char'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster # of chars vs Non-disaster # of chars')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

test_df['nr_of_char'] = test_df['text'].str.len()
test_df['nr_of_char'] = test_df['nr_of_char'] / test_df['nr_of_char'].max()

**Number of words.**

In [None]:
train_df['nr_of_words'] = train_df['text'].str.split().str.len()
train_df['nr_of_words'] = train_df['nr_of_words'] / train_df['nr_of_words'].max()

disaster = train_df.where(train_df['target'] == 1)['nr_of_words'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['nr_of_words'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster # of words vs Non-disaster # of words')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

test_df['nr_of_words'] = test_df['text'].str.split().str.len()
test_df['nr_of_words'] = test_df['nr_of_words'] / test_df['nr_of_words'].max()

**Number of unique words.**

In [None]:
train_df['nr_of_unique_words'] = train_df['text'].apply(lambda x: len(set(x.split())))
train_df['nr_of_unique_words'] = train_df['nr_of_unique_words'] / train_df['nr_of_unique_words'].max()

disaster = train_df.where(train_df['target'] == 1)['nr_of_unique_words'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['nr_of_unique_words'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster # of unique words vs Non-disaster # of unique words')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

test_df['nr_of_unique_words'] = test_df['text'].apply(lambda x: len(set(x.split())))
test_df['nr_of_unique_words'] = test_df['nr_of_unique_words'] / test_df['nr_of_unique_words'].max()

**Number of punctuation marks.**

In [None]:
train_df['nr_of_punctuation'] = train_df['text'].str.split(r"\?|,|\.|\!|\"|'").str.len()
train_df['nr_of_punctuation'] = train_df['nr_of_punctuation'] / train_df['nr_of_punctuation'].max()

disaster = train_df.where(train_df['target'] == 1)['nr_of_punctuation'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['nr_of_punctuation'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster # of punctuation signs vs Non-disaster # of punctuation signs')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

test_df['nr_of_punctuation'] = test_df['text'].str.split(r"\?|,|\.|\!|\"|'").str.len()
test_df['nr_of_punctuation'] = test_df['nr_of_punctuation'] / test_df['nr_of_punctuation'].max()

**Number of stopwords.**

In [None]:
stop_words = set(stopwords.words('english'))
train_df['nr_of_stopwords'] = train_df['text'].str.split().apply(lambda x: len(set(x) & stop_words))
train_df['nr_of_stopwords'] = train_df['nr_of_stopwords'] / train_df['nr_of_stopwords'].max()

disaster = train_df.where(train_df['target'] == 1)['nr_of_stopwords'].dropna().values
not_disaster = train_df.where(train_df['target'] == 0)['nr_of_stopwords'].dropna().values

fig, ax = plt.subplots(figsize = (20, 5))
ax.set_title('Disaster # of stopwords vs Non-disaster # of stopwords')
ax.boxplot([disaster, not_disaster])

plt.tight_layout()
plt.show()

test_df['nr_of_stopwords'] = test_df['text'].str.split().apply(lambda x: len(set(x) & stop_words))
test_df['nr_of_stopwords'] = test_df['nr_of_stopwords'] / test_df['nr_of_stopwords'].max()

In [None]:
train_df.corr().iplot(kind='heatmap',colorscale="Reds",title="Feature Correlation Matrix")

**It seems like none of our features are highly correlated with the target. Nevertheless, we will try to use them in a network as an experiment.**

# <font size="+2" color="black"><b>2. Text Cleaning + Glove Embeddings </b></font><br><a id="2"></a>

**First, we'll download the Glove embeddings and see how much of the text they already cover.**

In [None]:
glove_embeddings = np.load('/kaggle/input/embfile/emb/glove.840B.300d.pkl', allow_pickle=True)

# credit to 
def build_vocab(X):
    
    tweets = X.apply(lambda s: s.split()).values      
    vocab = {}
    
    for tweet in tweets:
        for word in tweet:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1                
    return vocab

def check_embeddings_coverage(X, embeddings):
    
    vocab = build_vocab(X)    
    
    covered = {}
    oov = {}    
    n_covered = 0
    n_oov = 0
    
    for word in vocab:
        try:
            covered[word] = embeddings[word]
            n_covered += vocab[word]
        except:
            n_oov += vocab[word]
            
    vocab_coverage = len(covered) / len(vocab)
    text_coverage = (n_covered / (n_covered + n_oov))
    
    return vocab_coverage, text_coverage

train_glove_vocab_coverage, train_glove_text_coverage = check_embeddings_coverage(train_df['text'], glove_embeddings)
print('GloVe Embeddings cover {:.2%} of vocabulary and {:.2%} of text in Training Set'.format(train_glove_vocab_coverage, train_glove_text_coverage))
test_glove_vocab_coverage, test_glove_text_coverage = check_embeddings_coverage(test_df['text'], glove_embeddings)
print('GloVe Embeddings cover {:.2%} of vocabulary and {:.2%} of text in Test Set'.format(test_glove_vocab_coverage, test_glove_text_coverage))

**It seems pretty bad. We can make those percentages better by cleaning the text.**

In [None]:
def clean(tweet): 
            
    # Special characters
    tweet = re.sub(r"\x89Û_", "", tweet)
    tweet = re.sub(r"\x89ÛÒ", "", tweet)
    tweet = re.sub(r"\x89ÛÓ", "", tweet)
    tweet = re.sub(r"\x89ÛÏWhen", "When", tweet)
    tweet = re.sub(r"\x89ÛÏ", "", tweet)
    tweet = re.sub(r"China\x89Ûªs", "China's", tweet)
    tweet = re.sub(r"let\x89Ûªs", "let's", tweet)
    tweet = re.sub(r"\x89Û÷", "", tweet)
    tweet = re.sub(r"\x89Ûª", "", tweet)
    tweet = re.sub(r"\x89Û\x9d", "", tweet)
    tweet = re.sub(r"å_", "", tweet)
    tweet = re.sub(r"\x89Û¢", "", tweet)
    tweet = re.sub(r"\x89Û¢åÊ", "", tweet)
    tweet = re.sub(r"fromåÊwounds", "from wounds", tweet)
    tweet = re.sub(r"åÊ", "", tweet)
    tweet = re.sub(r"åÈ", "", tweet)
    tweet = re.sub(r"JapÌ_n", "Japan", tweet)    
    tweet = re.sub(r"Ì©", "e", tweet)
    tweet = re.sub(r"å¨", "", tweet)
    tweet = re.sub(r"SuruÌ¤", "Suruc", tweet)
    tweet = re.sub(r"åÇ", "", tweet)
    tweet = re.sub(r"å£3million", "3 million", tweet)
    tweet = re.sub(r"åÀ", "", tweet)
    
    # Contractions
    tweet = re.sub(r"he's", "he is", tweet)
    tweet = re.sub(r"there's", "there is", tweet)
    tweet = re.sub(r"We're", "We are", tweet)
    tweet = re.sub(r"That's", "That is", tweet)
    tweet = re.sub(r"won't", "will not", tweet)
    tweet = re.sub(r"they're", "they are", tweet)
    tweet = re.sub(r"Can't", "Cannot", tweet)
    tweet = re.sub(r"wasn't", "was not", tweet)
    tweet = re.sub(r"don\x89Ûªt", "do not", tweet)
    tweet = re.sub(r"aren't", "are not", tweet)
    tweet = re.sub(r"isn't", "is not", tweet)
    tweet = re.sub(r"What's", "What is", tweet)
    tweet = re.sub(r"haven't", "have not", tweet)
    tweet = re.sub(r"hasn't", "has not", tweet)
    tweet = re.sub(r"There's", "There is", tweet)
    tweet = re.sub(r"He's", "He is", tweet)
    tweet = re.sub(r"It's", "It is", tweet)
    tweet = re.sub(r"You're", "You are", tweet)
    tweet = re.sub(r"I'M", "I am", tweet)
    tweet = re.sub(r"shouldn't", "should not", tweet)
    tweet = re.sub(r"wouldn't", "would not", tweet)
    tweet = re.sub(r"i'm", "I am", tweet)
    tweet = re.sub(r"I\x89Ûªm", "I am", tweet)
    tweet = re.sub(r"I'm", "I am", tweet)
    tweet = re.sub(r"Isn't", "is not", tweet)
    tweet = re.sub(r"Here's", "Here is", tweet)
    tweet = re.sub(r"you've", "you have", tweet)
    tweet = re.sub(r"you\x89Ûªve", "you have", tweet)
    tweet = re.sub(r"we're", "we are", tweet)
    tweet = re.sub(r"what's", "what is", tweet)
    tweet = re.sub(r"couldn't", "could not", tweet)
    tweet = re.sub(r"we've", "we have", tweet)
    tweet = re.sub(r"it\x89Ûªs", "it is", tweet)
    tweet = re.sub(r"doesn\x89Ûªt", "does not", tweet)
    tweet = re.sub(r"It\x89Ûªs", "It is", tweet)
    tweet = re.sub(r"Here\x89Ûªs", "Here is", tweet)
    tweet = re.sub(r"who's", "who is", tweet)
    tweet = re.sub(r"I\x89Ûªve", "I have", tweet)
    tweet = re.sub(r"y'all", "you all", tweet)
    tweet = re.sub(r"can\x89Ûªt", "cannot", tweet)
    tweet = re.sub(r"would've", "would have", tweet)
    tweet = re.sub(r"it'll", "it will", tweet)
    tweet = re.sub(r"we'll", "we will", tweet)
    tweet = re.sub(r"wouldn\x89Ûªt", "would not", tweet)
    tweet = re.sub(r"We've", "We have", tweet)
    tweet = re.sub(r"he'll", "he will", tweet)
    tweet = re.sub(r"Y'all", "You all", tweet)
    tweet = re.sub(r"Weren't", "Were not", tweet)
    tweet = re.sub(r"Didn't", "Did not", tweet)
    tweet = re.sub(r"they'll", "they will", tweet)
    tweet = re.sub(r"they'd", "they would", tweet)
    tweet = re.sub(r"DON'T", "DO NOT", tweet)
    tweet = re.sub(r"That\x89Ûªs", "That is", tweet)
    tweet = re.sub(r"they've", "they have", tweet)
    tweet = re.sub(r"i'd", "I would", tweet)
    tweet = re.sub(r"should've", "should have", tweet)
    tweet = re.sub(r"You\x89Ûªre", "You are", tweet)
    tweet = re.sub(r"where's", "where is", tweet)
    tweet = re.sub(r"Don\x89Ûªt", "Do not", tweet)
    tweet = re.sub(r"we'd", "we would", tweet)
    tweet = re.sub(r"i'll", "I will", tweet)
    tweet = re.sub(r"weren't", "were not", tweet)
    tweet = re.sub(r"They're", "They are", tweet)
    tweet = re.sub(r"Can\x89Ûªt", "Cannot", tweet)
    tweet = re.sub(r"you\x89Ûªll", "you will", tweet)
    tweet = re.sub(r"I\x89Ûªd", "I would", tweet)
    tweet = re.sub(r"let's", "let us", tweet)
    tweet = re.sub(r"it's", "it is", tweet)
    tweet = re.sub(r"can't", "cannot", tweet)
    tweet = re.sub(r"don't", "do not", tweet)
    tweet = re.sub(r"you're", "you are", tweet)
    tweet = re.sub(r"i've", "I have", tweet)
    tweet = re.sub(r"that's", "that is", tweet)
    tweet = re.sub(r"i'll", "I will", tweet)
    tweet = re.sub(r"doesn't", "does not", tweet)
    tweet = re.sub(r"i'd", "I would", tweet)
    tweet = re.sub(r"didn't", "did not", tweet)
    tweet = re.sub(r"ain't", "am not", tweet)
    tweet = re.sub(r"you'll", "you will", tweet)
    tweet = re.sub(r"I've", "I have", tweet)
    tweet = re.sub(r"Don't", "do not", tweet)
    tweet = re.sub(r"I'll", "I will", tweet)
    tweet = re.sub(r"I'd", "I would", tweet)
    tweet = re.sub(r"Let's", "Let us", tweet)
    tweet = re.sub(r"you'd", "You would", tweet)
    tweet = re.sub(r"It's", "It is", tweet)
    tweet = re.sub(r"Ain't", "am not", tweet)
    tweet = re.sub(r"Haven't", "Have not", tweet)
    tweet = re.sub(r"Could've", "Could have", tweet)
    tweet = re.sub(r"youve", "you have", tweet)  
    tweet = re.sub(r"donå«t", "do not", tweet)   
           
    # Urls
    tweet = re.sub(r"https?:\/\/t.co\/[A-Za-z0-9]+", "", tweet)
        
    # Words with punctuations and special characters
    punctuations = '@#!?+&*[]-%.:/();$=><|{}^' + "'`"
    for p in punctuations:
        tweet = tweet.replace(p, '')
        
    # ... and ..
    tweet = tweet.replace('...', ' ... ')
    if '...' not in tweet:
        tweet = tweet.replace('..', ' ... ')
        
    #Spaces
    tweet = tweet.replace('  ', ' ')
    tweet = tweet.replace('   ', ' ')
        
    tweet = tweet.lower()
    
    tweet = " ".join(tweet.split())
    
    return tweet

train_df['text_cleaned'] = train_df['text'].apply(lambda s : clean(s))
test_df['text_cleaned'] = test_df['text'].apply(lambda s : clean(s))

train_glove_vocab_coverage, train_glove_text_coverage = check_embeddings_coverage(train_df['text_cleaned'], glove_embeddings)
print('GloVe Embeddings cover {:.2%} of vocabulary and {:.2%} of text in Training Set'.format(train_glove_vocab_coverage, train_glove_text_coverage))
test_glove_vocab_coverage, test_glove_text_coverage = check_embeddings_coverage(test_df['text_cleaned'], glove_embeddings)
print('GloVe Embeddings cover {:.2%} of vocabulary and {:.2%} of text in Test Set'.format(test_glove_vocab_coverage, test_glove_text_coverage))

**That's more like it. Now let's prepare our train and test corpus.**

In [None]:
def create_corpus(df):
    
    corpus = []
    
    for tweet in tqdm(df['text_cleaned']):
        words = [word.lower() for word in word_tokenize(tweet) if((word.isalpha() == 1) and word not in stop_words)]
        corpus.append(words)
        
    return corpus

train_corpus = create_corpus(train_df)
test_corpus = create_corpus(test_df)

tokenizer_obj_train = Tokenizer()
tokenizer_obj_train.fit_on_texts(train_corpus)
seq_train=tokenizer_obj_train.texts_to_sequences(train_corpus)

tokenizer_obj_test = Tokenizer()
tokenizer_obj_test.fit_on_texts(test_corpus)
seq_test=tokenizer_obj_test.texts_to_sequences(test_corpus)

MAX_WORDS = 0
for i in seq_train:
    MAX_WORDS = max(MAX_WORDS, len(i))

for i in seq_test:
    MAX_WORDS = max(MAX_WORDS, len(i))
    
train_pad = pad_sequences(seq_train, maxlen = MAX_WORDS, truncating = 'post', padding = 'post')
test_pad = pad_sequences(seq_test, maxlen = MAX_WORDS, truncating = 'post', padding = 'post')

tokenizer_obj_train.word_index.update(tokenizer_obj_test.word_index)

In [None]:
word_index = tokenizer_obj_train.word_index
print('Number of unique words:',len(word_index))

In [None]:
num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words, 300))

counter = 0

for word, i in tqdm(word_index.items()):
    emb_vec = glove_embeddings.get(word)
    if emb_vec is not None:
        embedding_matrix[i] = emb_vec
    else:
        counter += 1

del glove_embeddings
counter

**We are missing about 25% of the words in the corpus, which is kind of high, but getting to a better percentage would require a large amount of manual work on the dataset and that is not our purpose.**

# <font size="+2" color="black"><b>3. Neural Networks </b></font><br><a id="3"></a>

**We will try 3 types of models: Bidirectional LSTM, 1D CNN + Bidirectional LSTM and Bidirectional LSTM + FC Network on numerical features.**

**First, let's split the original training set in training and validation.**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(train_pad, train_df['target'].values, test_size = 0.25, random_state = 42)

print("Shape of train", x_train.shape)
print("Shape of Validation", x_test.shape)

<font size="+1" color="black"><b>3.1 Bidirectional LSTM </b></font><br><a id="3.1"></a>

**Now, for our first model. We will try a simple bidirectional LSTM with a lot of regularization since the dataset is so small.**

In [None]:
inp = Input(shape = (MAX_WORDS, ))
x = Embedding(num_words, 300, weights = [embedding_matrix])(inp)
x = Bidirectional(LSTM(MAX_WORDS, dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True))(x)
x = ActivityRegularization(l2 = 0.1)(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = Dense(4, activation = "relu")(x)
x = ActivityRegularization(l2 = 0.1)(x)
x = Dropout(0.2)(x)
x = Dense(1, activation = "sigmoid")(x)

model_LSTM = Model(inputs = inp, outputs = x)
model_LSTM.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

print(model_LSTM.summary())

In [None]:
checkpoint = ModelCheckpoint(
    'model_LSTM.h5', 
    monitor = 'val_loss', 
    verbose = 1, 
    save_best_only = True
)

reduce_lr = ReduceLROnPlateau(
    monitor = 'val_loss', 
    factor = 0.2, 
    verbose = 1, 
    patience = 5,                        
    min_lr = 0.001
)

model_LSTM.fit(x_train, y_train, batch_size = 32, epochs = 15, validation_data = (x_test, y_test), callbacks = [reduce_lr, checkpoint])

**We'll finish by checking the performance on the validation set.**

In [None]:
model_LSTM.load_weights('model_LSTM.h5')
pred_LSTM = model_LSTM.predict(x_test)

acc_LSTM = accuracy_score(y_test, np.where(pred_LSTM > 0.5, 1, 0))
f1_LSTM = f1_score(y_test, np.where(pred_LSTM > 0.5, 1, 0))

print(acc_LSTM, f1_LSTM)

<font size="+1" color="black"><b>3.2 1D CNN + Bidirectional LSTM </b></font><br><a id="3.2"></a>

**For our second model, we will add a 1D convolutional layer with a kernel size of 3 hoping that it will help the model understand meaning from tri-grams.**

In [None]:
inp = Input(shape = (MAX_WORDS, ))
x = Embedding(num_words, 300, weights = [embedding_matrix])(inp)
x = Conv1D(MAX_WORDS, 3)(x)
x = Activation('relu')(x)
x = MaxPooling1D(pool_size=2, strides=2)(x)
x = Bidirectional(LSTM(MAX_WORDS, dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True))(x)
x = ActivityRegularization(l2 = 0.1)(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = Dense(4, activation = "relu")(x)
x = ActivityRegularization(l2 = 0.1)(x)
x = Dropout(0.2)(x)
x = Dense(1, activation = "sigmoid")(x)

model_CNN_LSTM = Model(inputs = inp, outputs = x)
model_CNN_LSTM.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

print(model_CNN_LSTM.summary())

In [None]:
checkpoint = ModelCheckpoint(
    'model_CNN_LSTM.h5', 
    monitor = 'val_loss', 
    verbose = 1, 
    save_best_only = True
)

model_CNN_LSTM.fit(x_train, y_train, batch_size = 64, epochs = 10, validation_data = (x_test, y_test), callbacks = [reduce_lr, checkpoint])

**As with the first model, we'll finish by checking the performance on the validation set.**

In [None]:
model_CNN_LSTM.load_weights('model_CNN_LSTM.h5')
pred_CNN_LSTM = model_CNN_LSTM.predict(x_test)

acc_CNN_LSTM = accuracy_score(y_test, np.where(pred_CNN_LSTM > 0.5, 1, 0))
f1_CNN_LSTM = f1_score(y_test, np.where(pred_CNN_LSTM > 0.5, 1, 0))

print(acc_CNN_LSTM, f1_CNN_LSTM)

<font size="+1" color="black"><b>3.3 Bidirectional LSTM + FC Network on numerical features </b></font><br><a id="3.3"></a>

**Here, we will try using the numerical features that we collected in the first chapter. In order to do that, we need to fetch them.**

In [None]:
numerical_features = train_df[['tweet_sentiment', 'nr_of_char', 'nr_of_words', 'nr_of_unique_words', 'nr_of_punctuation', 'nr_of_stopwords']].to_numpy()

In [None]:
nlp_inp = Input(shape = (MAX_WORDS, ))
x = Embedding(num_words, 300, weights = [embedding_matrix])(nlp_inp)
x = Bidirectional(LSTM(MAX_WORDS,dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True))(x)
x = ActivityRegularization(l2 = 0.1)(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = Dense(4, activation = "relu")(x)
x = ActivityRegularization(l2 = 0.1)(x)

num_features_inp = Input(shape = (6, ), name = 'num_features_inp')
y = Dense(4, activation = "relu")(num_features_inp)
x = ActivityRegularization(l2 = 0.1)(x)

z = Concatenate()([x, y])
z = Dropout(0.2)(z)
z = Dense(1, activation = "sigmoid")(z)

model_LSTM_FC = Model(inputs = [nlp_inp, num_features_inp], outputs = z)
model_LSTM_FC.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

print(model_LSTM_FC.summary())

**Also, we will have to re-create the testing and validation data since we also need the numerical features for this model.**

In [None]:
x_train_text, x_test_text, x_train_num, x_test_num, y_train, y_test = train_test_split(train_pad, numerical_features, train_df['target'].values, test_size=0.25, random_state = 42)

print("Shape of train", x_train_text.shape)
print("Shape of Validation", x_test_text.shape)
print("Shape of train", x_train_num.shape)
print("Shape of Validation", x_test_num.shape)

In [None]:
checkpoint = ModelCheckpoint(
    'model_LSTM_FC.h5', 
    monitor = 'val_loss', 
    verbose = 1, 
    save_best_only = True
)

model_LSTM_FC.fit([x_train_text, x_train_num], y_train, batch_size=32, epochs=15, validation_data=([x_test_text, x_test_num], y_test), callbacks = [reduce_lr, checkpoint])

**As with the first two models, we'll finish by checking the performance on the validation set.**

In [None]:
model_LSTM_FC.load_weights('model_LSTM_FC.h5')
pred_LSTM_FC = model_LSTM_FC.predict([x_test_text, x_test_num])

acc_LSTM_FC = accuracy_score(y_test, np.where(pred_LSTM_FC > 0.5, 1, 0))
f1_LSTM_FC = f1_score(y_test, np.where(pred_LSTM_FC > 0.5, 1, 0))

print(acc_LSTM_FC, f1_LSTM_FC)

**Now that we have the accuracies and f1-scores for all of our first 3 models, let's compare them.**

In [None]:
print("Accuracy and F1-Score of the LSTM: ", acc_LSTM, f1_LSTM)
print("Accuracy and F1-Score of the CNN_LSTM: ", acc_CNN_LSTM, f1_CNN_LSTM)
print("Accuracy and F1-Score of the LSTM_FC: ", acc_LSTM_FC, f1_LSTM_FC)

**Unfortunately, the performance is not really what we would of wanted and it doesn't differ that much between the 3 models. I expect this to happen because of the small dataset and the fact that there are lots of tweets with very few words, of which 25% we don't even have in the embeddings. In order to get good performance with the 3 models we tried until now, we would need to clean the data much better. As I previously mentioned, this would be a tedious task and not the target of this notebook. Therefore, let's try Google's Bert and see if it has any improvements over our approaches.**

<font size="+1" color="black"><b>3.4 Bert </b></font><br><a id="3.4"></a>

**We are going to use the SimpleTransformers library in order to try and train Bert for our problem. In order to use it, we need to install it and then put the data in a specific format.**

In [None]:
!pip install transformers
!pip install simpletransformers
!pip uninstall -y pyarrow
!pip install 'pyarrow>=1.0.0, <5.0.0'
!pip uninstall -y tqdm
!pip install 'tqdm==v4.43.0'
!pip uninstall -y tokenizers
!pip install 'tokenizers>=0.10.1,<0.11'

In [None]:
columns = train_df[["text_cleaned", "target"]]
train_df_V2 = columns.copy()

In [None]:
rename = {"text_cleaned": "text", "target": "labels"}
train_df_V2.rename(columns = rename, inplace=True)

In [None]:
train_x_y = train_df_V2.sample(frac = 0.75, random_state = 42)
test_x_y = pd.concat([train_df_V2, train_x_y]).drop_duplicates(keep=False)

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs


model_args = ClassificationArgs()
model_args.use_early_stopping = True
model_args.early_stopping_delta = 0.01
model_args.early_stopping_metric = "mcc"
model_args.early_stopping_metric_minimize = False
model_args.early_stopping_patience = 5
model_args.evaluate_during_training_steps = 1000
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.no_save = True

model_bert = ClassificationModel('bert', 'bert-base-uncased', args=model_args, use_cuda=False)
model_bert.train_model(train_x_y)

**As with the first three models, we'll finish by checking the performance on the validation set. Admittedly, it is not the same validation set as before, but it should be just as useful to show the model's performance.**

In [None]:
pred_bert, out_bert = model_bert.predict(test_x_y['text'].tolist())

acc_bert = accuracy_score(test_x_y['labels'].to_numpy(), pred_bert)
f1_bert = f1_score(test_x_y['labels'].to_numpy(), pred_bert)

print(acc_bert, f1_bert)

**Now that seems a bit better.**

# <font size="+2" color="black"><b>4. Submission </b></font><br><a id="4"></a>

**We will train the Bert model on the whole training dataset and use it to make the final submission.**

In [None]:
model_bert = ClassificationModel("bert", "bert-base-uncased", args=model_args, use_cuda=False)
model_bert.train_model(train_df_V2)

final_pred_bert, final_out_bert = model_bert.predict(test_df['text_cleaned'].tolist())

submit = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')
submit['target'] = final_pred_bert
submit.to_csv('submission.csv', index=False)

# <font size="+2" color="black"><b>5. Conclusions </b></font><br><a id="5"></a>

**In conclusion, as expected, Bert does do a better job at the task at hand than our previous simpler models. I'm pretty sure that with enough data cleaning, the simpler models would have done just fine. Furthermore, with more data cleaning and some fine tuning, the Bert model could have achieved a much higher score, but as previously stated this is not the goal of this notebook. The goal was to practice investigating detasets, visalisations and model creation. In my opinion, the goal has been achieved.**