# Introduction

This competition is a nice introduction to Natural Language Processing, classyfing tweets into those relating to diasters and those that don't. The first and most simple stratgy with NLP is usually to represent our texts as a bag of words or TF-IDF (text frequency - inverse document frequency) vectors and use a clinear classification model. A more sophisticated approach is to use BERT (Bidirectional Encoder Representations from Transformers). This is a nice introdcution to BERT - https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/.

In this notebook, I'll try both methods.

In [None]:
# Use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import tokenization

# text processing libraries
import re
import string
import nltk
from nltk.corpus import stopwords

# sklearn 
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
 
import warnings
warnings.filterwarnings('ignore')

# Load and Prepare Data

In [None]:
#Training data
train = pd.read_csv('../input/nlp-getting-started/train.csv')
print('Training data shape: ', train.shape)
train.head()

In [None]:
# Testing data 
test = pd.read_csv('../input/nlp-getting-started/test.csv')
print('Testing data shape: ', test.shape)
test.head()

# EDA

In [None]:
#Missing values in training set
train.isnull().sum()

In [None]:
#Missing values in test set
test.isnull().sum()

In [None]:
train['target'].value_counts()

We've got a relatively balanced dataset. Lots of location values are missing but relatively small number of few keywords are missing.

It's possible to use Keyword and Location as meta-features for our linear model. I won't explore that in this notebook (I might come back and add it later). Location is very sparsely populated so I left it alone. I did try just appending the keyword to the to see if that improved the results but it didn't

# Model 1: Traditional NLP - Bag of Words + Linear Model

# Data Preprocessing

In [None]:
# take copies of the data to leave the originals for BERT
train1 = train.copy()
test1 = test.copy()

**Data cleaning:** In summary, we want to tokenize our text then send it through a round of cleaning where we turn all characters to lower case, remove brackets, URLs, html tags, punctuation, numbers, etc. We'll also remove emojis from the text and remove common stopwords. This is a vital step in the Bag-of-words + linear model

In [None]:
# Applying a first round of text cleaning techniques

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower() # make text lower case
    text = re.sub('\[.*?\]', '', text) # remove text in square brackets
    text = re.sub('https?://\S+|www\.\S+', '', text) # remove URLs
    text = re.sub('<.*?>+', '', text) # remove html tags
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
    text = re.sub('\n', '', text) # remove words conatinaing numbers
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)

    return text

In [None]:
# emoji removal
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Applying the de=emojifying function to both test and training datasets
train1['text'] = train1['text'].apply(lambda x: remove_emoji(x))
test1['text'] = test1['text'].apply(lambda x: remove_emoji(x))

In [None]:
# text preprocessing function
def text_preprocessing(text):
    """
    Cleaning and parsing the text.

    """
    tokenizer_reg = nltk.tokenize.RegexpTokenizer(r'\w+')
    
    nopunc = clean_text(text)
    tokenized_text = tokenizer_reg.tokenize(nopunc)
    remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
    combined_text = ' '.join(remove_stopwords)
    return combined_text

# Applying the cleaning function to both test and training datasets
train1['text'] = train1['text'].apply(lambda x: text_preprocessing(x))
test1['text'] = test1['text'].apply(lambda x: text_preprocessing(x))

# Let's take a look at the updated text
train1['text'].head()

# Bag of Words Vectorizer

Here we're only going to use uni-grams and add any word that appears to the vocabulary. Adding 2- and 3- grams didn't improve the score, surprisingly.

In [None]:
#count_vectorizer = CountVectorizer()
count_vectorizer = CountVectorizer(ngram_range = (1,1), min_df = 1)
train_vectors = count_vectorizer.fit_transform(train1['text'])
test_vectors = count_vectorizer.transform(test1["text"])

## Keeping only non-zero elements to preserve space 
train_vectors.shape

# TF-IDF Vectorizer

Here we use 1- and 2-grams where each terms has to appear at least twice and we ignore terms appearing in over 50% of text examples.

In [None]:
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df = 2, max_df = 0.5)
train_tfidf = tfidf.fit_transform(train1['text'])
test_tfidf = tfidf.transform(test1["text"])

train_tfidf.shape

# Models

Fit Logistic Regression and Multinomial Naive Bayes models with BoW and TF-IDF, giving four models in total. This is not an extensive list of vectorization options and models and I won't tune any of the models. It's more of an example framework for linear models in NLP and (spoiler) BERT is going to beat whatever linear model we can come up with.

In [None]:
# Fitting a simple Logistic Regression on BoW
logreg_bow = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(logreg_bow, train_vectors, train["target"], cv=5, scoring="f1")
scores.mean()

In [None]:
# Fitting a simple Logistic Regression on TFIDF
logreg_tfidf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(logreg_tfidf, train_tfidf, train["target"], cv=5, scoring="f1")
scores.mean()

In [None]:
# Fitting a simple Naive Bayes on BoW
NB_bow = MultinomialNB()
scores = model_selection.cross_val_score(NB_bow, train_vectors, train["target"], cv=5, scoring="f1")
scores.mean()

In [None]:
# Fitting a simple Naive Bayes on TFIDF
NB_tfidf = MultinomialNB()
scores = model_selection.cross_val_score(NB_tfidf, train_tfidf, train["target"], cv=5, scoring="f1")
scores.mean()

The best score is when we use MNB on the bag of words vectors. It gives a training score of 0.6585 and a leaderboard score of 0.7945.

Bag of Words is significantly better than TF-IDF in both models and it's a little bit surprising that 1-grams with no minumum limit seems to give the best results. I think this might be partly due to the nature of the data. Tweets are usually pretty short and probably don't have much of a 'standard' layout or structure. This might be why a fairly simple BoW model does really well compared to TF-IDF or higher gram models.

In [None]:
NB_bow.fit(train_vectors, train["target"])

In [None]:
sample_submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')
sample_submission["target"] = NB_bow.predict(test_vectors)

import os
os.chdir('/kaggle/working')
    
sample_submission.to_csv("submission1.csv", index=False)

# BERT Model

# Helper Functions

In [None]:
# The Encoding function takes the text column from train or test dataframe, the tokenizer,
# and the maximum length of text string as input.

# Outputs:
# Tokens
# Pad masks - BERT learns by masking certain tokens in each sequence.
# Segment id

def bert_encode(texts, tokenizer, max_len = 512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
# Build and compile the model

def build_model(bert_layer, max_len = 512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Load and Prepare Data

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

**Data Cleaning** - I've found that a relatively quick and generic data cleaning like I did with the BoW + linear model does not improve the result. The best results seem to be achieved with a painstakingly detailed clean up of train and test text which isn't particularly realistic irl. Some simple data cleaning code in hidden cells below.

In [None]:
# def decontracted(phrase):
#     # specific
#     phrase = re.sub(r"won\'t", "will not", phrase)
#     phrase = re.sub(r"can\'t", "can not", phrase)

#     # general
#     phrase = re.sub(r"n\'t", " not", phrase)
#     phrase = re.sub(r"\'re", " are", phrase)
#     phrase = re.sub(r"\'s", " is", phrase)
#     phrase = re.sub(r"\'d", " would", phrase)
#     phrase = re.sub(r"\'ll", " will", phrase)
#     phrase = re.sub(r"\'t", " not", phrase)
#     phrase = re.sub(r"\'ve", " have", phrase)
#     phrase = re.sub(r"\'m", " am", phrase)
#     return phrase

In [None]:
# import spacy
# import re
# nlp = spacy.load('en')
# def preprocessing(text):
#   text = text.replace('#','')
#   text = decontracted(text)
#   text = re.sub('\S*@\S*\s?','',text)
#   text = re.sub('http[s]?:(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text)

#   token=[]
#   result=''
#   text = re.sub('[^A-z]', ' ',text.lower())
  
#   text = nlp(text)
#   for t in text:
#     if not t.is_stop and len(t)>2:  
#       token.append(t.lemma_)
#   result = ' '.join([i for i in token])

#   return result.strip()

In [None]:
# train.text = train.text.apply(lambda x : preprocessing(x))
# test.text = test.text.apply(lambda x : preprocessing(x))

In [None]:
#train.head()

# Modelling

In [None]:
# Download BERT architecture
# BERT-Large uncased: 24-layer, 1024-hidden-nodes, 16-attention-heads, 340M parameters

module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values

In [None]:
model = build_model(bert_layer, max_len=160)
model.summary()

In [None]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=3,
    callbacks=[checkpoint],
    batch_size=16
)

The BERT model gives a validation score of 0.8345 and a leaderboard score of 0.83542.

So it improves on the BoW + MNB model but not by an insane amount given how much more complex the BERT model is. This is a bit of a general principle in NLP. Relatively models can give really good results. Deep Learning models do tend to perform better but not always but as much as you might expect.

# Predictions and Submission

In [None]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)

In [None]:
submission['target'] = test_pred.round().astype(int)
import os
os.chdir('/kaggle/working')
    
submission.to_csv("submission2.csv", index=False)