# Intoduction
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.

In [None]:
import numpy as np
import pandas as pd
from nltk.stem import PorterStemmer
ps = PorterStemmer()
from nltk.stem.lancaster import LancasterStemmer
lc = LancasterStemmer()
from nltk.stem import SnowballStemmer
sb = SnowballStemmer("english")
import gc
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, GRU, Conv1D, BatchNormalization
from keras.layers import Bidirectional, GlobalMaxPooling1D, Concatenate, SpatialDropout1D
from keras.optimizers import Adam
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.initializers import Constant
import spacy
import pickle
import re
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

# Exploratory Data Analysis

Let's look at the dataset. Each sample in the training set has the following information:

- `id` - a unique identifier for each tweet
- `text` - the text of the tweet
- `location` - the location the tweet was sent from (may be blank)
- `keyword` - a particular keyword from the tweet (may be blank)
- `target` - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [None]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
train.head()

Now, let's take a look at each variable more closely.

## Keyword

In [None]:
train['keyword'].value_counts().sort_values(ascending=False)

Check whether the missingness is associated with the target variable.

In [None]:
pd.crosstab(train['keyword'].isnull(), train['target'])

The missing percentage for class 0 and class 1 is 0.4% and 1% respectively. It doesn't seem like they are related.

## Location

In [None]:
train['location'].value_counts().sort_values(ascending=False)[0:20]

It seems like most reported locations are from the US.

The percentage of missing values is

In [None]:
train['location'].isnull().sum() / train.shape[0]

Check whether the missingness is associated with the target variable.

In [None]:
pd.crosstab(train['location'].isnull(), train['target'])

The missing percentage for class 0 is 33.6% and the missing percentage for class 1 is 32.9%. It doesn't seem like they are related.

## Target
Let's look at the frequency of each class.

In [None]:
train['target'].value_counts()

It's a fairly balanced dataset.

## Text
Below is a modification of the code from 
- [Sentiment Analysis with Text Mining](https://towardsdatascience.com/sentiment-analysis-with-text-mining-13dd2b33de27)
- [NLP with Disaster Tweets - Read Before Start EDA](https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-read-before-start-eda)

In [None]:
def count_regex(pattern, tweet):
    return len(re.findall(pattern, tweet))
    
for df in [train, test]:
    df['words_count'] = df['text'].apply(lambda x: count_regex(r'\w+', x))
    df['unique_words_count'] = df['text'].apply(lambda x: len(set(str(x).split())))
    df['mean_word_length'] = df['text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
    df['chars_count'] = df['text'].apply(lambda x: len(str(x)))
    df['mentions_count'] = df['text'].apply(lambda x: count_regex(r'@\w+', x))
    df['hashtags_count'] = df['text'].apply(lambda x: count_regex(r'#\w+', x))
    df['capital_words_count'] = df['text'].apply(lambda x: count_regex(r'\b[A-Z]{2,}\b', x))
    df['excl_quest_marks_count'] = df['text'].apply(lambda x: count_regex(r'!|\?', x))
    df['urls_count'] = df['text'].apply(lambda x: count_regex(r'http.?://[^\s]+[\s]?', x))

In [None]:
new_features = ['words_count', 'unique_words_count', 'mean_word_length', 'chars_count', 'mentions_count', 
                'hashtags_count', 'capital_words_count', 'excl_quest_marks_count', 'urls_count']
disaster_tweets_idx = train['target'] == 1
fig, axes = plt.subplots(ncols=2, nrows=len(new_features), figsize=(20, 50), dpi=100)

for i, feature in enumerate(new_features):
    sns.distplot(train.loc[~disaster_tweets_idx][feature], label='Not Disaster', ax=axes[i][0], color='green')
    sns.distplot(train.loc[disaster_tweets_idx][feature], label='Disaster', ax=axes[i][0], color='red')

    sns.distplot(train[feature], label='Train', ax=axes[i][1])
    sns.distplot(test[feature], label='Test', ax=axes[i][1])
    
    for j in range(2):
        axes[i][j].set_xlabel('')
        axes[i][j].tick_params(axis='x', labelsize=12)
        axes[i][j].tick_params(axis='y', labelsize=12)
        axes[i][j].legend()
    
    axes[i][0].set_title(f'{feature} Target Distribution in Training Set', fontsize=13)
    axes[i][1].set_title(f'{feature} Training & Test Set Distribution', fontsize=13)

plt.show()

- All the features have a similar distribution in the training set and the test set.
- Some features seem to have different distributions for disaster and non-disaster tweets: `word_count`, `unique_words_count`, `mean_word_length`, `chars_count` and `excl_quest_marks_count`. I tried to include them in my model, but for some reason I got a lower score, so I decided to leave them out eventually.

# Model building
It seems like the keyword is already included in the tweet and I wasn't sure how to use the `location` variable, so I didn't use them in my model.

In [None]:
train_text = train['text']
test_text = test['text']
target = train['target'].values

# combine the text from the training dataset and the test dataset
text_list = pd.concat([train_text, test_text])

# number of training samples
num_train_data = target.shape[0]

I saw quite a few tweets contain this word `&amp;`. For example,

In [None]:
text_list.iloc[171]

I looked it up online and found that it is a character reference for `&` in HTML. I will just replace it with the word `and`. I will also replace `w/` with its long form `with`. We will take care of other data cleaning issues in the next section.

In [None]:
text_list = text_list.apply(lambda x: re.sub('&amp;', ' and ', x))
text_list = text_list.apply(lambda x: re.sub('w/', 'with', x))

## Embeddings
We will use two pre-trained word embeddings models. But first, we will have to tokenize our text. We will also remove the punctuations in the process.

In [None]:
# https://www.kaggle.com/wowfattie/3rd-place
nlp = spacy.load('en_core_web_lg')
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)
docs = nlp.pipe(text_list, n_threads = 2)

# convert words to integers and save the results in word_sequences
word_sequences = []

# store the mapping in word_dict
word_dict = {}
lemma_dict = {}

# store the frequence of each word
word_freq = {}

word_index = 1
for doc in docs:
    word_seq = []
    for word in doc:
        try:
            word_freq[word.text] += 1
        except KeyError:
            word_freq[word.text] = 1
        if (word.text not in word_dict) and (word.pos_ is not 'PUNCT'):
            word_dict[word.text] = word_index
            word_index += 1
            lemma_dict[word.text] = word.lemma_
        # do not include punctuations in word_dict
        # this essentially removes hashtags and mentions
        if word.pos_ is not 'PUNCT':
            word_seq.append(word_dict[word.text])
    word_sequences.append(word_seq)
del docs
gc.collect()

# maximum number of words per tweet in the dataset
max_length = max([len(s) for s in word_sequences])

# number of unique words
# add 1 because 0 is reserved for padding
vocab_size = len(word_dict) + 1

train_word_sequences = word_sequences[:num_train_data]
test_word_sequences = word_sequences[num_train_data:]

# add zeros at the end of each word sequence so that their lengths are fixed at max_len
train_word_sequences = pad_sequences(train_word_sequences, maxlen=max_length, padding='post')
test_word_sequences = pad_sequences(test_word_sequences, maxlen=max_length, padding='post')

Now, build an index that maps the words to their vector representation. If the pre-trained model doesn't recognize a word, we will try to use a modified version of the word, e.g., lowercase, uppercase, stem, lemma and so on.

In [None]:
def load_embeddings(embeddings_index, word_dict, lemma_dict):
    embed_size = 300
    vocab_size = len(word_dict) + 1
    embedding_matrix = np.zeros((vocab_size, embed_size), dtype=np.float32)
    unknown_vector = np.zeros((embed_size,), dtype=np.float32) - 1.
    for key in word_dict:
        word = key
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = key.lower()
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = key.upper()
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = key.capitalize()
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = ps.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = lc.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = sb.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        word = lemma_dict[key]
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[word_dict[key]] = embedding_vector
            continue
        embedding_matrix[word_dict[key]] = unknown_vector                    
    return embedding_matrix

In [None]:
def load_pickle_file(path):
    with open(path, 'rb') as f:
        file = pickle.load(f)
    return file

# the asterisk below allows the function to accept an arbitrary number of inputs
def get_coefs(word,*arr): 
    """ convert the embedding file into a Python dictionary """ 
    return word, np.asarray(arr, dtype='float32')
                                                  
path_glove = '/kaggle/input/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl'
path_paragram = '/kaggle/input/paragram-300-sl999/paragram_300_sl999.txt'

embeddings_index_glove = load_pickle_file(path_glove)
# the asterisks below unpacks the list
embeddings_index_paragram = dict(get_coefs(*o.split(" ")) for o in open(path_paragram, encoding="utf8", errors='ignore') if len(o) > 100)

embedding_matrix_glove = load_embeddings(embeddings_index_glove, word_dict, lemma_dict)
embedding_matrix_paragram = load_embeddings(embeddings_index_paragram, word_dict, lemma_dict)

# stack the two pre-trained embedding matrices
embedding_matrix = np.concatenate((embedding_matrix_glove, embedding_matrix_paragram), axis=1)

## Model Training

The model I use is similar to the one posted here https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/80568. I ran the model 5 times and then took the average.

In [None]:
embedding_size = 600
learning_rate = 0.001
batch_size = 32
num_epoch = 5

def build_model(embedding_matrix, vocab_size, max_length, embedding_size=300):
    model = Sequential([
        Embedding(vocab_size, embedding_size, embeddings_initializer=Constant(embedding_matrix), 
                  input_length=max_length, trainable=False),
        SpatialDropout1D(0.3),
        Bidirectional(LSTM(128, return_sequences=True)),
        Conv1D(64, kernel_size=2), 
        GlobalMaxPooling1D(),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [None]:
reps = 5
pred_prob = np.zeros((len(test_word_sequences),), dtype=np.float32)
for r in range(reps):
    model = build_model(embedding_matrix, vocab_size, max_length, embedding_size)
    model.fit(train_word_sequences, target, batch_size=batch_size, epochs=num_epoch, verbose=2)
    pred_prob += np.squeeze(model.predict(test_word_sequences, batch_size=batch_size, verbose=2) / reps)

Make a submission.

In [None]:
submission = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
submission['target'] = pred_prob.round().astype(int)
submission.to_csv('submission.csv', index=False)