## Real or Not? NLP with Disaster Tweets

This kernel uses **Plotly** to visualize the distribution of tweets with respect to their characteristics, and it uses some simple natural language cleaning techniques to purify the tweets. I combined the **tf-idf** word-based and **Google's Universal Sentence Encoder's** 512-dimension features as it improved the model's accuracy than being individually used. Finally, I used **h20.ai's AutoML** module to fit the transformed data and was able to achieve a **F1 score** of **82.413%** on the test data. If you find this kernel useful please **upvote** and share your valuable feedback.

My other kernels (Please **upvote** if you like the implementation):

* [House Sales Price Prediction](https://www.kaggle.com/gauthampughazh/house-sales-price-prediction-svr)
* [Titanic Survival Prediction](https://www.kaggle.com/gauthampughazh/titanic-survival-prediction-pandas-plotly-keras)

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import tensorflow as tf
import tensorflow_hub as hub
import h2o
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from h2o.frame import H2OFrame
from h2o.automl import H2OAutoML, get_leaderboard
from IPython.display import FileLink


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 200)
# nnlm_embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

In [None]:
train_df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
submission_df = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

## Exploratory Data Analysis

**Peeking the data**

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
test_df.info()

Checking top 100 tweets with missing ***location*** feature

In [None]:
train_df[train_df['location'].isnull()].head(100)

Checking the tweets in class 0 with missing ***keyword*** feature

In [None]:
train_df[(train_df['target'] == 0) & (train_df['keyword'].isnull())]

Checking the tweets in class 1 with missing ***keyword*** feature

In [None]:
train_df[(train_df['target'] == 1) & (train_df['keyword'].isnull())]

In [None]:
print(f"No. of unique keywords in class 0: {train_df[(train_df['target'] == 0)]['keyword'].value_counts().shape[0]}")
print(f"No. of unique keywords in class 1: {train_df[(train_df['target'] == 1)]['keyword'].value_counts().shape[0]}")
print(f"Total no. of unique keywords: {train_df['keyword'].value_counts().shape[0]}")

**Conclusion**: Keywords present in class 1 are also present in class 0

**Checking the distribution of tweets among the classes**

In [None]:
fig = px.bar(x=['Non-disaster tweet', 'Disaster tweet'], y=train_df['target'].value_counts().values)
fig.update_layout(title_text='Distribution of tweets', xaxis_title='Tweet type', yaxis_title='Count')

**Visualizing the distribution of punctuations of the tweets**

In [None]:
URL_PATTERN = re.compile(r"(https:\/\/\S+)|(http:\/\/\S+)|(www\.\S+)")
HTML_TAGS_PATTERN = re.compile(r'<.*>')
ALPHA_NUMERIC_PATTERN = re.compile(r"\w*[:,-]*(\d+[:,-]*)+\d*\w*")
PUNCTUATION_PATTERN = re.compile(r'[^a-zA-Z ]')
MENTIONS_PATTERN = re.compile(r'@[\w]*')
HASH_TAGS_PATTERN = re.compile(r'#\S+')
UNWANTED_WORDS_PATTERN = re.compile(r'&amp;|RT: \S+:|RT \S+:|FYI|CAD|RT |GMT|UTC|JST|\s[b-zB-Z]\s|ST|nsfw')
STOPWORDS = set(stopwords.words('english'))

def get_punctuations(text):
    
    return PUNCTUATION_PATTERN.findall(text)

punctuations = train_df.apply(lambda x: get_punctuations(x['text']), axis=1)
punctuations = punctuations[punctuations.notnull()].explode().value_counts()

fig = px.bar(x=punctuations.index, y=punctuations.values)
fig.update_layout(title_text='Punctuation count', xaxis_title='Punctuations', yaxis_title='Count')

**Visualizing the distribution of unwanted words across the different classes of the tweets**

In [None]:
def has_unwanted_words(text):
    
    return 'Tweets with unwanted words' if re.search(UNWANTED_WORDS_PATTERN, text)\
                                        else 'Tweets without unwanted words'

has_unwanted_words = train_df.apply(lambda x: has_unwanted_words(x['text']), axis=1).to_frame()
has_unwanted_words.columns = ['Unwanted words'] 
has_unwanted_words['target'] = train_df['target']

px.histogram(has_unwanted_words, x='Unwanted words', color='target', barmode='group')

**Visualizing the distribution of HTML across the different classes of the tweets**

In [None]:
def has_html_tags(text):
    
    return 'Tweets with HTML tags' if re.search(HTML_TAGS_PATTERN, text) else 'Tweets without HTML tags'

has_tags = train_df.apply(lambda x: has_html_tags(x['text']), axis=1).to_frame()
has_tags.columns = ['HTML tags'] 
has_tags['target'] = train_df['target']

px.histogram(has_tags, x='HTML tags', color='target', barmode='group')

**Visualizing the distribution of URLs across the different classes of the tweets**

In [None]:
def has_urls(text):
    
    return 'Tweets with URLs' if re.search(URL_PATTERN, text) else 'Tweets without URLs'

has_urls = train_df.apply(lambda x: has_urls(x['text']), axis=1).to_frame()
has_urls.columns = ['URLs'] 
has_urls['target'] = train_df['target']

px.histogram(has_urls, x='URLs', color='target', barmode='group')

**Visualizing the distribution of alphanumeric words across the different classes of the tweets**

In [None]:
def has_alnums(text):
    
    return 'Tweets with Alphanumeric words' if re.search(ALPHA_NUMERIC_PATTERN, text) else 'Tweets without Alphanumeric words'

has_alnums = train_df.apply(lambda x: has_alnums(x['text']), axis=1).to_frame()
has_alnums.columns = ['Alnums'] 
has_alnums['target'] = train_df['target']

px.histogram(has_alnums, x='Alnums', color='target', barmode='group')

**Visualizing the distribution of mentions across the different classes of the tweets**

In [None]:
def has_mentions(text):
    
    return 'Tweets with mentions' if re.search(MENTIONS_PATTERN, text) else 'Tweets without mentions'
    
has_mentions = train_df.apply(lambda x: has_mentions(x['text']), axis=1).to_frame()
has_mentions.columns = ['Mentions'] 
has_mentions['target'] = train_df['target']

px.histogram(has_mentions, x='Mentions', color='target', barmode='group')

**Visualizing the distribution of hashtags across the different classes of the tweets**

In [None]:
def has_hash_tags(text):
    
    return 'Tweets with hash tags' if re.search(HASH_TAGS_PATTERN, text) else 'Tweets without hash tags'
    
has_hash_tags = train_df.apply(lambda x: has_hash_tags(x['text']), axis=1).to_frame()
has_hash_tags.columns = ['Hash tags'] 
has_hash_tags['target'] = train_df['target']

px.histogram(has_hash_tags, x='Hash tags', color='target', barmode='group')

**Using a word cloud to view the top 200 most frequently occurred words**

In [None]:
tweets = train_df[['text']].apply(lambda x: " ".join(x))['text'].lower()
wordcloud = WordCloud(max_words=200, width=1000, height=600, background_color='white').generate(tweets)

plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## Cleaning the tweets

**Replacing the contractions**

In [None]:
# Source: https://www.kaggle.com/prachichitnis/stack-tfidf-embedding-xgboost

def replace_contractions(tweet):
    tweet = re.sub(r"he's", "he is", tweet)
    tweet = re.sub(r"there's", "there is", tweet)
    tweet = re.sub(r"We're", "We are", tweet)
    tweet = re.sub(r"That's", "That is", tweet)
    tweet = re.sub(r"won't", "will not", tweet)
    tweet = re.sub(r"they're", "they are", tweet)
    tweet = re.sub(r"Can't", "Cannot", tweet)
    tweet = re.sub(r"wasn't", "was not", tweet)
    tweet = re.sub(r"don\x89Ûªt", "do not", tweet)
    tweet = re.sub(r"aren't", "are not", tweet)
    tweet = re.sub(r"isn't", "is not", tweet)
    tweet = re.sub(r"What's", "What is", tweet)
    tweet = re.sub(r"haven't", "have not", tweet)
    tweet = re.sub(r"hasn't", "has not", tweet)
    tweet = re.sub(r"There's", "There is", tweet)
    tweet = re.sub(r"He's", "He is", tweet)
    tweet = re.sub(r"It's", "It is", tweet)
    tweet = re.sub(r"You're", "You are", tweet)
    tweet = re.sub(r"I'M", "I am", tweet)
    tweet = re.sub(r"shouldn't", "should not", tweet)
    tweet = re.sub(r"wouldn't", "would not", tweet)
    tweet = re.sub(r"i'm", "I am", tweet)
    tweet = re.sub(r"I\x89Ûªm", "I am", tweet)
    tweet = re.sub(r"I'm", "I am", tweet)
    tweet = re.sub(r"Isn't", "is not", tweet)
    tweet = re.sub(r"Here's", "Here is", tweet)
    tweet = re.sub(r"you've", "you have", tweet)
    tweet = re.sub(r"you\x89Ûªve", "you have", tweet)
    tweet = re.sub(r"we're", "we are", tweet)
    tweet = re.sub(r"what's", "what is", tweet)
    tweet = re.sub(r"couldn't", "could not", tweet)
    tweet = re.sub(r"we've", "we have", tweet)
    tweet = re.sub(r"it\x89Ûªs", "it is", tweet)
    tweet = re.sub(r"doesn\x89Ûªt", "does not", tweet)
    tweet = re.sub(r"It\x89Ûªs", "It is", tweet)
    tweet = re.sub(r"Here\x89Ûªs", "Here is", tweet)
    tweet = re.sub(r"who's", "who is", tweet)
    tweet = re.sub(r"I\x89Ûªve", "I have", tweet)
    tweet = re.sub(r"y'all", "you all", tweet)
    tweet = re.sub(r"can\x89Ûªt", "cannot", tweet)
    tweet = re.sub(r"would've", "would have", tweet)
    tweet = re.sub(r"it'll", "it will", tweet)
    tweet = re.sub(r"we'll", "we will", tweet)
    tweet = re.sub(r"wouldn\x89Ûªt", "would not", tweet)
    tweet = re.sub(r"We've", "We have", tweet)
    tweet = re.sub(r"he'll", "he will", tweet)
    tweet = re.sub(r"Y'all", "You all", tweet)
    tweet = re.sub(r"Weren't", "Were not", tweet)
    tweet = re.sub(r"Didn't", "Did not", tweet)
    tweet = re.sub(r"they'll", "they will", tweet)
    tweet = re.sub(r"they'd", "they would", tweet)
    tweet = re.sub(r"DON'T", "DO NOT", tweet)
    tweet = re.sub(r"That\x89Ûªs", "That is", tweet)
    tweet = re.sub(r"they've", "they have", tweet)
    tweet = re.sub(r"i'd", "I would", tweet)
    tweet = re.sub(r"should've", "should have", tweet)
    tweet = re.sub(r"You\x89Ûªre", "You are", tweet)
    tweet = re.sub(r"where's", "where is", tweet)
    tweet = re.sub(r"Don\x89Ûªt", "Do not", tweet)
    tweet = re.sub(r"we'd", "we would", tweet)
    tweet = re.sub(r"i'll", "I will", tweet)
    tweet = re.sub(r"weren't", "were not", tweet)
    tweet = re.sub(r"They're", "They are", tweet)
    tweet = re.sub(r"Can\x89Ûªt", "Cannot", tweet)
    tweet = re.sub(r"you\x89Ûªll", "you will", tweet)
    tweet = re.sub(r"I\x89Ûªd", "I would", tweet)
    tweet = re.sub(r"let's", "let us", tweet)
    tweet = re.sub(r"it's", "it is", tweet)
    tweet = re.sub(r"can't", "cannot", tweet)
    tweet = re.sub(r"don't", "do not", tweet)
    tweet = re.sub(r"you're", "you are", tweet)
    tweet = re.sub(r"i've", "I have", tweet)
    tweet = re.sub(r"that's", "that is", tweet)
    tweet = re.sub(r"i'll", "I will", tweet)
    tweet = re.sub(r"doesn't", "does not", tweet)
    tweet = re.sub(r"i'd", "I would", tweet)
    tweet = re.sub(r"didn't", "did not", tweet)
    tweet = re.sub(r"ain't", "am not", tweet)
    tweet = re.sub(r"you'll", "you will", tweet)
    tweet = re.sub(r"I've", "I have", tweet)
    tweet = re.sub(r"Don't", "do not", tweet)
    tweet = re.sub(r"I'll", "I will", tweet)
    tweet = re.sub(r"I'd", "I would", tweet)
    tweet = re.sub(r"Let's", "Let us", tweet)
    tweet = re.sub(r"you'd", "You would", tweet)
    tweet = re.sub(r"It's", "It is", tweet)
    tweet = re.sub(r"Ain't", "am not", tweet)
    tweet = re.sub(r"Haven't", "Have not", tweet)
    tweet = re.sub(r"Could've", "Could have", tweet)
    tweet = re.sub(r"youve", "you have", tweet)  
    tweet = re.sub(r"donå«t", "do not", tweet)
    
    return tweet

**Creating a custom sklearn transformer to perform basic cleaning operations on the tweets**

In [None]:
class TweetTransformer(BaseEstimator, TransformerMixin):
    
    
    def __init__(self, remove_stopwords=False):
        """ decides whether to remove the stopwords or not
        """
        self.remove_stopwords_ = remove_stopwords
        
    def replace_contractions(self, tweet):
        """ replaces the contractions in the tweet
        """
        
        return replace_contractions(tweet)

    def replace_urls(self, tweet):
        """ replaces the URLs in the tweet
        """

        return URL_PATTERN.sub('', tweet)
    
    def replace_unwanted_words(self, tweet):
        """ replaces the unwanted words in the tweet
        """
        
        return UNWANTED_WORDS_PATTERN.sub(' ', tweet)

    def replace_mentions(self, tweet):
        """ replaces the mentions in the tweet
        """

        return MENTIONS_PATTERN.sub('', tweet)

    def replace_alpha_nums(self, tweet):
        """ replaces the alphanumeric words in the tweet
        """

        return ALPHA_NUMERIC_PATTERN.sub('', tweet)

    def remove_emoji_characters(self, tweet):
        """ removes the emoticons from the tweet 
        """

        return tweet.encode('ascii', 'ignore').decode('ascii')

    def replace_punctuations(self, tweet):
        """ replaces the punctuations in the tweet
        """

        return PUNCTUATION_PATTERN.sub(' ', tweet)
    
    def modify_empty_and_rare_utterances(self, tweet):
        """ handles empty tweets and other rare cases
        """
        
        if tweet.strip() == '':
            tweet = 'no text'
            
        tweet = tweet.lower()
        tweet = re.sub('lo+l', 'laughing out loud', tweet)
        tweet = re.sub('coo+l', 'cool', tweet)
        tweet = re.sub('(calif|cal)', 'california', tweet)
        tweet = tweet.replace('rd', 'road')
        tweet = tweet.replace('nyc', 'new york city')
        tweet = tweet.replace('sismo', 'earthquake')
        tweet = tweet.replace('detactado', 'detected')
        tweet = re.sub(' +', ' ', tweet)
        tweet = re.sub('^ +', '', tweet)
        
        return tweet
    
    def remove_stopwords(self, tweet):
        """ removes the stopwords from the tweet
        """
        
        tokens = word_tokenize(tweet)
        for i, token in enumerate(tokens):
            if token in STOPWORDS:
                del tokens[i]

        return " ".join(tokens)
    
    def transform_helper(self, tweet):
        """ makes a call to the preprocessing functions
        """
        
        tweet = self.replace_contractions(tweet)
        tweet = self.replace_urls(tweet)
        tweet = self.replace_unwanted_words(tweet)
        tweet = self.replace_mentions(tweet)
        tweet = self.replace_alpha_nums(tweet)
        tweet = self.remove_emoji_characters(tweet)
        tweet = self.replace_punctuations(tweet)
        tweet = self.modify_empty_and_rare_utterances(tweet)
        
        if self.remove_stopwords_:
            tweet = self.remove_stopwords(tweet)
        
        return tweet
    
    def fit(self, X=None):
        
        return self
    
    def transform(self, X):
        """ returns a pd.Series of transformed tweets
        """
        
        return X.map(lambda x: self.transform_helper(x))

**Creating a custom scikit-learn transformer to generate Google's NNLM 128-dimension embeddings**

In [None]:
class NNLMEmbeddingsTransformer(BaseEstimator, TransformerMixin):
    
    def fit(self, X=None):
        
        return self
    
    def transform(self, X):
        
        return nnlm_embed(X.values)

**Creating a custom scikit-learn transformer to generate Google's USE 512-dimension embeddings**

In [None]:
class USEEmbeddingsTransformer(BaseEstimator, TransformerMixin):
    
    def fit(self, X=None):
        
        return self
    
    def transform(self, X):
        
        return use_embed(X.values)

**Creating custom scikit-learn pipelines to transform the tweets**

In [None]:
# Switching to USE embeddings as it helps to capture the context across a sentence rather than individual words

# nnlm_embeddings_pipeline = Pipeline([
#     ('tweet_transformer', TweetTransformer()),
#     ('nnlm_embeddings_transformer', NNLMEmbeddingsTransformer())
# ])

# USE embeddings pipeline
use_embeddings_pipeline = Pipeline([
    ('tweet_transformer', TweetTransformer()),
    ('use_embeddings_transformer', USEEmbeddingsTransformer())
])

# tf-idf pipeline
tfidf_pipeline = Pipeline([
    # using remove_stopwords parameter with True for tf-idf
    # as it doesn't take the context into account
    
    ('tweet_transformer', TweetTransformer(remove_stopwords=True)),
    ('tfidf_vectorizer', TfidfVectorizer())
])

# Combining the USE and tf-idf pipelines
data_prep_pipeline = FeatureUnion([
    ('use_embeddings', use_embeddings_pipeline),
    ('tfidf', tfidf_pipeline)
])

# Transforming the training data
X_train = data_prep_pipeline.fit_transform(train_df['text'])
# Target labels
y_train = train_df['target']
# Transforming the test data
X_test = data_prep_pipeline.transform(test_df['text'])

In [None]:
# Checking the shape of the transformed data
X_train.shape

## Modelling

**Initilaize H2O**

In [None]:
h2o.init()

**Convert the transformed data into H2OFrames**

In [None]:
train_h2o = H2OFrame(X_train.todense())
x = train_h2o.columns
y = 'target'
train_h2o[y] = H2OFrame(y_train.values).asfactor()
X_test_h2o = H2OFrame(X_test.todense())

In [None]:
# Checking the shape of the training data

train_h2o.shape

**Traning the models**

In [None]:
# Creating 10 models with a maximum traning time of 10 hours
# and training them on the train data

aml = H2OAutoML(max_models=20, max_runtime_secs=18000, seed=1)
aml.train(x=x, y=y, training_frame=train_h2o)

**Displaying the model results**

In [None]:
# Displaying the leaderboard

lb = aml.leaderboard
lb.head(rows=lb.nrows)

**Making Predictions**

In [None]:
# Predicting the test data

predictions = aml.predict(X_test_h2o)

**Submission**

In [None]:
# Generating the submission file

submission_df['target'] = predictions['predict'].as_data_frame().values
submission_df.to_csv('submissions.csv', index=False)
FileLink('submissions.csv')

In [None]:
# Checking the submission data frame

submission_df.head()