# Introduction

The *Real or Not? NLP with Disaster Tweets* competitions offers a neat opportunity to see how different approaches to natural language processing work when compared to one another. In this notebook, we'll look at how to start examining NLP data and performing some rudimentary second-order feature engineering. Here's a breakdown of what this notebook covers:

1. Perform an initial exploration of some simple fields.
2. Clean and normalize the data set.
3. Extract first-order features and examine how useful they are.
4. Perform rudimentary natural language processing on the text field.
5. Evaluate our natural language model.
6. Use the model and make predictions that we can submit to the competition.

# 1. Importing the Data

The first step in the process is to import our training data so we can see what kinds of information we have to work with. For this project, we'll start by importing the entire training dataset into a single Pandas dataframe.

In [None]:
import time
import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns

train = pd.read_csv("../input/nlp-getting-started/train.csv")
display(train)

test = pd.read_csv("../input/nlp-getting-started/test.csv")
display(test)

# 1.1 Eliminating Duplicates

One thing we should do is check to see if we have duplicated or conflicting data. Here's an easy way to check for textual duplicates against the `target` - which is the class we're trying to predict.

In [None]:
duplicates = pd.concat(x for _, x in train.groupby(["text"]) if len(x) > 1)
with pd.option_context("display.max_rows", None, "max_colwidth", 240):
    display(duplicates[["id", "target", "text"]])

It looks like we have quite a few duplicates. In some instances, the duplicates resolve to the same target class, but in others such as duplicate indexes `5620` and `5641`, we have the same tweet belonging to two different classes. For those instances where the tweet belongs to the same class, we can simply delete the duplicates.

In [None]:
train.drop(
    [
        6449, 7034, 3589, 3591, 3597, 3600, 3603, 
        3604, 3610, 3613, 3614, 119, 106, 115,
        2666, 2679, 1356, 7609, 3382, 1335, 2655, 
        2674, 1343, 4291, 4303, 1345, 48, 3374,
        7600, 164, 5292, 2352, 4308, 4306, 4310, 
        1332, 1156, 7610, 2441, 2449, 2454, 2477,
        2452, 2456, 3390, 7611, 6656, 1360, 5771, 
        4351, 5073, 4601, 5665, 7135, 5720, 5723,
        5734, 1623, 7533, 7537, 7026, 4834, 4631, 
        3461, 6366, 6373, 6377, 6378, 6392, 2828,
        2841, 1725, 3795, 1251, 7607
    ], inplace=True
)
duplicates = pd.concat(x for _, x in train.groupby(["text"]) if len(x) > 1)
with pd.option_context("display.max_rows", None, "max_colwidth", 240):
    display(duplicates[["id", "target", "text"]])

Now we're facing a challenge. We could keep one duplicate with one target class, but we don't have access to the method by which the dataset creators used to mark up real versus not real disaster tweets. They may have had access to more information than us, so we have to be careful if we alter the dataset - we could introduce personal bias. While it may be tempting to try to keep some of the data (e.g. `that horrible sinking feeling when you've been at home on your phone for a while and you realise its been on 3G this whole time` seems like it should be marked as `not real`), the better approach is to simply delete the offending duplicates. While this cuts our training size down, we ensure we haven't inadventently introduced bias to the dataset.

In [None]:
train.drop(
    [
        4290, 4299, 4312, 4221, 4239, 4244, 2830, 
        2831, 2832, 2833, 4597, 4605, 4618, 4232, 
        4235, 3240, 3243, 3248, 3251, 3261, 3266, 
        4285, 4305, 4313, 1214, 1365, 6614, 6616, 
        1197, 1331, 4379, 4381, 4284, 4286, 4292, 
        4304, 4309, 4318, 610, 624, 630, 634, 3985,
        4013, 4019, 1221, 1349, 6091, 6094, 
        6103, 6123, 5620, 5641
    ], inplace=True
)

# 1.2 Keyword and Location Normalization

It looks like we need to do a little cleanup here. Both `keyword` and `location` fields are meant to be interpreted as strings. While we're at it, we should probably convert them all to lowercase for ease of processing. We'll also fill all missing values (`<NA>` values) in `keyword` and `location` with the empty string. If we look at the `keyword` strings, we find that some entries have `%20` instead of a space. We should also stem the `keyword` field so we can collapse similar keywords into a single keyword (for example, `death` and `deaths` would become `death`). Let's go ahead and make those changes to the dataframe. 

In [None]:
from gensim.parsing.preprocessing import stem_text

def clean_location_keyword(df):
    df["location"] = df["location"].astype("string").str.lower()
    df["location"].fillna("<empty>", inplace=True)
    df["keyword"] = df["keyword"].astype("string").str.lower()
    df["keyword"].replace(regex=r"\%20", value=" ", inplace=True)
    df["keyword"].fillna("<empty>", inplace=True)
    df["keyword"] = df["keyword"].apply(stem_text)

clean_location_keyword(train)
clean_location_keyword(test)

# 2. Looking at Class Imbalance

It looks like we have 7,613 training samples. Let's see how many tweets we have that are examples of disaster versus those that are not. What we're looking at is whether or not we have a balance between samples that are both real examples of disasters, and those that are not.

In [None]:
counts = pd.DataFrame(train["target"].value_counts())
counts.rename(columns={"target": "Samples"}, index={0: "Not Real", 1: "Real"}, inplace=True)
ax = sns.barplot(x=counts.index, y=counts.Samples)
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s=round(height),
        ha="center"
    )

For this particular set of data, it looks like we have a slightly skewed distribution between the two classes. In this instance, we'll have to be careful with any machine learning algorithm we use, since we have more tweets that do not pertain to disasters than we do that contain real disasters. 

# 3. Looking at Keywords

Let's take a closer look at what kind of information we have in the `keyword` field, specifically what unique values we have.

In [None]:
with pd.option_context("display.max_rows", None):
    display(train["keyword"].unique())

There are a few things we can collapse. For example, `arson` and `arsonist` can be collapsed to `arson`. Let's go ahead and make a few of these changes.

In [None]:
def collapse_keywords(x):
    if x == "arsonist":
        return "arson"
    if x == "blaze":
        return "ablaz"
    if x == "bloodi":
        return "blood"
    if x == "build burn" or x == "burn build":
        return "build on fire"
    if x == "blew up":
        return "blown up"
    if x == "colli":
        return "collid"
    if x == "explo":
        return "explod"
    if x == "hailstorm":
        return "hail"
    if x == "injuri":
        return "injur"
    if x == "panick":
        return "panic"
    if x == "suicid bomber":
        return "suicid bomb"
    if x == "wildfir":
        return "wild fire"
    return x

train["keyword"] = train["keyword"].apply(lambda x: collapse_keywords(x))
test["keyword"] = test["keyword"].apply(lambda x: collapse_keywords(x))
with pd.option_context("display.max_rows", None):
    display(train["keyword"].unique())

Now let's look at the keyword in relation to whether their target is real `1` or not real `0`. Let's just take a look at the first 50 rows or so.

In [None]:
with pd.option_context("display.max_rows", None):
    display(pd.DataFrame(data=train[["id", "keyword", "target"]].groupby(["keyword", "target"]).count()).rename(columns={"id": "count"}).head(50))

We can see that there are certain keywords that are strongly tied to one class. For example, `airplane accident` is very strongly associated with the real disaster target - we see it appear 34 times, and 29 of those times it is a real disaster, while only 5 times it is not. This is good news, since it suggests there are likely keywords here that will provide separation between classes.

# 4. Looking at Location

Let's take look at the first 500 entries in the location field and see what we're working with.

In [None]:
print([location for location in train["location"]][:500])

In some instances we have countries, others include states, and some include cities. Yet others include junk data such as `global` as well as `Twitter Lockout in progress`. We'll need a way to clean and normalize this data so that it's a little more useful to us. Normalizing this data may turn out to be beneficial, but based on how messy the field is, it may not be worthwhile to spend huge amounts of time trying to clean it. As it stands, let's see if we can use the Python package `pycountry` to help us sort out some of this data. What we'll do is try and sort out real locations from ones that are not real. We can compare what is in the `location` field to subdivision data from `pycountry`. If we get a match, we'll save the state and country to some new columns on the dataframe. If we don't get a match, we'll try and do a little more processing. If we have two floating point numbers, we probably have a set of geo coordinates, so we can mark that as not being location spam. Other than that, we can't do much with the data, so we'll flag it as probable spam.

In [None]:
import re

from pycountry import subdivisions

def clean_state_country(df):
    subs = [subdivision.name.lower() for subdivision in subdivisions]
    countries = [subdivision.country_code for subdivision in subdivisions]
    country = []
    state = []
    location_spam = []
    for _, row in df.iterrows():
        match_found = False
        is_spam = 0
        country_str = "<none>"
        state_str = "<none>"
        if row["location"] != "":
            for index, subdivision in enumerate(subs):
                if subdivision in row["location"]:
                    country_str = countries[index]
                    state_str = subdivision
                    match_found = True
                    break
            if not match_found:
                split_data = row["location"].replace(" ", "").split(",")
                is_spam = 1
                if len(split_data) == 2:
                    if re.match(r"[\-]*[0-9]+\.[0-9]+", split_data[0]) and re.match(r"[\-]*[0-9]+\.[0-9]+", split_data[0]):
                        is_spam = 0
        location_spam.append(is_spam)
        country.append(country_str)
        state.append(state_str)
    df["country"] = country
    df["state"] = state
    df["location_spam"] = location_spam
    
clean_state_country(train)
clean_state_country(test)

Now that we have country and state information, we should be able to look at those fields the same way we examined keywords. Let's take a look what happens.

In [None]:
with pd.option_context("display.max_rows", None):
    display(pd.DataFrame(data=train[["id", "country", "target"]].groupby(["country", "target"]).count()).rename(columns={"id": "count"}).head(50))

There are quite a number of entries for which we have no country information. The first row shows us the number of rows without country information. The total is 5,353 entries, which is more than half of our available training data. For entries with countries, we're seeing somewhat equal splits between real and not real disasters. This is to be expected, as people geotag tweets from all countries whether or not they are actually disasters. It's unlikely that real disasters would exclusively be geotagged. This is probably the same for state information. Let's take a look.

In [None]:
with pd.option_context("display.max_rows", None):
    display(pd.DataFrame(data=train[["id", "state", "target"]].groupby(["state", "target"]).count()).rename(columns={"id": "count"}).head(50))

As predicted, we're missing the same amount of state information. Looking at the low counts for each state, we're probably not going to get very useful information from this field, but we'll keep it intact for now.

# 5. Simple Feature Engineering

Before we look directly at the text as a feature, let's think about some of the other first-order information we can extract from it. Here are a few features that may be informative:

* Total length of the text
* Average word length
* Number of `@` mentions
* Number of hashtags
* Number of numeric values in the text (excluding timestamps)
* Number of URLs in the text
* Number of timestamps in the text
* Hashtags in the text
* `@` mentions in the text
* Emojis in the text

Let's go ahead and extract these fields.

In [None]:
import re
import emoji

def engineer_features(df):
    df["total_length"] = df["text"].apply(len)
    df["avg_word_length"] = df["text"].apply(lambda x: round(sum(len(word) for word in x.split()) / len(x.split())))
    df["num_ats"] = df["text"].apply(lambda x: x.count("@"))
    df["num_hashtags"] = df["text"].apply(lambda x: x.count("#"))
    df["num_numeric"] = df["text"].apply(lambda x: len(re.findall(r"\w[0-9,]+\w", x)))
    df["num_urls"] = df["text"].apply(lambda x: x.count("http"))
    df["num_timestamps"] = df["text"].apply(lambda x: len(re.findall(r"[0-9]+:[0-9]+", x)))
    df["hashtags"] = df["text"].apply(lambda x: " ".join([z.lower() for z in re.findall(r'#(\w+)', x)]) or "<none>")
    df["mentions"] = df["text"].apply(lambda x: " ".join([z.lower() for z in re.findall(r'@(\w+)', x)]) or "<none>")
    df["has_emojis"] = df["text"].apply(lambda x: 1 if bool(emoji.get_emoji_regexp().search(x)) else 0)

engineer_features(train)
engineer_features(test)
train

# 6. Pre-processing Text

For our textual analysis to be useful, we'll have to perform some pre-processing on the text first to make it easier to work with. First, let's check out the text fields and see what we're dealing with in more detail.

In [None]:
for _, row in train["text"].head(50).iteritems():
    print(row)

We are going to do some simple cleanup to help out with our word analysis. First, let's convert to lowercase, and fix contractions.

In [None]:
import re

from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces, strip_numeric, stem_text

def fix_contractions(x):
    x = x.replace("&amp;", "and")
    x = x.replace("&lt;", "<")
    x = x.replace("&gt;", ">")
    x = re.sub("(\W|^)hwy\.(\W)", "\\1highway\\2", x)
    x = re.sub("(\W|^)ave.(\W)", "\\1avenue\\2", x)
    x = re.sub("(\W|^)fyi(\W)", "\\1for your information\\2", x)
    x = re.sub("(\W|^)ain't(\W)", "\\1am not\\2", x)
    x = re.sub("(\W|^)can't(\W)", "\\1cannot\\2", x)
    x = re.sub("(\W|^)cant(\W)", "\\1cannot\\2", x)
    x = x.replace("g'day", "good day")
    x = x.replace("giv'n", "given")
    x = x.replace("let's", "let us")
    x = x.replace("ma'am", "madam")
    x = x.replace("ne'er", "never")
    x = x.replace("o'clock", "of the clock")
    x = x.replace("o'er", "over")
    x = x.replace("ol'", "old")
    x = x.replace("shan't", "shall not")
    x = x.replace("y'all", "you all")
    x = x.replace("'tis", "it is")
    x = re.sub("\W'twas", " it was", x)
    x = re.sub("\W'cause", " because", x)
    x = re.sub("(\w)'ve", "\\1 have", x)
    x = re.sub("(\w)n't", "\\1 not", x)
    x = re.sub("(\w)'s", "\\1 is", x)
    x = re.sub("(\w)'d", "\\1 had", x)
    x = re.sub("(\w)'ll", "\\1 will", x)
    x = re.sub("(\w)'re", "\\1 are", x)
    x = re.sub("(\w)'m", "\\1 am", x)
    x = x.replace("...", " ")
    x = strip_punctuation(x)
    x = strip_multiple_whitespaces(x)
    return x.strip()

def lower_expand(df):
    df["new_text"] = df["text"].apply(lambda x: x.lower())
    df["new_text"] = df["new_text"].apply(fix_contractions)
    
lower_expand(train)
lower_expand(test)
for _, row in train["new_text"].head(50).iteritems():
    print(row)

Looks like there are certain tweets that have lots of personal pronouns, and some that don't. Let's see if tehre is any separation that may be useful.

In [None]:
has_personal_pronouns = []

def scan_for_pronouns(x):
    words = x.split(" ")
    if "i" in words or "me" in words or "you" in words or "my" in words:
        return 1
    return 0

def has_personal_pronouns(df):
    df["has_personal_pronouns"] = df["new_text"].apply(scan_for_pronouns)

has_personal_pronouns(train)
has_personal_pronouns(test)

display(pd.DataFrame(data=train[["id", "has_personal_pronouns", "target"]].groupby(["has_personal_pronouns", "target"]).count()).rename(columns={"id": "count"}).head(50))

There may be some separation here that can work to our favor. Let's keep this categorical field of personal pronouns and move on to some more processing. Note that we may want to revisit word categories in the future. Specifically part-of-speech tagging may give us insights into different distributions of words that we may be able to make use of.

Here's what we're going to do:

* Remove words that don't have any value, such as `the`, `of`, `and` (stopword removal)
* We're going to strip out any links since they are `tco` encoded for Twitter
* We'll remove all punctuation
* We'll remove the numerics
* We'll remove multiple whitespaces
* Stem the text

In [None]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces, strip_numeric, stem_text

def normalize_text(df):
    normalized_text = []

    for _, row in df.iterrows():
        new_text = row["new_text"]
        new_text = remove_stopwords(new_text)
        new_text = re.sub(r"t co [\w]+", "", new_text)
        new_text = strip_numeric(new_text)
        new_text = strip_multiple_whitespaces(new_text)
        new_text = stem_text(new_text)
        normalized_text.append(new_text.strip())

    df["normalized_text"] = normalized_text

normalize_text(train)
normalize_text(test)
train

# 7. Training and Validating the Classifier

Now it's time to acutally build a classifier and see how well it does. To evaluate how well our features are working, we're going to split up our training data into a training set and a validation set. We'll do this 5 times for a 5-fold cross validation. Our difference between sets gives us an idea how robust our model actually is to variations in training data. To handle our textual data, we'll use CatBoost, as it can handle textual fields very nicely for us. It also allows us to conveniently handle categorical data as well, without having to resort to label encoding it ourselves. We'll set up CatBoost to automatically determine the best balance of each of the data fields for us, so we don't have to worry about a poorly performing field drowning out informative fields. We'll also set our evaluation metric for CatBoost to match the competition's output, so we can get a better idea of how well it's performing during our training phase.

In [None]:
import gc

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from catboost import CatBoostClassifier

vectorizer = TfidfVectorizer()
skf = StratifiedKFold(n_splits=5, random_state=2020, shuffle=True)

features = [
    "keyword", "state", "location_spam", "total_length", 
    "avg_word_length", "num_ats", "num_hashtags", "num_numeric", 
    "num_urls", "num_timestamps", "normalized_text",
    "hashtags", "mentions", "has_emojis", "has_personal_pronouns"
]

cat_params = {
    "cat_features": ["location_spam", "has_emojis", "has_personal_pronouns"],
    "text_features": ["keyword", "state", "normalized_text", "hashtags", "mentions"],
    "verbose": 100,
    "learning_rate": 0.05,
    "iterations": 700,
    "eval_metric": "F1",
    "random_state": 2020,
    "depth": 9,
    "auto_class_weights": "Balanced",
}

importances = pd.DataFrame()
best_score = 0.0
best_model = None
best_sgd = None

for fold, (train_index, test_index) in enumerate(skf.split(train, train["target"])):
    print("-------> fold {} <--------".format(fold + 1))
    x_train, x_valid = pd.DataFrame(train.iloc[train_index]), pd.DataFrame(train.iloc[test_index])
    y_train, y_valid = train["target"].iloc[train_index], train["target"].iloc[test_index]
    
    x_train_features = pd.DataFrame(x_train[features])
    x_valid_features = pd.DataFrame(x_valid[features])

    print(": Build CatBoost model")
    model = CatBoostClassifier(
        **cat_params
    )
    model.fit(
        x_train_features, 
        y_train,
        eval_set=[(x_valid_features, y_valid)],
        verbose=100,
    )

    train_predictions = model.predict(x_valid_features)
    
    print(model.get_feature_importance(prettified=True))
    print(classification_report(y_valid, train_predictions, target_names=["Not Real", "Real"]))
    score = model.score(x_valid_features, y_valid)
    if score > best_score:
        print("--> This model is the best so far {:0.5}".format(score))
        best_model = model
        best_score = score

# 8. Feature Performance

We can take a look and see how our various features are performing.

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

plt.figure(figsize=(14, 35))
_ = sns.barplot(x="Importances", y="Feature Id", data=best_model.get_feature_importance(prettified=True))

As we can see, the text of the tweet, plus the keywords fields are the major driving forces for correct categorization. 

# 9. Building and Submitting the Final Model

Let's go ahead and build a model that uses all of the data, and takes all the features we've examined so far. Once the model is built, we can submit the result.

In [None]:
train_features = pd.DataFrame(train[features])

print(": Build CatBoost model")
model = CatBoostClassifier(
    **cat_params
)
model.fit(
    train_features, 
    train["target"],
)


Here is the code to run the predictions on the test data, and build the submission file.

In [None]:
test_features = pd.DataFrame(test[features])
predictions = model.predict(test_features)
submission = pd.DataFrame({"id": test["id"], "target": predictions})
submission.to_csv("submission.csv", index=False)