# Introduction

The *Real or Not? NLP with Disaster Tweets* competitions offers a neat opportunity to see how different approaches to natural language processing work when compared to one another. In this notebook, we'll look at using TF/IDF vectors and a support vector machine to classify tweets. In this notebook we will:

* Clean and normalize the data set.
* Perform rudimentary natural language processing on the text field.
* Evaluate various hyper-parameters for an `SVC` model.
* Use the model and make predictions that we can submit to the competition.

## Credits

This notebook is similar to the approach of [Disaster Tweets 80.263% accuracy using SVC](https://www.kaggle.com/sauravjoshi23/disaster-tweets-80-263-accuracy-using-svc) by Saurav Joshi. This notebook differs in that it removes duplicate tweets and uses `gensim` functions to perform most of the cleaning. This notebook also performs simple kernel performance tuning by looking at some optional values to `SVC`. Otherwise, the approach is the same. Please examine his notebook and upvote it!

# 1. Importing the Data

The first step in the process is to import our training data so we can see what kinds of information we have to work with. For this project, we'll start by importing the entire training dataset into a single Pandas dataframe.

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv("../input/nlp-getting-started/train.csv")
display(train)

test = pd.read_csv("../input/nlp-getting-started/test.csv")
display(test)

# 1.1 Eliminating Duplicates

One thing we should do is check to see if we have duplicated or conflicting data. Here's an easy way to check for textual duplicates against the `target` - which is the class we're trying to predict.

In [None]:
duplicates = pd.concat(x for _, x in train.groupby(["text"]) if len(x) > 1)
with pd.option_context("display.max_rows", None, "max_colwidth", 240):
    display(duplicates[["id", "target", "text"]])

It looks like we have quite a few duplicates. In some instances, the duplicates resolve to the same target class, but in others such as duplicate indexes `5620` and `5641`, we have the same tweet belonging to two different classes. For those instances where the tweet belongs to the same class, we can simply delete the duplicates.

In [None]:
train.drop(
    [
        6449, 7034, 3589, 3591, 3597, 3600, 3603, 
        3604, 3610, 3613, 3614, 119, 106, 115,
        2666, 2679, 1356, 7609, 3382, 1335, 2655, 
        2674, 1343, 4291, 4303, 1345, 48, 3374,
        7600, 164, 5292, 2352, 4308, 4306, 4310, 
        1332, 1156, 7610, 2441, 2449, 2454, 2477,
        2452, 2456, 3390, 7611, 6656, 1360, 5771, 
        4351, 5073, 4601, 5665, 7135, 5720, 5723,
        5734, 1623, 7533, 7537, 7026, 4834, 4631, 
        3461, 6366, 6373, 6377, 6378, 6392, 2828,
        2841, 1725, 3795, 1251, 7607
    ], inplace=True
)
duplicates = pd.concat(x for _, x in train.groupby(["text"]) if len(x) > 1)
with pd.option_context("display.max_rows", None, "max_colwidth", 240):
    display(duplicates[["id", "target", "text"]])

Now we're facing a challenge. We could keep one duplicate with one target class, but we don't have access to the method by which the dataset creators used to mark up real versus not real disaster tweets. They may have had access to more information than us, so we have to be careful if we alter the dataset - we could introduce personal bias. While it may be tempting to try to keep some of the data (e.g. `that horrible sinking feeling when you've been at home on your phone for a while and you realise its been on 3G this whole time` seems like it should be marked as `not real`), the better approach is to simply delete the offending duplicates. While this cuts our training size down, we ensure we haven't inadventently introduced bias to the dataset.

In [None]:
train.drop(
    [
        4290, 4299, 4312, 4221, 4239, 4244, 2830, 
        2831, 2832, 2833, 4597, 4605, 4618, 4232, 
        4235, 3240, 3243, 3248, 3251, 3261, 3266, 
        4285, 4305, 4313, 1214, 1365, 6614, 6616, 
        1197, 1331, 4379, 4381, 4284, 4286, 4292, 
        4304, 4309, 4318, 610, 624, 630, 634, 3985,
        4013, 4019, 1221, 1349, 6091, 6094, 
        6103, 6123, 5620, 5641
    ], inplace=True
)

# 2. Pre-processing Text

For our textual analysis to be useful, we'll have to perform some pre-processing on the text first to make it easier to work with. We'll convert to lowercase, and fix contractions. We'll also remove any stopwords, strip the punctuation, remove multiple whitespaces, and stem the text.

In [None]:
import re

from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces, strip_numeric, stem_text
from textblob import TextBlob

def fix_text_issues(x):
    x = x.lower()
    x = x.replace("&amp;", "and")
    x = x.replace("&lt;", "<")
    x = x.replace("&gt;", ">")
    x = re.sub("(\W|^)hwy\.(\W)", "\\1highway\\2", x)
    x = re.sub("(\W|^)ave.(\W)", "\\1avenue\\2", x)
    x = re.sub("(\W|^)fyi(\W)", "\\1for your information\\2", x)
    x = re.sub("(\W|^)ain't(\W)", "\\1am not\\2", x)
    x = re.sub("(\W|^)can't(\W)", "\\1cannot\\2", x)
    x = re.sub("(\W|^)cant(\W)", "\\1cannot\\2", x)
    x = re.sub("(\W|^)rt(\W)", "\\1retweet\\2", x)
    x = x.replace("g'day", "good day")
    x = x.replace("giv'n", "given")
    x = x.replace("let's", "let us")
    x = x.replace("ma'am", "madam")
    x = x.replace("ne'er", "never")
    x = x.replace("o'clock", "of the clock")
    x = x.replace("o'er", "over")
    x = x.replace("ol'", "old")
    x = x.replace("shan't", "shall not")
    x = x.replace("y'all", "you all")
    x = x.replace("'tis", "it is")
    x = x.replace("'hood", "neighborhood")
    x = x.replace("can ªt", "cannot")
    x = x.replace("åÊ", " ")
    x = x.replace("ÛÏ", " ")
    x = x.replace("?Û", " ")
    x = x.replace("Û_", " ")
    x = x.replace("Û÷", " ")
    x = x.replace("ûò", " ")
    x = x.replace("û", " ")
    x = re.sub("\W'twas", " it was", x)
    x = re.sub("\W'cause", " because", x)
    x = re.sub("(\w)'ve", "\\1 have", x)
    x = re.sub("(\w)n't", "\\1 not", x)
    x = re.sub("(\w)'s", "\\1 is", x)
    x = re.sub("(\w)'d", "\\1 had", x)
    x = re.sub("(\w)'ll", "\\1 will", x)
    x = re.sub("(\w)'re", "\\1 are", x)
    x = re.sub("(\w)'m", "\\1 am", x)
    x = re.sub("http[s]*://t.co/\S+", "", x)
    x = x.replace("...", " ")
    x = strip_multiple_whitespaces(x)
    x = strip_punctuation(x)
    x = stem_text(x)
    return x.strip()

def clean_text(df):
    df["text"] = df["text"].apply(fix_text_issues)

clean_text(train)
clean_text(test)

# 3. Training and Validating the Classifier

Let's build a classifier to try and classify various forms of tweets. We'll use the `SVC` classifier for the task. We'll use the `GridSearchCV` classifier to perform an exhaustive search over some common parameters to the `TfidfVectorizer` and the `SVC` class.

In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfVectorizer(decode_error="ignore")), ('clf', SVC(random_state=2020))])
parameters = {
    'tfidf__ngram_range': ((1,1), (1,2), (2,2)),
    'tfidf__use_idf': (True, False), 
    'tfidf__smooth_idf': (True, False),
    'tfidf__sublinear_tf': (True, False),
    'clf__C': (1.5, 1.7, 1.9),
}

grid = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=3, cv=3)
grid_result = grid.fit(train["text"], train["target"])

print("Best score: {:0.5f}".format(grid_result.best_score_))
print("Best parameters: {}".format(grid_result.best_params_))

Let's use those best tuning parameters, and run it on a train test split to see more detailed performance.

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(
    train["text"], train["target"], test_size=0.2, random_state=2020
)

vectorizer = TfidfVectorizer(
    decode_error="ignore", 
    ngram_range=(1,2), 
    smooth_idf=False, 
    sublinear_tf=True, 
    use_idf=True
)
x_train_tfidf = vectorizer.fit_transform(x_train)
x_valid_tfidf = vectorizer.transform(x_valid)

model = SVC(random_state=2020, C=1.7)
model.fit(x_train_tfidf, y_train)

train_predictions = model.predict(x_valid_tfidf)
    
print(classification_report(y_valid, train_predictions, target_names=["Not Real", "Real"]))
score = model.score(x_valid_tfidf, y_valid)
print("--> Mean accuracy {:0.5}".format(score))
print("")

# 4. Building and Submitting the Final Model

Let's go ahead and build a model that uses all of the data. Once the model is built, we can submit the result.

In [None]:
vectorizer = TfidfVectorizer(
    decode_error="ignore", 
    ngram_range=(1,2), 
    smooth_idf=False, 
    sublinear_tf=True, 
    use_idf=True
)
train_tfidf = vectorizer.fit_transform(train["text"])
model = SVC(random_state=2020, C=1.7)
model.fit(train_tfidf, train["target"])

Here is the code to run the predictions on the test data, and build the submission file.

In [None]:
clean_text(test)
test_tfidf = vectorizer.transform(test["text"])
predictions = model.predict(test_tfidf)
submission = pd.DataFrame({"id": test["id"], "target": predictions})
submission.to_csv("submission.csv", index=False)