# Introduction

This is my first attempt at working with NLP. Aiming to just get a reasonable submission in the easiest and fastest way possible.

## Contents

* Exploratory Data Analysis
* Data Cleansing
* Classifier and Predictions
* Evaluation and Submission

In [None]:
import os

## Importing required libraries and reading in our csv files

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import string
import re
!pip install pyspellchecker
from spellchecker import SpellChecker

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression


train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [None]:
# Printing the head of the DataFrame to get an overview of what it looks like
print(train_df.head())

After printing the head of our training DataFrame we can see that this is a very simple data set. Containing the an ID, keyword, location, the tweet, and the classification. It is concerning that none of the visible rows contain values for keyword or location. Let's dig a little deeper.

In [None]:
print(train_df.info())
print(test_df.info())

Thankfully it seems like most of the data is populated. Although, the location column is missing over 30% of its values. This may become a problem if our model makes use of this field.

# Some basic EDA with comparisons between the Disaster/Non-Disaster tweets

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='target', data=train_df)

train_df['target'].value_counts(normalize='True')

We have more non-disaster tweets than disaster tweets at a balance of 57% vs. 43%.

Next let's see if we can identify any obvious differences between the tweets in terms of length, characters, punctuation, etc.

In [None]:
train_df['word_count']=train_df['text'].str.split().map(lambda x: len(x))
train_df['char_count']=train_df['text'].str.len()

grid = sns.FacetGrid(train_df, col='target')

grid.map(plt.hist, 'word_count')

print('The average word count for Non-Disaster tweets is {}'.format(train_df[train_df['target']==0]['word_count'].mean()))
print('The average word count for Disaster tweets is {}'.format(train_df[train_df['target']==1]['word_count'].mean()))

In [None]:
grid = sns.FacetGrid(train_df, col='target')

grid.map(plt.hist, 'char_count')

print('The average character count for Non-Disaster tweets is {}'.format(train_df[train_df['target']==0]['char_count'].mean()))
print('The average character count for Disaster tweets is {}'.format(train_df[train_df['target']==1]['char_count'].mean()))

Overall there isn't much to take away other than the following:

On average, the word count and character count of tweets relating to a disaster are longer than those not relating to a disaster. However, both types of tweets seem to be limited by Twitter's 144 character count maximum, shown by the bunching of data points at the high end. Perhaps a more recent data set that allows up to 280 characters (current Twitter limit) would widen the gap between these statistics.


# Cleaning the data

As shown in the previous analysis it is clear that we need to do some cleaning before we go anywhere near modelling. We will clean the data in the following ways:
* Removing Uneccesary Punctuation
* Removing HTML/URLs
* Spelling Correction

In [None]:
# Removing Punctuation
def remove_punctuation(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

train_df['text']=train_df['text'].apply(lambda x : remove_punctuation(x))
test_df['text']=test_df['text'].apply(lambda x : remove_punctuation(x))


# Removing HTML tags
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

train_df['text']=train_df['text'].apply(lambda x : remove_html(x))
test_df['text']=test_df['text'].apply(lambda x : remove_html(x))


# Removing URLs
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

train_df['text']=train_df['text'].apply(lambda x : remove_URL(x))
test_df['text']=test_df['text'].apply(lambda x : remove_URL(x))


# Correct Spelling
spell = SpellChecker()

def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)


#train_df['text']=train_df['text'].apply(lambda x : correct_spellings(x))
#test_df['text']=test_df['text'].apply(lambda x : correct_spellings(x))

# Creating our classifier and making predictions

In this first attempt, we'll be using the simple Bag-of-words method to represent our text for machine learning. This method discards information about grammar and word order and just works with frequency of occurance. This is far from optimal but is a quick and easy way to get our first submission in the books.

The CountVectorizer() that we'll be using goes through a 3 step process:
1. First, it will tokenize all of the strings
2. Second, it builds a "vocabulary" of words that occur
3. Third, it counts the occurances of each token in the vocabulary

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df['text'], train_df['target'], random_state=1)

In [None]:
pl = Pipeline([
        ('vec', CountVectorizer()),
        ('clf', LogisticRegression())
    ])

In [None]:
pl.fit(X_train, y_train)

accuracy = pl.score(X_test, y_test)

print(accuracy)

# Evaluation and submission

In [None]:
# Finally we'll input our predictions into the sample submission and submit to Kaggle for final scoring

submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

submission["target"] = pl.predict(test_df['text'])

submission.to_csv("submission.csv", index=False)