# A first glance at the data.

#### Hi guys :). I'm very excited for this competition, it's my first experience with anything related to NLP and this is my first public notebook and it's still a W.I.P! Hope it helps.

## Imports:

In [None]:
import os

print(os.listdir('../input/nlp-getting-started'))

In [None]:
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

import re, string

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')
test = pd.read_csv('../input/nlp-getting-started/test.csv')
sample = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

In [None]:
train.head()

In [None]:
train = train.drop(columns='id')

In [None]:
test_ids = test.id
test = test.drop(columns='id')

## Distribution of target variable:

In [None]:
sns.countplot(train.target).set_title('Target variable distribution')

Seems like there's a similar amount for each type

## Now let's take a look into these NaN values!

In [None]:
(100.0 * train.isna().sum() / train.shape[0]).to_frame(name='percentage').sort_values(by='percentage')

Seems like location is missing 33% of the times, whereas keyword only 0.8% of the time. I wonder how much the keyword correlate to disasters.

So let's take a look at the percentage of tweets that contain the keyword in it for each class.

In [None]:
def is_keyword_in(data):
    if data.keyword in data.text.split():
        return 1
    else:
        return 0

In [None]:
train['keyword_appears'] = train[['keyword', 'text']].dropna().apply(is_keyword_in, axis=1)

In [None]:
print('Percentage of keyword appearence in disasters')
100.0 * train[train.target == 1].keyword_appears.value_counts(normalize=True).to_frame(name='percentage')

In [None]:
train[train.target == 1].keyword_appears.value_counts(normalize=True).plot(kind='bar').set_title('Does keyword appear in real disasters?')

In [None]:
print('Percentage of keyword appearence in non-disasters')
100.0 * train[train.target == 0].keyword_appears.value_counts(normalize=True).to_frame(name='percentage')

In [None]:
train[train.target == 0].keyword_appears.value_counts(normalize=True).plot(kind='bar').set_title('Does keyword appear in non disasters?')

In [None]:
pd.crosstab(train.target, train.keyword_appears)

Not much of a difference =/. So keywords may not give us a good enough hint that a tweet is a real disaster or not.

## How about location? Which locations have the most frequent disasters?

In [None]:
train.location.dropna().value_counts().to_frame(name='count')

Oh wow! Right of the bat it's clear that this location data is really dirty

## Let's look at some correlation between some text features and the target variable

In [None]:
def get_num_words(data):
    return len(data.split())

In [None]:
# Number of characters
train['num_chars'] = train.text.apply(len)

# Number of words
train['num_words'] = train.text.apply(get_num_words)

In [None]:
train.num_chars.describe()

In [None]:
sns.boxplot(x='target', y='num_chars', data=train[['num_chars', 'target']]).set_title('Number of characters')

In [None]:
train.num_words.describe()

In [None]:
sns.boxplot(x='target', y='num_words', data=train[['num_words', 'target']]).set_title('Number of words')

Seems like on average disasters tend to have more characters, as for words it's almost identical.

Do people mention others a lot? Let's see:

In [None]:
mentions = 0

for tweet in train.text.values:
    words = tweet.split()
    for w in words:
        if w[0] == '@':
            mentions += 1

print('Number of mentions:', mentions)
print('Number of tweets:', train.shape[0])

Oh wow actually a lot of mentions! Maybe those could tell a model whether it is a disaster or not. Let's say for instance if it's mentioning the twitter username of a famous person or organisation, so that could maybe be looked into.

In [None]:
def has_mention(data):
    mentions = 0
    for word in data.text.split():
        if word[0] == '@':
            mentions += 1
    
    return mentions

In [None]:
train['mention'] = train.apply(has_mention, axis=1)

In [None]:
print('Percentage of mentions in disasters')
100.0 * train[train.target == 1].mention.value_counts(normalize=True).to_frame(name='percentage')

In [None]:
train[train.target == 1].mention.value_counts(normalize=True).plot(kind='bar').set_title('Do mentions appear in disasters?')

In [None]:
print('Percentage of mentions in non-disasters')
100.0 * train[train.target == 0].mention.value_counts(normalize=True).to_frame(name='percentage')

In [None]:
train[train.target == 0].mention.value_counts(normalize=True).plot(kind='bar').set_title('Do mentions appear in non-disasters?')

Seems like disasters tend to have less mentions, but the difference is not so big

## Data Cleaning

Majority of the code here was taken from: https://www.kaggle.com/shahules/tweets-complete-eda-and-basic-modeling, I added the mentions removing part.

In [None]:
def remove_URL(data):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',data)

def remove_html(data):
    html = re.compile(r'<.*?>')
    return html.sub(r'',data)

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(data):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', data)

def remove_punct(data):
    table=str.maketrans('','',string.punctuation)
    return data.translate(table)

def make_lower(data):
    return data.lower()

def remove_mentions(data):
    words = data.split()
    
    words = [word for word in words if word[0] != '@']
    return ' '.join(words)

def clean_data(data, drop=False, test=False, lowercase=False, correct=False, rmv_mentions=False):
    data.text = data.text.apply(remove_URL)
    data.text = data.text.apply(remove_html)
    data.text = data.text.apply(remove_emoji)
    data.text = data.text.apply(remove_punct)
    
    if lowercase:
        data.text = data.text.apply(make_lower)
    
    if correct:
        data.text = data.text.apply(correct_spellings)
    
    if rmv_mentions:
        data.text = data.text.apply(remove_mentions)
    
    if drop and test:
        return data[['text']]
    elif drop:
        return data[['text', 'target']]
    
    return data

In [None]:
%%time
train = clean_data(train, drop=True, lowercase=True, rmv_mentions=True)

In [None]:
train.head()

## Model

W.I.P.