In [2]:
import pandas as pd
import numpy as np

from modules.text_cleaning import text_clean

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


# Data Exploration

First we import the training/validation data. I've stored these locally in my `raw_data` folder; they can be downloaded from Kaggle [here](https://www.kaggle.com/c/nlp-getting-started/data).

In [3]:
train_data = pd.read_csv('raw_data/train.csv')

The training/validation data consist of 7613 samples. Excluding the `id` column, each row has three features--`keyword`, `location`, and `text`--and one target value, which is 1 if the text is about a real disaster, and 0 if it is not.

In [4]:
print(train_data.shape)
train_data.head()

(7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


We see below that while the majority of these tweets are **not** about real disasters, the data are still roughly balanced. When we do our modeling, we might want to see if it's worthwhile to take measures to balance the data set.

In [5]:
print(f"Not about real disaster: {train_data['target'].value_counts()[0]} ({round(100 * train_data['target'].value_counts(normalize = True)[0], 1)}%)")
print(f"About real disaster: {train_data['target'].value_counts()[1]} ({round(100 * train_data['target'].value_counts(normalize = True)[1], 1)}%)")

Not about real disaster: 4342 (57.0%)
About real disaster: 3271 (43.0%)


The average tweet in the training dataset is 101 characters long:

In [6]:
round(train_data['text'].apply(lambda x : len(x)).mean(),2)

101.04

In terms of missing data, we see that there are no missing target values and no missing text, but a smattering of missing keywords and a sizable proportion of missing locations. Since we are (somewhat artificially) imposing the challenge of using only text, these missing data will not be relevant, and there will not be any need to impute data.

In [7]:
train_data.isnull().mean().apply(lambda x : f"{round(100 * x, 1)}% missing")

id           0.0% missing
keyword      0.8% missing
location    33.3% missing
text         0.0% missing
target       0.0% missing
dtype: object

Before wrapping up our analysis, it is worth looking at what our text-cleaning function does. This text cleaning will be especially useful in our bag-of-words models. In short, the text-cleaning function takes the text of a tweet, removes stopwords, lemmatizes the remaining words, and separates them all by spaces. An illustration can be seen below:

In [8]:
print(f"TEXT IN:\n{train_data['text'][7445]}\n\nTEXT OUT:\n{text_clean(train_data['text'][7445])}")

TEXT IN:
Crack in the path where I wiped out this morning during beach run. Surface wounds on left elbow and right knee. http://t.co/yaqRSximph

TEXT OUT:
crack path wiped morning beach run surface wound left elbow right knee
