This Notebook is Clone of 'NLP Getting Started Tutorial' kernel in Kaggle for nlp modeling practice.

### NLP Tutorial

NLP - or Natural Language Processing - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

In this tutorial we'll look this competition's dataset, use a simple technique to process it, build a machine larning model, and submit predictions for a score!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

### Step 1. Data Load and EDA

In [2]:
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

#### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [3]:
train_df[train_df['target']==0]['text'].values[1]

'I love fruits'

In [6]:
# Let's look this code more deeply
train_df.head()


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


text column contains disaster tweet

In [7]:
train_df[train_df['target']==0].head()

Unnamed: 0,id,keyword,location,text,target
15,23,,,What's up man?,0
16,24,,,I love fruits,0
17,25,,,Summer is lovely,0
18,26,,,My car is so fast,0
19,28,,,What a goooooooaaaaaal!!!!!!,0


this text isn't related to disaster tweet

In [9]:
train_df[train_df['target']==0]['text'].values

array(["What's up man?", 'I love fruits', 'Summer is lovely', ...,
       'These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittens\x89Û_ https://t.co/TFGrAyuDC5',
       'Sirens everywhere!',
       'I just heard a really loud bang and everyone is asleep great'],
      dtype=object)

In [10]:
train_df[train_df['target']==0]['text'].values[0]

"What's up man?"

In [11]:
train_df[train_df['target']==0]['text'].values[1]

'I love fruits'

OK. go to next step

------------
And one that is:

In [12]:
train_df[train_df['target']==1]['text'].values[1]

'Forest fire near La Ronge Sask. Canada'

**Building vectors**

The theory behind the model we'll build in this notebook is pretty simple : the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process.

NOte: a vector is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [13]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df['text'][0:5])

In [15]:
## we use .todense() here because these vectors are 'sparse' (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


----------
let's look more code

In [18]:
example_train_vectors[0]

<1x54 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [21]:
example_train_vectors

<5x54 sparse matrix of type '<class 'numpy.int64'>'
	with 61 stored elements in Compressed Sparse Row format>

In [37]:
train_df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [35]:
test_df

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


--------
Ok. Next go!

The obove tells us that:

1. There are 54 unique words(or 'tokens') in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

### Step 2. Text -> Vectorization(One hot Encoding)

In [23]:
train_vectors = count_vectorizer.fit_transform(train_df['text'])

## note that we're NOT using .fit_transform() here. Using Just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectores - 
# i. e. that the train and test vectors use same set of tokens.
test_vectors = count_vectorizer.transform(test_df['text'])

**Our model**

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word(or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a *linear* connection. So let's build a linear model and see!

In [24]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - rigde regression
## is a good way to do this.
clf = linear_model.RidgeClassifier()


Let's test our model and see how well it does on the training data. For this we'll use cross-validation, where we train on a portion of the known data, then validate it with te rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition if F1, so let's use that here.

### Step 3. Cross Validation Check

In [25]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df['target'], cv=3, scoring='f1')
scores

array([0.59421842, 0.5642787 , 0.64082434])

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of them a shot!

In the meantime, let's do predictions on our training set and build a submission for the competition.

In [41]:
# check some data
print(train_vectors[0].todense())
print(train_vectors[0].todense().shape)

[[0 0 0 ... 0 0 0]]
(1, 21637)


In [48]:
train_vectors

<7613x21637 sparse matrix of type '<class 'numpy.int64'>'
	with 111497 stored elements in Compressed Sparse Row format>

In [46]:
print(train_df['target'].values)
print(train_df['target'].values.shape)

[1 1 1 ... 1 1 1]
(7613,)


same length -> 7613

### Step 4. Modeling

In [26]:
clf.fit(train_vectors, train_df['target'])

RidgeClassifier()

In [27]:
clf.predict(test_vectors)

array([0, 1, 1, ..., 1, 1, 0])

### Step 5. Save / submission(kaggle)

In [28]:
sample_submission = pd.read_csv('./data/sample_submission.csv')

In [29]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [30]:
sample_submission['target'].unique()

array([0])

In [31]:
sample_submission['target'] = clf.predict(test_vectors)

In [32]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [34]:
sample_submission.to_csv('submission.csv', index=False)