## NLP (Natural Language Processing) Tutorial - Sarah's Edited Copy

In this notebook, we're going to build a model to classify a disaster tweet from a non-disaster tweet. The steps below are based off of the NLP tutorial and Dylan's notebook.

## 1. First, we will import the necessary libraries and packages

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

## 2. Next, we will store our data source into data frames
We have 1 training set, and 1 test set.

In [2]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

## 3. Let's review the training datatset

In [3]:
# SPC - View variable types in the dataset
train_df.info() #7613 training records

# SPC - Preview random 20 records from the data source
train_df.sample(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
id          7613 non-null int64
keyword     7552 non-null object
location    5080 non-null object
text        7613 non-null object
target      7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Unnamed: 0,id,keyword,location,text,target
7366,10546,windstorm,Georgia ? Tennessee,When I breathe it sounds like a windstorm. Hah...,0
4251,6039,heat%20wave,"Greenfield, Massachusetts",Many thx for share and your comment Alex Light...,0
1908,2743,crushed,,So many Youtube commenters saying the Dothraki...,1
5112,7291,nuclear%20disaster,,#Nuclear policy of #Japan without responsibili...,1
5107,7286,nuclear%20disaster,,The president spoke of Kennedy's diplomacy to ...,1
1220,1759,buildings%20burning,"Savannah, GA",Sinking ships burning buildings &amp; Falling ...,1
7020,10061,typhoon,Whole World,Global precipitation measurement satellite cap...,1
3757,5338,fire,å_,WCW @catsandsyrup THA BITCH IS FIRE,0
5792,8264,rioting,Upstate New York,I think Twitter was invented to keep us insomn...,0
1864,2679,crush,San Fransokyo,I have the biggest crush on you &amp; I dont k...,0


## 4. Now let's review the test dataset

In [4]:
test_df.info() #3263 test records
test_df.sample(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
id          3263 non-null int64
keyword     3237 non-null object
location    2158 non-null object
text        3263 non-null object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


Unnamed: 0,id,keyword,location,text
2974,9843,trauma,"California, USA",@AngstAttack Dating a woman does NOT make you ...
1746,5880,hailstorm,USA,@Haley_Whaley Hailstorm Clash ofClans Gems Giv...
2675,8919,snowstorm,,Long Island is technically the leftover dirt f...
506,1656,bombing,Kill Devil Hills,'Japan Marks 70th Anniversary of Hiroshima Ato...
182,593,arson,EARTH,#LGBTQ News ?? Owner of Chicago-Area Gay Bar A...
2639,8821,sirens,Atrapada en el mundo.,Sleeping With Sirens - 2 Chord
3038,10046,twister,,@bamagaga Best part of today? Your song I'M TO...
3057,10131,upheaval,,Bat four reasons en route to upheaval versus i...
2198,7353,obliterate,,@BattyAfterDawn @DrawLiomDraw he's a good cute...
975,3230,deluged,,Businesses are deluged with invoices. Make you...


## 5. Text Pre-Processing (Training Data Clean-up)
In this step, we will clean up the text by doing the following:
1. Convert all text to lowercase
2. Remove special characters such as '@', '#', '*'
3. Remove common articles such as 'a', 'an', 'the'

In [5]:
# convert to lowercase by applying the lambda function
train_df['text'] = train_df['text'].apply(lambda x: x.lower())
train_df.sample(20)

Unnamed: 0,id,keyword,location,text,target
5333,7612,pandemonium,"VONT ISLAND, LAGOS",pandemonium in aba as woman delivers baby with...,0
824,1199,blizzard,,peanut butter cookie dough blizzard is ???????...,0
21,32,,,london is cool ;),0
5347,7633,pandemonium,Los Angeles,pandemonium in aba as woman delivers baby with...,0
7553,10798,wrecked,,#news cramer: iger's 3 words that wrecked disn...,0
6987,10020,twister,,brain twister let drop up telly structuring ca...,0
842,1222,blizzard,,i call it a little bit of your blizzard?,1
881,1276,blood,Buenos Aires,*se pone a cantar crying lightning*,0
3434,4910,explode,,toronto going crazy for the blue jays. can you...,0
4017,5706,floods,"21.462446,-158.022017",floods cause damage and death across asia | al...,1


In [6]:
# remove special characters
train_df['text'] = train_df['text'].apply(lambda x: x.replace('@','')) # remove '@'
train_df['text'] = train_df['text'].apply(lambda x: x.replace('#','')) # remove '#'
train_df['text'] = train_df['text'].apply(lambda x: x.replace('*','')) # remove '*'

# remove common articles
# cannot use .replace() for articles, it will replace all instances of 'a' within a word
train_df['text'] = train_df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if "a" !=x])) # loop through tweet and remove 'a'
train_df['text'] = train_df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if 'an' !=x])) # remove 'an'
train_df['text'] = train_df['text'].apply(lambda x: " ".join([x for x in x.split(" ") if 'the' !=x])) # remove 'the'

# pre-view 20 random records
train_df.sample(20)

Unnamed: 0,id,keyword,location,text,target
544,791,avalanche,World,avalanche city - sunset http://t.co/48h3tlvlxr...,1
5355,7643,pandemonium,California,truly scene of chaos unprecedented in frenzy. ...,1
5431,7750,police,,world news qld police wrap billy gordon invest...,1
7059,10113,upheaval,,look at state actions year after fergusonûªs ...,0
4257,6049,heat%20wave,,creationsbykole cork city in ireland...we got ...,1
2035,2921,danger,Atlanta Georgia,therealrittz fettilootch is slanglucci oppress...,0
6220,8878,smoke,cigarknub@gmail.com,smoke it all http://t.co/79upydcemp,0
4574,6504,injuries,North West London,likeavillasboas rich_chandler being' injury pr...,0
302,444,apocalypse,Tokyo,enjoyed live-action attack on titan but every ...,0
3467,4959,exploded,,worked at fast food joint. poured burnt hot oi...,0


## 6. A quick look at our data

Following the NLP tutorial, we will review a sample of 4 non-disaster tweets.

In [7]:
train_df[train_df["target"] == 0]["text"].values[0:5]

array(["what's up man?", 'i love fruits', 'summer is lovely',
       'my car is so fast', 'what goooooooaaaaaal!!!!!!'], dtype=object)

And we will also review a sample of 4 disaster tweets.

In [8]:
train_df[train_df["target"] == 1]["text"].values[0:5]

array(['our deeds are reason of this earthquake may allah forgive us all',
       'forest fire near la ronge sask. canada',
       "all residents asked to 'shelter in place' are being notified by officers. no other evacuation or shelter in place orders are expected",
       '13,000 people receive wildfires evacuation orders in california ',
       'just got sent this photo from ruby alaska as smoke from wildfires pours into school '],
      dtype=object)

## 7. Building vectors

Based on the NLP Tutorial, the theory behind the model we'll build is based on the following: We will make a starting assumption that the words contained in each tweet indicate whether or not they're about a real disaster. We will start by using scikit-learn's `CountVectorizer` to count the words in each tweet and store them as vectors. Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with.

In [9]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first single tweet in the data and view it
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:1])
print(example_train_vectors)

  (0, 8)	1
  (0, 3)	1
  (0, 2)	1
  (0, 9)	1
  (0, 7)	1
  (0, 10)	1
  (0, 4)	1
  (0, 6)	1
  (0, 1)	1
  (0, 5)	1
  (0, 11)	1
  (0, 0)	1


Per the output above, we see 12 unique words (12 records above) in the first tweet after pre-processing/cleaning. Below, we display a snippet of the first 5 tweets.

In [10]:
train_df['text'][0:5]

0    our deeds are reason of this earthquake may al...
1               forest fire near la ronge sask. canada
2    all residents asked to 'shelter in place' are ...
3    13,000 people receive wildfires evacuation ord...
4    just got sent this photo from ruby alaska as s...
Name: text, dtype: object

In [13]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 12)
[[1 1 1 1 1 1 1 1 1 1 1 1]]


We will now create vector based on a larger sample of tweets.

In [14]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first single tweet in the data and view it
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:25])
print(example_train_vectors)

  (0, 104)	1
  (0, 42)	1
  (0, 13)	1
  (0, 111)	1
  (0, 98)	1
  (0, 136)	1
  (0, 48)	1
  (0, 90)	1
  (0, 10)	1
  (0, 61)	1
  (0, 143)	1
  (0, 9)	1
  (1, 60)	1
  (1, 54)	1
  (1, 94)	1
  (1, 80)	1
  (1, 116)	1
  (1, 118)	1
  (1, 29)	1
  (2, 13)	2
  (2, 9)	1
  (2, 113)	1
  (2, 18)	1
  (2, 138)	1
  (2, 123)	2
  :	:
  (15, 88)	1
  (16, 86)	1
  (16, 63)	1
  (17, 78)	1
  (17, 131)	1
  (17, 87)	1
  (18, 78)	1
  (18, 126)	1
  (18, 30)	1
  (18, 92)	1
  (18, 53)	1
  (19, 148)	1
  (19, 67)	1
  (20, 136)	1
  (20, 78)	1
  (20, 114)	1
  (21, 78)	1
  (21, 83)	1
  (21, 35)	1
  (22, 86)	1
  (22, 124)	1
  (23, 148)	1
  (23, 150)	1
  (23, 40)	1
  (24, 84)	1


Now we will use todense to reduce space (or keep non-zero elements).

In [17]:
print(example_train_vectors[20].todense().shape)
print(example_train_vectors[20].todense())

(1, 152)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0]]


The above tells us that there are 152 unique words (or "tokens") in the 20th sample tweet. Next we will split the training dataset into thirds.

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(pd.DataFrame(train_df['text']), pd.DataFrame(train_df['target']), test_size=0.33, random_state=42)

We will now view a sample of 5 tweets from the training dataset.

In [22]:
X_train.sample(5)

Unnamed: 0,text
6084,share large sinkhole swallows entire pond in l...
3766,justin_ling i promise not to tax pancakes or r...
7349,reuters top news: photos: rocky fire has grown...
1077,crew on enolagay had nuclear bomb on board dis...
7578,jt_ruff23 cameronhacker and i wrecked you both


Now we're going to create vectors for all of our tweets (both training and test datasets).

In [23]:
train_vectors = count_vectorizer.fit_transform(X_train["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(X_test["text"])

We will now view the [train_vectors] and [test_vectors] that we just vectorized.

In [25]:
print(train_vectors[10].todense().shape)
print(train_vectors[10])
print(f"Total vocabulary of the 11th tweet is {train_vectors[10].todense().shape[1]} words.") #not sure why we include '1' inside .shape[]

(1, 16405)
  (0, 10492)	1
  (0, 749)	1
  (0, 752)	1
  (0, 2282)	1
  (0, 10466)	1
  (0, 12303)	1
  (0, 12304)	1
  (0, 6789)	1
  (0, 5221)	1
  (0, 8627)	1
  (0, 4899)	1
  (0, 11430)	1
  (0, 2372)	1
  (0, 6682)	1
  (0, 8199)	1
  (0, 1197)	1
Total vocabulary of the 11th tweet is 16405 words.


### Our model

We will assume that there is a linear connection between the tweets and whether or not they're about a real disaster. As a start, I would apply the logistic regression instead of Ridge Classifier because our output is binary ('is a disaster tweet' vs 'is NOT a disaster tweet').

In [32]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
# clf = linear_model.RidgeClassifier()

#Import Logistic Regression
from sklearn.linear_model import LogisticRegression

# Use Logistic Regression
model = LogisticRegression()

#Fit the Model?
model.fit(train_vectors, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Now let's test our model for accuracy. 

In [34]:
# scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
#scores = model_selection.cross_val_score(model, train_vectors, train_df["target"], cv=3, scoring="f1")
#scores
accuracy = model.score(test_vectors, y_test['target'])
print("Accuracy:", accuracy)

Accuracy: 0.8030242737763629


Now we fit the model on our training dataset.

In [36]:
model.fit(train_vectors, y_train["target"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)