# Real or Not? NLP with Disaster Tweets
In this notebook I'll be using NLP to predict whether tweets are about real disasters or not. The dataset comes from a Kaggle competetion found here: https://www.kaggle.com/c/nlp-getting-started/overview.

The competition is based on F1 scores so we'll work to optimize that.

In [132]:
# imports
import pandas as pd
import numpy as np
import math

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import RidgeClassifier

In [74]:
# import data
data = pd.read_csv('./dataset/train.csv')

# EDA

In [147]:
# train test split
x_train, x_test, y_train, y_test = train_test_split(data[['keyword','location','text']]
                                                   ,data[['target']]
                                                   ,random_state=42)

# targets as 1d array
y_train_array = y_train.values.reshape(-1)
y_test_array = y_test.values.reshape(-1)

In [121]:
# number of rows
print(f'Rows of data: {x_train.shape[0]}')

# baseline accuracy
baseline = round(np.mean(y_train), 2)
print(f'Baseline accuracy: {baseline}')

# sneak peak at data
data.head()

Rows of data: 5709
Baseline accuracy: target    0.43
dtype: float64


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [122]:
# show attributes as grid (so we can better scan more attributes)
keywords = np.array(x_train.groupby('keyword').count().index)
keywords_grid = arr_to_grid(keywords)

locations = np.array(x_train.groupby('location').count().index)
locations_grid = arr_to_grid(locations)

# print count of unique keywords and locations
print(f'Count of unique keywords: {len(keywords)}')
print(f'Count of unique locations: {len(locations)}')

Count of unique keywords: 221
Count of unique locations: 2619


In [123]:
keywords_grid

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,ablaze,accident,aftershock,airplane accident,ambulance,annihilated,annihilation,apocalypse,armageddon,army
1,arson,arsonist,attack,attacked,avalanche,battle,bioterror,bioterrorism,blaze,blazing
2,bleeding,blew up,blight,blizzard,blood,bloody,blown up,body bag,body bagging,body bags
3,bomb,bombed,bombing,bridge collapse,buildings burning,buildings on fire,burned,burning,burning buildings,bush fires
4,casualties,casualty,catastrophe,catastrophic,chemical emergency,cliff fall,collapse,collapsed,collide,collided
5,collision,crash,crashed,crush,crushed,curfew,cyclone,damage,danger,dead
6,death,deaths,debris,deluge,deluged,demolish,demolished,demolition,derail,derailed
7,derailment,desolate,desolation,destroy,destroyed,destruction,detonate,detonation,devastated,devastation
8,disaster,displaced,drought,drown,drowned,drowning,dust storm,earthquake,electrocute,electrocuted
9,emergency,emergency plan,emergency services,engulfed,epicentre,evacuate,evacuated,evacuation,explode,exploded


In [124]:
locations_grid

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,"Melbourne, Australia",News,å_,"616 Û¢ Kentwood , MI",? ??????? ? ( ?? å¡ ? ? ? å¡),?currently writing a book?,Alberta,Alex/Mika/Leo|18|he/she/they,"BC, US, Asia or Europe."
1,Baku & Erzurum,"Eugene, Oregon",Jariana Town,"Little Rock, AR",Miami Beach,New Delhi,New England,Nxgerxa,"Quantico Marine Base, VA.","Queensland, Australia"
2,Road to the Billionaires Club,Somewhere.,The World,Tropical SE FLorida,snapchat // fvck_casper,|IG: imaginedragoner,"#1 Vacation Destination,HAWAII",#????? Libya#,#Bummerville otw,#EngleWood CHICAGO
3,#FLIGHTCITY UK,#ForeverWithBAP 8,#Gladiator Û¢860Û¢757Û¢,#HAMont,#HarleyChick#PJNT#RunBenRun,#KaumElite;#F?VOR;#SMOFC,#NewcastleuponTyne #UK,#ODU,#PhanTrash,#RedSoxNation
4,#SOUTHAMPTON ENGLAND,#SandraBland,#WhereverI'mAt,#freegucci,#goingdownthetoilet Illinois,#iminchina,#keepthefaith J&J,#otrakansascity,#partsunknown,'Merica
5,'SAN ANTONIOOOOO',"( ?å¡ ?? ?å¡),",(RP),(Spain),"-6.152261,106.775995",-?s?s?j??s-,... -.- -.--,//??//,//RP\ ot @Mort3mer\\,1/10 Taron squad
6,10 Steps Ahead. Cloud 9,107-18 79TH STREET,11/4/14,11202,"11th dimension, los angeles","1313 W.Patrick St, Frederick",14/cis/istj,140920-21 & 150718-19 BEIJING,17-Feb,17th Dimension
7,18 Û¢ CC,"19.600858, -99.047821",1D | 5SOS | AG,"2,360 miles away",2005 |-/,"204, 555 11 Ave. S.W.",21 | PNW,"21, Porto","21.462446,-158.022017",23 countries and counting!
8,253,"261 5th Avenue New York, NY","2B Hindhede Rd, Singapore",3.28.15|7.20.15|7.25.15,302,302???? 815,304,36 & 38,3?3?7?SLOPelousas??2?2?5?,3???2???????
9,3rd Eye Chakra,401 livin',412 NW 5th Ave. Portland OR,434,"46.950109,7.439469","48.870833,2.399227",5/5 access / rt link please x,"570 Vanderbilt; Brooklyn, NY",60th St (SS),617-BTOWN-BEATDOWN


In [174]:
# sample of disaster
print(f'This is a disaster tweet: {data[data.target==1].iloc[0].text}\n')

# sample of non-disaster
print(f'This is a non-disaster tweet: {data[data.target==0].iloc[0].text}')

This is a disaster tweet: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all

This is a non-disaster tweet: What's up man?


The examples we see look quite obvious whether the tweet is about a disaster or not. There may be examples where the tweet may not be so obvious especially if non-contentional words are being used to describe the disaster or if a word used for a disaster is used in a more casual way. An example of a tweet that contains a 'disaster' word but wouldn't indicate a disaster may be "I wonder if earthquakes are common in the North Pole". 

# Modeling

In [127]:
# use count vectorizer to vectorize text
cv = CountVectorizer()
train_vectors = cv.fit_transform(x_train['text'])

# we're NOT using .fit_transform() but rather .transform() to ensure tokens in the train
# vectors are the only ones mapped to the test vectors.
test_vectors = cv.transform(x_test['text'])

In [148]:
# we'll start with a linear model
# because our vectors are really big, we want to push the model's weights toward 0 without
# completely discounting different words. Ridge regression is a good way to do this.
ridge = RidgeClassifier()

# cross validate
scores = cross_val_score(ridge, train_vectors, y_train_array, cv=3, scoring='f1')
print(f'Avg cross validation score: {np.mean(scores)}')

Avg cross validation score: 0.7197266631555141


With our first model we received a cross validation score of 0.72 which isn't too bad. Let's try to fit the data to more complex models and see if we can get a better f1 score.

In [149]:
ridge.fit(train_vectors, y_train_array)

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

In [165]:
# predictions on test set
pred = ridge.predict(test_vectors)
score = ridge.score(test_vectors, y_test_array)

# score
print(f'Test prediction score: {score}')

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_test, pred))
cm.columns = ['pred disaster','pred not disaster']
cm.index = ['actual disaster','actual not disaster']

cm

Test prediction score: 0.7909663865546218


Unnamed: 0,pred disaster,pred not disaster
actual disaster,929,162
actual not disaster,236,577


In [166]:
# classification report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.80      0.85      0.82      1091
           1       0.78      0.71      0.74       813

    accuracy                           0.79      1904
   macro avg       0.79      0.78      0.78      1904
weighted avg       0.79      0.79      0.79      1904

