# Tweet Analysis - Natural Disaster or not

## Overview
### Dataset from Kaggle is a collection of tweets that are labelled as a real natural disaster or just a tweet with disaster-related words. Our goal is to create a classifier that can differentiate between the two. Applications could be to signal us of an ongoing natural disaster in a particular location.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
train = pd.read_csv('train.csv', index_col='id')
test = pd.read_csv('test.csv', index_col='id')

## EDA

In [5]:
train.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
train.isna().sum()

keyword       61
location    2533
text           0
target         0
dtype: int64

Location should probably not be used as a differentiator. Keyword seems like an interesting feature.

In [12]:
train.target.value_counts(normalize=True)

0    0.57034
1    0.42966
Name: target, dtype: float64

Almost 60% of the tweets are not describing a real disaster.

In [13]:
train.location.value_counts()

USA                             104
New York                         71
United States                    50
London                           45
Canada                           29
                               ... 
Your notifications                1
Bristol, UK                       1
Horsemind, MI                     1
Derbyshire, United Kingdom        1
Planet Eyal, Shandral System      1
Name: location, Length: 3341, dtype: int64

Most tweets are from the US

In [15]:
train.keyword.value_counts()

fatalities               45
armageddon               42
deluge                   42
damage                   41
harm                     41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

These must be an existing classification of the tweets based on the text. 

Let's see how to terms differ according to the target

In [16]:
positive_keywords = train[train.target == 1].keyword.value_counts()
positive_keywords

wreckage       39
outbreak       39
derailment     39
debris         37
oil%20spill    37
               ..
ruin            1
body%20bag      1
body%20bags     1
blazing         1
epicentre       1
Name: keyword, Length: 220, dtype: int64

In [17]:
negative_keywords = train[train.target == 0].keyword.value_counts()
negative_keywords

body%20bags          40
armageddon           37
harm                 37
wrecked              36
ruin                 36
                     ..
outbreak              1
suicide%20bombing     1
typhoon               1
suicide%20bomber      1
oil%20spill           1
Name: keyword, Length: 218, dtype: int64

We could look at a sample from each type as well

In [18]:
positive_sample = train[train.target == 1].text.sample()
positive_sample.item()

'DLH issues Hazardous Weather Outlook (HWO)  http://t.co/a0Ad8z5Vsr #WX'

In [19]:
negative_sample = train[train.target == 0].text.sample()
negative_sample.item()

"@jamienye u can't blame it all on coaching management penalties defence or injuries. Cursed is probably a good way to put it! #riders"

In [20]:
train.size

30452

In [21]:
test.size

9789

## Preprocessing

In [23]:
from nltk.tokenize.regexp import regexp_tokenize
import re

In [24]:
def lower(text):
    return text.lower()

In [25]:
def filter_letters(text):
    return re.sub(string=text, repl='', pattern=r'[^a-z\s]') # filter out anything that is not a letter or a space

In [26]:
def tokenize(text):
    return regexp_tokenize(text, pattern='\s+', gaps=True)

In [27]:
cleaned_train = train.copy()
cleaned_train.text = cleaned_train.text.map(lower)

In [28]:
cleaned_train.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,our deeds are the reason of this #earthquake m...,1
4,,,forest fire near la ronge sask. canada,1
5,,,all residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,just got sent this photo from ruby #alaska as ...,1


In [29]:
cleaned_train.text = cleaned_train.text.map(filter_letters)
cleaned_train.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,our deeds are the reason of this earthquake ma...,1
4,,,forest fire near la ronge sask canada,1
5,,,all residents asked to shelter in place are be...,1
6,,,people receive wildfires evacuation orders in...,1
7,,,just got sent this photo from ruby alaska as s...,1


In [30]:
cleaned_train.text = cleaned_train.text.map(tokenize)
cleaned_train.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,"[our, deeds, are, the, reason, of, this, earth...",1
4,,,"[forest, fire, near, la, ronge, sask, canada]",1
5,,,"[all, residents, asked, to, shelter, in, place...",1
6,,,"[people, receive, wildfires, evacuation, order...",1
7,,,"[just, got, sent, this, photo, from, ruby, ala...",1


In [31]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

In [32]:
def remove_stopwords(tokens):
    tokens = [token for token in tokens if token not in stop]
    return tokens

In [33]:
cleaned_train.text = cleaned_train.text.map(remove_stopwords)
cleaned_train.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,"[deeds, reason, earthquake, may, allah, forgiv...",1
4,,,"[forest, fire, near, la, ronge, sask, canada]",1
5,,,"[residents, asked, shelter, place, notified, o...",1
6,,,"[people, receive, wildfires, evacuation, order...",1
7,,,"[got, sent, photo, ruby, alaska, smoke, wildfi...",1


In [34]:
from nltk.stem import WordNetLemmatizer, PorterStemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

In [35]:
def stem_and_lemmatize(tokens):
    tokens = [lemmatizer.lemmatize(stemmer.stem(token)) for token in tokens]
    return tokens

In [36]:
cleaned_train.text = cleaned_train.text.map(stem_and_lemmatize)
cleaned_train.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,"[deed, reason, earthquak, may, allah, forgiv, u]",1
4,,,"[forest, fire, near, la, rong, sask, canada]",1
5,,,"[resid, ask, shelter, place, notifi, offic, ev...",1
6,,,"[peopl, receiv, wildfir, evacu, order, califor...",1
7,,,"[got, sent, photo, rubi, alaska, smoke, wildfi...",1


In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [38]:
tfidf = TfidfVectorizer(ngram_range=(1, 2))

In [39]:
features = tfidf.fit_transform(cleaned_train.text.map(lambda alist: ' '.join(alist))).toarray()

In [40]:
features.shape

(7613, 68544)

In [41]:
from sklearn.model_selection import StratifiedKFold

In [49]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

In [44]:
kf = StratifiedKFold(n_splits=3, shuffle=True)
y = cleaned_train['target']

In [58]:
for train_index, test_index in kf.split(features, y):
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    #model = MultinomialNB()
    #model = LogisticRegression()
    model = LinearSVC(C=10)
    model.fit(X_train, y_train)
    print(model.score(X_test, y_test))

0.7915681639085894
0.7970843183609141
0.7899093417422152


In [59]:
test

Unnamed: 0_level_0,keyword,location,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,,Just happened a terrible car crash
2,,,"Heard about #earthquake is different cities, s..."
3,,,"there is a forest fire at spot pond, geese are..."
9,,,Apocalypse lighting. #Spokane #wildfires
11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...
10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
10865,,,Storm in RI worse than last hurricane. My city...
10868,,,Green Line derailment in Chicago http://t.co/U...
10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [60]:
cleaned_test = test.copy()
cleaned_test.text = cleaned_test.text.map(lower)
cleaned_test.text = cleaned_test.text.map(filter_letters)
cleaned_test.text = cleaned_test.text.map(tokenize)
cleaned_test.text = cleaned_test.text.map(remove_stopwords)
cleaned_test.text = cleaned_test.text.map(stem_and_lemmatize)
X_test = tfidf.transform(cleaned_test.text.map(lambda alist: ' '.join(alist))).toarray()
y = model.predict(X_test)
y

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [61]:
test['pred'] = y

In [62]:
test.to_csv('predictions.csv')

In [63]:
sample_submission = pd.read_csv('sample_submission.csv')

In [64]:
sample_submission['target'] = y

In [65]:
sample_submission.to_csv('sample_submission.csv')