# Disaster Tweets Notebook

Predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

### Types of Disaster
* Geophysical (e.g. Earthquakes, Landslides, Tsunamis and Volcanic Activity)
* Hydrological (e.g. Avalanches and Floods)
* Climatological (e.g. Extreme Temperatures, Drought and Wildfires)
* Meteorological (e.g. Cyclones and Storms/Wave Surges)
* Biological (e.g. Disease Epidemics and Insect/Animal Plagues)

In [23]:
disaster_list = ['tsunami', 'disasters', 'volcano', 'tornado', 'avalanche', 'earthquake', 
                 'blizzard', 'drought', 'bushfire', 'tremor', 'dust storm', 'storm', 'magma',
                 'twister', 'windstorm', 'heat wave', 'cyclone', 'forest fire', 'flood', 'fire',
                 'hailstorm', 'lava', 'lightning', 'high-pressure', 'hail', 'hurricane', 
                 'seismic', 'erosion', 'whirlpool', 'Richter scale', 'whirlwind', 'dark cloud', 
                 'thunderstorm', 'barometer', 'gale', 'blackout', 'gust', 'force', 'low-pressure',
                 'volt', 'snowstorm', 'rainstorm', 'storm', 'nimbus', 'violent storm', 'sandstorm',
                 'casualty', 'Beaufort scale', 'fatal', 'fatality', 'cumulonimbus', 'death', 'lost',
                 'destruction', 'tension', 'cataclysm', 'damage', 'uproot', 'underground', 'destroy',
                 'arsonist', 'wind scale', 'arson', 'rescue', 'permafrost', 'fault', 'drown']

# Libraries

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [37]:
import string, re
import nltk
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords 

# Data

In [3]:
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
submission = pd.read_csv('sample_submission.csv')
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


# Exploration

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [8]:
train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [9]:
train.keyword.value_counts()

fatalities               45
deluge                   42
armageddon               42
body%20bags              41
sinking                  41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

In [10]:
train.keyword.unique()

array([nan, 'ablaze', 'accident', 'aftershock', 'airplane%20accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew%20up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown%20up', 'body%20bag', 'body%20bagging',
       'body%20bags', 'bomb', 'bombed', 'bombing', 'bridge%20collapse',
       'buildings%20burning', 'buildings%20on%20fire', 'burned',
       'burning', 'burning%20buildings', 'bush%20fires', 'casualties',
       'casualty', 'catastrophe', 'catastrophic', 'chemical%20emergency',
       'cliff%20fall', 'collapse', 'collapsed', 'collide', 'collided',
       'collision', 'crash', 'crashed', 'crush', 'crushed', 'curfew',
       'cyclone', 'damage', 'danger', 'dead', 'death', 'deaths', 'debris',
       'deluge', 'deluged', 'demolish', 'demolished', 'demolition',
       'derail', 'der

"A space is assigned number 32, which is 20 in hexadecimal. When you see “%20,” it represents a space in an encoded URL"

In [13]:
train.text[0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [14]:
train.text[1]

'Forest fire near La Ronge Sask. Canada'

In [15]:
train.text[2]

"All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"

In [16]:
train.keyword.count()

7552

In [17]:
train.keyword.count()/len(train)

0.9919873899908052

In [20]:
test.keyword.count()/len(test)

0.9920318725099602

In [25]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


# Cleaning

In [29]:
# Remove all hyphens and quotes 
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
train['text_tokens_raw'] = nltk.regexp_tokenize(train['text'], pattern)

TypeError: expected string or bytes-like object

In [45]:
sample = train.text.head()

In [70]:
sample[3]

'13,000 people receive #wildfires evacuation orders in California '

In [48]:
type(sample[0])

str

In [71]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
token_pattern = nltk.regexp_tokenize(sample[3], pattern)
token_pattern

['people', 'receive', 'wildfires', 'evacuation', 'orders', 'in', 'California']

In [72]:
token_pattern_lower = [word.lower() for word in token_pattern]
token_pattern_lower

['people', 'receive', 'wildfires', 'evacuation', 'orders', 'in', 'california']

In [73]:
token_pattern_lower_stopless = [word for word in token_pattern_lower if word not in stopwords_list]
token_pattern_lower_stopless

['people', 'receive', 'wildfires', 'evacuation', 'orders', 'california']

In [75]:
type(train.text[0])

str

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
token_pattern = nltk.regexp_tokenize(sample[3], pattern)
token_pattern

In [76]:
from nltk.tokenize import word_tokenize
train['tokenized_text'] = train['text'].apply(word_tokenize) 
train.head()

Unnamed: 0,id,keyword,location,text,target,text_lower,text_stopped,tokenized_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this #earthquake m...,our deeds are the reason of this #earthquake m...,"[Our, Deeds, are, the, Reason, of, this, #, ea..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask. canada,forest fire near la ronge sask. canada,"[Forest, fire, near, La, Ronge, Sask, ., Canada]"
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...,"[All, residents, asked, to, 'shelter, in, plac..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfires evacuation or...","[13,000, people, receive, #, wildfires, evacua..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby #alaska as ...,just got sent this photo from ruby #alaska as ...,"[Just, got, sent, this, photo, from, Ruby, #, ..."


### Also found this on the web
* from nltk.tokenize import TweetTokenizer
- tt = TweetTokenizer()
- df['Text'].apply(tt.tokenize)

In [33]:
# Remove all capitals
train['text_lower'] = [word.lower() for word in train.text]

In [40]:
train.head()

Unnamed: 0,id,keyword,location,text,target,text_tokens,text_lower,text_stopped
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this #earthquake m...,our deeds are the reason of this #earthquake m...,our deeds are the reason of this #earthquake m...
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask. canada,forest fire near la ronge sask. canada,forest fire near la ronge sask. canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfires evacuation or..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby #alaska as ...,just got sent this photo from ruby #alaska as ...,just got sent this photo from ruby #alaska as ...


In [38]:
# Remove all stopwords
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

train['text_stopped'] = [word for word in train['text_lower'] if word not in stopwords_list]

In [42]:
train.head()

Unnamed: 0,id,keyword,location,text,target,text_lower,text_stopped
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this #earthquake m...,our deeds are the reason of this #earthquake m...
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask. canada,forest fire near la ronge sask. canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfires evacuation or..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby #alaska as ...,just got sent this photo from ruby #alaska as ...


In [85]:
train['token_text_stopped'] = [word for word in train['tokenized_text'] if word not in stopwords_list]
train.head()

Unnamed: 0,id,keyword,location,text,target,text_lower,text_stopped,tokenized_text,token_text_stopped
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this #earthquake m...,our deeds are the reason of this #earthquake m...,"[Our, Deeds, are, the, Reason, of, this, #, ea...","[Our, Deeds, are, the, Reason, of, this, #, ea..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask. canada,forest fire near la ronge sask. canada,"[Forest, fire, near, La, Ronge, Sask, ., Canada]","[Forest, fire, near, La, Ronge, Sask, ., Canada]"
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...,"[All, residents, asked, to, 'shelter, in, plac...","[All, residents, asked, to, 'shelter, in, plac..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfires evacuation or...","[13,000, people, receive, #, wildfires, evacua...","[13,000, people, receive, #, wildfires, evacua..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby #alaska as ...,just got sent this photo from ruby #alaska as ...,"[Just, got, sent, this, photo, from, Ruby, #, ...","[Just, got, sent, this, photo, from, Ruby, #, ..."


# Model

# Conclusion

# Future Work