# Disaster Tweets Notebook

Predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

### Types of Disaster
* Geophysical (e.g. Earthquakes, Landslides, Tsunamis and Volcanic Activity)
* Hydrological (e.g. Avalanches and Floods)
* Climatological (e.g. Extreme Temperatures, Drought and Wildfires)
* Meteorological (e.g. Cyclones and Storms/Wave Surges)
* Biological (e.g. Disease Epidemics and Insect/Animal Plagues)

In [1]:
disaster_list = ['tsunami', 'disasters', 'volcano', 'tornado', 'avalanche', 'earthquake', 
                 'blizzard', 'drought', 'bushfire', 'tremor', 'dust storm', 'storm', 'magma',
                 'twister', 'windstorm', 'heat wave', 'cyclone', 'forest fire', 'flood', 'fire',
                 'hailstorm', 'lava', 'lightning', 'high-pressure', 'hail', 'hurricane', 
                 'seismic', 'erosion', 'whirlpool', 'Richter scale', 'whirlwind', 'dark cloud', 
                 'thunderstorm', 'barometer', 'gale', 'blackout', 'gust', 'force', 'low-pressure',
                 'volt', 'snowstorm', 'rainstorm', 'storm', 'nimbus', 'violent storm', 'sandstorm',
                 'casualty', 'Beaufort scale', 'fatal', 'fatality', 'cumulonimbus', 'death', 'lost',
                 'destruction', 'tension', 'cataclysm', 'damage', 'uproot', 'underground', 'destroy',
                 'arsonist', 'wind scale', 'arson', 'rescue', 'permafrost', 'fault', 'drown']

# Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
import string, re
import nltk
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords 

In [4]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

# Data

In [6]:
train = pd.read_csv('data/train.csv')
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
test = pd.read_csv('data/test.csv')
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
test.describe()

Unnamed: 0,id
count,3263.0
mean,5427.152927
std,3146.427221
min,0.0
25%,2683.0
50%,5500.0
75%,8176.0
max,10875.0


In [9]:
submission = pd.read_csv('data/sample_submission.csv')
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [10]:
submission.describe()

Unnamed: 0,id,target
count,3263.0,3263.0
mean,5427.152927,0.0
std,3146.427221,0.0
min,0.0,0.0
25%,2683.0,0.0
50%,5500.0,0.0
75%,8176.0,0.0
max,10875.0,0.0


# Exploration

In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [12]:
train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [13]:
train.keyword.value_counts()

fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

In [14]:
train.keyword.unique()

array([nan, 'ablaze', 'accident', 'aftershock', 'airplane%20accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew%20up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown%20up', 'body%20bag', 'body%20bagging',
       'body%20bags', 'bomb', 'bombed', 'bombing', 'bridge%20collapse',
       'buildings%20burning', 'buildings%20on%20fire', 'burned',
       'burning', 'burning%20buildings', 'bush%20fires', 'casualties',
       'casualty', 'catastrophe', 'catastrophic', 'chemical%20emergency',
       'cliff%20fall', 'collapse', 'collapsed', 'collide', 'collided',
       'collision', 'crash', 'crashed', 'crush', 'crushed', 'curfew',
       'cyclone', 'damage', 'danger', 'dead', 'death', 'deaths', 'debris',
       'deluge', 'deluged', 'demolish', 'demolished', 'demolition',
       'derail', 'der

"A space is assigned number 32, which is 20 in hexadecimal. When you see “%20,” it represents a space in an encoded URL"

In [15]:
train.text[0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [16]:
train.text[1]

'Forest fire near La Ronge Sask. Canada'

In [17]:
train.text[2]

"All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"

In [18]:
train.keyword.count()

7552

In [19]:
train.keyword.count()/len(train)

0.9919873899908052

In [20]:
test.keyword.count()/len(test)

0.9920318725099602

In [21]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


# Cleaning

In [22]:
# Word tokenize didn't work
#from nltk.tokenize import word_tokenize
#train['tokenized_text'] = train['text'].apply(word_tokenize) 
#train.head()

In [23]:
# TweetTokenizer didn't work either
#from nltk.tokenize import TweetTokenizer
#tt = TweetTokenizer()
#train['token_tweets'] = train['text'].apply(tt.tokenize)
#train.head()

In [24]:
# Remove all hyphens and quotes - needs to be in a loop!
# pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
# train['text_tokens_raw'] = nltk.regexp_tokenize(train['text'], pattern)

In [25]:
sample1 = train.text.head()
sample1

0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
Name: text, dtype: object

In [26]:
sample = train.text[40:44]
sample

40    Check these out: http://t.co/rOI2NSmEJJ http:/...
41    on the outside you're ablaze and alive\nbut yo...
42    Had an awesome time visiting the CFC head offi...
43         SOOOO PUMPED FOR ABLAZE ???? @southridgelife
Name: text, dtype: object

In [27]:
token_https = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', sample[40], 
                         flags=re.MULTILINE)
token_https

'Check these out:     #nsfw'

In [28]:
# Trying to seperate all the words
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
token_pattern = nltk.regexp_tokenize(token_https, pattern)
token_pattern

['Check', 'these', 'out', 'nsfw']

In [29]:
# Make all words lower case
token_pattern_lower = [word.lower() for word in token_pattern]
token_pattern_lower

['check', 'these', 'out', 'nsfw']

In [30]:
# Remove all stopwords, punctuation, and numbers
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

token_pattern_lower_stopless = [word for word in token_pattern_lower if word not in stopwords_list]
token_pattern_lower_stopless

['check', 'nsfw']

In [31]:
# Regex seperation
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"

# Remove all stopwords, punctuation, and numbers
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

# Create new column
new_list = []

# Loop through df
for i in range(len(train.text)):
    token_https = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', train.text[i], 
                         flags=re.MULTILINE)
    token_pattern = nltk.regexp_tokenize(token_https, pattern)
    token_pattern_lower = [word.lower() for word in token_pattern]
    token_pattern_lower_stopless = [word for word in token_pattern_lower if word not in stopwords_list]
    new_list.append(token_pattern_lower_stopless)

In [32]:
# Add column to df
train['cleaned_text'] = new_list
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deeds, reason, earthquake, may, allah, forgiv..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, ronge, sask, canada]"
2,5,,,All residents asked to 'shelter in place' are ...,1,"[residents, asked, shelter, place, notified, o..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[people, receive, wildfires, evacuation, order..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, ruby, alaska, smoke, wildfi..."


In [33]:
# add an empty columns 
train = train.reindex(columns = train.columns.tolist() + ['new_text'])
train.head()                      

Unnamed: 0,id,keyword,location,text,target,cleaned_text,new_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deeds, reason, earthquake, may, allah, forgiv...",
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, ronge, sask, canada]",
2,5,,,All residents asked to 'shelter in place' are ...,1,"[residents, asked, shelter, place, notified, o...",
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[people, receive, wildfires, evacuation, order...",
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, ruby, alaska, smoke, wildfi...",


In [34]:
for i in range(len(train.cleaned_text)):
    train['new_text'][i] = ", ".join(train.cleaned_text[i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['new_text'][i] = ", ".join(train.cleaned_text[i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [35]:
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,new_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deeds, reason, earthquake, may, allah, forgiv...","deeds, reason, earthquake, may, allah, forgive..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, ronge, sask, canada]","forest, fire, near, la, ronge, sask, canada"
2,5,,,All residents asked to 'shelter in place' are ...,1,"[residents, asked, shelter, place, notified, o...","residents, asked, shelter, place, notified, of..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[people, receive, wildfires, evacuation, order...","people, receive, wildfires, evacuation, orders..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, ruby, alaska, smoke, wildfi...","got, sent, photo, ruby, alaska, smoke, wildfir..."


In [62]:
train.new_text[44]

'wanted, set, chicago, ablaze, preaching, hotel'

In [64]:
train.cleaned_text[44]

['wanted', 'set', 'chicago', 'ablaze', 'preaching', 'hotel']

In [63]:
train.text[44]

'I wanted to set Chicago ablaze with my preaching... But not my hotel! http://t.co/o9qknbfOFX'

In [66]:
total_words = []
for i in range(len(train.cleaned_text)):
    total_words += train.cleaned_text[i]

In [67]:
len(total_words)

70348

In [69]:
word_freqdist = FreqDist(total_words)
word_freqdist.most_common(100)

[('like', 348),
 ('amp', 344),
 ('fire', 254),
 ("i'm", 240),
 ('get', 229),
 ('new', 228),
 ('via', 220),
 ('news', 213),
 ('people', 198),
 ('one', 197),
 ('video', 166),
 ('disaster', 158),
 ('emergency', 158),
 ('police', 143),
 ('u', 136),
 ('time', 132),
 ('would', 132),
 ('still', 129),
 ('body', 129),
 ('us', 128),
 ('burning', 121),
 ('crash', 120),
 ('day', 120),
 ('back', 120),
 ('storm', 120),
 ('suicide', 119),
 ('california', 117),
 ('man', 116),
 ('got', 114),
 ('know', 113),
 ('rt', 112),
 ('buildings', 111),
 ('first', 109),
 ('see', 105),
 ('bomb', 105),
 ('going', 104),
 ('world', 104),
 ('nuclear', 104),
 ('pm', 103),
 ('love', 102),
 ('two', 102),
 ('fires', 102),
 ('attack', 101),
 ('go', 100),
 ('dead', 99),
 ('killed', 99),
 ('year', 98),
 ('youtube', 98),
 ('w', 97),
 ('car', 94),
 ('gt', 94),
 ('full', 94),
 ('hiroshima', 94),
 ('life', 93),
 ('train', 93),
 ('war', 92),
 ('old', 91),
 ('today', 90),
 ('may', 89),
 ('accident', 89),
 ('good', 89),
 ('families'

In [70]:
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,new_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deeds, reason, earthquake, may, allah, forgiv...","deeds, reason, earthquake, may, allah, forgive..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, ronge, sask, canada]","forest, fire, near, la, ronge, sask, canada"
2,5,,,All residents asked to 'shelter in place' are ...,1,"[residents, asked, shelter, place, notified, o...","residents, asked, shelter, place, notified, of..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[people, receive, wildfires, evacuation, order...","people, receive, wildfires, evacuation, orders..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, ruby, alaska, smoke, wildfi...","got, sent, photo, ruby, alaska, smoke, wildfir..."


# Model 1

In [106]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [107]:
vectorizer = TfidfVectorizer()

In [108]:
tf_idf_data_train = vectorizer.fit_transform(train.text)

In [109]:
# I can't get the vectorize to work on my cleaned lists!
#tf_idf_data_train_cleaned = vectorizer.fit_transform(train.cleaned_text)

In [110]:
tf_idf_data_test = vectorizer.transform(test.text)

In [111]:
tf_idf_data_train.shape

(7613, 21637)

In [112]:
tf_idf_data_test.shape

(3263, 21637)

In [113]:
non_zero_cols = tf_idf_data_train.nnz / float(tf_idf_data_train.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / float(tf_idf_data_train.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 14.645606199921188
Percentage of columns containing 0: 0.9993231221426297


In [114]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [115]:
target = train.target

In [116]:
# Naive Bayes
nb_classifier.fit(tf_idf_data_train, target)
nb_train_preds = nb_classifier.predict(tf_idf_data_train)
nb_test_preds = nb_classifier.predict(tf_idf_data_test)

In [117]:
# Random Forest
rf_classifier.fit(tf_idf_data_train, target)
rf_train_preds = rf_classifier.predict(tf_idf_data_train)
rf_test_preds = rf_classifier.predict(tf_idf_data_test)

In [119]:
nb_train_score = accuracy_score(train.target, nb_train_preds)
#nb_test_score = accuracy_score(test.target, nb_test_preds)

rf_train_score = accuracy_score(train.target, rf_train_preds)
#rf_test_score = accuracy_score(test.target, rf_test_preds)

print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.8853 		 Testing Accuracy: 0.0

----------------------------------------------------------------------

Random Forest
Training Accuracy: 0.9965 		 Testing Accuracy: 0.0


In [None]:
# My test data doesn't have a target!

# Model 2

In [120]:
# Set the target
y = train.target

In [122]:
# Set the labels
X = train.text

In [123]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

In [124]:
vectorizer = TfidfVectorizer()

In [125]:
tf_idf_data_train2 = vectorizer.fit_transform(X_train)

In [127]:
tf_idf_data_test2 = vectorizer.transform(X_test)

In [128]:
tf_idf_data_train2.shape

(5709, 17725)

In [129]:
tf_idf_data_test2.shape

(1904, 17725)

In [130]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [131]:
# Naive Bayes
nb_classifier.fit(tf_idf_data_train2, y_train)
nb_train_preds2 = nb_classifier.predict(tf_idf_data_train2)
nb_test_preds2 = nb_classifier.predict(tf_idf_data_test2)

In [133]:
# Random Forest
rf_classifier.fit(tf_idf_data_train2, y_train)
rf_train_preds2 = rf_classifier.predict(tf_idf_data_train2)
rf_test_preds2 = rf_classifier.predict(tf_idf_data_test2)

In [142]:
nb_train_score2 = accuracy_score(y_train, nb_train_preds2)
nb_test_score2 = accuracy_score(y_test, nb_test_preds2)

rf_train_score2 = accuracy_score(y_train, rf_train_preds2)
rf_test_score2 = accuracy_score(y_test, rf_test_preds2)

print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score2, nb_test_score2))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score2, rf_test_score2))

Multinomial Naive Bayes
Training Accuracy: 0.8932 		 Testing Accuracy: 0.7994

----------------------------------------------------------------------

Random Forest
Training Accuracy: 0.9967 		 Testing Accuracy: 0.7773


# Model 3

In [149]:
# Set the target
y = train.target

In [150]:
# Set the labels
X = train.new_text

In [151]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

In [152]:
vectorizer = TfidfVectorizer()

In [153]:
tf_idf_data_train3 = vectorizer.fit_transform(X_train)

In [154]:
tf_idf_data_test3 = vectorizer.transform(X_test)

In [155]:
tf_idf_data_train3.shape

(5709, 13472)

In [156]:
tf_idf_data_test3.shape

(1904, 13472)

In [157]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [159]:
# Naive Bayes
nb_classifier.fit(tf_idf_data_train3, y_train)
nb_train_preds3 = nb_classifier.predict(tf_idf_data_train3)
nb_test_preds3 = nb_classifier.predict(tf_idf_data_test3)

In [160]:
# Random Forest
rf_classifier.fit(tf_idf_data_train3, y_train)
rf_train_preds3 = rf_classifier.predict(tf_idf_data_train3)
rf_test_preds3 = rf_classifier.predict(tf_idf_data_test3)

In [161]:
nb_train_score3 = accuracy_score(y_train, nb_train_preds3)
nb_test_score3 = accuracy_score(y_test, nb_test_preds3)

rf_train_score3 = accuracy_score(y_train, rf_train_preds3)
rf_test_score3 = accuracy_score(y_test, rf_test_preds3)

print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score3, nb_test_score3))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score3, rf_test_score3))

Multinomial Naive Bayes
Training Accuracy: 0.9073 		 Testing Accuracy: 0.8051

----------------------------------------------------------------------

Random Forest
Training Accuracy: 0.9876 		 Testing Accuracy: 0.7883


# Conclusion

I first tried to separate all the words in the text file, then get rid of numbers and symbols. Then I spent some time trying to get rid of the https links and all of the webpages. Finally I made all the words lowercase. I tried two different Tweet processing methods but neither of them worked - perhaps because the string of text had been changed to a list.

Using the TfidfVectorizer along with my nltk cleaning that was somehow joined again to a string I was able to improve my testing accuracy for the Naive Bayes which went from 79.94% to 80.51%. The Random Forest testing accuracy also went up from 77.73% to 78.83%. Neither are great and both are only slight improvements, but nevertheless heading in the right direction. I still have some more cleaning issues to deal with - since a bunch of the word vectors seemed to be just letters.

# Future Work