## NLP Challenge: Disaster, real or not real?
-------------------------------------------------------------------
### Context:

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
But, it’s not always clear whether a person’s words are actually announcing a disaster. Our goal is to be able to build a supervised learning model that allows us to predict whether a tweet about disaster is real or not real.

### Content:
This dataset contains 7613 tweets from Twitter with labels that shows either the tweet is about a real disaster (1), or not a real disaster (0).

In order to process the tweet, we'll use natural language processing techniques to prepare text data. Now let's start!

### I. Predict disaster tweets with Bag of Words:

In [1]:
# Import libraries:
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk
from imblearn.over_sampling import SMOTE
import itertools
flatten = itertools.chain.from_iterable
from sklearn.model_selection import cross_val_score

from sklearn import ensemble
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

In [2]:
# Load the dataset:
df = pd.read_csv('train.csv', usecols=['text', 'target'])

# Print out the shape of the dataset:
print(df.shape)

(7613, 2)


In [3]:
# Check the first 10 rows of the train dataset:
df.head(10)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
5,#RockyFire Update => California Hwy. 20 closed...,1
6,#flood #disaster Heavy rain causes flash flood...,1
7,I'm on top of the hill and I can see a fire in...,1
8,There's an emergency evacuation happening now ...,1
9,I'm afraid that the tornado is coming to our a...,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 2 columns):
text      7613 non-null object
target    7613 non-null int64
dtypes: int64(1), object(1)
memory usage: 119.1+ KB


In [5]:
# Utility function for text cleaning:
def text_cleaner(text):
    text = re.sub(r'--',' ',text) 
    text = re.sub(r'http\S+',r'',text) 
    text = re.sub(r'[?|!|\'|"|#]',r'',text) 
    text = re.sub(r'[.|,|)|(|)|\|/]',r' ',text) 
    text = text.lower() 
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[()[\]{}\''',.``?:;!&^]','',text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('#','',text)
    text = re.sub('û*','',text)
    text = re.sub('ûó*','',text)
    text = re.sub('ò*','',text)
    text = re.sub(r'[^\w]', ' ', text)
    text = re.sub('[^a-zA-Z]+', ' ', text)
    text = text.strip() 
    text = " ".join(text.split())
    return text

In [6]:
# Apply text_cleaner function:
df['parsed_text'] = df['text'].apply(text_cleaner)

In [7]:
df.head()

Unnamed: 0,text,target,parsed_text
0,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...
3,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...
4,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...


In [8]:
# Make a copy of df to later use for tf-idf:
df1 = df.copy()

In [9]:
# Parse the document: 
nlp = spacy.load('en')
df['parsed_text'] = df['parsed_text'].apply(lambda x: nlp(x))

In [10]:
# Utility function tocreate bag of words:
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

In [11]:
# Let's look at the list of the most common words:
common_words = bag_of_words(list(flatten(df['parsed_text'])))
common_words[0:20]

['not',
 's',
 'like',
 'fire',
 'be',
 'amp',
 'new',
 'go',
 'get',
 'people',
 'news',
 'kill',
 'burn',
 'year',
 'video',
 'bomb',
 'crash',
 'disaster',
 'emergency',
 'come']

In [12]:
# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences['parsed_text']
    df['text_source'] = sentences['target']
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

In [13]:
# Apply bow_features function:
word_counts = bow_features(df[['parsed_text','target']], common_words)

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
Processing row 600
Processing row 650
Processing row 700
Processing row 750
Processing row 800
Processing row 850
Processing row 900
Processing row 950
Processing row 1000
Processing row 1050
Processing row 1100
Processing row 1150
Processing row 1200
Processing row 1250
Processing row 1300
Processing row 1350
Processing row 1400
Processing row 1450
Processing row 1500
Processing row 1550
Processing row 1600
Processing row 1650
Processing row 1700
Processing row 1750
Processing row 1800
Processing row 1850
Processing row 1900
Processing row 1950
Processing row 2000
Processing row 2050
Processing row 2100
Processing row 2150
Processing row 2200
Processing row 2250
Processing row 2300
Processing row 2350
Processing row 2400
Processing row 2450
Processing row 2500
Pro

In [14]:
word_counts.head()

Unnamed: 0,not,s,like,fire,be,amp,new,go,get,people,...,lmfao,newyork,credit,tbt,demand,muslim,raid,palestine,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(our, deeds, are, the, reason, of, this, earth...",1
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(forest, fire, near, la, ronge, sask, canada)",1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(all, residents, asked, to, shelter, in, place...",1
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,"(people, receive, wildfires, evacuation, order...",1
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,"(just, got, sent, this, photo, from, ruby, ala...",1


####  Supervised learning models (using bag of words)

In [15]:
# Random forest model:
rfc = ensemble.RandomForestClassifier(n_estimators=100)

Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

In [16]:
rfc_scores = cross_val_score(rfc, X, Y, cv=10)
print("Cross-validation scores: ",rfc_scores)
print("Mean score: ",rfc_scores.mean())

Cross-validation scores:  [0.70209974 0.44488189 0.47900262 0.42838371 0.52299606 0.59658344
 0.59001314 0.5781866  0.69382392 0.61760841]
Mean score:  0.5653579521350895


In [19]:
# Logistic Regression model:
lr = LogisticRegression(penalty='l1',multi_class='auto',solver='liblinear')

lr_scores = cross_val_score(lr, X, Y, cv=10)
print("Cross-validation scores: ",lr_scores)
print("Mean score: ",lr_scores.mean())

Cross-validation scores:  [0.71391076 0.5656168  0.48950131 0.44283837 0.51642576 0.60183968
 0.57555848 0.57950066 0.66491459 0.71222076]
Mean score:  0.5862327163112495


In [20]:
# Gradient Boosting model:
clf = ensemble.GradientBoostingClassifier()

clf_scores = cross_val_score(clf, X, Y, cv=10)
print("Cross_validation scores: ",clf_scores)
print("Mean score: ",clf_scores.mean())

Cross_validation scores:  [0.65354331 0.64829396 0.4855643  0.57161629 0.46780552 0.58344284
 0.58607096 0.63600526 0.65571616 0.73455979]
Mean score:  0.6022618394776869


### II. Predict disaster tweets with Tf-idf:

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define our TfidfVectorizer:
vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=3, # only use words that appear at least 3 times
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

In [22]:
# Applying the vectorizer to our dataset
tfidf = vectorizer.fit_transform(df1['parsed_text'])
print("Number of features: %d" % tfidf.get_shape()[1])

Number of features: 4032


In [23]:
tfidf.shape

(7613, 4032)

In [24]:
# Define explanatory and target variable:
X = tfidf
Y = df1['target']

print(X.shape, Y.shape)

(7613, 4032) (7613,)


#### Supervised learning models (using tf-idf)

In [25]:
# Random forest model:
rfc = ensemble.RandomForestClassifier(n_estimators=100)

rfc_scores = cross_val_score(rfc, X, Y, cv=10)
print("Cross-validation scores: ",rfc_scores)
print("Mean score: ",rfc_scores.mean())

Cross-validation scores:  [0.71128609 0.54855643 0.47900262 0.46911958 0.54270696 0.58738502
 0.61366623 0.61498029 0.64257556 0.71353482]
Mean score:  0.5922813606906233


In [26]:
# Logistic Regression model:
lr = LogisticRegression(penalty='l2',multi_class='auto',solver='saga')

lr_scores = cross_val_score(lr, X, Y, cv=10)
print("Cross-validation scores: ",lr_scores)
print("Mean score: ",lr_scores.mean())

Cross-validation scores:  [0.7335958  0.65091864 0.59973753 0.60709593 0.63731932 0.66491459
 0.68068331 0.6544021  0.73587385 0.76609724]
Mean score:  0.673063830227529


In [27]:
# Gradient Boosting model:
clf = ensemble.GradientBoostingClassifier()

clf_scores = cross_val_score(clf, X, Y, cv=10)
print("Cross_validation scores: ",clf_scores)
print("Mean score: ",clf_scores.mean())

Cross_validation scores:  [0.65485564 0.6351706  0.49343832 0.62417871 0.51116951 0.59921156
 0.59395532 0.61760841 0.63206307 0.73587385]
Mean score:  0.6097525013709686
