# Real Or Not? NLP with Disaster Tweets

### Competition description:

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.


The following is my solution to this problem:

Libraries to be used:

In [1]:
import pandas as pd
import numpy as np

import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


Read in training and testing data

In [2]:
training_data = pd.read_csv("train.csv")
testing_data = pd.read_csv("test.csv")

### Getting to know the data

The training data contains five features, id, keyword, location, text, and the target (label). From the first five rows in the dataset, it is readily apparent the keword and location features are both missing values. 

In [3]:
training_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


There are 7,613 rows in the dataframe. 

In [4]:
training_data.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


The following identifies the number and percentage of missing values each of our features has. Of the 7613 total tweets, 2533 (33.3%) are missing the location feature and 61 (0.8%) are missing a keyword feature. 

In [5]:
def identify_number_of_missing_values(dataframe):
    """
    Prints a table representing the number and percent of missing values each feature has
    
    Parameter:
        dataframe: dataframe with features and values
    """
    total = dataframe.isnull().sum().sort_values(ascending = False)
    percent_1 = dataframe.isnull().sum()/dataframe.isnull().count()*100
    percent_2 = (round(percent_1, 1)).sort_values(ascending = False)
    missing_data = pd.concat([total, percent_2], axis = 1, keys = ['Total', '%'])
    print(missing_data)
    
identify_number_of_missing_values(training_data)

          Total     %
location   2533  33.3
keyword      61   0.8
target        0   0.0
text          0   0.0
id            0   0.0


### Data Cleaning

The ID column is dropped because it is unrelated to the data. The location and keyword columns are also dropped due to missing values.

In [6]:
training_data = training_data.drop(['id', 'location', 'keyword'], axis = 1)
testing_data = testing_data.drop(['id', 'location', 'keyword'], axis = 1)

Next, the words contained text feature are normalized by changing all letters to lowercase, removing numbers, removing punctuation, removing websites, and removing whitespace. This is done for both the training and test sets.

In [7]:
def text_normalizing(text):
    """Normalizes data by changing all letters to lowercase, removing numbers, removing punctuation,
       removing websites, and removing whitespace.
        :param data: dataframe with features and values
        :return: dataframe with normalized values
    """
    #changes all letters to be lowercase
    lowercase = str.lower(text)

    #removes numbers
    no_digits = re.sub(r'\d+', '', lowercase)

    #removes punctuation
    no_punctuation = "".join([c for c in no_digits if c not in string.punctuation])

    #removes websites
    no_websites = re.sub(r'http\S+', '', no_punctuation)

    #removes whitespace
    no_white_space = no_websites.strip()

    return no_white_space

training_data['text'] = training_data['text'].apply(lambda x: text_normalizing(x))
testing_data['text'] = testing_data['text'].apply(lambda x: text_normalizing(x))

Next, the text is tokenized.

In [8]:
def tokenize_text (text):
    """Tokenizes text.
        :param data: dataframe with raw data
        :return: tokenized text
    """
    #instantiate tokenizer
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    tokenized_text = tokenizer.tokenize(text)

    return tokenized_text

training_data['text'] = training_data['text'].apply(lambda x: tokenize_text(x))
testing_data['text'] = testing_data['text'].apply(lambda x: tokenize_text(x))

Finally, stopwords are removed and remaining words are stemmed. 

In [9]:
def stemmer_and_remove_stop_words (tokenized_text):
    """Removes stopwords and stems remaining words 
        :param data: Dataframe with tokenized text
        :return: Dataframe with stopwords removed and remaining words stemmed
    """
    stemmer = PorterStemmer()
    tweet = [stemmer.stem(word) for word in tokenized_text if word not in stopwords.words('english')]
    stemmed = ' '.join(tweet)

    return stemmed

training_data['text'] = training_data['text'].apply(lambda x: stemmer_and_remove_stop_words(x))
testing_data['text'] = testing_data['text'].apply(lambda x: stemmer_and_remove_stop_words(x))

The tweets are now ready to be used in machine learning algorithms. They now look like the following:

In [10]:
training_data.head()

Unnamed: 0,text,target
0,deed reason earthquak may allah forgiv us,1
1,forest fire near la rong sask canada,1
2,resid ask shelter place notifi offic evacu she...,1
3,peopl receiv wildfir evacu order california,1
4,got sent photo rubi alaska smoke wildfir pour ...,1


### Data Modeling

The data is now ready to be modeled. Code for three of the best performing models have been included below.

Instantiates Count Vectorizer which converts tweets to a matrix of token counts and learns the vocabulary dictionary for training data tweets and returns document-term matrix.

In [11]:
count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(training_data['text']).toarray()

Transform tweets for testing data to document-term matrix.

In [12]:
bag_of_words_test = count_vectorizer.transform(testing_data['text']).toarray()

Split into features and labels for training and testing data.

In [13]:
training_features = bag_of_words
training_labels = training_data['target']
testing_features = bag_of_words_test

#### K-Nearest Neighbor

In [15]:
k_nearest_neighbor = KNeighborsClassifier(n_neighbors = 8, weights = 'distance')
k_nearest_neighbor.fit(training_features, training_labels)
k_predictions = k_nearest_neighbor.predict(testing_features)
k_predictions1 = pd.DataFrame(k_predictions)
k_predictions1.to_csv('k_predictions.csv')

#### Logistic Regression

In [20]:
logistic_regression = LogisticRegression(penalty = 'l2', solver = 'saga', random_state = 21, max_iter = 1000)
logistic_regression.fit(training_features, training_labels)
l_prediction = logistic_regression.predict(testing_features)
l_predictions1 = pd.DataFrame(l_prediction)
l_predictions1.to_csv('l_predictions.csv')

#### Support Vector Machine

In [17]:
svm = SVC(kernel = 'linear', degree = 3, max_iter = 100000)
svm.fit(training_features, training_labels)
svm_prediction = svm.predict(testing_features)
svm_predictions1 = pd.DataFrame(svm_prediction)
svm_predictions1.to_csv('svm_predictions.csv')

# Results

The results when uploaded to Kaggle were:

K-Nearest Neighbor: 72.448%

Logistic Regression: 79.497%

Support Vector Machine: 77.505%

Logistic Refression ended up being the best performing model of the ones tried in this notebook. This is probably due to the simplicity of the dataset. 