# Feature Engineering and Regression Modeling

We utilize a provided dataset and aims to classify tweets as either "Relevant" or "Not Relevant" using Doc2Vec and Logistic Regression.

## Table of Contents

1. [Collecting Tweets](01-Gathering-Data.ipynb)
1. [Feature Engineering with TF-IDF](02-Feature-Engineering.ipynb)
1. [Benchmark Model](03-Benchmark-Model.ipynb)
1. [Feature Engineering & Model Tuning with Doc2Vec](04-Model-Tuning.ipynb)
1. [Making Predictions on Test Data](05-Making-Predictions.ipynb)
1. [Visualizing a Disaster Event](06-Time-Series-Analysis.ipynb)

### Import Libaries

In [1]:
import pandas as pd
import numpy as np
from sklearn import utils
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
import time
import pickle
import gensim
import multiprocessing
cores = multiprocessing.cpu_count()

### Load Data

Our training data consists of 10,877 disaster-related tweets published by [Figure Eight](https://www.figure-eight.com/data-for-everyone/) on September 4th, 2015. 
 - "Contributors looked at over 10,000 tweets culled with a variety of searches like 'ablaze', 'quarantine', and 'pandemonium', then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous)."




In [2]:
#read in data
df = pd.read_csv('../data/datasets/socialmedia-disaster-tweets-DFE.csv', encoding='ISO-8859-1')

In [3]:
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,choose_one,choose_one:confidence,choose_one_gold,keyword,location,text,tweetid,userid
0,778243823,True,golden,156,,Relevant,1.0,Relevant,,,Just happened a terrible car crash,1.0,
1,778243824,True,golden,152,,Relevant,1.0,Relevant,,,Our Deeds are the Reason of this #earthquake M...,13.0,
2,778243825,True,golden,137,,Relevant,1.0,Relevant,,,"Heard about #earthquake is different cities, s...",14.0,
3,778243826,True,golden,136,,Relevant,0.9603,Relevant,,,"there is a forest fire at spot pond, geese are...",15.0,
4,778243827,True,golden,138,,Relevant,1.0,Relevant,,,Forest fire near La Ronge Sask. Canada,16.0,


In [4]:
#number of observations
df.shape[0]

10876

In [5]:
# drop rows labeled 'can't decide'  for our target variable
df = df[df['choose_one'] != 'Can\'t Decide']

In [6]:
#relevant variable
df['choose_one'].value_counts(normalize = True)

Not Relevant    0.569705
Relevant        0.430295
Name: choose_one, dtype: float64

In [7]:
df.shape[0]

10860

In [8]:
#create binary target column
df['target'] = df['choose_one'].map(lambda x: 1 if x == 'Relevant' else 0)

- After dropping tweets that could not be classified as either "Relevant" or "Not Relevant", we are left with 10,860 observations. 
- A baseline accuracy rate for our model is 57%.

### Train Test Split

In [9]:
#train test split keeping target variable (separate y variable not necessary)
train, test = train_test_split(df[['text','target']], stratify = df['target'], random_state=42)
train.head()

Unnamed: 0,text,target
4987,This whole podcast explosion thing has been we...,0
3658,If Shantae doesn't get in Smash I will destroy...,0
9745,Robert Gagnon reviews the catastrophe of impos...,1
9060,[CLIP] Top-down coercion - The structural weak...,0
4408,Achievement Unlocked: Replaced Light Socket; D...,0


### Preprocess Text

In [10]:
#instatiate lemmatizer, tokenizer, and stemmer
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer('\w+')

#create set of stopwords from sklearn and add a few more words
stops = set(stopwords.words('english'))
more_stops = ['xb','amp']
stops.update(more_stops)

#function to clean text
def to_words(raw_text):
    #remove links 
    raw_text = re.sub('(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$', '', raw_text)
    #tokenize
    words = tokenizer.tokenize(raw_text.lower())
    #remove stop words and lemmatize
    meaningful_words = [lemmatizer.lemmatize(w) for w in words if not w in stops]
    #returns a list of words
    return meaningful_words

In [28]:
#create and export clean training data

train_clean = pd.DataFrame(train['text'].map(lambda x: to_words(x)),columns = ['text'])
train_clean['target'] = train['target']
test_clean = pd.DataFrame(test['text'].map(lambda x: to_words(x)), columns = ['text'])
test_clean['target'] = test['target']

train_clean.to_csv('../data/datasets/train_clean.csv')
test_clean.to_csv('../data/datasets/test_clean.csv')

In [11]:
# create TaggedDocument object array for our train and test data
train_tagged = train.apply(
    lambda r: TaggedDocument(words=to_words(r['text']), tags=[r['target']]), axis=1)
test_tagged = test.apply(
    lambda r: TaggedDocument(words=to_words(r['text']), tags=[r['target']]), axis=1)

In [12]:
train_tagged[:5]

4987    ([whole, podcast, explosion, thing, weird, rep...
3658    ([shantae, get, smash, destroy, wii, u, shanta...
9745    ([robert, gagnon, review, catastrophe, imposin...
9060    ([clip, top, coercion, structural, weakness, e...
4408    ([achievement, unlocked, replaced, light, sock...
dtype: object

#### Doc2Vec Distributed Bag of Words (DBOW) model
- We use Doc2Vec to engineer paragraph vectors for tweets
- The following procedures are a modification of Susan Li's example: [Multi-Class Text Classification with Doc2Vec & Logistic Regression](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4)
- "DBOW is the doc2vec model analogous to Skip-gram model in word2vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph."



In [13]:
#function accepts a trained doc2vec model and TaggedDocument object
def get_vectors(model, tagged_docs):
    tweets = tagged_docs.values
    tag, vector = zip(*[(t.tags[0], model.infer_vector(t.words, steps=20)) for t in tweets])
    #returns the tag and paragraph vector
    return tag, vector

In [14]:
#gridsearch doc2vec model
# t0 = time.time()
# sizes = [50,100,200,300]
# windows = [5,8,10,15,20]
# counts = [1,5,10]
# summaries_dbow = []

# for size in sizes:
#     for window in windows:
#         for count in counts:
#             model_dbow = Doc2Vec(dm=0, vector_size=size, negative=15, window=window, 
#                                  hs=0, min_count=count, sample = 0, workers=cores)
#             model_dbow.build_vocab([x for x in train_tagged.values])

#             model_dbow.train(utils.shuffle([x for x in train_tagged.values]), 
#                                  total_examples=len(train_tagged.values), epochs=15)            

#             y_train, X_train = get_vectors(model_dbow, train_tagged)
#             y_test, X_test = get_vectors(model_dbow, test_tagged)
#             logreg = LogisticRegression(n_jobs=1, C=1e5)
#             logreg.fit(X_train, y_train)

#             y_train_pred = logreg.predict(X_train)
#             y_test_pred = logreg.predict(X_test)

#             summary = {}
#             summary['Size']      = size
#             summary['Window']    = window
#             summary['Count']     = count
#             summary['Train_Acc'] = accuracy_score(y_train, y_train_pred)
#             summary['Test_Acc']  = accuracy_score(y_test, y_test_pred)
#             summary['Train_F1']  = f1_score(y_train, y_train_pred)
#             summary['Test_F1']   = f1_score(y_test, y_test_pred)
#             summary['CV_Score']  = cross_val_score(logreg, X_train, y_train, cv = 3).mean()
#             summary['Train_Report'] = classification_report(y_train,y_train_pred)
#             summary['Test_Report'] = classification_report(y_test,y_test_pred)
            
#             summaries_dbow.append(summary)

# print(time.time()-t0)
# summaries_dbow_df = pd.DataFrame(summaries_dbow)
# summaries_dbow_df.sort_values(by='Test_Acc',ascending=False).head(10)

- After doing a "GridSearch" to tune our Doc2Vec hyperparameters, we instatiate and train our Doc2Vec and Logistic Regression Model below:

### Finalizing and Exporting Models

In [15]:
logreg = LogisticRegression(n_jobs=1, C=1e5)

df_tagged = df.apply(
    lambda r: TaggedDocument(words=to_words(r['text']), tags=[r['target']]), axis=1)

model_dbow = Doc2Vec(dm=0, vector_size=100, negative=15, window=15, 
                     hs=0, min_count=5, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in df_tagged.values])

model_dbow.train(utils.shuffle([x for x in df_tagged.values]), 
                                 total_examples=len(df_tagged.values), epochs=15)            


get_vectors(model_dbow, df_tagged)

y, X = get_vectors(model_dbow, train_tagged)
logreg.fit(X, y)
y_pred = logreg.predict(X)

print('Final model accuracy %s' % accuracy_score(y, y_pred))
print('Final model F1 score: {}'.format(f1_score(y, y_pred, average='weighted')))
print('Final model CV score: {}'.format(cross_val_score(logreg, X, y, cv = 3).mean()))

Final model accuracy 0.8379373848987108
Final model F1 score: 0.8359878347374353
Final model CV score: 0.8281131312914726


In [16]:
#export package
models = {'model_dbow':model_dbow,
         'logreg':logreg}

pickle.dump(models, open('../data/pickles/models.pk', 'wb'))