# Making Predictions

We use Doc2Vec and Logistic Regression Models to classify unseen tweets as "Relevant" or "Not Relevant".

## Table of Contents

1. [Collecting Tweets](01-Gathering-Data.ipynb)
1. [Feature Engineering with TF-IDF](02-Feature-Engineering.ipynb)
1. [Benchmark Model](03-Benchmark-Model.ipynb)
1. [Feature Engineering & Model Tuning with Doc2Vec](04-Model-Tuning.ipynb)
1. [Making Predictions on Test Data](05-Making-Predictions.ipynb)
1. [Visualizing a Disaster Event](06-Time-Series-Analysis.ipynb)

### Import Libaries

In [1]:
import pandas as pd
import numpy as np
from gensim.models import Doc2Vec
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
import gensim
import time
import pickle

### Load Models

In [2]:
#import models
models = pickle.load(open('../data/pickles/models.pk', 'rb'))
model_dbow = models['model_dbow']
logreg = models['logreg']

#### We collected tweets from 2 different date ranges using search terms "wildfire" and "forest fire":
 - July 12th - 13th, 2018
  - "In mid-July to August 2018, a series of large wildfires erupted across California, mostly in the northern part of the state, including the destructive Carr Fire and the Mendocino Complex Fire. On August 4, 2018, a national disaster was declared in Northern California, due to the extensive wildfires burning there." [Wikipedia](https://en.wikipedia.org/wiki/2018_California_wildfires)
 - January 15th - 16th, 2019
  - No current/recent wildfire emergencies



### Load Data

In [3]:
#reading in sample tweets
fire_tweets = pd.read_csv('../data/datasets/2018-07-12_2018-07-13_wildfire OR forest+fire.csv')
fire_tweets.set_index(fire_tweets['date'], inplace = True)
fire_tweets.drop(columns = ['date'], inplace = True)
fire_tweets.dropna(inplace=True)
fire_tweets.head()

Unnamed: 0_level_0,text
date,Unnamed: 1_level_1
2018-07-12 16:59:55,You are a forest fire
2018-07-12 16:59:05,"@ fireadaptednet has a new blog ""Co-Managing W..."
2018-07-12 16:58:54,Ever wonder what type of questions you should ...
2018-07-12 16:58:50,Ever wonder what type of questions you should ...
2018-07-12 16:58:26,Updated: Evacuation order given as Kamloops wi...


In [4]:
#reading in sample tweets
recent_tweets = pd.read_csv('../data/datasets/2019-01-15_2019-01-16_wildfire OR forest+fire.csv')
recent_tweets.set_index(recent_tweets['date'], inplace = True)
recent_tweets.drop(columns = ['date'], inplace = True)
recent_tweets.dropna(inplace=True)
recent_tweets.head()

Unnamed: 0_level_0,text
date,Unnamed: 1_level_1
2019-01-15 15:59:53,Gavin Newsom Begins Tackling California's Wild...
2019-01-15 15:58:55,Wildfire . . . . . . # ilikeithot # writersoft...
2019-01-15 15:58:51,*$6b from wildfire damages in 2017
2019-01-15 15:58:48,"@ AciowaKsanti @ Starshinedraws, # Gillette di..."
2019-01-15 15:58:18,Trump says he’ll pull FEMA funds for Californi...


### Preprocessing Data

In [5]:
#instatiate lemmatizer, tokenizer, and stemmer
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer('\w+')
p_stemmer = PorterStemmer()

#create set of stopwords from sklearn and add more words
stops = set(stopwords.words('english'))
more_stops = ['xb','amp']
stops.update(more_stops)

#function to clean text
def to_words(raw_text):
    ##remove links 
    raw_text = re.sub('(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$', '', raw_text)
    ##remove numbers
    #raw_text = re.sub('\d+', '', raw_text)
    #tokenize
    words = tokenizer.tokenize(raw_text.lower())
    ##remove stop words and stem/lemmatize
    meaningful_words = [lemmatizer.lemmatize(w) for w in words if not w in stops]
    ##meaningful_words = [p_stemmer.stem(w) for w in words if not w in stops]
    
    return meaningful_words

In [6]:
#takes trained doc2vec model, fitted linear model, and text series
def make_preds(doc2vec, lr, text):
    words = text.map(lambda x: to_words(x))
    vectors = [(doc2vec.infer_vector(w, steps=20)) for w in words]
    preds = lr.predict(vectors)
    probas = [proba[1] for proba in lr.predict_proba(vectors)]
    #returns array of predictions
    return (preds, probas)

### Making predictions for both date ranges:

### Predicting urgency during a real wildfire event

In [7]:
fire_preds, fire_probas = make_preds(model_dbow, logreg, fire_tweets['text'])
fire_tweets['pred'] = fire_preds
fire_tweets['proba'] = fire_probas

In [8]:
fire_tweets.head()

Unnamed: 0_level_0,text,pred,proba
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-07-12 16:59:55,You are a forest fire,0,0.491512
2018-07-12 16:59:05,"@ fireadaptednet has a new blog ""Co-Managing W...",0,0.325324
2018-07-12 16:58:54,Ever wonder what type of questions you should ...,0,0.153826
2018-07-12 16:58:50,Ever wonder what type of questions you should ...,0,0.184275
2018-07-12 16:58:26,Updated: Evacuation order given as Kamloops wi...,1,0.929461


In [9]:
fire_tweets.shape[0]

1831

In [15]:
fire_tweets.pred.value_counts(normalize = True)

1    0.673403
0    0.326597
Name: pred, dtype: float64

In [17]:
fire_tweets.to_csv('../data/predictions/2018-07-12_2018-07-13_wildfire OR forest+fire.csv')

### Predicting urgency when a wildfire is not occurring

In [11]:
recent_preds, recent_probas = make_preds(model_dbow, logreg, recent_tweets['text'])
recent_tweets['pred'] = recent_preds
recent_tweets['proba'] = recent_probas

In [12]:
recent_tweets.head()

Unnamed: 0_level_0,text,pred,proba
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-01-15 15:59:53,Gavin Newsom Begins Tackling California's Wild...,1,0.93547
2019-01-15 15:58:55,Wildfire . . . . . . # ilikeithot # writersoft...,1,0.549091
2019-01-15 15:58:51,*$6b from wildfire damages in 2017,1,0.625218
2019-01-15 15:58:48,"@ AciowaKsanti @ Starshinedraws, # Gillette di...",1,0.521022
2019-01-15 15:58:18,Trump says he’ll pull FEMA funds for Californi...,1,0.986043


In [13]:
recent_tweets.shape[0]

2389

In [16]:
recent_tweets.pred.value_counts(normalize = True)

0    0.516953
1    0.483047
Name: pred, dtype: float64

### Save Predictions

In [18]:
recent_tweets.to_csv('../data/predictions/2019-01-15_2019-01-16_wildfire OR forest+fire.csv')