# Lab 11: Predictive model with Logistic Regression

As always, we create a SparkContext/HiveContext.

In [None]:
# Set up Spark Context
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *

SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 15)
conf.set('spark.sql.autoBroadcastJoinThreshold', 100*1024*1024)  # 100MB for broadcast join
sc = SparkContext('yarn-client', 'Spark-lab11', conf=conf)

from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql("use demo")

Let's load the feature matrix created in lab 10 into a Spark dataframe called 'fm', using the data frames Reader API:

In [None]:
fm = hc.read.format("orc").load("/tmp/fm")
fm.limit(5).toPandas()

Split the dataset into a training and testing set as follows:
1. Use years 2011-2013 for training your model.
2. use the year 2014 as your test set.

In [None]:
trainData = fm.<YOUR CODE HERE>
testData = fm.<YOUR CODE HERE>

print trainData.count(), testData.count()

Using Spark ML's pipeline API, create the components of an end-to-end pipeline as follows:
1. Use the StringIndexer() transformation to convert all string variables (category, dayofweek, district, neighborhood) into categorical variables
2. Similarly, convert the "resolved" variable to a categorical variable called "label". We need to do this since Spark-ML Logistic Regression requires a categorical variable as the target variable, whereas "resolved" is a numerical variable with values 0.0 and 1.0.
3. Use VectorAssembler to create a "features" column that combines all the features of the model: month, hour, prcp, tmin, tmax, and the other categorical variables. Call the output column "features"

In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline

<YOUR CODE HERE>

Create a Logistic Regression classifier with reasonable paramter settings such as maxIter=30 and regParam=0.01:

In [None]:
from pyspark.ml.classification import LogisticRegression
lr = <YOUR CODE HERE>

Create the spark-ML pipeline to combine all the processing steps. Then train the model using the training set:

In [None]:
pipeline_lr = <YOUR CODE HERE>
model_lr = pipeline_lr.fit(trainData)

Compute the predictions using the testData:

In [None]:
results = model_lr.<YOUR CODE HERE>

We have created a Python function that given a Pandas Dataframe with two columns (label, prediction) computes the precision, recall and overall accuracy: 

In [None]:
def eval_metrics(lap):
    tp = float(len(lap[(lap['label']==1) & (lap['prediction']==1)]))
    tn = float(len(lap[(lap['label']==0) & (lap['prediction']==0)]))
    fp = float(len(lap[(lap['label']==0) & (lap['prediction']==1)]))
    fn = float(len(lap[(lap['label']==1) & (lap['prediction']==0)]))
    precision = tp / (tp+fp)
    recall = tp / (tp+fn)
    accuracy = (tp+tn) / (tp+tn+fp+fn)
    return {'precision': precision, 'recall': recall, 'accuracy': accuracy}

Create a Pandas data frame from your results data frame, and use the eval_metrics function to compute the precision, recall and accuracy of the current model:

In [None]:
lap = results.<YOUR CODE HERE>
print eval_metrics(lap)

With Logistic Regression, you can print the trained model's weights and intercept coefficients.

In [None]:
print model_lr.stages[-1].weights
print model_lr.stages[-1].intercept

Note that the recall is relatively low. One possible cause for this might be that our categorical variables are represented as numerical values in our regression model. Create a different Spark-ML pipeline that uses OneHotEncoder to transform some of these categorical variables into dummy variables and re-run the logistic regression model. 

Did the results improve?

In [None]:
from pyspark.ml.feature import OneHotEncoder

<YOUR CODE HERE>