<h1>Initialize PySpark</h1>

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Assignment 8 ML Basics").getOrCreate()

<h1>Mody necessary imports</h1>


In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import col
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

<h1>Read the data and split into training and testing sets</h1>

In [3]:
df = spark.read.format('csv').options(header='false').load('cal_housing.data')
train,test = df.randomSplit((0.8,0.2),seed=1234)

<h1>Write the prepareData function</h1>
<li>Same as what we did in the class so it's done for you</li>

In [4]:
from pyspark.sql.functions import udf
def prepareData(df):
    df = df.toDF("Longitude","Latitude","MedianAge","TotalRooms","TotalBedrooms","Population","Households",
            "MedianIncome","MedianHomeValue")
    df = df.select(*(col(c).cast("float").alias(c) for c in df.schema.names))
    df = df.withColumn("MedianHomeValue",col("MedianHomeValue")/100000)
    df = df.withColumn("RoomsPerHouse", col("TotalRooms")/col("Households"))\
        .withColumn("PeoplePerHouse", col("Population")/col("Households"))\
       .withColumn("BedroomsPerHouse", col("TotalBedrooms")/col("Households"))
    df_analysis = df.select("MedianHomeValue", 
              "MedianAge", 
              "Population", 
              "Households", 
              "MedianIncome", 
              "RoomsPerHouse", 
              "PeoplePerHouse", 
              "BedroomsPerHouse")
    return df_analysis




<h1>Transform the dependent variable into two buckets</h1>
<li>Use QuantileDiscretizer with two buckets</li>
<li><a href="https://spark.apache.org/docs/latest/ml-features#quantilediscretizer">https://spark.apache.org/docs/latest/ml-features#quantilediscretizer</a></li>


In [5]:
discretizer = QuantileDiscretizer(numBuckets=2, inputCol="MedianHomeValue", outputCol="label")

<h1>Transform independent variable features into a vector</h1>
<li>Same as class, so done for you</li>

In [6]:
assembler = VectorAssembler()\
    .setInputCols(("MedianAge","Population", "Households", 
                        "MedianIncome", "RoomsPerHouse", "PeoplePerHouse", "BedroomsPerHouse"))\
    .setOutputCol("features")



<h1>Build two models</h1>
<li>A Logistic Regression Model</li>
<li>A Random Forest Classifier</li>
<li><a href="https://spark.apache.org/docs/latest/ml-classification-regression.html">https://spark.apache.org/docs/latest/ml-classification-regression.html</a></li>

In [10]:
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8,labelCol="label", featuresCol="features")

In [11]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)

<h1>Create pipelines</h1>
<li>Create p1 with two stages: the bucketizer that uses QuantileDiscretizer and the assembler</li>
<li>Create p2 with two stages: p1 and logistic regression</li>
<li>Create p3 with two stages: p1 and random forest classifier</li>

In [12]:
p1 = Pipeline().\
    setStages((discretizer,assembler))

In [13]:
p2 = Pipeline().\
    setStages((p1,lr))

In [34]:
p3 = Pipeline().\
    setStages((p1,rf))

<h1>Run the two pipelines with training data</h1>

In [15]:
lr_model = p2.fit(prepareData(train))

In [35]:
rf_model = p3.fit(prepareData(train))

<h1>Get predictions for both models</h1>
<li>On training as well as testing data sets</li>

In [17]:
lr_training_predictions = lr_model.transform(prepareData(train))
lr_testing_predictions = lr_model.transform(prepareData(test))

In [25]:
lr_training_predictions.printSchema()

root
 |-- MedianHomeValue: double (nullable = true)
 |-- MedianAge: float (nullable = true)
 |-- Population: float (nullable = true)
 |-- Households: float (nullable = true)
 |-- MedianIncome: float (nullable = true)
 |-- RoomsPerHouse: double (nullable = true)
 |-- PeoplePerHouse: double (nullable = true)
 |-- BedroomsPerHouse: double (nullable = true)
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [26]:
lr_training_predictions.select("label","features","rawPrediction","probability","prediction").show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|[19.0,1431.0,608....|[0.09115704430670...|[0.52277349336246...|       0.0|
|  0.0|[17.0,333.0,117.0...|[0.07747723998945...|[0.51935962676014...|       0.0|
|  0.0|[27.0,117.0,34.0,...|[0.06174127728250...|[0.51543041792146...|       0.0|
|  0.0|[20.0,624.0,262.0...|[0.06750215798802...|[0.51686913457431...|       0.0|
|  0.0|[14.0,515.0,226.0...|[0.02140424061302...|[0.50535085586738...|       0.0|
|  0.0|[25.0,1841.0,633....|[0.04014254935353...|[0.51003428991637...|       0.0|
|  0.0|[29.0,671.0,239.0...|[0.01586899387397...|[0.50396716521623...|       0.0|
|  0.0|[34.0,3134.0,1056...|[0.05828766765427...|[0.51456779269881...|       0.0|
|  0.0|[41.0,375.0,158.0...|[0.07538833192058...|[0.51883816175685...|       0.0|
|  0.0|[21.0,118

In [23]:
lr_testing_predictions.printSchema()

root
 |-- MedianHomeValue: double (nullable = true)
 |-- MedianAge: float (nullable = true)
 |-- Population: float (nullable = true)
 |-- Households: float (nullable = true)
 |-- MedianIncome: float (nullable = true)
 |-- RoomsPerHouse: double (nullable = true)
 |-- PeoplePerHouse: double (nullable = true)
 |-- BedroomsPerHouse: double (nullable = true)
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [24]:
lr_testing_predictions.select("label","features","rawPrediction","probability","prediction").show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|[15.0,1015.0,472....|[0.08320172252195...|[0.52078843963663...|       0.0|
|  0.0|[19.0,1129.0,463....|[0.07132332842781...|[0.51782327713772...|       0.0|
|  0.0|[17.0,83.0,45.0,1...|[0.07876916255589...|[0.51968211508478...|       0.0|
|  0.0|[16.0,2434.0,824....|[0.04003701634317...|[0.51000791726168...|       0.0|
|  0.0|[20.0,1135.0,303....|[0.07789210974938...|[0.51946318787569...|       0.0|
|  0.0|[24.0,227.0,139.0...|[0.10493510633797...|[0.52620973056278...|       0.0|
|  0.0|[18.0,3424.0,283....|[0.07841979649255...|[0.51959490830958...|       0.0|
|  0.0|[32.0,623.0,169.0...|[0.07349229946248...|[0.51836480973411...|       0.0|
|  0.0|[19.0,649.0,173.0...|[0.01680063092517...|[0.50420005893895...|       0.0|
|  0.0|[15.0,209

In [36]:
rf_training_predictions = rf_model.transform(prepareData(train))
rf_testing_predictions = rf_model.transform(prepareData(test))

In [37]:
rf_training_predictions.printSchema()

root
 |-- MedianHomeValue: double (nullable = true)
 |-- MedianAge: float (nullable = true)
 |-- Population: float (nullable = true)
 |-- Households: float (nullable = true)
 |-- MedianIncome: float (nullable = true)
 |-- RoomsPerHouse: double (nullable = true)
 |-- PeoplePerHouse: double (nullable = true)
 |-- BedroomsPerHouse: double (nullable = true)
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [38]:
rf_training_predictions.select("label","features","rawPrediction","probability","prediction").show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|[19.0,1431.0,608....|[8.08651947434545...|[0.80865194743454...|       0.0|
|  0.0|[17.0,333.0,117.0...|[8.45449075178655...|[0.84544907517865...|       0.0|
|  0.0|[27.0,117.0,34.0,...|[8.46597719490942...|[0.84659771949094...|       0.0|
|  0.0|[20.0,624.0,262.0...|[8.38972095370328...|[0.83897209537032...|       0.0|
|  0.0|[14.0,515.0,226.0...|[6.56939150065826...|[0.65693915006582...|       0.0|
|  0.0|[25.0,1841.0,633....|[6.70123655003746...|[0.67012365500374...|       0.0|
|  0.0|[29.0,671.0,239.0...|[6.52567347987796...|[0.65256734798779...|       0.0|
|  0.0|[34.0,3134.0,1056...|[8.01534226754972...|[0.80153422675497...|       0.0|
|  0.0|[41.0,375.0,158.0...|[8.35896499978746...|[0.83589649997874...|       0.0|
|  0.0|[21.0,118

In [39]:
rf_testing_predictions.printSchema()

root
 |-- MedianHomeValue: double (nullable = true)
 |-- MedianAge: float (nullable = true)
 |-- Population: float (nullable = true)
 |-- Households: float (nullable = true)
 |-- MedianIncome: float (nullable = true)
 |-- RoomsPerHouse: double (nullable = true)
 |-- PeoplePerHouse: double (nullable = true)
 |-- BedroomsPerHouse: double (nullable = true)
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [40]:
rf_testing_predictions.select("label","features","rawPrediction","probability","prediction").show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|[15.0,1015.0,472....|[8.61532576260426...|[0.86153257626042...|       0.0|
|  0.0|[19.0,1129.0,463....|[8.61532576260426...|[0.86153257626042...|       0.0|
|  0.0|[17.0,83.0,45.0,1...|[8.57743437358305...|[0.85774343735830...|       0.0|
|  0.0|[16.0,2434.0,824....|[6.75082007322346...|[0.67508200732234...|       0.0|
|  0.0|[20.0,1135.0,303....|[8.62814873986489...|[0.86281487398648...|       0.0|
|  0.0|[24.0,227.0,139.0...|[8.14929542178839...|[0.81492954217883...|       0.0|
|  0.0|[18.0,3424.0,283....|[8.69080992696938...|[0.86908099269693...|       0.0|
|  0.0|[32.0,623.0,169.0...|[8.60323615245231...|[0.86032361524523...|       0.0|
|  0.0|[19.0,649.0,173.0...|[7.74383107977677...|[0.77438310797767...|       0.0|
|  0.0|[15.0,209

<h1>Print testing stats for each model</h1>
<li>Accuracy</li>
<li>Area under ROC</li>
<li>Area under Precision-Recall curve</li>

#### Logistic Regression stats

In [64]:
lr_evaluator = BinaryClassificationEvaluator()\
  .setLabelCol("label")\
  .setRawPredictionCol("prediction")
lr_test_accuracy = lr_evaluator.evaluate(lr_testing_predictions)
lr_test_accuracy

0.7412937432595157

#### Random Forest stats

In [79]:
rf_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
rf_test_accuracy = rf_evaluator.evaluate(rf_testing_predictions)
rf_test_accuracy

0.7962028358567652

<h1>Print training stats for each model</h1>
<li>Accuracy</li>
<li>Area under ROC</li>
<li>Area under Precision-Recall curve</li>

#### Logistic Regression stats

In [69]:
lr_train_accuracy = lr_evaluator.evaluate(lr_training_predictions)
lr_train_accuracy

0.7447346580520845

In [80]:
lr_train_sum = lr_model.stages[1].summary

In [81]:
print("areaUnderROC: " + str(lr_train_sum.areaUnderROC))

areaUnderROC: 0.8305139145196938


#### Random Forest stats

In [72]:
rf_train_accuracy = rf_evaluator.evaluate(rf_training_predictions)
rf_train_accuracy

0.8052066266156926

In [84]:
rf_train_sum = rf_model.stages[1]

<h1>Coefficients in the logistic regression</h1>
<li>Notice how this is different from the linear regression coefficients</li>
<li>Still dominated by income but some other flavors have crept in</li>
<li>Print this out nicely formatted. Each line should contain the feature and it's coefficient</li>

In [33]:
# Print the coefficients and intercept for logistic regression
lrModel = lr_model.stages[1]
print("Coefficients: ",lrModel.coefficients)
print("Intercept: ",lrModel.intercept)

Coefficients:  (7,[3],[0.036392134132822884])
Intercept:  -0.13755701445839436


<h1>Draw Training ROC and Precision-Recall Curves (LR Model only)</h1>
<li>include the area under roc and area under pr curves in your plots</li>

In [None]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

# Compute raw scores on the test set
predictionAndLabels = lr_training_predictions.select("rawPrediction")

# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

<li>https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification</li>

<h1>Draw testing ROC</h1>
<li>For this, the easiest is to use sklearn's roc and pr functionalities</li>
<li>The predictions contain two useful columns, the probability and the dependent value bucket</li>
<li>Extract these into a data frame</li>
<li>Note that the probability is expressed as a pair (probability of 0 and probability of 1)</li>
<li>Extract the probability of 1 from this (convert the df into an rdd, it will be easier!)</li>
<li>Also convert all values to float from Spark DoubleType (sklearn won't understand DoubleType)</li>