### University of Virginia
### DS 7200: Distributed Computing
### Lab: Supervised Learning
### Last Updated: August 20, 2023

---

#### Instructions

This project has two parts:
- Part I: Classification - build and apply a logistic regression model on the Wisconsin Breast Cancer dataset.
- Part II: Regression - build and apply a linear regression model on the California Housing dataset.

**Total Possible Points: 10**

---

---

#### Submission

Zack Gottesman
qdw5jf

credit to Prof. Tashman for code inpsired by class notebooks

---

#### Part I: Classification (5 POINTS)

Here are the specifications and grading breakdown:

- the target variable is `diagnosis`
- use `f1`, `f2` as predictors (1 PT)
- split data into 60% training set, 40% test set 
- standardize the predictors (1 PT)
- use seed=314 whenever a seed is needed
- fit a Logistic Regression model with an intercept (1 PT)
- compute and show the area under the ROC curve for the test set (2 PTS)

In [3]:
import os
import pandas as pd

from pyspark.sql import SparkSession

In [5]:
DATA_FILEPATH = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/09/26 22:30:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Enter code and solution

## Prepare data

We'll go as long as possible before splitting train and test data to make sure we are making the same modifications to each. Eventually we will have to split because we don't want to leak information about the test data during training. We'll build a pipeline to handle things after that.

In [75]:
df = spark.read.csv(DATA_FILEPATH, header=True)

In [78]:
kept_data = df.select(["f1", "f2", "diagnosis"])
kept_data.show(3)

+-----+-----+---------+
|   f1|   f2|diagnosis|
+-----+-----+---------+
|17.99|10.38|        M|
|20.57|17.77|        M|
|19.69|21.25|        M|
+-----+-----+---------+
only showing top 3 rows



In [79]:
kept_data.printSchema()

root
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- diagnosis: string (nullable = true)



In [80]:
kept_data.select("diagnosis").distinct().show()

+---------+
|diagnosis|
+---------+
|        B|
|        M|
+---------+



### Clean up data and convert to correct types

First, need to change columns to numeric

In [81]:
import pyspark.sql.types as typ

converted_data = kept_data.withColumn("f1_int", kept_data["f1"].cast(typ.FloatType()))
converted_data = converted_data.withColumn("f2_int", kept_data["f2"].cast(typ.FloatType()))
converted_data.show(3)

+-----+-----+---------+------+------+
|   f1|   f2|diagnosis|f1_int|f2_int|
+-----+-----+---------+------+------+
|17.99|10.38|        M| 17.99| 10.38|
|20.57|17.77|        M| 20.57| 17.77|
|19.69|21.25|        M| 19.69| 21.25|
+-----+-----+---------+------+------+
only showing top 3 rows



Now need to OHE the label column

In [82]:
from pyspark.ml.feature import StringIndexer

idxer = StringIndexer(inputCol="diagnosis", outputCol="diagnosis_idx")
converted_data = idxer.fit(converted_data).transform(converted_data)
converted_data.show(3)

+-----+-----+---------+------+------+-------------+
|   f1|   f2|diagnosis|f1_int|f2_int|diagnosis_idx|
+-----+-----+---------+------+------+-------------+
|17.99|10.38|        M| 17.99| 10.38|          1.0|
|20.57|17.77|        M| 20.57| 17.77|          1.0|
|19.69|21.25|        M| 19.69| 21.25|          1.0|
+-----+-----+---------+------+------+-------------+
only showing top 3 rows



In [83]:
clean_data = converted_data.select("f1_int", "f2_int", "diagnosis_idx")
clean_data.show(5)

+------+------+-------------+
|f1_int|f2_int|diagnosis_idx|
+------+------+-------------+
| 17.99| 10.38|          1.0|
| 20.57| 17.77|          1.0|
| 19.69| 21.25|          1.0|
| 11.42| 20.38|          1.0|
| 20.29| 14.34|          1.0|
+------+------+-------------+
only showing top 5 rows



### Split to train/test

In [84]:
train_df, test_df = clean_data.randomSplit([0.6, 0.4], seed=314)

In [85]:
train_df.show(3)

+------+------+-------------+
|f1_int|f2_int|diagnosis_idx|
+------+------+-------------+
| 6.981| 13.43|          0.0|
| 7.691| 25.44|          0.0|
| 7.729| 25.49|          0.0|
+------+------+-------------+
only showing top 3 rows



In [86]:
test_df.show(3)

+------+------+-------------+
|f1_int|f2_int|diagnosis_idx|
+------+------+-------------+
|  7.76| 24.54|          0.0|
| 8.219|  20.7|          0.0|
| 8.598| 20.98|          0.0|
+------+------+-------------+
only showing top 3 rows



### Prepare Data Transformations
We use `VectorAssembler` to prepare the predictors and `StandardScaler` to scale them.

In [87]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

vec_ass = VectorAssembler(inputCols=["f1_int", "f2_int"], outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

### Create Model

In [88]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol='diagnosis_idx',
                        featuresCol='scaled_features',
                        fitIntercept=True,
                       )

### Predict on test data

We have to modify the test data in the same way as training data to make predictions using our model. Pipelines are made for that!!

In [89]:
from pyspark.ml import Pipeline

pipe = Pipeline(stages=[vec_ass, scaler, lr])
pipe_model = pipe.fit(train_df)
preds = pipe_model.transform(test_df)

In [90]:
# look at some example predictions
preds.select("rawPrediction", "probability", "prediction", "diagnosis_idx") \
     .sample(fraction=0.05) \
     .show(truncate=False)

+----------------------------------------+------------------------------------------+----------+-------------+
|rawPrediction                           |probability                               |prediction|diagnosis_idx|
+----------------------------------------+------------------------------------------+----------+-------------+
|[5.884640921984568,-5.884640921984568]  |[0.9972258722478972,0.0027741277521028396]|0.0       |0.0          |
|[3.054710803703724,-3.054710803703724]  |[0.9549854701951765,0.045014529804823455] |0.0       |0.0          |
|[3.117204964756265,-3.117204964756265]  |[0.9575968805319068,0.042403119468093164] |0.0       |0.0          |
|[2.682368750679835,-2.682368750679835]  |[0.9359782129093271,0.06402178709067285]  |0.0       |0.0          |
|[2.945410642126358,-2.945410642126358]  |[0.9500461338150327,0.04995386618496733]  |0.0       |0.0          |
|[3.1261572168294727,-3.1261572168294727]|[0.957958902129706,0.042041097870293975]  |0.0       |0.0          |
|

In [91]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# area under ROC is default metric, but we specify anyway for clarity
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",
                                          labelCol="diagnosis_idx",
                                          metricName="areaUnderROC"
                                         )

auc = evaluator.evaluate(preds)

print("Area under ROC:", auc)

Area under ROC: 0.9540261654689511


Nice!

#### Part II: Regression (5 POINTS)

In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Here are the specifications and grading breakdown:

- Scale the response variable median_house_value, dividing by 100000 (1 PT)

- Split data into train set (80%), test set (20%) using seed=314 (1 PT)

- Add new predictor: `rooms_per_household`

- In the training set, select all of these features and standardize them: (1 PT)

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

- Fit a linear regression model on the training set with these parameters:

  - maxIter=10
  - regParam=0.3
  - elasticNetParam=0.8  


- Compute the MSE on the test set (2 PTS)

In [4]:
spark = SparkSession.builder.getOrCreate()

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/09/26 23:37:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [32]:
DATA_FILEPATH2 = 'cal_housing_data_preproc_w_header.txt'

In [33]:
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

# specify schema here bc inferred schema makes everything strings (ugh)
house_schema = StructType([
    StructField("median_house_value", FloatType()),
    StructField("median_income", FloatType()),    
    StructField("housing_median_age", FloatType()),    
    StructField("total_rooms", FloatType()),    
    StructField("total_bedrooms", FloatType()),    
    StructField("population", FloatType()),
    StructField("households", FloatType()),        
    StructField("latitude", FloatType()),        
    StructField("longitude", FloatType())
])

In [34]:
df2 = spark.read.csv(DATA_FILEPATH2, header=True, schema=house_schema)

In [35]:
df2.show(3)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|          452600.0|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|          358500.0|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
|          352100.0|       7.2574|              52.0|     1467.0|         190.0|     496.0|     177.0|   37.85|  -122.24|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
only showing top 3 rows



In [36]:
df2.printSchema()

root
 |-- median_house_value: float (nullable = true)
 |-- median_income: float (nullable = true)
 |-- housing_median_age: float (nullable = true)
 |-- total_rooms: float (nullable = true)
 |-- total_bedrooms: float (nullable = true)
 |-- population: float (nullable = true)
 |-- households: float (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)



#### Enter code and solution

### Data preprocessing

In [37]:
SCALE_FACTOR = 100000
df2_scaled = df2.withColumn("median_house_val_scaled", df2["median_house_value"] / SCALE_FACTOR)
df2_scaled.select("median_house_val_scaled", "median_house_value").show(3)

+-----------------------+------------------+
|median_house_val_scaled|median_house_value|
+-----------------------+------------------+
|                  4.526|          452600.0|
|                  3.585|          358500.0|
|                  3.521|          352100.0|
+-----------------------+------------------+
only showing top 3 rows



In [38]:
df2_scaled_room_per_hh = df2_scaled.withColumn("rooms_per_hh", df2_scaled["total_rooms"] / df2_scaled["households"])
df2_scaled_room_per_hh.show(3)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-----------------------+-----------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|median_house_val_scaled|     rooms_per_hh|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-----------------------+-----------------+
|          452600.0|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|                  4.526|6.984126984126984|
|          358500.0|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|                  3.585|6.238137082601054|
|          352100.0|       7.2574|              52.0|     1467.0|         190.0|     496.0|     177.0|   37.85|  -122.24|                  3.521|8.288135593220339|
+---------------

In [39]:
train_df2, test_df2 = df2_scaled_room_per_hh.randomSplit([0.8, 0.2], seed=314)

In [40]:
feats = ["total_bedrooms", "population", "households", "median_income", "rooms_per_hh"]

In [41]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

vec_ass2 = VectorAssembler(inputCols=feats, outputCol="features")
scaler2 = StandardScaler(inputCol="features", outputCol="scaled_features")

### Model

In [42]:
from pyspark.ml.regression import LinearRegression

lin_reg = LinearRegression(labelCol='median_house_val_scaled',
                        featuresCol='scaled_features',
                        maxIter=10,
                        regParam=0.3,
                        elasticNetParam=0.8
                       )

In [43]:
from pyspark.ml import Pipeline

pipe2 = Pipeline(stages=[vec_ass2, scaler2, lin_reg])
pipe_model2 = pipe2.fit(train_df2)
preds2 = pipe_model2.transform(test_df2)

                                                                                

In [50]:
# look at some example predictions
preds2.select("prediction", "median_house_val_scaled") \
     .sample(fraction=0.05) \
     .show()

+------------------+-----------------------+
|        prediction|median_house_val_scaled|
+------------------+-----------------------+
|1.2181898031384244|                  0.225|
|  1.34823728251112|                  0.427|
|1.5283115231160926|                  0.475|
| 1.521477100832544|                  0.516|
|1.4010033343161403|                  0.517|
|1.4824352091423285|                  0.524|
| 1.403050900171543|                  0.526|
| 1.498483611805543|                  0.538|
|1.4619872908242135|                  0.554|
|1.5981774347782942|                  0.587|
|1.5651121950566944|                  0.594|
|1.5996992885291248|                  0.603|
|1.6352271091456378|                  0.606|
| 1.632626189904221|                   0.62|
|1.4989816496438495|                  0.627|
|1.4690430742530616|                  0.666|
| 1.720837105496604|                   0.67|
|1.9039826948846805|                  0.673|
|1.2965780034477739|                  0.675|
| 1.631740

In [46]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluatorMSE = RegressionEvaluator(labelCol="median_house_val_scaled", predictionCol="prediction", metricName="mse")
mse = evaluatorMSE.evaluate(preds2)
print("MSE", mse)

MSE 0.7551749818714476
