 <img src="uva_seal.png">   

## MLlib Intro and Classifiers

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: September 17, 2023

---  


### SOURCES 
Learning Spark: Machine Learning with MLlib

Logistic Regression using the DataFrame API  
https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression




### OBJECTIVES
- Introduce classification examples using MLlib, including logistic regression

### CONCEPTS

- Supervised learning
- Binary and multiclass classification
- Logistic Regression
- Naive Bayes
- Tree Methods
---

**Game Plan**

This notebook will:

1. Briefly summarize some of the major classification models
2. Illustrate how to work with some of these models using MLlib

---  

### 1) Machine Learning in Spark 
Spark MLlib is the library for machine learning.  There are two interfaces:

1) A newer DataFrame-based API which is being actively built out

2) An older RDD-based API which is still maintained, but it is not growing  
  For supervised learning tasks*, the RDD API uses a `LabeledPoint` object to bundle labels with predictors.
  
  For unsupervised learning tasks, since there is no label, the `LabeledPoint` object is not used.  
  Examples of unsupervised learning tasks include clustering methods like k-means.

Some functionality is only available in the RDD-based API.  
We will discuss both APIs in this course. 


(\*) In *supervised learning* tasks, each observation has a label or ground truth indicating the correct answer.  
Unsupervised learning tasks do NOT have this label. Most data in the wild does not have the label.

---  

### 2) Introduction to Classification

Classification is a common form of supervised learning.  
In supervised learning, the training examples include labels.  
After training the model, the purpose of the task is to predict labels for new examples.  

The data type of the $Y$ variable makes it a *classification problem*, namely $Y$ is a discrete variable.  
Binary classification is most common. Examples include fraud (or not), default, survival, claim filing, spam.

A continuous $Y$ variable results in a regression problem (next topic).

In the RDD API, classification and regression both use the `LabeledPoint` class.  
To remind ourselves, a `LabeledPoint` consists of a label and a feature vector.  

Follow this convention for labels:  
- For binary classification, use labels $0$ and $1$  
- For multiclass classification, use labels $0$, $1$, …, $C-1$ where $C$ is the number of classes  



Spark supports several popular models for classification including:  
- Logistic regression  
- Naive Bayes  
- Tree methods (e.g., decision tree, random forest)  
- Support Vector Machines  



### 3) Logistic regression 

This is currently the most popular method for binary classification.  
It is a generalized linear model which uses a linear plane to separate positive and negative examples.  
Although the model is relatively simple, the results can be very competitive.  

Below is an example of some data and a logistic curve fit to the data. Probability of Passing $Y$ is a function of Hours Studying $X$.  Notice the $Y$ variable consists of the values 0, 1.



<img src="logreg_img2.png">

Multiclass Problems  
The algorithm will output a multinomial logistic regression model, which contains $K−1$ binary logistic regression models regressed against the first class. Given a new data point, $K−1$ models will be run, and the class with largest probability will be chosen as the predicted class.

In [1]:
# MODULES, CONTEXT, AND PATHING
from pyspark.sql import SparkSession
import os

spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()
sc = spark.sparkContext

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/09/20 14:22:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/09/20 14:22:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


**Logistic Regression with RDD API**  

We will load data/train model/predict

In [2]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint

data = sc.textFile('sample_svm_data.txt')

In [3]:
data.take(2)

                                                                                

['1 0 2.52078447201548 0 0 0 2.004684436494304 2.000347299268466 0 2.228387042742021 2.228387042742023 0 0 0 0 0 0',
 '0 2.857738033247042 0 0 2.619965104088255 0 2.004684436494304 2.000347299268466 0 2.228387042742021 2.228387042742023 0 0 0 0 0 0']

In [4]:
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

In [5]:
parsedData = data.map(parsePoint)

# Print a record to understand the data structure
print(parsedData.take(1))

[LabeledPoint(1.0, [0.0,2.52078447201548,0.0,0.0,0.0,2.004684436494304,2.000347299268466,0.0,2.228387042742021,2.228387042742023,0.0,0.0,0.0,0.0,0.0,0.0])]


In [6]:
# Build the model using the stochastic gradient descent optimizer
model = LogisticRegressionWithLBFGS.train(parsedData)

                                                                                

24/09/20 14:22:58 WARN Instrumentation: [6792f81e] Initial coefficients will be ignored! Its dimensions (1, 16) did not match the expected size (1, 16)


In [7]:
# Evaluating the model on training data. For each record, create tuple of (label, prediction)
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
print(labelsAndPreds.take(3))

# Source: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression

[(1.0, 1), (0.0, 1), (0.0, 0)]


#### Logistic Regression with DataFrame API

This will be the more common approach

**The concepts and ideas are used for other ML models as well**

In the DataFrame API, the `LabeledPoint` object is NOT used.  
Instead, the requirement is to package all predictor columns into a single column.  
The ML model will take the predictor column name as an input, and the target variable name as an input.

There are two transformations we should discuss right away, as they are very helpful:

`VectorAssembler`  
This will package the DataFrame predictor columns into a single column.


`StandardScaler`  
This will scale your data, and it can be applied after the VectorAssembler step.

**EXAMPLE**  
This small dataset has target variable *high_price*.

In [8]:
# load the data into DF

import os
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("ml_classifier").getOrCreate()

# Load training data
filename = "sample_housing_data.csv"

# read data into dataframe
training = spark.read.csv(filename,  inferSchema=True, header = True)
training.show(2)

24/09/20 14:23:03 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
+----------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|high_price|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+----------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|         1|       8.5552|              40.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|         1|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
+----------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
only showing top 2 rows



Use `VectorAssembler` to package some variables into a feature column:

In [9]:
from pyspark.ml.feature import VectorAssembler

# inputCols take a list of column names
# outputCol is arbitrary name of new column; generally called features

assembler = VectorAssembler(inputCols=["median_income", "total_rooms"],
                            outputCol="features")

tr = assembler.transform(training)
tr.select("*").show(2, truncate=False)

+----------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+
|high_price|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|features      |
+----------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+
|1         |8.5552       |40.0              |880.0      |129.0         |322.0     |126.0     |37.88   |-122.23  |[8.5552,880.0]|
|1         |8.3252       |41.0              |880.0      |129.0         |322.0     |126.0     |37.88   |-122.23  |[8.3252,880.0]|
+----------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+
only showing top 2 rows



Use `StandardScaler` to scale the features

In [10]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(tr)
scaledData = scalerModel.transform(tr)

scaledData.select("high_price","features","scaledFeatures").show(2, truncate=False)

+----------+--------------+--------------------------------------+
|high_price|features      |scaledFeatures                        |
+----------+--------------+--------------------------------------+
|1         |[8.5552,880.0]|[4.544281109921356,0.3640968833079785]|
|1         |[8.3252,880.0]|[4.422111592518852,0.3640968833079785]|
+----------+--------------+--------------------------------------+
only showing top 2 rows



Set up and fit the model. We discuss the parameters later.

In [11]:
from pyspark.ml.classification import LogisticRegression

# instantiate the model
lr = LogisticRegression(labelCol='high_price',
                        featuresCol='scaledFeatures',
                        maxIter=10, 
                        regParam=0.3, 
                        elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(scaledData)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Coefficients: [0.07027733010720466,0.0]
Intercept: -0.954705346212893


Measure the model fit

In [12]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# compute predictions. this will append column "prediction" to dataframe
lrPred = lrModel.transform(scaledData)
lrPred.select('probability','prediction').show(5,truncate=False)

# set up evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction",
                                          labelCol="high_price",
                                          metricName="areaUnderPR")

# pass to evaluator the DF with predictions, labels
aupr = evaluator.evaluate(lrPred)

print("Area under PR Curve:", aupr)

+----------------------------------------+----------+
|probability                             |prediction|
+----------------------------------------+----------+
|[0.6537005259491084,0.34629947405089156]|0.0       |
|[0.6556415610201259,0.34435843897987406]|0.0       |
|[0.6558421210389079,0.34415787896109207]|0.0       |
|[0.664584376177192,0.335415623822808]   |0.0       |
|[0.6778813168214649,0.3221186831785351] |0.0       |
+----------------------------------------+----------+
only showing top 5 rows

Area under PR Curve: 0.3333333333333333


### 4) Naive Bayes

Naive Bayes (NB) is a relatively simple model, yet the performance can be quite good.  This has led to its popularity.  

NB does multiclass classification. It is commonly used in text classification where the input features are count variables.

At a high level, the count of a word on a page can adjust the probability that the page belongs to a given class.  For example, the presence of the word “tacos” will increase the probability that the page belongs to a **restaurant** relative to a **florist**.

The algorithm computes the conditional probability distribution of each feature given a label, and then it applies Bayes’ theorem to compute the conditional probability distribution of a label given an observation.

Naive?  
The term “naive” comes from the simplifying assumption of independence between every pair of features. This assumption greatly simplifies the model and is often reasonable.


**Naive Bayes Implementation**  

Several methods are supported including:

- multinomial naive Bayes
- Bernoulli naive Bayes

**Parameters**  
The model type is selected with an optional parameter “multinomial”, “complement”, “bernoulli” or “gaussian”, with “multinomial” as the default. 

For document classification, the input feature vectors should usually be sparse vectors.


**Naive Bayes Example**

We will load data/train model/predict

In [13]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load training data
data = spark.read.format("libsvm") \
    .load("./sample_libsvm_data.txt")

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 314)
train = splits[0]
test = splits[1]

# set up the model with some parameter values
nb = NaiveBayes(labelCol='label',
                featuresCol='features',
                smoothing=1.0, 
                modelType="multinomial")

# train the model
model = nb.fit(train)

# make predictions
predictions = model.transform(test)
predictions.show()

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

24/09/20 14:23:08 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
+-----+--------------------+--------------------+-----------+----------+
|label|            features|       rawPrediction|probability|prediction|
+-----+--------------------+--------------------+-----------+----------+
|  0.0|(692,[121,122,123...|[-225289.31301264...|  [1.0,0.0]|       0.0|
|  0.0|(692,[122,123,148...|[-179927.31719147...|  [1.0,0.0]|       0.0|
|  0.0|(692,[123,124,125...|[-202007.18900696...|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[-277661.89434484...|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[-247993.96658158...|  [1.0,0.0]|       0.0|
|  0.0|(692,[126,127,128...|[-205425.55655384...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-210393.27785331...|  [1.0,0.0]|       0.0|
|  0.0|(692,[150,151,152...|

### 5) Tree Methods

Tree methods can be used for both classification and regression  

Simplest method is a Decision Tree, which is intuitively appealing due to series of binary decisions (Male/Female, Age greater than 30 or not)  

Can handle missing values (in many implementations), categorical data, continuous data.  
Minimal preprocessing needed.  

Feature selection is part of algorithm (best feature is used, then next best, …)  

Does not require scaling  
Handles non-linear interactions  
Handles multiclass classification  

Code examples can be found in the documentation. Random forest code, for example, can be found here:

https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

---

**TRY FOR YOURSELF (UNGRADED EXERCISE)**

Copy all DataFrame API logistic regression code in the cell below, modify the input columns in VectorAssembler, and refit the model.  
Compute and print the area under the ROC curve.