<a href="https://colab.research.google.com/github/sabaripkumar/digipen/blob/main/CET3052_Colab_MLlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It includes:

* ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
* Featurization: feature extraction, transformation, dimensionality reduction, and selection
* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
* Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. The RDD-based API is now in maintenance mode. For more information, read [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).

In [1]:
!pip install -q pyspark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
data_filename = "sample_data/california_housing_train.csv"
test_filename = "sample_data/california_housing_test.csv"

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Property used to format output tables better
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)
print(spark)

<pyspark.sql.session.SparkSession object at 0x7a189175ea40>


In [4]:
df = spark.read.csv(data_filename, header=True, sep=",", inferSchema=True)

In [5]:
df.show(5, truncate=False)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|-114.31  |34.19   |15.0              |5612.0     |1283.0        |1015.0    |472.0     |1.4936       |66900.0           |
|-114.47  |34.4    |19.0              |7650.0     |1901.0        |1129.0    |463.0     |1.82         |80100.0           |
|-114.56  |33.69   |17.0              |720.0      |174.0         |333.0     |117.0     |1.6509       |85700.0           |
|-114.57  |33.64   |14.0              |1501.0     |337.0         |515.0     |226.0     |3.1917       |73400.0           |
|-114.57  |33.57   |20.0              |1454.0     |326.0         |624.0     |262.0     |1.925        |65500.0           |
+---------+--------+----

In [6]:
df.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

In [7]:
from pyspark.ml.feature import VectorAssembler

# Create a feature vector assembler
assembler = VectorAssembler(inputCols=["median_income", "total_rooms", "housing_median_age"], outputCol="features")

In [8]:
# Transform the DataFrame using the assembler
training = assembler.transform(df)

In [9]:
training.show(5, truncate=False)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|features            |
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|-114.31  |34.19   |15.0              |5612.0     |1283.0        |1015.0    |472.0     |1.4936       |66900.0           |[1.4936,5612.0,15.0]|
|-114.47  |34.4    |19.0              |7650.0     |1901.0        |1129.0    |463.0     |1.82         |80100.0           |[1.82,7650.0,19.0]  |
|-114.56  |33.69   |17.0              |720.0      |174.0         |333.0     |117.0     |1.6509       |85700.0           |[1.6509,720.0,17.0] |
|-114.57  |33.64   |14.0              |1501.0     |337.0         |515.0     |226.0     |3.1917       |73400.0           |[3.1917,1501.0,14.0]|

In [10]:
from pyspark.ml.regression import GeneralizedLinearRegression

# Create an instance of GeneralizedLinearRegression
# Generalized linear models (GLMs) are specifications of linear models where the
# response variable Yi follows some distribution from the exponential family of
# distributions.
glr = GeneralizedLinearRegression(family="gaussian", link="identity",
                                  labelCol="median_house_value",
                                  featuresCol="features")

In [11]:
# Fit the model to the data
model = glr.fit(training)

In [12]:
# Print the coefficients and intercept
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))

Coefficients: [42719.258930532815,3.769951994447242,1970.2176607228666]
Intercept: -24896.402073033052


In [13]:
summary = model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()

Coefficient Standard Errors: [330.3365970486282, 0.30796694928352264, 52.66431486362351, 2377.780515757901]
T Values: [129.32039414404997, 12.24141747423852, 37.41086665277673, -10.470437413394944]
P Values: [0.0, 0.0, 0.0, 0.0]
Dispersion: 6480216865.236034
Null Deviance: 228674518990668.44
Residual Degree Of Freedom Null: 16999
Deviance: 110137765841551.64
Residual Degree Of Freedom: 16996
AIC: 432314.2464896329
Deviance Residuals: 
+-------------------+
|  devianceResiduals|
+-------------------+
| -22719.31856929169|
| -39026.91749179254|
|  3863.111836325683|
| -71292.40184933401|
|-36724.034782626244|
|-106313.49151614401|
| -67269.20219814702|
| -83421.03306712484|
| -94796.38829543884|
|-116866.58092091762|
| -48705.26906590874|
|-31391.629129224093|
| -118090.7671055483|
|-112336.54613648751|
|-1790.2371176319866|
| -21717.58830593113|
| -68371.09038059862|
|-28258.723519262494|
| -21302.42661294744|
|  -35528.4776051994|
+-------------------+
only showing top 20 rows



In [14]:
testing = df[["median_income", "total_rooms", "housing_median_age"]]
new_df = assembler.transform(testing)
predictions = model.transform(new_df)

In [15]:
# Show the predictions
predictions.select("features", "prediction").show()
new_df.show(5)

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[1.4936,5612.0,15.0]| 89619.31856929169|
|  [1.82,7650.0,19.0]|119126.91749179254|
| [1.6509,720.0,17.0]| 81836.88816367432|
|[3.1917,1501.0,14.0]|  144692.401849334|
| [1.925,1454.0,20.0]|102224.03478262624|
|[3.3438,1387.0,29.0]|180313.49151614401|
|[2.6768,2907.0,25.0]|149669.20219814702|
| [1.7083,812.0,41.0]|131921.03306712484|
|[2.1782,4789.0,34.0]|153196.38829543884|
|[2.1908,1497.0,46.0]|164966.58092091762|
|[2.6797,3741.0,16.0]|135205.26906590874|
| [1.625,1988.0,21.0]|  93391.6291292241|
|[2.1571,1291.0,48.0]| 166690.7671055483|
| [3.212,2478.0,31.0]| 182736.5461364875|
|[0.8585,1448.0,15.0]| 46790.23711763199|
|[1.6991,2556.0,17.0]| 90817.58830593113|
|[2.9653,1678.0,28.0]|163271.09038059862|
|  [0.8571,44.0,21.0]|53258.723519262494|
|[1.2049,1388.0,17.0]| 65302.42661294744|
|  [1.2656,97.0,17.0]|  63028.4776051994|
+--------------------+------------