# ML Packages

**https://spark.apache.org/docs/2.0.1/api/python/pyspark.mllib.html**

## ML vs MLlib


|Spark Mllib|Spark ML|
|-|-|
|spark.mllib contains the legacy API **built on top of RDDs**.|spark.ml provides **higher-level API built on top of DataFrames for constructing ML pipelines.**|


## Why is MLlib switching to the DataFrame-based API?

DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.

The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.

DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

In [38]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

In [39]:
spark = SparkSession.builder.appName('Pyspark ML Algorithms').getOrCreate()

In [40]:
dataframe = spark.read.csv("Admission_Prediction.csv", header=True)

In [41]:
type(dataframe)

pyspark.sql.dataframe.DataFrame

In [42]:
dataframe.show()

+---------+-----------+-----------------+----+----+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating| SOP| LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+----+----+----+--------+---------------+
|   337.00|     118.00|                4|4.50|4.50|9.65|    1.00|           0.92|
|   324.00|     107.00|                4|4.00|4.50|8.87|    1.00|           0.76|
|     null|     104.00|                3|3.00|3.50|8.00|    1.00|           0.72|
|   322.00|     110.00|                3|3.50|2.50|8.67|    1.00|           0.80|
|   314.00|     103.00|                2|2.00|3.00|8.21|    0.00|           0.65|
|   330.00|     115.00|                5|4.50|3.00|9.34|    1.00|           0.90|
|   321.00|     109.00|             null|3.00|4.00|8.20|    1.00|           0.75|
|   308.00|     101.00|                2|3.00|4.00|7.90|    0.00|           0.68|
|   302.00|     102.00|                1|2.00|1.50|8.00|    0.00|           0.50|
|   323.00|     

In [43]:
dataframe.describe()

DataFrame[summary: string, GRE Score: string, TOEFL Score: string, University Rating: string, SOP: string, LOR: string, CGPA: string, Research: string, Chance of Admit: string]

In [44]:
dataframe.printSchema()

root
 |-- GRE Score: string (nullable = true)
 |-- TOEFL Score: string (nullable = true)
 |-- University Rating: string (nullable = true)
 |-- SOP: string (nullable = true)
 |-- LOR: string (nullable = true)
 |-- CGPA: string (nullable = true)
 |-- Research: string (nullable = true)
 |-- Chance of Admit: string (nullable = true)



In [45]:
dataframe.columns

['GRE Score',
 'TOEFL Score',
 'University Rating',
 'SOP',
 'LOR',
 'CGPA',
 'Research',
 'Chance of Admit']

# Convert to column

Dataframe column type will be "string" by default. We need to convert it to "float" or "integer" etc type if its number

In [46]:
from pyspark.sql.functions import col

for column in dataframe.columns:
    print(col(column))

Column<'GRE Score'>
Column<'TOEFL Score'>
Column<'University Rating'>
Column<'SOP'>
Column<'LOR'>
Column<'CGPA'>
Column<'Research'>
Column<'Chance of Admit'>


In [47]:
dataframe.select(*(col(c) for c in dataframe.columns)).show()

+---------+-----------+-----------------+----+----+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating| SOP| LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+----+----+----+--------+---------------+
|   337.00|     118.00|                4|4.50|4.50|9.65|    1.00|           0.92|
|   324.00|     107.00|                4|4.00|4.50|8.87|    1.00|           0.76|
|     null|     104.00|                3|3.00|3.50|8.00|    1.00|           0.72|
|   322.00|     110.00|                3|3.50|2.50|8.67|    1.00|           0.80|
|   314.00|     103.00|                2|2.00|3.00|8.21|    0.00|           0.65|
|   330.00|     115.00|                5|4.50|3.00|9.34|    1.00|           0.90|
|   321.00|     109.00|             null|3.00|4.00|8.20|    1.00|           0.75|
|   308.00|     101.00|                2|3.00|4.00|7.90|    0.00|           0.68|
|   302.00|     102.00|                1|2.00|1.50|8.00|    0.00|           0.50|
|   323.00|     

In [48]:
dataframe.printSchema()

root
 |-- GRE Score: string (nullable = true)
 |-- TOEFL Score: string (nullable = true)
 |-- University Rating: string (nullable = true)
 |-- SOP: string (nullable = true)
 |-- LOR: string (nullable = true)
 |-- CGPA: string (nullable = true)
 |-- Research: string (nullable = true)
 |-- Chance of Admit: string (nullable = true)



# Cast column to Number

In [49]:
dataframe.select(*(col(c).cast("float") for c in dataframe.columns)).printSchema()

root
 |-- GRE Score: float (nullable = true)
 |-- TOEFL Score: float (nullable = true)
 |-- University Rating: float (nullable = true)
 |-- SOP: float (nullable = true)
 |-- LOR: float (nullable = true)
 |-- CGPA: float (nullable = true)
 |-- Research: float (nullable = true)
 |-- Chance of Admit: float (nullable = true)



In [50]:
new_dataframe = dataframe.select(*(col(c).cast("float") for c in dataframe.columns))

# Handle Missing value

In [51]:
from pyspark.sql.functions import when, col, count, isnan, filter

**when(condition, value)**
Docstring:
Evaluates a list of conditions and returns one of multiple possible result expressions.

condition : :class:`~pyspark.sql.Column`
    a boolean :class:`~pyspark.sql.Column` expression.
value :
    a literal value, or a :class:`~pyspark.sql.Column` expression.

df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()

[Row(age=3), Row(age=4)]

df.select(when(df.age == 2, df.age + 1).alias("age")).collect()

[Row(age=3), Row(age=None)]

In [52]:
dataframe.select([count(when(col(c).isNull(), c)) for c in dataframe.columns]).show()

+-------------------------------------------------------+-----------------------------------------------------------+-----------------------------------------------------------------------+-------------------------------------------+-------------------------------------------+---------------------------------------------+-----------------------------------------------------+-------------------------------------------------------------------+
|count(CASE WHEN (GRE Score IS NULL) THEN GRE Score END)|count(CASE WHEN (TOEFL Score IS NULL) THEN TOEFL Score END)|count(CASE WHEN (University Rating IS NULL) THEN University Rating END)|count(CASE WHEN (SOP IS NULL) THEN SOP END)|count(CASE WHEN (LOR IS NULL) THEN LOR END)|count(CASE WHEN (CGPA IS NULL) THEN CGPA END)|count(CASE WHEN (Research IS NULL) THEN Research END)|count(CASE WHEN (Chance of Admit IS NULL) THEN Chance of Admit END)|
+-------------------------------------------------------+---------------------------------------------------

In [53]:
dataframe.select([count(when(col(c).isNull(), c)).alias(c) for c in dataframe.columns])

DataFrame[GRE Score: bigint, TOEFL Score: bigint, University Rating: bigint, SOP: bigint, LOR: bigint, CGPA: bigint, Research: bigint, Chance of Admit: bigint]

In [54]:
dataframe.select([count(when(col(c).isNull(), c)).alias(c) for c in dataframe.columns]).show()

+---------+-----------+-----------------+---+---+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+---+---+----+--------+---------------+
|       15|         10|               15|  0|  0|   0|       0|              0|
+---------+-----------+-----------------+---+---+----+--------+---------------+



## alias

In [55]:
dataframe.select([count(when(col(c).isNull(), c)).alias(c) for c in dataframe.columns]).show()

+---------+-----------+-----------------+---+---+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+---+---+----+--------+---------------+
|       15|         10|               15|  0|  0|   0|       0|              0|
+---------+-----------+-----------------+---+---+----+--------+---------------+



## Impute missing value

In [56]:
from pyspark.ml.feature import Imputer

In [58]:
imputer = Imputer(inputCols=["GRE Score","TOEFL Score","University Rating"],
                 outputCols=["GRE Score","TOEFL Score","University Rating"], strategy='mean')
model = imputer.fit(new_dataframe)
imputed_data = model.transform(new_dataframe)

In [59]:
imputed_data.show()

+---------+-----------+-----------------+---+---+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+---+---+----+--------+---------------+
|    337.0|      118.0|              4.0|4.5|4.5|9.65|     1.0|           0.92|
|    324.0|      107.0|              4.0|4.0|4.5|8.87|     1.0|           0.76|
|316.55878|      104.0|              3.0|3.0|3.5| 8.0|     1.0|           0.72|
|    322.0|      110.0|              3.0|3.5|2.5|8.67|     1.0|            0.8|
|    314.0|      103.0|              2.0|2.0|3.0|8.21|     0.0|           0.65|
|    330.0|      115.0|              5.0|4.5|3.0|9.34|     1.0|            0.9|
|    321.0|      109.0|        3.1216495|3.0|4.0| 8.2|     1.0|           0.75|
|    308.0|      101.0|              2.0|3.0|4.0| 7.9|     0.0|           0.68|
|    302.0|      102.0|              1.0|2.0|1.5| 8.0|     0.0|            0.5|
|    323.0|      108.0|              3.0

In [60]:
imputed_data.select([count(when(col(c).isNull(), c)).alias(c) for c in imputed_data.columns]).show()

+---------+-----------+-----------------+---+---+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+---+---+----+--------+---------------+
|        0|          0|                0|  0|  0|   0|       0|              0|
+---------+-----------+-----------------+---+---+----+--------+---------------+



In [61]:
features = imputed_data.drop('Chance of Admit')

In [64]:
features

DataFrame[GRE Score: float, TOEFL Score: float, University Rating: float, SOP: float, LOR: float, CGPA: float, Research: float]

In [62]:
features.columns

['GRE Score',
 'TOEFL Score',
 'University Rating',
 'SOP',
 'LOR',
 'CGPA',
 'Research']

In [63]:
imputed_data['Chance of Admit']

Column<'Chance of Admit'>

# Merge df rows to list to train it

VectorAssembler merges all column in row except "Chance of Admit" since its target variable. inputCols=features.columns which doesn't contain "Chance of Admit" since its dropped above

In [69]:
#let's assemble our features together using vectorAssembler
assembler = VectorAssembler(inputCols=features.columns, outputCol="features")

In [66]:
output = assembler.transform(imputed_data)

In [67]:
output.show()

+---------+-----------+-----------------+---+---+----+--------+---------------+--------------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|            features|
+---------+-----------+-----------------+---+---+----+--------+---------------+--------------------+
|    337.0|      118.0|              4.0|4.5|4.5|9.65|     1.0|           0.92|[337.0,118.0,4.0,...|
|    324.0|      107.0|              4.0|4.0|4.5|8.87|     1.0|           0.76|[324.0,107.0,4.0,...|
|316.55878|      104.0|              3.0|3.0|3.5| 8.0|     1.0|           0.72|[316.558776855468...|
|    322.0|      110.0|              3.0|3.5|2.5|8.67|     1.0|            0.8|[322.0,110.0,3.0,...|
|    314.0|      103.0|              2.0|2.0|3.0|8.21|     0.0|           0.65|[314.0,103.0,2.0,...|
|    330.0|      115.0|              5.0|4.5|3.0|9.34|     1.0|            0.9|[330.0,115.0,5.0,...|
|    321.0|      109.0|        3.1216495|3.0|4.0| 8.2|     1.0|           0.75|[321.0,109.0

In [68]:
output.select('features').toPandas()

Unnamed: 0,features
0,"[337.0, 118.0, 4.0, 4.5, 4.5, 9.64999961853027..."
1,"[324.0, 107.0, 4.0, 4.0, 4.5, 8.86999988555908..."
2,"[316.55877685546875, 104.0, 3.0, 3.0, 3.5, 8.0..."
3,"[322.0, 110.0, 3.0, 3.5, 2.5, 8.67000007629394..."
4,"[314.0, 103.0, 2.0, 2.0, 3.0, 8.21000003814697..."
...,...
495,"[332.0, 108.0, 5.0, 4.5, 4.0, 9.02000045776367..."
496,"[337.0, 117.0, 5.0, 5.0, 5.0, 9.86999988555908..."
497,"[330.0, 120.0, 5.0, 4.5, 5.0, 9.5600004196167,..."
498,"[312.0, 103.0, 4.0, 4.0, 5.0, 8.43000030517578..."


In [70]:
output.select('features').toPandas().values

array([[DenseVector([337.0, 118.0, 4.0, 4.5, 4.5, 9.65, 1.0])],
       [DenseVector([324.0, 107.0, 4.0, 4.0, 4.5, 8.87, 1.0])],
       [DenseVector([316.5588, 104.0, 3.0, 3.0, 3.5, 8.0, 1.0])],
       [DenseVector([322.0, 110.0, 3.0, 3.5, 2.5, 8.67, 1.0])],
       [DenseVector([314.0, 103.0, 2.0, 2.0, 3.0, 8.21, 0.0])],
       [DenseVector([330.0, 115.0, 5.0, 4.5, 3.0, 9.34, 1.0])],
       [DenseVector([321.0, 109.0, 3.1216, 3.0, 4.0, 8.2, 1.0])],
       [DenseVector([308.0, 101.0, 2.0, 3.0, 4.0, 7.9, 0.0])],
       [DenseVector([302.0, 102.0, 1.0, 2.0, 1.5, 8.0, 0.0])],
       [DenseVector([323.0, 108.0, 3.0, 3.5, 3.0, 8.6, 0.0])],
       [DenseVector([325.0, 106.0, 3.0, 3.5, 4.0, 8.4, 1.0])],
       [DenseVector([327.0, 111.0, 4.0, 4.0, 4.5, 9.0, 1.0])],
       [DenseVector([316.5588, 112.0, 4.0, 4.0, 4.5, 9.1, 1.0])],
       [DenseVector([307.0, 109.0, 3.0, 4.0, 3.0, 8.0, 1.0])],
       [DenseVector([311.0, 104.0, 3.0, 3.5, 2.0, 8.2, 1.0])],
       [DenseVector([314.0, 105.0, 3.0, 3

In [71]:
data = output.select("features", "Chance of Admit")

In [72]:
data.show(5)

+--------------------+---------------+
|            features|Chance of Admit|
+--------------------+---------------+
|[337.0,118.0,4.0,...|           0.92|
|[324.0,107.0,4.0,...|           0.76|
|[316.558776855468...|           0.72|
|[322.0,110.0,3.0,...|            0.8|
|[314.0,103.0,2.0,...|           0.65|
+--------------------+---------------+
only showing top 5 rows



# Train Test split

In [73]:
train_df, test_df = data.randomSplit([0.7, 0.3])

In [74]:
train_df.show()

+--------------------+---------------+
|            features|Chance of Admit|
+--------------------+---------------+
|[290.0,100.0,1.0,...|           0.47|
|[290.0,104.0,4.0,...|           0.45|
|[293.0,97.0,2.0,2...|           0.64|
|[294.0,93.0,1.0,1...|           0.46|
|[294.0,95.0,1.0,1...|           0.49|
|[295.0,93.0,1.0,2...|           0.46|
|[295.0,96.0,2.0,1...|           0.47|
|[295.0,99.0,1.0,2...|           0.37|
|[295.0,99.0,2.0,2...|           0.57|
|[296.0,97.0,2.0,1...|           0.49|
|[296.0,99.0,2.0,3...|           0.47|
|[297.0,96.0,2.0,2...|           0.43|
|[297.0,96.0,2.0,2...|           0.34|
|[297.0,98.0,2.0,2...|           0.59|
|[297.0,101.0,3.0,...|           0.57|
|[298.0,92.0,1.0,2...|           0.51|
|[298.0,97.0,3.121...|           0.45|
|[298.0,98.0,2.0,1...|           0.44|
|[298.0,98.0,2.0,4...|           0.34|
|[298.0,100.0,3.0,...|           0.58|
+--------------------+---------------+
only showing top 20 rows



In [75]:
test_df.show()

+--------------------+---------------+
|            features|Chance of Admit|
+--------------------+---------------+
|[295.0,101.0,2.0,...|           0.69|
|[296.0,95.0,2.0,3...|           0.44|
|[296.0,99.0,2.0,2...|           0.61|
|[296.0,101.0,1.0,...|            0.6|
|[297.0,99.0,4.0,3...|           0.54|
|[297.0,100.0,1.0,...|           0.52|
|[298.0,99.0,1.0,1...|           0.53|
|[298.0,105.0,3.0,...|           0.69|
|[299.0,96.0,2.0,1...|           0.54|
|[299.0,97.0,3.0,5...|           0.38|
|[299.0,100.0,1.0,...|           0.59|
|[299.0,100.0,2.0,...|           0.51|
|[300.0,99.0,1.0,1...|           0.58|
|[300.0,100.0,3.0,...|           0.64|
|[300.0,100.0,3.12...|           0.62|
|[300.0,105.0,1.0,...|           0.58|
|[301.0,100.0,3.0,...|           0.67|
|[301.0,102.0,3.0,...|           0.68|
|[302.0,99.0,1.0,2...|           0.57|
|[304.0,97.0,2.0,1...|           0.47|
+--------------------+---------------+
only showing top 20 rows



# Linear Regressor

In [77]:
linear_reg = LinearRegression(featuresCol="features", labelCol="Chance of Admit")
linear_model = linear_reg.fit(train_df)

In [79]:
print("coefficient : ", linear_model.coefficients)
print("Intercept: ", linear_model.intercept)

coefficient :  [0.0027044541071400766,0.002347259402374242,0.0028634849945772937,0.0034685505284830923,0.016291101677516847,0.11833324746460916,0.024150038964725402]
Intercept:  -1.4932994837564946


In [81]:
train_summary = linear_model.summary
print("RMSE: ", train_summary.rootMeanSquaredError)
print("r2 score: ", train_summary.r2)
print("Adjusted r2 score: ", train_summary.r2adj)

RMSE:  0.05732066225528161
r2 score:  0.8398643510037533
Adjusted r2 score:  0.8365479322375589


## Prediction

In [84]:
prediction = linear_model.transform(test_df)
prediction.select("features", "Chance of Admit", "prediction").show()
prediction

+--------------------+---------------+-------------------+
|            features|Chance of Admit|         prediction|
+--------------------+---------------+-------------------+
|[295.0,101.0,2.0,...|           0.69| 0.5186675680260435|
|[296.0,95.0,2.0,3...|           0.44| 0.4953061204459819|
|[296.0,99.0,2.0,2...|           0.61| 0.5449396588385975|
|[296.0,101.0,1.0,...|            0.6| 0.5134996181600522|
|[297.0,99.0,4.0,3...|           0.54| 0.5453631702617401|
|[297.0,100.0,1.0,...|           0.52| 0.5201305066994182|
|[298.0,99.0,1.0,1...|           0.53|0.48471216742619005|
|[298.0,105.0,3.0,...|           0.69|  0.655550794797741|
|[299.0,96.0,2.0,1...|           0.54| 0.5142805369142496|
|[299.0,97.0,3.0,5...|           0.38| 0.5324011773288315|
|[299.0,100.0,1.0,...|           0.59| 0.5243560553547211|
|[299.0,100.0,2.0,...|           0.51| 0.5277705124802525|
|[300.0,99.0,1.0,1...|           0.58| 0.5453245582132811|
|[300.0,100.0,3.0,...|           0.64|  0.66607949364822

DataFrame[features: vector, Chance of Admit: float, prediction: double]

## Prediction evaluation

In [85]:
from pyspark.ml.evaluation import RegressionEvaluator
pred_evaluator = RegressionEvaluator(predictionCol='prediction',labelCol='Chance of Admit', metricName='r2')
print('R2 score on test data = ', pred_evaluator.evaluate(prediction))

R2 score on test data =  0.7637091005124877


# Random Forest

In [86]:
random_forest_reg = RandomForestRegressor(featuresCol="features", labelCol="Chance of Admit")
model = random_forest_reg.fit(train_df)

In [87]:
predictions = model.transform(test_df)

In [88]:
predictions.show()

+--------------------+---------------+-------------------+
|            features|Chance of Admit|         prediction|
+--------------------+---------------+-------------------+
|[295.0,101.0,2.0,...|           0.69|0.48777921884619957|
|[296.0,95.0,2.0,3...|           0.44| 0.4372186373436279|
|[296.0,99.0,2.0,2...|           0.61| 0.5135236457242093|
|[296.0,101.0,1.0,...|            0.6| 0.5063290284075517|
|[297.0,99.0,4.0,3...|           0.54| 0.5446274179792636|
|[297.0,100.0,1.0,...|           0.52| 0.5218365715579162|
|[298.0,99.0,1.0,1...|           0.53| 0.4687097167552025|
|[298.0,105.0,3.0,...|           0.69|  0.592479256146192|
|[299.0,96.0,2.0,1...|           0.54|  0.500007005204133|
|[299.0,97.0,3.0,5...|           0.38|  0.553782877903234|
|[299.0,100.0,1.0,...|           0.59| 0.5580090378804548|
|[299.0,100.0,2.0,...|           0.51| 0.5209516475422984|
|[300.0,99.0,1.0,1...|           0.58| 0.5744862151526577|
|[300.0,100.0,3.0,...|           0.64| 0.676864623848491

In [89]:
from pyspark.ml.evaluation import RegressionEvaluator
pred_evaluator = RegressionEvaluator(predictionCol='prediction',labelCol='Chance of Admit', metricName='r2')
print('R2 score on test data = ', pred_evaluator.evaluate(predictions))

R2 score on test data =  0.7503766064229214
