<a href="https://colab.research.google.com/github/rohandawar/pyspark/blob/main/Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I am trying to replicate pipelines in pyspark

In [3]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285398 sha256=e531e2b775edf5f8ed410a86a5764453c2a83f1c047565d9cdbb1d73df98b6f7
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


In [23]:
# Import Libs

# Pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

# Google
from google.colab import drive

In [5]:
# Mount the drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
# Start the spark Session
spark = SparkSession.builder.appName('Pipeline').getOrCreate()

In [15]:
df = spark.read.csv('/content/drive/MyDrive/DataSets_Pyspark_GoogleColab_Primer/Boston.csv', inferSchema=True, header=True).drop('_c0')
df.show(5)

+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio| black|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 5 rows



In [20]:
# train & test split
train_df, test_df = df.randomSplit([0.7,0.3], seed=42)

In [21]:
# create a list of columns
col_list = df.columns

# remove the target variable
col_list.remove('medv')
print('List of Columns to be converted to vec Assemebler: ', col_list)

List of Columns to be converted to vec Assemebler:  ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat']


In [24]:
# Instiate the vec assembler
vec_assembler = VectorAssembler(inputCols=col_list, outputCol='features')

# Instiate the random forest regressor
regressor = RandomForestRegressor(labelCol='medv', featuresCol='features')

#Start the pipeline
pipeline = Pipeline(stages=[vec_assembler, regressor])

In [26]:
# Create the parameter Grid with all parameters
paramGrid = ParamGridBuilder()\
            .addGrid(regressor.numTrees,[3,5,10,15])\
            .addGrid(regressor.maxDepth,[3,5,10,15])\
            .build()

In [27]:
# Instiate the evalutor
evaluator = RegressionEvaluator(labelCol='medv', metricName='rmse')

In [28]:
# Cross Validator Score
Crossvalidator = CrossValidator(estimator=pipeline,
                                estimatorParamMaps=paramGrid,
                                evaluator=evaluator,
                                numFolds=10)

In [29]:
# Train the cross Validator
tunned_model = Crossvalidator.fit(train_df)

In [30]:
# Predictions
predictions = tunned_model.transform(test_df)

rmse = evaluator.evaluate(predictions)
print(f"RMSE:{rmse}")

RMSE:3.7997402862623955
