<a href="https://colab.research.google.com/github/tyri0n11/distributed-system/blob/main/7_2_regression_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from pyspark import SparkContext
sc = SparkContext(master = 'local[4]')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

# Linear regression without cross-valiation

In [2]:
ad = spark.read.csv('./Advertising.csv', header=True, inferSchema=True)
ad.show(5)

+-----+-----+---------+-----+
|   TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3|  9.3|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
+-----+-----+---------+-----+
only showing top 5 rows


In [3]:
type(ad)

## Transform data structure

In [4]:
from pyspark.ml.linalg import Vectors
ad_df = ad.rdd.map(lambda x: [Vectors.dense(x[0:3]), x[-1]]).toDF(['features', 'label'])
ad_df.show(5)

+-----------------+-----+
|         features|label|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]|  9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
+-----------------+-----+
only showing top 5 rows


## Build linear regression model

In [5]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol = 'label')

In [6]:
type(lr)

## Fit the model

In [7]:
lr_model = lr.fit(ad_df)

In [8]:
type(lr_model)

In [9]:
lr_model

LinearRegressionModel: uid=LinearRegression_432ec5b39bf6, numFeatures=3

## Prediction

In [10]:
pred = lr_model.transform(ad_df)
pred.show(5)

+-----------------+-----+------------------+
|         features|label|        prediction|
+-----------------+-----+------------------+
|[230.1,37.8,69.2]| 22.1| 20.52397440971517|
| [44.5,39.3,45.1]| 10.4|12.337854820894362|
| [17.2,45.9,69.3]|  9.3|12.307670779994238|
|[151.5,41.3,58.5]| 18.5| 17.59782951168913|
|[180.8,10.8,58.4]| 12.9|13.188671856831299|
+-----------------+-----+------------------+
only showing top 5 rows


## Module evaluation

In [11]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='label')
evaluator.setMetricName('r2').evaluate(pred)

0.897210638178952

In [12]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='label')
evaluator.setMetricName('mse').evaluate(pred)

2.784126314510938

In [13]:
evaluator.setMetricName('mae').evaluate(pred)

1.2520112296870693

### Exercise

1. Do research in Spark documentaion to study & practice in other regression models.
2. Investigate other metrics for evaluating regression model in Spark.
3. Practice on `Advertising` dataset.
4. Do the same thing, but now you should split into `train` and `test` datasets, so the fitting modelling in the `train`, then do the evaluation in the `test`

# Linear regression with cross-validation in Spark

In [14]:
training, test = ad_df.randomSplit([0.8, 0.2], seed=123)

In [15]:
##=====build cross valiation model======

# estimator
lr = LinearRegression(featuresCol = 'features', labelCol = 'label')

# parameter grid
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(lr.regParam, [0, 0.5, 1]).\
    addGrid(lr.elasticNetParam, [0, 0.5, 1]).\
    build()

# evaluator
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='label', metricName='r2')

# cross-validation model
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [16]:
type(cv)

In [17]:
cv_model = cv.fit(training)

In [18]:
type(cv_model)

In [19]:
cv_model

CrossValidatorModel_9515a65de413

In [20]:
pred_training_cv = cv_model.transform(training)
pred_test_cv = cv_model.transform(test)

In [21]:
# performance on training data
evaluator.setMetricName('r2').evaluate(pred_training_cv)

0.8952845631627804

In [22]:
# performance on test data
evaluator.setMetricName('r2').evaluate(pred_test_cv)

0.9013819610158472

## Intercept and coefficients

In [23]:
print('Intercept: ', cv_model.bestModel.intercept, "\n",
     'coefficients: ', cv_model.bestModel.coefficients)

Intercept:  2.9592600706772787 
 coefficients:  [0.04613729524909818,0.19200356629524312,-0.006269704193266422]


In [24]:
ad_df.show(5)

+-----------------+-----+
|         features|label|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]|  9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
+-----------------+-----+
only showing top 5 rows


## Get parameter values from the best model

In [25]:
print('best regParam: ' + str(cv_model.bestModel._java_obj.getRegParam()) + "\n" +
     'best ElasticNetParam:' + str(cv_model.bestModel._java_obj.getElasticNetParam()))

best regParam: 0.0
best ElasticNetParam:0.0


### Exercise

Wrap up the code of building cross-validation models in a Python class

### Exercise

Do the regression to forecast the `inside_sale` of this data: https://github.com/maks-p/restaurant_sales_forecasting/blob/master/csv/CSV_for_EDA_NEW.csv

In [88]:
df_sales = spark.read.csv('./CSV_for_EDA_NEW.csv', header=True, inferSchema=True)
df_sales.show(5)
df_sales.printSchema()

+----------+------------+-------------+-------------+--------------+---------------+-------------+---------------+--------------+---------------+--------------------+--------+--------------------+-------------------+-----------+-----------+--------+-------------------+-----------+
|      date|inside_sales|outside_sales|inside_covers|outside_covers|reserved_covers|walkin_covers|waitlist_covers|no_show_covers|no_show_parties|apparent_temperature|humidity|precip_intensity_max|    precip_max_time|precip_prob|precip_type|pressure|            summary|temperature|
+----------+------------+-------------+-------------+--------------+---------------+-------------+---------------+--------------+---------------+--------------------+--------+--------------------+-------------------+-----------+-----------+--------+-------------------+-----------+
|2017-01-02|    13159.84|          0.0|          174|             0|            106|           26|             42|            17|              6|         

In [89]:
num_cols = [
    "reserved_covers", "walkin_covers", "waitlist_covers",
    "no_show_covers", "no_show_parties",
    "apparent_temperature", "temperature", "humidity",
    "precip_intensity_max", "precip_prob", "pressure",
    "day_of_week", "month"
]
cat_cols = ["precip_type", "summary", "is_weekend"]
label_col = "inside_sales"
target_leakage_cols = ["outside_sales", "outside_covers", "inside_covers"]


In [90]:
df_sales = df_sales.drop(*target_leakage_cols)


In [91]:
from pyspark.sql.functions import dayofweek, month, when, col

df_sales = (
    df_sales
    .withColumn("day_of_week", dayofweek(col("date")))
    .withColumn("month", month(col("date")))
    .withColumn(
        "is_weekend",
        when(dayofweek(col("date")).isin([1, 7]), 1).otherwise(0)
    )
)


In [95]:
from pyspark.sql.functions import col

df_sales = df_sales.withColumn(
    "date_ts",
    col("date").cast("timestamp").cast("long")
)

cutoff_ts = df_sales.approxQuantile(
    "date_ts", [0.8], 0.0
)[0]
train = df_sales.filter(col("date_ts") <= cutoff_ts)
test  = df_sales.filter(col("date_ts") > cutoff_ts)
train.select("date").orderBy("date").show(3)
test.select("date").orderBy("date").show(3)

print("Train max date:", train.selectExpr("max(date)").first()[0])
print("Test min date:", test.selectExpr("min(date)").first()[0])


+----------+
|      date|
+----------+
|2017-01-02|
|2017-01-03|
|2017-01-04|
+----------+
only showing top 3 rows
+----------+
|      date|
+----------+
|2019-01-02|
|2019-01-03|
|2019-01-04|
+----------+
only showing top 3 rows
Train max date: 2018-12-31
Test min date: 2019-01-02


In [96]:
from pyspark.ml.feature import StringIndexer

indexers = [
    StringIndexer(
        inputCol=c,
        outputCol=f"{c}_idx",
        handleInvalid="keep"
    ) for c in cat_cols
]


In [97]:
from pyspark.ml.feature import OneHotEncoder

encoders = [
    OneHotEncoder(
        inputCol=f"{c}_idx",
        outputCol=f"{c}_ohe"
    ) for c in cat_cols
]


In [98]:
from pyspark.ml.feature import VectorAssembler

feature_cols = num_cols + [f"{c}_ohe" for c in cat_cols]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)


In [99]:
from pyspark.ml.regression import GBTRegressor

gbt = GBTRegressor(
    labelCol=label_col,
    featuresCol="features",
    maxIter=100,
    maxDepth=5,
    stepSize=0.1,
    seed=42
)


In [100]:
from pyspark.ml import Pipeline

pipeline = Pipeline(
    stages=indexers + encoders + [assembler, gbt]
)


In [101]:
df_sales.printSchema()


root
 |-- date: date (nullable = true)
 |-- inside_sales: double (nullable = true)
 |-- reserved_covers: integer (nullable = true)
 |-- walkin_covers: integer (nullable = true)
 |-- waitlist_covers: integer (nullable = true)
 |-- no_show_covers: integer (nullable = true)
 |-- no_show_parties: integer (nullable = true)
 |-- apparent_temperature: double (nullable = true)
 |-- humidity: double (nullable = true)
 |-- precip_intensity_max: double (nullable = true)
 |-- precip_max_time: timestamp (nullable = true)
 |-- precip_prob: double (nullable = true)
 |-- precip_type: string (nullable = true)
 |-- pressure: double (nullable = true)
 |-- summary: string (nullable = true)
 |-- temperature: double (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- is_weekend: integer (nullable = false)
 |-- date_ts: long (nullable = true)



In [102]:
model = pipeline.fit(train)


In [103]:
pred = model.transform(test)
pred.select("inside_sales", "prediction").show(10)


+------------+------------------+
|inside_sales|        prediction|
+------------+------------------+
|    12685.24|14380.140563571931|
|    13580.96|14638.092184099107|
|    16002.75|15642.942736773186|
|    16463.62|16974.901304761686|
|    12405.14|14784.952597362397|
|    14222.67| 13889.11679392458|
|    10811.32|12664.754689285679|
|    12985.09| 14578.97117472451|
|    13196.76|15164.272387610847|
|    14699.25|18205.927344249176|
+------------+------------------+
only showing top 10 rows


In [104]:
from pyspark.ml.evaluation import RegressionEvaluator

rmse_eval = RegressionEvaluator(
    labelCol="inside_sales",
    predictionCol="prediction",
    metricName="rmse"
)

r2_eval = RegressionEvaluator(
    labelCol="inside_sales",
    predictionCol="prediction",
    metricName="r2"
)

print("RMSE:", rmse_eval.evaluate(pred))
print("R2:", r2_eval.evaluate(pred))


RMSE: 2742.785601890637
R2: -0.03688928934671609


# Generalized regression

In [27]:
cuse = spark.read.csv('./cuse_binary.csv', header=True, inferSchema=True)
cuse.show(5)

+---+---------+---------+---+
|age|education|wantsMore|  y|
+---+---------+---------+---+
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
+---+---------+---------+---+
only showing top 5 rows


In [28]:
cuse.columns[0:3]
# cuse.select('age').distinct().show()
cuse.select('age').rdd.countByValue()
# cuse.select('education').rdd.countByValue()

defaultdict(int,
            {Row(age='<25'): 397,
             Row(age='25-29'): 404,
             Row(age='30-39'): 612,
             Row(age='40-49'): 194})

In [29]:
# string index each categorical string columns
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
indexers = [StringIndexer(inputCol=column, outputCol="indexed_"+column) for column in ('age', 'education', 'wantsMore')]
pipeline = Pipeline(stages=indexers)
indexed_cuse = pipeline.fit(cuse).transform(cuse)
indexed_cuse.select('age', 'indexed_age').distinct().show(5)

+-----+-----------+
|  age|indexed_age|
+-----+-----------+
|30-39|        0.0|
|  <25|        2.0|
|25-29|        1.0|
|40-49|        3.0|
+-----+-----------+



In [30]:
# onehotencode each indexed categorical columns
from pyspark.ml.feature import OneHotEncoder
columns = indexed_cuse.columns[0:3]
onehoteencoders = [OneHotEncoder(inputCol="indexed_"+column, outputCol="onehotencode_"+column) for column in columns]
pipeline = Pipeline(stages=onehoteencoders)
onehotencode_columns = ['onehotencode_age', 'onehotencode_education', 'onehotencode_wantsMore', 'y']
onehotencode_cuse = pipeline.fit(indexed_cuse).transform(indexed_cuse).select(onehotencode_columns)
onehotencode_cuse.distinct().show(5)

+----------------+----------------------+----------------------+---+
|onehotencode_age|onehotencode_education|onehotencode_wantsMore|  y|
+----------------+----------------------+----------------------+---+
|   (3,[1],[1.0])|             (1,[],[])|         (1,[0],[1.0])|  0|
|   (3,[2],[1.0])|         (1,[0],[1.0])|             (1,[],[])|  1|
|   (3,[0],[1.0])|         (1,[0],[1.0])|         (1,[0],[1.0])|  0|
|       (3,[],[])|         (1,[0],[1.0])|         (1,[0],[1.0])|  1|
|   (3,[2],[1.0])|             (1,[],[])|         (1,[0],[1.0])|  0|
+----------------+----------------------+----------------------+---+
only showing top 5 rows


In [31]:
# assemble all feature columns into on single vector column
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['onehotencode_age', 'onehotencode_education', 'onehotencode_wantsMore'], outputCol='features')
cuse_df_2 = assembler.transform(onehotencode_cuse).withColumnRenamed('y', 'label')
cuse_df_2.show(5)

+----------------+----------------------+----------------------+-----+-------------------+
|onehotencode_age|onehotencode_education|onehotencode_wantsMore|label|           features|
+----------------+----------------------+----------------------+-----+-------------------+
|   (3,[2],[1.0])|             (1,[],[])|         (1,[0],[1.0])|    0|(5,[2,4],[1.0,1.0])|
|   (3,[2],[1.0])|             (1,[],[])|         (1,[0],[1.0])|    0|(5,[2,4],[1.0,1.0])|
|   (3,[2],[1.0])|             (1,[],[])|         (1,[0],[1.0])|    0|(5,[2,4],[1.0,1.0])|
|   (3,[2],[1.0])|             (1,[],[])|         (1,[0],[1.0])|    0|(5,[2,4],[1.0,1.0])|
|   (3,[2],[1.0])|             (1,[],[])|         (1,[0],[1.0])|    0|(5,[2,4],[1.0,1.0])|
+----------------+----------------------+----------------------+-----+-------------------+
only showing top 5 rows


In [33]:
test.show(5)

+----------------+----------------------+----------------------+-----+---------+
|onehotencode_age|onehotencode_education|onehotencode_wantsMore|label| features|
+----------------+----------------------+----------------------+-----+---------+
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
+----------------+----------------------+----------------------+-----+---------+
only showing top 5 rows


In [87]:
# split data into training and test datasets
training, test = cuse_df_2.randomSplit([0.8, 0.2], seed=1234)
training.show(5)

+----------------+----------------------+----------------------+-----+---------+
|onehotencode_age|onehotencode_education|onehotencode_wantsMore|label| features|
+----------------+----------------------+----------------------+-----+---------+
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|
+----------------+----------------------+----------------------+-----+---------+
only showing top 5 rows


In [34]:
## ======= build cross validation model ===========

# estimator
from pyspark.ml.regression import GeneralizedLinearRegression
glm = GeneralizedLinearRegression(featuresCol='features', labelCol='label', family='binomial')

# parameter grid
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(glm.regParam, [0, 0.5, 1, 2, 4]).\
    build()

# evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction')

# build cross-validation model
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=glm, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [35]:
# fit model
# cv_model = cv.fit(training)
cv_model = cv.fit(cuse_df_2)

In [36]:
type(cv_model)

In [37]:
# prediction
pred_training_cv = cv_model.transform(training)
pred_test_cv = cv_model.transform(test)

pred_training_cv.show(5)
pred_test_cv.show(5, truncate=False)

+----------------+----------------------+----------------------+-----+---------+------------------+
|onehotencode_age|onehotencode_education|onehotencode_wantsMore|label| features|        prediction|
+----------------+----------------------+----------------------+-----+---------+------------------+
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|0.5140024065151407|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|0.5140024065151407|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|0.5140024065151407|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|0.5140024065151407|
|       (3,[],[])|             (1,[],[])|             (1,[],[])|    0|(5,[],[])|0.5140024065151407|
+----------------+----------------------+----------------------+-----+---------+------------------+
only showing top 5 rows
+----------------+----------------------+----------------------+-----+------

In [38]:
cv_model.bestModel.coefficients

DenseVector([-0.2806, -0.7999, -1.1892, 0.325, -0.833])

In [39]:
cv_model.bestModel.intercept

0.05602427516928616

In [40]:
evaluator.evaluate(pred_training_cv)

0.6716478245974649

In [41]:
evaluator.evaluate(pred_test_cv)

0.6830864197530864

### Exercise

1. Do the generalized regression to forecast the `inside_sale` of this data: https://github.com/maks-p/restaurant_sales_forecasting/blob/master/csv/CSV_for_EDA_NEW.csv

2. Wrap your code in a pipeline as a Python class