## Exercise: ML Workflows

Do a train/test split on a Dataset, create a baseline model, and evaluate the result.  Optionally, try to beat this baseline model by training a linear regression model.

Run the following cell to set up our environment.

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
%fs ls dbfs:/databricks-datasets/bikeSharing/data-001/hour.csv

path,name,size,modificationTime
dbfs:/databricks-datasets/bikeSharing/data-001/hour.csv,hour.csv,1156736,1455505275000


### Step 1: Train/Test Split

Import the bike sharing dataset and take a look at what's in it.  This dataset contains number of bikes rented (`cnt`) by season, year, month, and hour and for a number of weather conditions.

In [0]:
bikeDF = (spark
  .read
  .option("header", True)
  .option("inferSchema", True)
  .csv("dbfs:/databricks-datasets/bikeSharing/data-001/hour.csv")
  .drop("instant", "dteday", "casual", "registered", "holiday", "weekday") # Drop unnecessary features
)

display(bikeDF)

season,yr,mnth,hr,workingday,weathersit,temp,atemp,hum,windspeed,cnt
1,0,1,0,0,1,0.24,0.2879,0.81,0.0,16
1,0,1,1,0,1,0.22,0.2727,0.8,0.0,40
1,0,1,2,0,1,0.22,0.2727,0.8,0.0,32
1,0,1,3,0,1,0.24,0.2879,0.75,0.0,13
1,0,1,4,0,1,0.24,0.2879,0.75,0.0,1
1,0,1,5,0,2,0.24,0.2576,0.75,0.0896,1
1,0,1,6,0,1,0.22,0.2727,0.8,0.0,2
1,0,1,7,0,1,0.2,0.2576,0.86,0.0,3
1,0,1,8,0,1,0.24,0.2879,0.75,0.0,8
1,0,1,9,0,1,0.32,0.3485,0.76,0.0,14


Perform a train/test split.  Put 70% of the data into `trainBikeDF` and 30% into `testBikeDF`.  Use a seed of `42` so you have the same split every time you perform the operation.

In [0]:
# TODO
trainBikeDF, testBikeDF = bikeDF.randomSplit([0.7, 0.3], seed=42)

In [0]:
# Define the dbTest function
def dbTest(test_name, expected, result):
    assert result == expected, f"Test {test_name} failed: expected {expected}, got {result}"

# TEST - Run this cell to test your solution
_traincount = trainBikeDF.count()
_testcount = testBikeDF.count()

dbTest("ML1-P-03-01-01", True, _traincount < 13000 and _traincount > 12000)
dbTest("ML1-P-03-01-02", True, _testcount < 5500 and _testcount > 4800)

print("Tests passed!")

Tests passed!


### Step 2: Create a Baseline Model

Calculate the average of the column `cnt` and save it to the variable `trainCnt`.  Then create a new DataFrame `bikeTestPredictionDF` that appends a new column `prediction` that's the value of `trainCnt`.

### Baseline Model Concept
Purpose of a Baseline Model:
The baseline model serves as a benchmark to assess whether more complex models provide meaningful improvements.
In the context of regression, this benchmark prediction can be as simple as using the average of the target variable across all observations in the training data.
By evaluating this simple prediction, practitioners get an idea of how well the model can perform without incorporating patterns or relationships between the input features and the target variable.
When to Use a Baseline Model:
At the start of the modeling process, to establish a straightforward reference for future comparisons.
To understand the variance in the target variable: if the baseline performs well, it may indicate limited room for improvement with more complex models.
In regression tasks, calculating the baseline as the average value of the target variable is particularly common and effective.

In [0]:
from pyspark.sql.functions import avg, lit

# Assuming trainBikeDF and testBikeDF are already defined
# Calculate the average count from the training dataset
avgTrainCnt = trainBikeDF.select(avg("cnt")).first()[0]

# Create a baseline model by adding a column with the average count to the test dataset
bikeTestPredictionDF = testBikeDF.withColumn("prediction", lit(avgTrainCnt))

# Display the resulting DataFrame
display(bikeTestPredictionDF)

season,yr,mnth,hr,workingday,weathersit,temp,atemp,hum,windspeed,cnt,prediction
1,0,1,0,0,1,0.1,0.0758,0.42,0.3881,25,188.89937878044796
1,0,1,0,0,1,0.24,0.2879,0.81,0.0,16,188.89937878044796
1,0,1,0,0,2,0.18,0.197,0.51,0.1642,25,188.89937878044796
1,0,1,0,0,2,0.2,0.197,0.47,0.2239,17,188.89937878044796
1,0,1,0,1,1,0.12,0.1364,0.5,0.194,14,188.89937878044796
1,0,1,0,1,1,0.14,0.1212,0.59,0.2836,7,188.89937878044796
1,0,1,0,1,1,0.14,0.1667,0.59,0.1045,12,188.89937878044796
1,0,1,0,1,1,0.22,0.197,0.44,0.3582,5,188.89937878044796
1,0,1,0,1,2,0.16,0.1364,0.69,0.2836,9,188.89937878044796
1,0,1,0,1,2,0.2,0.197,0.64,0.194,17,188.89937878044796


In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-03-02-01", True, avgTrainCnt < 195 and avgTrainCnt > 180)
dbTest("ML1-P-03-02-02", True, "prediction" in bikeTestPredictionDF.columns)

print("Tests passed!")

Tests passed!


### Step 3: Evaluate the Result

Evaluate the result using `mse` as the error metric.  Save the result to `testError`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Your baseline prediction will not be very accurate.  Be sure to take the square root of the MSE to return the results to the proper units (that is, bike counts).

In [0]:
# TODO
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(labelCol="cnt", predictionCol="prediction", metricName="mse")
testError = evaluator.evaluate(bikeTestPredictionDF)
print(testError)

33098.35537800063


In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-03-03-01", True, testError > 33000 and testError < 35000)

print("Tests passed!")

Tests passed!


### Step 4 (Optional): Beat the Baseline

Use a linear regression model (explored in the previous lesson) to beat the baseline model score.