# Module 5 - Excercise overview


### Fabric Prerequistis

You need to have Lakehouse enabled and connected. 

Link to Lakehouse (replace these strings)
- Tables: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Tables`
- Files: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Files`

You will also need:
- PySpark notebook and connect it to the Fabric standard session

## Exercise 1: Test Model Predictions and Perform Inference Using PySpark

In this exercise, you will test the predictions of a previously trained machine learning model and perform inference on new data using PySpark.
Step-by-Step Instructions:

1) Load the Trained Model:
    - In Microsoft Fabric, load the trained model from the previous exercise (e.g., a saved Linear Regression model) using PySpark.

2) Load New Data for Inference:
    - Load new data into the notebook, which the model will use for inference (i.e., predicting values based on unseen data).

3) Perform Inference and Evaluate Predictions:
    - Use the model to generate predictions on the new data, and evaluate the predictions using metrics like RMSE (Root Mean Squared Error) and R-squared.

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark Session
spark = SparkSession.builder.appName("ModelInference").getOrCreate()

# Step 1: Load the trained Linear Regression model from disk (assuming it was saved earlier)
model_path = "path/to/saved_model"
lr_model = LinearRegressionModel.load(model_path)

# Step 2: Load new data for inference (Assuming new sales data)
new_data_path = "Files/Users/new_sales_data.csv"
new_df = spark.read.csv(new_data_path, header=True, inferSchema=True)

# Feature engineering on new data (assuming same transformations as in training)
from pyspark.sql.functions import year, month
new_df = new_df.withColumn("year", year(col("date"))).withColumn("month", month(col("date")))

# Assuming feature vector is created (similar to how it was done during model training)
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["year", "month", "region_encoded"], outputCol="features")
new_df_transformed = assembler.transform(new_df)

# Step 3: Perform Inference (Generate predictions)
predictions = lr_model.transform(new_df_transformed)

# Step 4: Evaluate predictions using RegressionEvaluator
evaluator = RegressionEvaluator(labelCol="sales_amount", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

print(f"Root Mean Squared Error (RMSE) on new data: {rmse}")
print(f"R-squared on new data: {r2}")

# Show sample predictions
predictions.select("features", "sales_amount", "prediction").show(5)


## Exercise 2: Job Scheduling and Running Jobs for Model Inference

In this exercise, you will automate the inference process by scheduling a job to run the inference notebook periodically using Microsoft Fabric’s job scheduling capabilities.
Step-by-Step Instructions:

1) Create a Notebook for Inference:
    - Create a notebook that loads the trained model, performs inference on new data, and saves the predictions to a Lakehouse.

2) Schedule the Job in Microsoft Fabric:
    - Use the Job Scheduling feature in Microsoft Fabric to schedule the notebook for periodic execution (e.g., daily or weekly).

3) Monitor and Manage Scheduled Jobs:
    - Monitor the job's performance, check logs, and ensure the notebook is running successfully at scheduled intervals.

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark Session
spark = SparkSession.builder.appName("ScheduledModelInference").getOrCreate()

# Step 1: Load the trained model
model_path = "path/to/saved_model"
lr_model = LinearRegressionModel.load(model_path)

# Step 2: Load new data for inference
new_data_path = "Files/Users/new_sales_data.csv"
new_df = spark.read.csv(new_data_path, header=True, inferSchema=True)

# Perform same feature engineering as training
from pyspark.sql.functions import year, month
new_df = new_df.withColumn("year", year(col("date"))).withColumn("month", month(col("date")))

# Assemble features for prediction
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["year", "month", "region_encoded"], outputCol="features")
new_df_transformed = assembler.transform(new_df)

# Perform inference
predictions = lr_model.transform(new_df_transformed)

# Save the predictions to the Lakehouse for further analysis
predictions.write.format("delta").mode("overwrite").save("Tables/Predictions_Lakehouse")

# Print confirmation
print("Scheduled job executed successfully, predictions saved to Lakehouse.")


## Exercise 3: Model Retraining Using MLflow and Job Scheduling