# Module 5

### Fabric Prerequistis

You need to have Lakehouse enabled and connected. 

Link to Lakehouse (replace these strings)
- Tables: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Tables`
- Files: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Files`

You will also need:
- PySpark notebook and connect it to the Fabric standard session

Data:
- Have delta tables created with flights data

## Step 1: Set Up MLFlow in Microsoft Fabric

MLFlow will allow us to track, compare, and manage experiments for our machine learning models.

Initialize MLFlow in your notebook environment.

In [None]:
import mlflow
import mlflow.spark

# Set the experiment name for MLFlow tracking
mlflow.set_experiment("FlightDelayPrediction")


## Step 2: Load Flight Data from the Lakehouse

We’ll use flight data stored in a Delta table within your Lakehouse. The process of loading the data will remain the same as in previous modules.

In [None]:
# Step 1: Initialize Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FlightDelayPrediction").getOrCreate()

# Step 2: Define the Lakehouse path to the Delta Table
lakehouse_table_path = "abfss://<your-container>@<your-storage-account>.dfs.core.windows.net/delta/flight_data"

# Step 3: Read the flight data from Delta Table in Lakehouse
df_flight = spark.read.format("delta").load(lakehouse_table_path)

# Step 4: Display the first few rows of the flight data
df_flight.show(5)


## Step 3: QUick and simple EDA

Check missing data and calculate

In [None]:
from pyspark.sql.functions import col, count, when

# Check for missing data by counting null values in each column
missing_data = df_flight.select([count(when(col(c).isNull(), c)).alias(c) for c in df_flight.columns])
missing_data.show()

In [None]:
df_flight.groupBy("is_delay").count().show()

## Step 4: Data Preparation for the Model

Before building our machine learning model, we need to prepare the data by transforming it into a format suitable for modeling.

Index categorical columns and assemble features:

In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

# Index categorical columns (e.g., carrier, origin, destination)
indexer_carrier = StringIndexer(inputCol="carrier", outputCol="carrier_index")
indexer_origin = StringIndexer(inputCol="origin", outputCol="origin_index")
indexer_dest = StringIndexer(inputCol="destination", outputCol="dest_index")

df_flight = indexer_carrier.fit(df_flight).transform(df_flight)
df_flight = indexer_origin.fit(df_flight).transform(df_flight)
df_flight = indexer_dest.fit(df_flight).transform(df_flight)

# Assemble all features into a single vector column
assembler = VectorAssembler(inputCols=["carrier_index", "origin_index", "dest_index", "departure_time"], outputCol="features")
df_flight = assembler.transform(df_flight)

# Select relevant columns for modeling
df_flight = df_flight.select("features", "is_delay")


## Step 5: Build a Machine Learning Model

In this step, we’ll build a Logistic Regression model to predict whether a flight will be delayed.

Split Data into Training and Test Sets:

In [None]:
# Split the data into training (80%) and testing (20%) sets
train_df, test_df = df_flight.randomSplit([0.8, 0.2], seed=42)


Define the Logistic Regression Model:

In [None]:
from pyspark.ml.classification import LogisticRegression

# Initialize Logistic Regression model
lr_model = LogisticRegression(featuresCol="features", labelCol="is_delay")


## Step 6: Run Experiments Using MLFlow

Now, we’ll use MLFlow to log model parameters, track metrics, and save the trained models.

Start an MLFlow Experiment:

In [None]:
# Start MLFlow run for tracking
with mlflow.start_run():
    # Train the Logistic Regression model
    lr_fitted = lr_model.fit(train_df)
    
    # Log model parameters
    mlflow.log_param("model_type", "Logistic Regression")
    
    # Make predictions on the test data
    predictions = lr_fitted.transform(test_df)
    
    # Log metrics such as AUC (Area Under ROC Curve)
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    evaluator = BinaryClassificationEvaluator(labelCol="is_delay", metricName="areaUnderROC")
    auc = evaluator.evaluate(predictions)
    
    mlflow.log_metric("AUC", auc)
    
    # Log the trained model
    mlflow.spark.log_model(lr_fitted, "logistic_regression_model")
    
    print(f"Experiment complete with AUC: {auc}")


Explanation:

* _mlflow.start_run()_ starts an MLFlow run, which is a session to log the parameters, metrics, and models.
* _mlflow.log_param()_ logs model parameters like model type.
* _mlflow.log_metric()_ tracks metrics like AUC (Area Under the ROC Curve).
* _mlflow.spark.log_model()_ saves the trained model into the MLFlow tracking system.

## Step 7: Compare Models Across Experiments

MLFlow automatically keeps track of all your experiments, so you can compare multiple models to find the best-performing one.



Track Multiple Experiments:

Let’s run another experiment with a Decision Tree Classifier and compare the results with our Logistic Regression model.

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(featuresCol="features", labelCol="is_delay")

# Start a new MLFlow run for the Decision Tree model
with mlflow.start_run():
    # Train the Decision Tree model
    dt_fitted = dt_model.fit(train_df)
    
    # Log the model type as Decision Tree
    mlflow.log_param("model_type", "Decision Tree")
    
    # Make predictions on the test data
    predictions = dt_fitted.transform(test_df)
    
    # Calculate AUC and log the metric
    auc = evaluator.evaluate(predictions)
    mlflow.log_metric("AUC", auc)
    
    # Log the trained model
    mlflow.spark.log_model(dt_fitted, "decision_tree_model")
    
    print(f"Experiment complete with AUC: {auc}")


You can now compare the AUC for both models to determine which one performed better.

## Step 8: Model Versioning and Productionization

Once you have identified the best model, you can register it for production, track its versions, and use it in different environments (test, dev, prod).

1) Register the Model in MLFlow for versioning and deployment.

In [None]:
# Register the Logistic Regression model as a versioned model in MLFlow
mlflow.register_model("runs:/<run-id>/logistic_regression_model", "FlightDelayPredictionModel")

2) Promote the Best Model:

Use MLFlow to promote the best model to the production environment. This helps in managing models across test, dev, and prod environments.

## Step 9: Inference and Predictions Using the Trained Model

Once your model is registered in MLFlow, you can load it for inference and make predictions on new flight data.

1) Load the Registered Model for Inference:

In [None]:
# Load the model from MLFlow for inference
import mlflow.pyfunc

# Load model version from MLFlow model registry
model_uri = "models:/FlightDelayPredictionModel/production"
loaded_model = mlflow.pyfunc.load_model(model_uri)

# Example: Predict delay for a new flight
new_flight_data = spark.createDataFrame([
    ("AA", "LAX", "JFK", "16:00")
], ["carrier", "origin", "destination", "departure_time"])

# Transform data to fit the model's input requirements
new_flight_data = indexer_carrier.fit(new_flight_data).transform(new_flight_data)
new_flight_data = indexer_origin.fit(new_flight_data).transform(new_flight_data)
new_flight_data = indexer_dest.fit(new_flight_data).transform(new_flight_data)
new_flight_data = assembler.transform(new_flight_data)

# Predict using the loaded model
new_flight_predictions = loaded_model.predict(new_flight_data)
new_flight_predictions.show()
