# Track Machine Learning experiments and models

A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data. Once you have trained the model, you can use it to reason over data that it hasn't seen before, and make predictions about that data.

In this notebook, you will learn the basic steps to run an experiment, add a model version to track run metrics and parameters and register a model.


In [None]:
df = spark.sql("SELECT * FROM lakehouseTraining.`superstore sales dataset` LIMIT 15")
display(df)

In [1]:
import mlflow
import mlflow.sklearn
import pandas as pd
import re
from pyspark.sql import SparkSession
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from mlflow.models.signature import infer_signature
from sklearn.ensemble import IsolationForest

# Initialize Spark session
spark = SparkSession.builder.appName("LakehouseTraining").getOrCreate()

# Enable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Load the table into a Spark DataFrame
df = spark.sql("SELECT * FROM lakehouseTraining.`superstore sales dataset` LIMIT 1000")

# Convert Spark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

# Clean the 'Sales' column by removing non-numeric characters
pandas_df['Sales'] = pandas_df['Sales'].apply(lambda x: re.sub(r'[^0-9.]', '', str(x)))

# Convert 'Sales' and 'Profit' columns to numeric
pandas_df['Sales'] = pd.to_numeric(pandas_df['Sales'], errors='coerce')
pandas_df['Profit'] = pd.to_numeric(pandas_df['Profit'], errors='coerce')

# Drop rows with NaN values
pandas_df.dropna(subset=['Sales', 'Profit'], inplace=True)

# Check if the DataFrame is empty after cleaning
if pandas_df.empty:
    print("The DataFrame is empty after cleaning. Please check your data.")
else:
    # Use 'Sales' as the feature and 'Profit' as the label
    X = pandas_df[['Sales']].values
    y = pandas_df['Profit'].values

    # Set given experiment as the active experiment. If an experiment with this name does not exist, a new experiment with this name is created.
    mlflow.set_experiment("your-experiment-name")

    # Start your training job with `start_run()`
    with mlflow.start_run() as run:
        # Set experiment tags
        mlflow.set_tag("project", "MLTest")
        mlflow.set_tag("author", "Your Name")

        lr = LinearRegression()
        lr.fit(X, y)
        y_pred = lr.predict(X)
        score = lr.score(X, y)
        signature = infer_signature(X, y)

        # Log metrics
        mlflow.log_metric("mean_squared_error", mean_squared_error(y, y_pred))
        mlflow.log_metric("r2_score", r2_score(y, y_pred))

        # Log parameters
        mlflow.log_param("intercept", lr.intercept_)
        mlflow.log_param("coef", lr.coef_)

        # Log model
        mlflow.sklearn.log_model(lr, "MLTest", signature=signature)
        print("Model saved in run_id=%s" % run.info.run_id)

        # Register the model
        mlflow.register_model(
            "runs:/{}/MLTest".format(run.info.run_id), "MLTest"
        )
        print("All done")

    # Initialize the Isolation Forest model
    iso_forest = IsolationForest(contamination=0.01, random_state=42)

    # Fit the model
    pandas_df['anomaly'] = iso_forest.fit_predict(pandas_df[['Sales', 'Profit']])

    # -1 indicates anomaly, 1 indicates normal
    anomalies = pandas_df[pandas_df['anomaly'] == -1]

    print("Number of anomalies detected:", len(anomalies))
    print(anomalies)

StatementMeta(, 8cfca490-410a-490f-8870-8b72c45524e2, 3, Finished, Available, Finished)

Registered model 'MLTest' already exists. Creating a new version of this model...
2024/10/21 16:40:21 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: MLTest, version 4
Created version '4' of model 'MLTest'.


Model saved in run_id=1e16900c-607f-4ab2-bbf0-c952ee926e68
All done




Number of anomalies detected: 9
           Order_ID  Order_Date   Ship_Date       Ship_Mode Customer_ID  \
5    CA-2019-107146   6/17/2019   6/19/2019     First Class    LC-16885   
154  US-2019-129469   9/23/2019   9/27/2019  Standard Class    KL-16555   
169  CA-2020-107629  12/14/2020  12/14/2020        Same Day    DB-13060   
287  CA-2020-140151   3/23/2020   3/25/2020     First Class    RB-19360   
470  CA-2019-117121  12/17/2019  12/21/2019  Standard Class    AB-10105   
551  US-2019-140158   10/4/2019   10/8/2019  Standard Class    DR-12940   
662  CA-2020-118892   8/17/2020   8/22/2020    Second Class    TP-21415   
801  CA-2019-133711  11/26/2019  11/29/2019     First Class    MC-17425   
893  CA-2020-128363   8/13/2020   8/18/2020  Standard Class    DC-12850   

      Customer_Name      Segment        Country          City         State  \
5    Lena Creighton     Consumer  United States      Longmont      Colorado   
154   Kelly Lampkin    Corporate  United States     Fairfie