# Track Machine Learning experiments and models

A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data. Once you have trained the model, you can use it to reason over data that it hasn't seen before, and make predictions about that data.

In this notebook, you will learn the basic steps to run an experiment, add a model version to track run metrics and parameters and register a model.


In [None]:
df = spark.sql("SELECT * FROM lakehouseTraining.`superstore sales dataset` LIMIT 15")
display(df)

In [22]:
import mlflow
import mlflow.sklearn
import pandas as pd
import re
from pyspark.sql import SparkSession
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from mlflow.models.signature import infer_signature

# Initialize Spark session
spark = SparkSession.builder.appName("LakehouseTraining").getOrCreate()

# Enable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Load the table into a Spark DataFrame
df = spark.sql("SELECT * FROM lakehouseTraining.`superstore sales dataset` LIMIT 1000")

# Convert Spark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

# Clean the 'Sales' column by removing non-numeric characters
pandas_df['Sales'] = pandas_df['Sales'].apply(lambda x: re.sub(r'[^0-9.]', '', str(x)))

# Convert 'Sales' and 'Profit' columns to numeric
pandas_df['Sales'] = pd.to_numeric(pandas_df['Sales'], errors='coerce')
pandas_df['Profit'] = pd.to_numeric(pandas_df['Profit'], errors='coerce')

# Drop rows with NaN values
pandas_df.dropna(subset=['Sales', 'Profit'], inplace=True)

# Check if the DataFrame is empty after cleaning
if pandas_df.empty:
    print("The DataFrame is empty after cleaning. Please check your data.")
else:
    # Use 'Sales' as the feature and 'Profit' as the label
    X = pandas_df[['Sales']].values
    y = pandas_df['Profit'].values

    # Set given experiment as the active experiment. If an experiment with this name does not exist, a new experiment with this name is created.
    mlflow.set_experiment("your-experiment-name")

    # Start your training job with `start_run()`
    with mlflow.start_run() as run:
        # Set experiment tags
        mlflow.set_tag("project", "MLTest")
        mlflow.set_tag("author", "Your Name")

        lr = LinearRegression()
        lr.fit(X, y)
        y_pred = lr.predict(X)
        score = lr.score(X, y)
        signature = infer_signature(X, y)

        # Log metrics
        mlflow.log_metric("mean_squared_error", mean_squared_error(y, y_pred))
        mlflow.log_metric("r2_score", r2_score(y, y_pred))

        # Log parameters
        mlflow.log_param("intercept", lr.intercept_)
        mlflow.log_param("coef", lr.coef_)

        # Log model
        mlflow.sklearn.log_model(lr, "MLTest", signature=signature)
        print("Model saved in run_id=%s" % run.info.run_id)

        # Register the model
        mlflow.register_model(
            "runs:/{}/MLTest".format(run.info.run_id), "MLTest"
        )
        print("All done")

StatementMeta(, a5fad20b-ab13-4e4c-9193-37855a0ddf21, 24, Finished, Available, Finished)

Model saved in run_id=b4104c42-411d-482b-b69c-7491723c59d4
All done


Registered model 'MLTest' already exists. Creating a new version of this model...
2024/10/17 16:15:15 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: MLTest, version 3
Created version '3' of model 'MLTest'.
