# Track Machine Learning experiments and models

A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data. Once you have trained the model, you can use it to reason over data that it hasn't seen before, and make predictions about that data.

In this notebook, you will learn the basic steps to run an experiment, add a model version to track run metrics and parameters and register a model.


In [None]:
df = spark.sql("SELECT * FROM lakehouseTraining.`superstore sales dataset` LIMIT 15")
display(df)

In [9]:
import pandas as pd
import re
from pyspark.sql import SparkSession
from sklearn.ensemble import IsolationForest
import warnings

# Suppress the specific warning
warnings.filterwarnings("ignore", message="X does not have valid feature names")

# Initialize Spark session
spark = SparkSession.builder.appName("LakehouseTraining").getOrCreate()

# Enable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Load the table into a Spark DataFrame
df = spark.sql("SELECT * FROM lakehouseTraining.`superstore sales dataset`")

# Convert Spark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

# Clean the 'Sales' column by removing non-numeric characters
pandas_df['Sales'] = pandas_df['Sales'].apply(lambda x: re.sub(r'[^0-9.]', '', str(x)))

# Convert 'Sales' and 'Profit' columns to numeric
pandas_df['Sales'] = pd.to_numeric(pandas_df['Sales'], errors='coerce')
pandas_df['Profit'] = pd.to_numeric(pandas_df['Profit'], errors='coerce')

# Drop rows with NaN values
pandas_df.dropna(subset=['Sales', 'Profit'], inplace=True)

# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.01, random_state=42)

# Fit the model
iso_forest.fit(pandas_df[['Sales', 'Profit']])

# Predict anomalies
pandas_df['anomaly'] = iso_forest.predict(pandas_df[['Sales', 'Profit']])

# -1 indicates anomaly, 1 indicates normal
anomalies = pandas_df[pandas_df['anomaly'] == -1]

print("Number of anomalies detected:", len(anomalies))
print(anomalies)

StatementMeta(, f3134bf6-5160-4003-8443-e73148a2f979, 11, Finished, Available, Finished)

Number of anomalies detected: 58
            Order_ID  Order_Date   Ship_Date       Ship_Mode Customer_ID  \
5     CA-2019-107146   6/17/2019   6/19/2019     First Class    LC-16885   
12    CA-2019-112340  10/21/2019  10/27/2019  Standard Class    NM-18520   
97    US-2020-149510   12/3/2020  12/10/2020  Standard Class    MC-17575   
107   CA-2019-130778  11/19/2019  11/25/2019  Standard Class    ND-18370   
121   US-2020-106705  12/26/2020    1/1/2021  Standard Class    PO-18850   
123   CA-2019-121671   7/17/2019   7/22/2019  Standard Class    AA-10480   
138   CA-2019-116799    3/3/2019    3/6/2019     First Class    JG-15310   
154   US-2019-129469   9/23/2019   9/27/2019  Standard Class    KL-16555   
156   CA-2019-125738  10/15/2019  10/21/2019  Standard Class    PB-18805   
169   CA-2020-107629  12/14/2020  12/14/2020        Same Day    DB-13060   
171   CA-2020-131618   6/17/2020   6/20/2020     First Class    LS-17200   
183   CA-2019-105760   6/19/2019   6/20/2019     First 