# Track Machine Learning experiments and models

##### This notebook demonstrates an anomaly detection process on a sales dataset using Spark, Pandas, and Isolation Forest. The steps include:
###### 1. Initializing a Spark session and enabling Arrow optimization for efficient data transfer between Spark and Pandas.
###### 2. Loading the entire sales dataset from the lakehouse into a Spark DataFrame.
###### 3. Converting the Spark DataFrame to a Pandas DataFrame for further processing.
###### 4. Cleaning the 'Sales' column by removing non-numeric characters and converting 'Sales' and 'Profit' columns to numeric types.
###### 5. Dropping rows with NaN values in the 'Sales' and 'Profit' columns.
###### 6. Normalizing the 'Sales' and 'Profit' data using StandardScaler.
###### 7. Initializing and fitting an Isolation Forest model to detect anomalies in the normalized data.
###### 8. Predicting anomalies and classifying data points as normal or anomalous.
###### 9. Visualizing the anomalies using a scatter plot with improved readability and aesthetics.


In [4]:
import pandas as pd
import re
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Load data into pandas DataFrame from CSV file
pandas_df = pd.read_csv("SuperStore Sales DataSet.csv")

# Clean the 'Sales' column by removing non-numeric characters
pandas_df['Sales'] = pandas_df['Sales'].apply(lambda x: re.sub(r'[^0-9.]', '', str(x)))

# Convert 'Sales' and 'Profit' columns to numeric
pandas_df['Sales'] = pd.to_numeric(pandas_df['Sales'], errors='coerce')
pandas_df['Profit'] = pd.to_numeric(pandas_df['Profit'], errors='coerce')

# Drop rows with NaN values
pandas_df.dropna(subset=['Sales', 'Profit'], inplace=True)

# Normalize the data
scaler = StandardScaler()
pandas_df[['Sales', 'Profit']] = scaler.fit_transform(pandas_df[['Sales', 'Profit']])

# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.0125, random_state=42)

# Fit the model
iso_forest.fit(pandas_df[['Sales', 'Profit']].values)

# Predict anomalies using the same feature names
pandas_df['anomaly'] = iso_forest.predict(pandas_df[['Sales', 'Profit']].values)

# Remove anomalies
cleaned_df = pandas_df[pandas_df['anomaly'] == 1]

# Drop the 'anomaly' column as it's no longer needed
cleaned_df = cleaned_df.drop(columns=['anomaly'])

# Save the cleaned data back to a CSV file
cleaned_df.to_csv("Cleaned_SuperStore_Sales_DataSet.csv", index=False)

print("Number of entries after removing anomalies:", cleaned_df.shape)

Number of entries after removing anomalies: (5827, 20)
