<a href="https://colab.research.google.com/github/subhashpolisetti/Automated-ML-with-PyCaret/blob/main/Anomaly_Analysis_Wholesale_Customers_PyCaret.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PyCaret Anomaly Detection**


# Overview of PyCaret

**PyCaret** is a low-code, open-source machine learning library in Python that streamlines machine learning workflows. It serves as a comprehensive tool for machine learning and model management, drastically reducing the time needed for experiments and increasing productivity.

In comparison to other open-source libraries, PyCaret offers an alternative low-code approach, allowing users to replace lengthy code with just a few lines. This results in significantly faster and more efficient experiments. Essentially, PyCaret functions as a Python wrapper around numerous machine learning libraries and frameworks, including scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, among others.

The design and simplicity of PyCaret cater to the rise of **citizen data scientists**—a term first introduced by Gartner. These users can handle both basic and moderately complex analytical tasks that would have previously required more specialized technical skills.


In this section, I will demonstrate multiclass classification along with the requirements for running the code.

### Installation

PyCaret is tested and supported on the following 64-bit systems:

- Python 3.7 – 3.10
- Python 3.9 for Ubuntu only
- Ubuntu 16.04 or later
- Windows 7 or later

You can install PyCaret using Python's pip package manager:

```bash
pip install pycaret


In [9]:
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn>1.4.0 (from pycaret)
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8

In [10]:
# check installed version
import pycaret
pycaret.__version__

'3.3.2'

## 🚀 **Quick Start**

The **Anomaly Detection Module in PyCaret** is an unsupervised machine learning tool designed to detect unusual data points that stand out significantly from the majority of the dataset.

These anomalies can represent issues such as fraudulent transactions, structural faults, medical abnormalities, or errors.

PyCaret's anomaly detection module comes with various preprocessing options to ready your data for modeling through the `setup` function. The module includes over 10 pre-configured algorithms and several visualizations to assess model performance.

A standard workflow in PyCaret's anomaly detection module involves the following steps:

**Setup** ➡️ **Create Model** ➡️ **Assign Labels** ➡️ **Analyze Model** ➡️ **Prediction** ➡️ **Save Model**


In [1]:

from google.colab import files
uploaded = files.upload()

Saving Wholesale customers data.csv to Wholesale customers data.csv


In [3]:
# Importing pandas library and reading the dataset from a CSV file
import pandas as pd

customer_dataset = pd.read_csv('Wholesale customers data.csv')

# Display the first five rows of the dataset
customer_dataset.head()


Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


In [4]:
# Display the shape of the dataset (number of rows and columns)
customer_dataset.shape


(440, 8)

In [23]:
# Sample 95% of the dataset for modeling
data = customer_dataset.sample(frac=0.95, random_state=786)

# Create unseen data by dropping the sampled data
data_unseen = customer_dataset.drop(data.index)

# Reset the index of both datasets
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

# Print the shapes of the datasets
print('Data for the Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))


Data for the Modeling: (418, 8)
Unseen Data For Predictions: (22, 8)


In [6]:

from pycaret.anomaly import *

In [7]:
# Initialize the setup for anomaly detection using PyCaret
wholesale_anomalies = setup(
    data=customer_dataset,  # Use the wholesale customer dataset for analysis
    normalize=True,         # Enable normalization of the data for better model performance
    session_id=123          # Set a session ID for reproducibility of results
)



Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(440, 8)"
2,Transformed data shape,"(440, 8)"
3,Numeric features,8
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,Normalize,True
9,Normalize method,zscore


In [8]:
# Create an Isolation Forest model for anomaly detection
isolation_forest_model = create_model('iforest')



Processing:   0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
# Print the details of the Isolation Forest model
print(isolation_forest_model)  # Display the trained Isolation Forest model's summary and parameters


IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)


In [10]:
# Create a Support Vector Machine model for anomaly detection
svm_model = create_model('svm', fraction=0.025)  # Use a fraction of 2.5% of the data for model training


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

In [11]:
# Print the details of the Support Vector Machine model
print(svm_model)


OCSVM(cache_size=200, coef0=0.0, contamination=0.025, degree=3, gamma='auto',
   kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001,
   verbose=False)


In [12]:
# List all available models for anomaly detection
available_models = models()
print(available_models)


                                        Name  \
ID                                             
abod            Angle-base Outlier Detection   
cluster       Clustering-Based Local Outlier   
cof         Connectivity-Based Local Outlier   
iforest                     Isolation Forest   
histogram  Histogram-based Outlier Detection   
knn             K-Nearest Neighbors Detector   
lof                     Local Outlier Factor   
svm                   One-class SVM detector   
pca             Principal Component Analysis   
mcd           Minimum Covariance Determinant   
sod               Subspace Outlier Detection   
sos             Stochastic Outlier Selection   

                                                  Reference  
ID                                                           
abod                                  pyod.models.abod.ABOD  
cluster    pycaret.internal.patches.pyod.CBLOFForceToDouble  
cof                                     pyod.models.cof.COF  
iforest          

In [13]:
# Assign cluster labels to the dataset using the Isolation Forest model
iforest_results = assign_model(isolation_forest_model)

# Display the first few rows of the results to inspect the assigned labels
iforest_results.head()


Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Anomaly,Anomaly_Score
0,2,3,12669,9656,7561,214,2674,1338,0,-0.138161
1,2,3,7057,9810,9568,1762,3293,1776,0,-0.155732
2,2,3,6353,8808,7684,2405,3516,7844,0,-0.075376
3,1,3,13265,1196,4221,6404,507,1788,0,-0.176917
4,2,3,22615,5410,7198,3915,1777,5185,0,-0.068065


In [20]:
plot_model(isolation_forest_model) \



In [21]:

pip install pycaret[analysis]

Collecting shap~=0.44.0 (from pycaret[analysis])
  Downloading shap-0.44.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Collecting interpret>=0.2.7 (from pycaret[analysis])
  Downloading interpret-0.6.3-py3-none-any.whl.metadata (1.1 kB)
Collecting umap-learn>=0.5.2 (from pycaret[analysis])
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting ydata-profiling>=4.3.1 (from pycaret[analysis])
  Downloading ydata_profiling-4.10.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting explainerdashboard>=0.3.8 (from pycaret[analysis])
  Downloading explainerdashboard-0.4.7-py3-none-any.whl.metadata (3.8 kB)
Collecting fairlearn==0.7.0 (from pycaret[analysis])
  Downloading fairlearn-0.7.0-py3-none-any.whl.metadata (7.3 kB)
Collecting dash-auth (from explainerdashboard>=0.3.8->pycaret[analysis])
  Downloading dash_auth-2.3.0-py3-none-any.whl.metadata (10 kB)
Collecting dash-bootstrap-components>=1 (fro

In [24]:
# Predict on unseen data using the Isolation Forest model
anomaly_predictions = predict_model(isolation_forest_model, data=data_unseen)
anomaly_predictions.head()


Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Anomaly,Anomaly_Score
0,1.448652,0.590668,-0.204806,0.334067,-0.297637,-0.496155,-0.228138,-0.026224,0,-0.138925
1,1.448652,0.590668,0.438987,-0.173259,-0.352839,-0.413666,-0.130709,0.212691,0,-0.135865
2,1.448652,0.590668,2.273476,-0.255056,-0.218626,-0.523789,-0.061837,-0.087639,0,-0.058653
3,-0.690297,0.590668,2.474854,-0.104621,0.017459,0.668172,-0.273493,4.553279,1,0.014821
4,-0.690297,0.590668,3.489423,-0.310943,0.100578,3.084264,-0.294281,0.345461,0,-0.004218


In [25]:
# Predict on the modeling data
predictions_results = predict_model(isolation_forest_model, data=data_unseen)
predictions_results.head()


Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Anomaly,Anomaly_Score
0,1.448652,0.590668,-0.204806,0.334067,-0.297637,-0.496155,-0.228138,-0.026224,0,-0.138925
1,1.448652,0.590668,0.438987,-0.173259,-0.352839,-0.413666,-0.130709,0.212691,0,-0.135865
2,1.448652,0.590668,2.273476,-0.255056,-0.218626,-0.523789,-0.061837,-0.087639,0,-0.058653
3,-0.690297,0.590668,2.474854,-0.104621,0.017459,0.668172,-0.273493,4.553279,1,0.014821
4,-0.690297,0.590668,3.489423,-0.310943,0.100578,3.084264,-0.294281,0.345461,0,-0.004218


In [26]:
# Save the Isolation Forest model for future use
save_model(isolation_forest_model, 'Final_IForest_Model')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Channel', 'Region', 'Fresh',
                                              'Milk', 'Grocery', 'Frozen',
                                              'Detergents_Paper', 'Delicassen'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('normalize', TransformerWrapper(transformer=StandardScaler())),
                 ('trained_model',
                  IForest(behaviour='new', bootstrap=False, contamination=0.05,
     max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
     random_state=123, verbose=0))]),
 'Final_IForest_Model.pkl')

In [27]:
# Load the saved Isolation Forest model
loaded_iforest_model = load_model('Final_IForest_Model')
loaded_iforest_model


Transformation Pipeline and Model Successfully Loaded
