**Anomaly Detection Using Pycaret**

PyCaret’s Anomaly Detection Module is an unsupervised machine learning module that is used for identifying rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically, the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors . This module provide several pre-processing features that prepares the data for modeling through setup function. This module has over 12 ready-to-use algorithms and several plots to analyze the results of trained models.






**Setting up Environment**

**Setup Environment**
setup(data, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_multicollinearity = False, multicollinearity_threshold = 0.9, group_features = None, group_names = None, supervised = False, supervised_target = None, n_jobs = -1, html = True, session_id = None, log_experiment = False, experiment_name = None, log_plots = False, log_profile = False, log_data = False, silent = False, verbose=True, profile = False)

Description:

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix}.

Code
#import the dataset from pycaret repository
from pycaret.datasets import get_data
anomaly = get_data('anomaly')
#import anomaly detection module
from pycaret.anomaly import *
#intialize the setup
exp_ano = setup(anomaly)
 

Output
‘anomaly‘ is a pandas Dataframe.

Parameters:

data : dataframe
{array-like, sparse matrix}, shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features in dataframe.

categorical_features: string, default = None
If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].

categorical_imputation: string, default = ‘constant’
If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.

ordinal_features: dictionary, default = None
When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. The list sequence must be in increasing order from lowest to highest.

high_cardinality_features: string, default = None
When the data contains features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using frequency distribution. As such original features are replaced with the frequency distribution and converted into numeric variable.

numeric_features: string, default = None
If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].

numeric_imputation: string, default = ‘mean’
If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available option is ‘median’ which imputes the value using the median value in the training dataset.

date_features: string, default = None
If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features =’date_column_name’. It can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.

ignore_features: string, default = None
If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically set to ignore for modeling.

normalize: bool, default = False
When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however,  the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.

normalize_method: string, default = ‘zscore’
Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x – u) / s. The other available options are:

‘minmax’ : scales and translates each feature individually such that it is in the range of 0 – 1.
‘maxabs’ : scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
‘robust’ : scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.

transformation: bool, default = False
When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or  other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

transformation_method: string, default = ‘yeo-johnson’
Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear  correlations between variables measured at the same scale.

handle_unknown_categorical: bool, default = True
When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is  defined under the unknown_categorical_method param.

unknown_categorical_method: string, default = ‘least_frequent’
Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.

pca: bool, default = False
When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.

pca_method: string, default = ‘linear’
The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:

kernel : dimensionality reduction through the use of RVF kernel.
incremental : replacement for ‘linear’ pca when the dataset to be decomposed is  too large to fit in memory

pca_components: int/float, default = 0.99
Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.

ignore_low_variance: bool, default = False
When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique  values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.

combine_rare_levels: bool, default = False
When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be at least two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.

rare_level_threshold: float, default = 0.1
Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.

bin_numeric_features: list, default = None
When a list of numeric features is passed they are transformed into categorical features using K-Means, where values in each bin have the same nearest center of a
1D k-means cluster. The number of clusters are determined based on the ‘sturges’  method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.

remove_multicollinearity: bool, default = False
When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature with higher average correlation in the feature space is dropped.

multicollinearity_threshold: float, default = 0.9
Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.

group_features: list or list of list, default = None
When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.

group_names: list, default = None
When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.

supervised: bool, default = False
When set to True, supervised_target column is ignored for transformation. This param is only for internal use.

supervised_target: string, default = None
Name of supervised_target column that will be ignored for transformation. Only applicable when tune_model() function is used. This param is only for internal use.

n_jobs: int, default = -1
The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.

html: bool, default = True
If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.

session_id: int, default = None
If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.

log_experiment: bool, default = True
When set to True, all metrics and parameters are logged on MLFlow server.

experiment_name: str, default = None
Name of experiment for logging. When set to None, ‘clf’ is by default used as alias for the experiment name.

log_plots: bool, default = False
When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.

log_profile: bool, default = False
When set to True, data profile is also logged on MLflow as a html file. By default, it is set to False.

silent: bool, default = False
When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.

verbose: Boolean, default = True
Information grid is not printed when verbose is set to False.

profile: bool, default = False
If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.

Returns:

Information Grid: Information grid is printed.

Environment: This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

In [2]:
!pip install pycaret
#import the dataset from pycaret repository
from pycaret.datasets import get_data
anomaly = get_data('anomaly')
#import anomaly detection module
from pycaret.anomaly import *
#intialize the setup
exp_ano = setup(anomaly)

Unnamed: 0,Description,Value
0,session_id,4616
1,Original Data,"(1000, 10)"
2,Missing Values,False
3,Numeric Features,10
4,Categorical Features,0
5,Ordinal Features,False
6,High Cardinality Features,False
7,High Cardinality Method,
8,Transformed Data,"(1000, 10)"
9,CPU Jobs,-1


In [3]:
anomaly

Unnamed: 0,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10
0,0.263995,0.764929,0.138424,0.935242,0.605867,0.518790,0.912225,0.608234,0.723782,0.733591
1,0.546092,0.653975,0.065575,0.227772,0.845269,0.837066,0.272379,0.331679,0.429297,0.367422
2,0.336714,0.538842,0.192801,0.553563,0.074515,0.332993,0.365792,0.861309,0.899017,0.088600
3,0.092108,0.995017,0.014465,0.176371,0.241530,0.514724,0.562208,0.158963,0.073715,0.208463
4,0.325261,0.805968,0.957033,0.331665,0.307923,0.355315,0.501899,0.558449,0.885169,0.182754
...,...,...,...,...,...,...,...,...,...,...
995,0.305055,0.656837,0.331665,0.822525,0.907127,0.882276,0.855732,0.584786,0.808640,0.242762
996,0.812627,0.864258,0.616604,0.167966,0.811223,0.938071,0.418462,0.472306,0.348347,0.671129
997,0.250967,0.138627,0.919703,0.461234,0.886555,0.869888,0.800908,0.530324,0.779433,0.234952
998,0.502436,0.936820,0.580062,0.540773,0.151995,0.059452,0.225220,0.242755,0.279385,0.538755


In [4]:
## creating a model
iforest=create_model('iforest')
## plotting a model
plot_model(iforest)

In [5]:
## creating a model
knn=create_model('knn')
## plotting a model
plot_model(knn)

In [6]:
# generate predictions using trained model
knn_predictions = predict_model(knn, data = anomaly)

In [7]:
knn_predictions

Unnamed: 0,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Anomaly,Anomaly_Score
0,0.263995,0.764929,0.138424,0.935242,0.605867,0.518790,0.912225,0.608234,0.723782,0.733591,0,0.558927
1,0.546092,0.653975,0.065575,0.227772,0.845269,0.837066,0.272379,0.331679,0.429297,0.367422,0,0.477482
2,0.336714,0.538842,0.192801,0.553563,0.074515,0.332993,0.365792,0.861309,0.899017,0.088600,0,0.676207
3,0.092108,0.995017,0.014465,0.176371,0.241530,0.514724,0.562208,0.158963,0.073715,0.208463,1,0.804769
4,0.325261,0.805968,0.957033,0.331665,0.307923,0.355315,0.501899,0.558449,0.885169,0.182754,0,0.630836
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.305055,0.656837,0.331665,0.822525,0.907127,0.882276,0.855732,0.584786,0.808640,0.242762,0,0.266822
996,0.812627,0.864258,0.616604,0.167966,0.811223,0.938071,0.418462,0.472306,0.348347,0.671129,0,0.403480
997,0.250967,0.138627,0.919703,0.461234,0.886555,0.869888,0.800908,0.530324,0.779433,0.234952,0,0.337727
998,0.502436,0.936820,0.580062,0.540773,0.151995,0.059452,0.225220,0.242755,0.279385,0.538755,0,0.300265
