# Machine Learning Modeling in Pycaret

Tomás von Bischoffshausen Gariazzo

January, 2022

***

***

### 1. Sample and Split

**1.1 Train Test Split** 

**Machine Learning Goal** is to build a model that generalizes well to the new data. Hence the dataset is split into the Train dataset and the Test dataset during supervised machine learning experiment. Test dataset serves as a proxy for new data. 

**Training set**
Evaluation of a trained machine learning model and optimization of the hyperparameters in PyCaret is performed using k-fold cross validation on Train dataset only. 

**Test Set**
Test dataset (also known as hold-out set) is not used in training of models and hence can be used under predict_model function to evaluate metrics and determine if the model has over-fitted the data. 

**Train Size**
By default, PyCaret uses 70% of the dataset for training, which can be changed using train_size parameter within setup. This functionality is only available in pycaret.classification and pycaret.regression modules. 

In [None]:
# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', train_size = 0.5)


**1.2 Sampling**

**¿Why sampling?**
Sometimes, you may want to choose a smaller sample size to train models faster.

In [None]:
# Importing dataset
from pycaret.datasets import get_data
bank = get_data('bank')
 
# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = bank, target = 'deposit') #sampling = True


***

### 2. Pre-processing

**2.1 Missing Value Imputation**

**Missing values importance**
Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most of the machine learning algorithms are not capable of dealing with missing or blank values. 
 
**Treat them, not delete them**
Removing samples with missing values is a basic strategy that is sometimes used but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values. 

In [None]:
# Importing dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = hepatitis, target = 'Class'
            #, numeric_imputation = "mean",
            # categorichal_imputation = "constant")

**2.2 Change Data Types**

**Data Types**
Each feature in the dataset has an associated data type such as numeric feature, categorical feature or date-time feature. 

**Why ensuring data types are correct?**
Several downstream processes depend on the data type of the features, for example: missing value imputation for numeric and categorical features should be performed separately. 

In [None]:
# Importing dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = hepatitis, 
             target = 'Class', 
             categorical_features = ['AGE']
             #, numeric_features = [‘column1’]

# Importing dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = pokemon, 
             target = 'Legendary', 
             ignore_features = ['#', 'Name'])

**2.3 One Hot Encoding**

**Why encoding?**
Machine learning algorithms cannot work directly with categorical data and they must be transformed into numeric values before training a model.

**What is one hot encoding?**
Most common type of categorical encoding is One Hot Encoding (also known as dummy encoding) where each categorical level becomes a separate feature in the dataset containing binary values (1 or 0). 


**One hot encoding in Pycaret**
Since this is an imperative step to perform a ML experiment, PyCaret will transform all categorical features in dataset using one hot encoding. This is ideal for features having nominal categorical data i.e. data cannot be ordered. 


In [None]:
# Importing dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = pokemon,
             target = 'Legendary')

**2.4 Ordinal Encoding**
 
**Why ordinal encoding?**
When categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female). 

In [None]:
# Importing dataset
from pycaret.datasets import get_data
employee = get_data('employee')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = employee, 
             target = 'left', 
             ordinal_features = {'salary' : ['low', 'medium', 'high']})

**2.5 Cardinal Econding**

**Why cardinal encoding?**
When categorical features in the dataset contain variables with many levels (also known as high cardinality features), then typical One Hot Encoding leads to creation of a very large number of new features, thereby making the experiment slow and introduces probable noise for certain machine learning algorithms. 

In [None]:
# Importing dataset
from pycaret.datasets import get_data
income = get_data('income')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = income, 
             target = 'income >50K', 
             high_cardinality_features = ['native-country'],
            # high_cardinality_method = ‘frequency’/"clustering")

**2.6 Handle Unknown Levels**

**¿Why handle unknown levels?**
When the unseen data has new levels in categorical feature that were not present at the time of training the model, it may cause problems for trained algorithm in generating accurate predictions. 

**What can be done?**
One way to deal with such data points is to reassign them to known level of categorical features i.e. the levels known in the training dataset. 

In [None]:
# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', 
             handle_unknown_categorical = True, 
             unknown_categorical_method = 'most_frequent'#/last frequent)


***

### 3. Scale and Transform

**3.1 Normalization**

**¿What is normalization?**
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to rescale the values of numeric columns in the dataset without distorting differences in the ranges of values or losing information. 

**Normalization methods**

z-score : The standard zscore is calculated as z = (x – u) / s 

minmax : scales and translates each feature individually such that it is in the range of 0 – 1. 

maxabs : scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. 

robust : scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.

In [None]:
# Importing dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = pokemon, 
             target = 'Legendary', 
             normalize = True, 
             #normalize_method = "zscore")


**3.2 Transformation**

**What is transformation?**
While normalization rescales the data within new limits to reduce the impact of magnitude in the variance, Transformation is a more radical technique. Transformation changes the shape of the distribution such that the transformed data can be represented by a normal or approximate normal distribution. In general, data must be transformed when using ML algorithms that assume normality or a gaussian distribution in the residuals. Examples of such models are Logistic Regression, Linear Discriminant Analysis (LDA) and Gaussian Naive Bayes. 

**Transformation Methods**

**transformation: bool, default = False** When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood. 

**transformation_method: string, default = ‘yeo-johnson’** Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.
 

In [None]:
# Importing dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = pokemon, 
             target = 'Legendary', 
             transformation = True)

**3.3 Target Transformation**
Is similar to transformation as it is used to change the shape of the distribution of target variable. Target must be transformed when linear algorithms such as Linear Regression or Linear Discriminant Analysis are used for modeling. 
 
**transform_target: bool, default = False** When set to True, target variable is transformed using the method defined in transform_target_method param. Target transformation is applied separately from feature transformations. 

**transform_target_method: string, default = ‘box-cox’**
‘Box-cox’ and ‘yeo-johnson’ methods are supported. Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data. When transform_target_method is ‘box-cox’ and target variable contains negative values, method is internally forced to ‘yeo-johnson’ to avoid exceptions.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
diamond = get_data('diamond')
 
# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = diamond, target = 'Price', transform_target = True)

### 4. Feature Engineer


**Feature Interaction**

It is often seen in machine learning experiments when two features combined through an arithmetic operation becomes more significant in explaining variances in the data, than the same two features separately. Creating a new feature through interaction of existing features is known as feature interaction. It can achieved in PyCaret using feature_interaction and feature_ratio parameters within setup. Feature interaction creates new features by multiplying two variables (a * b), while feature ratios create new features but by calculating the ratios of existing features (a / b).

feature_interaction: bool, default = False When set to True, it will create new features by interacting (a * b) for all numeric variables in the dataset including polynomial and trigonometric features (if created). This feature is not scalable and may not work as expected on datasets with large feature space. feature_ratio: bool, default = False When set to True, it will create new features by calculating the ratios (a / b) of all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space. interaction_threshold: bool, default = 0.01 Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.

In [None]:
# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')
 
# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', feature_interaction = True, feature_ratio = True)


**4.2 Polynomial Features**

In machine learning experiments the relationship between the dependent and independent variable is often assumed as linear, however this is not always the case. Sometimes the relationship between dependent and independent variables is more complex. Creating new polynomial features sometimes might help in capturing that relationship which otherwise may go unnoticed. PyCaret can create polynomial features from existing features using 
 
polynomial_features parameter within setup. polynomial_features: bool, default = False
When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in polynomial_degree param.
polynomial_degree: int, default = 2 Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2]. polynomial_threshold: float, default = 0.1 This is used to compress a sparse matrix of polynomial and trigonometric features. Polynomial and trigonometric features whose feature importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.

In [None]:
# Importing dataset
from pycaret.datasets import get_data
juice = get_data('juice')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = juice, target = 'Purchase', polynomial_features = True)


**4.3 Trigonometry Features**

Similar to Polynomial Features, PyCaret also allows creating new trigonometry features from the existing features. It is achieved using trigonometry_features parameter within setup. trigonometry_features: bool, default = False. When set to True, new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree param.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', trigonometry_features = True)


**4.4 Group Features**

When dataset contains features that are related to each other in someway, for example: features recorded at some fixed time intervals, then new statistical features such as mean, median, variance and standard deviation for a group of such features can be created from existing features using group_features parameter within setup.

group_features: list or list of list, default = None When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation. group_names: list, default = None When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.


In [None]:
from pycaret.datasets import get_data
credit = get_data('credit')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = credit, target = 'default', group_features = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']


**4.5 Bin Numeric Features**
Feature binning is a method of turning continuous variables into categorical values using pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range. Such extreme values influence on the trained model, thereby affecting the prediction accuracy of the model. In PyCaret, continuous numeric features can be binned into intervals using bin_numeric_features parameter within setup. PyCaret uses the ‘sturges’ rule to determine the number of bins and also uses K-Means clustering to convert continuous numeric features into categorical features.

bin_numeric_features: list, default = None
When a list of numeric features is passed they are transformed into categorical features using K-Means, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
income = get_data('income')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = income, target = 'income >50K', bin_numeric_features = ['age'])


4.6 Combine Rare Levels
Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality. This can be achieved in PyCaret using combine_rare_levels parameter within setup.

combine_rare_levels: bool, default = False When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be at least two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features. rare_level_threshold: float, default = 0.1 Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
income = get_data('income')
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = income, target = 'income >50K', combine_rare_levels = True)


### Feature Selection


**5.1 Feature Importance**

Feature Importance is a process used to select features in the dataset that contributes the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time. In PyCaret, this can be achieved using feature_selection parameter. It uses a combination of several supervised feature selection techniques to select the subset of features that are most important for modeling. The size of the subset can be controlled using feature_selection_threshold parameter within setup.

feature_selection: bool, default = False When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When polynomial_features and feature_interaction are used, it is highly recommended to define the feature_selection_threshold param with a lower value. feature_selection_threshold: float, default = 0.8 Threshold used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of feature_selection_threshold specially in cases where polynomial_features and feature_interaction are used. Setting a very low value may be efficient but could result in under-fitting.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
 
# Importing module and initializing setup
from pycaret.regression import *
clf1 = setup(data = diabetes, target = 'Class variable', feature_selection = True)


In [None]:
5.2 Remove Multicollinearity
Multicollinearity (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset. Multicollinearity increases the variance of the coefficients, thus making them unstable and noisy for linear models. One such way to deal with Multicollinearity is to drop one of the two features that are highly correlated with each other. This can be achieved in PyCaret using remove_multicollinearity parameter within setup.

remove_multicollinearity: bool, default = False When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped. multicollinearity_threshold: float, default = 0.9 Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
concrete = get_data('concrete')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = concrete, target = 'strength', remove_multicollinearity = True, multicollinearity_threshold = 0.6)


5.3 Principal Component Analysis
Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of a data. It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality. This can be achieved in PyCaret using pca parameter within setup.

pca: bool, default = False When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact. pca_method: string, default = ‘linear’ The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are: kernel : dimensionality reduction through the use of RVF kernel. incremental : replacement for ‘linear’ pca when the dataset to be decomposed is too large to fit in memory. pca_components: int/float, default = 0.99 Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
income = get_data('income')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = income, target = 'income >50K', pca = True, pca_components = 10)


**5.4 Ignore Low Variance**

Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature.  For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in PyCaret using ignore_low_variance parameter within setup. Both conditions below must be met for a feature to be considered a low variance feature. Count of unique values in a feature  / sample size < 10% Count of most common value / Count of second most common value > 20 times.
 
ignore_low_variance: bool, default = False
When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique  values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
mice = get_data('mice')
 
# Filter the column to demonstrate example
mice = mice[mice['Genotype']] = 'Control'
 
# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = mice, target = 'class', ignore_low_variance = True)


In [None]:
# 5. (v2) Unsupervised Learning


**5.1 (v2) Create clusters**

Creating Clusters using the existing features from the data is an unsupervised ML technique to engineer and create new features. It uses iterative approach to determine the number of clusters using combination of Calinski-Harabasz and Silhouette criterion. Each data point with the original features is assigned to a cluster. The assigned cluster label is then used as a new feature in predicting target variable. This can be achieved in PyCaret using create_clusters parameter within setup.

create_clusters: bool, default = False When set to True, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion. cluster_iter: int, default = 20 Number of iterations used to create a cluster. Each iteration represents cluster size. Only comes into effect when create_clusters param is set to True.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')
 
# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', create_clusters = True)


**5.1 (v2) Remove Outliers**

The Remove Outliers function in PyCaret allows you to identify and remove outliers from the dataset before training the model. Outliers are identified through PCA linear dimensionality reduction using the Singular Value Decomposition technique. It can be achieved using remove_outliers parameter within setup. The proportion of outliers are controlled through outliers_threshold parameter.

remove_outliers: bool, default = False When set to True, outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique. outliers_threshold: float, default = 0.05 The percentage / proportion of outliers in the dataset can be defined using the outliers_threshold param. By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.


In [None]:
# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', remove_outliers = True)


*** 

### 6. Models

**6.1 Classification**

**Classification setup functions**

This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function. It takes two mandatory parameters: data and target. All the other parameters are optional.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')

In [None]:
pycaret.classification.setup(data: pandas.core.frame.DataFrame, 
                             target: str, train_size: float = 0.7, 
                             test_data: Optional[pandas.core.frame.DataFrame] = None, 
                             preprocess: bool = True, 
                             imputation_type: str = 'simple', 
                             iterative_imputation_iters: int = 5, 
                             categorical_features: Optional[List[str]] = None, 
                             categorical_imputation: str = 'constant', 
                             categorical_iterative_imputer: Union[str, Any] = 'lightgbm', 
                             ordinal_features: Optional[Dict[str, list]] = None, 
                             high_cardinality_features: Optional[List[str]] = None, 
                             high_cardinality_method: str = 'frequency', 
                             numeric_features: Optional[List[str]] = None, 
                             numeric_imputation: str = 'mean', 
                             numeric_iterative_imputer: Union[str, Any] = 'lightgbm', 
                             date_features: Optional[List[str]] = None, 
                             ignore_features: Optional[List[str]] = None, 
                             normalize: bool = False, 
                             normalize_method: str = 'zscore', 
                             transformation: bool = False, 
                             transformation_method: str = 'yeo-johnson', 
                             handle_unknown_categorical: bool = True, 
                             unknown_categorical_method: str = 'least_frequent', 
                             pca: bool = False, 
                             pca_method: str = 'linear', 
                             pca_components: Optional[float] = None, 
                             ignore_low_variance: bool = False, 
                             combine_rare_levels: bool = False, 
                             rare_level_threshold: float = 0.1, 
                             bin_numeric_features: Optional[List[str]] = None, 
                             remove_outliers: bool = False, 
                             outliers_threshold: float = 0.05, 
                             remove_multicollinearity: bool = False, 
                             multicollinearity_threshold: float = 0.9, 
                             remove_perfect_collinearity: bool = True, 
                             create_clusters: bool = False, 
                             cluster_iter: int = 20, 
                             polynomial_features: bool = False, 
                             polynomial_degree: int = 2, 
                             trigonometry_features: bool = False, 
                             polynomial_threshold: float = 0.1, 
                             group_features: Optional[List[str]] = None, 
                             group_names: Optional[List[str]] = None, 
                             feature_selection: bool = False, 
                             feature_selection_threshold: float = 0.8, 
                             feature_selection_method: str = 'classic', 
                             feature_interaction: bool = False, feature_ratio: bool = False, 
                             interaction_threshold: float = 0.01, 
                             fix_imbalance: bool = False, 
                             fix_imbalance_method: Optional[Any] = None, 
                             data_split_shuffle: bool = True, 
                             data_split_stratify: Union[bool, List[str]] = False, 
                             fold_strategy: Union[str, Any] = 'stratifiedkfold', 
                             fold: int = 10, fold_shuffle: bool = False, 
                             fold_groups: Optional[Union[str, pandas.core.frame.DataFrame]] = None, 
                             n_jobs: Optional[int] = - 1, 
                             use_gpu: bool = False, 
                             custom_pipeline: Optional[Union[Any, Tuple[str, Any], 
                             List[Any], 
                             List[Tuple[str, Any]]]] = None, html: bool = True, session_id: Optional[int] = None, log_experiment: bool = False, experiment_name: Optional[str] = None, log_plots: Union[bool, list] = False, log_profile: bool = False, log_data: bool = False, silent: bool = False, verbose: bool = True, 
                             profile: bool = False, 
                             profile_kwargs: Optional[Dict[str, Any]] = None)

data: pandas.DataFrame
Shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.

target: str
Name of the target column to be passed in as a string. The target variable can be either binary or multiclass.

train_size: float, default = 0.7
Proportion of the dataset to be used for training and validation. Should be between 0.0 and 1.0.

test_data: pandas.DataFrame, default = None
If not None, test_data is used as a hold-out set and train_size parameter is ignored. test_data must be labelled and the shape of data and test_data must match.

preprocess: bool, default = True
When set to False, no transformations are applied except for train_test_split and custom transformations passed in custom_pipeline param. Data must be ready for modeling (no missing values, no dates, categorical data encoding), when preprocess is set to False.

imputation_type: str, default = ‘simple’
The type of imputation to use. Can be either ‘simple’ or ‘iterative’.

iterative_imputation_iters: int, default = 5
Number of iterations. Ignored when imputation_type is not ‘iterative’.

categorical_features: list of str, default = None
If the inferred data types are not correct or the silent param is set to True, categorical_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are categorical.

categorical_imputation: str, default = ‘constant’
Missing values in categorical features are imputed with a constant ‘not_available’ value. The other available option is ‘mode’.

categorical_iterative_imputer: str, default = ‘lightgbm’
Estimator for iterative imputation of missing values in categorical features. Ignored when imputation_type is not ‘iterative’.

ordinal_features: dict, default = None
Encode categorical features as ordinal. For example, a categorical feature with ‘low’, ‘medium’, ‘high’ values where low < medium < high can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }.

high_cardinality_features: list of str, default = None
When categorical features contains many levels, it can be compressed into fewer levels using this parameter. It takes a list of strings with column names that are categorical.

high_cardinality_method: str, default = ‘frequency’
Categorical features with high cardinality are replaced with the frequency of values in each level occurring in the training dataset. Other available method is ‘clustering’ which trains the K-Means clustering algorithm on the statistical attribute of the training data and replaces the original value of feature with the cluster label. The number of clusters is determined by optimizing Calinski-Harabasz and Silhouette criterion.

numeric_features: list of str, default = None
If the inferred data types are not correct or the silent param is set to True, numeric_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are numeric.

numeric_imputation: str, default = ‘mean’
Missing values in numeric features are imputed with ‘mean’ value of the feature in the training dataset. The other available option is ‘median’ or ‘zero’.

numeric_iterative_imputer: str, default = ‘lightgbm’
Estimator for iterative imputation of missing values in numeric features. Ignored when imputation_type is set to ‘simple’.

date_features: list of str, default = None
If the inferred data types are not correct or the silent param is set to True, date_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are DateTime.

ignore_features: list of str, default = None
ignore_features param can be used to ignore features during model training. It takes a list of strings with column names that are to be ignored.

normalize: bool, default = False
When set to True, it transforms the numeric features by scaling them to a given range. Type of scaling is defined by the normalize_method parameter.

normalize_method: str, default = ‘zscore’
Defines the method for scaling. By default, normalize method is set to ‘zscore’ The standard zscore is calculated as z = (x - u) / s. Ignored when normalize is not True. The other options are:

minmax: scales and translates each feature individually such that it is in the range of 0 - 1.

maxabs: scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

robust: scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.

transformation: bool, default = False
When set to True, it applies the power transform to make data more Gaussian-like. Type of transformation is defined by the transformation_method parameter.

transformation_method: str, default = ‘yeo-johnson’
Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option for transformation is ‘quantile’. Ignored when transformation is not True.

handle_unknown_categorical: bool, default = True
When set to True, unknown categorical levels in unseen data are replaced by the most or least frequent level as learned in the training dataset.

unknown_categorical_method: str, default = ‘least_frequent’
Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.

pca: bool, default = False
When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method parameter.

pca_method: str, default = ‘linear’
The ‘linear’ method performs uses Singular Value Decomposition. Other options are:

kernel: dimensionality reduction through the use of RVF kernel.

incremental: replacement for ‘linear’ pca when the dataset is too large.

pca_components: int or float, default = None
Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be less than the original number of features. Ignored when pca is not True.

ignore_low_variance: bool, default = False
When set to True, all categorical features with insignificant variances are removed from the data. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.

combine_rare_levels: bool, default = False
When set to True, frequency percentile for levels in categorical features below a certain threshold is combined into a single level.

rare_level_threshold: float, default = 0.1
Percentile distribution below which rare categories are combined. Ignored when combine_rare_levels is not True.

bin_numeric_features: list of str, default = None
To convert numeric features into categorical, bin_numeric_features parameter can be used. It takes a list of strings with column names to be discretized. It does so by using ‘sturges’ rule to determine the number of clusters and then apply KMeans algorithm. Original values of the feature are then replaced by the cluster label.

remove_outliers: bool, default = False
When set to True, outliers from the training data are removed using the Singular Value Decomposition.

outliers_threshold: float, default = 0.05
The percentage outliers to be removed from the training dataset. Ignored when remove_outliers is not True.

remove_multicollinearity: bool, default = False
When set to True, features with the inter-correlations higher than the defined threshold are removed. When two features are highly correlated with each other, the feature that is less correlated with the target variable is removed. Only considers numeric features.

multicollinearity_threshold: float, default = 0.9
Threshold for correlated features. Ignored when remove_multicollinearity is not True.

remove_perfect_collinearity: bool, default = True
When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, when two features are 100% correlated, one of it is randomly removed from the dataset.

create_clusters: bool, default = False
When set to True, an additional feature is created in training dataset where each instance is assigned to a cluster. The number of clusters is determined by optimizing Calinski-Harabasz and Silhouette criterion.

cluster_iter: int, default = 20
Number of iterations for creating cluster. Each iteration represents cluster size. Ignored when create_clusters is not True.

polynomial_features: bool, default = False
When set to True, new features are derived using existing numeric features.

polynomial_degree: int, default = 2
Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2]. Ignored when polynomial_features is not True.

trigonometry_features: bool, default = False
When set to True, new features are derived using existing numeric features.

polynomial_threshold: float, default = 0.1
When polynomial_features or trigonometry_features is True, new features are derived from the existing numeric features. This may sometimes result in too large feature space. polynomial_threshold parameter can be used to deal with this problem. It does so by using combination of Random Forest, AdaBoost and Linear correlation. All derived features that falls within the percentile distribution are kept and rest of the features are removed.

group_features: list or list of list, default = None
When the dataset contains features with related characteristics, group_features parameter can be used for feature extraction. It takes a list of strings with column names that are related.

group_names: list, default = None
Group names to be used in naming new features. When the length of group_names does not match with the length of group_features, new features are named sequentially group_1, group_2, etc. It is ignored when group_features is None.

feature_selection: bool, default = False
When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_threshold parameter.

feature_selection_threshold: float, default = 0.8
Threshold value used for feature selection. When polynomial_features or feature_interaction is True, it is recommended to keep the threshold low to avoid large feature spaces. Setting a very low value may be efficient but could result in under-fitting.

feature_selection_method: str, default = ‘classic’
Algorithm for feature selection. ‘classic’ method uses permutation feature importance techniques. Other possible value is ‘boruta’ which uses boruta algorithm for feature selection.

feature_interaction: bool, default = False
When set to True, new features are created by interacting (a * b) all the numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.

feature_ratio: bool, default = False
When set to True, new features are created by calculating the ratios (a / b) between all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.

interaction_threshold: bool, default = 0.01
Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.

fix_imbalance: bool, default = False
When training dataset has unequal distribution of target class it can be balanced using this parameter. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class.

fix_imbalance_method: obj, default = None
When fix_imbalance is True, ‘imblearn’ compatible object with ‘fit_resample’ method can be passed. When set to None, ‘imblearn.over_sampling.SMOTE’ is used.

data_split_shuffle: bool, default = True
When set to False, prevents shuffling of rows during ‘train_test_split’.

data_split_stratify: bool or list, default = False
Controls stratification during ‘train_test_split’. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored when data_split_shuffle is False.

fold_strategy: str or sklearn CV generator object, default = ‘stratifiedkfold’
Choice of cross validation strategy. Possible values are:

‘kfold’

‘stratifiedkfold’

‘groupkfold’

‘timeseries’

a custom CV generator object compatible with scikit-learn.

fold: int, default = 10
Number of folds to be used in cross validation. Must be at least 2. This is a global setting that can be over-written at function level by using fold parameter. Ignored when fold_strategy is a custom object.

fold_shuffle: bool, default = False
Controls the shuffle parameter of CV. Only applicable when fold_strategy is ‘kfold’ or ‘stratifiedkfold’. Ignored when fold_strategy is a custom object.

fold_groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when ‘GroupKFold’ is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

n_jobs: int, default = -1
The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.

use_gpu: bool or str, default = False
When set to True, it will use GPU for training with algorithms that support it, and fall back to CPU if they are unavailable. When set to ‘force’, it will only use GPU-enabled algorithms and raise exceptions when they are unavailable. When False, all algorithms are trained using CPU only.

GPU enabled algorithms:

Extreme Gradient Boosting, requires no further installation

CatBoost Classifier, requires no further installation (GPU is only enabled when data > 50,000 rows)

Light Gradient Boosting Machine, requires GPU installation https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html

Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, Support Vector Machine, requires cuML >= 0.15 https://github.com/rapidsai/cuml

custom_pipeline: (str, transformer) or list of (str, transformer), default = None
When passed, will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied after ‘train_test_split’ and before pycaret’s internal transformations.

html: bool, default = True
When set to False, prevents runtime display of monitor. This must be set to False when the environment does not support IPython. For example, command line terminal, Databricks Notebook, Spyder and other similar IDEs.

session_id: int, default = None
Controls the randomness of experiment. It is equivalent to ‘random_state’ in scikit-learn. When None, a pseudo random number is generated. This can be used for later reproducibility of the entire experiment.

log_experiment: bool, default = False
When set to True, all metrics and parameters are logged on the MLFlow server.

experiment_name: str, default = None
Name of the experiment for logging. Ignored when log_experiment is not True.

log_plots: bool or list, default = False
When set to True, certain plots are logged automatically in the MLFlow server. To change the type of plots to be logged, pass a list containing plot IDs. Refer to documentation of plot_model. Ignored when log_experiment is not True.

log_profile: bool, default = False
When set to True, data profile is logged on the MLflow server as a html file. Ignored when log_experiment is not True.

log_data: bool, default = False
When set to True, dataset is logged on the MLflow server as a csv file. Ignored when log_experiment is not True.

silent: bool, default = False
Controls the confirmation input of data types when setup is executed. When executing in completely automated mode or on a remote kernel, this must be True.

verbose: bool, default = True
When set to False, Information grid is not printed.

profile: bool, default = False
When set to True, an interactive EDA report is displayed.

profile_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the ProfileReport method used to create the EDA report. Ignored if profile is False.

Returns
Global variables that can be changed using the set_config function.

**Classification compare models function**

This function trains and evaluates performance of all estimators available in the model library using cross validation. The output of this function is a score grid with average cross validated scores. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
best_model = compare_models()

In [None]:
pycaret.classification.compare_models(include: Optional[List[Union[str, Any]]] = None, 
                                      exclude: Optional[List[str]] = None, 
                                      fold: Optional[Union[int, Any]] = None, 
                                      round: int = 4, 
                                      cross_validation: bool = True, 
                                      sort: str = 'Accuracy', n_select: int = 1, 
                                      budget_time: Optional[float] = None, turbo: bool = True, errors: str = 'ignore', 
                                      fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, 
                                      probability_threshold: Optional[float] = None, verbose: bool = True)

include: list of str or scikit-learn compatible object, default = None
To train and evaluate select models, list containing model ID or scikit-learn compatible object can be passed in include param. To see a list of all models available in the model library use the models function.

exclude: list of str, default = None
To omit certain models from training and evaluation, pass a list containing model id in the exclude parameter. To see a list of all models available in the model library use the models function.

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

cross_validation: bool, default = True
When set to False, metrics are evaluated on holdout set. fold param is ignored when cross_validation is set to False.

sort: str, default = ‘Accuracy’
The sort order of the score grid. It also accepts custom metrics that are added through the add_metric function.

n_select: int, default = 1
Number of top_n models to return. For example, to select top 3 models use n_select = 3.

budget_time: int or float, default = None
If not None, will terminate execution of the function after budget_time minutes have passed and return results up to that point.

turbo: bool, default = True
When set to True, it excludes estimators with longer training times. To see which algorithms are excluded use the models function.

errors: str, default = ‘ignore’
When set to ‘ignore’, will skip the model with exceptions and continue. If ‘raise’, will break the function when exceptions are raised.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when ‘GroupKFold’ is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

probability_threshold: float, default = None
Threshold for converting predicted probability to class label. It defaults to 0.5 for all classifiers unless explicitly defined in this parameter. Only applicable for binary classification.

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

Returns
Trained model or list of trained models, depending on the n_select param.

**Classification create model function**

This function trains and evaluates the performance of a given estimator using cross validation. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function. All the available models can be accessed using the models function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')

In [None]:
pycaret.classification.create_model(estimator: Union[str, Any], 
                                    fold: Optional[Union[int, Any]] = None, round: int = 4, 
                                    cross_validation: bool = True, fit_kwargs: Optional[dict] = None, 
                                    groups: Optional[Union[str, Any]] = None, 
                                    probability_threshold: Optional[float] = None, 
                                    verbose: bool = True, **kwargs)

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

cross_validation: bool, default = True
When set to False, metrics are evaluated on holdout set. fold param is ignored when cross_validation is set to False.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

probability_threshold: float, default = None
Threshold for converting predicted probability to class label. It defaults to 0.5 for all classifiers unless explicitly defined in this parameter. Only applicable for binary classification.

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

**kwargs:
Additional keyword arguments to pass to the estimator.

Returns
Trained Model

**classification tune_model function**

This function tunes the hyperparameters of a given estimator. The output of this function is a score grid with CV scores by fold of the best selected model based on optimize parameter. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
tuned_lr = tune_model(lr)

In [None]:
pycaret.classification.tune_model(estimator, 
                                  fold: Optional[Union[int, Any]] = None, 
                                  round: int = 4, 
                                  n_iter: int = 10, 
                                  custom_grid: Optional[Union[Dict[str, list], Any]] = None, 
                                  optimize: str = 'Accuracy', 
                                  custom_scorer=None, search_library: str = 'scikit-learn', 
                                  search_algorithm: Optional[str] = None, 
                                  early_stopping: Any = False, 
                                  early_stopping_max_iters: int = 10, 
                                  choose_better: bool = False, 
                                  fit_kwargs: Optional[dict] = None, 
                                  groups: Optional[Union[str, Any]] = None, 
                                  return_tuner: bool = False, 
                                  verbose: bool = True, 
                                  tuner_verbose: Union[int, bool] = True, 
                                  **kwargs)

estimator: scikit-learn compatible object
Trained model object

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

n_iter: int, default = 10
Number of iterations in the grid search. Increasing ‘n_iter’ may improve model performance but also increases the training time.

custom_grid: dictionary, default = None
To define custom search space for hyperparameters, pass a dictionary with parameter name and values to be iterated. Custom grids must be in a format supported by the defined search_library.

optimize: str, default = ‘Accuracy’
Metric name to be evaluated for hyperparameter tuning. It also accepts custom metrics that are added through the add_metric function.

custom_scorer: object, default = None
custom scoring strategy can be passed to tune hyperparameters of the model. It must be created using sklearn.make_scorer. It is equivalent of adding custom metric using the add_metric function and passing the name of the custom metric in the optimize parameter. Will be deprecated in future.

search_library: str, default = ‘scikit-learn’
The search library used for tuning hyperparameters. Possible values:

‘scikit-learn’ - default, requires no further installation
https://github.com/scikit-learn/scikit-learn

‘scikit-optimize’ - pip install scikit-optimize
https://scikit-optimize.github.io/stable/

‘tune-sklearn’ - pip install tune-sklearn ray[tune]
https://github.com/ray-project/tune-sklearn

‘optuna’ - pip install optuna
https://optuna.org/

search_algorithm: str, default = None
The search algorithm depends on the search_library parameter. Some search algorithms require additional libraries to be installed. If None, will use search library-specific default algorithm.

‘scikit-learn’ possible values:
‘random’ : random grid search (default)

‘grid’ : grid search

‘scikit-optimize’ possible values:
‘bayesian’ : Bayesian search (default)

‘tune-sklearn’ possible values:
‘random’ : random grid search (default)

‘grid’ : grid search

‘bayesian’ : pip install scikit-optimize

‘hyperopt’ : pip install hyperopt

‘optuna’ : pip install optuna

‘bohb’ : pip install hpbandster ConfigSpace

‘optuna’ possible values:
‘random’ : randomized search

‘tpe’ : Tree-structured Parzen Estimator search (default)

early_stopping: bool or str or object, default = False
Use early stopping to stop fitting to a hyperparameter configuration if it performs poorly. Ignored when search_library is scikit-learn, or if the estimator does not have ‘partial_fit’ attribute. If False or None, early stopping will not be used. Can be either an object accepted by the search library or one of the following:

‘asha’ for Asynchronous Successive Halving Algorithm

‘hyperband’ for Hyperband

‘median’ for Median Stopping Rule

If False or None, early stopping will not be used.

early_stopping_max_iters: int, default = 10
Maximum number of epochs to run for each sampled configuration. Ignored if early_stopping is False or None.

choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the optimize parameter.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the tuner.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

return_tuner: bool, default = False
When set to True, will return a tuple of (model, tuner_object).

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

tuner_verbose: bool or in, default = True
If True or above 0, will print messages from the tuner. Higher values print more messages. Ignored when verbose param is False.

**kwargs:
Additional keyword arguments to pass to the optimizer.

Returns
Trained Model and Optional Tuner Object when return_tuner is True.

**classification ensemble_model**

This function ensembles a given estimator. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
dt = create_model('dt')
bagged_dt = ensemble_model(dt, method = 'Bagging')

In [None]:
pycaret.classification.ensemble_model(estimator, method: str = 'Bagging', 
                                      fold: Optional[Union[int, Any]] = None, 
                                      n_estimators: int = 10, round: int = 4, 
                                      choose_better: bool = False, optimize: str = 'Accuracy', 
                                      fit_kwargs: Optional[dict] = None, 
                                      groups: Optional[Union[str, Any]] = None, 
                                      probability_threshold: Optional[float] = None, 
                                      verbose: bool = True)

estimator: scikit-learn compatible object
Trained model object

method: str, default = ‘Bagging’
Method for ensembling base estimator. It can be ‘Bagging’ or ‘Boosting’.

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

n_estimators: int, default = 10
The number of base estimators in the ensemble. In case of perfect fit, the learning procedure is stopped early.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the optimize parameter.

optimize: str, default = ‘Accuracy’
Metric to compare for model selection when choose_better is True.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

probability_threshold: float, default = None
Threshold for converting predicted probability to class label. It defaults to 0.5 for all classifiers unless explicitly defined in this parameter. Only applicable for binary classification.

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

Returns
Trained Model

**classification blend_models functions**

This function trains a Soft Voting / Majority Rule classifier for select models passed in the estimator_list param. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
top3 = compare_models(n_select = 3)
blender = blend_models(top3)

In [None]:
pycaret.classification.blend_models(estimator_list: list, 
                                    fold: Optional[Union[int, Any]] = None, round: int = 4, 
                                    choose_better: bool = False, optimize: str = 'Accuracy', 
                                    method: str = 'auto', 
                                    weights: Optional[List[float]] = None, 
                                    fit_kwargs: Optional[dict] = None, 
                                    groups: Optional[Union[str, Any]] = None, 
                                    probability_threshold: Optional[float] = None, verbose: bool = True)

estimator_list: list of scikit-learn compatible objects
List of trained model objects

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the optimize parameter.

optimize: str, default = ‘Accuracy’
Metric to compare for model selection when choose_better is True.

method: str, default = ‘auto’
‘hard’ uses predicted class labels for majority rule voting. ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers. Default value, ‘auto’, will try to use ‘soft’ and fall back to ‘hard’ if the former is not supported.

weights: list, default = None
Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights when None.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

probability_threshold: float, default = None
Threshold for converting predicted probability to class label. It defaults to 0.5 for all classifiers unless explicitly defined in this parameter. Only applicable for binary classification.

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

Returns
Trained Model

**classification stack_models functions**

This function trains a meta model over select estimators passed in the estimator_list parameter. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
top3 = compare_models(n_select = 3)
stacker = stack_models(top3)

In [None]:
pycaret.classification.stack_models(estimator_list: list, 
                                    meta_model=None, meta_model_fold: Optional[Union[int, Any]] = 5, 
                                    fold: Optional[Union[int, Any]] = None, round: int = 4, 
                                    method: str = 'auto', 
                                    restack: bool = True, 
                                    choose_better: bool = False, 
                                    optimize: str = 'Accuracy', 
                                    fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, 
                                    probability_threshold: Optional[float] = None, 
                                    verbose: bool = True)

estimator_list: list of scikit-learn compatible objects
List of trained model objects

meta_model: scikit-learn compatible object, default = None
When None, Logistic Regression is trained as a meta model.

meta_model_fold: integer or scikit-learn compatible CV generator, default = 5
Controls internal cross-validation. Can be an integer or a scikit-learn CV generator. If set to an integer, will use (Stratifed)KFold CV with that many folds. See scikit-learn documentation on Stacking for more details.

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

method: str, default = ‘auto’
When set to ‘auto’, it will invoke, for each estimator, ‘predict_proba’, ‘decision_function’ or ‘predict’ in that order. Other, manually pass one of the value from ‘predict_proba’, ‘decision_function’ or ‘predict’.

restack: bool, default = True
When set to False, only the predictions of estimators will be used as training data for the meta_model.

choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the optimize parameter.

optimize: str, default = ‘Accuracy’
Metric to compare for model selection when choose_better is True.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

probability_threshold: float, default = None
Threshold for converting predicted probability to class label. It defaults to 0.5 for all classifiers unless explicitly defined in this parameter. Only applicable for binary classification.

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

Returns
Trained Model

**classification.plot_model**

This function analyzes the performance of a trained model on holdout set. It may require re-training the model in certain cases.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
plot_model(lr, plot = 'auc')

In [None]:
(estimator, plot: str = 'auc', 
 scale: float = 1, 
 save: bool = False, 
 fold: Optional[Union[int, Any]] = None, 
 fit_kwargs: Optional[dict] = None, 
 plot_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, 
 use_train_data: bool = False, 
 verbose: bool = True, 
 display_format: Optional[str] = None)

estimator: scikit-learn compatible object
Trained model object

plot: str, default = ‘auc’
List of available plots (ID - Name):

‘auc’ - Area Under the Curve

‘threshold’ - Discrimination Threshold

‘pr’ - Precision Recall Curve

‘confusion_matrix’ - Confusion Matrix

‘error’ - Class Prediction Error

‘class_report’ - Classification Report

‘boundary’ - Decision Boundary

‘rfe’ - Recursive Feature Selection

‘learning’ - Learning Curve

‘manifold’ - Manifold Learning

‘calibration’ - Calibration Curve

‘vc’ - Validation Curve

‘dimension’ - Dimension Learning

‘feature’ - Feature Importance

‘feature_all’ - Feature Importance (All)

‘parameter’ - Model Hyperparameter

‘lift’ - Lift Curve

‘gain’ - Gain Chart

‘tree’ - Decision Tree

‘ks’ - KS Statistic Plot

scale: float, default = 1
The resolution scale of the figure.

save: bool, default = False
When set to True, plot is saved in the current working directory.

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

plot_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the visualizer class.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

use_train_data: bool, default = False
When set to true, train data will be used for plots, instead of test data.

verbose: bool, default = True
When set to False, progress bar is not displayed.

display_format: str, default = None
To display plots in Streamlit (https://www.streamlit.io/), set this to ‘streamlit’. Currently, not all plots are supported.

Returns
None

**classification evaluate_model**

This function displays a user interface for analyzing performance of a trained model. It calls the plot_model function internally.

In [None]:
pycaret.classification.evaluate_model(estimator, 
                                      fold: Optional[Union[int, Any]] = None, 
                                      fit_kwargs: Optional[dict] = None, 
                                      plot_kwargs: Optional[dict] = None, 
                                      groups: Optional[Union[str, Any]] = None, 
                                      use_train_data: bool = False)

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
evaluate_model(lr)

stimator: scikit-learn compatible object
Trained model object

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

plot_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the visualizer class.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

use_train_data: bool, default = False
When set to true, train data will be used for plots, instead of test data.

Returns
None

**classification interpret_model function**

This function analyzes the predictions generated from a trained model. Most plots in this function are implemented based on the SHAP (SHapley Additive exPlanations). For more info on this, please see https://shap.readthedocs.io/en/latest/

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
xgboost = create_model('xgboost')
interpret_model(xgboost)

In [None]:
pycaret.classification.interpret_model(estimator, plot: str = 'summary', 
                                       feature: Optional[str] = None, 
                                       observation: Optional[int] = None, 
                                       use_train_data: bool = False, 
                                       X_new_sample: Optional[pandas.core.frame.DataFrame] = None, 
                                       y_new_sample: Optional[pandas.core.frame.DataFrame] = None, 
                                       save: bool = False, **kwargs)

estimator: scikit-learn compatible object
Trained model object

plotstr, default = ‘summary’
Abbreviation of type of plot. The current list of plots supported are (Plot - Name):

‘summary’ - Summary Plot using SHAP

‘correlation’ - Dependence Plot using SHAP

‘reason’ - Force Plot using SHAP

‘pdp’ - Partial Dependence Plot

‘msa’ - Morris Sensitivity Analysis

‘pfi’ - Permutation Feature Importance

feature: str, default = None
This parameter is only needed when plot = ‘correlation’ or ‘pdp’. By default feature is set to None which means the first column of the dataset will be used as a variable. A feature parameter must be passed to change this.

observation: integer, default = None
This parameter only comes into effect when plot is set to ‘reason’. If no observation number is provided, it will return an analysis of all observations with the option to select the feature on x and y axes through drop down interactivity. For analysis at the sample level, an observation parameter must be passed with the index value of the observation in test / hold-out set.

use_train_data: bool, default = False
When set to true, train data will be used for plots, instead of test data.

X_new_sample: pd.DataFrame, default = None
Row from an out-of-sample dataframe (neither train nor test data) to be plotted. The sample must have the same columns as the raw input data, and it is transformed by the preprocessing pipeline automatically before plotting.

y_new_sample: pd.DataFrame, default = None
Row from an out-of-sample dataframe (neither train nor test data) to be plotted. The sample must have the same columns as the raw input label data, and it is transformed by the preprocessing pipeline automatically before plotting.

save: bool, default = False
When set to True, Plot is saved as a ‘png’ file in current working directory.

**kwargs:
Additional keyword arguments to pass to the plot.

Returns
None


**classification calibrate_model function**

This function calibrates the probability of a given estimator using isotonic or logistic regression. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
dt = create_model('dt')
calibrated_dt = calibrate_model(dt)

In [None]:
pycaret.classification.calibrate_model(estimator, method: str = 'sigmoid', 
                                       calibrate_fold: Optional[Union[int, Any]] = 5, 
                                       fold: Optional[Union[int, Any]] = None, round: int = 4, 
                                       fit_kwargs: Optional[dict] = None, 
                                       groups: Optional[Union[str, Any]] = None, 
                                       verbose: bool = True)→ Any

estimator: scikit-learn compatible object
Trained model object

method: str, default = ‘sigmoid’
The method to use for calibration. Can be ‘sigmoid’ which corresponds to Platt’s method or ‘isotonic’ which is a non-parametric approach.

calibrate_fold: integer or scikit-learn compatible CV generator, default = 5
Controls internal cross-validation. Can be an integer or a scikit-learn CV generator. If set to an integer, will use (Stratifed)KFold CV with that many folds. See scikit-learn documentation on Stacking for more details.

fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the fold_strategy parameter of the setup function is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in the setup function.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

verbose: bool, default = True
Score grid is not printed when verbose is set to False.

Returns
Trained Model

**classification optimize_threshold function**

This function optimizes probability threshold for a given estimator using custom cost function. The function displays a plot of optimized cost as a function of probability threshold between 0.0 to 1.0 and returns the optimized threshold value as a numpy float.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
optimize_threshold(lr, true_negative = 10, false_negative = -100)

In [None]:
pycaret.classification.optimize_threshold(estimator, 
                                          true_positive: int = 0, 
                                          true_negative: int = 0, 
                                          false_positive: int = 0, 
                                          false_negative: int = 0, 
                                          grid_interval: float = 0.0001)

estimator: scikit-learn compatible object
Trained model object

true_positive: int, default = 0
Cost function or returns for true positive.

true_negative: int, default = 0
Cost function or returns for true negative.

false_positive: int, default = 0
Cost function or returns for false positive.

false_negative: int, default = 0
Cost function or returns for false negative.

grid_interval: float, default = 0.0001
Grid inerval for threshold grid search. Iteration count = 1.0/grid_interval. Default 10000 iterations.

Returns
numpy.float64

**classification predict_model function**

This function predicts Label and Score (probability of predicted class) using a trained model. When data is None, it predicts label and score on the holdout set.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
pred_holdout = predict_model(lr)
pred_unseen = predict_model(lr, data = unseen_dataframe)

In [None]:
pycaret.classification.predict_model(estimator, data: Optional[pandas.core.frame.DataFrame] = None, 
                                     probability_threshold: Optional[float] = None, 
                                     encoded_labels: bool = False, raw_score: bool = False, 
                                     drift_report: bool = False, round: int = 4, 
                                     verbose: bool = True)

estimator: scikit-learn compatible object
Trained model object

data: pandas.DataFrame
Shape (n_samples, n_features). All features used during training must be available in the unseen dataset.

probability_threshold: float, default = None
Threshold for converting predicted probability to class label. Unless this parameter is set, it will default to the value set during model creation. If that wasn’t set, the default will be 0.5 for all classifiers. Only applicable for binary classification.

encoded_labels: bool, default = False
When set to True, will return labels encoded as an integer.

raw_score: bool, default = False
When set to True, scores for all labels will be returned.

drift_report: bool, default = False
When set to True, interactive drift report is generated on test set with the evidently library.

round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

verbose: bool, default = True
When set to False, holdout score grid is not printed.

Returns
pandas.DataFrame

**classification finalize_model function**

This function trains a given estimator on the entire dataset including the holdout set.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
final_lr = finalize_model(lr)

In [None]:
pycaret.classification.finalize_model(estimator, 
                                      fit_kwargs: Optional[dict] = None, 
                                      groups: Optional[Union[str, Any]] = None, 
                                      model_only: bool = True)

estimator: scikit-learn compatible object
Trained model object

fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.

groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.

model_only: bool, default = True
When set to False, only model object is re-trained and all the transformations in Pipeline are ignored.

Returns
Trained Model

**classification deploy_model**

This function deploys the transformation pipeline and trained model on cloud.



In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
# sets appropriate credentials for the platform as environment variables
import os
os.environ["AWS_ACCESS_KEY_ID"] = str("foo")
os.environ["AWS_SECRET_ACCESS_KEY"] = str("bar")
deploy_model(model = lr, model_name = 'lr-for-deployment', platform = 'aws', authentication = {'bucket' : 'S3-bucket-name'})

In [None]:
pycaret.classification.deploy_model(model, 
                                    model_name: str, 
                                    authentication: dict, 
                                    platform: str = 'aws')

Amazon Web Service (AWS) users:
To deploy a model on AWS S3 (‘aws’), the credentials have to be passed. The easiest way is to use environment variables in your local environment. Following information from the IAM portal of amazon console account are required:

AWS Access Key ID

AWS Secret Key Access

More info: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables

Google Cloud Platform (GCP) users:
To deploy a model on Google Cloud Platform (‘gcp’), project must be created using command line or GCP console. Once project is created, you must create a service account and download the service account key as a JSON file to set environment variables in your local environment.

More info: https://cloud.google.com/docs/authentication/production

Microsoft Azure (Azure) users:
To deploy a model on Microsoft Azure (‘azure’), environment variables for connection string must be set in your local environment. Go to settings of storage account on Azure portal to access the connection string required.

AZURE_STORAGE_CONNECTION_STRING (required as environment variable)

More info: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python?toc=%2Fpython%2Fazure%2FTOC.json

model: scikit-learn compatible object
Trained model object

model_name: str
Name of model.

authentication: dict
Dictionary of applicable authentication tokens.

When platform = ‘aws’: {‘bucket’ : ‘S3-bucket-name’, ‘path’: (optional) folder name under the bucket}

When platform = ‘gcp’: {‘project’: ‘gcp-project-name’, ‘bucket’ : ‘gcp-bucket-name’}

When platform = ‘azure’: {‘container’: ‘azure-container-name’}

platform: str, default = ‘aws’
Name of the platform. Currently supported platforms: ‘aws’, ‘gcp’ and ‘azure’.

Returns
None

**classification save_model function**

This function saves the transformation pipeline and trained model object into the current working directory as a pickle file for later use.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
lr = create_model('lr')
save_model(lr, 'saved_lr_model')

In [None]:
pycaret.classification.save_model(model, model_name: str, 
                                  model_only: bool = False, 
                                  verbose: bool = True, 
                                  **kwargs)

model: scikit-learn compatible object
Trained model object

model_name: str
Name of the model.

model_only: bool, default = False
When set to True, only trained model object is saved instead of the entire pipeline.

verbose: bool, default = True
Success message is not printed when verbose is set to False.

**kwargs:
Additional keyword arguments to pass to joblib.dump().

Returns
Tuple of the model object and the filename.

**clasiffication load_model function**

This function loads a previously saved pipeline.

In [None]:
from pycaret.classification import load_model
saved_lr = load_model('saved_lr_model')

In [None]:
pycaret.classification.load_model(model_name, 
                                  platform: Optional[str] = None, 
                                  authentication: Optional[Dict[str, str]] = None, 
                                  verbose: bool = True)

model_name: str
Name of the model.

platform: str, default = None
Name of the cloud platform. Currently supported platforms: ‘aws’, ‘gcp’ and ‘azure’.

authentication: dict, default = None
dictionary of applicable authentication tokens.

when platform = ‘aws’: {‘bucket’ : ‘S3-bucket-name’}

when platform = ‘gcp’: {‘project’: ‘gcp-project-name’, ‘bucket’ : ‘gcp-bucket-name’}

when platform = ‘azure’: {‘container’: ‘azure-container-name’}

verbose: bool, default = True
Success message is not printed when verbose is set to False.

Returns
Trained Model

**classification automl function**

This function returns the best model out of all trained models in current session based on the optimize parameter. Metrics evaluated can be accessed using the get_metrics function.

In [None]:
pycaret.classification.automl(optimize: str = 'Accuracy', 
                              use_holdout: bool = False)

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
top3 = compare_models(n_select = 3)
tuned_top3 = [tune_model(i) for i in top3]
blender = blend_models(tuned_top3)
stacker = stack_models(tuned_top3)
best_auc_model = automl(optimize = 'AUC')

optimize: str, default = ‘Accuracy’
Metric to use for model selection. It also accepts custom metrics added using the add_metric function.

use_holdout: bool, default = False
When set to True, metrics are evaluated on holdout set instead of CV.

Returns
Trained Model

**classification pull function**

Returns last printed score grid. Use pull function after any training function to store the score grid in pandas.DataFrame.

In [None]:
pycaret.classification.pull(pop: bool = False)

In [None]:
Returns last printed score grid. Use pull function after any training function to store the score grid in pandas.DataFrame.

pop: bool, default = False
If True, will pop (remove) the returned dataframe from the display container.

Returns
pandas.DataFrame

**classification models function**

Returns table of models available in the model library.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
all_models = models()

In [None]:
pycaret.classification.models(type: Optional[str] = None, 
                              internal: bool = False, 
                              raise_errors: bool = True)

type: str, default = None
linear : filters and only return linear models

tree : filters and only return tree based models

ensemble : filters and only return ensemble models

internal: bool, default = False
When True, will return extra columns and rows used internally.

raise_errors: bool, default = True
When False, will suppress all exceptions, ignoring models that couldn’t be created.

Returns
pandas.DataFrame

**classification get_metrics function**

Returns table of available metrics used for CV.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
all_metrics = get_metrics()

In [None]:
pycaret.classification.get_metrics(reset: bool = False, include_custom: bool = True, raise_errors: bool = True)

reset: bool, default = False
When True, will reset all changes made using the add_metric and remove_metric function.

include_custom: bool, default = True
Whether to include user added (custom) metrics or not.

raise_errors: bool, default = True
If False, will suppress all exceptions, ignoring models that couldn’t be created.

Returns
pandas.DataFrame

**classification add_metric function**

Adds a custom metric to be used for CV.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
from sklearn.metrics import log_loss
add_metric('logloss', 'Log Loss', log_loss, greater_is_better = False)

In [None]:
pycaret.classification.add_metric(id: str, name: str, 
                                  score_func: type, 
                                  target: str = 'pred', 
                                  greater_is_better: bool = True, 
                                  multiclass: bool = True, **kwargs)

id: str
Unique id for the metric.

name: str
Display name of the metric.

score_func: type
Score function (or loss function) with signature score_func(y, y_pred, **kwargs).

target: str, default = ‘pred’
The target of the score function.

‘pred’ for the prediction table

‘pred_proba’ for pred_proba

‘threshold’ for decision_function or predict_proba

greater_is_better: bool, default = True
Whether score_func is higher the better or not.

multiclass: bool, default = True
Whether the metric supports multiclass target.

**kwargs:
Arguments to be passed to score function.

Returns
pandas.Series

**classification remove_metric function**

Removes a metric from CV.

In [None]:
from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice,  target = 'Purchase')
remove_metric('MCC')

In [None]:
pycaret.classification.remove_metric(name_or_id: str)

name_or_id: str
Display name or ID of the metric.

Returns
None

In [None]:
pycaret.classification.get_logs(experiment_name: Optional[str] = None, save: bool = False)→ pandas.core.frame.DataFramev