# PyCaret

PyCaret beginneer's room.

[PyCaret](https://pycaret.org/) 1.0.0, released in 2020, is a free Python library that allows you to do data preprocessing, visualization, and model development for machine learning model development in a few lines of code. AutoML, one of the low-code (only a few lines).PyCaret is a Python wrapper around several major machine learning libraries (scikit-learn, XGBoost, LightGBM, etc.) and can handle classification, regression, clustering, anomaly detection, and natural language processing.

For kaggler, it may be useful by allowing you to quickly try out a rough score of various models. However, advanced kagglers are likely to have their own preprocessing and model evaluation, or already have a pipeline, so they may not use it as much.

* I referred to the wonderful pycaret notebook [here](https://www.kaggle.com/hasanbasriakcay/ubiquan-market-preds-pycaret-model-comparisons/notebook).


* Please note that PyCaret takes a long time to run the code because it trains multiple models.Please note that it will take some time and no code will be executed here.

* I am a beginner in machine learning, so I would appreciate comments if there are any mistakes.

# List of Pycaret functions used in notebook

Preprocessing：　setup()

Compare models： compare_models()

Create model： create_model()

Tuning： tune_model()

Visualization： plot_model()

Evaluate： evaluate_model()

Inference： finalize_model(), predict_model()

In [None]:
%%capture
!pip install pycaret[full]

In [None]:
import numpy as np 
import pandas as pd 
import os
import warnings
warnings.filterwarnings('ignore')

# Train data read(parquet format)

https://www.kaggle.com/robikscube/fast-data-loading-and-low-mem-with-parquet-files/notebook

In [None]:
%%time
train = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')
test = pd.read_parquet('../input/ubiquant-parquet/example_test.parquet')

train

# Reduce data
Reduce the data because the data size is too large and it will run out of memory later.

Note that the original data records have been deleted down to 1/100.

### reduce colums

In [None]:
#https://www.kaggle.com/hasanbasriakcay/ubiquan-market-preds-pycaret-model-comparisons
### Cols Select
IGNORE_COLS = ['row_id', 'f_4', 'f_13', 'f_20', 'f_27', 'f_30', 'f_49', 'f_63', 'f_66', 'f_73', 'f_74', 'f_84', 'f_111', 
               'f_115', 'f_120', 'f_122', 'f_124', 'f_129', 'f_148', 'f_170', 'f_182', 'f_200', 'f_228', 'f_248', 'f_254', 
               'f_258', 'f_269', 'f_272', 'f_291', 'f_293', 'f_299', 'f_4', 'f_7', 'f_13', 'f_19', 'f_20', 'f_27', 'f_30', 
               'f_35', 'f_37', 'f_39', 'f_40', 'f_49', 'f_56', 'f_60', 'f_61', 'f_63', 'f_66', 'f_67', 'f_70', 'f_73', 'f_74', 
               'f_75', 'f_84', 'f_99', 'f_101', 'f_102', 'f_107', 'f_111', 'f_115', 'f_120', 'f_122', 'f_123', 'f_124', 'f_129', 
               'f_148', 'f_154', 'f_161', 'f_164', 'f_166', 'f_170', 'f_175', 'f_180', 'f_182', 'f_183', 'f_191', 'f_199', 'f_200', 
               'f_201', 'f_202', 'f_205', 'f_211', 'f_215', 'f_217', 'f_218', 'f_220', 'f_227', 'f_228', 'f_235', 'f_244', 'f_248', 
               'f_253', 'f_254', 'f_258', 'f_269', 'f_272', 'f_275', 'f_278', 'f_283', 'f_288', 'f_291', 'f_292', 'f_293', 'f_296', 
               'f_299']

basic_cols = ['time_id', 'investment_id', 'target']
num_feat = 300
features = [f'f_{i}' for i in range(num_feat)]
cols = basic_cols + features
selected_cols = []
for c in cols:
    if c in IGNORE_COLS:
        continue
    selected_cols.append(c)
train=train[selected_cols]

### reduce rows

In [None]:
train=train[:25002] #time id:0-10
train

# Reduce Memory Usage

In [None]:
#https://www.kaggle.com/hasanbasriakcay/ubiquan-market-preds-pycaret-model-comparisons


def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

train = reduce_mem_usage(train)
train

In [None]:
train.info()

# import pycaret.regression

This competition is a regression, so it will use 'from pycaret.regression import *'

In [None]:
#regression
from pycaret.regression import *

#classification:Not required for this competition.
#from pycaret.classification import *

# Preprocessing：setup()

The first step is to run setup().

setup() is used to set up the preprocessing. It performs missing value processing, data division, etc.

Let's  look at the main items.
* session_id : A pseudo-random number that is distributed as a seed for reproducibility. In this experiment, session_id is set as 2022 for later reproducibility.
* Original Data : The original shape of the data set. In this experiment, (25002, 229) means 25002 samples and 229 features including the target column.
* Missing Values : This is indicated as True when the original data has missing values. For this experiment, there are no missing values in the data set.
* Numeric Features : The number of features to be inferred as numeric. In this dataset, 228 features will be inferred as numeric.
* Categorical Features : Number of features to be inferred as categorical. There are no Categorical Features in this dataset.
* Transformed Train Set : Displays the shape of the transformed training set (17501, 229). 
* Transformed Test Set : Displays the shape of the transformed test set (7501, 229).

In [None]:
%%time

reg = setup(data = train,
            target = 'target',
            #numeric_features = NUM_FEATURES,
            session_id = 2022,
            silent = True, #Skip checking for type estimation.
            data_split_shuffle = False) #Avoid using "future" observations to predict "past" observations.

# Model Comparison： compare_models()

The function trains all the models in the model library and scores them using k-fold cross-validation for metric evaluation.

The output includes accuracy, AUC, recall, goodness of fit, F1, Kappa, and MCC, along with training time.

You can also use lightgbm, catboost, xgboost, etc., which are often used in kaggle.



In [None]:
%%time

#It will take some time to run.
best_model=compare_models(sort = 'RMSE')

You can also get multiple models of a higher level.

In [None]:
#Not executed because it takes time to execute

#N = 3 #Specify the number of upper models
#top_models = compare_models(sort = 'RMSE', n_select = N)

# Create model;create_model()

In this case, I will use the Random Forest Classifier model.

The metrics printed in the compare_models() score grid will be the average score across all CV folds.

In [None]:
%%time
model = create_model('rf')

# Tuning： tune_model()

PyCaret uses a random grid search to automatically adjust the hyperparameters of the model.

The output is the model's best accuracy, AUC, repeatability, goodness-of-fit, F1, kappa, and MCC

In [None]:
#It will take about 10 hours to run.
#Not executed because it takes time to execute

#tuned_model = tune_model(model, optimize = 'RMSE')

# Get parameters
You can check the parameters of the model

In [None]:
#Not executed because it takes time to execute
#tuned_model.get_params

# Check the hyperparameters :evaluate_model

You can check the evaluation metrics of the model

In [None]:
%%time
evaluate_model(model)

# Blending:blend_models()

If no model is specified, blend_models() will use all the models supported by PyCaret. If no model is specified, blend_models() will use all models supported by PyCaret for blending.

In [None]:
##Not executed because it takes time to execute

# create models
#cat = create_model('catboost') #CatBoost
#rf = create_model('rf') #Random Forest
#lr = create_model('lr') #Logistic Regression

# tuning
#tuned_cat = tune_model(cat)
#tuned_rf = tune_model(rf)
#tuned_lr = tune_model(lr)

# Blending
#soft：Use the prediction label of the model with the highest prediction score.
#hard：Majority rule for predictive labels
#blender_specific = blend_models(estimator_list = [tuned_cat,tuned_rf,tuned_lr], method = 'soft')

# Stacking :stack_models()

Stacking is often used in kaggle, but it can be done by simply setting up multiple trainers and a trainer for the metamodel

In [None]:
#Not executed because it takes time to execute

# create individual models for stacking
#cat = create_model('catboost')
#rf = create_model('rf')
#tuned_cat = tune_model(cat)
#tuned_rf = tune_model(rf)

#meta_model
#xgboost = create_model('xgboost')

# stacking models
#stacker = stack_models(estimator_list = [tuned_cat,tuned_rf], meta_model = xgboost)

# Inference: predict_model()
Perform one final check by reviewing the evaluation metrics to predict the test/hold-out set before finalizing the model.

In [None]:
%%time
predict_model(model)

# Finalize the model: finalize_model()

Finally, run finalize_model() to finalize the model

In [None]:
%%time
final_model = finalize_model(model)
predict_model(final_model)

# Visualization: plot_model()

You can visualize various graphs with plot_model().

In [None]:
#Prediction error plot
plot_model(final_model, plot='error')

In [None]:
#feature importance
plot_model(final_model, plot='feature')

In [None]:
#Not executed because it takes time to execute

#learning curve
#plot_model(final_model, plot='learning')

# Save & load model

In [None]:
#save_model()
#save_model(final_rf,model_name='Final RF Model')

#load_model()
#saved_final_rf = load_model(model_name='Final RF Model')

# pycaret + SHAP
pycaret also supports shap, but this note will not run it because it takes too long.

In [None]:
#!pip install shap

In [None]:
#import shap

In [None]:
#Passing a model to interpret_model displays a summary plot
#summary plot shows us which explanatory variables have a large effect on the model.

#interpret_model(final_model)

In [None]:
#display dependence plot with 'correlation' argument
#dependence plot is a scatter plot of a specific explanatory variable and SHAP values.

#interpret_model(final_model,plot='correlation')

In [None]:
#force plot when 'reason' is specified as argument.
#force plot shows SHAP values for individual data.
#specify data index with observation argument

#interpret_model(final_model,plot='reason',observation=1)