![image](logo.png)
###  🐫 Python Comparative Analysis for Machine Learning 0.3.1 🐫


# Usage example
pycaML allows to compare models with tuned and default parameters. It also supports stacking and voting ensembles.

Every object in pycaML is an Experiment. There are two types of Experiments: `RegressionExperiment`, and `ClassificationExperiment`. If you start a `ClassificationExperiment`, pyCAML automatically understands if it's a binary classification task, or multi-label classification.

To begin, let's create a `RegressionExperiment` instance using the diabetes dataset.

In [1]:
from pycaML import RegressionExperiment
diabetes = RegressionExperiment(name = 'diabetes')
diabetes.load_data('diabetes.csv', target = 'Glucose')

Directory experiments/diabetes/params created!
Directory experiments/diabetes/trials created!
Directory experiments/diabetes/tables created!
Directory experiments/diabetes/data created!
X_train, y_train, X_test, y_test saved


The diabetes object automatically creates the necessary folders for the experiment. 
`load_data`   takes the dataset as an input, splits it in 4 files and saves them to the `/experiments/diabetes/data` folder.


The `start` method starts the experiment, and saves the result in the `/experiments/diabetes/data`. If the file already exists, the result is loaded and the function doesn't run.

In [9]:
diabetes.start() 
diabetes.result

Unnamed: 0_level_0,CV mean_squared_error,CV mean_absolute_error,CV root_mean_squared_error,Test mean_squared_error,Test mean_absolute_error,Test root_mean_squared_error,STD mean_squared_error,STD mean_absolute_error,STD root_mean_squared_error
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ridge,684.0937,19.7132,25.9922,704.9065,20.6151,26.5501,149.3266,2.0916,2.9153
Least Angle Regression,684.1458,19.7181,25.9938,704.6762,20.6074,26.5457,149.0704,2.0922,2.9102
Bayesian Ridge,685.2642,19.7011,26.0095,707.7618,20.6916,26.6038,151.7873,2.0711,2.9613
Lasso,687.251,19.649,26.0409,710.1292,20.7857,26.6482,155.7772,2.0639,3.0206
TheilSen,711.0897,19.8694,26.4725,736.8144,20.6382,27.1443,167.3953,2.1714,3.2089
Huber,728.275,20.4693,26.8886,796.3084,21.8721,28.2189,123.059,1.6999,2.2976
ExtraTrees,744.8984,20.0806,27.1331,769.2363,21.5691,27.7351,162.8124,1.7093,2.9489
CatBoost,746.5364,19.9194,27.1227,810.2917,22.4741,28.4656,187.3362,2.3122,3.3004
Random Forest,747.2463,20.3745,27.1317,703.7196,20.8853,26.5277,183.1579,2.2459,3.3341
Elastic Net,747.6407,20.5127,27.1626,795.5149,22.1152,28.2049,168.0972,1.9605,3.1363


By default, a 5-fold cross validation is performed. `RMSE_cv` is the mean score on the validation folds. `RMSE_std` is the mean standard deviation score from the validation folds.
You can customize the number of folds by setting the `cv` parameter. Example with 10-fold cross validation:

In [3]:
diabetes.start(cv = 10) 

# Stacking models

The above experiment trained all the available models with default parameters.
To perform model stacking, let's create another experiment. There is no need to run `load_data` if the data has already been copied in the data folder.
Stacking is started with the `start` function. There are two parameters:
`n_estimators` decides how many models to include in the final estimator
`estimators` can be `best`, `random` or `all`


In [10]:
diabetes_stacking = RegressionExperiment(name = 'diabetes')
diabetes_stacking.stack(n_estimators=10, estimators='best')
diabetes_stacking.stack_result

Unnamed: 0_level_0,Test mean_squared_error,Test mean_absolute_error,Test root_mean_squared_error,Estimators,N_estimators
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Stacking,697.102287,20.604786,26.402695,"['Ridge', 'Least Angle Regression', 'Bayesian ...",6
Voting,705.505953,20.588648,26.561362,"['Ridge', 'Least Angle Regression', 'Bayesian ...",6
Stacking,680.004624,20.567459,26.076898,"[Ridge, Least Angle Regression, Bayesian Ridge...",10
Voting,691.613451,20.495616,26.298545,"[Ridge, Least Angle Regression, Bayesian Ridge...",10


# Hyperparameter tuning
Hyperparameters tuning is as simple as starting the experiment with the flag `tuning`. You can choose which metric to optimize. You can also stack tuned models by starting `stack` after the tuning process saved the parameters.

In [11]:
diabetes_tuned = RegressionExperiment(name = 'diabetes')
diabetes_tuned.start(tuning = 'mean_squared_error')
diabetes_tuned.result

Result loaded
Data loaded


Unnamed: 0_level_0,CV mean_squared_error,CV mean_absolute_error,CV root_mean_squared_error,Test mean_squared_error,Test mean_absolute_error,Test root_mean_squared_error,STD mean_squared_error,STD mean_absolute_error,STD root_mean_squared_error
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
CatBoost,676.618,19.3714,25.8387,660.4535,20.3281,25.6993,156.7793,2.1262,2.9969
Bagging,683.856,19.7232,25.9889,693.9194,20.458,26.3423,149.633,2.0606,2.9039
Bayesian Ridge,684.0655,19.7309,25.9896,701.3845,20.594,26.4837,150.4353,2.0754,2.9333
Ridge,684.068,19.7091,25.991,705.2452,20.6255,26.5565,149.6693,2.0888,2.922
Elastic Net,684.0738,19.7088,25.991,705.4142,20.6309,26.5596,149.7063,2.0886,2.9227
Least Angle Regression,684.1458,19.7181,25.9938,704.6762,20.6074,26.5457,149.0704,2.0922,2.9102
Orthogonal Matching Pursuit,684.1458,19.7181,25.9938,704.6762,20.6074,26.5457,149.0704,2.0922,2.9102
Lasso,684.1503,19.7177,25.9938,704.65,20.6076,26.5452,149.087,2.0924,2.9105
XGBoost,684.5187,19.5088,25.9646,651.5563,20.1063,25.5256,171.0888,2.2113,3.2187
Huber,685.6506,19.6731,26.0001,699.0842,20.5891,26.4402,159.071,2.2705,3.1058


In [14]:
diabetes_stacking_tuned = RegressionExperiment(name = 'diabetes')
diabetes_stacking_tuned.stack(tuning = 'mean_squared_error')
diabetes_stacking_tuned.stack_result

Unnamed: 0_level_0,Test mean_squared_error,Test mean_absolute_error,Test root_mean_squared_error,Estimators,N_estimators
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Stacking,697.102287,20.604786,26.402695,"['Ridge', 'Least Angle Regression', 'Bayesian ...",6
Voting,705.505953,20.588648,26.561362,"['Ridge', 'Least Angle Regression', 'Bayesian ...",6
Stacking,680.004624,20.567459,26.076898,"['Ridge', 'Least Angle Regression', 'Bayesian ...",10
Voting,691.613451,20.495616,26.298545,"['Ridge', 'Least Angle Regression', 'Bayesian ...",10
Stacking,665.105399,20.219939,25.789637,"['CatBoost', 'Bagging', 'Bayesian Ridge', 'Rid...",10
Voting,692.044351,20.483104,26.306736,"['CatBoost', 'Bagging', 'Bayesian Ridge', 'Rid...",10
Stacking,665.105399,20.219939,25.789637,"[CatBoost, Bagging, Bayesian Ridge, Ridge, Ela...",10
Voting,692.044351,20.483104,26.306736,"[CatBoost, Bagging, Bayesian Ridge, Ridge, Ela...",10


The `start` method starts optimizing every single model with 100 runs of Bayesian Optimization using TPE algorithm. It's based on the package hyperopt.

But don't worry! Since optimizing models take a lot of time, you don't need to optimize each one in a single run.  
pycaML saves optimized parameters to the `experiments/diabetes/params/` folder. When running the experiments, the parameters are automatically loaded if they already exists. So you won't run the same optimization twice (unless you delete the file)  
The `experiments/diabetes/trials/` folder contains additional informations on the optimization runs.






# Author 
If you like the package, you can find me on [LinkedIn](https://www.linkedin.com/in/donato-riccio-280084146/).
