![image](logo.png)
### Python Comparative Analysis for Machine Learning


# Usage example
pycaML allows to compare models with tuned and default parameters. It also supports stacking and voting ensembles.

Every object in pycaML is an Experiment. There are two types of Experiments: `RegressionExperiment`, and `ClassificationExperiment`. If you start a `ClassificationExperiment`, pyCAML automatically understands if it's a binary classification task, or multi-label classification.

To begin, let's create a `RegressionExperiment` instance using the diabetes dataset.

In [1]:
from pycaML import RegressionExperiment
diabetes = RegressionExperiment(name = 'diabetes')
diabetes.load_data('diabetes.csv', target = 'Glucose')

Data loaded
X_train, y_train, X_test, y_test saved


The diabetes object automatically creates the necessary folders for the experiment. 
`load_data`   takes the dataset as an input, splits it in 4 files and saves them to the `/experiments/diabetes/data` folder.


The `start` method starts the experiment, and saves the result in the `/experiments/diabetes/data`. If the file already exists, the result is loaded and the function doesn't run.

In [2]:
diabetes.start() 
diabetes.result

20:39:07: Training Ridge 13/14                  

Unnamed: 0,Model,RMSE_test,RMSE_cv,RMSE_std,Time
9,EBM,25.559451,26.538906,2.928745,1.304769
1,Random Forest,26.527714,27.335807,3.334061,0.159
12,Ridge,26.550075,26.155185,2.9153,0.000801
11,Lasso,26.648249,26.215473,3.020647,0.000801
6,ExtraTrees,27.73511,27.292827,2.948893,0.1024
10,Elastic Net,28.204873,27.34302,3.136312,0.000602
5,CatBoost,28.465623,27.322818,3.300429,1.441607
7,KNN,28.888397,29.704169,3.227931,0.000601
4,Gradient Boost,29.001088,27.448801,3.289781,0.129612
3,LightGBM,29.300748,27.547163,3.597968,1.557333


By default, a 5-fold cross validation is performed. `RMSE_cv` is the mean score on the validation folds. `RMSE_std` is the mean standard deviation score from the validation folds.
You can customize the number of folds by setting the `cv` parameter. Example with 10-fold cross validation:

In [None]:
diabetes.start(cv = 10) 

# Stacking models

The above experiment trained all the available models with default parameters.
To perform model stacking, let's create another experiment. There is no need to run `load_data` if the data has already been copied in the data folder.


In [3]:
diabetes_stacking = RegressionExperiment(name = 'diabetes', stacking = True)
diabetes_stacking.start()
diabetes_stacking.result

Result loaded
Data loaded
10:51:24: Training Voting (diverse) 4/5            

Unnamed: 0,Model,RMSE_test,RMSE_cv,rmse_Std,Time
2,Stacking (diverse),25.810042,26.546858,2.822744,7.049409
3,Voting (diverse),26.053284,26.864156,3.488867,1.406394
0,Stacking (all),26.52166,26.988583,2.827267,61.703028
1,Voting (all),26.587388,26.503379,3.350272,8.767302


Stacking all uses every other model as level 0 estimators.
Stacking and voting diverse only take the best model from boosting, the best for parallel models (RF, ET) and simple models.

# Hyperparameter tuning
Hyperparameters tuning is as simple as creating another experiment with the flag `tuning = True`. You can also stack tuned models by combining the parameters.

In [5]:
diabetes_tuned = RegressionExperiment(name = 'diabetes', tuning = True)
#n_eval optional paramerer, default = 100
diabetes_tuned.start(n_eval = 100)
diabetes_tuned.result

Unnamed: 0,Model,RMSE_test,RMSE_cv,RMSE_std,Time
0,XGBoost,25.431691,26.275172,3.162214,1.103
1,CatBoost,25.620023,26.000716,2.966027,0.351201
2,ExtraTrees,25.692319,25.974532,3.066963,0.0912
3,Random Forest,25.71451,26.28291,3.190689,0.238601
4,Gradient Boost,25.799594,26.226755,3.174442,0.067601
5,EBM,26.036886,26.271469,3.068709,2.507198
6,LightGBM,26.205889,26.325291,3.284387,0.76946
7,Lasso,26.544329,26.156433,2.910997,0.0004
8,Elastic Net,26.554725,26.155138,2.920088,0.0004
9,Ridge,26.556566,26.154694,2.922152,0.0006


In [5]:
diabetes_stacking_tuned = RegressionExperiment(name = 'diabetes', tuning = True, stacking = True)
diabetes_stacking_tuned.start()
diabetes_stacking_tuned.result

Unnamed: 0,Model,RMSE_test,RMSE_cv,rmse_Std,Time
2,Stacking (diverse),25.459647,25.952576,3.100108,6.207174
3,Voting (diverse),25.502212,25.9267,3.2612,1.046467
0,Stacking (all),25.553054,26.649562,3.079056,29.515667
1,Voting (all),25.553578,25.92176,3.252272,4.9455


The `start` method starts optimizing every single model with 100 runs of Bayesian Optimization using TPE algorithm. It's based on the package hyperopt.

But don't worry! Since optimizing models take a lot of time, you don't need to optimize each one in a single run.  
pycaML saves optimized parameters to the `experiments/diabetes/params/` folder. When running the experiments, the parameters are automatically loaded if they already exists. So you won't run the same optimization twice (unless you delete the file)  
The `experiments/diabetes/trials/` folder contains additional informations on the optimization runs.






# Author 
If you like the package, you can find me on [LinkedIn](https://www.linkedin.com/in/donato-riccio-280084146/).
