![image](logo.png)
### Python Comparative Analysis for Machine Learning


# Usage example
pycaML allows to compare models with tuned and default parameters. It also supports stacking and voting ensembles.

Every object in pycaML is an Experiment. There are three types of Experiments: `RegressionExperiment`, `BinaryClassificationExperiment` and `MultiClassExperiment`. 

To begin, let's create a `RegressionExperiment` instance using the diabetes dataset.

In [1]:
from pycaML import RegressionExperiment
diabetes = RegressionExperiment(name = 'diabetes')
diabetes.load_data('diabetes.csv', target = 'Glucose')

Data loaded
Data loaded


The diabetes object automatically creates the necessary folders for the experiment. 
`load_data`   takes the dataset as an input, splits it in 4 files and saves them to the `/experiments/diabetes/data` folder.


The `start` method starts the experiment, and saves the result in the `/experiments/diabetes/data`. If the file already exists, the result is loaded and the function doesn't run.

In [2]:
diabetes.start() 
diabetes.result

10:49:25: Training Linear Regression 11/12          

Unnamed: 0,Model,RMSE_test,RMSE_cv,rmse_Std,Time
9,EBM,25.559451,26.538906,2.928745,1.302912
1,Random Forest,26.527714,27.335807,3.334061,0.1516
10,Linear Regression,26.545738,26.156182,2.910194,0.002003
6,ExtraTrees,27.73511,27.292827,2.948893,0.1052
5,CatBoost,28.465623,27.322818,3.300429,1.255016
7,KNN,28.888397,29.704169,3.227931,0.001999
4,Gradient Boost,29.001088,27.448801,3.289781,0.137202
3,LightGBM,29.287183,27.550593,3.601632,1.814198
2,XGBoost,29.375231,29.392813,2.936202,4.303913
8,AdaBoost,29.375362,28.610432,2.796502,0.051399


By default, a 5-fold cross validation is performed. `RMSE_cv` is the mean score on the validation folds. `rmse_Std` is the mean standard deviation score from the validation folds.
You can customize the number of folds by setting the `cv` parameter. Example with 10-fold cross validation:

In [None]:
diabetes.start(cv = 10) 

# Stacking models

The above experiment trained all the available models with default parameters.
To perform model stacking, let's create another experiment. There is no need to run `load_data` if the data has already been copied in the data folder.


In [3]:
diabetes_stacking = RegressionExperiment(name = 'diabetes', stacking = True)
diabetes_stacking.start()
diabetes_stacking.result

Result loaded
Data loaded
10:51:24: Training Voting (diverse) 4/5            

Unnamed: 0,Model,RMSE_test,RMSE_cv,rmse_Std,Time
2,Stacking (diverse),25.810042,26.546858,2.822744,7.049409
3,Voting (diverse),26.053284,26.864156,3.488867,1.406394
0,Stacking (all),26.52166,26.988583,2.827267,61.703028
1,Voting (all),26.587388,26.503379,3.350272,8.767302


Stacking all uses every other model as level 0 estimators.
Stacking and voting diverse only take the best model from boosting, the best for parallel models (RF, ET) and simple models.

# Hyperparameter tuning
Hyperparameters tuning is as simple as creating another experiment with the flag `tuning = True`. You can also stack tuned models by combining the parameters.

In [3]:
diabetes_tuned = RegressionExperiment(name = 'diabetes', tuning = True)
#n_eval optional paramerer, default = 100
diabetes_tuned.start(n_eval = 100)
diabetes_tuned.result

Unnamed: 0,Model,RMSE_test,RMSE_cv,rmse_Std,Time
5,CatBoost,25.531721,26.00538,3.022123,1.1228
2,XGBoost,25.583461,26.195805,3.226319,1.798199
3,LightGBM,25.692363,26.334937,3.189525,0.505999
9,EBM,25.734647,26.185927,3.122633,2.045199
1,Random Forest,25.746069,26.264782,3.184428,0.100199
6,ExtraTrees,25.929733,26.308307,3.125071,0.1454
4,Gradient Boost,26.062517,26.235544,3.346406,0.093402
8,AdaBoost,26.597059,26.522331,3.318685,0.079001
10,Linear Regression,26.649405,26.216172,3.020942,0.0006
0,Decision Tree,26.679701,27.293106,3.269204,0.001


In [5]:
diabetes_stacking_tuned = RegressionExperiment(name = 'diabetes', tuning = True, stacking = True)
diabetes_stacking_tuned.start()
diabetes_stacking_tuned.result

Unnamed: 0,Model,RMSE_test,RMSE_cv,rmse_Std,Time
2,Stacking (diverse),25.459647,25.952576,3.100108,6.207174
3,Voting (diverse),25.502212,25.9267,3.2612,1.046467
0,Stacking (all),25.553054,26.649562,3.079056,29.515667
1,Voting (all),25.553578,25.92176,3.252272,4.9455


The `start` method starts optimizing every single model with 100 runs of Bayesian Optimization using TPE algorithm. It's based on the package hyperopt.

But don't worry! Since optimizing models take a lot of time, you don't need to optimize each one in a single run.  
pycaML saves optimized parameters to the `experiments/diabetes/params/` folder. When running the experiments, the parameters are automatically loaded if they already exists. So you won't run the same optimization twice (unless you delete the file)  
The `experiments/diabetes/trials/` folder contains additional informations on the optimization runs.



# Version 
pycaML is in early development, in the future many features are planned to be added. 

# Author 
If you like the package, you can find me on [LinkedIn](https://www.linkedin.com/in/donato-riccio-280084146/).
