![image](logo.png)
### Python Comparative Analysis for Machine Learning


# Usage example
pycaML allows to compare models with tuned and default parameters. It also supports stacking and voting ensembles.

Every object in pycaML is an Experiment. There are two types of Experiments: `RegressionExperiment`, and `ClassificationExperiment`. If you start a `ClassificationExperiment`, pyCAML automatically understands if it's a binary classification task, or multi-label classification.

To begin, let's create a `RegressionExperiment` instance using the diabetes dataset.

In [1]:
from pycaML import RegressionExperiment
diabetes = RegressionExperiment(name = 'diabetes')
diabetes.load_data('diabetes.csv', target = 'Glucose')

Directory experiments/diabetes/params created!
Directory experiments/diabetes/trials created!
Directory experiments/diabetes/tables created!
Directory experiments/diabetes/data created!
X_train, y_train, X_test, y_test saved


The diabetes object automatically creates the necessary folders for the experiment. 
`load_data`   takes the dataset as an input, splits it in 4 files and saves them to the `/experiments/diabetes/data` folder.


The `start` method starts the experiment, and saves the result in the `/experiments/diabetes/data`. If the file already exists, the result is loaded and the function doesn't run.

In [10]:
diabetes.start() 
diabetes.result

Unnamed: 0,Model,RMSE_test,RMSE_cv,RMSE_std,Time
10,EBM,25.559451,26.538906,2.928745,1.310682
2,Random Forest,26.527714,27.335807,3.334061,0.198201
15,Least Angle Regression,26.545738,26.156182,2.910194,0.001802
13,Ridge,26.550075,26.155185,2.9153,0.0008
19,Bayesian Ridge,26.603792,26.177551,2.961254,0.001001
12,Lasso,26.648249,26.215473,3.020647,0.000601
21,TheilSen,27.144326,26.666264,3.208906,0.5574
16,Orthogonal Matching Pursuit,27.353908,28.552806,2.754056,0.001002
0,Bagging,27.691749,28.809562,3.732251,0.021
7,ExtraTrees,27.73511,27.292827,2.948893,0.107601


By default, a 5-fold cross validation is performed. `RMSE_cv` is the mean score on the validation folds. `RMSE_std` is the mean standard deviation score from the validation folds.
You can customize the number of folds by setting the `cv` parameter. Example with 10-fold cross validation:

In [3]:
diabetes.start(cv = 10) 

# Stacking models

The above experiment trained all the available models with default parameters.
To perform model stacking, let's create another experiment. There is no need to run `load_data` if the data has already been copied in the data folder.


In [9]:
diabetes_stacking = RegressionExperiment(name = 'diabetes', stacking = True)
diabetes_stacking.start()
diabetes_stacking.result

Result loaded
Data loaded


Unnamed: 0,Model,RMSE_test,RMSE_cv,RMSE_std,Time
0,Voting (all),25.959632,25.956527,3.301492,8.59525
1,Stacking (diverse),25.996212,26.696336,2.807487,7.218289
2,Voting (diverse),26.310165,27.01654,3.5386,1.429399
3,Stacking (all),27.066082,31.819227,3.693124,48.291322


Stacking all uses every other model as level 0 estimators.
Stacking and voting diverse only take the best model from boosting, the best for parallel models (RF, ET) and simple models.

# Hyperparameter tuning
Hyperparameters tuning is as simple as creating another experiment with the flag `tuning = True`. You can also stack tuned models by combining the parameters.

In [7]:
diabetes_tuned = RegressionExperiment(name = 'diabetes', tuning = True)
#n_eval optional paramerer, default = 100
diabetes_tuned.start(n_eval = 100)
diabetes_tuned.result

Result loaded
Data loaded


Unnamed: 0,Model,RMSE_test,RMSE_cv,RMSE_std,Time
0,Gradient Boost,25.391383,26.304448,3.162734,0.036801
1,CatBoost,25.699729,26.004154,2.967139,0.390401
2,XGBoost,25.729658,26.294288,3.030478,0.971916
3,Random Forest,25.737857,26.269705,3.185983,0.109799
4,EBM,25.850996,26.2667,3.019551,2.336289
5,ExtraTrees,25.932177,26.308251,3.110662,0.1314
6,AdaBoost,26.138107,26.551887,3.302247,0.060601
7,Huber,26.446251,26.183322,3.107596,0.049999
8,Bayesian Ridge,26.483656,26.154648,2.93332,0.001399
9,Lasso,26.544964,26.156317,2.910634,0.000401


In [8]:
diabetes_stacking_tuned = RegressionExperiment(name = 'diabetes', tuning = True, stacking = True)
diabetes_stacking_tuned.start()
diabetes_stacking_tuned.result

Result loaded
Data loaded


Unnamed: 0,Model,RMSE_test,RMSE_cv,RMSE_std,Time
0,Voting (diverse),25.501409,25.962292,3.271162,2.73374
1,Stacking (diverse),25.948605,26.168364,3.11427,14.864656
2,Voting (all),26.771776,31.761112,10.639915,6.386552
3,Stacking (all),626.631348,172.005079,60.60815,36.860789


The `start` method starts optimizing every single model with 100 runs of Bayesian Optimization using TPE algorithm. It's based on the package hyperopt.

But don't worry! Since optimizing models take a lot of time, you don't need to optimize each one in a single run.  
pycaML saves optimized parameters to the `experiments/diabetes/params/` folder. When running the experiments, the parameters are automatically loaded if they already exists. So you won't run the same optimization twice (unless you delete the file)  
The `experiments/diabetes/trials/` folder contains additional informations on the optimization runs.






# Author 
If you like the package, you can find me on [LinkedIn](https://www.linkedin.com/in/donato-riccio-280084146/).
