## AutoML notebook
Hi! Welcome to the AutoML notebook. In this notebook you will be enabled to use AutoML in a few steps. 
1. Upload a raw dataset
2. Create a from this raw dataset to do your analysis on
3. Let AutoML create a good model for your data a model based on the provided subset

To use the notebook in the right way you have to run each code block. Above each code block there is an explanation of what is happening.

In [1]:
from numpy import argwhere, delete
from pandas import read_csv, read_sql_table, DataFrame
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tpot import TPOTClassifier
import warnings
warnings.filterwarnings('ignore')

### Step 1: Upload dataset
Upload a raw dataset, this has to be a .csv file. Copy the filepath into the location variable, use two '\\' instead of one to make sure that the file is uploade and no error is thrown. Denote the separator of the csv file in the separator variable. examples are commented in the lines below. The top of the dataframe is shown if it is successful

In [28]:
#separator = ","
location = "C:\\Users\\riooms\\Desktop\\dataset_37_diabetes.csv"
separator = ','
#location =  'D:\\28.5. - RARP - CWZ - ML.csv'
df = read_csv(location, sep = separator)
df.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


All variables that are not numeric are label-encoded to numeric values in this code section, so that the AutoML method can read your data. The output of this code block is a list with the names of the variables in your dataset.

In [5]:
# Categorical boolean mask
categorical_feature_mask = df.dtypes==object
categorical_cols = df.columns[categorical_feature_mask].tolist()
le = LabelEncoder()
    # apply le on categorical feature columns
if len(categorical_cols) >= 1 :
    df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
print(df.columns.values)

['record_id' 'Side' 'epenbni' 'mriece' 'rtsuspect' 'Epstein.new' 'age'
 'psad']


### Step 2: Create subsets

Copy the column names of the predictor variables to the list and replace the values that are there. Copy the name of the target variable to the corresponding pieces of code. Do not add the target variable to the subset variables, this will cause an error.
Run the code blocks to get the subsets ready

In [20]:
#subsetvariables = ['column1' ,'column2', 'column3', 'etc...']
subset1variables = ['mriece', 'rtsuspect' ,'Epstein.new' ,'age', 'psad']
subset2variables = ['mriece', 'rtsuspect' ,'Epstein.new' ,'age', 'psad']

#target = 'class'
target = 'epenbni'

df.rename(columns={target: 'target'}, inplace=True)



### Step 3: create models
Run the code below to set the configuration for the creation of your first, automatically created, logistic regression model. The estimated waiting time is 15 minutes a the model to complete

In [24]:
tpot_config = {
    'sklearn.linear_model.LogisticRegression': {
        'penalty': ["l1", "l2"],
        'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.],
        'dual': [True, False]
    }
}

tpot1 = TPOTClassifier(generations=10, population_size=50, verbosity=2, 
                      max_time_mins=7.5, scoring = 'roc_auc',
                      config_dict=tpot_config)
tpot2 = TPOTClassifier(generations=10, population_size=50, verbosity=2, scoring = 'roc_auc',
                      max_time_mins=7.5)


Run this cell to create your first model

The AutoML method detects whether there are missing values in your dataset and replaces them with the median value of the column.

In [25]:
# rename de targetvariable naar targetvariabele

predictorss1 = df[subset1variables]
predictorss2 = df[subset2variables]


#set train en validatieset op
X_train1, X_test1, Y_train1, Y_test1 = train_test_split(predictorss1, df.target, train_size = 0.75, test_size = 0.25)

#train model
tpot1.fit(X_train1, Y_train1)
score1 = tpot1.score(X_test1, Y_test1)
model1 = tpot1._optimized_pipeline

#set train en validatieset op
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(predictorss2, df.target, train_size = 0.75, test_size = 0.25)

#train model
tpot2.fit(X_train2, Y_train2)
score2 = tpot2.score(X_test2, Y_test2)
model2 = tpot2._optimized_pipeline



HBox(children=(IntProgress(value=0, description='Optimization Progress', max=50, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: 0.5363209761082102
Generation 2 - Current best internal CV score: 0.5363209761082102
Generation 3 - Current best internal CV score: 0.543686557729111
Generation 4 - Current best internal CV score: 0.5480319365425748
Generation 5 - Current best internal CV score: 0.5480319365425748
Generation 6 - Current best internal CV score: 0.5781806215848769
Generation 7 - Current best internal CV score: 0.5781806215848769
Generation 8 - Current best internal CV score: 0.5781806215848769
Generation 9 - Current best internal CV score: 0.5929155950432546
Generation 10 - Current best internal CV score: 0.5929155950432546
Generation 11 - Current best internal CV score: 0.6600345517366794
Generation 12 - Current best internal CV score: 0.6726373626373626
Generation 13 - Current best internal CV score: 0.6726373626373626
Generation 14 - Current best internal CV score: 0.6726373626373626
Generation 15 - Current best internal CV score: 0.6726373626373626
Gener

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=50, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: 0.7814806818770398
Generation 2 - Current best internal CV score: 0.7854627485680477
Generation 3 - Current best internal CV score: 0.7854627485680477
Generation 4 - Current best internal CV score: 0.7854627485680477
Generation 5 - Current best internal CV score: 0.7854627485680477
Generation 6 - Current best internal CV score: 0.7859479998846732
Generation 7 - Current best internal CV score: 0.7896552833810332
Generation 8 - Current best internal CV score: 0.7920679278360747
Generation 9 - Current best internal CV score: 0.7920679278360747
Generation 10 - Current best internal CV score: 0.7920679278360747

7.50715475 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(BernoulliNB(Normalizer(input_matrix, norm=l1), alpha=1.0, fit_prior=True), bootstrap=True, criterion=gini, max_features=0.5, min_sampl

### Results
The next cells can be used to get your model output. Only run these cells when the optimization progress bar is filled!

Run this cell to receive the output of your first model

In [26]:
"Model accuracy model 1:  " +str(score1) + '.  Model used (1): ' + str(model1)

'Model accuracy model 1:  0.678550135501355.  Model used (1): LogisticRegression(LogisticRegression(LogisticRegression(LogisticRegression(LogisticRegression(input_matrix, LogisticRegression__C=5.0, LogisticRegression__dual=False, LogisticRegression__penalty=l2), LogisticRegression__C=20.0, LogisticRegression__dual=True, LogisticRegression__penalty=l2), LogisticRegression__C=1.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1), LogisticRegression__C=25.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1), LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)'

Run this cell to receive the output of your second model

In [27]:
 "Model accuracy model 2: " +str(score2) + '.  Model used (2): ' + str(model2)

'Model accuracy model 2: 0.7949365979080231.  Model used (2): ExtraTreesClassifier(BernoulliNB(Normalizer(input_matrix, Normalizer__norm=l1), BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=True), ExtraTreesClassifier__bootstrap=True, ExtraTreesClassifier__criterion=gini, ExtraTreesClassifier__max_features=0.5, ExtraTreesClassifier__min_samples_leaf=11, ExtraTreesClassifier__min_samples_split=9, ExtraTreesClassifier__n_estimators=100)'