## Tutorial based on https://github.com/alteryx/evalml

### TODO methodology ideas:
* algorithmic bias

### Workflow
* semantic commits
* git flow

### Documentation
* add problem_type argument to documentation
* update woodwork documentation without the dreaded value slice error: https://woodwork.alteryx.com/en/stable/guides/statistical_insights.html

In [51]:
import evalml
from evalml.automl import AutoMLSearch

import woodwork as ww

In [28]:
ww.config.set_option('numeric_categorical_threshold', 2)

In [29]:
ww.config

Woodwork Global Config Settings
-------------------------------
natural_language_threshold: 10
numeric_categorical_threshold: 2

In [33]:
ww.list_logical_types()

Unnamed: 0,name,type_string,description,physical_type,standard_tags,is_default_type,is_registered,parent_type
0,Boolean,boolean,Represents Logical Types that contain binary v...,boolean,{},True,True,
1,Categorical,categorical,Represents Logical Types that contain unordere...,category,{category},True,True,
2,CountryCode,country_code,Represents Logical Types that contain categori...,category,{category},True,True,Categorical
3,Datetime,datetime,Represents Logical Types that contain date and...,datetime64[ns],{},True,True,
4,Double,double,Represents Logical Types that contain positive...,float64,{numeric},True,True,
5,EmailAddress,email_address,Represents Logical Types that contain email ad...,string,{},True,True,NaturalLanguage
6,Filepath,filepath,Represents Logical Types that specify location...,string,{},True,True,NaturalLanguage
7,FullName,full_name,Represents Logical Types that may contain firs...,string,{},True,True,NaturalLanguage
8,IPAddress,ip_address,Represents Logical Types that contain IP addre...,string,{},True,True,NaturalLanguage
9,Integer,integer,Represents Logical Types that contain positive...,Int64,{numeric},True,True,


In [34]:
ww.list_semantic_tags()

Unnamed: 0,name,is_standard_tag,valid_logical_types
0,category,True,"[Categorical, CountryCode, Ordinal, SubRegionC..."
1,numeric,True,"[Double, Integer]"
2,index,False,"[Integer, Double, Categorical, Datetime]"
3,time_index,False,[Datetime]
4,date_of_birth,False,[Datetime]


In [65]:
X, y = evalml.demos.load_breast_cancer()

In [66]:
# woodwork DataTable
X.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
physical_type,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,...,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
logical_type,Double,Double,Double,Double,Double,Double,Double,Double,Double,Double,...,Double,Double,Double,Double,Double,Double,Double,Double,Double,Double
semantic_tags,{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},...,{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric},{numeric}
count,569,569,569,569,569,569,569,569,569,569,...,569,569,569,569,569,569,569,569,569,569
nunique,456,479,522,539,474,537,537,542,432,499,...,457,511,514,544,411,529,539,492,500,535
nan_count,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
mean,14.1273,19.2896,91.969,654.889,0.0963603,0.104341,0.0887993,0.0489191,0.181162,0.0627976,...,16.2692,25.6772,107.261,880.583,0.132369,0.254265,0.272188,0.114606,0.290076,0.0839458
mode,12.34,14.93,82.61,512.2,0.1007,0.1147,0,0,0.1601,0.05667,...,12.36,17.7,101.7,284.4,0.1216,0.1486,0,0,0.2226,0.07427
std,3.52405,4.30104,24.299,351.914,0.0140641,0.0528128,0.0797198,0.0388028,0.0274143,0.00706036,...,4.83324,6.14626,33.6025,569.357,0.0228324,0.157336,0.208624,0.0657323,0.0618675,0.0180613
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0,0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0,0,0.1565,0.05504


In [67]:
X.mutual_information()

Unnamed: 0,column_1,column_2,mutual_info
0,mean radius,mean area,0.919339
1,worst radius,worst area,0.847250
2,mean radius,mean perimeter,0.797975
3,mean perimeter,mean area,0.790030
4,worst radius,worst perimeter,0.737651
...,...,...,...
430,smoothness error,worst texture,0.028677
431,symmetry error,worst texture,0.028277
432,mean smoothness,worst texture,0.027734
433,smoothness error,worst symmetry,0.026979


In [68]:
y

<DataColumn: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags = {'category'})>

In [69]:
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type = evalml.problem_types.ProblemTypes.BINARY)

In [70]:
automl = AutoMLSearch(X_train, y_train, 
                      problem_type = evalml.problem_types.ProblemTypes.BINARY,
                      max_batches = 10,
                      max_iterations = 10)

Generating pipelines to search over...


In [71]:
automl.search()

Numerical binary classification target classes must be [0, 1], got [benign, malignant] instead
*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary. 
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 10 batches for a total of 10 pipelines. 
Allowed model families: lightgbm, extra_trees, linear_model, decision_tree, catboost, random_forest, xgboost



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

(1/10) Mode Baseline Binary Classification P... Elapsed:00:00
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 12.904
(2/10) Decision Tree Classifier w/ Imputer      Elapsed:00:00
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 2.432
High coefficient of variation (cv >= 0.2) within cross validation scores. Decision Tree Classifier w/ Imputer may not perform as estimated on unseen data.
(3/10) LightGBM Classifier w/ Imputer           Elapsed:00:00
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.133
(4/10) Extra Trees Classifier w/ Imputer        Elapsed:00:02
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.137
(5/10) Elastic Net Classifier w/ Imputer + S... Elapsed:00:03
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.506
(6/10) CatBoost Classifier w/ Imputer           Elapsed:00:04
	Starting cross validation
	Finished cross validatio

In [72]:
automl.rankings

Unnamed: 0,id,pipeline_name,score,validation_score,percent_better_than_baseline,high_variance_cv,parameters
0,8,Logistic Regression Classifier w/ Imputer + St...,0.094015,0.060529,99.271446,True,{'Imputer': {'categorical_impute_strategy': 'm...
1,6,XGBoost Classifier w/ Imputer,0.113098,0.069048,99.123568,True,{'Imputer': {'categorical_impute_strategy': 'm...
2,7,Random Forest Classifier w/ Imputer,0.119972,0.099614,99.070299,False,{'Imputer': {'categorical_impute_strategy': 'm...
3,2,LightGBM Classifier w/ Imputer,0.132722,0.110679,98.971496,False,{'Imputer': {'categorical_impute_strategy': 'm...
4,3,Extra Trees Classifier w/ Imputer,0.136959,0.111169,98.938661,False,{'Imputer': {'categorical_impute_strategy': 'm...
6,5,CatBoost Classifier w/ Imputer,0.386387,0.374338,97.005774,False,{'Imputer': {'categorical_impute_strategy': 'm...
7,4,Elastic Net Classifier w/ Imputer + Standard S...,0.505862,0.496767,96.079926,False,{'Imputer': {'categorical_impute_strategy': 'm...
8,1,Decision Tree Classifier w/ Imputer,2.431916,2.726782,81.15435,True,{'Imputer': {'categorical_impute_strategy': 'm...
9,0,Mode Baseline Binary Classification Pipeline,12.904388,12.952041,0.0,False,{'Baseline Classifier': {'strategy': 'mode'}}


In [73]:
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

<DataColumn: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags = {'category'})>