<a href="https://colab.research.google.com/github/tommybebe/til/blob/master/ml/AutoGluon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### References
- [AutoGloun - Predicting Columns in a Table - Quick Start](https://autogluon.mxnet.io/tutorials/tabular_prediction/tabular-quickstart.html)
- [Getting Started with AutoML and AWS AutoGluon.ipynb](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab/blob/master/notebooks/MLA-TAB-Lecture3-AutoGluon.ipynb)
- [Machine Learning Accelerator - Tabular Data - Lecture 3](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab/blob/master/notebooks/MLA-TAB-Lecture3-AutoGluon.ipynb)


### Setting

In [5]:
# Here we assume CUDA 10.0 is installed.  You should change the number
# according to your own CUDA version (e.g. mxnet-cu101 for CUDA 10.1).
!pip install --upgrade mxnet-cu101
!pip install autogluon

Collecting mxnet-cu101
[?25l  Downloading https://files.pythonhosted.org/packages/bf/03/02325a5de95d5cfdd43c929ea55d9cadb44d239ca3aee7e3131540c09773/mxnet_cu101-1.6.0.post0-py2.py3-none-manylinux1_x86_64.whl (711.7MB)
[K     |████████████████████████████████| 711.7MB 25kB/s 
Installing collected packages: mxnet-cu101
Successfully installed mxnet-cu101-1.6.0.post0




In [2]:
!pip install -U ipykernel

Collecting ipykernel
[?25l  Downloading https://files.pythonhosted.org/packages/52/19/c2812690d8b340987eecd2cbc18549b1d130b94c5d97fcbe49f5f8710edf/ipykernel-5.3.4-py3-none-any.whl (120kB)
[K     |██▊                             | 10kB 24.5MB/s eta 0:00:01[K     |█████▍                          | 20kB 2.7MB/s eta 0:00:01[K     |████████▏                       | 30kB 3.7MB/s eta 0:00:01[K     |██████████▉                     | 40kB 4.0MB/s eta 0:00:01[K     |█████████████▋                  | 51kB 3.2MB/s eta 0:00:01[K     |████████████████▎               | 61kB 3.7MB/s eta 0:00:01[K     |███████████████████             | 71kB 3.9MB/s eta 0:00:01[K     |█████████████████████▊          | 81kB 4.3MB/s eta 0:00:01[K     |████████████████████████▌       | 92kB 4.5MB/s eta 0:00:01[K     |███████████████████████████▏    | 102kB 4.4MB/s eta 0:00:01[K     |██████████████████████████████  | 112kB 4.4MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 4.4MB/s 


### Example 1. Income classification

#### Data

In [1]:
import autogluon as ag
from autogluon import TabularPrediction as task

In [4]:
train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train_data = train_data.head(500) # subsample 500 data points for faster demo
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K


In [3]:
label_column = 'class'
print("Summary of class variable: \n", train_data[label_column].describe())

Summary of class variable: 
 count        500
unique         2
top        <=50K
freq         394
Name: class, dtype: object


#### Fit

In [5]:
dir = 'agModels-predictClass' # specifies folder where to store trained models
predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir)

Beginning AutoGluon training ...
AutoGluon will save models to agModels-predictClass/
AutoGluon Version:  0.0.13
Train Data Rows:    500
Train Data Columns: 15
Preprocessing data ...
Here are the 2 unique label values in your data:  [' <=50K', ' >50K']
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
Train Data Class Count: 2
NumExpr defaulting to 2 threads.
Feature Generator processed 500 data points with 14 features
Original Features (raw dtypes):
	int64 features: 6
	object features: 8
Original Features (inferred dtypes):
	int features: 6
	object features: 8
Generated Features (special dtypes):
Processed Features (raw dtypes):
	int features: 6
	category features: 8
Processed Features:
	int features: 6
	categor

#### Inference

In [8]:
predictor = task.load(dir) # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)
print("Predictions:  ", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.8249564950353158
Evaluations on test data:
{
    "accuracy": 0.8249564950353158,
    "accuracy_score": 0.8249564950353158,
    "balanced_accuracy_score": 0.6792933272763128,
    "matthews_corrcoef": 0.45574507440667816,
    "f1_score": 0.8249564950353158
}


Predictions:   [' <=50K' ' <=50K' ' >50K' ... ' <=50K' ' <=50K' ' <=50K']


Detailed (per-class) classification report:
{
    " <=50K": {
        "precision": 0.8371901797251263,
        "recall": 0.956515903905516,
        "f1-score": 0.8928839889751942,
        "support": 7451
    },
    " >50K": {
        "precision": 0.7420382165605095,
        "recall": 0.40207075064710956,
        "f1-score": 0.5215444879686626,
        "support": 2318
    },
    "accuracy": 0.8249564950353158,
    "macro avg": {
        "precision": 0.789614198142818,
        "recall": 0.6792933272763128,
        "f1-score": 0.7072142384719284,
        "support": 9769
    },
    "weighted avg": {
        "precision": 0.8146124081399505,
        "recall": 0.8249564950353158,
        "f1-score": 0.8047721081958779,
        "support": 9769
    }
}


In [7]:
test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label_column]  # values to predict
test_data_nolab = test_data.drop(labels=[label_column],axis=1) # delete label column to prove we're not cheating
test_data_nolab.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States


#### Check model summary

In [10]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                         model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      weighted_ensemble_k0_l1       0.87       0.387214  8.178858                0.001337           0.380271            1       True         11
1           LightGBMClassifier       0.86       0.011454  0.174834                0.011454           0.174834            0       True          7
2          NeuralNetClassifier       0.86       0.024859  4.372865                0.024859           4.372865            0       True          9
3           CatboostClassifier       0.85       0.011435  1.483190                0.011435           1.483190            0       True          8
4     LightGBMClassifierCustom       0.84       0.014230  0.377960                0.014230           0.377960            0       True         10
5     ExtraTreesClassifierGini       0.83       0.112609  0.520672  

#### Save model

In [12]:
predictor.save()

TabularPredictor saved. To load, use: TabularPredictor.load("agModels-predictClass/")


In [14]:
!ls agModels-predictClass

learner.pkl  models  SummaryOfModels.html  utils


### Example 2. IMage 

#### Data

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split


file_uri = 'https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab/blob/master/data/review/review_dataset.csv?raw=true'
df = pd.read_csv(file_uri)
train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=23)

#### Fit

In [17]:
from autogluon import TabularPrediction as task

k = 10000 # grab less data for a quick demo
#k = train_data.shape[0] # grad the whole dataset; 

predictor = task.fit(train_data=train_data.head(k), label='Outcome Type')

No output_directory specified. Models will be saved in: AutogluonModels/ag-20200826_152210/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200826_152210/
AutoGluon Version:  0.0.13
Train Data Rows:    10000
Train Data Columns: 13
Preprocessing data ...
Here are the 2 unique label values in your data:  [1.0, 0.0]
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Train Data Class Count: 2
Feature Generator processed 10000 data points with 230 features
Original Features (raw dtypes):
	object features: 10
	int64 features: 2
Original Features (inferred dtypes):
	object features: 9
	text features: 1
	int features: 2
Generated Features (special dtypes):
	text_as_category features: 1
	text_spe

In [18]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                         model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      weighted_ensemble_k0_l1      0.877       1.320798  79.765735                0.003705           0.730549            1       True         11
1   RandomForestClassifierGini      0.869       0.243101   4.365239                0.243101           4.365239            0       True          1
2   RandomForestClassifierEntr      0.866       0.243715   4.967941                0.243715           4.967941            0       True          2
3           CatboostClassifier      0.857       0.060184  10.444251                0.060184          10.444251            0       True          8
4           LightGBMClassifier      0.849       0.050409   1.142792                0.050409           1.142792            0       True          7
5     ExtraTreesClassifierEntr      0.848       0.240790   5.5

{'feature_prune': False,
 'hyperparameter_tune': False,
 'hyperparameters_userspecified': {'default': {'CAT': [{}],
   'GBM': [{}],
   'KNN': [{'AG_args': {'name_suffix': 'Unif'}, 'weights': 'uniform'},
    {'AG_args': {'name_suffix': 'Dist'}, 'weights': 'distance'}],
   'NN': [{}],
   'RF': [{'AG_args': {'name_suffix': 'Gini',
      'problem_types': ['binary', 'multiclass']},
     'criterion': 'gini'},
    {'AG_args': {'name_suffix': 'Entr',
      'problem_types': ['binary', 'multiclass']},
     'criterion': 'entropy'}],
   'XT': [{'AG_args': {'name_suffix': 'Gini',
      'problem_types': ['binary', 'multiclass']},
     'criterion': 'gini'},
    {'AG_args': {'name_suffix': 'Entr',
      'problem_types': ['binary', 'multiclass']},
     'criterion': 'entropy'}],
   'custom': [{'AG_args': {'disable_in_hpo': True,
      'model_type': 'GBM',
      'name_suffix': 'Custom'},
     'boosting_type': 'gbdt',
     'feature_fraction': 0.9,
     'learning_rate': 0.03,
     'min_data_in_leaf': 5,
  

### Summary
- Simple! Nice.