# AutoGluon Quick Start
https://github.com/aws-samples/aws-ai-ml-workshop-kr/blob/master/sagemaker/autogluon/autogluon_helloworld.ipynb

### source
- https://autogluon.mxnet.io/

### 참고
- 한글 동영상 - https://youtu.be/xnimaVNTWfc
- aws 블로그 - https://aws.amazon.com/ko/blogs/opensource/machine-learning-with-autogluon-an-open-source-automl-library/
- 블로그의 샘플코드 - https://github.com/shashankprasanna/autogluon-demos/blob/master/otto-kaggle-example.ipynb
- 데모 - https://github.com/shashankprasanna/autogluon-demos
- 소스 - https://github.com/awslabs/autogluon
- paper - https://arxiv.org/abs/2003.06505 https://arxiv.org/pdf/2003.06505.pdf

# Autogluon의 장점
- 데이터 전처리 과정을 대신 해줌
    - 결측치 처리
    - Categorical 변수에 대하여 원-핫 인코딩이나 embedding layer 생성
- 여러 모델 중 best model을 찾아줌
- 하이퍼 파라미터 튜닝까지 실행할 수 있음 (선택적)
- 각 모델별 학습 시간 및 추론 시간을 확인할 수 있음 (분석가가 유동적으로 모델 선택 가능)
- 앙상블, 스태킹 모델까지 만들어줌

# Autogluon의 Data Pre-Processing
AutoGluon uses **median imputation** for missing numeric values,**quantile normalization** for skewed distribution, and **standard normalization** was applied to all other variables.

For categorical features, AutoGluon uses the embedding layer if discrete levels are greater than 4, else uses **one-hot encoding**.

[출처](https://towardsdatascience.com/tabular-prediction-using-auto-machine-learning-autogluon-de2507ecd94f)

## 설치
`pip install autogluon`

권한 문제 생길 시 `pip install --user autogluon`

## AutoGluon API

```python
from autogluon import TabularPrediction as task

# 1) Dataset() - pandas와 유사
data = task.Dataset(DATASET_PATH)

# 2) fit() - 몇가지 모델이 만들어져서 디스크에 저장됨
predictor = task.fit(data_train, label=LABEL_COLUMN_NAME)

# 3) predict()
prediction = predictor.predict(new_data)
```

In [1]:
import autogluon as ag
# from autogluon import TabularPrediction as task
from autogluon.tabular import TabularDataset, TabularPredictor

## 1) Dataset()

In [2]:
train_data = TabularDataset(data='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
# train_data = train_data.head(500) # subsample 500 data points for faster demo
# print(train_data.head())

In [3]:
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K


In [4]:
label_column = 'class'
train_data.groupby(label_column).count()

Unnamed: 0_level_0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
<=50K,29704,29704,29704,29704,29704,29704,29704,29704,29704,29704,29704,29704,29704,29704
>50K,9369,9369,9369,9369,9369,9369,9369,9369,9369,9369,9369,9369,9369,9369


## 2) fit()
- https://auto.gluon.ai/stable/api/autogluon.task.html
- **`TabularPredictor`**
    - `label_column`: 반응변수의 column 명 입력 (X, y로 나눠줄 필요 없음)
    - `eval_metric`: 모델 평가 지표
        - default = {"classification":"accuracy", "regression":"root_mean_squared_error"}
        - classification: `['accuracy', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_weighted', 'roc_auc', 'roc_auc_ovo_macro', 'average_precision', 'precision', 'precision_macro', 'precision_micro', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_weighted', 'log_loss', 'pac_score']`
         - regression: `['root_mean_squared_error', 'mean_squared_error', 'mean_absolute_error', 'median_absolute_error', 'r2']`
    - `path`: 모델 아티팩트를 저장할 폴더(`.pkl` 형식)
- **`fit()`**
    - `train_data`: 학습용 데이터
    - `tuning_data`: 하이퍼파라미터 튜닝용 데이터, 없으면 `train_data`를 hold out하여 사용
    - `time_limit`: 특정 학습 시간(초)를 넘어가면 해당 모델을 사용하지 않음, 값을 주지 않으면 모두 사용
    - `presets`: 학습 프리셋. 좋은 성능 vs 빠른 시간. default = `medium_quality`
        - `best_quality`, `high_quality`, `good_quality`, `medium_quality`, `optimize_for_deployment`
        - https://github.com/awslabs/autogluon/blob/master/docs/tutorials/tabular_prediction/tabular-quickstart.md
        
| Preset         | Model Quality                                          | Use Cases                                                                                                                                               | Fit Time (Ideal) | Inference Time (Relative to medium_quality) | Disk Usage |
|----------------|--------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|---------------------------------------------|------------|
| best_quality   | State-of-the-art (SOTA), much better than high_quality | When accuracy is what matters                                                                                                                           | 16x+             | 32x+                                        | 16x+       |
| high_quality   | Better than good_quality                               | When a very powerful, portable solution with fast inference is required: Large-scale batch inference                                                    | 16x              | 4x                                          | 2x         |
| good_quality   | Significantly better than medium_quality               | When a powerful, highly portable solution with very fast inference is required: Billion-scale batch inference, sub-100ms online-inference, edge-devices | 16x              | 2x                                          | 0.1x       |
| medium_quality | Competitive with other top AutoML Frameworks           | Initial prototyping, establishing a performance baseline                                                                                                | 1x               | 1x                                          | 1x         |

In [5]:
save_path = 'agModels-predictClass'  # specifies folder to store trained models
predictor = TabularPredictor(label=label_column, path=save_path).fit(train_data)

Beginning AutoGluon training ...
AutoGluon will save models to "agModels-predictClass\"
AutoGluon Version:  0.4.0
Python Version:     3.9.12
Operating System:   Windows
Train Data Rows:    39073
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting

In [6]:
test_data = TabularDataset(data='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label_column]  # values to predict
test_data_nolab = test_data.drop(columns=[label_column])  # delete label column to prove we're not cheating
test_data_nolab.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States


## 3) predict()
- **`TabularPredictor.predict()`**
    - `data`: 예측할 데이터(테스트 데이터)
    - `model`: 특정 모델을 선택해서 사용할 수 있음, default = 성능이 가장 좋은 모델
        - `predictor.get_model_names()`로 모델명 확인 가능
- **`TabularPredictor.evaluate_predictions()`**
    - `auxiliary_metrics`: 앞서 설정한 평가 지표를 제외한 나머지 지표도 함께 볼 것인지 설정
- **`TabularPredictor.leaderboard()`**: 모델별 지표 확인

In [7]:
predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)
print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.8763435356740711
Evaluations on test data:
{
    "accuracy": 0.8763435356740711,
    "balanced_accuracy": 0.7950062351568354,
    "mcc": 0.6395678748952276,
    "f1": 0.710727969348659,
    "precision": 0.798708288482239,
    "recall": 0.640207075064711
}


Predictions:  
 0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object


In [8]:
leaderboard = predictor.leaderboard(test_data, silent=True)
leaderboard.head()

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.877162,0.8872,0.084999,0.030999,3.058001,0.084999,0.030999,3.058001,1,True,11
1,WeightedEnsemble_L2,0.876344,0.8912,0.412998,0.216998,12.253992,0.005,0.006,1.685997,2,True,14
2,CatBoost,0.874399,0.8824,0.024,0.021,28.942511,0.024,0.021,28.942511,1,True,7
3,LightGBMLarge,0.873784,0.8856,0.106999,0.040001,4.198,0.106999,0.040001,4.198,1,True,13
4,LightGBM,0.873477,0.8824,0.072,0.063001,1.099998,0.072,0.063001,1.099998,1,True,4


In [10]:
predictor.get_model_names()

['KNeighborsUnif',
 'KNeighborsDist',
 'LightGBMXT',
 'LightGBM',
 'RandomForestGini',
 'RandomForestEntr',
 'CatBoost',
 'ExtraTreesGini',
 'ExtraTreesEntr',
 'NeuralNetFastAI',
 'XGBoost',
 'NeuralNetTorch',
 'LightGBMLarge',
 'WeightedEnsemble_L2']

In [9]:
predictor.predict(test_data_nolab, model='LightGBMXT')

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

## Best Model 정보
- **`TabularPredictor.info()`**: 각 모델별 정보가 나온다.
    - 모델명, 평가지표, 학습 시간, 추론 시간, 하이퍼파라미터, 앙상블이라면 어떤 모델을 합친 것인지 등

In [11]:
infos = predictor.info()
best_model = predictor.get_model_best()
infos['model_info'][best_model]['hyperparameters']

{'use_orig_features': False,
 'max_base_models': 25,
 'max_base_models_per_type': 5,
 'save_bag_folds': True}

In [12]:
best_model

'WeightedEnsemble_L2'

## Behind the fit function

### Data preprocessing
1. Classification 문제인지 regression 문제인지 결정
2. Feature를 numeric, categorical, text, or date/time 타입으로 분류하고 전처리
    - categorical 변수이지만 반복되지 않는 컬럼은 삭제함 (예: user_id 컬럼)
    - text 컬럼은 n-gram 으로, date/time은 적절한 숫자 타입으로 변환 등
    - missing 데이터는 unkown으로 채움
3. Training set/ validation set 분리

### Model fitting
- random forest처럼 범용적인 알고리즘부터 try하고 knn같은것은 나중에 시도함
- 제약사항이 없이 가장 정확한 모델리턴 or 비용/시간 제한을 주고 최고 정확성 모델 리턴
- 앙상블 and multi-layer stacking을 이용한 결합
- 하이퍼파라미터 최적화

### Algorithms
알고리즘 구성 (초기구성)
- Random Forests
- Extremely Randomized trees
- k-nearest neighbors
- LightGBM boosted trees
- CatBoost boosted trees
- AutoGluon-Tabular deep neural networks