## AutoML 예제 (DACON 대회 공유코드)
**출처 : https://dacon.io/competitions/official/235647/codeshare/1701?page=1&dtype=recent&ptype=pub **



이번 코드에서는 AutoML 패키지인 PyCaret을 활용하여 정형데이터 대회에 참여하는 과정을 알아보겠습니다. Feature engineering, model tuning 없이 주어진 데이터를 그대로 활용하여 default 모델을 훈련하고 예측 했으므로, 추가 작업을 통해 높은 성능을 보여줄 수 있을 것 같습니다. 

개인적으로 PyCaret은 아직까지 single output인 문제에는 적합한데 multi output 문제에는 부적합한것 같습니다. 혹시 multi output 문제에도 잘 적용된다면 알려주세요!

In this kernel we will use an AutoML package called PyCaret to enter data science competitions with structured data. I've used the given data without any feature engineering and trained the models without model tuning, so I expect better scores if we engineer additional feature and tune the models. 

I think PyCaret is approporiate for single output prediction tasks, but I still haven't figured out easier way to implement it on multi output prediction tasks. Would appreciate it if anyone could share tutorial code on applying PyCaret on multi output prediction task. 

## 데이터 불러오기 (Read Data)

In [1]:
import pandas as pd
train = pd.read_csv('../input/kakr-4th-competition/train.csv')
test = pd.read_csv('../input/kakr-4th-competition/test.csv')
submission = pd.read_csv('../input/kakr-4th-competition/sample_submission.csv')

## 데이터 구조 확인 (Checking the shapes of data)

In [2]:
print(train.shape)
print(test.shape)
print(submission.shape)

(26049, 16)
(6512, 15)
(6512, 2)


## PyCaret 패키지 설치 (Install PyCaret)

In [None]:
# !pip install pycaret

## 분류 작업에 필용한 함수 불러오기 (Import methods for classification task)

In [3]:
from pycaret.classification import *

## 실험 환경 구축 (Setup the environment)

- PyCaret에서는 모델 학습 전 실험 환경을 구축 해주어야 합니다. setup 함수를 통해 환경을 구축할 수 있습니다. 
 
----

- In PyCaret you have to setup the environment before experimenting with the models. It can be done by using 'setup' method. 

In [1]:
train.head()

NameError: name 'train' is not defined

In [5]:
# 'voted' 컬럼이 예측 대상이므로 target 인자에 명시
# 'voted' column is the target variable
clf = setup(data = train, target = 'income', silent = True)
setup

NameError: name 'train' is not defined

## 모델 학습 및 비교 (Train models and compare)

- 환경 구축을 했으니 PyCaret에서 제공하는 기본 모델에 대해 학습하고 비교해보겠습니다.
- compared_models 함수를 통해 15개의 기본 모델을 학습하고 성능을 비교할 수 있습니다. 
- F1 기준으로 성능이 가장 좋은 3개의 모델을 추려내어 저장해보겠습니다. 본 대회 평가지표가 F1이기 때문에 F1 기준으로 모델을 선정합니다.
-----
- Now we have constructed the environment, we will now train and compare the default models provided in PyCaret
- By using 'compare_models' method we can easily train and compare 15 default models provided in the package
- We will select top 3 models in terms of F1, that is because the evaluation metric for this competition is F1

In [7]:
best_3 = compare_models(sort = 'F1', n_select = 3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Light Gradient Boosting Machine,0.8737,0.9269,0.661,0.7834,0.7169,0.6363,0.6403,0.4036
1,CatBoost Classifier,0.8745,0.9289,0.6551,0.7906,0.7163,0.6367,0.6415,10.0232
2,Extreme Gradient Boosting,0.8707,0.9254,0.659,0.7738,0.7116,0.629,0.6325,10.0862
3,Ada Boost Classifier,0.8603,0.9153,0.625,0.7564,0.684,0.5954,0.6002,1.3996
4,Gradient Boosting Classifier,0.8657,0.9211,0.598,0.7975,0.683,0.6001,0.6104,4.2536
5,Random Forest Classifier,0.8471,0.8783,0.5819,0.7319,0.648,0.552,0.5581,0.2184
6,Linear Discriminant Analysis,0.8421,0.8942,0.5731,0.7186,0.6373,0.5381,0.5439,0.2755
7,Extra Trees Classifier,0.8306,0.8806,0.5944,0.669,0.6291,0.5199,0.5217,1.3858
8,Decision Tree Classifier,0.813,0.748,0.622,0.6126,0.6169,0.4934,0.4936,0.2272
9,Ridge Classifier,0.8404,0.0,0.5176,0.7456,0.6108,0.5145,0.5282,0.0415


- 가장 좋은 3개의 모델은 best_3 변수에 저장되어 있습니다.  

## 모델 앙상블 (Model Ensemble)

- 학습된 3개의 모델을 앙상블 시키도록 하겠습니다.확률 값을 기준으로 soft vote ensemble을 진행하겠습니다. 

In [8]:
blended = blend_models(estimator_list = best_3, fold = 5, method = 'soft')
blend_models

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8665,0.9242,0.6338,0.7732,0.6966,0.612,0.617
1,0.8763,0.9286,0.6716,0.7865,0.7245,0.6454,0.6488
2,0.8703,0.9283,0.658,0.7726,0.7107,0.6278,0.6312
3,0.8717,0.9282,0.6523,0.7815,0.7111,0.6295,0.6337
4,0.8801,0.9332,0.6757,0.7979,0.7317,0.6553,0.659
Mean,0.873,0.9285,0.6583,0.7823,0.7149,0.634,0.6379
SD,0.0048,0.0029,0.0149,0.0094,0.0122,0.015,0.0146


## 모델 예측 (Prediction)
- 구축된 앙상블 모델을 통해 예측을 해보겠습니다. 
- setup 환경에 이미 hold-out set이 존재하므로 해당 데이터에 대해 예측을 하여 모델 성능을 확인하겠습니다. 

----
- We will use the ensembled model on predicting unseen data.
- There is already a hold-out set constucted on our environment so we will test on it to evaluate the performance.

In [9]:
pred_holdout = predict_model(blended)
predict_model

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.8656,0.9255,0.6443,0.7638,0.699,0.6133,0.617


## 전체 데이터에 대한 재학습 (Re-training the model on whole data)

- 현재까지 실험은 주어진 train 데이터를 다시 한 번 train / validation으로 나눠서 실험을 한 것이므로, 전체 train 데이터에 학습되어 있지 않습니다. 
- 최적의 성능을 위해 전체 데이터에 학습을 시켜주도록 하겠습니다. 

------
- Until now we have splitted the given train data into another train / validation sets to experiment. So the models are not trained on the full training data set.
- We will train the model on the whole dataset for the most optimal performance. 

In [10]:
final_model = finalize_model(blended)

## 대회용 test set에 대한 예측 (Predicting on test set for the competition)

- predict_model 함수를 통해 재학습된 모델을 대회용 test set에 대해 예측해보겠습니다. 
- We will now use the re-trained model on the test set for the competition

In [11]:
predictions = predict_model(final_model, data = test)

In [12]:
predictions

Unnamed: 0,id,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Label,Score
0,0,28,Private,67661,Some-college,10,Never-married,Adm-clerical,Other-relative,White,Female,0,0,40,United-States,<=50K,0.0034
1,1,40,Self-emp-inc,37869,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,>50K,0.5428
2,2,20,Private,109952,Some-college,10,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,25,United-States,<=50K,0.0002
3,3,40,Private,114537,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,>50K,0.6900
4,4,37,Private,51264,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,99,France,<=50K,0.3569
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6507,6507,35,Private,61343,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,>50K,0.5145
6508,6508,41,Self-emp-inc,32185,Bachelors,13,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States,>50K,0.5892
6509,6509,39,Private,409189,5th-6th,3,Married-civ-spouse,Other-service,Husband,White,Male,0,0,40,Mexico,<=50K,0.0233
6510,6510,35,Private,180342,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K,0.2160


In [13]:
submission['voted'] = predictions['Score']

In [14]:
submission.loc[submission['voted'] >= 0.5, 'prediction'] = 1

In [15]:
submission

Unnamed: 0,id,prediction,voted
0,0,0,0.0034
1,1,1,0.5428
2,2,0,0.0002
3,3,1,0.6900
4,4,0,0.3569
...,...,...,...
6507,6507,1,0.5145
6508,6508,1,0.5892
6509,6509,0,0.0233
6510,6510,0,0.2160


In [16]:
del(submission['voted'])

In [17]:
submission.to_csv('submission_proba_0.csv', index = False)