- [Chapter 1. Overview](#overview)
- [Chapter 2. Feature Engineering](#feature_engineering)
    - [(1) 데이터 불러오기](#data_import)
    - [(2) 데이터 합치기](#data_combine)
    - [(3) 데이터 다루기](#handling_missing_values)
    - [(4) 문자열 데이터 인코딩](#feature_encoding)
    - [(5) 데이터셋 분리](#split_data)
    - [(6) 한계](#limitation)
- [Chapter 3. Scikit Learn](#scikit_learn)
    - [(1) 데이터셋 분리](#data_split)
    - [(2) Base Model - Decision Tree](#base_model_tree)
    - [(3) Helper Class and Submission Function](#helper_class)
        * [(A) DecisionTreeClassifier](#DecisionTreeClassifier)
        * [(B) RandomForestClassifier](#RandomForestClassifier)
        * [(C) LightGBM](#lightgbm)
        * [(D) Feature Importance](#feature_importance)
        * [(E) 제출](#submission) 
        
- [Chapter 4. PyCaret](#pycaret)
    - [(1) Intro](#intro)
    - [(2) Model Building](#model_building)
        + [(A) Initialize Setup](#initialize_setup)
        + [(B) Comparing All Models](#compare_models)
        + [(C) Create Model](#create_pycaret_model)
        + [(D) Tune Model](#tune_pycaret_model)
        + [(E) Plot Model](#plot_pycaret_model)
        + [(F) Predictions and Submissions](#preds_submissions)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="overview"></a>
## Chpater 1. Overview
- PyCaret & Scikit-Learn 코드 비교

<a id="feature_engineering"></a>
## Chpater 2. Feature Engineering
- PyCaret & Scikit-Learn 코드 비교

<a id="data_import"></a>
### (1) 데이터 불러오기

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import os

print("Version Pandas", pd.__version__)
print("Version Matplotlib", matplotlib.__version__)
print("Version Numpy", np.__version__)
print("Version Seaborn", sb.__version__)

os.listdir('../input/tabular-playground-series-apr-2021/')

- 데이터 불러오기

In [None]:
BASE_DIR = '../input/tabular-playground-series-apr-2021/'
train = pd.read_csv(BASE_DIR + 'train.csv')
test = pd.read_csv(BASE_DIR + 'test.csv')
sample_submission = pd.read_csv(BASE_DIR + 'sample_submission.csv')

train.shape, test.shape, sample_submission.shape

In [None]:
train.head()

In [None]:
test.head()

In [None]:
sample_submission.head()

<a id="data_combine"></a>
### (2) 데이터 합치기

In [None]:
all_df = pd.concat([train, test])
all_df.shape

<a id="handling_missing_values"></a>
### (3) 데이터 다루기

In [None]:
# Start
print("Before Handling:", all_df.shape)

# Age
age_dict = all_df[['Age', 'Pclass']].dropna().groupby('Pclass').mean().round(0).to_dict()
print("Avg. Mean of Age by Pclass:", age_dict)
all_df['Age'] = all_df['Age'].fillna(all_df.Pclass.map(age_dict['Age']))

# Cabin
all_df["Cabin"].fillna("No Cabin", inplace = True)
print("Values from Cabin: ", all_df["Cabin"].unique())
all_df['Cabin_Code'] = all_df['Cabin'].fillna('X').map(lambda x: x[0].strip())
print("Values from Cabin Code: ", all_df["Cabin_Code"].unique())

# Fare
print("Avg. Mean:", np.round(all_df['Fare'].mean(), 2))
all_df['Fare'] = all_df['Fare'].fillna(round(all_df['Fare'].mean(), 2))

# Embarked
all_df["Embarked"].fillna("X", inplace = True)
print("Values from Embarked: ", all_df["Embarked"].unique())

# Delete Columns
all_df.drop(['Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1, inplace=True)
print("After Handling:", all_df.shape)

<a id="feature_encoding"></a>
### (4) Feature Encoding

In [None]:
all_df.info()

In [None]:
cat_cols = ['Pclass', 'Sex', 'Cabin_Code', 'Embarked']
num_cols = ['Age', 'SibSp', 'Parch', 'Fare', 'Survived']

# 명목형? 서열형?, 더미변수
onehot_df = pd.get_dummies(all_df[cat_cols])
print("onehot_df Shape:", onehot_df.shape)

num_df = all_df[num_cols]
print("num_df Shape:", num_df.shape)

all_cleansed_df = pd.concat([num_df, onehot_df], axis=1)
print("all_cleansed_df Shape:", all_cleansed_df.shape)

<a id="split_data"></a>
### (5) 데이터셋 분리


In [None]:
X = all_cleansed_df[:train.shape[0]]
print("X Shape is:", X.shape)
y = X['Survived']
X.drop(['Survived'], axis=1, inplace=True)
test_data = all_cleansed_df[train.shape[0]:].drop(columns=['Survived'])
test_data.info()

In [None]:
X.shape, y.shape

In [None]:
test_data.shape

<a id="limitation"></a>
### (6) Feature Engineering의 한계
- 가난한 사람 vs 부유한 사람 (계층 분리) (X)

<a id="scikit_learn"></a>
## Chapter 3. Scikit-Learn


In [None]:
!pip install scikit-learn==0.23.2

In [None]:
import sklearn
print(sklearn.__version__)

<a id="data_split"></a>
### (1) 데이터셋 분리
- Stratified Sampling : https://medium.com/@411.codebrain/train-test-split-vs-stratifiedshufflesplit-374c3dbdcc36

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, stratify = X[['Pclass']], random_state=42)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

- 평가 메트릭 함수

In [None]:
from sklearn.metrics import accuracy_score
def acc_score(y_true, y_pred, **kwargs):
    return accuracy_score(y_true, (y_pred > 0.5).astype(int), **kwargs)

<a id="base_model_tree"></a>
### (2) Base Model - Decision Tree

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from matplotlib import pyplot as plt

tree_model = DecisionTreeClassifier(max_depth=3)
tree_model.fit(X_train, y_train)
predictions = tree_model.predict_proba(X_val)
AUC = roc_auc_score(y_val, predictions[:,1])
ACC = acc_score(y_val, predictions[:,1])
print("Model AUC:", AUC)
print("Model Accurarcy:", ACC)
print("\n")

fpr, tpr, _ = roc_curve(y_val, predictions[:,1])

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(fpr, tpr)
ax.text(x = 0.3, 
        y = 0.4, 
        s = "Model AUC is {}\n\nModel Accuracy is {}".format(np.round(AUC, 2), np.round(ACC, 2)), 
        fontsize=16, bbox=dict(facecolor='gray', alpha=0.3))
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')
ax.set_title('ROC curve')

plt.show()

- 파일 제출

In [None]:
final_preds = tree_model.predict(test_data)
binarizer = np.vectorize(lambda x: 1 if x >= .5 else 0)
print("binarizer : ", binarizer)
prediction_binarized = binarizer(final_preds)
print(prediction_binarized)
submission = pd.concat([sample_submission,pd.DataFrame(prediction_binarized)], axis=1).drop(columns=['Survived'])
submission.columns = ['PassengerId', 'Survived']
submission.to_csv('submission.csv', index=False)

<a id="helper_class"></a>
### (3) Helper Class and Submission Function

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, confusion_matrix
from matplotlib import pyplot as plt

SEED = 0 # for Reproducibility

# class 
class sk_helper(object):
    def __init__(self, model, seed = 0, params={}):
        params['random_state'] = seed
        self.model = model(**params)
        self.model_name = str(model).split(".")[-1][:-2]
        
    # train
    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)
        
    # predict
    def predict(self, y_val):
        return self.model.predict(y_val)
    
    # inner fit
    def fit(self, x, y):
        return self.model.fit(x, y)
    
    # feature importance
    def feature_importances(self, X_train, y_train):
        return self.model.fit(X_train, y_train).feature_importances_
        
    # roc_curve
    def roc_curve_graph(self, X_train, y_train, X_val, y_val):
        self.model.fit(X_train, y_train)
        
        print("model_name:", self.model_name)
        model_name = self.model_name
        preds_proba = self.model.predict_proba(X_val)
        preds = (preds_proba[:, 1] > 0.5).astype(int)
        auc = roc_auc_score(y_val, preds_proba[:, 1])
        acc = accuracy_score(y_val, preds)
        confusion = confusion_matrix(y_val, preds)
        print('Confusion Matrix')
        print(confusion)
        print("Model AUC: {0:.3f}, Model Accuracy: {1:.3f}\n".format(auc, acc))
        fpr, tpr, _ = roc_curve(y_val, predictions[:,1])
        fig, ax = plt.subplots(figsize=(10, 6))

        ax.plot(fpr, tpr)
        ax.text(x = 0.3, 
                y = 0.4, 
                s = "Model AUC is {}\n\nModel Accuracy is {}".format(np.round(auc, 2), np.round(acc, 2)), 
                fontsize=16, bbox=dict(facecolor='gray', alpha=0.3))
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.set_title('ROC curve of {}'.format(model_name), fontsize=16)

        plt.show()

<a id="DecisionTreeClassifier"></a>
#### (A) Decision Tree


In [None]:
%%time
tree_params = {'max_depth' : 6}
tree_model = sk_helper(model=DecisionTreeClassifier, seed=SEED, params=tree_params)
tree_model.roc_curve_graph(X_train, y_train, X_val, y_val)

<a id="RandomForestClassifier"></a>
#### (B) RandomForest


In [None]:
%%time
from sklearn.ensemble import RandomForestClassifier

rf_params = {
    'n_jobs': -1,
    'n_estimators': 500,
     'warm_start': True, 
     #'max_features': 0.2,
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 1
}

rf_model = sk_helper(model=RandomForestClassifier, seed=SEED, params=rf_params)
rf_model.roc_curve_graph(X_train, y_train, X_val, y_val)

<a id="lightgbm"></a>
#### (C) LightGBM


In [None]:
%%time

import lightgbm
from lightgbm import LGBMClassifier
print(lightgbm.__version__)
lgb_params = {
    'metric': 'auc',
    'n_estimators': 10000,
    'objective': 'binary',
}

lgb_model = sk_helper(model=LGBMClassifier, seed=SEED, params=lgb_params)
lgb_model.roc_curve_graph(X_train, y_train, X_val, y_val)

<a id="feature_importance"></a>
#### (D) Feature Importance


In [None]:
tree_features = tree_model.feature_importances(X_train, y_train)
rf_features = rf_model.feature_importances(X_train, y_train)
lgb_features = lgb_model.feature_importances(X_train, y_train)

In [None]:
cols = X.columns.values
feature_df = pd.DataFrame({'features': cols, 
                          'Decision Tree': tree_features, 
                          'RandomForest': rf_features, 
                          'LightGBM': lgb_features})

feature_df

In [None]:
%matplotlib inline

import seaborn as sb
import matplotlib.pyplot as plt

width = 0.3
x = np.arange(0, len(feature_df.index))

## ax[0] graph
fig, ax = plt.subplots(nrows = 2, ncols = 1, figsize = (16, 16)) # Option sharex=True
ax[0].bar(x - width/2, feature_df['Decision Tree'], color = "#0095FF", width = width)
ax[0].bar(x + width/2, feature_df['RandomForest'], color = "#E6C0B1", width = width)
ax[0].set_xticks(x)
ax[0].set_xticklabels(feature_df['features'], rotation=90)

## ax[0] legend
colors = {'Decision Tree':'#0095FF', 'RandomForest':'#E6C0B1'} 
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]

ax[0].legend(handles, labels, bbox_to_anchor = (0.95, 0.95))
ax[0].set_title("Feature Importance between Decision Tree and RandomForest", fontsize=20)

## ax[1] graph
ax[1].bar(x, feature_df['LightGBM'], color = "#60F09E")
ax[1].set_xticks(x)
ax[1].set_xticklabels(feature_df['features'], rotation=90)
ax[1].set_title("Feature Importance of LightGBM", fontsize=20)

## plt manage
## plt.xticks(x, feature_df['features'], rotation=90)
plt.tight_layout()
plt.show()

<a id="submission"></a>
#### (E) 제출

In [None]:
import numpy as np
from datetime import datetime

version = datetime.now().strftime("%d-%m-%Y %H-%M-%S")

def final_submission(model, data, version):
    final_preds = model.predict(data)
    binarizer = np.vectorize(lambda x: 1 if x >= .5 else 0)
    prediction_binarized = binarizer(final_preds)
    submission = pd.concat([sample_submission,pd.DataFrame(prediction_binarized)], axis=1).drop(columns=['Survived'])
    submission.columns = ['PassengerId', 'Survived']
    submission.to_csv('Sklearn of Submit Date {} Submission.csv'.format(version), index=False)
    
final_submission(lgb_model, test_data, version)

<a id='pycaret'></a>
## Chapter 4. PyCaret

<a id="intro"></a>
### (1) Intro
- URL: https://pycaret.gitbook.io/docs/
> It's an open source low-code machine learning library that aims to reduce cycle time from hypothesis to insights. 

- Point 1. Simple and Easy to use
> All the operations performed in PyCaret are automatically stored in a custom `Pipeline` that is fully orchestrated for `deployment`. 
- Point 2. Python Wrapper
> Around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy and many more. 
- Point 3. Train Multiple Models 
> It trains multiple models SIMULTANEOUSLY.. (interesting!) and outputs a table comparing performaces of each model you developed. 
- Point 4. [PyCaret on GPU](https://pycaret.readthedocs.io/en/latest/installation.html)
> `PyCaret >= 2.2` provides the option to use GPU for select model training and hyperparameter tuning. There is no change in the use of the API, however, in some cases, additional libraries have to be installed as they are not installed with the default slim version or the full version. The following estimators can be trained on GPU.

In [None]:
!pip install pycaret==2.2.3

<a id="model_building"></a>        
### (2) Modeling Building 

<a id="initialize_setup"></a>
### (A) 초기 세팅

In [None]:
from pycaret.utils import version
import sklearn
print("pycaret version:", version())
print("sklearn version:", sklearn.__version__)

In [None]:
from pycaret.classification import *

all_df_pycaret = pd.concat([X, y], axis=1)
all_df_pycaret['Survived'] = all_df_pycaret['Survived'].astype('int64')
all_df_pycaret.info()

setup(data = all_df_pycaret, 
      target = 'Survived', 
      fold = 3, # 교차검증
      silent = True, 
      normalize = True
     )

set_config('seed', 123)

<a id="compare_models"></a>
### (B) Comparing All Models

In [None]:
%%time

best_model = compare_models(sort = 'Accuracy', n_select = 3)

<a id="create_pycaret_model"></a>
### (C) Create Model


In [None]:
%%time
gbc_model = create_model("gbc")

<a id="tune_pycaret_model"></a>
### (D) Tune Model

In [None]:
%%time
tuned_gbc = tune_model(gbc_model, n_iter = 50)

<a id="plot_pycaret_model"></a>
### (E) Plot Model

In [None]:
plot_model(tuned_gbc, plot = "confusion_matrix")

In [None]:
plot_model(tuned_gbc, plot = "feature_all")

In [None]:
plot_model(tuned_gbc, plot="auc")

<a id="preds_submissions"></a>
### (F) Prediction and Submission

In [None]:
predictions = predict_model(tuned_gbc, data = test_data)
predictions.info()

In [None]:
submission = pd.read_csv(BASE_DIR + 'sample_submission.csv')
submission['Survived'] = predictions['Label']
submission.to_csv('PyCaret Submission.csv', index=False)
submission.head()

1) Pycaret 1차 작업
2) 디테일하게 확인 <-- feature engineering
3) 데이터를 재구조화

# 

<a id="overview"></a>

   