# [DATT] Project 1. 선정된 최종모형을 그대로 사용하는 것이 맞을까? 아니면, 전체 데이터 셋에 대해 다시 적합하는 것이 맞을까?

- [Background]
  - 데이터를 분석하는 과정에서 최종모형과 최적의 Hyper-parameter를 선택하고 난 후에, train 데이터를 적합한 모형을 사용할 것인가? 아니면 train + test 데이터 전체를 합쳐서, 다시 적합한 모형을 사용할 것인가? 에 대한 궁금증이 든 적이 있다. 직관적으로 최적의 Hyper-Parameter는 train 데이터에 의해서 선택된 모수들이기 때문에, 과연 train + test 데이터를 합쳐서 다시 적합하는게 맞을까? 라는 궁금증으로 이어지기도 하였다. 그래서 해당 궁금증을 이번 DATT 첫 번째 project로 다양한 데이터로부터 실험을 통해 검증해 보고자 한다.


- [Hypothesis]
  - Hypothesis1. 동일한 하이퍼 파라미터로 train sample과 full sample를 돌렸을 때, 정확도는 동일할 것이다.
  - Hypothesis2. 동일한 하이퍼 파라미터로 train sample과 full sample를 돌렸을 때, 정확도는 다를 것이다. 만약 다르다면, 어떤 sample을 적용했을 때 더 높은 정확도를 갖는가? ($\alpha=0.05$)


- [Setting]
  - Data Case : 총 4개의 데이터 셋으로 검증, 각각의 데이터 셋은 Binary Classification이며, y의 비율은 상이하게 조정 (50%, 35%, 25%, 20%)
  - Iteration : 100회 by seed
  - Data Setting by each Iteration
    - Unknown Data : 전체 데이터의 20%
    - Known Data : 전체 데이터의 80%
      - Train Data : Known Data의 80%
      - Test Data : Known Data의 20%
  - Cross-Validation for Optimal Hyper-Parameter : 10-Fold
  - Preprocessing
    - 연속형 변수 : 그대로 사용
    - 범주형 변수 : 속성에 맞게, Label Encoding or One-Hot Encoding
  - Model : Random Forest Model
    - Number of Trees : [50, 100, 150, 300]
    - Max Depth : [None, 10, 20, 30]
    - Max Features : [auto, sqrt, log2]
  - Fitting & Prediction (Same Hyper-parameter)
    - Train Model : Only Train Data -> predict unknown data y from unknown data x
    - Full Model : Train + Test Data -> predict unknown data y from unknown data x


- [Project Step]
  - Step 1. Data Import & Data Preprocessing
  - Step 2. Split Unknown & Known Data by seed
  - Step 3. Split train & test Data from Known Data by seed
  - Step 4. Search for Optimal Hyper-Parameter by F1-Score
  - Step 5. Model Fitting & Prediction : Train sample & Full sample (same Hyper-Parameter)
  - Step 6. Comparison Prediction Results : Full sample - Train sample

## Setting. Packages Import

In [1]:
### Base packages
import numpy as np
import pandas as pd
import random 

### Evaluation packages
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score , recall_score , f1_score

### Model packages
from sklearn.ensemble import RandomForestClassifier

### Data packages
from sklearn.datasets import load_breast_cancer
import seaborn as sns

### etc
import warnings
warnings.filterwarnings("ignore")

In [2]:
### Evaluation Function
def evaluation(y_test , pred):
    accuracy = np.round(accuracy_score(y_test , pred),4)
    f1 = np.round(f1_score(y_test,pred),4)
    precision = np.round(precision_score(y_test , pred),4)
    recall = np.round(recall_score(y_test , pred),4)
    
    return pd.DataFrame({'Accuracy':[accuracy], 'F1':[f1], 'Precision':[precision], 'Recall':[recall]})

## Case 1. Breast Cancer Data

- Breast Cancer를 예측하는 데이터셋으로 약 30개의 연속형 변수와 Breast Cancer 여부인 y로 구성되어 있다.
- y의 비율은 1이 63% 이고 0은 37%로 구성되어 있다.

In [16]:
### [Step 1] Data Import & Data Preprocessing

## Data Import
cancer = load_breast_cancer()

## Data Split
x = cancer.data
y = cancer.target

## Proportion of y
pd.DataFrame(y).value_counts()/len(y)

1    0.627417
0    0.372583
dtype: float64

In [143]:
### Validation for 100-Iteration
train_result = pd.DataFrame()
full_result = pd.DataFrame()
i = 0

for set_seed in range(100) :
    
    ### [Step 2] Split Unknown & Known Data by seed
    train_x, unknown_x, train_y, unknown_y = train_test_split(x, y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 3] Split train & test Data from Known Data by seed
    x_train, x_test, y_train, y_test = train_test_split(train_x, train_y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 4] Search for Optimal Hyper-Parameter by F1-Score
    
    ## Base Model
    rf_model = RandomForestClassifier(random_state = set_seed)
    
    ## Parameter Setting
    par = {'n_estimators':[50,100,150,300],
           'max_depth':[None,10,20,30],
           'max_features':['auto', 'sqrt', 'log2']}
    
    ## Stratified 10-Fold
    train_rf = GridSearchCV(rf_model, param_grid = par, scoring = 'f1', cv = 10)
    
    ### [Step 5] Model Fitting & Prediction : Train sample & Full sample (same Hyper-Parameter)
    
    ## Train sample Model Fitting
    train_rf.fit(x_train, y_train)
    
    ## Full sample Model Fitting
    best_par = train_rf.best_params_
    full_rf = RandomForestClassifier(random_state = set_seed,
                                     max_depth = best_par['max_depth'],
                                     max_features = best_par['max_features'],
                                     n_estimators = best_par['n_estimators'])
    full_rf.fit(train_x, train_y)
    
    ## Prediction
    train_pred = train_rf.predict(unknown_x)
    full_pred = full_rf.predict(unknown_x)  
    
    train_result = pd.concat([train_result,evaluation(unknown_y, train_pred)],axis=0).reset_index(drop = True)
    full_result = pd.concat([full_result,evaluation(unknown_y, full_pred)],axis=0).reset_index(drop = True)
    
    print("Finish Iteration",i+1)
    i += 1

Finish  Iteration 1
Finish  Iteration 2
Finish  Iteration 3
Finish  Iteration 4
Finish  Iteration 5
Finish  Iteration 6
Finish  Iteration 7
Finish  Iteration 8
Finish  Iteration 9
Finish  Iteration 10
Finish  Iteration 11
Finish  Iteration 12
Finish  Iteration 13
Finish  Iteration 14
Finish  Iteration 15
Finish  Iteration 16
Finish  Iteration 17
Finish  Iteration 18
Finish  Iteration 19
Finish  Iteration 20
Finish  Iteration 21
Finish  Iteration 22
Finish  Iteration 23
Finish  Iteration 24
Finish  Iteration 25
Finish  Iteration 26
Finish  Iteration 27
Finish  Iteration 28
Finish  Iteration 29
Finish  Iteration 30
Finish  Iteration 31
Finish  Iteration 32
Finish  Iteration 33
Finish  Iteration 34
Finish  Iteration 35
Finish  Iteration 36
Finish  Iteration 37
Finish  Iteration 38
Finish  Iteration 39
Finish  Iteration 40
Finish  Iteration 41
Finish  Iteration 42
Finish  Iteration 43
Finish  Iteration 44
Finish  Iteration 45
Finish  Iteration 46
Finish  Iteration 47
Finish  Iteration 48
F

In [30]:
### Results

## Saving Results
#train_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data1_train_result.csv", index = False)
#full_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data1_full_result.csv", index = False)
train_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data1_train_result.csv")
full_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data1_full_result.csv")

In [31]:
### [Step 6] Comparison Prediction Results : Full sample - Train sample
result1 = pd.DataFrame()
result1['Accuracy_full_train'] = full_result['Accuracy'] - train_result['Accuracy']
result1['F1_full_train'] = full_result['F1'] - train_result['F1']
result1['Precision_full_train'] = full_result['Precision'] - train_result['Precision']
result1['Recall_full_train'] = full_result['Recall'] - train_result['Recall']
result1.describe()

Unnamed: 0,Accuracy_full_train,F1_full_train,Precision_full_train,Recall_full_train
count,100.0,100.0,100.0,100.0
mean,0.003775,0.002999,0.003312,0.002712
std,0.011312,0.00886,0.01244,0.011291
min,-0.0351,-0.0273,-0.0274,-0.05
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0088,0.0071,0.01315,0.0133
max,0.0351,0.028,0.0474,0.0299


## Case 2. Wine Data

- Wine의 색깔이 red인지 white인지 구분하는 Binary Classification 문제이며, 11개의 feature로 구성되어 있다.
- y의 비율은 0이 75%, 1이 25%로 구성되어 있다.

In [21]:
### [Step 1] Data Import & Data Preprocessing

## Data Import
red = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
white = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
red['color'] = 1
white['color'] = 0

## Data Split
data = pd.concat([red,white],axis = 0).reset_index(drop = True)
x = data.drop(['color','quality'],axis=1)
y = data['color']

## Proportion of y
pd.DataFrame(y).value_counts()/len(y)

color
0        0.753886
1        0.246114
dtype: float64

In [193]:
### Validation for 100-Iteration
train_result = pd.DataFrame()
full_result = pd.DataFrame()
i = 0

for set_seed in range(100) :
    
    ### [Step 2] Split Unknown & Known Data by seed
    train_x, unknown_x, train_y, unknown_y = train_test_split(x, y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 3] Split train & test Data from Known Data by seed
    x_train, x_test, y_train, y_test = train_test_split(train_x, train_y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 4] Search for Optimal Hyper-Parameter by F1-Score
    
    ## Base Model
    rf_model = RandomForestClassifier(random_state = set_seed)
    
    ## Parameter Setting
    par = {'n_estimators':[50,100,150,300],
           'max_depth':[None,10,20,30],
           'max_features':['auto', 'sqrt', 'log2']}
    
    ## Stratified 10-Fold
    train_rf = GridSearchCV(rf_model, param_grid = par, scoring = 'f1', cv = 10)
    
    ### [Step 5] Model Fitting & Prediction : Train sample & Full sample (same Hyper-Parameter)
    
    ## Train sample Model Fitting
    train_rf.fit(x_train, y_train)
    
    ## Full sample Model Fitting
    best_par = train_rf.best_params_
    full_rf = RandomForestClassifier(random_state = set_seed,
                                     max_depth = best_par['max_depth'],
                                     max_features = best_par['max_features'],
                                     n_estimators = best_par['n_estimators'])
    full_rf.fit(train_x, train_y)
    
    ## Prediction
    train_pred = train_rf.predict(unknown_x)
    full_pred = full_rf.predict(unknown_x)  
    
    train_result = pd.concat([train_result,evaluation(unknown_y, train_pred)],axis=0).reset_index(drop = True)
    full_result = pd.concat([full_result,evaluation(unknown_y, full_pred)],axis=0).reset_index(drop = True)
    
    print("Finish Iteration",i+1)
    i += 1

Finish Iteration 1
Finish Iteration 2
Finish Iteration 3
Finish Iteration 4
Finish Iteration 5
Finish Iteration 6
Finish Iteration 7
Finish Iteration 8
Finish Iteration 9
Finish Iteration 10
Finish Iteration 11
Finish Iteration 12
Finish Iteration 13
Finish Iteration 14
Finish Iteration 15
Finish Iteration 16
Finish Iteration 17
Finish Iteration 18
Finish Iteration 19
Finish Iteration 20
Finish Iteration 21
Finish Iteration 22
Finish Iteration 23
Finish Iteration 24
Finish Iteration 25
Finish Iteration 26
Finish Iteration 27
Finish Iteration 28
Finish Iteration 29
Finish Iteration 30
Finish Iteration 31
Finish Iteration 32
Finish Iteration 33
Finish Iteration 34
Finish Iteration 35
Finish Iteration 36
Finish Iteration 37
Finish Iteration 38
Finish Iteration 39
Finish Iteration 40
Finish Iteration 41
Finish Iteration 42
Finish Iteration 43
Finish Iteration 44
Finish Iteration 45
Finish Iteration 46
Finish Iteration 47
Finish Iteration 48
Finish Iteration 49
Finish Iteration 50
Finish It

In [32]:
### Results

## Saving Results
#train_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data2_train_result.csv", index = False)
#full_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data2_full_result.csv", index = False)
train_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data2_train_result.csv")
full_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data2_full_result.csv")

In [33]:
### [Step 6] Comparison Prediction Results : Full sample - Train sample
result2 = pd.DataFrame()
result2['Accuracy_full_train'] = full_result['Accuracy'] - train_result['Accuracy']
result2['F1_full_train'] = full_result['F1'] - train_result['F1']
result2['Precision_full_train'] = full_result['Precision'] - train_result['Precision']
result2['Recall_full_train'] = full_result['Recall'] - train_result['Recall']
result2.describe()

Unnamed: 0,Accuracy_full_train,F1_full_train,Precision_full_train,Recall_full_train
count,100.0,100.0,100.0,100.0
mean,0.000294,0.000607,0.000158,0.001027
std,0.001128,0.002343,0.002177,0.004232
min,-0.0023,-0.0049,-0.0067,-0.0129
25%,-0.000175,-0.00045,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0008,0.0016,0.0001,0.0031
max,0.0031,0.0066,0.0061,0.0129


## Case 3. Titanic Data

- Titanic에서 생존 여부를 예측하는 Binary Classification 문제이며, y의 비율을 21%로 맞추기 위해 인위적으로 데이터를 split하였다.
- 사용된 feature들은 총 5개이며, 성별의 경우 문자형으로 되어 있어, one-hot encoding을 진행하였다.

In [12]:
### [Step 1] Data Import & Data Preprocessing

## Data Import
data = sns.load_dataset('titanic')

np.random.seed(1234)
idx = data[data.survived == 1].index
idx1 = np.random.choice(idx, size = 151,replace = False)
data = data[(data.survived == 0) | (data.index.isin(idx1))].reset_index(drop=True)

## Data Preprocessing

## Missing Value Remove
data.isnull().sum()

## Encoding
data = data[['survived','pclass','sex','sibsp','parch','fare']]
data = pd.get_dummies(data, columns=['sex'])

## Data Split
x = data.drop(['survived'], axis = 1)
y = data['survived']

## Proportion of y
pd.DataFrame(y).value_counts()/len(y)

survived
0           0.784286
1           0.215714
dtype: float64

In [121]:
### Validation for 100-Iteration
train_result = pd.DataFrame()
full_result = pd.DataFrame()
i = 0

for set_seed in range(100) :
    
    ### [Step 2] Split Unknown & Known Data by seed
    train_x, unknown_x, train_y, unknown_y = train_test_split(x, y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 3] Split train & test Data from Known Data by seed
    x_train, x_test, y_train, y_test = train_test_split(train_x, train_y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 4] Search for Optimal Hyper-Parameter by F1-Score
    
    ## Base Model
    rf_model = RandomForestClassifier(random_state = set_seed)
    
    ## Parameter Setting
    par = {'n_estimators':[50,100,150,300],
           'max_depth':[None,10,20,30],
           'max_features':['auto', 'sqrt', 'log2']}
    
    ## Stratified 10-Fold
    train_rf = GridSearchCV(rf_model, param_grid = par, scoring = 'f1', cv = 10)
    
    ### [Step 5] Model Fitting & Prediction : Train sample & Full sample (same Hyper-Parameter)
    
    ## Train sample Model Fitting
    train_rf.fit(x_train, y_train)
    
    ## Full sample Model Fitting
    best_par = train_rf.best_params_
    full_rf = RandomForestClassifier(random_state = set_seed,
                                     max_depth = best_par['max_depth'],
                                     max_features = best_par['max_features'],
                                     n_estimators = best_par['n_estimators'])
    full_rf.fit(train_x, train_y)
    
    ## Prediction
    train_pred = train_rf.predict(unknown_x)
    full_pred = full_rf.predict(unknown_x)  
    
    train_result = pd.concat([train_result,evaluation(unknown_y, train_pred)],axis=0).reset_index(drop = True)
    full_result = pd.concat([full_result,evaluation(unknown_y, full_pred)],axis=0).reset_index(drop = True)
    
    print("Finish Iteration",i+1)
    i += 1

Finish Iteration 1
Finish Iteration 2
Finish Iteration 3
Finish Iteration 4
Finish Iteration 5
Finish Iteration 6
Finish Iteration 7
Finish Iteration 8
Finish Iteration 9
Finish Iteration 10
Finish Iteration 11
Finish Iteration 12
Finish Iteration 13
Finish Iteration 14
Finish Iteration 15
Finish Iteration 16
Finish Iteration 17
Finish Iteration 18
Finish Iteration 19
Finish Iteration 20
Finish Iteration 21
Finish Iteration 22
Finish Iteration 23
Finish Iteration 24
Finish Iteration 25
Finish Iteration 26
Finish Iteration 27
Finish Iteration 28
Finish Iteration 29
Finish Iteration 30
Finish Iteration 31
Finish Iteration 32
Finish Iteration 33
Finish Iteration 34
Finish Iteration 35
Finish Iteration 36
Finish Iteration 37
Finish Iteration 38
Finish Iteration 39
Finish Iteration 40
Finish Iteration 41
Finish Iteration 42
Finish Iteration 43
Finish Iteration 44
Finish Iteration 45
Finish Iteration 46
Finish Iteration 47
Finish Iteration 48
Finish Iteration 49
Finish Iteration 50
Finish It

In [34]:
### Results

## Saving Results
#train_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data3_train_result.csv", index = False)
#full_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data3_full_result.csv", index = False)

train_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data3_train_result.csv")
full_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data3_full_result.csv")

In [35]:
### [Step 6] Comparison Prediction Results : Full sample - Train sample
result3 = pd.DataFrame()
result3['Accuracy_full_train'] = full_result['Accuracy'] - train_result['Accuracy']
result3['F1_full_train'] = full_result['F1'] - train_result['F1']
result3['Precision_full_train'] = full_result['Precision'] - train_result['Precision']
result3['Recall_full_train'] = full_result['Recall'] - train_result['Recall']
result3.describe()

Unnamed: 0,Accuracy_full_train,F1_full_train,Precision_full_train,Recall_full_train
count,100.0,100.0,100.0,100.0
mean,0.007787,0.016564,0.026847,0.009862
std,0.015566,0.040892,0.05935,0.054871
min,-0.0357,-0.0775,-0.1465,-0.1379
25%,0.0,-0.009025,-0.006425,-0.0312
50%,0.0071,0.0115,0.02605,0.0
75%,0.0143,0.04225,0.05325,0.0548
max,0.05,0.1304,0.2273,0.125


## Case 4. Penguins Data

- penguin의 성별을 예측하는 Binary Classification 문제이며, 사용된 feature는 총 6개이다.
- y의 비율은 각각 50%이다.

In [27]:
### [Step 1] Data Import & Data Preprocessing

## Data Import
data = sns.load_dataset('penguins')

## Data Encoding
data = pd.get_dummies(data, columns=['species','island'])
data['sex'].replace({'Female': 0, 'Male': 1}, inplace=True)

## MIssing Value drop
data = data.dropna(how='any')

## Data Split
x = data.drop(['sex'], axis = 1)
y = data['sex']

## Proportion of y
pd.DataFrame(y).value_counts()/len(y)

sex
1.0    0.504505
0.0    0.495495
dtype: float64

In [67]:
### Validation for 100-Iteration
train_result = pd.DataFrame()
full_result = pd.DataFrame()
i = 0

for set_seed in range(100) :
    
    ### [Step 2] Split Unknown & Known Data by seed
    train_x, unknown_x, train_y, unknown_y = train_test_split(x, y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 3] Split train & test Data from Known Data by seed
    x_train, x_test, y_train, y_test = train_test_split(train_x, train_y, test_size = 0.2, random_state = set_seed)
    
    ### [Step 4] Search for Optimal Hyper-Parameter by F1-Score
    
    ## Base Model
    rf_model = RandomForestClassifier(random_state = set_seed)
    
    ## Parameter Setting
    par = {'n_estimators':[50,100,150,300],
           'max_depth':[None,10,20,30],
           'max_features':['auto', 'sqrt', 'log2']}
    
    ## Stratified 10-Fold
    train_rf = GridSearchCV(rf_model, param_grid = par, scoring = 'f1', cv = 10)
    
    ### [Step 5] Model Fitting & Prediction : Train sample & Full sample (same Hyper-Parameter)
    
    ## Train sample Model Fitting
    train_rf.fit(x_train, y_train)
    
    ## Full sample Model Fitting
    best_par = train_rf.best_params_
    full_rf = RandomForestClassifier(random_state = set_seed,
                                     max_depth = best_par['max_depth'],
                                     max_features = best_par['max_features'],
                                     n_estimators = best_par['n_estimators'])
    full_rf.fit(train_x, train_y)
    
    ## Prediction
    train_pred = train_rf.predict(unknown_x)
    full_pred = full_rf.predict(unknown_x)  
    
    train_result = pd.concat([train_result,evaluation(unknown_y, train_pred)],axis=0).reset_index(drop = True)
    full_result = pd.concat([full_result,evaluation(unknown_y, full_pred)],axis=0).reset_index(drop = True)
    
    print("Finish Iteration",i+1)
    i += 1

Finish Iteration 1
Finish Iteration 2
Finish Iteration 3
Finish Iteration 4
Finish Iteration 5
Finish Iteration 6
Finish Iteration 7
Finish Iteration 8
Finish Iteration 9
Finish Iteration 10
Finish Iteration 11
Finish Iteration 12
Finish Iteration 13
Finish Iteration 14
Finish Iteration 15
Finish Iteration 16
Finish Iteration 17
Finish Iteration 18
Finish Iteration 19
Finish Iteration 20
Finish Iteration 21
Finish Iteration 22
Finish Iteration 23
Finish Iteration 24
Finish Iteration 25
Finish Iteration 26
Finish Iteration 27
Finish Iteration 28
Finish Iteration 29
Finish Iteration 30
Finish Iteration 31
Finish Iteration 32
Finish Iteration 33
Finish Iteration 34
Finish Iteration 35
Finish Iteration 36
Finish Iteration 37
Finish Iteration 38
Finish Iteration 39
Finish Iteration 40
Finish Iteration 41
Finish Iteration 42
Finish Iteration 43
Finish Iteration 44
Finish Iteration 45
Finish Iteration 46
Finish Iteration 47
Finish Iteration 48
Finish Iteration 49
Finish Iteration 50
Finish It

In [36]:
### Results

## Saving Results
#train_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data4_train_result.csv", index = False)
#full_result.to_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data4_full_result.csv", index = False)
train_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data4_train_result.csv")
full_result = pd.read_csv("C:/Users/kyt28/OneDrive/바탕 화면/Data이모저모/Project1/data4_full_result.csv")

In [37]:
### [Step 6] Comparison Prediction Results : Full sample - Train sample
result4 = pd.DataFrame()
result4['Accuracy_full_train'] = full_result['Accuracy'] - train_result['Accuracy']
result4['F1_full_train'] = full_result['F1'] - train_result['F1']
result4['Precision_full_train'] = full_result['Precision'] - train_result['Precision']
result4['Recall_full_train'] = full_result['Recall'] - train_result['Recall']
result4.describe()

Unnamed: 0,Accuracy_full_train,F1_full_train,Precision_full_train,Recall_full_train
count,100.0,100.0,100.0,100.0
mean,0.004332,0.004968,0.001868,0.007393
std,0.02238,0.02408,0.031474,0.036971
min,-0.0448,-0.0525,-0.0879,-0.0667
25%,-0.0149,-0.013625,-0.007725,-0.025775
50%,0.0,0.0,0.0,0.0
75%,0.0298,0.026275,0.024725,0.0303
max,0.0448,0.0513,0.0729,0.0909


## Project 1. Summary

- 먼저, 100번의 Iteration에 따라 Full sample과 Train sample를 적합한 모형을 구축하고 unknown data에 대한 평가를 진행하였다.
- 그렇게 해당 모형의 평가를 4가지 평가지표 Accuracy, F1-Score, Precision, Recall를 구하였다.
- 이후, 각각의 Case에서 모든 Metric에 대해 Full sample에서 Train sample를 뺀 차이 값을 구하였다.

### Summary 1. Difference Analysis

In [107]:
Acc = pd.DataFrame()
F1 = pd.DataFrame()
Precision = pd.DataFrame()
Recall = pd.DataFrame()
result_1 = pd.DataFrame()

j = 0
for metric in ['Acc','F1','Precision','Recall'] : 
    for i in range(1,5) :
        globals()['r{}'.format(i)] = globals()['result{}'.format(i)].describe().transpose()[['mean','std']]
        globals()['{}'.format(metric)] = pd.concat([globals()['{}'.format(metric)],globals()['r{}'.format(i)].iloc[j,:]],axis=1)
    j += 1
    globals()['{}'.format(metric)].columns = ['Case 1. P(Y=1)=37%','Case 2. P(Y=1)=25%','Case 3. P(Y=1)=21%','Case 4. P(Y=1)=50%']
    
    d = globals()['{}'.format(metric)].transpose()
    d['mean'] = d.apply(lambda row: f"{np.round(row['mean'],6)} ({np.round(row['std'],6)})", axis=1)
    d = d.drop(columns='std').transpose()
    
    result_1 = pd.concat([result_1,d],axis=0)
    
result_1.index = ['Accuracy','F1-Score','Precision','Recall']
result_1

Unnamed: 0,Case 1. P(Y=1)=37%,Case 2. P(Y=1)=25%,Case 3. P(Y=1)=21%,Case 4. P(Y=1)=50%
Accuracy,0.003775 (0.011312),0.000294 (0.001128),0.007787 (0.015566),0.004332 (0.02238)
F1-Score,0.002999 (0.00886),0.000607 (0.002343),0.016564 (0.040892),0.004968 (0.02408)
Precision,0.003312 (0.01244),0.000158 (0.002177),0.026847 (0.05935),0.001868 (0.031474)
Recall,0.002712 (0.011291),0.001027 (0.004232),0.009862 (0.054871),0.007393 (0.036971)


- 해당 결과에서 각 cell은 평균 (표준편차)를 의미한다.
- 모든 Case에서 4개의 Metric은 모두 Full Sample을 적합한 모형이 Train Sample을 적합한 모형보다 미세하지만 높게 나타났다.
- 하지만, 모든 Case에 대해 4개의 Metric의 Confidence Interval을 보면 모두 0을 포함하고 있으므로, 각 유의수준 5% 하에서 통계적 차이는 존재하지 않는다.

### Summary 2. Positive Proportion

In [120]:
result_2 = pd.DataFrame()
for i in range(1,5) :
    res = pd.concat([(globals()['result{}'.format(i)] > 0).sum()/100,(globals()['result{}'.format(i)] == 0).sum()/100,(globals()['result{}'.format(i)] < 0).sum()/100], axis=1)
    result_2 = pd.concat([result_2,res],axis=0)
result_2 = result_2.reset_index().reset_index()
result_2.columns = ['Case','Metric','P(Full > Train)','P(Full = Train)','P(Full < Train)']
result_2['Case'] = np.repeat(["Case 1. P(Y=1)=37%","Case 2. P(Y=1)=25%","Case 3. P(Y=1)=21%","Case 4. P(Y=1)=50%"], 4)
result_2

Unnamed: 0,Case,Metric,P(Full > Train),P(Full = Train),P(Full < Train)
0,Case 1. P(Y=1)=37%,Accuracy_full_train,0.42,0.43,0.15
1,Case 1. P(Y=1)=37%,F1_full_train,0.45,0.36,0.19
2,Case 1. P(Y=1)=37%,Precision_full_train,0.45,0.36,0.19
3,Case 1. P(Y=1)=37%,Recall_full_train,0.28,0.62,0.1
4,Case 2. P(Y=1)=25%,Accuracy_full_train,0.42,0.33,0.25
5,Case 2. P(Y=1)=25%,F1_full_train,0.44,0.29,0.27
6,Case 2. P(Y=1)=25%,Precision_full_train,0.27,0.57,0.16
7,Case 2. P(Y=1)=25%,Recall_full_train,0.39,0.37,0.24
8,Case 3. P(Y=1)=21%,Accuracy_full_train,0.63,0.16,0.21
9,Case 3. P(Y=1)=21%,F1_full_train,0.63,0.08,0.29


- 모든 Case 별로 각 Iteration별로 Full sample과 Train sample의 차이에 대해 비율을 분석한 결과, 모든 Metric에서 Full sample를 통해 적합한 모형의 성능이 더 좋게 나타났다.

### Conclusion

- Train을 통해 최적의 Hyper-Parameter를 선택하고, 모든 sample(Train + Test)을 통해 다시 적합한 모형이 Train 데이터만으로 예측한 결과보다, 모든 Case에서 4가지의 Metric 관점에서 약간 더 높게 나타났다. (Random Forest)
- 하지만, 이러한 차이는 통계적 관점에서 보았을 때, 통계적으로 차이가 존재한다고 볼 수 없다. (유의수준 5% 하)

### Opinion
- 위와 같은 검증을 통해 궁금증을 풀어볼 수 있는 기회가 있었다. 예상대로 Full sample을 모두 넣고 다시 적합한 모형의 모든 metric이 train sample보다 높았다. 하지만, 이러한 차이는 통계적 관점에서 유의하지 않기 때문에, 데이터 분석 과정에서 최종모형을 선정하고 재적합할지 말지에 대한 고찰이 필요한 것 같다.
- 보통적으로 적당한 양의 데이터를 분석하는 상황에서는, Full sample을 통해 조금이라도 기존 모형보다 높은 성능의 모형을 구축하는 것이 적절한 것 같다. 하지만, 많은 양의 데이터를 분석하는 데에는 많은 시간과 비용이 들기 때문에, 다시 sample을 적합할 필요 없이 Train sample로 구축된 모형을 그대로 사용하는 것이 올바른 선택인 것 같다. (일반적으로 고도화된 모형일수록 시간과 비용이 많이 들기 때문에, 기업의 관점에서는 모든 것이 비용이라고 생각한다.)
- 그래서 이러한 연구를 통해 데이터 분석 상황에 맞게, 시간과 비용을 고려하여 효율적으로 최종 모형을 Full sample에 대해 다시 적합할지, 아니면 Train sample을 그대로 사용할지에 대한 판단을 내릴 수 있게 되었다.