## Albiti-4th-Week-TeamA
**- 김서연, 신예진, 최정윤 -**

1. 목표 : 당뇨 관련 지표들에 대한 이해 – open data 및 meta info.
2. 기한 : 2021.05.31 ~ 2021.06.06.
3. Task 1. Pima dataset을 사용한 분류모델 구축.
    - Kaggle에서 Pima dataset 다운로드.   
    - Accuracy 70% 이상, F1 70% 이상 모델 구축! 
4. Task 2. Higher and higher.
    - Accuracy 85% 이상, F1 85% 이상.
_____________

# EDA

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()

### Column 이해하기

`Pregnancies`  : Number of times pregnant

`Glucose` : Plasma glucose concentration a 2 hours in an oral glucose tolerance test

`BloodPressure` : Diastolic blood pressure (mm Hg)

`SkinThickness` : Triceps skin fold thickness (mm)

`Insulin` : 2-Hour serum insulin (mu U/ml)

`BMI` : Body mass index (weight in kg/(height in m)^2)

`DiadbetesPedigreeFunction` : Diabetes pedigree function

`Age` : Age (years)

`Outcome` : Class variable (0 or 1) 268 of 768 are 1, the others are 0

### Null & DType 확인
- 결측치 없음.
- object column 없음.

In [None]:
df.info()

### 기초 통계량
- min이 0인 column이 많은 것을 확인할 수 있다.
- 그런데, 대다수의 column은 0이라는 값을 가질 수 없다. (eg. BloodPressure)

In [None]:
df.describe()

### Outcome 별 Distribution Plot
- 0 값을 가지는 data가 상당히 많다는 것을 확인할 수 있다. (eg. Insulin)
- Outcome의 비율이 2:1 정도로 imbalance 한 것도 확인할 수 있다.

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(12,10))

for i in range(len(df.columns)-1):
    name = df.columns[i]
    sns.distplot(df[df['Outcome'] == 0][name], color='green', ax=ax[i//3, i%3])
    sns.distplot(df[df['Outcome'] == 1][name], color='red', ax=ax[i//3, i%3])
    ax[i//3, i%3].set_title(f'Healthy vs Diabetic by {name}')

ax[2, 2] = sns.countplot(x='Outcome', data=df)
ax[2, 2].set_title('Outcome')

fig.tight_layout()
plt.show()

### Box Plot
- 몇 feature들에게서 엄청나게 많은 outlier들을 확인할 수 있다.
- 0 값의 영향을 받은 것처럼 보이는 feature도 존재한다.

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(12,8))

for i in range(len(df.columns)-1):
    name = df.columns[i]
    sns.boxplot(x= df[name], ax=ax[i//3, i%3])
    ax[i//3, i%3].set_title(f'{name} Box Plot')

ax[2, 2] = sns.countplot(x='Outcome', data=df)
ax[2, 2].set_title('Outcome')

fig.tight_layout()
plt.show()

### 상관계수
- 변수 삭제는 하지 않는 것으로 하였다.

In [None]:
cor = df.corr()
cor = cor.corr(method = 'pearson')
mask = np.triu(np.ones_like(cor, dtype=bool))
fig, ax = plt.subplots(figsize=(6, 6))  
corr_heatmap = sns.heatmap(cor, mask = mask, cbar = True, annot = True, annot_kws={'size' : 9}, fmt = '.2f', square = True)

# 전처리
### 0(zero) 값 처리

- 0 값을 가질 수 없는 data 처리의 필요성이 느껴짐
- 대상 Column: Glucose, BloodPressure, SkinThickness, Insulin, BMI

In [None]:
lst_null = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[lst_null] = df[lst_null].replace(0, np.nan)
df.isnull().sum()

In [None]:
# replace를 위한 dataframe
outcome_mean_df = df.groupby('Outcome').mean()
outcome_mean_df.head()

In [None]:
for col in lst_null:
    df[col] = np.where((df[col].isnull())&(df['Outcome']==0), outcome_mean_df[col].iloc[0], df[col])
    df[col] = np.where((df[col].isnull())&(df['Outcome']==1), outcome_mean_df[col].iloc[1], df[col])

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(12,10))

for i in range(len(df.columns)-1):
    name = df.columns[i]
    sns.distplot(df[df['Outcome'] == 0][name], color='green', ax=ax[i//3, i%3])
    sns.distplot(df[df['Outcome'] == 1][name], color='red', ax=ax[i//3, i%3])
    ax[i//3, i%3].set_title(f'Healthy vs Diabetic by {name}')

ax[2, 2] = sns.countplot(x='Outcome', data=df)
ax[2, 2].set_title('Outcome')

fig.tight_layout()
plt.show()

### 변수 변환
- 로그 변환, 제곱근 변환, boxcox 변환 모두 시도
- boxcox 후에 정규분포 모양으로 변환이 더 잘 이루어졌음
- 최종적으로 boxcox 변환 사용

In [None]:
# 변수 변환
from sklearn import preprocessing
from scipy.stats import boxcox
skewed_cols = ['Pregnancies', 'Insulin', 'DiabetesPedigreeFunction', 'Age']

for col in skewed_cols :
    df[col] = preprocessing.scale(boxcox(df[col]+1)[0])

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(12,10))

for i in range(len(df.columns)-1):
    name = df.columns[i]
    sns.distplot(df[df['Outcome'] == 0][name], color='green', ax=ax[i//3, i%3])
    sns.distplot(df[df['Outcome'] == 1][name], color='red', ax=ax[i//3, i%3])
    ax[i//3, i%3].set_title(f'Healthy vs Diabetic by {name}')

ax[2, 2] = sns.countplot(x='Outcome', data=df)
ax[2, 2].set_title('Outcome')

fig.tight_layout()
plt.show()

### 스케일링
- RobustScaler, MinMaxScaler, StandardScaler 모두 시도
- 최종적으로 모델 성능이 근소하게 더 높게 나온 RobustScaler 사용

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop(['Outcome'], axis=1)
y = df.Outcome

In [None]:
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()

X_robust_scaled = robust_scaler.fit_transform(X)

### 오버 샘플링
- 예측 변수의 클래스 개수를 보면 '0'이 '1'보다 더 많음
- 데이터 불균형을 해소하기 위해 오버 샘플링 기법인 SMOTE 사용
- '0'과 '1' 클래스의 개수를 동일하게 맞춰줌

In [None]:
# 오버 샘플링
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='auto', random_state=1234)
X_resampled, y_resampled= sm.fit_resample(X_robust_scaled,y)

print('After OverSampling, the shape of train_X: {}'.format(X_resampled.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(X_resampled.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_resampled==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_resampled==0)))

# 모델

### Confusion Matrix Function
- model : fit 하기 전 모델
- X : X 데이터 (전체 데이터)
- y : y 데이터 (전체 데이터)
- name : dataframe의 index명 설정

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import KFold, cross_val_predict

# data의 70%로 학습
def model_confusion(model, X, y, name):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
    model.fit(X_train, y_train)

    pred = model.predict(X_test).reshape(-1,)
    result_dict = {'Accuracy': [accuracy_score(y_test, pred)], 
                   'Precision': [precision_score(y_test, pred)], 
                   'Recall': [recall_score(y_test, pred)], 
                   'F1 score': [f1_score(y_test, pred)]}
    
    result = pd.DataFrame(result_dict, index=[name])
    return result

# Kfold 학습
def Kfold_model_confusion(model, X, y, name):
    y_pred = cross_val_predict(model, X_resampled, y_resampled, cv=10)
    result_dict = {'Accuracy': [accuracy_score(y_resampled, y_pred)], 
                  'Precision': [precision_score(y_resampled, y_pred)], 
                  'Recall': [recall_score(y_resampled, y_pred)], 
                  'F1 score': [f1_score(y_resampled, y_pred)]}
    result = pd.DataFrame(result_dict, index=[name])
    return result

### Import Modules

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
import xgboost

### Models we tried
- Logistic Regression
- SVM
- Gaussian NB
- Random Forest
- Gradient Boosting
- AdaBoost
- Decision Tree
- LGBM
- XGBoost
- Ensemble with Random Forest, Gradient Boost, XGBoost, LGBM
- Ensemble with all models we tried

In [None]:
lr_clf = LogisticRegression()
svm_clf = svm.SVC(kernel='linear', C=0.7, gamma=5)
nb_clf = GaussianNB()
forest_clf = RandomForestClassifier(n_estimators=500)
gra_clf = GradientBoostingClassifier(n_estimators=500)
ada_clf = AdaBoostClassifier(n_estimators=500)
dt_clf = DecisionTreeClassifier(random_state=1234)
lgbm_clf = LGBMClassifier(n_estimators=500)
xgb_clf = xgboost.XGBClassifier(n_estimators=500, learning_rate=0.2, 
                                gamma=0.5, max_depth=20, verbosity=0)
en_clf = VotingClassifier(estimators=[('rf', forest_clf), ('gb', gra_clf), ('xgb', xgb_clf), ('lgbm', lgbm_clf)],
                         voting='soft',weights=[2, 3, 5, 4])
all_clf = VotingClassifier(estimators=[('lr', lr_clf), ('svm', svm_clf), ('nb', nb_clf), ('for', forest_clf),
                                      ('gra', gra_clf), ('ada', ada_clf), ('dt', dt_clf), ('lgbm', lgbm_clf),
                                      ('xgb', xgb_clf)], voting='hard', weights=[1,2,2,8,5,4,4,5,6])

### 결과 도출
1. Train Data : Test Data = 7 : 3 으로 학습
    - Scaling과 Over Sampling 하지 않은 데이터
    - Scaling과 Over Sampling 한 데이터
2. K-fold Cross Validation 학습

In [None]:
print("< 7:3 split dataset before scaling and over sampling >")

result1 = model_confusion(lr_clf, X, y, 'Logistic Regression')
result1 = pd.concat([result1, model_confusion(svm_clf, X, y, 'SVM')])
result1 = pd.concat([result1, model_confusion(nb_clf, X, y, 'Gaussian NB')])
result1 = pd.concat([result1, model_confusion(forest_clf, X, y, 'Random Forest')])
result1 = pd.concat([result1, model_confusion(gra_clf, X, y, 'Gradient Boosting')])
result1 = pd.concat([result1, model_confusion(ada_clf, X, y, 'AdaBoosting')])
result1 = pd.concat([result1, model_confusion(dt_clf, X, y, 'Decision Tree')])
result1 = pd.concat([result1, model_confusion(lgbm_clf, X, y, 'LGBM')])
result1 = pd.concat([result1, model_confusion(xgb_clf, X, y, 'XGBoost')])
result1 = pd.concat([result1, model_confusion(en_clf, X, y, 'Ensemble')])
result1 = pd.concat([result1, model_confusion(all_clf, X, y, 'Ensemble_all')])
result1

In [None]:
print("< 7:3 split dataset after scaling and over sampling >")

result2 = model_confusion(lr_clf, X, y, 'Logistic Regression')
result2 = pd.concat([result2, model_confusion(svm_clf, X_resampled, y_resampled, 'SVM')])
result2 = pd.concat([result2, model_confusion(nb_clf, X_resampled, y_resampled, 'Gaussian NB')])
result2 = pd.concat([result2, model_confusion(forest_clf, X_resampled, y_resampled, 'Random Forest')])
result2 = pd.concat([result2, model_confusion(gra_clf, X_resampled, y_resampled, 'Gradient Boosting')])
result2 = pd.concat([result2, model_confusion(ada_clf, X_resampled, y_resampled, 'AdaBoosting')])
result2 = pd.concat([result2, model_confusion(dt_clf, X_resampled, y_resampled, 'Decision Tree')])
result2 = pd.concat([result2, model_confusion(lgbm_clf, X_resampled, y_resampled, 'LGBM')])
result2 = pd.concat([result2, model_confusion(xgb_clf, X_resampled, y_resampled, 'XGBoost')])
result2 = pd.concat([result2, model_confusion(en_clf, X_resampled, y_resampled, 'Ensemble')])
result2 = pd.concat([result2, model_confusion(all_clf, X_resampled, y_resampled, 'Ensemble_all')])
result2

In [None]:
print("< K-fold Cross Validation after scaling and over sampling >")

result3 = Kfold_model_confusion(lr_clf, X_resampled, y_resampled, 'Logistic Regression')
result3 = pd.concat([result3, Kfold_model_confusion(svm_clf, X_resampled, y_resampled, 'SVM')])
result3 = pd.concat([result3, Kfold_model_confusion(nb_clf, X_resampled, y_resampled, 'Gaussian NB')])
result3 = pd.concat([result3, Kfold_model_confusion(forest_clf, X_resampled, y_resampled, 'Random Forest')])
result3 = pd.concat([result3, Kfold_model_confusion(gra_clf, X_resampled, y_resampled, 'Gradient Boosting')])
result3 = pd.concat([result3, Kfold_model_confusion(ada_clf, X_resampled, y_resampled, 'AdaBoosting')])
result3 = pd.concat([result3, Kfold_model_confusion(dt_clf, X_resampled, y_resampled, 'Decision Tree')])
result3 = pd.concat([result3, Kfold_model_confusion(lgbm_clf, X_resampled, y_resampled, 'LGBM')])
result3 = pd.concat([result3, Kfold_model_confusion(xgb_clf, X_resampled, y_resampled, 'XGBoost')])
result3 = pd.concat([result3, Kfold_model_confusion(en_clf, X_resampled, y_resampled, 'Ensemble')])
result3 = pd.concat([result3, Kfold_model_confusion(all_clf, X_resampled, y_resampled, 'Ensemble_all')])
result3

# 최종 결과

#### Train:Test=7:3 Split

In [None]:
pd.DataFrame(result2[result2['F1 score']==max(result2['F1 score'])])

#### K-fold Cross Validation

In [None]:
pd.DataFrame(result3[result3['F1 score']==max(result3['F1 score'])])