<a href="https://colab.research.google.com/github/wjdrnqja9/TIL/blob/main/machine_learning/01_k_fold_cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# k-fold
- 교차 검증을 통해 모델의 과적합 정도를 판단할수 있습니다.
- https://scikit-learn.org/stable/modules/cross_validation.html

- 장점
    - 특정 데이터셋이대한 과적합 방지
    - 더욱 일반화된 모델 생성 가능
    - 데이터셋 규모가 적을때 과소적합 방지
- 단점
    - 모델 훈련 및 평가 소요시간 증가(반복학습 횟수 증가) -> 병렬학습

## 1. 데이터 로드

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import pandas as pd

In [3]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/머신러닝/Day07-수업자료/data/titanic_train.csv")
df.tail(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


## 2. 데이터 전처리

In [5]:

# 필요한 컬럼만 필터링
columns = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
df = df[columns]

# 결측 데이터 제거
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

# sex, embarked : 더미변수화
dummy_sex = pd.get_dummies(df["Sex"])
dummy_embarked = pd.get_dummies(df["Embarked"])
df = pd.concat([df, dummy_sex, dummy_embarked], axis=1).drop(columns=["Sex", "Embarked"])

df.tail(1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,female,male,C,Q,S
711,0,3,32.0,0,0,7.75,0,1,0,1,0


## 3. 데이터셋 나누기

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
df_x = df.drop(columns=["Survived"])
df_y = df["Survived"]

In [8]:
train_x, test_x, train_y, test_y = train_test_split(df_x, df_y, test_size=0.2, random_state=1)
len(train_x), len(test_x), len(train_y), len(test_y)

(569, 143, 569, 143)

## 4. 모델링

### 4.1 decision tree

In [9]:
from sklearn.tree import DecisionTreeClassifier

In [10]:
import numpy as np

In [11]:
dt_model = DecisionTreeClassifier(random_state=0).fit(train_x, train_y)
score = dt_model.score(test_x, test_y) * 100
np.round(score, 2)

78.32

### 4.2 random forest

In [12]:
from sklearn.ensemble import RandomForestClassifier

In [13]:
rf_model = RandomForestClassifier(random_state=0).fit(train_x, train_y)
score = rf_model.score(test_x, test_y) * 100
np.round(score, 2)

76.92

## 5. k-fold cross validation score
- https://scikit-learn.org/stable/modules/cross_validation.html
- 분산이 크면 데이터에 따라서 모델의 정확도 차이가 크므로 과적합의 위험이 크다.
- 모델링 결과에서 Decision Tree 알고리즘이 정확도가 더 높게 나왔지만 crooss validation score로 확인해보면 random forest가 더 높다.
- Decision Tree 모델의 결과가 과적합될 가능성이 크다.

In [14]:
from sklearn.model_selection import cross_val_score

### 5.1 decision tree

In [15]:
scores = cross_val_score(dt_model, test_x, test_y, cv=10)
scores.mean(), scores.var()

(0.6714285714285715, 0.017310657596371883)

### 5.2 random forest

In [16]:
scores = cross_val_score(rf_model, test_x, test_y, cv=10)
scores.mean(), scores.var()

(0.7285714285714285, 0.01384126984126984)