# 머신러닝 - 결정트리
  

**2019-2023 [FinanceData.KR]()**


## 붓꽃(아이리스) 데이터셋

http://nbviewer.jupyter.org/b7c2a073463ef59938f2eaaf9d2c169c

#### 결정트리 모델 사용

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
model.fit(x_train, y_train)

#### 결정트리 시각화
sklearn.tree.plot_tree()

https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html



In [None]:
import matplotlib.pyplot as plt
from sklearn import tree

plt.figure(figsize=(15, 9))
annots = tree.plot_tree(model, class_names=iris.target_names, feature_names=iris.feature_names, filled=True, rounded=True)

In [None]:
annots

지니계수가 0이 될 때까지 계속 분할

#### 과적합 방지
max_depth, min_samples_split, min_samples_leaf 등을 조절하여 트리가 과적합되는 것을 방지

In [None]:
model = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_leaf=2, random_state=0)
model.fit(x_train, y_train)

plt.figure(figsize=(15, 9))
annots = tree.plot_tree(model, class_names=iris.target_names, feature_names=iris.feature_names, filled=True, rounded=True)

In [None]:
annots

#### 특성 중요도 (feature importances)
* 각 특성이 얼마나 작용했는지 평가하는 지표
* 0~1사이의 값을 가지며, 클수록 영향력이 크다는 의미
* 특성 중요도 전체의 합은 항상 1




In [None]:
for name, value in zip(iris.feature_names , model.feature_importances_):
    print(f'{name} : {value:.4f}')

In [None]:
model.feature_importances_

In [None]:
import seaborn as sns
sns.barplot(x=model.feature_importances_ , y=iris.feature_names)

## 타이타닉 생존자 데이터셋

In [None]:
import seaborn as sns

df = sns.load_dataset("titanic")
df

In [None]:
df["survived"].value_counts()

#### 전처리 - 정수 인코딩

참고 : [머신러닝 데이터 전처리 - 인코딩](https://nbviewer.org/1e6222113a0333a72ac25adcedde39a1)

In [None]:
from sklearn.preprocessing import LabelEncoder

df["gender"] = LabelEncoder().fit_transform(df["sex"])
df.head()

#### 전처리 - 결측치 처리

In [None]:
df['age'] = df['age'].fillna(df['age'].mean())

df.isnull().sum()

#### 필요한 컬럼 선택

In [None]:
feature_names = ["pclass", "age", "gender"]

X = df[feature_names]
y = df["survived"]

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
model.fit(x_train, y_train)

In [None]:
plt.figure(figsize=(15, 9))
annots = tree.plot_tree(model, feature_names=feature_names, filled=True, rounded=True)

In [None]:
for name, value in zip(feature_names , model.feature_importances_):
    print(f'{name} : {value:.4f}')

## confusion_matrix

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model.predict(x_test))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, model.predict(x_test)))

----
**2019-2023 [FinanceData.KR]()**
