# <a id='toc1_'></a>[분석 모형 구축](#toc0_)
---

**Table of contents**<a id='toc0_'></a>    
- [분석 모형 구축](#toc1_)    
  - [선형 회귀 분석](#toc1_1_)    
  - [로지스틱 회귀 분석](#toc1_2_)    
  - [의사 결정 나무](#toc1_3_)    
  - [서포트 벡터 머신(SVM; Support Vector Machine)](#toc1_4_)    
  - [K-NN 최근접 이웃(K-Nearest Neighbors)](#toc1_5_)    
  - [인공 신경망(ANN; Artificial Neural Network)](#toc1_6_)    
  - [앙상블(Ensemble)](#toc1_7_)    
    - [배깅(Bagging; BootStrap Aggregating)](#toc1_7_1_)    
    - [랜덤 포레스트(Random Forest)](#toc1_7_2_)    
      - [종속 변수가 범주형](#toc1_7_2_1_)    
      - [종속 변수가 수치형](#toc1_7_2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---

## <a id='toc1_1_'></a>[선형 회귀 분석](#toc0_)

In [23]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

df = pd.read_csv('./datasets/Hitters.csv')
print(df.describe())

            AtBat        Hits       HmRun        Runs         RBI       Walks  \
count  322.000000  322.000000  322.000000  322.000000  322.000000  322.000000   
mean   380.928571  101.024845   10.770186   50.909938   48.027950   38.742236   
std    153.404981   46.454741    8.709037   26.024095   26.166895   21.639327   
min     16.000000    1.000000    0.000000    0.000000    0.000000    0.000000   
25%    255.250000   64.000000    4.000000   30.250000   28.000000   22.000000   
50%    379.500000   96.000000    8.000000   48.000000   44.000000   35.000000   
75%    512.000000  137.000000   16.000000   69.000000   64.750000   53.000000   
max    687.000000  238.000000   40.000000  130.000000  121.000000  105.000000   

            Years       CAtBat        CHits      CHmRun        CRuns  \
count  322.000000    322.00000   322.000000  322.000000   322.000000   
mean     7.444099   2648.68323   717.571429   69.490683   358.795031   
std      4.926087   2324.20587   654.472627   86.26606

In [24]:
print(df.head(3))

       Unnamed: 0  AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  CHits  \
0  -Andy Allanson    293    66      1    30   29     14      1     293     66   
1     -Alan Ashby    315    81      7    24   38     39     14    3449    835   
2    -Alvin Davis    479   130     18    66   72     76      3    1624    457   

   ...  CRuns  CRBI  CWalks  League Division PutOuts  Assists  Errors  Salary  \
0  ...     30    29      14       A        E     446       33      20     NaN   
1  ...    321   414     375       N        W     632       43      10   475.0   
2  ...    224   266     263       A        W     880       82      14   480.0   

   NewLeague  
0          A  
1          N  
2          A  

[3 rows x 21 columns]


In [25]:
# 속성 및 결측값 확인
## Salary 열만 263개로 결측값이 존재하기 떄문에 결측값 처리 필요
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  322 non-null    object 
 1   AtBat       322 non-null    int64  
 2   Hits        322 non-null    int64  
 3   HmRun       322 non-null    int64  
 4   Runs        322 non-null    int64  
 5   RBI         322 non-null    int64  
 6   Walks       322 non-null    int64  
 7   Years       322 non-null    int64  
 8   CAtBat      322 non-null    int64  
 9   CHits       322 non-null    int64  
 10  CHmRun      322 non-null    int64  
 11  CRuns       322 non-null    int64  
 12  CRBI        322 non-null    int64  
 13  CWalks      322 non-null    int64  
 14  League      322 non-null    object 
 15  Division    322 non-null    object 
 16  PutOuts     322 non-null    int64  
 17  Assists     322 non-null    int64  
 18  Errors      322 non-null    int64  
 19  Salary      263 non-null    f

In [26]:
# 결측값을 제거한 후 df 변수에 저장
df = df.dropna()

print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  263 non-null    object 
 1   AtBat       263 non-null    int64  
 2   Hits        263 non-null    int64  
 3   HmRun       263 non-null    int64  
 4   Runs        263 non-null    int64  
 5   RBI         263 non-null    int64  
 6   Walks       263 non-null    int64  
 7   Years       263 non-null    int64  
 8   CAtBat      263 non-null    int64  
 9   CHits       263 non-null    int64  
 10  CHmRun      263 non-null    int64  
 11  CRuns       263 non-null    int64  
 12  CRBI        263 non-null    int64  
 13  CWalks      263 non-null    int64  
 14  League      263 non-null    object 
 15  Division    263 non-null    object 
 16  PutOuts     263 non-null    int64  
 17  Assists     263 non-null    int64  
 18  Errors      263 non-null    int64  
 19  Salary      263 non-null    float6

In [27]:
# 범주형 레이블 데이터를 LabelEncoder 객체를 사용하여 숫자로 변환 후 저장
df = pd.get_dummies(df, drop_first=True)

x = df.drop('Salary', axis=1)   # Salary 열을 제외한 나머지 열들 저장
y = df['Salary']

# 학습용 데이터 80%, 평가용 데이터 20% 지정
## x 변수의 80% -> train_x
## x 변수의 20% -> test_x
## y 변수의 80% -> train_y
## y 변수의 20% -> test_y
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2)

# 회귀 분석
md = LinearRegression()
md.fit(train_x, train_y)   # 회귀 분석 수행
print(md.feature_names_in_)   # 독립 변수의 순서 출력

['AtBat' 'Hits' 'HmRun' 'Runs' 'RBI' 'Walks' 'Years' 'CAtBat' 'CHits'
 'CHmRun' 'CRuns' 'CRBI' 'CWalks' 'PutOuts' 'Assists' 'Errors'
 'Unnamed: 0_-Alan Ashby' 'Unnamed: 0_-Alan Trammell'
 'Unnamed: 0_-Alan Wiggins' 'Unnamed: 0_-Alex Trevino'
 'Unnamed: 0_-Alfredo Griffin' 'Unnamed: 0_-Alvin Davis'
 'Unnamed: 0_-Andre Dawson' 'Unnamed: 0_-Andre Thornton'
 'Unnamed: 0_-Andres Galarraga' 'Unnamed: 0_-Andres Thomas'
 'Unnamed: 0_-Andy VanSlyke' 'Unnamed: 0_-Argenis Salazar'
 'Unnamed: 0_-Barry Bonds' 'Unnamed: 0_-Bill Almon'
 'Unnamed: 0_-Bill Buckner' 'Unnamed: 0_-Bill Doran'
 'Unnamed: 0_-Bill Madlock' 'Unnamed: 0_-Bill Schroeder'
 'Unnamed: 0_-Billy Hatcher' 'Unnamed: 0_-BillyJo Robidoux'
 'Unnamed: 0_-Bo Diaz' 'Unnamed: 0_-Bob Brenly' 'Unnamed: 0_-Bob Dernier'
 'Unnamed: 0_-Bob Kearney' 'Unnamed: 0_-Bob Melvin'
 'Unnamed: 0_-Bobby Bonilla' 'Unnamed: 0_-Brett Butler'
 'Unnamed: 0_-Brian Downing' 'Unnamed: 0_-Brook Jacoby'
 'Unnamed: 0_-Bruce Bochy' 'Unnamed: 0_-Buddy Bell'
 'Unnamed: 0_

In [28]:
print(md.coef_)   # md 모델의 계수 출력

[-2.42922383e+00  9.26403534e+00 -4.24195329e+00 -4.21883090e+00
  2.20031073e+00  5.92905458e+00 -9.34444275e+00 -7.72113093e-02
 -9.11501085e-02  2.74301857e+00  1.76649218e+00 -2.72616458e-02
 -9.75509599e-01  2.48359313e-01  3.02379312e-01 -6.32421464e-01
  1.13447642e+02 -2.77496185e+02  3.90809180e+02  2.90993137e+02
  2.15998543e+02 -2.05996739e+02  2.59080951e-13 -2.27373675e-13
 -4.43515156e+02 -1.18891027e+02  2.06803888e+01  1.51821709e+01
 -2.04446814e+02 -8.28167579e+01 -5.32750084e+02  1.13299417e+01
 -5.06158301e+01  3.22112282e+01 -4.79272781e+01 -2.28282779e+02
  1.75289614e+02 -1.13686838e-13  4.24787091e+02  1.42113298e+02
 -8.55711875e+01 -3.31215057e+02 -2.84217094e-14 -6.82121026e-13
  4.26325641e-13 -4.13537069e+01 -2.41584530e-13  1.01081214e+02
  2.35530130e+02  2.29644577e+02 -2.24486518e+02  6.89301720e+01
 -1.42108547e-13  0.00000000e+00 -2.03814141e+02  3.16277512e+01
  2.56771139e+01 -2.72953706e+02  5.49351030e+01 -1.45040796e+02
  9.18459531e+01 -1.97053

In [29]:
print(md.intercept_)     # md 모델의 절편값 출력

134.27992157352782


In [30]:
pred = md.predict(test_x)

print(mean_squared_error(test_y, pred, squared=False))    # RMSE 값 계산 (값이 클수록 잔차가 큼)

395.07679434923136




## <a id='toc1_2_'></a>[로지스틱 회귀 분석](#toc0_)

In [31]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI    Diabetes         Age     Outcome  
count  768.000000  768.000000  768.000000  768.000000  
mean    31.992578    0.471876   33.240885    0.348958  
std      7.884160    0.331329   11.760232    0.476951  
min      0.000000    0.078000   21.000000    0.000000  
25%     27.300000    0.243750   24.

In [32]:
print(df.head(3))

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   

   Diabetes  Age  Outcome  
0     0.627   50        1  
1     0.351   31        0  
2     0.672   32        1  


In [33]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pregnancies    768 non-null    int64  
 1   Glucose        768 non-null    int64  
 2   BloodPressure  768 non-null    int64  
 3   SkinThickness  768 non-null    int64  
 4   Insulin        768 non-null    int64  
 5   BMI            768 non-null    float64
 6   Diabetes       768 non-null    float64
 7   Age            768 non-null    int64  
 8   Outcome        768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [34]:
df = df.dropna()
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pregnancies    768 non-null    int64  
 1   Glucose        768 non-null    int64  
 2   BloodPressure  768 non-null    int64  
 3   SkinThickness  768 non-null    int64  
 4   Insulin        768 non-null    int64  
 5   BMI            768 non-null    float64
 6   Diabetes       768 non-null    float64
 7   Age            768 non-null    int64  
 8   Outcome        768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [None]:
x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = LogisticRegression(max_iter=1000)
md.fit(x_train, y_train)

predict = md.predict(x_test)

cm = confusion_matrix(y_test, predict, predict_labels=[1, 0])
print(cm)

In [None]:
print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

## <a id='toc1_3_'></a>[의사 결정 나무](#toc0_)

In [37]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI    Diabetes         Age     Outcome  
count  768.000000  768.000000  768.000000  768.000000  
mean    31.992578    0.471876   33.240885    0.348958  
std      7.884160    0.331329   11.760232    0.476951  
min      0.000000    0.078000   21.000000    0.000000  
25%     27.300000    0.243750   24.

In [38]:
print(df.head(3))

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   

   Diabetes  Age  Outcome  
0     0.627   50        1  
1     0.351   31        0  
2     0.672   32        1  


In [39]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pregnancies    768 non-null    int64  
 1   Glucose        768 non-null    int64  
 2   BloodPressure  768 non-null    int64  
 3   SkinThickness  768 non-null    int64  
 4   Insulin        768 non-null    int64  
 5   BMI            768 non-null    float64
 6   Diabetes       768 non-null    float64
 7   Age            768 non-null    int64  
 8   Outcome        768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [40]:
df = df.dropna()
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pregnancies    768 non-null    int64  
 1   Glucose        768 non-null    int64  
 2   BloodPressure  768 non-null    int64  
 3   SkinThickness  768 non-null    int64  
 4   Insulin        768 non-null    int64  
 5   BMI            768 non-null    float64
 6   Diabetes       768 non-null    float64
 7   Age            768 non-null    int64  
 8   Outcome        768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [None]:
x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = DecisionTreeClassifier(max_depth=2)
md.fit(x_train, y_train)

predict = md.predict(x_test)

cm = confusion_matrix(y_test, predict, redict_labels=[1, 0])
print(cm)

In [None]:
print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

## <a id='toc1_4_'></a>[서포트 벡터 머신(SVM; Support Vector Machine)](#toc0_)

In [None]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())
print(df.head(3))
print(df.info())

df = df.dropna()
print(df.info())

x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = SVC(kernel='linear')
md.fit(x_train, y_train)

predict = md.predict(x_test)

cm = confusion_matrix(y_test, predict, predict_labels=[1, 0])
print(cm)

print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

## <a id='toc1_5_'></a>[K-NN 최근접 이웃(K-Nearest Neighbors)](#toc0_)

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())
print(df.head(3))
print(df.info())

df = df.dropna()
print(df.info())

x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = KNeighborsClassifier(n_neighbors=5)
md.fit(x_train, y_train)

predict = md.predict(x_test)

cm = confusion_matrix(y_test, predict, predict_labels=[1, 0])
print(cm)

print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

## <a id='toc1_6_'></a>[인공 신경망(ANN; Artificial Neural Network)](#toc0_)

In [None]:
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())
print(df.head(3))
print(df.info())

df = df.dropna()
print(df.info())

x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = MLPClassifier(hidden_layer_size=(64, 32), activation='relu', max_iter=1000)
md.fit(x_train, y_train)

predict = md.predict(x_test)

cm = confusion_matrix(y_test, predict, predict_labels=[1, 0])
print(cm)

print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

## <a id='toc1_7_'></a>[앙상블(Ensemble)](#toc0_)

### <a id='toc1_7_1_'></a>[배깅(Bagging; BootStrap Aggregating)](#toc0_)

In [None]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())
print(df.head(3))
print(df.info())

df = df.dropna()
print(df.info())

x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

base_model = DecisionTreeClassifier(max_depth=2)  # 기본 모델을 의사 결정 나무로 지정
md = BaggingClassifier(base_model, n_estimator=100)
md.fit(x_train, y_train)

predict = md.predict(x_test)

cm = confusion_matrix(y_test, predict, predict_labels=[1, 0])
print(cm)

print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

### <a id='toc1_7_2_'></a>[랜덤 포레스트(Random Forest)](#toc0_)

#### <a id='toc1_7_2_1_'></a>[종속 변수가 범주형](#toc0_)

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())
print(df.head(3))
print(df.info())

df = df.dropna()
print(df.info())

x = df.drop('diabetes', axis=1)
y = df['diabetes']
y = LabelEncoder().fit_transform(y)     # 범주형 -> 수치형

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = RandomForestClassifier(n_estimators=100, max_depth=2)
md.fit(x_train, y_train)

predict = md.predict(x_test)
print(predict)

cm = confusion_matrix(predict , y_test, predict_labels=[1, 0])
print(cm)

print(accuracy_score(y_test, predict))
print(recall_score(y_test, predict))
print(precision_score(y_test, predict))
print(f1_score(y_test, predict))
print(roc_auc_score(y_test, predict))

#### <a id='toc1_7_2_2_'></a>[종속 변수가 수치형](#toc0_)

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('./datasets/PimaIndianDiabetes2.csv')

print(df.describe())
print(df.head(3))
print(df.info())

df = df.dropna()
print(df.info())

df['diabetes'] = LabelEncoder().fit_transform(df['diabetes'])  # pos -> 1, neg -> 0

x = df.drop('pressure', axis=1)
y = df['pressure']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

md = RandomForestRegressor(n_estimators=100, max_depth=2)
md.fit(x_train, y_train)

predict = md.predict(x_test)
print(predict)

print(mean_squared_error(y_test, predict, squared=False))