<a href="https://colab.research.google.com/github/weilipan/MachineLearing/blob/main/1_3_%E5%88%87%E5%89%B2%E6%95%B8%E6%93%9A%E9%9B%86.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1-3切割數據集
## 1-3-1 訓練、驗證與測試集
將數據集建構好模型再交由測試集評估其效能。
1. 切割時的隨機性由設定亂數種子參數random_state來達成，固定這個參數將切割出同樣的訓練與測試集，可作為除錯與重製實驗使用。
2. 設定訓練與測試集比例，一般來說會配置較大比例的數據做為訓練集，以便擬合出較佳的學習模型，常見拆分比例為80:20、70:30，對較大的數據集的拆分比例可到90:10或甚至99:1。透過test_size(或train_size)可以設定測試集的比例，預設值是0.25。
3. 以分類問題而言，若訓練與測試集在各個類利的比率皆與整個數據集相近，對於模型的建構與評估都能有好的效能，這裡是透過指定目標項變數給參數stratify來達成。
4. 一旦評估完成確定使用哪個模型之後，可再用整個數據集擬合出最終模型。

In [None]:
#下載寶可夢檔案
!wget pokemon.csv https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv

--2023-03-31 06:53:17--  http://pokemon.csv/
Resolving pokemon.csv (pokemon.csv)... failed: Name or service not known.
wget: unable to resolve host address ‘pokemon.csv’
--2023-03-31 06:53:17--  https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44028 (43K) [text/plain]
Saving to: ‘pokemon.csv.1’


2023-03-31 06:53:17 (4.06 MB/s) - ‘pokemon.csv.1’ saved [44028/44028]

FINISHED --2023-03-31 06:53:17--
Total wall clock time: 0.2s
Downloaded: 1 files, 43K in 0.01s (4.06 MB/s)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
df=pd.read_csv('pokemon.csv')
X=df.loc[:,'HP':'Speed'] #特徵
y=df['Type 1']
# 切割數據集，其中80%用以訓練，20%用以測試
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,
                                               random_state=42,
                                               stratify=y)
# 觀看數據集、訓練與測試集的類別比例
df_count=pd.concat([y.value_counts(),
                    y_train.value_counts(),
                    y_test.value_counts()],axis=1)
df_count.columns=['y','y_train','y_test']
df_count.head()


Unnamed: 0,y,y_train,y_test
Water,112,90,22
Normal,98,78,20
Grass,70,56,14
Bug,69,55,14
Psychic,57,45,12


## 1-3-2 k次交叉驗證
「交叉驗證」(cross-validation)是比單純使用訓練與測試集來評估效能更穩定的方法，作法是將訓練集均分為k等分，k是使用者給定的值，通常為5或10，而不同的均分方式延伸出不同的交叉驗證方法，其中最基本的是「k次交叉驗證」(k-fold cross-validation)

In [None]:
X=df.loc[:,'HP':'Speed']
y=df['Legendary'] #目標改為判斷是否為神獸
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
import numpy as np
#分層k次
kfold=StratifiedKFold(n_splits=10,shuffle=True,random_state=42).split(X_train,y_train)
score_lst=[]#紀錄k次交叉驗證的正確率

for k,(i_train,i_valid) in enumerate(kfold):
  #初始化kNN分類器
  knn=KNeighborsClassifier(n_neighbors=2)
  knn.fit(X_train.iloc[i_train,:],
          y_train.iloc[i_train])
  # 以驗證集評估正確率
  score=knn.score(X_train.iloc[i_valid,:],
                  y_train.iloc[i_valid])
  score_lst.append(score)
  print(f'{k+1}-Fold:Acc={score:.2f}')

print(f'\n10-fold CV accuracy={np.mean(score_lst):.3f},std={np.std(score_lst):.3f}')


1-Fold:Acc=0.94
2-Fold:Acc=0.95
3-Fold:Acc=0.94
4-Fold:Acc=0.97
5-Fold:Acc=0.94
6-Fold:Acc=0.94
7-Fold:Acc=0.94
8-Fold:Acc=0.97
9-Fold:Acc=0.94
10-Fold:Acc=0.91

10-fold CV accuracy=0.942,std=0.017


In [None]:
# scikit-learn提供計分器cross_val_score()以簡潔的方式評估模型。
# 這個計分器還能將不同次的評估分散到多個CPU核心進行平行運算，
# 而參數n_jobs則決定要用多少核心數進行運算，設定為-1是使用所有CPU核心來進行。
from sklearn.model_selection import cross_val_score

knn=KNeighborsClassifier(n_neighbors=2)
score_lst=cross_val_score(estimator=knn,X=X_train,y=y_train,cv=10,n_jobs=-1)
print(f'10-fold CV accuracy scores\n {score_lst}')
print(f'\n10-fold CV accuracy = {np.mean(score_lst):.3f}, std={np.std(score_lst):.3f}')

10-fold CV accuracy scores
 [0.96875  0.9375   0.953125 0.9375   0.984375 0.921875 0.953125 0.921875
 0.921875 0.921875]

10-fold CV accuracy = 0.942, std=0.021
