![](所有数据全为训练数据.png)
![](测试数据集的出现.png)

我们的模型一直对训练数据集进行调参，  
导致我们的模型可能对训练数据集过拟合  
![](针对特定数据过拟合.png)


用训练数据集训练出模型  
用验证数据集测试训练的模型，并进反复调参（调整超参数使用的数据集）  

测试数据集不参与模型的构建，只作为衡量最终模型性能的数据集
![](验证数据集的出现.png)

# 交叉验证
![](交叉验证.png)

## 使用train_test_split

In [2]:
import numpy as np
from sklearn import datasets

In [3]:
digits = datasets.load_digits()
X = digits.data
y = digits.target

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

best_score = 0
best_p = 0
best_k = 0
for k in range(2,10):
    for p in range(1,6):
        knn_clf = KNeighborsClassifier(weights="distance",n_neighbors=k,p=p)
        knn_clf.fit(X_train,y_train)
        score = knn_clf.score(X_test,y_test)
        
        if score > best_score:
            best_score=score
            best_p = p
            best_k = k

print(best_k)
print(best_p)
print(best_score)

2
2
0.9847009735744089


## 使用交叉验证

In [9]:
from sklearn.model_selection import cross_val_score

knn_clf = KNeighborsClassifier()
cross_val_score(knn_clf,X_train,y_train)



array([0.98618785, 0.98328691, 0.98319328])

In [10]:
best_score = 0
best_p = 0
best_k = 0
for k in range(2,10):
    for p in range(1,6):
        knn_clf = KNeighborsClassifier(weights="distance",n_neighbors=k,p=p)
        scores = cross_val_score(knn_clf,X_train,y_train)
        score = np.mean(scores)
        knn_clf.fit(X_train,y_train)
        
        if score > best_score:
            best_score=score
            best_p = p
            best_k = k

print(best_k)
print(best_p)
print(best_score)





2
3
0.987021088720914


In [11]:
best_knn_clf = KNeighborsClassifier(weights="distance",n_neighbors=2,p=3)

In [12]:
best_knn_clf.fit(X_train,y_train)
best_knn_clf.score(X_test,y_test)

0.9847009735744089

## 回顾网格搜索

In [13]:
from sklearn.model_selection import GridSearchCV
# 定义超参数的组合
param_grid = [
    {
        "weights":["distance"],
        "n_neighbors":[i for i in range(2, 11)],
        'p':[i for i in range(1,6)]
    }
]

# 创建默认分类器
knn_clf = KNeighborsClassifier()
# 设置参数
grid_search = GridSearchCV(knn_clf, param_grid) 
# 搜索最佳超参数组合及各个参数值
grid_search.fit(X_train,y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid=[{'n_neighbors': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [15]:
grid_search.best_score_

0.987012987012987

In [16]:
grid_search.best_params_

{'n_neighbors': 2, 'p': 3, 'weights': 'distance'}

In [17]:
best_knn_clf = grid_search.best_estimator_

In [18]:
best_knn_clf.score(X_test,y_test)

0.9847009735744089

## 一些其他的参数

In [19]:
# 参数
cross_val_score(knn_clf,X_train,y_train,cv=5)

array([0.97260274, 1.        , 0.98148148, 0.98604651, 0.98584906])

In [None]:
# grid_search = GridSearchCV(knn_clf, param_grid,cv=5) 

k-folds交叉验证  
缺点：每次训练k个模型，相当于性能慢了k倍  

留一法
![](留一法.png)