超参数： 运行机器学习算法前需要指定的参数，如KNN算法中的k<br>
模型参数：算法过程中学习的属于这个模型参数<br>
<br>
（KNN算法没有模型参数）<br><br>
寻找好的超参数：
<li>领域知识
<li>经验数值
<li>实验搜索

In [1]:
import numpy as np
from sklearn import datasets

In [2]:
digits = datasets.load_digits()
X = digits.data
y = digits.target

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666)

In [4]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

0.9888888888888889

### 寻找最好的k

In [5]:
best_score = 0.0
best_k = -1
for k in range(1,11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score

print("best k = ", best_k)
print("best score = ", best_score)

best k =  4
best score =  0.9916666666666667


假如搜索得到的k值临近边界，如10，则最好扩宽一下边界，比如在8到20的范围内，再搜索一下看能否得到更好的k

### KNN中不止k一个超参数？

还有另一个是点之间距离的权重。<br>
在sklearn.neighbors.KNeighborsClassifier中weights这个参数就是专门用来控制这个超参数的。<br>
weights默认参数为uniform，即不考虑距离。改成distance后，则加入了距离的考量。

#### 测试一下，是加入距离这个超参数好，还是不加入好呢？

In [6]:
best_method = ""
best_score = 0.0
best_k = -1
for method in ("uniform", "distance"):
    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights = method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method

print("best k = ", best_k)
print("best score = ", best_score)
print("best method = ", best_method)

best k =  4
best score =  0.9916666666666667
best method =  uniform


可发现，对于手写字母识别这个问题来说，调用KNN还是不加入距离比较好。

#### 还有新的超参数？

明可夫斯基距离中的p， p=1: 曼哈顿距离， p=2: 欧拉距离<br>
sklearn包中也有对应p的参数

#### 搜索明可夫斯基距离相应的p

In [7]:
%%time
best_p = -1
best_score = 0.0
best_k = -1

for k in range(1,11):
    for p in range(1,6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights = "distance", p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_p = p

print("best k = ", best_k)
print("best p = ", best_p)
print("best score = ", best_score)

best k =  3
best p =  2
best score =  0.9888888888888889
Wall time: 47.8 s


以上这种搜索方式成为 网格搜索。一个个罗列出来，都测试一下。

新问题： 有时超参数间有彼此依赖的关系。如只有当考虑距离权重时，才涉及到p这个参数。<br>
那么如何一步到位，能测试到所有参数呢？

欲知后事如何，敬请期待下一小节，网格搜索。