目录：
* 实现基于线性搜索的KNN算法

KNN算法的步骤：
1. 计算已知类别数据集中的点与当前点之间的距离
2. 选出与当前点距离最小的k个点
3. 确定前k个点所在类别的出现频率
4. 返回前k个点所出现频率最高的类别作为预测的分类

线性搜索和基于KDtree的搜索差别在于如何选出与当前点距离最小的k个点的过程。  
线性搜索即简单的计算出数据集中每个点到当前点之间的距离，排序后，得到距离较近的几个k个点，用这k个点所属的类别进行投票。  
KDtree的搜索时，先将数据集存储为特定的树结构，基于树结构寻找距离较近的k个点。

# 基于线性搜索的KNN算法

In [7]:
import numpy as np
class KNN:
    def __init__(self,k=3,p=2):
        """初始化KNN"""
        self.k = k
        self.p = p
        self.X_train = None
        self.y_train = None
    
    def fit(self,X_train,y_train):
        """根据训练数据 训练KNN分类器"""
        self.X_train = X_train
        self.y_train = y_train
        return self
    
    def predict(self,X_predict):
        """给定待预测的数据集X_predict，返回表示X_predict的结果向量"""
        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)
    
    def _predict(self,x):
        """给定单个待预测数据，返回X的预测结果值"""
        distance = np.power(np.sum(np.power(np.abs(self.X_train - x),self.p),axis=1),self.p)
        nearest = np.argsort(distance)
        topK_y = self.y_train[nearest[:self.k]]
        
        from collections import Counter
        votes = Counter(topK_y)
        
        return votes.most_common(1)[0][0]
    
    def __repr__(self):
        return "KNN(k=%d,p=%d)"%(self.k,self.p)

In [10]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()
X = data.data
y = data.target
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [30]:
%%time
clf = KNN(k=5)
clf.fit(X_train,y_train)
y_predict = clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_predict,y_test)

CPU times: user 3.8 ms, sys: 1.34 ms, total: 5.14 ms
Wall time: 3.98 ms


## 对比sklearn中的knn算法

In [31]:
%%time
from sklearn.neighbors import KNeighborsClassifier
clf_skl = KNeighborsClassifier()
clf_skl.fit(X_train,y_train)
clf_skl.score(X_test,y_test)

CPU times: user 4.84 ms, sys: 2.75 ms, total: 7.59 ms
Wall time: 5.21 ms


In [22]:
clf_skl

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

总结：
由于鸢尾花数据集的样本量较小，所以自己实现的knn算法和sklearn中的knn算法的性能相差无几

## 基于手写数据集测试自己写的knn算法和sklearn中的knn算法的性能差距

In [32]:
from sklearn.datasets import load_digits
data = load_digits()

In [38]:
X_train,X_test,y_train,y_test = train_test_split(data.data,data.target)

In [61]:
%%time
clf = KNN()
clf.fit(X_train,y_train)
predict = clf.predict(X_test)

CPU times: user 832 ms, sys: 3.99 ms, total: 836 ms
Wall time: 840 ms


In [62]:
accuracy_score(predict,y_test)

0.9911111111111112

In [58]:
%%time
clf_skl = KNeighborsClassifier(n_neighbors=3)
clf_skl.fit(X_train,y_train)

CPU times: user 2.93 ms, sys: 1.17 ms, total: 4.11 ms
Wall time: 3 ms


In [59]:
clf_skl.score(X_test,y_test)

0.9888888888888889

In [60]:
clf_skl

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

总结：
性能的差距是巨大的，上百倍的差距实在不想说😢  
形势所迫，必须得尝试KDTree