### Description：
> * 这次是基于sklearn包实现手写数字识别，这里还是复习机器学习分析问题的步骤，或许稍微和之前的不太一样，因为这是图片
>

> * 关于图片，需要转换数据， 获取数据的时候也和平时不太一样
>

> * 关于数据集，这次数据集直接分好了训练集和测试集，不过训练集和测试集都是一个样本一个文件，所以需要练习一下文件操作
>

> * 本项目中，可能不仅涉及到KNN，或许尝试其他的一些分类算法，看看分类的效果会不会提升

### 1. 导入需要的包

In [73]:
import numpy as np
from os import listdir

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

### 2. 获取训练集和测试集 

In [2]:
# 把图片转换成向量表示
def img2Vector(filename):
    returnVect = np.zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

In [75]:
# 获取数据集
trainingFileList = listdir('digits/trainingDigits')
print(len(trainingFileList))    # 1934个
Train_num = len(trainingFileList)
TrainingMat = np.zeros((Train_num, 1024))
hwLabels = []
for i in range(Train_num):
    fileNameStr = trainingFileList[i]
    fileStr = fileNameStr.split('.')[0]
    classNumStr = int(fileStr.split('_')[0])
    hwLabels.append(classNumStr)
    TrainingMat[i,:] = img2Vector('digits/trainingDigits/%s' % fileNameStr)

# 训练集就提取好了  (TrainingMat, hwLabels)

# 下面提取测试集
testFileList = listdir('digits/testDigits')
Test_num = len(testFileList)
TestMat = np.zeros((Test_num, 1024))
TestLabels = []
for i in range(Test_num):
    fileNameStr = testFileList[i]
    fileStr = fileNameStr.split('.')[0]
    classNumStr = int(fileStr.split('_')[0])
    TestLabels.append(classNumStr)
    TestMat[i, :] = img2Vector('digits/testDigits/%s' % fileNameStr)

# 测试集也获取到了 (TestMat, TestLabels)

### 3. 建立模型

In [23]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(TrainingMat, hwLabels)
predictions = model.predict(TestMat)
#print(len(predictions))
# errorCount = 0.0
# for i in range(Test_num):
#     print("the classifier came back with:%d, the real answer is:%d" % (predictions[i], TestLabels[i]))
#     if (predictions[i] != TestLabels[i]):
#         errorCount += 1.0
# print("\ntotal number of errors is : %d" % errorCount)
# print("\nthe total error rate is : %f" % (errorCount/float(Test_num)))

the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back with:0, the real answer is:0
the classifier came back wit

### 4. 模型比较， 尝试找一个更好的模型

In [82]:
models = {}
models['KNN'] = KNeighborsClassifier(n_neighbors=3)
models['SVM'] = SVC(C=2.0, kernel='rbf')
models['CART'] = DecisionTreeClassifier()
models['NB'] = GaussianNB()
models['AB'] = AdaBoostClassifier()
models['GBM'] = GradientBoostingClassifier()
models['RF'] = RandomForestClassifier()
models['ET'] = ExtraTreesClassifier()
for key in models:
    model = models[key]
    model.fit(TrainingMat, hwLabels)
    predictions = model.predict(TestMat)
    predictions = list(predictions)
    print("{} aaccuracy_rate: {}".format(key, accuracy_score(predictions, TestLabels)))

KNN aaccuracy_rate: 0.9873150105708245
SVM aaccuracy_rate: 0.9767441860465116
CART aaccuracy_rate: 0.8858350951374208
NB aaccuracy_rate: 0.733615221987315
AB aaccuracy_rate: 0.828752642706131
GBM aaccuracy_rate: 0.9725158562367865
RF aaccuracy_rate: 0.959830866807611
ET aaccuracy_rate: 0.9682875264270613


### 5.尝试优化模型
> 从上面结果可以看出，支持向量机的效果还行，尝试寻找参数

In [80]:
# 算法调参 --- SVM
param_grid = {}
param_grid['C'] = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
param_grid['kernel'] = ['linear', 'poly', 'rbf', 'sigmoid']
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy')
grid_result = grid.fit(X=TrainingMat, y=hwLabels)

print("最优:{}  使用 {}".format(grid_result.best_score_, grid_result.best_params_))


最优:0.9560496380558429  使用 {'C': 2.0, 'kernel': 'rbf'}


In [83]:
# 算法调参 --- KNN
param_grid = {'n_neighbors': [1, 3, 5, 7,  9, 11, 13, 15, 17, 19, 21]}
model = KNeighborsClassifier()
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy')
grid_result = grid.fit(X=TrainingMat, y=hwLabels)
print("最优:{}  使用 {}".format(grid_result.best_score_, grid_result.best_params_))

最优:0.9503619441571872  使用 {'n_neighbors': 3}
