人脸识别模型是一个比较复杂的n维数据降维到k维数据的案例。先对人脸数据进行读取和处理，再通过PCA进行数据降维，最后用的K近邻算法搭建模型进行人脸识别。

## 背景
人脸识别是基于人的脸部特征信息进行身份识别的一种生物识别技术。该技术蓬勃发展，应用广泛，如人脸识别门禁系统、刷脸支付软件等。

人脸识别在本质上是根据每张人脸图像中不同像素点的颜色进行数据建模与判断。人脸图像的每个像素点的颜色都有不同的值，这些值可以组成人脸的特征向量，不过因为人脸图像的像素点很多，所以特征变量也很多，需要利用PCA进行数据降维。

## 人脸数据读取、处理与变量提取

### 1．读取人脸照片数据

In [1]:
import os
names = os.listdir('datasets/olivettifaces/')

In [2]:
names[0:5]

['10_6.jpg', '36_5.jpg', '12_4.jpg', '34_7.jpg', '29_2.jpg']

In [3]:
# 通过如下代码在Python中查看这些图片
from PIL import Image
img0 = Image.open('datasets/olivettifaces/' + names[0])
img0.show()

### 2．人脸数据处理：特征变量提取

In [4]:
import numpy as np
img0 = img0.convert('L') # 参数'L'指转换成灰度格式的图像
img0 = img0.resize((32, 32)) # 用resize()函数调整图像尺寸为32×32像素
arr = np.array(img0)

In [5]:
import pandas as pd
pd.DataFrame(arr)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,197,90,104,96,71,55,44,46,60,65,...,33,40,45,47,49,52,45,78,107,102
1,199,93,94,72,65,70,57,52,70,80,...,26,31,36,39,39,45,42,40,71,107
2,200,91,77,76,79,62,66,89,93,91,...,32,27,24,28,36,38,44,42,45,61
3,196,78,76,74,60,60,66,65,71,82,...,40,32,25,27,33,30,36,42,45,52
4,189,58,73,69,52,46,48,53,59,64,...,39,30,28,28,25,24,28,33,39,42
5,185,41,51,51,41,33,39,44,49,62,...,42,39,40,35,34,32,29,30,33,34
6,183,27,32,31,29,31,33,38,49,74,...,49,48,44,32,27,31,28,24,25,26
7,187,37,30,25,29,30,31,35,39,49,...,39,38,42,37,31,30,26,23,24,25
8,188,51,36,27,30,26,31,32,31,33,...,59,62,56,46,48,48,39,29,29,35
9,184,53,40,27,29,28,28,32,37,45,...,115,118,104,102,100,87,67,43,41,54


In [6]:
arr = arr.reshape(1, -1)
arr

array([[197,  90, 104, ..., 213, 216, 207]], dtype=uint8)

In [7]:
arr.flatten().tolist()

[197,
 90,
 104,
 96,
 71,
 55,
 44,
 46,
 60,
 65,
 64,
 61,
 54,
 65,
 70,
 59,
 50,
 46,
 39,
 36,
 37,
 33,
 33,
 40,
 45,
 47,
 49,
 52,
 45,
 78,
 107,
 102,
 199,
 93,
 94,
 72,
 65,
 70,
 57,
 52,
 70,
 80,
 81,
 70,
 65,
 66,
 65,
 70,
 69,
 58,
 42,
 34,
 35,
 31,
 26,
 31,
 36,
 39,
 39,
 45,
 42,
 40,
 71,
 107,
 200,
 91,
 77,
 76,
 79,
 62,
 66,
 89,
 93,
 91,
 98,
 94,
 83,
 81,
 74,
 66,
 69,
 69,
 56,
 47,
 41,
 34,
 32,
 27,
 24,
 28,
 36,
 38,
 44,
 42,
 45,
 61,
 196,
 78,
 76,
 74,
 60,
 60,
 66,
 65,
 71,
 82,
 95,
 90,
 82,
 89,
 87,
 74,
 67,
 67,
 64,
 61,
 56,
 47,
 40,
 32,
 25,
 27,
 33,
 30,
 36,
 42,
 45,
 52,
 189,
 58,
 73,
 69,
 52,
 46,
 48,
 53,
 59,
 64,
 74,
 81,
 83,
 95,
 90,
 72,
 62,
 67,
 64,
 58,
 55,
 49,
 39,
 30,
 28,
 28,
 25,
 24,
 28,
 33,
 39,
 42,
 185,
 41,
 51,
 51,
 41,
 33,
 39,
 44,
 49,
 62,
 74,
 71,
 62,
 70,
 76,
 73,
 70,
 73,
 72,
 69,
 63,
 52,
 42,
 39,
 40,
 35,
 34,
 32,
 29,
 30,
 33,
 34,
 183,
 27,
 32,
 31,
 29,
 31,

In [8]:
# 将所有人脸图片的图像数据都转换成数值类型数据
X = [] # 构造一个空列表X用于存放每一张人脸图片的灰度值
for i in names: 
    img = Image.open('datasets/olivettifaces/' + i) # 通过for循环遍历文件名列表，其中的names就是最开始获取的各张人脸图片的文件名列表
    img = img.convert('L')  # 将每张图片的图像数据转换为灰度值
    img = img.resize((32, 32))
    arr = np.array(img)
    X.append(arr.reshape(1, -1).flatten().tolist()) # 用append()函数将每张图片的灰度值添加到X列表中

In [9]:
import pandas as pd
X = pd.DataFrame(X)
# X是所有人脸图片的特征变量,每一行数据为每张图片的每个像素点对应的灰度值
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,197,90,104,96,71,55,44,46,60,65,...,195,191,189,176,204,221,219,213,216,207
1,205,119,124,121,115,102,92,97,91,83,...,172,175,145,165,167,162,166,169,170,169
2,167,14,14,10,16,13,11,14,24,42,...,175,203,211,206,206,177,187,203,212,211
3,178,32,38,43,40,37,42,42,42,44,...,137,141,139,144,155,161,160,148,143,145
4,125,122,122,121,122,123,120,124,119,122,...,168,169,156,147,149,149,149,150,149,217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,105,108,107,105,108,108,109,108,107,107,...,180,172,150,153,149,152,152,152,147,216
396,200,98,106,103,90,82,83,64,53,35,...,191,191,188,168,169,172,170,170,170,169
397,199,102,105,107,105,96,92,80,70,68,...,190,199,191,160,140,193,215,207,209,208
398,98,101,103,103,107,112,87,69,106,108,...,173,155,144,151,151,147,146,152,147,217


In [10]:
print(X.shape)

(400, 1024)


### 3．人脸数据处理：目标变量提取

In [11]:
temp = names[0].split('_')[0]
temp = int(temp)
temp

10

In [12]:
# 提取400张人脸图片的目标变量
y = []
for i in names:
    img = Image.open('datasets/olivettifaces/'+i)
    y.append(int(i.split('_')[0]))
y

[10,
 36,
 12,
 34,
 29,
 8,
 32,
 14,
 30,
 23,
 16,
 16,
 23,
 30,
 14,
 32,
 8,
 29,
 34,
 12,
 36,
 10,
 36,
 10,
 2,
 38,
 34,
 12,
 29,
 8,
 14,
 21,
 32,
 16,
 40,
 30,
 40,
 30,
 16,
 32,
 21,
 14,
 8,
 29,
 12,
 34,
 38,
 2,
 10,
 36,
 36,
 25,
 10,
 34,
 12,
 29,
 14,
 32,
 8,
 6,
 16,
 30,
 18,
 18,
 30,
 16,
 6,
 8,
 32,
 14,
 29,
 12,
 34,
 10,
 25,
 36,
 10,
 36,
 12,
 27,
 34,
 29,
 32,
 14,
 8,
 30,
 16,
 4,
 4,
 16,
 30,
 8,
 14,
 32,
 29,
 34,
 27,
 12,
 36,
 10,
 17,
 31,
 9,
 15,
 20,
 33,
 35,
 13,
 28,
 37,
 11,
 3,
 39,
 39,
 3,
 11,
 37,
 28,
 13,
 35,
 33,
 20,
 15,
 9,
 31,
 17,
 31,
 22,
 17,
 9,
 33,
 15,
 13,
 35,
 1,
 28,
 11,
 37,
 37,
 11,
 28,
 1,
 35,
 13,
 15,
 33,
 9,
 17,
 22,
 31,
 31,
 17,
 5,
 33,
 15,
 9,
 13,
 26,
 35,
 28,
 11,
 37,
 37,
 11,
 28,
 35,
 26,
 13,
 9,
 15,
 33,
 5,
 17,
 31,
 17,
 31,
 19,
 15,
 33,
 9,
 7,
 35,
 13,
 28,
 37,
 24,
 11,
 11,
 24,
 37,
 28,
 13,
 35,
 7,
 9,
 33,
 15,
 19,
 31,
 17,
 31,
 22,
 19,
 5,
 20,
 15,
 

In [13]:
len(y)

400

## 数据划分与降维
### 1．划分训练集和测试集

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### 2．PCA数据降维

In [15]:
from sklearn.decomposition import PCA
pca = PCA(n_components=100) # 将PCA模型赋给变量pca，并设置模型的参数n_components为100，即将这1024个特征进行线性组合，生成互不相关的100个新特征
pca.fit(X_train) # 使用训练集的特征数据拟合PCA模型

In [16]:
# 使用拟合好的PCA模型分别对训练集和测试集的特征数据进行降维
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

In [17]:
print("训练集为：" ,X_train_pca.shape)
print("测试集为：" ,X_test_pca.shape)

训练集为： (320, 100)
测试集为： (80, 100)


In [18]:
pd.DataFrame(X_train_pca).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-209.194783,-483.972886,-666.187647,-9.44291,90.117449,-372.726019,-68.658625,-65.583955,205.699489,128.490271,...,7.223361,103.594285,26.7457,-41.425204,6.164882,-7.773288,44.361977,25.205916,0.693482,43.244242
1,1000.904477,182.220499,54.964676,210.836605,-327.9603,179.376838,41.289902,182.588741,-64.144235,163.133922,...,-91.904078,79.751745,-12.721174,-37.322907,-45.295632,90.033565,41.488621,-0.297781,-21.66832,-46.947382
2,922.391257,-137.872455,-71.645004,-230.662194,-283.218886,-65.189245,73.734977,5.421844,-185.546159,23.495267,...,-61.614615,-5.127722,-13.572533,7.579274,6.545021,22.061466,-21.704533,-10.135357,40.125017,7.295073
3,20.002095,-156.049387,-632.836555,-13.186843,303.680092,-241.865972,368.969253,-242.585366,-59.969704,-48.072715,...,8.086528,-18.980447,0.164915,16.935543,-1.932065,-52.933057,-24.80615,21.272025,-39.519931,3.473405
4,-88.251392,-429.519773,-332.18516,-337.815174,340.893245,16.324757,54.116855,373.453012,-79.452681,-108.305404,...,53.444203,-17.184134,4.648724,-50.924929,-28.508075,-61.405204,-39.576091,-7.101498,27.380695,15.128173


In [19]:
pd.DataFrame(X_test_pca).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-321.694819,253.921637,-118.118867,-47.424803,-398.578863,-363.309106,224.493866,-84.487746,-49.916768,90.52117,...,-31.284767,-1.11091,-19.110605,-87.444995,-62.221263,3.965311,-11.865698,47.049103,3.626724,10.624931
1,-197.357089,514.410049,210.901735,38.342406,147.654736,-281.959454,242.332514,-32.805916,351.342863,0.399789,...,-19.721624,-1.335597,40.348113,-1.321023,-12.343347,30.890136,-31.131491,-43.328556,13.045446,42.916344
2,778.405108,-5.071112,-210.346944,-187.63076,-21.274935,64.34563,-33.602712,129.315647,166.219442,-59.542377,...,-4.57995,-31.82971,9.494805,20.260669,-9.025256,-23.84764,-14.88964,-0.721419,-16.447159,-19.92945
3,1089.483628,103.741141,-209.20932,-145.92928,-16.35185,-174.07204,210.084631,-47.299789,189.822662,45.705738,...,2.988708,-4.319463,-17.547373,-68.083339,28.422788,19.784891,51.543116,6.39886,12.343777,-8.58078
4,-318.476681,235.823098,238.878715,588.296981,-366.673581,-548.017741,362.243586,121.926383,101.168752,-130.133655,...,-41.825833,-16.004098,-21.46129,21.601761,-34.639083,9.031246,-60.819426,6.913433,33.948377,-20.253161


## 模型的搭建与使用

### 1. 模型搭建
使用K近邻算法分类模型进行模型的搭建。K近邻算法分类模型通过训练掌握某张人脸的部分特征数据，在面对测试集的特征数据时就可以根据近邻的思想进行人脸的分类。

In [20]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train_pca, y_train)

### 2. 模型预测

In [23]:
y_pred = knn.predict(X_test_pca)
y_pred

array([25, 17,  2,  2, 17, 30, 19,  3,  8, 11, 17, 38, 14, 38, 21,  8, 27,
        8, 12, 19, 15,  9, 21, 39, 29, 30, 23,  8, 23,  2, 29, 29, 35, 22,
       32, 34,  8,  9,  2, 11,  1, 20,  5, 18, 40, 30,  4,  6, 18, 19, 16,
       31,  5, 34, 20,  3, 20, 28, 13, 40, 13,  8, 26, 33,  2,  5, 35, 40,
       20,  4, 26, 34, 38, 18, 16, 19,  2,  7, 13, 22])

In [25]:
# 汇总预测值和实际值，以便进行对比
import pandas as pd
a = pd.DataFrame()
a['预测值'] = list(y_pred)
a['实际值'] = list(y_test)
a.head()

Unnamed: 0,预测值,实际值
0,25,25
1,17,17
2,2,2
3,2,4
4,17,17


In [26]:
# 查看所有测试集数据的预测准确度
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
score

0.8625

In [27]:
# 除了用accuracy_score()函数获取模型评分，还可以用K近邻算法分类模型自带的score()函数来获取模型评分
score_knn = knn.score(X_test_pca, y_test)
score_knn

0.8625

### 3. 模型对比（数据降维与不降维）

In [30]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()  # 建立KNN模型
knn.fit(X_train, y_train)  # 不进行数据降维，直接训练
y_pred = knn.predict(X_test)  # 不进行数据降维，直接测试
from sklearn.metrics import accuracy_score
score_no_pca = accuracy_score(y_pred, y_test)
score_no_pca

0.875