** <font size=4>主成分分析PCA</font>  ** 

Ref: Gavin Hackeling, Mastering Machine Learning with scikit-learn, 2014

降维致力于解决三类问题：
1. 降维可以缓解维度灾难问题。
2. 降维可以在压缩数据的同时让信息损失最小化。
3. 理解几百个维度的数据结构很困难，两三个维度的数据通过可视化更容易理解。

PCA将数据投射到一个低维子空间实现降维。
主成分可以通过两种方法计算：
1. 计算数据协方差矩阵，从而得到特征值和特征向量。 (http://blog.csdn.net/ybdesire/article/details/6270328/)
2. 用数据矩阵的奇异值分解（SVD）来找协方差矩阵的特征向量和特征值的平方根。

In [1]:
# 例子：P解决一个脸部识别问题，脸部识别是一个监督分类任务，用于从照片中认出某个人。
from os import walk, path
import numpy as np
import mahotas as mh
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

FILE_PATH = r'd:\Code\GitHub\notebook\datasets\att_faces'

X = []
y = []

for dir_path, dir_names, file_name in walk(FILE_PATH):
    for fn in file_name:
        if fn[-3:] == 'pgm':
            image_filename = path.join(dir_path, fn)
            # 使用sklearn.preprocessing.scale()函数，对数据中的特征进行归一化（normalize），让其具有零平均值（zero-mean）和单位方差（unit variance）
            X.append(scale(mh.imread(image_filename, as_grey=True).reshape(10304).astype('float32')))
            y.append(dir_path)

X = np.array(X)
X_train, X_test, y_train, y_test = train_test_split(X, y)
pca = PCA(n_components = 150)
X_train_reduced = pca.fit_transform(X_train)
X_test_reduced = pca.transform(X_test)
print('The original dimensions of the training data were:', X_train.shape)
print('The reduced dimensions of the training data were:', X_train_reduced.shape)

classifier = LogisticRegression()
accuracy = cross_val_score(classifier, X_train_reduced, y_train)

# Warning: Scikit learn Error Message 'Precision and F-score are ill-defined and being set to 0.0 in labels'
# Please refer: https://stackoverflow.com/questions/34757653/why-does-scikitlearn-says-f1-score-is-ill-defined-with-fn-bigger-than-0
print('Cross validation accuracy:', np.mean(accuracy), accuracy)
classifier.fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
print(classification_report(y_test, prediction))

The original dimensions of the training data were: (300, 10304)
The reduced dimensions of the training data were: (300, 150)
Cross validation accuracy: 0.792933361542 [ 0.78070175  0.83673469  0.76136364]
                                                precision    recall  f1-score   support

 d:\Code\GitHub\notebook\datasets\att_faces\s1       1.00      1.00      1.00         2
d:\Code\GitHub\notebook\datasets\att_faces\s10       1.00      0.83      0.91         6
d:\Code\GitHub\notebook\datasets\att_faces\s11       0.50      1.00      0.67         1
d:\Code\GitHub\notebook\datasets\att_faces\s12       1.00      1.00      1.00         4
d:\Code\GitHub\notebook\datasets\att_faces\s13       1.00      1.00      1.00         1
d:\Code\GitHub\notebook\datasets\att_faces\s14       1.00      1.00      1.00         3
d:\Code\GitHub\notebook\datasets\att_faces\s15       1.00      1.00      1.00         2
d:\Code\GitHub\notebook\datasets\att_faces\s16       1.00      1.00      1.00         2
d: