# PCA（principal component analysis）：主成分分析

## 原理
旋转数据，根据新特征对解释数据的重要性来选择子集
> 第一个新坐标轴选择原始数据中方差最大的方向

> 第二个坐标轴选择和第一个坐标轴正交且最大方差的方向

## 算法
> 1.去除平均值

> 2.计算协方差矩阵

> 3.计算协方差矩阵的特征值及特征向量

> 4.将特征向量按对应特征值大小排序，保留最上面的N个特征向量

> 5.将数据转换到上述N个特征向量构建的新空间中

## 优缺点
优点：降低数据的复杂性，识别最重要的多个特征； 可以同时获得SVM和决策树的优点

缺点：不一定需要，且可能损失有用信息

对于表示同一类数据样本的共同特征是非常有效的，但不适合用于区分不同的样本类

## python模块中的实现
sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)
> n_components : 保留component的数目

> copy : 默认为True，保留原始数据

> whiten : （白化）默认为False

> svd_solver : 奇异值分解  {‘auto’, ‘full’, ‘arpack’, ‘randomized’}

> tol : 默认值为0，svd_solver == ‘arpack’时奇异值的tolerance

> iterated_power : 默认值为‘auto’，svd_solver == ‘randomized’时的循环数

> random_state : 当svd_solver == ‘arpack’ or ‘randomized'时使用的随机状态

# 示例

In [6]:
import numpy as np
import pandas as pd

In [7]:
dataMat = pd.read_table("D:/python/Machine-Learning/machinelearninginaction/Ch13/testSet.txt",header=None,names=["A","B"])

In [9]:
meanVals = np.mean(dataMat,axis=0)

In [10]:
meanRemoved = dataMat - meanVals

In [11]:
covMat = np.cov(meanRemoved,rowvar=0)

In [12]:
covMat

array([[1.05198368, 1.1246314 ],
       [1.1246314 , 2.21166499]])

In [13]:
eigVals,eigVects = np.linalg.eig(covMat)

In [14]:
eigVals

array([0.36651371, 2.89713496])

In [16]:
eigVects

array([[-0.85389096, -0.52045195],
       [ 0.52045195, -0.85389096]])

In [18]:
eigValInd = np.argsort(eigVals)

In [19]:
eigValInd

array([0, 1], dtype=int64)

In [21]:
lowDDataMat = np.dot(meanRemoved,eigVects)

In [22]:
lowDDataMat

array([[ 0.15840394, -2.51033597],
       [ 0.5092619 , -2.86915379],
       [-0.20728318,  0.09741085],
       ...,
       [-0.62056456, -0.50166225],
       [-0.02335614, -0.05898712],
       [-1.37276015, -0.18978714]])

In [24]:
meanVals

A    9.063936
B    9.096002
dtype: float64

In [26]:
lowDDataMat+meanVals.tolist()

array([[9.22234038, 6.58566621],
       [9.57319833, 6.22684839],
       [8.85665326, 9.19341303],
       ...,
       [8.44337188, 8.59433993],
       [9.04058029, 9.03701506],
       [7.69117629, 8.90621504]])

In [27]:
from sklearn.datasets import load_breast_cancer

In [28]:
cancer = load_breast_cancer()

In [30]:
cancer.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [31]:
cancer.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [33]:
cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')