主成分矩阵:假定数据集以原点为中心

将训练集投影到$d$维度：$X_{d-proj}=XW_d$

In [1]:
from sklearn.datasets import fetch_openml
mnist=fetch_openml('mnist_784', version=1)
mnist.keys()
X, y = mnist["data"], mnist["target"]
import numpy as np
X = X.to_numpy()
y = y.to_numpy(dtype=np.uint8)
X_train = X[:60000]

#以下代码提取定义前两个PC的两个单位向量
X_centered=X-X.mean(axis=0)
U, s, Vt=np.linalg.svd(X_centered)
c1=Vt.T[:, 0]
c2=Vt.T[:, 1]

#以下代码将训练集投影到由前两个主要成分定义的平面上
W2=Vt.T[:, :2]
X2D=X_centered.dot(W2)

from sklearn.decomposition import PCA
pca=PCA(n_components=2)
X2D=pca.fit_transform(X)
#可解释方差比
pca.explained_variance_ratio_

array([0.09746116, 0.07155445])

In [2]:
pca=PCA()
pca.fit(X_train)
cumsum=np.cumsum(pca.explained_variance_ratio_)
d=np.argmax(cumsum>=0.95)+1
pca=PCA(n_components=0.95)
X_reduced=pca.fit_transform(X_train)

In [3]:
pca=PCA(n_components=154)
X_reduce=pca.fit_transform(X_train)
X_recovered=pca.inverse_transform(X_reduced)

PCA逆变换，回到原始数量的维度：$X_{recovered}=X_{d_project}W_{d}^{T}$

In [None]:
from fileinput import filename

#随机PCA
rnd_pca=PCA(n_components=154, svd_solver="randomized")
X_reduced=rnd_pca.fit_transform(X_train)

#增量PCA
from sklearn.decomposition import IncrementalPCA
n_batches=100
inc_pca=IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)
X_reduced=inc_pca.transform(X_train)
m,n=X_reduced.shape
X_mm=np.memmap(filename, dtype="float32", mode="readonly", shape=(m,n))
batch_size=m//n_batches
inc_pca=IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)

In [None]:
#内核PCA
from sklearn.decomposition import KernelPCA
rbf_pca=KernelPCA(n_components=2, kernel="rbf", gamma=0.04)
X_reduced=rbf_pca.fit_transform(X)

#首先使用kPCA将维度减少到二维，然后使用逻辑回归来分类。
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf=Pipeline([
    ("kpca", KernelPCA(n_components=2)),
    ("log_reg", LogisticRegression())
])
param_grid=[{
    "kpca_gamma": np.linspace(0.03, 0.05, 10),
    "kpca_kernel": ["rbf", "sigmoid"]
}]
grid_search=GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)

rbf_pca=KernelPCA(n_components=2, kernel="rbf", gamma=0.0433, fit_inverse_transform=True)
X_reduced=rbf_pca.fit_transform(X)
X_preimage=rbf_pca.inverse_transform(X_reduced)
#计算原像误差
from sklearn.metrics import mean_squared_error
mean_squared_error(X, X_preimage)


LLE：局部线性嵌入，非线性降维技术，流形学习技术。

工作原理：首先测量每个训练实例如何与其最近的邻居线性相关，然后寻找可以最好地保留这些局部关系的训练集的低维表示形式

In [None]:
from sklearn.manifold import LocallyLinearEmbedding
lle=LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced=lle.fit_transform(X)

LLE第一步：局部关系线性建模

$
\hat{W} = \arg\min_{W} \sum_{i=1}^{m} \left( \boldsymbol{x}^{(i)} - \sum_{j=1}^{m} w_{i,j} \boldsymbol{x}^{(j)} \right)^2
$

满足
$
\begin{cases}
w_{i,j} = 0 & \text{当 } \boldsymbol{x}^{(j)} \text{ 不属于 } \boldsymbol{x}^{(i)} \text{ 的 } k \text{ c.n.} \\
\sum_{j=1}^{m} w_{i,j} = 1 & \text{其中 } i = 1, 2, \cdots, m
\end{cases}
$

LLE第二步：在保持关系的同时减少维度

$
\hat{Z} = \arg\min_{Z} \sum_{i=1}^{m} \left( \boldsymbol{z}^{(i)} - \sum_{j=1}^{m} \hat{w}_{i,j} \boldsymbol{z}^{(j)} \right)^2
$

复杂度：

$O(mlog(m)nlog(k))$用于找到$k$个最近的邻居，$O(mnk^3)$ 用于优化权重，$O(dm^2)$用于构造低维表示

其它降维技术：随机投影，多维缩放，Isomap，t分布随机近邻嵌入，线性判别分析