## 线性代数基础

### 对称矩阵


## PCA Dimensionality reduction with principle components

**Principle component analysis**, or **PCA**, is an alternative to regularization and straight-forward feature elimination. PCA is particularly useful for problems with very large numbers of features compared to the number of training cases. For example, when faced with a problem with many thousands of features and perhaps a few thousand cases, PCA can be a good choice to **reduce the dimensionality** of the feature space.  

PCA is one of a family of transformation methods that reduce dimensionality. PCA is the focus here, since it is the most widely used of these methods. 

The basic idea of PCA is rather simple: Find a linear transformation of the feature space which **projects the majority of the variance（方差）** onto a few orthogonal（正交的） dimensions in the transformed space. The PCA transformation maps the data values to a new coordinate system defined by the principle components. Assuming the highest variance directions, or **components**, are the most informative, low variance components can be eliminated from the space with little loss of information. 

The projection along which the greatest variance occurs is called the **first principle component**. The next projection, orthogonal to the first, with the greatest variance is call the **second principle component**. Subsequent components are all mutually orthogonal with decreasing variance along the projected direction.  

Widely used PCA algorithms compute the components sequentially, starting with the first principle component. This means that it is computationally efficient to compute the first several components from a very large number of features. Thus, PCA can make problems with very large numbers of features computationally tractable（易处理的）. 

****
**Note:** It may help your understanding to realize that principle components are a scaled version of the **eigenvectors** of the feature matrix. The scale for each dimensions is given by the **eigenvalues**. The eigenvalues are the fraction of the variance explained by the components. 
****

## A simple example

To cement the concepts of PCA you will now work through a simple example. This example is restricted to 2-d data so that the results are easy to visualize. 

As a first step, execute the code in cell below to load the packages required for the rest of this notebook.
The code in the cell below simulates data from a bivariate Normal distribution. The distribution is deliberately centered on $\{ 0,0 \}$ and with unit variance on each dimension. There is considerable correlation between the two dimensions leading to a covariance matrix:

$$cov(X) =  \begin{bmatrix}
  1.0 & 0.6 \\
  0.6 & 1.0
 \end{bmatrix}$$

Given the covariance matrix 100 draws from this distribution are computed using the `multivariate_normal` function from the Numpy `random` package. Execute this code:

In [None]:
!pip install pandas_profiling

In [None]:
import pandas as pd
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import sklearn.decomposition as skde
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import math
from pandas_profiling import ProfileReport
%matplotlib inline

In [None]:
#模拟一个协方差为指定正态分布的二特征矩阵。
nr.seed(124)
cov = np.array([[1.0, 0.6], [0.6, 1.0]])
mean = np.array([0.0, 0.0])

sample = nr.multivariate_normal(mean, cov, 100)
sample.shape
 

To get a feel for this data, execute the code in the cell below to display a plot and examine the result. 

In [None]:
plt.scatter(sample[:,0], sample[:,1])
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Sample data')

You can see that the data have a roughly elliptical(椭圆的) pattern. The correlation between the two dimensions is also visible. 

With the simulated data set created, it is time to compute the PCA model. The code in the cell below does the following:
1. Define a PCA model object using the `PCA` function from the scikit-learn `decomposition` package.
2. Fit the PCA model to the sample data.
3. Display the ratio of the **variance explained** by each of the components, where, for a matrix X, this ratio is given by:

$$VE(X) = \frac{Var_{X-component}(X)}{Var_{X-total}(X)}$$

Notice that by construction:

$$VE(X) = \sum_{i=1}^N VE_i(X) = 1.0$$

In other words, the sum of the variance explained for each component must add to the total variance or 1.0 for standardized data. 
不同因子载荷由其对应的方差除以总方差表示

或者在对方差进行标准化处理后，其总方差为1。

Execute this code and examine the result.
> pca的方法explained_variance_ratio_计算了每个特征方差贡献率，所有总和为1，explained_variance_为方差值，通过合理使用这两个参数可以画出方差贡献率图或者方差值图，便于观察PCA降维最佳值。
> 相关介绍 https://blog.csdn.net/qq_36523839/article/details/82558636

In [None]:
pca_model = skde.PCA()
pca_fit = pca_model.fit(sample)
print(pca_fit.explained_variance_ratio_)
#### 使用numpy计算特征值和特征值所构成的对角化矩阵
sample.T 
cov_s=np.cov(sample.T)
lamda,Q=np.linalg.eig(cov_s)
print(lamda)
print(Q.T)

Notice that the explained variance of the first component is many times larger than for the second component. This is exactly the desired result indicating the first principle component explains the majority of the variance of the sample data. 

The code in the cell below computes and prints the scaled components. Mathematically, the scaled components are the eigenvectors scaled by the eigenvalues. Execute this code:
在此为了得到不同特征向量的重要度，需要进行缩放，为特征向量与对应的特征值相乘

In [None]:
comps = pca_fit.components_
for i in range(2):
    comps[:,i] = comps[:,i] * pca_fit.explained_variance_ratio_
print(comps)

Notice that the two vectors have their origins at $[ 0,0 ]$, and are quite different magnitude, and are pointing in different directions.  

To better understand how the projections of the components relate to the data, execute the code to plot the data along with the principle components. Execute this code: 

In [None]:
plt.scatter(sample[:,0], sample[:,1])
# plt.plot([0.0, comps[0,0]], [0.0,comps[0,1]], color = 'red', linewidth = 5)
# plt.plot([0.0, comps[1,0]], [0.0,comps[1,1]], color = 'red', linewidth = 5)
plt.plot([0.0, comps[0,0]], [0.0,comps[0,1]], color = 'red', linewidth = 5)
plt.plot([0.0, comps[1,0]], [0.0,comps[1,1]], color = 'red', linewidth = 5)

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Sample data')

Notice the the fist principle component (the long red line) is along the direction of greatest variance of the data. This is as expected. The short red line is along the direction of the second principle component. The lengths of these lines is the variance in the directions of the projection. 

The ultimate goal of PCA is to transform data to a coordinate system with the highest variance directions along the axes. The code in the cell below uses the `transform` method on the PCA object to perform this operation and then plots the result. Execute this code: 

In [None]:
trans = pca_fit.transform(sample)#变换坐标系，使得坐标系中点信息量（方差最大化）
plt.scatter(trans[:,0], trans[:,1])
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Sample data')

Notice that the scale along these two coordinates are quite different. The first principle component is along the horizontal axis. The range of values on this direction is in the range of about $\{ -2.5,2.5 \}$. The range of values on the vertical axis or second principle component are only about $\{ -0.2, 0.3 \}$. It is clear that most of the variance is along the direction of the fist principle component. 

In [None]:
# 定义一个矩阵代表样本
import numpy as np 
x=np.array([2,2,4,8,4])
y=np.array([2,6,6,8,8])
data=np.vstack((x,y))
print(data)
print(data.shape)
data_cov=np.cov(data)
print("得到两个特征属性x,y 的协方差，是一个对称矩阵")
print(data_cov)
print("实现特征属性的0均值化处理")
def toZero(x:np.array)->np.array:
    m=np.mean(x)
    return x-m
x=toZero(x)
y=toZero(y)
data2=np.vstack((x,y))
data_cov=np.cov(data2)
print("零均值化处理后，协方差矩阵不变")
print(data_cov)
print("实现协方差矩阵的对角化")
print("通过特征向量，特征值实现：")
eig,Q=np.linalg.eig(data_cov)
print(eig)
print("注意pyhton中特征向量构成矩阵转置后，才是新的空间矩阵")
print(Q.T)
print("将特征值实现对角化再除(n-1)，就是协方差对应的对角化矩阵")
sigma=np.diag(eig)/(data.shape[1]-1)
print(sigma)
print("实现原样本到新空间的转换")
newV=Q.T@data2
print(newV)
print(data2)
print(Q)

pca_model = skde.PCA()
pca_fit = pca_model.fit(data.T)
 
 

comps = pca_fit.components_
print(comps)
print(pca_fit.explained_variance_ratio_)

t

In [None]:
import numpy as np 
from numpy import linalg as la 
A=np.array([1,5,7,6,1,2,1,10,4,4, 3,6,7,5,2]).reshape(3,5)
print("A与自身转置相乘，得到一个对称矩阵：",A@A.T,"得到对称矩阵形状：",(A@A.T).shape)
print("原始矩阵：",A,"原始矩阵形状:",A.shape)
print(A@A.T)
U,s,Vt=la.svd(A)
Sigma=np.zeros(A.shape)#首先得到一个m*n的空矩阵，作为奇异值矩阵
print("左奇异矩阵",U,"矩阵形状：",U.shape)
print("右奇异矩阵",Vt,"矩阵形状：",Vt.shape)
 
for i in range(len(s)):#为奇异值矩阵对角元素赋值
    Sigma[i,i]=s[i]
print("奇异值矩阵：",Sigma,"奇异值矩阵形状：",Sigma.shape)

In [None]:
B=U@Sigma@Vt
print("同还原得到矩阵是否与原始矩阵相同：",np.allclose(A,B))

#PCA 实例操作

## Load Features and Labels

Keeping the foregoing simple example in mind, it is time to apply PCA to some real data. 

The code in the cell below loads the dataset which has had the the following preprocessing:
1. Cleaning missing values.
2. Aggregating categories of certain categorical variables. 
3. Encoding categorical variables as binary dummy variables.
4. Standardizing numeric variables. 

>数据处理基本步骤
> 1.清理缺失数据
> 2.对部分数据根据类别进行聚合
> 3.对类别数据进行哑元编码
> 4. 数据归一化 

Execute the code in the cell below to load the features and labels as numpy arrays for the example: 

1. 数据读取，数据观察

In [None]:
import pandas as pd 
import numpy as np 
def loadData():
    Features = np.array(pd.read_csv('..\..\data\Credit_Features.csv'))
    Labels = np.array(pd.read_csv('..\..\data\Credit_Labels.csv'))
    return Features,Labels
    print(Features.shape)
    print(Labels.shape)


There are 35 features in this data set. The numeric features have been Zscore scaled so they are zero centered (mean removed) and unit variance (divide by standard deviation). 

****
**Note:** <font color="red">Before performing PCA all features must be zero mean and unit variance. Failure to do so will result in biased computation of the components and scales. In this case, the data set has already been scaled, but ordinarily scaling is a key step. <font/>
****


Now, run the code in the cell below to split the data set into test and training subsets:
训练集，测试集分割

In [None]:
from sklearn.model_selection import train_test_split
import numpy.random as nr
def preData(Features,Labels):
    nr.seed(1115)
    X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.3)
    #将y有矩阵变为向量，通过np.ravel函数
    y_train=np.ravel(y_train)
    y_test=np.ravel(y_test)
    return X_train, X_test, y_train, y_test

Compute principle components 构建pca模型进行降维

These numbers are a bit abstract. However, you can see that the variance ratios are in descending order and that the sum is 1.0. 

Execute the code in the cell below to create a plot of the explained variance vs. the component:  

PCA中特征向量构成坐标系的方差贡献率（标准化后）的总和为1
通过绘制碎石图得到方差率占比大的维度，作为新的维度

The code in the cell below computes the principle components for the training feature subset. Execute this code:

In [None]:
import sklearn.decomposition as skde
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy.random as nr
import pandas as pd 
import numpy as np
 
def loadData():
    Features = np.array(pd.read_csv('..\..\data\Credit_Features.csv'))
    Labels = np.array(pd.read_csv('..\..\data\Credit_Labels.csv'))
    
    print(Features.shape)
    print(Labels.shape)
    return Features,Labels

def preData(Features,Labels):
    nr.seed(1115)
    X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.3)
    #将y有矩阵变为向量，通过np.ravel函数
    y_train=np.ravel(y_train)
    y_test=np.ravel(y_test)
    return X_train, X_test, y_train, y_test
def plot_explained(mod):
    comps = mod.explained_variance_ratio_
    x = range(len(comps))
    x = [y + 1 for y in x]          
    plt.plot(x,comps)
    
def percentagePCA(model):
    
    comps = model.explained_variance_ratio_
    x=0
    for i in range(len(comps)):
        x=x+comps[i]
        print("第{}个特征向量对应的方差贡献率：{}".format(i,x/1))
        
if __name__=='__main__':
    Features,Labels=loadData()
    X_train, X_test, y_train, y_test=preData(Features,Labels)
    pca_model=skde.PCA()
    pca_model.fit(X_train)
    print("特征向量贡献率： \n",pca_model.explained_variance_ratio_)
    print("特征向量贡献率和为1：",sum(pca_model.explained_variance_ratio_))
    plot_explained(pca_model)
    percentagePCA(pca_model)

Now it is time to create a PCA model with a reduced number of components. The code in the cell below trains and fits a PCA model with 5 components, and then transforms the features using that model. Execute this code. 

sklearn 中pca模型构建首先要根据方差贡献率，确定了降维个数，然后降维。
''' python 
1. pca_mod_5 = skde.PCA(n_components = x) # x 为降维个数
2. pca_mod_5.fit(X_train)# 通过模型降维
 
'''

In [None]:
pca_model_5=skde.PCA(n_components=5)
pca_model_5.fit(X_train)
comp_5=pca_model_5.transform(X_train)

#### Compute and evaluate a logistic regression model 在压缩矩阵基础上进行逻辑回归计算

Next, you will compute and evaluate a logistic regression model using the features transformed by the first 5 principle components. Execute the code in the cell below to define and fit a logistic regression model, and print the model coefficients. 

In [None]:
## Define and fit the logistic regression model
from sklearn.linear_model import LogisticRegression
#######################################
#Cfloat, default=1.0
#Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
log_mod_5 =  LogisticRegression(C = 10.0, class_weight = {0:0.45, 1:0.55}) 
log_mod_5.fit(comp_5, y_train)
print(log_mod_5.intercept_)
print(log_mod_5.coef_)

## 对模型进行评估
Notice that there are now 5 regression coefficients, one for each component. This number is in contrast to the 35 features in the raw data. 

Next, evaluate this model using the code below. Notice that the test features are transformed using the<font color='red'> same PCA transformation used for the training data</font>. Execute this code and examine the results.

In [None]:
import  sklearn.metrics as sklm 
def score_model(probs, threshold):
    return np.array([1 if x > threshold else 0 for x in probs[:,1]])

def print_metrics(labels, probs, threshold):
    scores = score_model(probs, threshold)
    metrics = sklm.precision_recall_fscore_support(labels, scores)
    conf = sklm.confusion_matrix(labels, scores)#混淆矩阵
    print('                 Confusion matrix')
    print('                 Score positive    Score negative')
    print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
    print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
    print('')
    print('Accuracy        %0.2f' % sklm.accuracy_score(labels, scores))
    print('AUC             %0.2f' % sklm.roc_auc_score(labels, probs[:,1]))
    print('Macro precision %0.2f' % float((float(metrics[0][0]) + float(metrics[0][1]))/2.0))
    print('Macro recall    %0.2f' % float((float(metrics[1][0]) + float(metrics[1][1]))/2.0))
    print(' ')
    print('           Positive      Negative')
    print('Num case   %6d' % metrics[3][0] + '        %6d' % metrics[3][1])
    print('Precision  %6.2f' % metrics[0][0] + '        %6.2f' % metrics[0][1])
    print('Recall     %6.2f' % metrics[1][0] + '        %6.2f' % metrics[1][1])
    print('F1         %6.2f' % metrics[2][0] + '        %6.2f' % metrics[2][1])

def plot_auc(labels, probs):
    ## Compute the false positive rate, true positive rate
    ## and threshold along with the AUC
    fpr, tpr, threshold = sklm.roc_curve(labels, probs[:,1])
    auc = sklm.auc(fpr, tpr)
    
    ## Plot the result
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, color = 'orange', label = 'AUC = %0.2f' % auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()    

probabilities = log_mod_5.predict_proba(pca_model_5.transform(X_test))
print_metrics(y_test, probabilities, 0.3)    
plot_auc(y_test, probabilities)     

 #### 基于SVD的维度压缩
 在进行压缩时，首先计算得到原始矩阵A（可以代表图片等），的对应$ U V_T\Sigma$矩阵，从中抽取信息量载荷大的k行，再还原新的矩阵 $A_,$ 对比原有矩阵$ A $得到压缩。

In [None]:
k=2
print("k值",k)
Sigma_k=Sigma[:k,:k]##奇异值矩阵选取k行ke列
print("奇异值矩阵选取k行k列:",Sigma_k)
U_k=U[:,:k]
print("右奇异矩阵选取k行",U_k)
Vt_k=Vt[:k,:]
print("左奇异矩阵选取列",Vt_k)

A_k=U_k@Sigma_k@Vt_k
print("通过计算得到压缩矩阵: \n",A_k)
print("原始矩阵 \n",A)


### pca实例2 对半导体数据进行降维

1. 实现数据读取
2. 数据处理
   * 缺失值处理
   * 数据归一化
3. 计算协方差矩阵

数据背景：
Title: SECOM Data Set

Abstract: Data from a semi-conductor manufacturing process
	

-----------------------------------------------------

Data Set Characteristics: Multivariate
Number of Instances: 1567
Area: Computer
Attribute Characteristics: Real
Number of Attributes: 591
Date Donated: 2008-11-19
Associated Tasks: Classification, Causal-Discovery
Missing Values? Yes

-----------------------------------------------------

Source:

Authors: Michael McCann, Adrian Johnston 

-----------------------------------------------------

Data Set Information:

A complex modern semi-conductor manufacturing process is normally under consistent 
surveillance via the monitoring of signals/variables collected from sensors and or 
process measurement points. However, not all of these signals are equally valuable 
in a specific monitoring system. The measured signals contain a combination of 
useful information, irrelevant information as well as noise. It is often the case 
that useful information is buried in the latter two. Engineers typically have a 
much larger number of signals than are actually required. If we consider each type 
of signal as a feature, then feature selection may be applied to identify the most 
relevant signals. The Process Engineers may then use these signals to determine key 
factors contributing to yield excursions downstream in the process. This will 
enable an increase in process throughput, decreased time to learning and reduce the 
per unit production costs.

To enhance current business improvement techniques the application of feature 
selection as an intelligent systems technique is being investigated.

The dataset presented in this case represents a selection of such features where 
each example represents a single production entity with associated measured 
features and the labels represent a simple pass/fail yield for in house line 
testing, figure 2, and associated date time stamp. Where .1 corresponds to a pass 
and 1 corresponds to a fail and the data time stamp is for that specific test 
point.


Using feature selection techniques it is desired to rank features according to 
their impact on the overall yield for the product, causal relationships may also be 
considered with a view to identifying the key features.

Results may be submitted in terms of feature relevance for predictability using 
error rates as our evaluation metrics. It is suggested that cross validation be 
applied to generate these results. Some baseline results are shown below for basic 
feature selection techniques using a simple kernel ridge classifier and 10 fold 
cross validation.

Baseline Results: Pre-processing objects were applied to the dataset simply to 
standardize the data and remove the constant features and then a number of 
different feature selection objects selecting 40 highest ranked features were 
applied with a simple classifier to achieve some initial results. 10 fold cross 
validation was used and the balanced error rate (*BER) generated as our initial 
performance metric to help investigate this dataset.


SECOM Dataset: 1567 examples 591 features, 104 fails

FSmethod (40 features) BER % True + % True - %
S2N (signal to noise) 34.5 +-2.6 57.8 +-5.3 73.1 +2.1
Ttest 33.7 +-2.1 59.6 +-4.7 73.0 +-1.8
Relief 40.1 +-2.8 48.3 +-5.9 71.6 +-3.2
Pearson 34.1 +-2.0 57.4 +-4.3 74.4 +-4.9
Ftest 33.5 +-2.2 59.1 +-4.8 73.8 +-1.8
Gram Schmidt 35.6 +-2.4 51.2 +-11.8 77.5 +-2.3

-----------------------------------------------------

Attribute Information:

Key facts: Data Structure: The data consists of 2 files the dataset file SECOM 
consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels 
file containing the classifications and date time stamp for each example.

As with any real life data situations this data contains null values varying in 
intensity depending on the individuals features. This needs to be taken into 
consideration when investigating the data either through pre-processing or within 
the technique applied.

The data is represented in a raw text file each line representing an individual 
example and the features seperated by spaces. The null values are represented by 
the 'NaN' value as per MatLab.



In [None]:
!pip install  dtale
!pip install sweetviz 

In [None]:
import pandas as pd
import sweetviz as sv

#EDA using Autoviz
sweet_report = sv.analyze(pd.read_csv("titanic.csv"))

#Saving results to HTML file
sweet_report.show_html('sweet_report.html')

In [None]:
import matplotlib.pyplot as plt

class PCA_explain:
    def __init__(self,pca_model):
        self.__pca_model=pca_model
    
    def plot_explained(self):
        matplotlib.use('TkAgg')

        comps = self.__pca_model.explained_variance_ratio_
        x = range(len(comps))
        x = [y + 1 for y in x]          
        plt.plot(x,comps)
        plt.show()
        
    def percentagePCA(self,percentage: float)->int:
        
        comps = self.__pca_model.explained_variance_ratio_
        x=0
        count=0
        for i in range(len(comps)):
            x=x+comps[i]
            if float(x/1)>=percentage:
              print("第{}个特征向量对应的方差贡献率：{}，超过预定贡献率:{}".format(i,x/1,percentage)) 
              count=i
              break 
        return count
          

In [1]:
from fileinput import filename
import string
import numpy as np  
import pandas as pd 
import dtale
from pandas_profiling import ProfileReport
import sweetviz as sv
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import sklearn.decomposition as skde

import matplotlib

 


def loadDataSet(dataFileName,sep,fileType,header=None)->pd.DataFrame:
    """_读取数据，并转换为矩阵返回_

    Args:
        fileName (_str_): _文件路径_
        delim (str, optional): _文件分隔符_. Defaults to '\t'.

    Returns:
        _np.mat_: _返回矩阵_
    """
    #fr=open(dataFileName)
    #stringArr=[ line.strip().split(delim) for line in fr.readlines()]##strip()表示删除掉数据中的换行符，split（‘，’）则是数据中遇到‘,’ 就隔开。
    #df=pd.read_table(dataFileName,sep=" ",header=None)
    #df=pd.DataFrame(stringArr)
    if fileType=='csv':
        df=pd.read_csv(dataFileName,sep=sep,header=None)
    elif fileType=='data':
        df=pd.read_table(dataFileName,sep=sep,header=None)
    df=df.astype(float)
    return df


def preDataSet(dataSet:pd.DataFrame,method="del")->pd.DataFrame:
    """_summary_

    Args:
        dataset (pd.DataFrame): _description_

    Returns:
        pd.DataFrame: _description_
    """
    #print(dataSet.describe)
    if method=="del":
        dataSet.dropna(inplace=True)
    elif method=="zero":
        dataSet.fillna(0)
    elif method=="mean":
        dataSet.fillna(dataSet.mean())
    
    return dataSet    

def percentage_null(data )->list:
    sum=data.shape[0]
    null_sum=data.isnull().sum()/sum

def pca(data_cov):
    b=data_cov@data_cov.T 
    # 计算特征值与特征向量矩阵
    eg_value,eg_vec=np.linalg.eig(b)
    

if __name__=="__main__":
    # readFile and EDA
    fileName="..\..\data\secom.data"
    fileName="..\..\data\Credit_Features.csv"
    #df=loadDataSet(fileName,sep=" ",fileType='data')
    df=loadDataSet(fileName,sep=",",fileType='csv',header=True)
    print(df.shape)
 
    print(df.describe())
    #df=preDataSet(df,method="mean")
    data=df.to_numpy()
    # 数据标准化，基于sklearn包括
    #1. 均值插补(meaneinputation,使用均值来代替null)
    #2. 数据标准化standardization 实现属性的相同空间缩放
    imp_mean=SimpleImputer(missing_values=np.nan,strategy="mean")
    imp_mean.fit(data)
    data2=imp_mean.transform(data)
    # 数据标准化
    stdsc=StandardScaler()
    data3=stdsc.fit_transform(data2)
    print(data3)
    #print("构建相关矩阵")
    #data_cov=np.cov(data3)
    #print(data_cov)
    pca_model=skde.PCA()
    pca_model.fit(data3)
    print("特征向量贡献率： \n",pca_model.explained_variance_ratio_)
    print("特征向量贡献率和为1：",sum(pca_model.explained_variance_ratio_))
    pca_exp=PCA_explain(pca_model)
    pca_exp.plot_explained()
    count=pca_exp.percentagePCA(0.8)
    print(count)
    #对原有数据进行pca
    pca_5=skde.PCA(n_components=count)
    pca_5.fit(data3)
    data5=pca_5.transform(data3)
    print(data5)
 
    
    
 
    
    
    

FileNotFoundError: [Errno 2] No such file or directory: '..\\..\\data\\Credit_Features.csv'

#### 实现图片压缩
https://blog.csdn.net/discoverer100/article/details/89356513

In [None]:
!pip install pillow

In [None]:
!pip install -U matplotlib

#### 图片处理背景
通过对png图像进行数组化操作，得到一个shape为3度的数组，即为3D张量，注意其每个度上的维度不同，3个度的保存的信息分别为高度信息，宽度信息和颜色通道信息。

例如下面代码得到图像的shape为(897, 631, 4)。其高度与长度构成

In [None]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
def image_svd(A,k):
    U,s,Vt=np.linalg.svd(A)
    Sigma=np.zeros(A.shape)
    for i in range(len(s)):#为奇异值矩阵对角元素赋值
      Sigma[i,i]=s[i]
    Sigma_k=Sigma[:k,:k]
    U_k=U[:,:k]
    Vt_k=Vt[:k,:]
    return U_k@Sigma_k@Vt_k
   
def imageR(img):
      R=img[:,:,0]
      G=img[:,:,1]
      B=img[:,:,2]
      #A=img[:,:,3]
      return R,G,B#,A
def imageS(R,G,B):
      img=np.stack((R,G,B),2)
      return img
if __name__=="__main__":
    image=Image.open("girl.jpg","r")
    A=np.array(image)
    print("\n 图片形状：",A.shape)
  
    fig,axes=plt.subplots(3,2)
    fig.set_size_inches(15,15)
    ax1=axes[0,0]
    ax2=axes[0,1]
    ax3=axes[1,0]
    ax4=axes[1,1]
    ax5=axes[2,0]
    ax1.axis('off')
    ax2.axis('off')
    ax3.axis('off')
    ax4.axis('off')#直接关闭坐标轴的可读性与表情
    ax1.imshow(image)
    R,G,B =imageR(A)
    print("红色通道图片：\n" ,"对应矩阵shape",R.shape)
    ax2.imshow(R)
    print("绿色通道图片:" )
    ax3.imshow(G)
    print("蓝色通道图片:" )
    ax4.imshow(B)
    #print("透明通道图片:" )
    #ax5.imshow(A2)
    k=20
    R_k=image_svd(R,k)
    G_k=image_svd(G,k)
    B_k=image_svd(B,k)
    #A_k=image_svd(A2,k)
    
    print(R,R.shape)
    img2=imageS(R_k,G_k,B_k)
    
    ax6=axes[2,1]
    ax6.imshow(G_k)
  

 
    
    

### SVD在推荐系统中的使用
> 推荐系统使用经典协同
在推荐系统中，我们常常遇到的问题是这样的，我们有很多用户和物品，也有少部分用户对少部分物品的评分，我们希望预测目标用户对其他未评分物品的评分，进而将评分高的物品推荐给目标用户。比如下面的用户物品评分表：

In [None]:
p