# Chapter 9

### 使用主成份縮減特徵(且保留了變異)

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets

digits = datasets.load_digits()
# 特徵矩陣標準化
features = StandardScaler().fit_transform(digits.data)
# 產生能保留99%變異的PCA
pca = PCA(n_components=.99,
                whiten=True) # 會對每一個主成分的值進行轉換，如此就能有0平均(zero-mean)與單位變異(unit variance)
                # svd_solver = "randomized" # 實做了一個尋找第一主成分的隨機演算法(stochastic algorithm)為節省時間
# activate
pca_features = pca.fit_transform(features)

print("Original number of features :", features.shape[1])
print("Reduced number of features :", pca_features.shape[1])

Original number of features : 64
Reduced number of features : 54


#### Notes: PCA是一種非監督的技術，即不需要目標向量的資訊，只需考慮特徵矩陣即可。另外，PCA所產生的新特徵無法由人類來解讀，若要保留解釋模型的能力，以特徵選取來進行降維會比較好。
![線性分離](img/78666551_595455054560185_8155747298551267328_n.jpg)

### 為不可線性分離之資料作特徵縮減

In [10]:
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles
# 產生線性不可分離資料
features, _ = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1)
# 套用帶徑向基函式(radius basis function, RBF)
kpca = KernelPCA(kernel="rbf", gamma=15, n_components=1)
features_kpca = kpca.fit_transform(features)

print("Original number of features :", features.shape[1])
print("Reduced number of features :", features_kpca.shape[1])

Original number of features : 2
Reduced number of features : 1


#### Notes: 標準的PCA使用線性投影以縮減特徵，若資料為線性可分，則PCA可以運作得很好，若為線性不可分則單用PCA效果較差，使用投影降維法的時候會使分類交織在一起，但是我們希望縮減為度也可以使資料線性平分，PCA_Kernel是我們的好幫手(rbf, poly, sigmoid...)，但是PCA_Kernel必須定義參數數量(如n_components=1)，還要設定Kernel本身的參數。
![不可線性分離](img/77410285_2148243928817543_140205286072778752_n.jpg)

### 運用類型可分性最大化縮減特徵(By LDA)

In [11]:
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 產生並執行LDA，然後用它來轉換特徵
lda = LinearDiscriminantAnalysis(n_components=1) # 代表要傳回的特徵數
features_lda = lda.fit(features, target).transform(features)

print("Original number of features :", features.shape[1])
print("Reduced number of features :", features_lda.shape[1])

Original number of features : 4
Reduced number of features : 1


In [12]:
# 檢視每一成份所解釋的變異數量
lda.explained_variance_ratio_

array([0.9912126])

#### Notes: LDA是一種分類方法也是一種常用的維度縮減技術，相較於PCA，PCA我們只對能將資料中變異最大化的成分感興趣，而在LDA中我們還有將分類間的差異最大化這個額外的目標。
![LDA](img/78144320_595190831217947_8467876655149350912_n.jpg)

#### Notes: n_components的調參值技巧，先將它設為None，在用LDA傳回每一成分所解釋的變異比，是否大於threshold。

In [17]:
lda = LinearDiscriminantAnalysis(n_components=None)
features_lda = lda.fit(features, target)

lda_var_ratios = lda.explained_variance_ratio_
def select_n_components(var_ratio, goal_var: float) -> int:
    total_variance = 0.0
    # init
    n_components = 0
    for explained_variance in var_ratio:
        total_variance += explained_variance
        n_components += 1
        if total_variance >= goal_var:
            break
    return n_components
select_n_components(lda_var_ratios, 0.95)

1

### 運用矩陣分解縮減特徵-非負值特徵矩陣(By NMF)

In [19]:
from sklearn.decomposition import NMF
from sklearn import datasets

digits = datasets.load_digits()
features = digits.data

nmf = NMF(n_components=10, random_state=1)
feature_nmf = nmf.fit_transform(features)
print("Original number of features :", features.shape[1])
print("Reduced number of features :", feature_nmf.shape[1])

Original number of features : 64
Reduced number of features : 10


#### Notes: NMF是線性維度縮減的非監督式學習，其能將特徵矩陣分解(即拆解成幾個相乘後與原矩陣相近的矩陣)成代表觀察與其特徵之潛在關係的矩陣。(拆解過後得矩陣維度會明顯小於相乘後的)給定回傳所需的特徵數，r
$$ V \sim WH $$
$$V是d \times n的特徵矩陣(d個特徵,n個觀察,且不能有負值)$$
$$W是d \times r的矩陣$$
$$H是r \times n的矩陣$$
$$透過r值的調整，我們可以設定需要縮減的維度$$

#### NMA特徵矩陣中不能有負值，也不會給我們輸出特徵的explained variance)

### 在稀疏資料上縮減特徵

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
from sklearn import datasets
import numpy as np

In [6]:
digits = datasets.load_digits()
# 特徵標準化
features = StandardScaler().fit_transform(digits.data)
# 產生稀疏矩陣
features_sparse = csr_matrix(features)
# 產生TSVD
tsvd = TruncatedSVD(n_components=10)
# fit_transform
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)
# results
print("Original number of features :", features_sparse.shape[1])
print("Reduced number of features :", features_sparse_tsvd.shape[1])

Original number of features : 64
Reduced number of features : 10


#### Notes: TSVD專門處理稀疏矩陣，PCA其實常在一個步驟中運用非截斷奇異值(SVD)。在正規的SVD中，給定d個特徵，SVD會產生d * d的因子矩陣，而TSVD則傳回n * n個(透過參數設定)

In [9]:
tsvd.explained_variance_ratio_[:3].sum() # 前三個輸出成分解釋了約30%的原始資料變異

0.3003938535247808

In [14]:
# 產生並執行帶有比特徵少1的TSVD
tsvd = TruncatedSVD(n_components=features_sparse.shape[1]-1)
features_tsvd = tsvd.fit(features)

tsvd_var_ratios = tsvd.explained_variance_ratio_
def select_n_components(var_ratio, goal_var):
    # 設定初始已釋變異
    total_variance = 0.0
    # 設定初始特徵數
    n_components = 0
    for explained_variance in var_ratio:
        total_variance += explained_variance
        n_components += 1
        if total_variance >= goal_var:
            break
    return n_components
select_n_components(tsvd_var_ratios, 0.95)

40

# Chapter 10

### Feature selection : Filter, Wrapper, Embedded
* Filter: 透過檢視統計性質來選取最好的特徵
* Wrapper: 透過試誤法，找到能產生最高預測品質之模型的特徵子集
* Embedded: 透過選取最佳特徵子集作為學習演算法訓練過程

### 設定特徵變異門檻(移除低變異)

In [15]:
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

In [16]:
iris = datasets.load_iris()
features = iris.data
target = iris.target
threshold = VarianceThreshold(threshold=.5)
features_high_variance = threshold.fit_transform(features)
features_high_variance[:3]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

#### Notes: Vatiance Thresholding(VT)，先計算每個特徵的變異
$$operatorVar(x)=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)^2$$
$$x是特徵向量，x_i是個別的特徵值，\mu是該特徵的平均值$$

In [18]:
# 檢視變異
threshold.fit(features).variances_

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

#### 標準化後變異門檻無法運作

In [19]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_std = scaler.fit_transform(features)
selector = VarianceThreshold()
selector.fit(features_std).variances_

array([1., 1., 1., 1.])

### 設定二元特徵變異門檻值

In [20]:
from sklearn.feature_selection import VarianceThreshold
# 特徵0: 80% 分類0
# 特徵1: 80% 分類1
# 特徵2: 60% 分類0，40% 分類1
features = [[0, 1, 0], 
                [0, 1, 1],
                [0, 1, 0],
                [0, 1, 1],
                [1, 0, 0]]
thresholder = VarianceThreshold(threshold=(.75)*(1-.75)) # 白努力隨機變數(p是類型1之觀察占比)
thresholder.fit_transform(features)

array([[0],
       [1],
       [0],
       [1],
       [0]])

### 處理高相關的特徵

In [42]:
import pandas as pd
import numpy as np
features = np.array([[1, 1, 1],
                             [2, 2, 0],
                             [3, 3, 1],
                             [4, 4, 0],
                             [5, 5, 1],
                             [6, 6, 0],
                             [7, 7, 1],
                             [8, 7, 0],
                             [9, 7, 1]])
df = pd.DataFrame(features)
# 產生相關矩陣
corr_matrix = df.corr().abs()
# 相關矩陣上三角
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))# 轉成True, False 給np.where
# 找出相關性高於0.95之特徵行的索引
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
df.drop(df.columns[to_drop], axis=1)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1
3,4,0
4,5,1
5,6,0
6,7,1
7,8,0
8,9,1


#### Notes: 對角線右上角

In [35]:
np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool) # 且轉成True, False 

array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]])

#### 若兩特徵高度相關，則其內含的資訊將十分類似(redundant)。

### 移除不相關特徵以進行分類

In [44]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
iris = load_iris()
features = iris.data
target = iris.target

# 轉成整數，將其轉為類別
features = features.astype(int)

# 選取兩個帶有最高卡方統計量的特徵
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)

print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


#### 若特徵是數量型的，則計算每一個特徵與目標向量的ANOVA F值

In [45]:
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


#### 利用SelectPercentile 選取前百分之n的特徵

In [49]:
from sklearn.feature_selection import SelectPercentile
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features, target)
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 3


#### Notes: 卡方統計量檢驗二類型向量的獨立性，這個統計量是類型特徵每一類所觀察到的觀察數與若該特徵與目標向量為獨立時(即無關係)的預期值差。
#### *特徵選取運用卡方時，目標向量與特徵都要是類型的。且所有數值不能為負值*
#### *若要處理數值型特徵，我們可以運用f_classif來計算每個特徵與目標向量的ANOVA F值*

$$ \chi^2 = \sum_{i=1}^{n}  \frac {(O_i-E_i)^2}{E_i}$$
$$O_i是類型i的觀察數，E_i為若該特徵與目標向量不存在關係時，類型i的預期觀察數。$$

### 遞迴特徵剔除(使用cross-Validation)

In [59]:
import warnings
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn import datasets, linear_model
warnings.filterwarnings("ignore")
features, target = make_regression(n_samples = 10000, n_features = 100, n_informative= 2, random_state=1)
# 產生線性迴歸
ols = linear_model.LinearRegression()
# 循環剔除特徵
rfecv = RFECV(estimator=ols, step=1, scoring="neg_mean_squared_error")
rfecv.fit(features, target)
rfecv.transform(features)

array([[ 0.00850799,  0.7031277 ],
       [-1.07500204,  2.56148527],
       [ 1.37940721, -1.77039484],
       ...,
       [-0.80331656, -1.60648007],
       [ 0.39508844, -1.34564911],
       [-0.55383035,  0.82880112]])

In [60]:
# 最佳特徵數
rfecv.n_features_

2

In [61]:
# 哪些類型最佳
rfecv.support_

array([False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

In [62]:
# 特徵排名
rfecv.ranking_

array([82, 84, 74, 33, 81,  1, 18, 46, 57, 67, 45,  7, 58, 52, 78,  8,  5,
       73, 31, 11, 43, 14, 34, 83, 21, 96, 20, 41, 94, 90, 71, 47, 30, 27,
       89, 50, 25, 69, 86,  1, 76, 19, 97, 88,  9, 16, 23, 80, 75, 54, 91,
       12, 65, 59, 24, 32,  4, 26, 10, 42, 72,  2, 87, 40, 66,  3, 92, 17,
       39, 35, 13, 79, 38,  6, 53, 60, 22, 61, 28, 95, 93, 36, 99, 48, 51,
       68, 37, 70, 15, 98, 56, 29, 44, 63, 49, 64, 77, 85, 55, 62])

#### Notes: RFE的原理像是線性迴歸或是SVM，重複訓練一個內含一些參數(權重或係數)的模型。第一次訓練模型時，將所有的特徵都算進來。然後，會找到帶有最小參數的特徵，最不重要的就把他踢除。