### 特徵選擇(Feature selection)

#### 1. 特徵選擇(feature selection)

(1) 特徵選擇(feature selection)的目的

    選擇目前已存在的特徵(非創造新的特徵)進行建模(modeling)，以增進模型的表現(model performance)

(2) 最佳的特徵子集

- 減少過擬合(over fitting)


- 改進結果準確度


- 增加模型可詮釋性(interpretability)


- 減少模型訓練時間


#### 2. 特徵選擇(feature selection)的對象與方法

- 冗餘特徵(Redundant features)

  - 含有重複或不重要資訊的特徵
  
  - 用來產生統計匯總(aggregate statistic)的數值


- 相關特徵(Correlated features)


- 特徵工程產生的文字向量(Text vectors)


#### 3. 冗餘特徵(redundant feature)

    # Create a list of redundant column names to drop
    to_drop = ["col_1", "col_2"]

    # Drop those columns from the dataset
    df_subset = df.drop(to_drop, axis=1)

    # Print out the head of the new dataset
    print(df_subset.head())

#### 4. 相關特徵(correlated features)

(1) 回歸(regression)模型的特徵選擇

a. 過濾法(Filter method): 根據統計表現(statistical performance)排序特徵

b. Filter functions:

- df.corr(): Pearson's correlation matrix


- sns.heatmap(corr_object): heatmap plot


- abs(): absolute value


(2) 使用 df.drop() 移除高度相關的特徵

    # Print out the column correlations of the dataset
    cor = df.corr()
    print(cor)
    
    # Find the columns where the correlation value are greater than 0.5
    to_drop = "col_names"

    # Drop that column from the DataFrame 
    df = df.drop(to_drop, axis=1)
    
(3) 選擇與 output variable 高度相關的特徵

    # Print out the column correlations of the dataset
    cor = df.corr()
    print(cor)

    # Correlation matrix heatmap
    plt.figure()
    sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
    plt.show()

    # Correlation with output variable A
    cor_target = abs(cor["A"])

    # Selecting highly correlated features
    best_features = cor_target[cor_target > 0.5]
    print(best_features)

(4) 統計測試的相關係數(Correlation coefficient):

|Feature/Response(output)|Continuous|Categorical|
|:-----:|:-----:|:-----:|
|Continuous|Pearson's Correlation|LDA|
|Categorical|ANOVA|Chi-Square|
 
**Reference:**

- [NumPy, SciPy, and Pandas: Correlation With Python](https://realpython.com/numpy-scipy-pandas-correlation-python/)

#### 5. 文字向量(text vectors)

(1) sklearn: TfidfVectorizer
    
    from sklearn.feature_extraction.text import TfidfVectorizer

    tfidf_vec = TfidfVectorizer()
    
    text_tfidf = tfidf_vec.fit_transform(df['text'])

- 詞彙(vocabulary words): 

    
      tfidf_vec.vocabulary_


- 文字權重(weights): 


      text_tfidf[row].data


- 文字指標(indices): 


      text_tfidf[row].indices

(2) 計算文字所對應的權重

    # Reverse the key-value pairs in the vocabulary
    vocab = {v : k for k, v in tfidf_vec.vocabulary_.items()}

    def return_weights(vocab, original_vocab, vector, vector_row, top_n):
        # Zip together the row indices and weights and pass it into dict function
        zipped = dict(zip(vector[vector_row].indices, vector[vector_row].data))
        
        # Let's transform that zipped dict into a series
        zipped_series = pd.Series({vocab[i] : zipped[i] for i in vector[vector_row].indices})
        
        # Let's sort the series to pull out the top n weighted words
        zipped_index = zipped_series.sort_values(ascending=False)[ : top_n].index
        
        return [original_vocab[i] for i in zipped_index]

    # Print out the weighted words
    print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 0, 5))

(3) 以列表收集所有文件中權重值高的文字

    def words_to_filter(vocab, original_vocab, vector, top_n):
        filter_list = []
        for i in range(0, vector.shape[0]):
            # Here we'll call the return_weights() function and extend the list we're creating
            filtered = return_weights(vocab, original_vocab, vector, i, top_n) 
            filter_list.extend(filtered)
            
        # Return the list in a set, so we don't get duplicate word indices
        return set(filter_list)

    # Call the function to get the list of word indices
    filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

    # By converting filtered_words back to a list, we can use it to filter the columns in the text vector
    filtered_text = text_tfidf[:, list(filtered_words)]


#### 6. 使用 ML 執行回歸(regression)問題的特徵選擇

(1) Wrapper method: Use an ML method to evaluate performance

a. Forward selection (LARS-least angle regression)

- Starts with no features, adds one at a time

Example:

    # Import modules (least angle regression with cross-val)
    from sklearn.linear_model import LarsCV

    # Drop feature suggested not important
    X = X.drop('col_not_important', axis=1)

    # Instantiate
    lars_mod = LarsCV(cv=5, normalize=False)

    # Fit
    feat_selector = lars_mod.fit(X, y)

    # Print r-squared score and estimated alpha (estimated regularization parameter)
    print(lars_mod.score(X, y))
    print(lars_mod.alpha_)

b. Backward elimination

- Starts with all features, eliminates one at a time


c. Forward selection/backward elimination combination (bidirectional elimination)

d. Recursive feature elimination (RFECV)

Example: 
        
    # support vector regression estimator
    from sklearn.svm import SVR
    
    # recursive feature elimination with cross-val
    from sklearn.feature_selection import RFECV

    # Instantiate estimator and feature selector
    svr_mod = SVR(kernel="linear")
    feat_selector = RFECV(svr_mod, cv=5)

    # Fit
    feat_selector = feat_selector.fit(X, y)

    # Print support(boolean array of selected features) and ranking(feature ranking, selected=1)
    print(feat_selector.support_)
    print(feat_selector.ranking_)
    print(X.columns)

(2) Embedded method: Iterative model training to extract features

- Ridge (L2)


- Lasso (L1)


- ElasticNet: L1-ratio regularization which is a combination of L1 and L2.


a. regularization functions

- Lasso estimator: 

      sklearn.linear_model.Lasso

- Lasso estimator with cross-validation: 

      sklearn.linear_model.LassoCV

- Ridge estimator: 

      sklearn.linear_model.Ridge

- Ridge estimator with cross-validation: 

      sklearn.linear_model.RidgeCV

- ElasticNet estimator: 

      sklearn.linear_model.ElasticNet

- ElasticNet estimator with cross-validation: 

      sklearn.linear_model.ElasticNetCV

b. Examples

*Lasso:*

    # Import modules
    from sklearn.linear_model import Lasso, LassoCV
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split

    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3)

    # Instantiate cross-validated lasso, fit
    lasso_cv = LassoCV(alphas=None, cv=10, max_iter=10000)
    lasso_cv.fit(X_train, y_train) 

    # Instantiate a lasso estimator passing the best alpha value from lasso_cv, fit, predict and print MSE
    lasso = Lasso(alpha = lasso_cv.alpha_)
    lasso.fit(X_train, y_train)
    print(mean_squared_error(y_true=y_test, y_pred=lasso.predict(X_test)))

*Ridge:* 

    # Import modules
    from sklearn.linear_model import Ridge, RidgeCV
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split

    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3)

    # Instantiate cross-validated ridge, fit
    ridge_cv = RidgeCV(alphas=np.logspace(-6, 6, 13))
    ridge_cv.fit(X_train, y_train)

    # Instantiate ridge, fit, predict and print MSE
    ridge = Ridge(alpha = ridge_cv.alpha_)
    ridge.fit(X_train, y_train)
    print(mean_squared_error(y_true=y_test, y_pred=ridge.predict(X_test)))

(3) Feature importance: tree-based ML models

a. Random Forest

    # Import
    from sklearn.ensemble import RandomForestRegressor

    # Instantiate
    rf_mod = RandomForestRegressor(max_depth=2, random_state=123, n_estimators=100, oob_score=True)

    # Fit
    rf_mod.fit(X, y)

    # After model fit
    rf_mod.feature_importances_
    
b. Extra Trees

    # Import
    from sklearn.ensemble import ExtraTreesRegressor

    # Instantiate
    xt_mod = ExtraTreesRegressor()

    # Fit
    xt_mod.fit(X, y)

    # After model fit
    xt_mod.feature_importances_

### 維度降維(Dimensionality reduction)

#### 1. 什麼是維度降維(Dimensionality reduction)?

- 非監督式(unsupervised learning)學習方法 (in linear/non-linear fashion)


- 組合或分解特徵空間


- 藉由特徵擷取(feature extraction)來縮減特徵空間


- 線性轉換(Linear transformation)後的特徵空間，特徵彼此無相關(uncorrelated)


- 在每個主成份盡可能使變異數(variance)最大


#### 2. 維度降維(dimensionality reduction) vs 特徵選擇(feature selection)

相同點: 減少資料集的特徵數量

相異點:

a. 維度降維利用特徵進行線性組合，同時移除多重共線性(multicollinearity)

b. 特徵選擇根據特徵與目標變數(target variable)的關係去包含或移除特徵 (無特徵轉換)

#### 3. 為什麼要使用維度降維(dimensionality reduction)?

(1) 維度災難(curse of dimensionality): 模型表現會隨著特徵的數量增加而降低 (overfitting due to high dimensionality)

(2) 使用維度降維(dimensionality reduction)的優點

- 加速 ML 的訓練


- 可視覺化主成份


- 增進模型準確度


#### 4. 維度降維的方法

(1) 主成分分析(Principal component analysis, PCA)

- Relationship between X and y


- Calculated by finding principal axes


- Translates, rotates and scales data to the direction of the maximum variance


- Lower-dimensional projection of the data


(2) 奇異值分解(Singular value decomposition, SVD)

- Linear algebra and vector calculus


- Decomposes data matrix into three matrices


- Results in 'singular' values


- Variance in data approximately equals the square sum of singular values


#### 5. 維度降維的套件

- 主成分分析: sklearn.decomposition.PCA


- 奇異值分解: sklearn.decomposition.TruncatedSVD


- 擬合與轉換資料: PCA/SVD.fit_transform(X)


- 主成分(principal components)的可解釋變異(explained variance): PCA/SVD.explained_variance_ratio_


Example 1: PCA

    # Import module
    from sklearn.decomposition import PCA

    # Feature matrix and target array
    X = df.drop('target', axis=1)
    y = df['target']

    # Apply PCA to the dataset (# of components of the input features)
    pca = PCA(n_components=3)

    # Fit and transform
    df_pca = pca.fit_transform(X)

    # Look at the percentage of variance explained by the different components
    print(pca.explained_variance_ratio_)

    # Split the df_pca and the y labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(df_pca, y)

    # Fit knn to the training data
    knn.fit(X_train, y_train)

    # Score knn on the test data and print it out
    print(knn.score(X_test, y_test))
    
Example 2: SVD

    # Import module
    from sklearn.decomposition import TruncatedSVD

    # Feature matrix and target array
    X = df.drop('target', axis=1)
    y = df['target']

    # SVD
    svd = TruncatedSVD(n_components=3)

    # Fit and transform
    principalComponents = svd.fit_transform(X)

    # Print ratio of variance explained
    print(svd.explained_variance_ratio_)

#### 6. 維度降維的視覺化

(1) 主成分分析(PCA)

a. 使用 PCA 分離不同的類別

Example:

    targets = [0, 1]
    colors = ['r', 'b']

    # For loop to create plot
    for target, color in zip(targets, colors):
        indicesToKeep = df_PCA['target'] == target
        ax.scatter(df_PCA.loc[indicesToKeep, 'principal component 1'],   
                   df_PCA.loc[indicesToKeep, 'principal component 2'],
                   c = color, s = 50)
    # Legend
    ax.legend(targets)
    ax.grid()
    plt.show()

b. 使用陡坡圖(scree plot)視覺化主成分(principal components)

Example:

    # Remove target variable
    X = df.drop('target', axis=1)

    # Instantiate
    pca = PCA(n_components=10)

    # Fit and transform
    principalComponents = pca.fit_transform(X)

    # List principal components names
    principal_components = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10']

    # Create a DataFrame
    pca_df = pd.DataFrame({'Variance Explained': pca.explained_variance_ratio_, 'PC':principal_components})

    # Plot DataFrame
    sns.barplot(x='PC', y='Variance Explained', data=pca_df, color="c")
    plt.show()

    # Instantiate, fit and transform
    pca2 = PCA()
    principalComponents2 = pca2.fit_transform(X)

    # Assign variance explained
    var = pca2.explained_variance_ratio_

    # Plot cumulative variance
    cumulative_var = np.cumsum(var)*100
    plt.plot(cumulative_var,'k-o', markerfacecolor='None', markeredgecolor='k')
    plt.title('Principal Component Analysis', fontsize=12)
    plt.xlabel("Principal Component", fontsize=12)
    plt.ylabel("Cumulative Proportion of Variance Explained", fontsize=12)
    plt.show()

(2) 隨機鄰近嵌入法(t-distributed stochastic neighbor embedding, t-SNE)

- 對高維的資料點(pairs of data points)建立機率分佈


- 在低維空間嵌入可視化的高維數據(Low-dimensional embedding)


TSNE 套件:

    # t-sne with data
    from sklearn.manifold import TSNE
    import seaborn as sns

    df = pd.read_csv('dataset.csv')

    # Feature matrix
    X = df.drop('target', axis=1)

    tsne = TSNE(n_components=2, verbose=1, perplexity=40)
    tsne_results = tsne.fit_transform(X)

    df['t-SNE-PC-one'] = tsne_results[:, 0]
    df['t-SNE-PC-two'] = tsne_results[:, 1]

    # t-sne viz
    plt.figure(figsize=(16,10))
    sns.scatterplot(x="t-SNE-PC-one", y="t-SNE-PC-two", hue="target", 
                    palette=sns.color_palette(["grey","blue"]),
                    data=df, legend="full", alpha=0.3)