### 特徵工程(Feature Engineering)

#### 1. 什麼是特徵工程?

- 利用已存在的特徵資料創造新的特徵並擷取新的額外資訊


- 可以了解特徵之間關係


- 處理的方法依據資料集而定(dataset-dependent)


- 為一種有效改進預測模型(predictive model)的方法


#### 2. 使用特徵工程的優點

- 增進學習演算法(learning algorithm)的預測能力


- 使機械學習模型有較佳的表現


#### 3. 特徵工程的資料種類

- 種類資料


- 數值資料 (例: 統計量、時間戳)


- 文字資料


(1) 特徵工程的資料型態表示:

a. 指標變數(Indicator variables)

- Threshold indicator


- Multiple features


- Groups of classes


b. 交互特徵(Interaction features)

- Sum/Difference/Product/Quotient


- Other mathematical combos


c. 特徵表示(Feature representation)

- Datetime stamps


- Dummy variables [(n - 1) features for n categories]


(2) DataFrame 資料型態

a. df.dtypes:

- object: string/mixed types


- int64: integer


- float64: float


- datetime64 (or timedelta): datetime


b. 欄位型態轉換

    df["A"] = df["A"].astype("float")
    print(df.dtypes)

#### 4. 種類特徵

(1) 二元變數(binary variables)

a. 使用 Pandas

- apply() function:


      df["Class_enc"] = df["Class"].apply(lambda val: 1 if val == "y" else 0)

- replace() function:


      df["Class_enc"] = df["Class"].replace({'x': 0, 'y': 1})

b. 使用 scikit-learn

    from sklearn.preprocessing import LabelEncoder
    
    le = LabelEncoder()
    
    df["Class_enc"] = le.fit_transform(df["Class"])

(2) 多元變數

a. One-hot encoding

    pd.get_dummies(df, columns=['Class'], prefix='enc_')

b. Dummy encoding

    pd.get_dummies(df, columns=['Class'], drop_first=True, prefix='enc_')

c. One-hot vs dummies encoding

- One-hot encoding: Explainable features that contain the entirely collinear feature due to same information


- Dummy encoding: Necessary information without duplication


d. 根據種類的數量進行分類 $\rightarrow$ 用來限制 one-hot/dummy encoding 所產生的大量欄位
    
    counts = df['Class'].value_counts()
    
    mask = df['Class'].isin(counts[counts < 5].index)
    
    df['Class'][mask] = 'Other'

**References**: 

- [Tutorial: (Robust) One Hot Encoding in Python](https://blog.cambridgespark.com/robust-one-hot-encoding-in-python-3e29bfcec77e)


- [Handling Categorical Data in Python](https://www.datacamp.com/community/tutorials/categorical-data)

#### 5. 數值特徵

(1) 統計量: 

Example: 計算平均

    columns = ["day1", "day2", "day3"]
    
    df["mean"] = df.apply(lambda row: row[columns].mean(), axis=1)

(2) 時間格式轉換:

    df["date_converted"] = pd.to_datetime(df["date"])
    
    df["month"] = df["date_converted"].apply(lambda row: row.month)
    
(3) 二值化數值變數(Binarizing numeric variables):

    df['Binary_counts'] = 0
    
    df.loc[df['Number_of_counts'] > 0, 'Binary_counts'] = 1)
    
(4) 分箱數值變數(Binning numeric variables):

    df['Binned_Group'] = pd.cut(df['Number_of_counts'], bins=[-np.inf, 0, 5, np.inf], labels=[1, 2, 3])

#### 6. 文字資料

(1) 搜尋與處理流浪字符(stray characters)

Example: £ 5000 $\rightarrow$ 5000.00

    # Convert the column to numeric values
    coerced_vals = pd.to_numeric(df['dollars'], errors='coerce')

    # Find the index of missing values
    idx = coerced_vals.isna()

    # Print the relevant rows
    print(df['dollars'][idx].head())
    
    df['dollars'] = df['dollars'].str.replace('£', ' ')
    df['dollars'] = df['dollars'].astype('float')

(2) 文字資料標準化、單詞分割、計數
    
    # Standardize the text
    df['text'] = df['text'].str.lower()
    print(df['text'][0])

    # Length of text
    df['char_cnt'] = df['text'].str.len()
    print(df['char_cnt'].head())

    # Word splits
    df['word_splits'] = df['text'].str.split()
    df['word_splits'].head(1)

    # Word counts
    df['word_cnt'] = df['text'].str.split().str.len()
    print(df['word_cnt'].head())

    # Average length of word
    df['avg_word_len'] = df['char_cnt'] / df['word_cnt']

(3) 使用正規表示式(regular expression)處理文字

a. 去除非字母的字符

    # Removing unwanted characters
    df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ')

- [a-zA-Z] : All letter characters


- [^a-zA-Z] : All non letter characters


b. re 模組取出數值

    import re
    string = "temperature: 80.0 F"

    # Write a pattern to extract numbers and decimals
    def return_temp(string):
        pattern = re.compile(r"\d+\.\d+")
        
        # Search the text for matches
        temp = re.match(pattern, string)
        
        # If a value is returned, use group(0) to return the found value
        if temp is not None:
            return float(temp.group(0))
            
    # Apply the function to the string column and take a look at both columns
    df["string_val"] = df["string"].apply(lambda row: return_temp(row))
    print(df[["string", "string_val"]].head())

(4) 計數表示(Word count representation)

    # Initializing the vectorizer
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer()

    # Specifying the vectorizer
    # min_df : minimum fraction of documents the word must occur in
    # max_df : maximum fraction of documents the word can occur in
    cv = CountVectorizer(min_df=0.1, max_df=0.9)

    # Fitting and transforming the vectorizer
    cv.fit(df['text_clean'])
    cv_transformed = cv.transform(df['text_clean'])

    # Transforming to toarray()
    cv_transformed.toarray()

    # Getting the features
    feature_names = cv.get_feature_names()

    # Create cv_df dataframe
    cv_df = pd.DataFrame(cv_transformed.toarray(), columns=cv.get_feature_names()).add_prefix('Counts_')

    # Updating DataFrame
    df = pd.concat([df, cv_df], axis=1, sort=False)

*Issue with word count representation:*

Counts may be vary large for commom words which provide little value as a distinguishing feature.
 
(5) tf/idf representation

a. 向量化文字(Vectorizing text)

- term frequency (tf)


- inverse document frequency (idf)


*TF-IDF = (count of word occurances / total words in document) / log(number of docs word is in / total number of docs)*

b. 文字擷取套件: TfidfVectorizer

    from sklearn.feature_extraction.text import TfidfVectorizer

    # Max features and stopwords
    (max_features : Maximum number of columns created from TF-IDF)
    (stop_words : List of common words to omit e.g. "and", "the" etc.)
    tv = TfidfVectorizer(max_features=100, stop_words='english')

    # Fitting and transforming the text
    tv_transformed = tv.fit_transform(df['text'])

    # Putting it all together
    tv_df = pd.DataFrame(tv_transformed.toarray(), columns=tv.get_feature_names()).add_prefix('TFIDF_') 
    df = pd.concat([df, tv_df], axis=1, sort=False)

    # Inspecting the transforms
    examine_row = tv_df.iloc[0]
    print(examine_row.sort_values(ascending=False))

    # Applying the vectorizer to new data
    new_tv_transformed = tv.transform(new_df['text'])
    new_tv_df = pd.DataFrame(new_tv_transformed.toarray(), columns=tv.get_feature_names()).add_prefix('TFIDF_')
    new_df = pd.concat([new_df, new_tv_df], axis=1, sort=False)

c. 使用 tf/idf 向量作文字分類(Text classification)

i. Naive Bayes Classifier:

- Features are independent


- Efficiency in high-dimensional space


ii. Example: 

      # Run the toarray() method on the tf/idf vector
      # Split the dataset according to the class distribution of category
      y = df["category"]
      X_train, X_test, y_train, y_test = train_test_split(tv_transformed.toarray(), y, stratify=y)

      # Fit the model to the training data
      nb.fit(X_train, y_train)

      # Print out the model's accuracy
      print(nb.score(X_test, y_test))
      
(6) 詞袋(Bag of words)與 N 元語法(N-grams)

a. 詞袋(bag of words)

- 一個沒有順序或文法的單詞集合


b. N 元語法(N-grams)

- 使用有序排列的單詞來代表有語義的文字


Example: 

- bigrams: Sequences of two consecutive words


- trigrams: Sequences of two consecutive words


c. 使用 tf/idf 向量處理二元語法(bi-grams)

    # ngram_range = (min, max): minimum and maximum length of n-grams
    tv_bi_gram_vec = TfidfVectorizer(ngram_range = (2, 2))

    # Fit and apply bigram vectorizer
    tv_bi_gram = tv_bi_gram_vec.fit_transform(df['text'])

    # Print the bigram features
    print(tv_bi_gram_vec.get_feature_names())

    # Create a DataFrame with the Counts features
    tv_df = pd.DataFrame(tv_bi_gram.toarray(), columns=tv_bi_gram_vec.get_feature_names()).add_prefix('Counts_')
    
    # Finding common words
    tv_sums = tv_df.sum()
    print(tv_sums.sort_values(ascending=False)).head()