### 特徵工程(Feature Engineering)

#### 1. 什麼是特徵工程?

- 利用已存在的特徵資料創造新的特徵並擷取新的額外資訊


- 可以了解特徵之間關係


- 處理的方法依據資料集而定(dataset-dependent)


- 為一種有效改進預測模型(predictive model)的方法


#### 2. 使用特徵工程的優點

- 增進學習演算法(learning algorithm)的預測能力


- 使機械學習模型有較佳的表現


#### 3. 特徵工程的資料種類

- 種類資料


- 數值資料 (例: 統計量、時間戳)


- 文字資料


**特徵工程的資料型態表示:**

(1) 指標變數(Indicator variables)

- Threshold indicator


- Multiple features


- Special events


- Groups of classes


(2) 交互特徵(Interaction features)

- Sum/Difference/Product/Quotient


- Other mathematical combos


(3) 特徵表示(Feature representation)

- Datetime stamps


- Transform categorical to dummy variables [(k - 1) binary columns]


#### 4. 種類特徵

(1) 二元變數(binary variables): 0 or 1

a. 使用 Pandas

- apply() function:


      data["Class_enc"] = data["Class"].apply(lambda val: 1 if val == "y" else 0)

- replace() function:


      data["Class_enc"] = data["Class"].replace({'x': 0, 'y': 1})

b. 使用 scikit-learn

    from sklearn.preprocessing import LabelEncoder
    
    le = LabelEncoder()
    
    data["Class_enc"] = le.fit_transform(data["Class"])

(2) 多元變數: one-hot encoding

- get_dummies() function: 


      data_enc = pd.get_dummies(data=data)

**References**: 

- [Tutorial: (Robust) One Hot Encoding in Python](https://blog.cambridgespark.com/robust-one-hot-encoding-in-python-3e29bfcec77e)


- [Handling Categorical Data in Python](https://www.datacamp.com/community/tutorials/categorical-data)

#### 5. 數值特徵

(1) 統計量: 

Example: 計算平均

    columns = ["day1", "day2", "day3"]
    
    df["mean"] = df.apply(lambda row: row[columns].mean(), axis=1)

(2) 時間格式轉換:

    df["date_converted"] = pd.to_datetime(df["date"])
    
    df["month"] = df["date_converted"].apply(lambda row: row.month)

#### 6. 文字資料

(1) 使用正規表示式(regular expression)擷取字串

    import re
    string = "temperature:75.6 F"

    # Write a pattern to extract numbers and decimals
    def return_temp(string):
        pattern = re.compile(r"\d+\.\d+")
        
        # Search the text for matches
        temp = re.match(pattern, string)
        
        # If a value is returned, use group(0) to return the found value
        if temp is not None:
            return float(temp.group(0))
            
    # Apply the function to the string column and take a look at both columns
    df["string_val"] = df["string"].apply(lambda row: return_temp(row))
    print(df[["string", "string_val"]].head())

(2) 使用 tf/idf 擷取字串

a. 向量化文字(Vectorizing text)

- term frequency (tf)


- inverse document frequency (idf)


b. 文字擷取套件: TfidfVectorizer

    from sklearn.feature_extraction.text import TfidfVectorizer

    tfidf_vec = TfidfVectorizer()

    text_tfidf = tfidf_vec.fit_transform(documents)

c. 使用 tf/idf 向量作文字分類(Text classification)

i. Naive Bayes Classifier:

- Features are independent


- Efficiency in high-dimensional space


ii. Example: 

      # Run the toarray() method on the tf/idf vector
      # Split the dataset according to the class distribution of category
      y = df["category"]
      X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

      # Fit the model to the training data
      nb.fit(X_train, y_train)

      # Print out the model's accuracy
      print(nb.score(X_test, y_test))