# 特徴量エンジニアリング概要

## 本プロジェクトの目的

機械学習のパフォーマンス改善に向けて重要な特徴量エンジニアリングの手法を，実装例を交えて紹介する．  
主に特徴量エンジニアリングについて述べるが，一部特徴量選択についても触れる．

## 特徴量エンジニアリングとは

機械学習における特徴量とは，分析対象を測定することが可能な変数を指す．データセットでは特徴量は列として表記されることが多い．

データセットに含める特徴量の質が，機械学習モデルの精度に影響し，ひいては機械学習を活用する場合に得るインサイトの質に大きく影響する．

データセットの質を改善する為に，特徴量選択や特徴量エンジニアリングが実施される．  
特徴量選択は分析対象に関連する特徴量に重点を置き，無関係な特徴量を取り除くプロセスを指す．特徴量エンジニアリングは，既存の特徴量をもとに新たな特徴量を構築してデータセットに追加することを指す．

## 特徴量選択と特徴量エンジニアリングの具体例

特徴エンジニアリングの手法の詳細は他のNotebookに記載するが，本節では特徴量選択と特徴量エンジニアリングの違いを説明する為の具体例を示す．

データセットはFlood Modeling Datasetを使用し，論文[Time Series Extrinsic Regression](https://arxiv.org/abs/2006.12672)のSVR Optimisedの条件に対してtsfreshによる特徴量選択及び特徴量エンジニアリングを試行する．

学習パラメータは論文通り，下記パラメータに対して3-Folds Cross ValidationのGridSearchのベストモデルを採用する．

|Parameters|Values|
|:--|:--|
|Kernel|RBF, Sigmoid|
|gamma|0.001, 0.01, 0.1, 1|
|C|0.1, 1, 10, 100|


### 実装例

In [1]:
# --- ローカルモジュールの更新を自動で読み込む ---
%load_ext autoreload
%autoreload 2

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from tsfresh import extract_features
from tsfresh.feature_selection.significance_tests import target_real_feature_real_test
from lib.dataloader.flood_modeling import load_flood_modeling

#### データセットダウンロード

In [3]:
if (not os.path.exists("flood_modeling_datasets")):
    !mkdir -p "flood_modeling_datasets" ; \
        cd flood_modeling_datasets ; \
        wget "https://zenodo.org/record/3902694/files/FloodModeling1_TEST.ts" ; \
        wget "https://zenodo.org/record/3902694/files/FloodModeling1_TRAIN.ts" ; \
        ls
else:
    print('[INFO] Dataset flood_modeling_datasets is already exist')

--2021-09-05 07:37:46--  https://zenodo.org/record/3902694/files/FloodModeling1_TEST.ts
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 380100 (371K) [application/octet-stream]
Saving to: ‘FloodModeling1_TEST.ts’


2021-09-05 07:37:49 (353 KB/s) - ‘FloodModeling1_TEST.ts’ saved [380100/380100]

--2021-09-05 07:37:49--  https://zenodo.org/record/3902694/files/FloodModeling1_TRAIN.ts
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 882160 (861K) [application/octet-stream]
Saving to: ‘FloodModeling1_TRAIN.ts’


2021-09-05 07:37:52 (526 KB/s) - ‘FloodModeling1_TRAIN.ts’ saved [882160/882160]

FloodModeling1_TEST.ts	FloodModeling1_TRAIN.ts


In [4]:
train_ts = os.path.join('flood_modeling_datasets', 'FloodModeling1_TRAIN.ts')
test_ts = os.path.join('flood_modeling_datasets', 'FloodModeling1_TEST.ts')
x_train, y_train, x_test, y_test = load_flood_modeling(train_ts, test_ts)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(471, 266)
(471,)
(202, 266)
(202,)


#### 3-Flods Cross ValidationとGridSearchでモデルを学習する

In [5]:
def rmse(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return rmse

params = {
    'kernel': ['rbf', 'sigmoid'],
    'gamma': [0.001, 0.01, 0.1, 1],
    'C': [0.1, 1, 10, 100]
}
model_svr = GridSearchCV(
    svm.SVR(),
    params,
    cv=KFold(n_splits=3, shuffle=True, random_state=1234),
    scoring=make_scorer(rmse, greater_is_better=False))
model_svr.fit(x_train, y_train)

print('[INFO] Best params: {}'.format(model_svr.best_params_))
print('[INFO] Best score: {}'.format(-model_svr.best_score_))

[INFO] Best params: {'C': 0.1, 'gamma': 0.01, 'kernel': 'sigmoid'}
[INFO] Best score: 0.04293317276271991


#### テストデータで評価

In [6]:
prediction = model_svr.predict(x_test)
print(rmse(y_test, prediction))

0.046303583075482053


論文[Time Series Extrinsic Regression](https://arxiv.org/abs/2006.12672)の実験結果がRMSE=0.05なので，再現できた．

#### tsfreshで時系列データから特徴量を抽出(特徴量エンジニアリング)して学習

tsfreshを用いて特徴量を抽出する為に，時系列データを整然データに整形する．

In [7]:
df_x_train = pd.DataFrame(x_train.T)
df_x_train_melt = df_x_train.melt(var_name='sample', value_name='A')
print(df_x_train_melt.shape)
df_x_train_melt.head()

(125286, 2)


Unnamed: 0,sample,A
0,0,0.05801
1,0,0.104612
2,0,0.147225
3,0,0.178263
4,0,0.191615


In [8]:
df_x_train_melt_ef = extract_features(df_x_train_melt, column_id='sample')
print(df_x_train_melt_ef.shape)
df_x_train_melt_ef.dropna(axis=1, inplace=True)
print(df_x_train_melt_ef.shape)
df_x_train_melt_ef.head()

Feature Extraction: 100%|██████████| 40/40 [00:07<00:00,  5.49it/s]


(471, 787)
(471, 775)


Unnamed: 0,A__variance_larger_than_standard_deviation,A__has_duplicate_max,A__has_duplicate_min,A__has_duplicate,A__sum_values,A__abs_energy,A__mean_abs_change,A__mean_change,A__mean_second_derivative_central,A__median,...,A__fourier_entropy__bins_2,A__fourier_entropy__bins_3,A__fourier_entropy__bins_5,A__fourier_entropy__bins_10,A__fourier_entropy__bins_100,A__permutation_entropy__dimension_3__tau_1,A__permutation_entropy__dimension_4__tau_1,A__permutation_entropy__dimension_5__tau_1,A__permutation_entropy__dimension_6__tau_1,A__permutation_entropy__dimension_7__tau_1
0,0.0,0.0,1.0,1.0,75.562985,154.196861,0.053672,-0.000219,-8.8e-05,0.0,...,0.188113,0.275463,0.378572,0.543862,0.77113,0.784683,1.055961,1.353449,1.662556,1.950983
1,0.0,0.0,1.0,1.0,98.868674,145.591326,0.308707,-0.00121,0.000607,0.0,...,0.163982,0.220352,0.262742,0.329196,1.734991,1.581074,2.68464,3.707707,4.371126,4.689947
2,0.0,0.0,1.0,1.0,50.257011,114.038367,0.218197,-0.000646,0.000182,0.0,...,0.045395,0.045395,0.090729,0.136002,1.329262,1.061334,1.523338,1.882738,2.060085,2.144009
3,1.0,0.0,1.0,1.0,107.636373,318.370109,0.175128,-0.009807,-0.000205,0.0,...,0.138228,0.217718,0.375938,0.516731,1.844222,1.33371,2.165619,2.864823,3.334014,3.653658
4,0.0,0.0,1.0,1.0,91.368367,180.487523,0.094849,-0.000604,0.000303,0.0,...,0.138228,0.190068,0.299591,0.418924,0.99992,1.14612,1.706036,2.27592,2.831335,3.316521


In [9]:
df_x_test = pd.DataFrame(x_test.T)
df_x_test_melt = df_x_test.melt(var_name='sample', value_name='A')
print(df_x_test_melt.shape)
df_x_test_melt.head()

(53732, 2)


Unnamed: 0,sample,A
0,0,0.190118
1,0,0.278452
2,0,0.364941
3,0,0.446354
4,0,0.519787


In [10]:
df_x_test_melt_ef = extract_features(df_x_test_melt, column_id='sample')
print(df_x_test_melt_ef.shape)
df_x_test_melt_ef.dropna(axis=1, inplace=True)
print(df_x_test_melt_ef.shape)
df_x_test_melt_ef.head()

Feature Extraction: 100%|██████████| 34/34 [00:03<00:00,  9.58it/s]


(202, 787)
(202, 775)


Unnamed: 0,A__variance_larger_than_standard_deviation,A__has_duplicate_max,A__has_duplicate_min,A__has_duplicate,A__sum_values,A__abs_energy,A__mean_abs_change,A__mean_change,A__mean_second_derivative_central,A__median,...,A__fourier_entropy__bins_2,A__fourier_entropy__bins_3,A__fourier_entropy__bins_5,A__fourier_entropy__bins_10,A__fourier_entropy__bins_100,A__permutation_entropy__dimension_3__tau_1,A__permutation_entropy__dimension_4__tau_1,A__permutation_entropy__dimension_5__tau_1,A__permutation_entropy__dimension_6__tau_1,A__permutation_entropy__dimension_7__tau_1
0,1.0,0.0,1.0,1.0,138.548151,416.90274,0.05497,-0.000717,-0.000167,0.0,...,0.079983,0.155665,0.235155,0.339942,0.49593,0.761174,0.990888,1.216844,1.448576,1.682238
1,0.0,0.0,1.0,1.0,64.714169,189.995101,0.300888,-0.001674,-0.003694,0.0,...,0.045395,0.045395,0.090729,0.254093,2.148727,0.934982,1.469652,1.89761,2.165799,2.283179
2,0.0,0.0,1.0,1.0,60.983586,97.945542,0.221572,-0.003128,0.00157,0.0,...,0.110453,0.233137,0.345796,0.52906,2.024522,1.25376,2.019174,2.595891,2.896423,3.03242
3,0.0,0.0,1.0,1.0,54.456523,189.393081,0.083508,-4.5e-05,-0.000252,0.0,...,0.291459,0.541661,0.699205,0.942941,1.426901,0.816907,1.166416,1.486912,1.816094,2.150398
4,1.0,0.0,1.0,1.0,153.217469,394.718757,0.183493,-0.000785,0.000152,0.002127,...,0.079983,0.090729,0.090729,0.170467,0.472096,1.458402,2.428749,3.335742,3.896036,4.204995


In [11]:
(df_x_train_melt_ef.columns == df_x_test_melt_ef.columns).all()

True

時系列データ266sampleから787種の特徴量を抽出し，NaNを除去し最終的に775種の特徴量が得られた．  
得られた特徴量の統計量は下記の通り．  

In [12]:
df_x_train_melt_ef.describe()

Unnamed: 0,A__variance_larger_than_standard_deviation,A__has_duplicate_max,A__has_duplicate_min,A__has_duplicate,A__sum_values,A__abs_energy,A__mean_abs_change,A__mean_change,A__mean_second_derivative_central,A__median,...,A__fourier_entropy__bins_2,A__fourier_entropy__bins_3,A__fourier_entropy__bins_5,A__fourier_entropy__bins_10,A__fourier_entropy__bins_100,A__permutation_entropy__dimension_3__tau_1,A__permutation_entropy__dimension_4__tau_1,A__permutation_entropy__dimension_5__tau_1,A__permutation_entropy__dimension_6__tau_1,A__permutation_entropy__dimension_7__tau_1
count,471.0,471.0,471.0,471.0,471.0,471.0,471.0,471.0,471.0,471.0,...,471.0,471.0,471.0,471.0,471.0,471.0,471.0,471.0,471.0,471.0
mean,0.233546,0.0,1.0,1.0,83.022627,243.43292,0.196158,-0.004727693,0.000384,0.011063,...,0.126633,0.20824,0.331969,0.526838,1.466896,1.035007,1.623229,2.141073,2.493796,2.70708
std,0.423536,0.0,0.0,0.0,34.692807,334.705174,0.126406,0.008206583,0.003295,0.037256,...,0.082069,0.148735,0.247285,0.39573,0.90132,0.373981,0.674897,0.96939,1.160782,1.259191
min,0.0,0.0,1.0,1.0,13.213483,19.805576,0.025911,-0.07462023,-0.025322,0.0,...,0.045395,0.045395,0.045395,0.079983,0.090729,0.094269,0.09984,0.100163,0.100488,0.100815
25%,0.0,0.0,1.0,1.0,54.179229,131.066582,0.106797,-0.004580441,-0.000382,0.0,...,0.079983,0.090729,0.090729,0.170467,0.785933,0.748043,1.094856,1.359456,1.606664,1.761342
50%,0.0,0.0,1.0,1.0,82.726576,202.169152,0.165587,-0.001861509,0.000152,0.0,...,0.079983,0.159721,0.288307,0.476867,1.329262,1.028211,1.566765,2.036235,2.465934,2.666115
75%,0.0,0.0,1.0,1.0,109.357643,294.651683,0.253088,-0.0007735044,0.000799,0.0,...,0.163982,0.31146,0.480816,0.742266,2.096334,1.354551,2.214789,2.974935,3.468341,3.756777
max,1.0,0.0,1.0,1.0,171.860527,6795.004536,0.712191,-2.132599e-17,0.030973,0.327183,...,0.456746,0.881258,1.335047,2.009035,3.96846,1.730466,3.026124,4.309662,5.062327,5.359928


これらの特徴量を用いて，モデルを学習する．

In [13]:
model_svr = GridSearchCV(
    svm.SVR(),
    params,
    cv=KFold(n_splits=3, shuffle=True, random_state=1234),
    scoring=make_scorer(rmse, greater_is_better=False))
model_svr.fit(df_x_train_melt_ef, y_train)

print('[INFO] Best params: {}'.format(model_svr.best_params_))
print('[INFO] Best score: {}'.format(-model_svr.best_score_))

[INFO] Best params: {'C': 0.1, 'gamma': 0.001, 'kernel': 'sigmoid'}
[INFO] Best score: 0.05797496334968894


In [14]:
prediction = model_svr.predict(df_x_test_melt_ef)
print(rmse(y_test, prediction))

0.06749691962168236


tsfreshで抽出した特徴量で学習すると，RMSE=0.67で論文の0.05よりも悪化した．  
統計的仮説検定によりp値を算出し，慣習に倣い，0.01を閾値とする場合と0.05を閾値とする場合を試行する．

本データセットは入出力ともに連続値であり，target_real_feature_real_testを用いてp値を計算する．

#### 特徴量を選択して学習

In [15]:
p_values = []
for column in df_x_train_melt_ef.columns:
    p_value = target_real_feature_real_test(df_x_train_melt_ef[column], pd.Series(y_train))
    p_values.append(p_value)

p_values = pd.DataFrame(np.array(p_values).reshape(1, -1), columns=df_x_train_melt_ef.columns)
p_values.dropna(axis=1, inplace=True)
p_values.head()

Unnamed: 0,A__variance_larger_than_standard_deviation,A__sum_values,A__abs_energy,A__mean_abs_change,A__mean_change,A__mean_second_derivative_central,A__median,A__mean,A__standard_deviation,A__variation_coefficient,...,A__fourier_entropy__bins_2,A__fourier_entropy__bins_3,A__fourier_entropy__bins_5,A__fourier_entropy__bins_10,A__fourier_entropy__bins_100,A__permutation_entropy__dimension_3__tau_1,A__permutation_entropy__dimension_4__tau_1,A__permutation_entropy__dimension_5__tau_1,A__permutation_entropy__dimension_6__tau_1,A__permutation_entropy__dimension_7__tau_1
0,1.6457380000000002e-43,3.524606e-13,4.833027e-83,0.037212,0.491681,0.537429,0.724922,3.524606e-13,3.358749e-97,5.394481e-08,...,7.8e-05,2.9e-05,0.000105,0.001233,0.137604,0.225059,0.124132,0.157889,0.32111,0.714294


In [16]:
p_values_under001 = p_values[p_values<=0.01].dropna(axis=1)
p_values_under001.head()

Unnamed: 0,A__variance_larger_than_standard_deviation,A__sum_values,A__abs_energy,A__mean,A__standard_deviation,A__variation_coefficient,A__variance,A__skewness,A__kurtosis,A__root_mean_square,...,A__ratio_beyond_r_sigma__r_1.5,A__ratio_beyond_r_sigma__r_2,A__ratio_beyond_r_sigma__r_2.5,A__ratio_beyond_r_sigma__r_5,A__ratio_beyond_r_sigma__r_6,A__ratio_beyond_r_sigma__r_7,A__fourier_entropy__bins_2,A__fourier_entropy__bins_3,A__fourier_entropy__bins_5,A__fourier_entropy__bins_10
0,1.6457380000000002e-43,3.524606e-13,4.833027e-83,3.524606e-13,3.358749e-97,5.394481e-08,3.358749e-97,2.32991e-13,5.18507e-12,4.833027e-83,...,1.410152e-17,1.5047e-10,0.000124,8.16557e-13,4e-06,0.000565,7.8e-05,2.9e-05,0.000105,0.001233


In [17]:
p_values_under005 = p_values[p_values<=0.05].dropna(axis=1)
p_values_under005.head()

Unnamed: 0,A__variance_larger_than_standard_deviation,A__sum_values,A__abs_energy,A__mean_abs_change,A__mean,A__standard_deviation,A__variation_coefficient,A__variance,A__skewness,A__kurtosis,...,A__ratio_beyond_r_sigma__r_2.5,A__ratio_beyond_r_sigma__r_5,A__ratio_beyond_r_sigma__r_6,A__ratio_beyond_r_sigma__r_7,A__lempel_ziv_complexity__bins_5,A__lempel_ziv_complexity__bins_10,A__fourier_entropy__bins_2,A__fourier_entropy__bins_3,A__fourier_entropy__bins_5,A__fourier_entropy__bins_10
0,1.6457380000000002e-43,3.524606e-13,4.833027e-83,0.037212,3.524606e-13,3.358749e-97,5.394481e-08,3.358749e-97,2.32991e-13,5.18507e-12,...,0.000124,8.16557e-13,4e-06,0.000565,0.043,0.011191,7.8e-05,2.9e-05,0.000105,0.001233


In [18]:
df_x_train_melt_ef_under001 = df_x_train_melt_ef[p_values_under001.columns]
df_x_test_melt_ef_under001 = df_x_test_melt_ef[p_values_under001.columns]

model_svr = GridSearchCV(
    svm.SVR(),
    params,
    cv=KFold(n_splits=3, shuffle=True, random_state=1234),
    scoring=make_scorer(rmse, greater_is_better=False))
model_svr.fit(df_x_train_melt_ef_under001, y_train)

print('[INFO] Best params: {}'.format(model_svr.best_params_))
print('[INFO] Best score: {}'.format(-model_svr.best_score_))

prediction = model_svr.predict(df_x_test_melt_ef_under001)
print(rmse(y_test, prediction))

[INFO] Best params: {'C': 0.1, 'gamma': 0.001, 'kernel': 'sigmoid'}
[INFO] Best score: 0.05797496334968894
0.06749691962168236


In [19]:
df_x_train_melt_ef_under005 = df_x_train_melt_ef[p_values_under005.columns]
df_x_test_melt_ef_under005 = df_x_test_melt_ef[p_values_under005.columns]

model_svr = GridSearchCV(
    svm.SVR(),
    params,
    cv=KFold(n_splits=3, shuffle=True, random_state=1234),
    scoring=make_scorer(rmse, greater_is_better=False))
model_svr.fit(df_x_train_melt_ef_under005, y_train)

print('[INFO] Best params: {}'.format(model_svr.best_params_))
print('[INFO] Best score: {}'.format(-model_svr.best_score_))

prediction = model_svr.predict(df_x_test_melt_ef_under005)
print(rmse(y_test, prediction))

[INFO] Best params: {'C': 0.1, 'gamma': 0.001, 'kernel': 'sigmoid'}
[INFO] Best score: 0.05797496334968894
0.06749691962168236


## Reference

* [特徴量変数](https://www.datarobot.com/jp/wiki/feature/)
* [特徴量の選択](https://www.datarobot.com/jp/wiki/feature-selection/)
* [特徴量エンジニアリング](https://www.datarobot.com/jp/wiki/feature-engineering/)
* [データインサイト](https://www.datarobot.com/jp/wiki/insights/)
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)
* [Feature-Engineeringのリンク集めてみた](https://qiita.com/squash/items/667f8cda16c76448b0f4)
* [DataFrameで特徴量作るのめんどくさ過ぎる。。featuretoolsを使って自動生成したろ](https://qiita.com/Hyperion13fleet/items/4eaca365f28049fe11c7)
* [時系列データから自動で特徴抽出するライブラリ tsfresh](https://qiita.com/yuko1658/items/871df86f99a9134cc9ef)
* [特徴量選択のまとめ](https://qiita.com/shimopino/items/5fee7504c7acf044a521)
* [機械学習で特徴量を正しく選択する方法](https://rightcode.co.jp/blog/information-technology/feature-selection-right-choice)
* [特徴選択とは？機械学習の予測精度を改善させる必殺技「特徴選択」を理解しよう](https://www.codexa.net/feature-selection-methods/)
* [Human Activity Recognition Using Smartphones Data Set](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones)
* [Human Activity Recognition using Smartphone](https://arxiv.org/abs/1401.8212)
* [Human Activity Analysis and Recognition from
Smartphones using Machine Learning Techniques](https://arxiv.org/abs/2103.16490)
* [Human Activity Recognition using Machine Learning](https://github.com/sushantdhumak/Human-Activity-Recognition-with-Smartphones)
* [How to Choose a Feature Selection Method For Machine Learning](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)
* [統計分析を理解しよう-よく使われている統計分析方法の概要-](https://www.nli-research.co.jp/report/detail/id=61928?site=nli)
* [Monash, UEA & UCR Time Series Extrinsic Regression Repository](http://tseregression.org/)
* [Flood Modeling Dataset 1](https://zenodo.org/record/3902694#.YTQjG50zaUk)
* [Flood Modeling Dataset 2](https://zenodo.org/record/3902696#.YTQktZ0zaUk)
* [Flood Modeling Dataset 3](https://zenodo.org/record/3902698#.YTQktZ0zaUk)
* [Monash University, UEA, UCR Time Series Extrinsic Regression Archive](https://arxiv.org/abs/2006.10996)
* [Time Series Extrinsic Regression](https://arxiv.org/abs/2006.12672)
* [ChangWeiTan/TS-Extrinsic-Regression](https://github.com/ChangWeiTan/TS-Extrinsic-Regression)
* [製造業：センサデータを機械学習に使う](https://www.datarobot.com/jp/blog/use_manufacturing_sensor_data_for_machine_learning/)
* [tsfresh](https://tsfresh.readthedocs.io/en/latest/index.html)
* [大規模データの解析における問題点](https://www.mbsj.jp/admins/ethics_and_edu/PNE/5_article.pdf)