## Driverless AI - Python Scoring Pipeline 実行サンプル

スコアリングフォルダ(scoring-pipeline)のexample.pyをJupyter上で実施したデモ

環境：
- Ubunto18.04（AWS EC2, t2.2xlarge）をローカル環境として実行
- Driverless AI 1.8.8 Python Scoring Pipeline

In [7]:
import pandas as pd
import numpy as np
from numpy import nan
from scipy.special._ufuncs import expit

スコアリングモデルのインポート（Python実行環境へ、スコアリングフォルダのscoring_h2oai_experiment_0bdb6222_458f_11eb_91c1_0242ac110002-1.0.0-py3-none-any.whlファイルからインストール済み）

In [4]:
from scoring_h2oai_experiment_0bdb6222_458f_11eb_91c1_0242ac110002 import Scorer

### Scorerのインスタンス化  
- パフォーマンスの観点からScorerのインスタンス化は一つだけとし、score()もしくはscore_batch()を同インスタンスから複数回呼び出すのが良い

In [5]:
scorer = Scorer()

2020-12-26 23:34:23,743 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | License manager initialized
2020-12-26 23:34:23,745 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | -----------------------------------------------------------------
2020-12-26 23:34:23,747 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | Checking whether we have a valid license...
2020-12-26 23:34:23,748 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | No Cloud provider found
2020-12-26 23:34:23,750 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | License inherited from environment
2020-12-26 23:34:23,760 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | 
2020-12-26 23:34:23,762 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | license_version:1
2020-12-26 23:34:23,763 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | serial_number:3
2020-12-26 23:34:23,765 C: NA  D:  NA    M:  NA    NODE:SERVER      17986  INFO   | licensee_org

---

インプット（特徴量）情報  
- Rangeは、学習データの最大・最小範囲

| Name | Type    | Range                                     | 
| ---- | ------- | ----------------------------------------- | 
| x1   | float32 | [-3.0065999031066895, 2.7874999046325684] | 
| x2   | float32 | [-4.136000156402588, 3.256700038909912]   | 
| x3   | float32 | [-3.0952999591827393, 3.3822999000549316] | 
| x4   | float32 | [-3.42330002784729, 3.0445001125335693] | 

---

### Scorer.score()メソッドによる一行スコアリング

In [90]:
scorer.score([
    '-2.631',  # x1
    '1.277',  # x2
    '-2.797',  # x3
    '3.319',  # x4
])

2020-12-27 00:35:26,446 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


[-1.3629496097564697]

In [15]:
# 学習データの最大・最小範囲外でもスコアリング可能
scorer.score([
    '10',  # x1
    '10',  # x2
    '10',  # x3
    '10',  # x4
])

2020-12-26 23:53:05,307 C:  0% D:224.9GB M:29.9GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


[8.19909954071045]

---

### Scorer.score_batch()メソッドによるデータテーブルのバッチスコアリング

In [31]:
# 乱数による、データ作成
import random
x1 = [random.uniform(-5,5) for _ in range(10)]
x2 = [random.uniform(-5,5) for _ in range(10)]
x3 = [random.uniform(-5,5) for _ in range(10)]
x4 = [random.uniform(-5,5) for _ in range(10)]
df = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3,'x4':x4})
df

Unnamed: 0,x1,x2,x3,x4
0,-3.917801,4.882211,3.633671,1.869447
1,2.289611,1.187278,-1.661479,3.050628
2,-2.698765,-1.753604,-1.789623,-3.505589
3,-3.861452,-3.641115,-3.809173,-1.615557
4,2.056016,3.801538,-1.998786,-0.117072
5,0.245556,3.343963,-3.962128,-2.549399
6,0.980815,-3.709113,-3.245773,-0.499412
7,1.284562,1.689672,-0.19171,-3.143283
8,-4.635536,-3.704511,-3.397017,-1.520294
9,-3.830807,0.023586,1.806482,-1.28212


In [86]:
type(df)

pandas.core.frame.DataFrame

In [38]:
# インプットデータのデータ型
df.dtypes

x1    float64
x2    float64
x3    float64
x4    float64
dtype: object

In [60]:
import time
start = time.time()    # 実行時間計測

res = scorer.score_batch(df)   # スコアリング
display(res)

e_time = time.time() - start 

2020-12-27 00:17:51,872 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


Unnamed: 0,y
0,5.220603
1,2.739367
2,2.789081
3,1.320055
4,5.507656
5,6.16531
6,0.898444
7,4.237808
8,0.935505
9,-2.013859


In [53]:
# 実行時間
print ("e_time:{0}".format(e_time) + "[s]")

e_time:0.32343506813049316[s]


**【推奨】 なお、Driverless AIとの一貫性を持たし、高速化させる場合はdataframeオブジェクトをスコアリングメソッドへ渡す**

In [64]:
import datatable
dt = datatable.Frame(df)   # pandas.DataFrameからdatatable.Frameへ変換
dt

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,−3.9178,4.88221,3.63367,1.86945
1,2.28961,1.18728,−1.66148,3.05063
2,−2.69876,−1.7536,−1.78962,−3.50559
3,−3.86145,−3.64112,−3.80917,−1.61556
4,2.05602,3.80154,−1.99879,−0.117072
5,0.245556,3.34396,−3.96213,−2.5494
6,0.980815,−3.70911,−3.24577,−0.499412
7,1.28456,1.68967,−0.19171,−3.14328
8,−4.63554,−3.70451,−3.39702,−1.52029
9,−3.83081,0.0235856,1.80648,−1.28212


In [65]:
type(dt)

datatable.Frame

In [85]:
# インプットデータのデータ型
for col in dt.names:
    print(col, ": ", dt[col].stype)

x1 :  stype.float64
x2 :  stype.float64
x3 :  stype.float64
x4 :  stype.float64


In [87]:
start = time.time()    # 実行時間計測

res = scorer.score_batch(dt)   # スコアリング
display(res)

e_time = time.time() - start 

2020-12-27 00:32:51,818 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


Unnamed: 0,y
0,5.220603
1,2.739367
2,2.789081
3,1.320055
4,5.507656
5,6.16531
6,0.898444
7,4.237808
8,0.935505
9,-2.013859


In [88]:
# 実行時間
print ("e_time:{0}".format(e_time) + "[s]")

e_time:0.30411601066589355[s]


---

### Scorer.fit_transform_batch()メソッドによる特徴量変換
- Driverless AIの[Transform Anothe Dataset]（特徴量エンジニアリングパイプライン）

In [96]:
df['y'] = pd.Series([random.uniform(-1,1) for _ in range(10)])
df

Unnamed: 0,x1,x2,x3,x4,y
0,-3.917801,4.882211,3.633671,1.869447,0.564277
1,2.289611,1.187278,-1.661479,3.050628,-0.396532
2,-2.698765,-1.753604,-1.789623,-3.505589,0.867018
3,-3.861452,-3.641115,-3.809173,-1.615557,-0.234887
4,2.056016,3.801538,-1.998786,-0.117072,0.513329
5,0.245556,3.343963,-3.962128,-2.549399,-0.580393
6,0.980815,-3.709113,-3.245773,-0.499412,0.079676
7,1.284562,1.689672,-0.19171,-3.143283,-0.255809
8,-4.635536,-3.704511,-3.397017,-1.520294,0.290084
9,-3.830807,0.023586,1.806482,-1.28212,-0.592879


In [104]:
train_transformed, valid_transformed, test_transformed = scorer.fit_transform_batch(train_frame=df, valid_frame=df, test_frame=df)

2020-12-27 01:09:53,906 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Using 1 parallel workers (1 parent workers) for fit_transform.
2020-12-27 01:09:54,084 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)
2020-12-27 01:09:54,281 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)
2020-12-27 01:09:54,466 C:  0% D:224.9GB M:29.8GB  NODE:SERVER      17986  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


- train_frame： あてはめ用
- valid_frame： ハイパーパラメータチューニングの検証用
- test_frame： 変換後データ

In [105]:
test_transformed

Unnamed: 0,0_x1,1_x2,2_x3,3_x4,y
0,-3.917801,4.882211,3.633671,1.869447,0.564277
1,2.289611,1.187278,-1.661479,3.050628,-0.396532
2,-2.698765,-1.753604,-1.789623,-3.505589,0.867018
3,-3.861452,-3.641115,-3.809173,-1.615557,-0.234887
4,2.056016,3.801538,-1.998786,-0.117072,0.513329
5,0.245556,3.343963,-3.962128,-2.549399,-0.580393
6,0.980815,-3.709112,-3.245773,-0.499412,0.079676
7,1.284562,1.689672,-0.19171,-3.143283,-0.255809
8,-4.635536,-3.704511,-3.397017,-1.520294,0.290084
9,-3.830807,0.023586,1.806482,-1.28212,-0.592879


本スコアリングパイプラインでは特徴量変換が実施されていないため、インプットデータと変換後データが同じ