## Driverless AI - Python Scoring Pipeline 実行サンプル（ローカル環境）

スコアリングフォルダ(scoring-pipeline)のexample.pyをJupyter上で実施したデモ

環境：
- Ubunto18.04（AWS EC2, t2.2xlarge）をローカル環境として実行
- Driverless AI 1.8.8 Python Scoring Pipeline

In [1]:
import pandas as pd
import numpy as np
from numpy import nan
from scipy.special._ufuncs import expit

スコアリングモデルのインポート（Python実行環境へ、スコアリングフォルダのscoring_h2oai_experiment_0bdb6222_458f_11eb_91c1_0242ac110002-1.0.0-py3-none-any.whlファイルからインストール済み）

In [3]:
from scoring_h2oai_experiment_0bdb6222_458f_11eb_91c1_0242ac110002 import Scorer

### Scorerのインスタンス化  
- パフォーマンスの観点からScorerのインスタンス化は一つだけとし、score()もしくはscore_batch()を同インスタンスから複数回呼び出すのが良い

In [4]:
scorer = Scorer()

2020-12-27 01:28:02,767 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | License manager initialized
2020-12-27 01:28:02,769 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | -----------------------------------------------------------------
2020-12-27 01:28:02,771 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | Checking whether we have a valid license...
2020-12-27 01:28:02,772 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | No Cloud provider found
2020-12-27 01:28:02,774 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | License inherited from environment
2020-12-27 01:28:02,782 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | 
2020-12-27 01:28:02,784 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | license_version:1
2020-12-27 01:28:02,786 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | serial_number:3
2020-12-27 01:28:02,787 C: NA  D:  NA    M:  NA    NODE:SERVER      19017  INFO   | licensee_org

---

インプット（特徴量）情報  
- Rangeは、学習データの最大・最小範囲

| Name | Type    | Range                                     | 
| ---- | ------- | ----------------------------------------- | 
| x1   | float32 | [-3.0065999031066895, 2.7874999046325684] | 
| x2   | float32 | [-4.136000156402588, 3.256700038909912]   | 
| x3   | float32 | [-3.0952999591827393, 3.3822999000549316] | 
| x4   | float32 | [-3.42330002784729, 3.0445001125335693] | 

---

### Scorer.score()メソッドによる一行スコアリング

In [5]:
scorer.score([
    '-2.631',  # x1
    '1.277',  # x2
    '-2.797',  # x3
    '3.319',  # x4
])

2020-12-27 01:28:08,115 C:  0% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


[-1.3629496097564697]

In [6]:
# 学習データの最大・最小範囲外でもスコアリング可能
scorer.score([
    '10',  # x1
    '10',  # x2
    '10',  # x3
    '10',  # x4
])

2020-12-27 01:28:10,037 C:  3% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


[8.19909954071045]

---

### Scorer.score_batch()メソッドによるデータテーブルのバッチスコアリング

In [7]:
# 乱数による、データ作成
import random
x1 = [random.uniform(-5,5) for _ in range(10)]
x2 = [random.uniform(-5,5) for _ in range(10)]
x3 = [random.uniform(-5,5) for _ in range(10)]
x4 = [random.uniform(-5,5) for _ in range(10)]
df = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3,'x4':x4})
df

Unnamed: 0,x1,x2,x3,x4
0,-1.75305,2.821383,4.254005,-2.973499
1,-4.019309,-1.982661,3.16061,4.619455
2,4.22512,4.335088,-4.099486,-4.481148
3,-3.775419,-1.174434,4.797593,-2.719304
4,-3.207626,2.566643,-0.356088,0.909786
5,-2.712394,-2.901967,4.270808,2.73853
6,4.860317,0.398382,1.057774,3.21039
7,-3.919865,-1.279046,-4.421415,0.47572
8,3.382945,1.450419,0.397534,-3.129112
9,-4.964369,-2.964394,1.727644,-4.648407


In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
# インプットデータのデータ型
df.dtypes

x1    float64
x2    float64
x3    float64
x4    float64
dtype: object

In [10]:
import time
start = time.time()    # 実行時間計測

res = scorer.score_batch(df)   # スコアリング
display(res)

e_time = time.time() - start 

2020-12-27 01:28:15,692 C:  1% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


Unnamed: 0,y
0,1.050215
1,3.261864
2,8.1991
3,-2.644363
4,2.690867
5,3.261864
6,4.290782
7,-1.666396
8,4.477252
9,-3.855328


In [11]:
# 実行時間
print ("e_time:{0}".format(e_time) + "[s]")

e_time:0.3250236511230469[s]


**【推奨】 なお、Driverless AIとの一貫性を持たし、高速化させる場合はdataframeオブジェクトをスコアリングメソッドへ渡す**

In [12]:
import datatable
dt = datatable.Frame(df)   # pandas.DataFrameからdatatable.Frameへ変換
dt

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,−1.75305,2.82138,4.25401,−2.9735
1,−4.01931,−1.98266,3.16061,4.61946
2,4.22512,4.33509,−4.09949,−4.48115
3,−3.77542,−1.17443,4.79759,−2.7193
4,−3.20763,2.56664,−0.356088,0.909786
5,−2.71239,−2.90197,4.27081,2.73853
6,4.86032,0.398382,1.05777,3.21039
7,−3.91986,−1.27905,−4.42142,0.47572
8,3.38294,1.45042,0.397534,−3.12911
9,−4.96437,−2.96439,1.72764,−4.64841


In [13]:
type(dt)

datatable.Frame

In [14]:
# インプットデータのデータ型
for col in dt.names:
    print(col, ": ", dt[col].stype)

x1 :  stype.float64
x2 :  stype.float64
x3 :  stype.float64
x4 :  stype.float64


In [17]:
start = time.time()    # 実行時間計測

res = scorer.score_batch(dt)   # スコアリング
display(res)

e_time = time.time() - start 

2020-12-27 01:28:35,636 C:  0% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


Unnamed: 0,y
0,1.050215
1,3.261864
2,8.1991
3,-2.644363
4,2.690867
5,3.261864
6,4.290782
7,-1.666396
8,4.477252
9,-3.855328


In [18]:
# 実行時間
print ("e_time:{0}".format(e_time) + "[s]")

e_time:0.3226892948150635[s]


---

### Scorer.fit_transform_batch()メソッドによる特徴量変換
- Driverless AIの[Transform Anothe Dataset]（特徴量エンジニアリングパイプライン）

In [19]:
df['y'] = pd.Series([random.uniform(-1,1) for _ in range(10)])
df

Unnamed: 0,x1,x2,x3,x4,y
0,-1.75305,2.821383,4.254005,-2.973499,-0.35061
1,-4.019309,-1.982661,3.16061,4.619455,-0.803862
2,4.22512,4.335088,-4.099486,-4.481148,0.845024
3,-3.775419,-1.174434,4.797593,-2.719304,-0.755084
4,-3.207626,2.566643,-0.356088,0.909786,-0.641525
5,-2.712394,-2.901967,4.270808,2.73853,-0.542479
6,4.860317,0.398382,1.057774,3.21039,0.972063
7,-3.919865,-1.279046,-4.421415,0.47572,-0.783973
8,3.382945,1.450419,0.397534,-3.129112,0.676589
9,-4.964369,-2.964394,1.727644,-4.648407,-0.992874


In [20]:
train_transformed, valid_transformed, test_transformed = scorer.fit_transform_batch(train_frame=df, valid_frame=df, test_frame=df)

2020-12-27 01:28:42,893 C:  1% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Using 1 parallel workers (1 parent workers) for fit_transform.
2020-12-27 01:28:43,121 C:  1% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)
2020-12-27 01:28:43,336 C:  1% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)
2020-12-27 01:28:43,555 C:  1% D:224.9GB M:30.2GB  NODE:SERVER      19017  INFO   | Submitted    0 and Completed    0 non-identity feature engineering tasks out of    4 total tasks (including    4 identity)


- train_frame： あてはめ用
- valid_frame： ハイパーパラメータチューニングの検証用
- test_frame： 変換後データ

In [21]:
test_transformed

Unnamed: 0,0_x1,1_x2,2_x3,3_x4,y
0,-1.75305,2.821383,4.254005,-2.973499,-0.35061
1,-4.019309,-1.982661,3.16061,4.619455,-0.803862
2,4.22512,4.335088,-4.099486,-4.481148,0.845024
3,-3.775419,-1.174434,4.797594,-2.719304,-0.755084
4,-3.207626,2.566643,-0.356088,0.909786,-0.641525
5,-2.712394,-2.901967,4.270808,2.73853,-0.542479
6,4.860317,0.398382,1.057774,3.21039,0.972063
7,-3.919865,-1.279046,-4.421415,0.47572,-0.783973
8,3.382945,1.450419,0.397534,-3.129112,0.676589
9,-4.964369,-2.964394,1.727644,-4.648407,-0.992874


本スコアリングパイプラインでは特徴量変換が実施されていないため、インプットデータと変換後データが同じ