## ch21 Serving Models with MLFLOW
- https://github.com/mattharrison/effective_xgboost_book/blob/main/xgbcode.ipynb

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.8.18</div>
<div style="text-align: right"> Last update: 2023.8.18</div>

In [1]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

### 21.1 Installation and Setup

In [2]:
from feature_engine import encoding, imputation
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
from sklearn import base, metrics, model_selection, \
pipeline, preprocessing
from sklearn.metrics import accuracy_score, roc_auc_score
import xgboost as xgb
import urllib
import zipfile

from sklearn import model_selection, preprocessing
import xg_helpers as xhelp

In [3]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/'\
'kaggle-survey-2018.zip'
fname = 'kaggle-survey-2018.zip'
member_name = 'multipleChoiceResponses.csv'

In [4]:
raw = xhelp.extract_zip(url, fname, member_name)
## Create raw X and raw y
kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')

In [5]:
## Split data
kag_X_train, kag_X_test, kag_y_train, kag_y_test = \
model_selection.train_test_split(
kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)

In [6]:
## Transform X with pipeline
X_train = xhelp.kag_pl.fit_transform(kag_X_train)
X_test = xhelp.kag_pl.transform(kag_X_test)
## Transform y with label encoder
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(kag_y_train)
y_train = label_encoder.transform(kag_y_train)
y_test = label_encoder.transform(kag_y_test)
# Combined Data for cross validation/etc
X = pd.concat([X_train, X_test], axis='index')
y = pd.Series([*y_train, *y_test], index=X.index)

모델 학습을 위해 hyperopt를 사용   
로깅을 위해 mlflow 사용

In [7]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import mlflow
from sklearn import metrics
import xgboost as xgb

In [8]:
ex_id = mlflow.create_experiment(name='ex3', artifact_location='ex2path')
mlflow.set_experiment(experiment_name='ex3')

<Experiment: artifact_location='/Users/sguys99/Desktop/project/self-study/xgboost/effective_xgboost/ex2path', creation_time=1692832509739, experiment_id='861349195765381159', last_update_time=1692832509739, lifecycle_stage='active', name='ex3', tags={}>

In [9]:
with mlflow.start_run():
    params = {'random_state': 42}
    rounds = [{'max_depth': hp.quniform('max_depth', 1, 12, 1),  # tree
               'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},
              {'subsample': hp.uniform('subsample', 0.5, 1),   # stochastic
               'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},
              {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization
              {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting
    ]

    for round in rounds:
        params = {**params, **round}
        trials = Trials()
        best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(
                space, X_train, y_train, X_test, y_test),            
            space=params,           
            algo=tpe.suggest,            
            max_evals=10,            
            trials=trials,
            timeout=60*5 # 5 minutes
        )
        params = {**params, **best}
        params['max_depth'] = int(params['max_depth']) # 실수로 변경해야 작동함
        for param, val in params.items():
            mlflow.log_param(param, val)
        
        xg = xgb.XGBClassifier(eval_metric='logloss', early_stopping_rounds=50, **params)
        xg.fit(X_train, y_train,
               eval_set=[(X_train, y_train),
                         (X_test, y_test)
                        ]
              )     
        for metric in [metrics.accuracy_score, metrics.precision_score, metrics.recall_score, 
                       metrics.f1_score]:
            mlflow.log_metric(metric.__name__, metric(y_test, xg.predict(X_test)))
            
    model_info = mlflow.xgboost.log_model(xg, artifact_path='model')

100%|██████████| 10/10 [00:01<00:00,  6.70trial/s, best loss: -0.7657458563535912]
[0]	validation_0-logloss:0.61792	validation_1-logloss:0.61927
[1]	validation_0-logloss:0.57478	validation_1-logloss:0.57918
[2]	validation_0-logloss:0.54551	validation_1-logloss:0.55309
[3]	validation_0-logloss:0.52702	validation_1-logloss:0.53567
[4]	validation_0-logloss:0.51445	validation_1-logloss:0.52740
[5]	validation_0-logloss:0.50370	validation_1-logloss:0.52021
[6]	validation_0-logloss:0.49590	validation_1-logloss:0.51482
[7]	validation_0-logloss:0.48841	validation_1-logloss:0.51205
[8]	validation_0-logloss:0.48254	validation_1-logloss:0.50750
[9]	validation_0-logloss:0.47883	validation_1-logloss:0.50669
[10]	validation_0-logloss:0.47578	validation_1-logloss:0.50609
[11]	validation_0-logloss:0.46857	validation_1-logloss:0.50519
[12]	validation_0-logloss:0.46562	validation_1-logloss:0.50602
[13]	validation_0-logloss:0.46320	validation_1-logloss:0.50568
[14]	validation_0-logloss:0.45946	validation_

In [10]:
ex_id

'861349195765381159'

In [11]:
model_info.run_id

'22f11ad59e22476a8fd05701dd888e21'

### 21.3 Running A Model From Code

In [12]:
logged_model = 'runs:/22f11ad59e22476a8fd05701dd888e21/model'

In [13]:
loaded_model = mlflow.pyfunc.load_model(logged_model)
loaded_model

mlflow.pyfunc.loaded_model:
  artifact_path: model
  flavor: mlflow.xgboost
  run_id: 22f11ad59e22476a8fd05701dd888e21

In [14]:
X_test.iloc[[0]]

Unnamed: 0,age,education,years_exp,compensation,python,r,sql,Q1_Male,Q1_Female,Q1_Prefer not to say,Q1_Prefer to self-describe,Q3_United States of America,Q3_India,Q3_China,major_cs,major_other,major_eng,major_stat
7894,22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0


In [15]:
loaded_model.predict(X_test.iloc[[0]])

array([1])

### 21.4 Serving Predictions

uuid를 알면 서빙을 할 수 있다.   
디펄트로 pyenv 기반 가상환경을 생성한다.  
--env-manager를 사용하면 로컬환경에서 시험할 수 있다.

mlflow models serve -m ex2path/861349195765381159/ \
22f11ad59e22476a8fd05701dd888e21/artifacts/model \
-p 1234 --env-manager local

중요: 교재의 명령 잘못됨

위 커맨드를 사용하면 1234 포트로 서빙을 진행한다.  

### 21.5 Querying from the Command Line

유닉스 운영체제에서는 다음과 같이 curl 명령을 사용할 수있다.  

```
curl $URL -X POST -H "Content-Type:application/json" --data $JSON_DATA
```

이를 위해 jason 데이터 포맷이 필요하다.

In [16]:
{'dataframe_split':
{'columns': ['col1', 'col2'],
'data': [[22, 16.0],
[25, 18.0]]}
}

{'dataframe_split': {'columns': ['col1', 'col2'],
  'data': [[22, 16.0], [25, 18.0]]}}

- 위와 같이 jason data를 만들어야 한다.

json을 수동으로 만드는 것을 원하지 않는다면 pandas를 사용할 수 있다.   
dataframe을 쓴다면 .to_jason 메서드에 orient='split'로 설정하면 된다.

In [17]:
X_test.head(2).to_json()

'{"age":{"7894":22,"10541":25},"education":{"7894":16.0,"10541":18.0},"years_exp":{"7894":1.0,"10541":1.0},"compensation":{"7894":0,"10541":70000},"python":{"7894":1,"10541":1},"r":{"7894":0,"10541":1},"sql":{"7894":0,"10541":0},"Q1_Male":{"7894":1,"10541":1},"Q1_Female":{"7894":0,"10541":0},"Q1_Prefer not to say":{"7894":0,"10541":0},"Q1_Prefer to self-describe":{"7894":0,"10541":0},"Q3_United States of America":{"7894":0,"10541":1},"Q3_India":{"7894":1,"10541":0},"Q3_China":{"7894":0,"10541":0},"major_cs":{"7894":1,"10541":0},"major_other":{"7894":0,"10541":1},"major_eng":{"7894":0,"10541":0},"major_stat":{"7894":0,"10541":0}}'

In [18]:
X_test.head()

Unnamed: 0,age,education,years_exp,compensation,python,r,sql,Q1_Male,Q1_Female,Q1_Prefer not to say,Q1_Prefer to self-describe,Q3_United States of America,Q3_India,Q3_China,major_cs,major_other,major_eng,major_stat
7894,22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0
10541,25,18.0,1.0,70000,1,1,0,1,0,0,0,1,0,0,0,1,0,0
21353,35,18.0,2.0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
13879,25,18.0,1.0,100000,1,0,1,1,0,0,0,1,0,0,1,0,0,0
21971,18,18.0,1.0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0


In [19]:
X_test.head(2).to_json(orient = 'split', index = False)

'{"columns":["age","education","years_exp","compensation","python","r","sql","Q1_Male","Q1_Female","Q1_Prefer not to say","Q1_Prefer to self-describe","Q3_United States of America","Q3_India","Q3_China","major_cs","major_other","major_eng","major_stat"],"data":[[22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0],[25,18.0,1.0,70000,1,1,0,1,0,0,0,1,0,0,0,1,0,0]]}'

위는 json string이고 딕셔너리로 바꿔야 한다. 그래야 다른 딕셔너리를 추가할 수 있다.  

This is a JSON string, and we need a Python dictionary so that we can embed this in another
dictionary. Consider this value to be DICT. We must place it in another dictionary with the
key dataframe_split : {'dataframe_split: DICT}. We will use the json.loads function to create a
dictionary from the string. (We can’t use the Python string because the quotes are incorrect
for JSON.)
Here is the JSON data we need to insert into the dictionary:

In [20]:
import json

In [21]:
json.loads(X_test.head(2).to_json(orient='split', index=False))

{'columns': ['age',
  'education',
  'years_exp',
  'compensation',
  'python',
  'r',
  'sql',
  'Q1_Male',
  'Q1_Female',
  'Q1_Prefer not to say',
  'Q1_Prefer to self-describe',
  'Q3_United States of America',
  'Q3_India',
  'Q3_China',
  'major_cs',
  'major_other',
  'major_eng',
  'major_stat'],
 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
  [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}

In [22]:
{'dataframe_split': json.loads(X_test.head(2).to_json(orient='split', index=False))}

{'dataframe_split': {'columns': ['age',
   'education',
   'years_exp',
   'compensation',
   'python',
   'r',
   'sql',
   'Q1_Male',
   'Q1_Female',
   'Q1_Prefer not to say',
   'Q1_Prefer to self-describe',
   'Q3_United States of America',
   'Q3_India',
   'Q3_China',
   'major_cs',
   'major_other',
   'major_eng',
   'major_stat'],
  'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
   [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}

최종적으로 파이썬 딕셔너리를 다시 json string으로 만들어야한다.  
json.dump를 사용한다.

In [23]:
import json

In [24]:
post_data = json.dumps({'dataframe_split': json.loads(X_test.head(2).to_json(orient='split', index=False))})
post_data

'{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}'

함수화하자.

In [35]:
def create_post_data(df):
    dictionary = json.loads(df
                            .to_json(orient='split', index=False))
    return json.dumps({'dataframe_split': dictionary})

In [36]:
post_data = create_post_data(X_test.head(2))
print(post_data)

{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}


주피터에서 !로 쉘 커맨드를 실행할 수 있다.  

```
!curl http://127.0.0.1:1234/invocations -X POST -H \
"Content-Type:application/json" --data $post_data
```

그런데 위처럼 하면 에러가 발생한다.
post_data 앞에 특수문자 $를 쓰는데 이를 쓰기위해 \문자가 필요하다. 
single quat를 붙여야 한다.>???

In [37]:
post_data

'{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}'

In [38]:
quoted = f"'{post_data}'"
quoted

'\'{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\''

In [41]:
def create_post_data(df, quote=True):
    dictionary = {'dataframe_split': json.loads(df
       .to_json(orient='split', index=False))}
    if quote:
        return f"'{dictionary}'"
    else:
        return dictionary

In [42]:
quoted = create_post_data(X_test.head(2))
quoted

"'{'dataframe_split': {'columns': ['age', 'education', 'years_exp', 'compensation', 'python', 'r', 'sql', 'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say', 'Q1_Prefer to self-describe', 'Q3_United States of America', 'Q3_India', 'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat'], 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}'"

이제 시작해보자.

In [43]:
!curl http://127.0.0.1:1234/invocations -x post -h "content-type:application/json" --data $quoted

Usage: curl [options...] <url>
Invalid category provided, here is a list of all categories:

 auth        Different types of authentication methods
 connection  Low level networking operations
 curl        The command line tool itself
 dns         General DNS options
 file        FILE protocol options
 ftp         FTP protocol options
 http        HTTP and HTTPS protocol options
 imap        IMAP protocol options
 misc        Options that don't fit into any other category
 output      Filesystem output
 pop3        POP3 protocol options
 post        HTTP Post specific options
 proxy       All options related to proxies
 scp         SCP protocol options
 sftp        SFTP protocol options
 smtp        SMTP protocol options
 ssh         SSH protocol options
 telnet      TELNET protocol options
 tftp        TFTP protocol options
 tls         All TLS/SSL related options
 upload      All options for uploads
 verbose     Options related to any kind of command line output of curl


- 동작 안됨

### 21.6 Querying with the Requests Library

This code uses the requests library to make a post request. It will predict the first two test
rows. it uses the pandas .to_json method and requires setting orient='split'. it returns 1
(software engineer) for the first row and 0 (data scientist) for the second.
In this case, because we are sending the json data as a dictionary and not a quoted string,
we will pass in quote=false.

In [44]:
import requests as req
import json

In [46]:
r = req.post('http://127.0.0.1:1234/invocations', 
             json=create_post_data(X_test.head(2), quote=False))

In [47]:
print(r.text)

{"predictions": [1, 0]}


- request는 잘 동작함

### 21.7 Building with Docker

mlflow models build-docker -m ex2path/22f11ad59e22476a8fd05701dd888e21/artifacts/model -n 모델이름

mlflow[extras]를 설치해야 가능??  

pip install mlflow[extras]

- 경로에 dockerfile이 없어 기록이 안됨
- 추후 해볼 것