# 基于机器学习数据库飞速上线AI应用——RUL

剩余使用寿命（remaining useful life，RUL），指一个系统正常工作一段时间后,能够正常运转的时间。借助RUL,工程师可以安排维护时间、优化运行效率并避免计划外停机。因此,预测RUL是预测性维护计划中的首要任务。 
本次的任务就是开发一个通过机器学习模型进行剩余使用寿命预测的实时智能应用。我们使用NASA提供的[Turbofan Engine Degradation Simulation Data Set](https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan)，作为训练集与测试集。

整个应用开发是基于[notebook](http://ipython.org/notebook.html)。


## 初始化环境
整个初始化过程包含安装fedb，以及相关运行环境，初始化脚步可以参考https://github.com/4paradigm/DemoApps/blob/main/predict-remaining-useful-life-nb/demo/init.sh

In [1]:
!cd demo && sh init.sh

ZooKeeper JMX enabled by default
Using config: /home/jovyan/work/zookeeper-3.4.14/bin/../conf/zoo.cfg
Starting zookeeper ... already running as process 729.
Starting tablet ... tablet already running as process 798.
Starting nameserver ... nameserver already running as process 851.
2021-06-16 11:35:22,068:2006(0x7f2ca10d2a00):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.14
2021-06-16 11:35:22,068:2006(0x7f2ca10d2a00):ZOO_INFO@log_env@757: Client environment:host.name=m7-pce-dev01
2021-06-16 11:35:22,068:2006(0x7f2ca10d2a00):ZOO_INFO@log_env@764: Client environment:os.name=Linux
2021-06-16 11:35:22,068:2006(0x7f2ca10d2a00):ZOO_INFO@log_env@765: Client environment:os.arch=3.10.0-1127.18.2.el7.x86_64
2021-06-16 11:35:22,068:2006(0x7f2ca10d2a00):ZOO_INFO@log_env@766: Client environment:os.version=#1 SMP Sun Jul 26 15:27:06 UTC 2020
2021-06-16 11:35:22,068:2006(0x7f2ca10d2a00):ZOO_INFO@log_env@774: Client environment:user.name=(null)
2021-06-16 11:35:22,

## 导入行程历史数据到fedb

使用fedb进行时序特征计算是需要历史数据的，所以我们将历史数据导入到fedb，以便实时推理可以使用历史数据进行特征推理，导入代码可以参考https://github.com/4paradigm/DemoApps/blob/main/predict-taxi-trip-duration-nb/demo/import.py
这里使用data/test_FD004.txt作为历史数据。

In [3]:
!cd demo && python3 import.py

2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.14
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@757: Client environment:host.name=m7-pce-dev01
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@764: Client environment:os.name=Linux
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@765: Client environment:os.arch=3.10.0-1127.18.2.el7.x86_64
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@766: Client environment:os.version=#1 SMP Sun Jul 26 15:27:06 UTC 2020
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@774: Client environment:user.name=(null)
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@782: Client environment:user.home=/root
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@log_env@794: Client environment:user.dir=/home/jovyan/work/rul/demo
2021-06-16 11:35:39,101:2075(0x7f7033716740):ZOO_INFO@zookeeper_init@827: Initi

## 使用数据进行模型训练

通过label数据进行模型训练，以下是这次任务使用的代码

* 训练脚本代码 https://github.com/4paradigm/DemoApps/blob/main/predict-taxi-trip-duration-nb/demo/train_sql.py 
* 训练数据 train_FD004.txt

整个任务最终会生成一个model，保存为simple_fm.csv。

In [3]:
# !cd demo && python3 train_by_ft.py

## 使用训练的模型搭建链接fedb的实时推理http服务

基于上一步生成的模型和fedb历史数据，搭建一个实时推理服务，整个推理服务代码参考https://github.com/4paradigm/DemoApps/blob/main/predict-taxi-trip-duration-nb/demo/predict_server.py

In [4]:
# !cd demo && sh start_predict_server.sh # TODO
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import pandas as pd
import utils

fm = pd.read_csv('demo/simple_fm.csv', index_col='engine_no')
X = fm.copy().fillna(0)
y = X.pop('remaining_useful_life')

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)

# skip baselines

reg = RandomForestRegressor(n_estimators=100)
reg.fit(X_train, y_train)

preds = reg.predict(X_test)
scores = mean_absolute_error(preds, y_test)
print('Mean Abs Error: {:.2f}'.format(scores))

high_imp_feats = utils.feature_importances(X, reg, feats=10)

Mean Abs Error: 48.02
1: MAX(recordings.sensor_measurement_4) [0.154]
2: MAX(recordings.sensor_measurement_13) [0.105]
3: MAX(recordings.sensor_measurement_11) [0.105]
4: MAX(recordings.sensor_measurement_15) [0.097]
5: MAX(recordings.sensor_measurement_3) [0.063]
6: MAX(recordings.sensor_measurement_2) [0.059]
7: MAX(recordings.operational_setting_2) [0.058]
8: MAX(recordings.sensor_measurement_21) [0.051]
9: MAX(recordings.sensor_measurement_8) [0.044]
10: MAX(recordings.sensor_measurement_9) [0.043]
-----



## 通过http请求发送一个推理请求



In [29]:
# !cd demo && python3 predict.py # TODO
# no real-time data, just calc from fedb
calc_sql = """            
        select recordings.time_in_cycles
            from t1 as recordings last join (select time_in_cycles
            from t1 tt group by tt.time_in_cycles) as cycles on recordings.time_in_cycles=cycles.time_in_cycles

;
"""

import sqlalchemy as db
engine = db.create_engine('fedb:///db_test?zk=127.0.0.1:2181&zkPath=/fedb')
connection = engine.connect()

from pandas import DataFrame, Series
import numpy as np
fm2 = DataFrame()

result = connection.execute(calc_sql)
print(result.rowcount)

# for r in result:
# #     print(r)
#     fm2 = fm2.append([np.array(r).tolist()])

fm2.head()

0


In [13]:
X = fm2.copy().fillna(0)
y = pd.read_csv(
    'demo/data/RUL_FD004.txt',
    sep=' ',
    header=None,
    names=['remaining_useful_life'],
    index_col=False,
)

preds2 = reg.predict(X)
mae = mean_absolute_error(preds2, y)
print('Mean Abs Error: {:.2f}'.format(mae))

ValueError: at least one array or dtype is required