# EasyMLOps  
  
## 介绍   
`EasyMLOps`包以`Pipline`的方式构建建模任务，可直接进行模型训练、预测(离线，在线)，测试(离线在线预测一致性、预测性能)等功能，通过外套一层Flask或FastApi即可直接部署生产，目前主要功能有：

### 1. 基础建模模块

- 数据清洗，数据自动填充、转换、盖帽、归一化、分箱、四则运算、逻辑运算、比较大小等：easymlops.ml.preprocessing
- 特征处理:
  - 特征编码，包括Target、Label、Onehot Encoding、WOEEncoding等：easymlops.ml.encoding
  - 特征降维，包括PCA、NFM等：easymlops.ml.decomposition 
  - 特征选择:easymlops.ml.feature_selection
    - 过滤式：包括饱和度、方差、相关性、卡方、P-value、互信息、IV、PSI等  
    - 嵌入式：包括LR、LightGBM等
- 分类模型，包括lgbm决策树、logistic回归、svm等传统机器学习模型：easymlops.ml.classification 
- 回归模型，包括lgbm决策树
- stacking，通过Parallel模块，可以在同一阶段进行多个模型的训练，这样可以很方面的构建stacking模型：easymlops.ensemble.Parallel

### 2. 文本NLP处理模块
- 文本清洗，包括去停用词，去标点符号，去特定字符，抽取中文字符，jieba中文分词，关键词提取、ngram特征提取等数据清洗操作：easymlops.nlp.preprocessing
- 特征提取，包括bow,tfidf等传统模型；lda,lsi等主题模型；fastext,word2vec,doc2vec等词向量模型：easymlops.nlp.representation

### 3. 训练性能优化模块（主要是减少内存占用）

- easymlops.ml.perfopt.ReduceMemUsage模块:修改数据类型，比如某列特征数据范围在float16内，而目前的数据类型是float64，则将float64修改为float16
- easymlops.ml.perfopt.Dense2Sparse模块:将稠密矩阵转换为稀疏矩阵（含0量很多时使用），注意后续的pipe模块要提供对稀疏矩阵的支持(easymlops.ml.classification下的模块基本都支持) 

### 4. Pipeline流程的分拆&组合&运行到指定层&中间层pipe模块获取  

- pipeml的子模块也可以是pipeml，这样方便逐块建模再组合
- pipeml可以提取中间层数据，方便复用别人的模型，继续做自己下一步工作:pipeobj.transform(data,run_to_layer=指定层数或模块名) 
- 获取指定pipe模块的两种方式
- pipeml的切片运行方式


### 5.pipeline流程的训练&预测&持久化

- 训练接口：fit
- 预测接口：transform/transform_single分别进行批量预测和单条数据预测
- 持久化：save/load

### 6. 自定义pipe模块及其接口扩展

- fit,tranform：最少只需实现这两函数即可接入pipeline中
- set_params,get_params:实现这两函数可以对模块持久化
- transform_single:支持生产预测  
- 扩展自定义函数接口及其调用方式  
- 进阶函数接口:_fit,_transform,_transofrm_single,_set_params,_get_params


### 7. 生产部署:日志记录&预测一致性测试&性能测试&空值测试&极端值测试

- 生产预测接口，pipeobj.transform_single(data)即可对生产数据(通常转换为dict)进行预测
- 日志记录，pipeobj.transform_single(data,logger)可以追踪记录pipeline预测中每一步信息
- 预测一致性&性能测试，pipeobj.check_transform_function(data)可以对transform/transform_single的一致性以及各个pipe模块的性能做测试  
- 空值测试，pipeobj.check_null_value(data)主要用于检测取各类空值时，比如直接删除，取值None,null,nan,np.nan...最终预测结果是否还能一致
- 极端值测试，pipeobj.check_extreme_value(data)用于检测输入极端值的情况下，还能否有正常的输出，比如你处理的某列数据是1~100范围，线上生产给你一个inf,0,max float,min float看看模块还能否正常输出结果
- 类型反转测试，pipeobj.check_inverse_dtype(data) 比如，你的模型训练的是数值类型，如果给你一个字符串的"1"，你的代码会不会报错，如果你训练的字符数据，给你一个数值的0.01你的程序会不会崩
- int转float测试，pipeobj.check_int_trans_float(data)，pandas会将某些特征自动推断为int，而线上传输的可能是float，需要测试这两种情况是否能一致

- 自动化测试接口，pipeobj.auto_test(data)，依次将上面的各个测试走一遍


## 0.安装
```bash
pip install easymlops
```  
或

```bash
pip install git+https://github.com/zhulei227/EasyMLOps
```  
或

```bash
git clone https://github.com/zhulei227/EasyMLOps.git
cd EasyMLOps
python setup.py install
```  
或  

将整个easymlops包拷贝到你所运行代码的同级目录，然后安装依赖包  
```bash
pip install -r requirements.txt
```

## 1. 基础建模模块 

导入`PipeML`主程序

In [1]:
from easymlops import PipeML

准备`pandas.DataFrame`格式的数据

In [2]:
import pandas as pd
data=pd.read_csv("./data/demo.csv")
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


拆分训练测试

In [3]:
x_train=data[:500]
x_test=data[500:]
y_train=x_train["Survived"]
y_test=x_test["Survived"]
del x_train["Survived"]
del x_test["Survived"]

### 1.1 数据清洗

In [4]:
from easymlops.ml.preprocessing import *
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa(cols=["Cabin","Ticket","Parch","Fare","Sex"],fill_mode="mode"))\
  .pipe(FillNa(cols=["Age"],fill_mode="mean"))\
  .pipe(FillNa(fill_detail={"Embarked":"N"}))\
  .pipe(FillNa())\
  .pipe(TransToCategory(cols=["Cabin","Embarked","Name"]))\
  .pipe(TransToFloat(cols=["Age","Fare"]))\
  .pipe(TransToInt(cols=["Pclass","PassengerId","SibSp","Parch"]))\
  .pipe(TransToLower(cols=["Ticket","Cabin","Embarked","Name","Sex"]))\
  .pipe(CategoryMapValues(map_detail={"Cabin":(["nan","NaN"],"n")}))\
  .pipe(Clip(cols=["Age"],default_clip=(1,99),name="clip_name"))\
  .pipe(Clip(cols=["Fare"],percent_range=(1,99),name="clip_fare"))\
  .pipe(MinMaxScaler(cols=[("Age","Age_minmax")]))\
  .pipe(Normalizer(cols=[("Fare","Fare_normal")]))\
  .pipe(Bins(n_bins=10,strategy="uniform",cols=[("Age","Age_uni")]))\
  .pipe(Bins(n_bins=10,strategy="quantile",cols=[("Age","Age_quan")]))\
  .pipe(Bins(n_bins=10,strategy="kmeans",cols=[("Fare","Fare_km")]))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_minmax,Fare_normal,Age_uni,Age_quan,Fare_km
500,501,3,"calic, mr. petar",male,17.0,0,0,315086,8.664062,n,s,0.228571,-0.518231,2.0,1.0,1.0
501,502,3,"canavan, miss. mary",female,21.0,0,0,364846,7.75,n,q,0.285714,-0.539177,2.0,2.0,1.0
502,503,3,"o'sullivan, miss. bridget mary",female,29.203125,0,0,330909,7.628906,n,q,0.402902,-0.541952,4.0,5.0,1.0
503,504,3,"laitinen, miss. kristina sofia",female,37.0,0,0,4135,9.585938,n,s,0.514286,-0.497106,5.0,6.0,1.0
504,505,1,"maioni, miss. roberta",female,16.0,0,0,110152,86.5,b79,s,0.214286,1.265409,2.0,1.0,4.0


### 1.2 特征处理

#### 1.2.1 特征编码

In [5]:
from easymlops.ml.encoding import *

In [6]:
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(OneHotEncoding(cols=["Sex"],drop_col=False))\
  .pipe(LabelEncoding(cols=["Sex"]))\
  .pipe(TargetEncoding(cols=["Name","Ticket"],y=y_train))\
  .pipe(WOEEncoding(cols=["Pclass","Embarked","Cabin"],y=y_train,name="woe"))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_male,Sex_female
500,501,0.482439,0.0,1,17.0,0,0,0.0,8.664062,0.299607,0.224849,1,0
501,502,0.482439,0.0,2,21.0,0,0,0.0,7.75,0.299607,-0.508609,0,1
502,503,0.482439,0.0,2,0.0,0,0,0.0,7.628906,0.299607,-0.508609,0,1
503,504,0.482439,0.0,2,37.0,0,0,0.0,9.585938,0.299607,0.224849,0,1
504,505,-0.741789,0.0,2,16.0,0,0,1.0,86.5,0.0,0.224849,0,1


In [7]:
#查看woe encoding层的详情
ml["woe"].show_detail().head()

Unnamed: 0,col,bin_value,bad_num,bad_rate,good_num,good_rate,woe,iv
0,Pclass,3,78,0.404145,201,0.654723,0.482439,0.120889
1,Pclass,1,66,0.341969,50,0.162866,-0.741789,0.132856
2,Pclass,2,49,0.253886,56,0.18241,-0.330626,0.023632
3,Embarked,S,121,0.626943,241,0.785016,0.224849,0.035543
4,Embarked,C,48,0.248705,44,0.143322,-0.551169,0.058083


#### 1.2.2 特征降维

In [8]:
from easymlops.ml.decomposition import *

ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(OneHotEncoding(cols=["Pclass","Sex"],drop_col=False))\
  .pipe(LabelEncoding(cols=["Sex","Pclass"]))\
  .pipe(TargetEncoding(cols=["Name","Ticket","Embarked","Cabin"],y=y_train))\
  .pipe(PCADecomposition(n_components=4))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,0,1,2,3
0,244.143261,-26.432507,-5.579491,0.140058
1,244.14944,-27.066148,-1.535471,0.032267
2,244.048761,-28.613606,-22.472884,-0.393394
3,244.250159,-24.147862,14.298477,0.354024
4,245.210998,51.165312,-11.852065,-1.108347


#### 1.2.3 特征选择
##### 过滤式

In [9]:
from easymlops.ml.feature_selection import *
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(MissRateFilter(max_threshold=0.1))\
  .pipe(VarianceFilter(min_threshold=0.1))\
  .pipe(PersonCorrFilter(min_threshold=0.1,y=y_train,name="person"))\
  .pipe(PSIFilter(oot_x=x_test,cols=["Pclass","Sex","Embarked"],name="psi",max_threshold=0.5))\
  .pipe(LabelEncoding(cols=["Sex","Ticket","Embarked","Pclass"]))\
  .pipe(TargetEncoding(cols=["Name","Cabin"],y=y_train))\
  .pipe(Chi2Filter(y=y_train,name="chi2"))\
  .pipe(MutualInfoFilter(y=y_train))\
  .pipe(IVFilter(y=y_train,name="iv",cols=["Sex","Fare"],min_threshold=0.05))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,Pclass,Name,Sex,Ticket,Fare,Cabin,Embarked
500,1,0.0,1,0,8.664062,0.317829,1
501,1,0.0,2,0,7.75,0.317829,3
502,1,0.0,2,0,7.628906,0.317829,3
503,1,0.0,2,0,9.585938,0.317829,1
504,2,0.0,2,354,86.5,0.0,1


In [10]:
#查看psi计算详情
ml["psi"].show_detail().head()

Unnamed: 0,col,bin_value,ins_num,ins_rate,oot_num,oot_rate,psi
0,Pclass,3,279,0.558,212,0.542199,0.000454
1,Pclass,1,116,0.232,100,0.255754,0.002316
2,Pclass,2,105,0.21,79,0.202046,0.000307
3,Sex,male,315,0.63,262,0.670077,0.002472
4,Sex,female,185,0.37,129,0.329923,0.004595


##### 嵌入式

In [11]:
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(LabelEncoding(cols=["Sex","Ticket","Embarked","Pclass"]))\
  .pipe(TargetEncoding(cols=["Name","Cabin"],y=y_train))\
  .pipe(LREmbed(y=y_train,min_threshold=0.01))\
  .pipe(LGBMEmbed(y=y_train,min_threshold=0.01))
  

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,Pclass,Name,Sex,SibSp,Parch,Cabin,Embarked
500,1,0.0,1,0,0,0.317829,1
501,1,0.0,2,0,0,0.317829,3
502,1,0.0,2,0,0,0.317829,3
503,1,0.0,2,0,0,0.317829,1
504,2,0.0,2,0,0,0.0,1


In [12]:
#查看LR权重分布
ml[-2].show_detail()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,9.4e-05,0.200128,6.508567,1.029712,0.007582,0.169105,0.042399,0.000298,0.003048,0.909827,0.152208


In [13]:
#查看Lgbm中split次数(默认)分布
ml[-1].show_detail()

Unnamed: 0,Pclass,Name,Sex,SibSp,Parch,Cabin,Embarked
0,35,100,12,16,16,4,7


### 1.3 分类模型

In [14]:
from easymlops.ml.classification import *

ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(OneHotEncoding(cols=["Pclass", "Sex"], drop_col=False)) \
  .pipe(WOEEncoding(cols=["Ticket", "Embarked", "Cabin", "Sex", "Pclass"], y=y_train)) \
  .pipe(LabelEncoding(cols=["Name"]))\
  .pipe(LGBMClassification(y=y_train,native_init_params={"max_depth":2},native_fit_params={"num_boost_round":128}))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,0,1
0,0.940358,0.059642
1,0.234832,0.765168
2,0.238124,0.761876
3,0.506521,0.493479
4,0.035974,0.964026


In [15]:
#获取sabass特征重要性
ml[-1].get_contrib(ml.transform_single(x_test.to_dict("record")[0],run_to_layer=-2))

{'Sex': -0.5342937672948507,
 'Cabin': -0.11912383721061759,
 'Age': 0.03491360138217343,
 'Fare': -0.16696530355596723,
 'SibSp': 0.03240589398019714,
 'Ticket': -0.013014167543831055,
 'Name': -0.10320265490747914,
 'PassengerId': -0.20656856690289452,
 'Embarked': -0.008278169631219159}

### 1.4 回归建模

In [16]:
from easymlops.ml.regression import *

ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(OneHotEncoding(cols=["Pclass", "Sex"], drop_col=False)) \
  .pipe(WOEEncoding(cols=["Ticket", "Embarked", "Cabin", "Sex", "Pclass"], y=y_train)) \
  .pipe(LabelEncoding(cols=["Name"]))\
  .pipe(LGBMRegression(y=y_train,objective="poisson"))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,pred
0,0.079148
1,1.039213
2,0.856584
3,0.44617
4,1.024305


In [17]:
#获取sabass特征重要性
ml[-1].get_contrib(ml.transform_single(x_test.to_dict("record")[0],run_to_layer=-2))

{'Sex': -0.5850500364459678,
 'Cabin': -0.12662416302463073,
 'Fare': -0.14681876996785104,
 'Age': 0.32898357714165033,
 'Name': -0.1049118878477538,
 'PassengerId': -0.41125350326614885,
 'SibSp': -0.006007789150575066,
 'Pclass_2': 0.03219068665942829,
 'Parch': -0.014461325859364781,
 'Ticket': -0.0019810404323185892}

### 1.5 Stacking建模

In [18]:
from easymlops.ml.ensemble import Parallel
ml = PipeML()
ml.pipe(FixInput()) \
  .pipe(FillNa()) \
  .pipe(Parallel([OneHotEncoding(cols=["Pclass", "Sex"]), LabelEncoding(cols=[("Sex","Sex_label"), ("Pclass","Pclass_label")]),
                    TargetEncoding(cols=["Name", "Ticket", "Embarked", "Cabin", "Sex"], y=y_train)])) \
  .pipe(Parallel([PCADecomposition(n_components=2, prefix="pca"), NMFDecomposition(n_components=2, prefix="nmf")]))

ml.fit(x_train).transform(x_test).head(5)

Unnamed: 0,pca_0,pca_1,nmf_0,nmf_1
0,244.142899,-26.440506,6.209713,0.178597
1,244.149068,-27.073288,6.22505,0.168701
2,244.048386,-28.621173,6.222639,0.096321
3,244.24979,-24.1548,6.260848,0.267427
4,245.211626,51.175971,6.249152,2.144727


In [19]:
ml = PipeML()
ml.pipe(FixInput()) \
  .pipe(FillNa()) \
  .pipe(Parallel([OneHotEncoding(cols=["Pclass", "Sex"]), LabelEncoding(cols=["Sex", "Pclass"]),
                    TargetEncoding(cols=["Name", "Ticket", "Embarked", "Cabin", "Sex"], y=y_train)])) \
  .pipe(Parallel([PCADecomposition(n_components=2, prefix="pca"), NMFDecomposition(n_components=2, prefix="nmf")]))\
  .pipe(Parallel([LGBMClassification(y=y_train, prefix="lgbm"), LogisticRegressionClassification(y=y_train, prefix="lr")]))

ml.fit(x_train).transform(x_test).head(5)

Unnamed: 0,lgbm_0,lgbm_1,lr_0,lr_1
0,0.982302,0.017698,0.661457,0.338543
1,0.984471,0.015529,0.665126,0.334874
2,0.980724,0.019276,0.668726,0.331274
3,0.643435,0.356565,0.663189,0.336811
4,0.140687,0.859313,0.46816,0.53184


## 2. 文本NLP处理模块
目前主要包含文本清洗和文本特征抽取
### 2.1 文本清洗

In [20]:
#Name中所有字符转小写，然后将所有标点符号用空格代替
from easymlops.nlp.preprocessing import *
nlp=PipeML()
nlp.pipe(FixInput())\
   .pipe(TargetEncoding(cols=["Sex","Ticket","Cabin","Embarked"],y=y_train))\
   .pipe(FillNa())\
   .pipe(Lower(cols=["Name"]))\
   .pipe(ReplacePunctuation(cols=["Name"],symbols=" "))

nlp.fit(x_train).transform(x_test).head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
500,501,3,calic mr petar,0.171429,17.0,0,0,0.0,8.664062,0.317829,0.334254
501,502,3,canavan miss mary,0.751351,21.0,0,0,0.0,7.75,0.317829,0.511111
502,503,3,o sullivan miss bridget mary,0.751351,0.0,0,0,0.0,7.628906,0.317829,0.511111
503,504,3,laitinen miss kristina sofia,0.751351,37.0,0,0,0.0,9.585938,0.317829,0.334254
504,505,1,maioni miss roberta,0.751351,16.0,0,0,1.0,86.5,0.0,0.334254


### 2.2 文本特征提取  
注意：后续模型处理的最小单位需要在原column中以空格分隔，比如上面第一行"calic mr petar"会分别把"calic","mr","petar"当作独立的词处理

In [21]:
from easymlops.nlp.representation import *
from easymlops.ml.perfopt import *
#构建tfidf模型
nlp=PipeML()
nlp.pipe(FixInput())\
   .pipe(SelectCols(cols=["Name","Age"]))\
   .pipe(Lower(cols=["Name"]))\
   .pipe(ReplacePunctuation(cols=["Name"],symbols=" "))\
   .pipe(TFIDF(cols=["Name"]))\
   .pipe(DropCols(cols=["Name"]))

x_test_new=nlp.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,Age,tfidf_Name_,tfidf_Name_a,tfidf_Name_abbott,tfidf_Name_abelson,tfidf_Name_achem,tfidf_Name_achille,tfidf_Name_achilles,tfidf_Name_ada,tfidf_Name_adahl,...,tfidf_Name_yarred,tfidf_Name_yoto,tfidf_Name_young,tfidf_Name_youseff,tfidf_Name_yousif,tfidf_Name_youssef,tfidf_Name_yousseff,tfidf_Name_yrois,tfidf_Name_zabour,tfidf_Name_zimmerman
0,0.0,0.285612,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.346207,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.194849,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.28994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.62769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
#构建lsi主题模型+word2vec词向量模型
nlp=PipeML()
nlp.pipe(FixInput())\
   .pipe(SelectCols(cols=["Name"]))\
   .pipe(Lower(cols=["Name"]))\
   .pipe(ReplacePunctuation(cols=["Name"],symbols=" "))\
   .pipe(SelectCols(cols=["Name"]))\
   .pipe(Parallel([LsiTopicModel(cols=["Name"],num_topics=4),Word2VecModel(embedding_size=4,cols=["Name"])]))\
   .pipe(DropCols(cols=["Name"]))

nlp.fit(x_train).transform(x_test).head(5)

Unnamed: 0,lsi_Name_0,lsi_Name_1,lsi_Name_2,lsi_Name_3,w2v_Name_0,w2v_Name_1,w2v_Name_2,w2v_Name_3
500,2.132522,0.648002,-0.0759,0.093424,-0.183208,-0.01715,0.403337,0.551337
501,2.034848,-0.608228,-0.73337,0.013288,-0.085437,-0.033206,0.293308,0.416924
502,2.040231,-0.616303,-0.747638,0.022877,-0.051665,0.000399,0.27701,0.309539
503,2.026293,-0.579116,-0.735033,-0.011914,-0.140556,0.009208,0.393155,0.443419
504,2.025096,-0.573415,-0.720068,-0.010283,-0.140556,0.009208,0.393155,0.443419


## 3. 训练性能优化模块
主要是优化内存使用情况，下面看一个比较特殊点的(特征OneHot展开)

In [23]:
from easymlops.ml.perfopt import *

ml=PipeML()
ml.pipe(FixInput())\
  .pipe(Clip(cols=["Age"],default_clip=(1,99),name="clip_name"))\
  .pipe(OneHotEncoding(cols=["Pclass","Sex","Name","Ticket","Embarked","Cabin"],drop_col=True))\
  .pipe(FillNa())\
  .pipe(ReduceMemUsage())\
  .pipe(Dense2Sparse())

ml.fit(x_train).transform(x_train).shape

(500, 1021)

In [24]:
#做了ReduceMemUsage后的内存消耗:500K
ml.transform(x_train,run_to_layer=-2).memory_usage().sum()//1024

500

In [25]:
#做了ReduceMemUsage和Dense2Sparse后的内存消耗:24K
ml.transform(x_train,run_to_layer=-1).memory_usage().sum()//1024

24

In [26]:
#easymlops.ml.classification中的模块对Dense2Sparse基本都支持，比如LightGBM
from easymlops.ml.classification import *
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(Clip(cols=["Age"],default_clip=(1,99),name="clip_name"))\
  .pipe(OneHotEncoding(cols=["Pclass","Sex","Name","Ticket","Embarked","Cabin"],drop_col=True))\
  .pipe(FillNa())\
  .pipe(ReduceMemUsage())\
  .pipe(Dense2Sparse())\
  .pipe(LGBMClassification(y=y_train,native_init_params={"max_depth": 2}, native_fit_params={"num_boost_round": 128}))

ml.fit(x_train).transform(x_test).head(5)

Unnamed: 0,0,1
0,0.930951,0.069049
1,0.231447,0.768553
2,0.296645,0.703355
3,0.679053,0.320947
4,0.062002,0.937998


## 4. Pipeline流程的分拆&组合&运行到指定层&中间层pipe模块获取
- 建模的过程通常式逐步迭代的，在每一层可能都要做多次调整，再进行下一步建模，但按照上面的建模方式，每次调整了最后一层，都要将前面的所有层再次运行一次，这样很费时费力；
- 所以如果能将整个pipeline分拆成为多个子pipeline然后再组合，方面迭代式的建模开发；
- 所以，这里将PipeML对象也设计为了一个pipe模块(即PipeML和上面介绍的FixInput、FillNa、TargetEncoding...等可以视为同级),所以PipeML也可以pipe一个PipeML对象

### 4.1 分拆

In [27]:
#比如先做特征工程
ml1=PipeML()
ml1.pipe(FixInput())\
   .pipe(FillNa())\
   .pipe(OneHotEncoding(cols=["Pclass","Sex"],drop_col=False))\
   .pipe(LabelEncoding(cols=["Sex","Pclass"]))\
   .pipe(TargetEncoding(cols=["Name","Ticket","Embarked","Cabin"],y=y_train))\
   .pipe(Normalizer())\
   .pipe(PCADecomposition(n_components=8))

x_train_new=ml1.fit(x_train).transform(x_train)
x_test_new=ml1.transform(x_test)
x_test_new.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-2.203936,-0.465733,0.176379,-0.614674,0.068869,-1.80774,0.133572,-0.431929
1,0.541887,-1.862707,-0.226505,-1.390086,-1.644222,-1.236928,1.754896,-0.193005
2,0.513605,-2.216552,-0.407295,-1.341431,-1.010294,-1.195668,2.342431,-0.590154
3,0.362241,-1.403968,-0.719084,-1.017833,-2.254035,-1.845033,-0.022829,-0.676507
4,1.85238,0.241057,0.131182,-0.278974,-1.837343,-1.554423,1.073508,0.252397


In [28]:
#然后模型训练
ml2=PipeML().pipe(LogisticRegressionClassification(y=y_train))
ml2.fit(x_train_new).transform(x_test_new).head(5)

Unnamed: 0,0,1
0,0.998719,0.001281
1,0.984316,0.015684
2,0.983844,0.016156
3,0.991256,0.008744
4,0.672809,0.327191


### 4.2 组合

In [29]:
ml_combine=PipeML().pipe(ml1).pipe(ml2)
ml_combine.transform(x_test).head(5)

Unnamed: 0,0,1
0,0.998719,0.001281
1,0.984316,0.015684
2,0.983844,0.016156
3,0.991256,0.008744
4,0.672809,0.327191


### 4.3 运行到指定层
我们有时候可能想看看pipeline过程中特征逐层的变化情况，以及复用别人的特征工程（但又不需要最后几步的变化），transform/transform_single中的run_to_layer就可以排上用场了

In [30]:
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(OneHotEncoding(cols=["Pclass","Sex"],drop_col=False))\
  .pipe(WOEEncoding(cols=["Sex","Pclass"],y=y_train))\
  .pipe(LabelEncoding(cols=["Name","Ticket"]))\
  .pipe(TargetEncoding(cols=["Embarked","Cabin"],y=y_train,name="target_encoding"))\
  .pipe(FillNa())\
  .pipe(Normalizer())\
  .pipe(PCADecomposition(n_components=8))

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-1.441766,0.01296,2.828399,0.338076,-1.338859,-0.07501,0.416083,0.156167
1,1.367719,2.166848,1.97736,-0.180096,-2.371848,0.16654,-0.781668,-0.645946
2,1.250239,2.523279,2.089104,-0.258638,-2.242903,1.016662,-0.745599,0.052138
3,1.241155,1.816799,1.987426,0.767974,-2.115902,-1.454603,0.087935,-0.189948
4,2.547265,-0.513182,0.477557,0.098263,-1.795859,-0.815377,-1.506217,2.8012


In [31]:
#查看第2层的数据
ml.transform(x_test,run_to_layer=1).head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
500,501,3,"Calic, Mr. Petar",male,17.0,0,0,315086,8.664062,,S
501,502,3,"Canavan, Miss. Mary",female,21.0,0,0,364846,7.75,,Q
502,503,3,"O'Sullivan, Miss. Bridget Mary",female,0.0,0,0,330909,7.628906,,Q
503,504,3,"Laitinen, Miss. Kristina Sofia",female,37.0,0,0,4135,9.585938,,S
504,505,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S


In [32]:
#查看倒数第3层的数据
ml.transform(x_test,run_to_layer=-3).head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_3,Pclass_1,Pclass_2,Sex_male,Sex_female
500,501,0.482439,0,1.111379,17.0,0,0,0,8.664062,0.317829,0.334254,1,0,0,1,0
501,502,0.482439,0,-1.56999,21.0,0,0,0,7.75,0.317829,0.511111,1,0,0,0,1
502,503,0.482439,0,-1.56999,0.0,0,0,0,7.628906,0.317829,0.511111,1,0,0,0,1
503,504,0.482439,0,-1.56999,37.0,0,0,0,9.585938,0.317829,0.334254,1,0,0,0,1
504,505,-0.741789,0,-1.56999,16.0,0,0,354,86.5,0.0,0.334254,0,1,0,0,1


In [33]:
#查看到模块名为target_encoding的数据
ml.transform(x_test,run_to_layer="target_encoding").head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_3,Pclass_1,Pclass_2,Sex_male,Sex_female
500,501,0.482439,0,1.111379,17.0,0,0,0,8.664062,0.317829,0.334254,1,0,0,1,0
501,502,0.482439,0,-1.56999,21.0,0,0,0,7.75,0.317829,0.511111,1,0,0,0,1
502,503,0.482439,0,-1.56999,0.0,0,0,0,7.628906,0.317829,0.511111,1,0,0,0,1
503,504,0.482439,0,-1.56999,37.0,0,0,0,9.585938,0.317829,0.334254,1,0,0,0,1
504,505,-0.741789,0,-1.56999,16.0,0,0,354,86.5,0.0,0.334254,0,1,0,0,1


### 4.4 中间层pipe模块获取 
有时候我们向获取指定pipe模块，并调用其函数接口,这里可以通过`下标索引`(从0开始),也可以通过`name`进行索引

In [34]:
#比如调用WOEEncoding的show_detail函数
ml[3].show_detail()

Unnamed: 0,col,bin_value,bad_num,bad_rate,good_num,good_rate,woe,iv
0,Sex,male,54,0.279793,261,0.850163,1.111379,0.633897
1,Sex,female,139,0.720207,46,0.149837,-1.56999,0.895475
2,Pclass,3,78,0.404145,201,0.654723,0.482439,0.120889
3,Pclass,1,66,0.341969,50,0.162866,-0.741789,0.132856
4,Pclass,2,49,0.253886,56,0.18241,-0.330626,0.023632


In [35]:
#name="taget_encoding"的show_detail函数
ml["target_encoding"].show_detail().head()

Unnamed: 0,col,bin_value,target_value
0,Embarked,C,0.521739
1,Embarked,Q,0.511111
2,Embarked,S,0.334254
3,Embarked,,1.0
4,Cabin,A14,0.0


### 4.5 切片式
可以通过切分的方式获取部分连续的pipe模块，并组装为PipeML，这样可以更加方便灵活的获取中间结果

In [36]:
#运行前3层
step1=ml[:3].transform(x_test[:5])
step1

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_3,Pclass_1,Pclass_2,Sex_male,Sex_female
500,501,3,"Calic, Mr. Petar",male,17.0,0,0,315086,8.664062,,S,1,0,0,1,0
501,502,3,"Canavan, Miss. Mary",female,21.0,0,0,364846,7.75,,Q,1,0,0,0,1
502,503,3,"O'Sullivan, Miss. Bridget Mary",female,0.0,0,0,330909,7.628906,,Q,1,0,0,0,1
503,504,3,"Laitinen, Miss. Kristina Sofia",female,37.0,0,0,4135,9.585938,,S,1,0,0,0,1
504,505,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S,0,1,0,0,1


In [37]:
#运行中间3,4层
step2=ml[3:5].transform(step1)
step2

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_3,Pclass_1,Pclass_2,Sex_male,Sex_female
500,501,0.482439,0,1.111379,17.0,0,0,0,8.664062,,S,1,0,0,1,0
501,502,0.482439,0,-1.56999,21.0,0,0,0,7.75,,Q,1,0,0,0,1
502,503,0.482439,0,-1.56999,0.0,0,0,0,7.628906,,Q,1,0,0,0,1
503,504,0.482439,0,-1.56999,37.0,0,0,0,9.585938,,S,1,0,0,0,1
504,505,-0.741789,0,-1.56999,16.0,0,0,354,86.5,B79,S,0,1,0,0,1


In [38]:
#运行5层及之后
step3=ml[5:].transform(step2)
step3

Unnamed: 0,0,1,2,3,4,5,6,7
0,-1.441766,0.01296,2.828399,0.338076,-1.338859,-0.07501,0.416083,0.156167
1,1.367719,2.166848,1.97736,-0.180096,-2.371848,0.16654,-0.781668,-0.645946
2,1.250239,2.523279,2.089104,-0.258638,-2.242903,1.016662,-0.745599,0.052138
3,1.241155,1.816799,1.987426,0.767974,-2.115902,-1.454603,0.087935,-0.189948
4,2.547265,-0.513182,0.477557,0.098263,-1.795859,-0.815377,-1.506217,2.8012


## 5. Pipeline流程的训练&预测&持久化  

训练接口fit和批量预测接口transform上面demo以及运行多次就不介绍了，下面介绍当条数据的预测接口transform_single这个主要用于线上单条数据的预测，要求输入是字典格式，而且输出也是字典格式

In [39]:
input_dict=x_test.to_dict("record")[0]
input_dict

{'PassengerId': 501,
 'Pclass': 3,
 'Name': 'Calic, Mr. Petar',
 'Sex': 'male',
 'Age': 17.0,
 'SibSp': 0,
 'Parch': 0,
 'Ticket': '315086',
 'Fare': 8.6625,
 'Cabin': nan,
 'Embarked': 'S'}

In [40]:
ml.transform_single(input_dict)

{0: -1.4417663747170075,
 1: 0.012960258111902379,
 2: 2.82839852783474,
 3: 0.33807607482067686,
 4: -1.338859240265778,
 5: -0.07500995838955171,
 6: 0.4160828190219088,
 7: 0.15616709646927246}

In [41]:
#也可以看看啥也不输入，能否得到一个结果，检验代码是否稳健
ml.transform_single({})

(<class 'easymlops.ml.preprocessing.core.FixInput'>) module, please check these missing columns:[1;43m['Parch', 'Embarked', 'Ticket', 'Fare', 'Age', 'PassengerId', 'Pclass', 'SibSp', 'Sex', 'Cabin', 'Name'][0m, they will by filled by 0(int),None(float),np.nan(category)


{0: 0.6465424591523234,
 1: 0.44784853411026065,
 2: 0.620424862293769,
 3: -0.6768081789371995,
 4: -0.2900009009927905,
 5: 1.7418435548525169,
 6: -1.3897057726750697,
 7: -0.740000642778759}

#### 持久化

In [42]:
#保存
ml.save("ml.pkl")

In [43]:
#导入
#由于只保留了模型参数，所以需要重新声明模型结构信息(参数无需传入;但导入也没有问题，这样还可以给调用者提供更多的建模信息)
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(FillNa())\
  .pipe(OneHotEncoding(cols=["Pclass","Sex"],drop_col=False))\
  .pipe(WOEEncoding(cols=["Sex","Pclass"]))\
  .pipe(LabelEncoding(cols=["Name","Ticket"]))\
  .pipe(TargetEncoding(cols=["Embarked","Cabin"],name="target_encoding"))\
  .pipe(FillNa())\
  .pipe(Normalizer())\
  .pipe(PCADecomposition(n_components=8))

<easymlops.pipeml.PipeML at 0x25176329208>

In [44]:
ml.load("ml.pkl")
ml.transform(x_test).head(5)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-1.441766,0.01296,2.828399,0.338076,-1.338859,-0.07501,0.416083,0.156167
1,1.367719,2.166848,1.97736,-0.180096,-2.371848,0.16654,-0.781668,-0.645946
2,1.250239,2.523279,2.089104,-0.258638,-2.242903,1.016662,-0.745599,0.052138
3,1.241155,1.816799,1.987426,0.767974,-2.115902,-1.454603,0.087935,-0.189948
4,2.547265,-0.513182,0.477557,0.098263,-1.795859,-0.815377,-1.506217,2.8012


In [45]:
ml.transform_single(input_dict)

{0: -1.4417663747170075,
 1: 0.012960258111902434,
 2: 2.8283985278347394,
 3: 0.3380760748206768,
 4: -1.338859240265778,
 5: -0.07500995838955166,
 6: 0.41608281902190886,
 7: 0.15616709646927251}

备注：对于分拆后组合的pipe模块，也需要按照训练时候样子申明，然后嵌套，然后load

## 6.自定义pipe模块及其接口扩展

把需求分为如下几个层级：

- 最低需求，只做数据探索工作，只需要实现fit和transform接口
- 模型持久化需求，需要实现set_params和get_params来告诉PipeML,你的模型预测需要保留那些参数
- 生产上线需求，需要实现transform_single接口，实现与transform一样的预测结果，但处理的数据格式不一样，transform是dataframe，而transform_single是字典，而且transform_single的性能要求通常比transform高  
- 自定义扩展函数，可以添加自定义的其他函数方法，比较监测线上数据分布的变化，然后通过`4.4`介绍的方法调用

下面看一下TargetEncoding的简化版实现

In [46]:
#注意下面继承的是object
class TargetEncoding(object):
    def __init__(self,name="", y=None,cols=None, error_value=0):
        self.name=name
        self.y=y
        self.cols=cols
        self.error_value = error_value
        self.target_map_detail = dict()

    def show_detail(self):
        data = []
        for col, map_detail in self.target_map_detail.items():
            for bin_value, target_value in map_detail.items():
                data.append([col, bin_value, target_value])
        return pd.DataFrame(data=data, columns=["col", "bin_value", "target_value"])

    def fit(self, s):
        assert self.y is not None and len(self.y) == len(s)
        s["y_"] = self.y
        for col in self.cols:
            tmp_ = s[[col, "y_"]]
            col_map = list(tmp_.groupby([col]).agg({"y_": ["mean"]}).to_dict().values())[0]
            self.target_map_detail[col] = col_map
        del s["y_"]
        return self
    
    def transform(self, s):
        for col in self.cols:
            if col in s.columns:
                s[col] = s[col].apply(lambda x: self._user_defined_function(col, x))
        return s
    
    def transform_single(self, s):
        for col in self.cols:
            if col in s.keys():
                s[col] = self._user_defined_function(col, s[col])
        return s

    def _user_defined_function(self, col, x):
        map_detail_ = self.target_map_detail.get(col, dict())
        return map_detail_.get(x, self.error_value)

    def get_params(self):
        #获取父类的params
        params=super().get_params()
        #加入当前的参数
        params.update({"target_map_detail": self.target_map_detail, "error_value": self.error_value})
        return params

    def set_params(self, params):
        #设置父类的params
        super().set_params(params)
        #再设置当前层的params
        self.target_map_detail = params["target_map_detail"]
        self.error_value = params["error_value"]

In [47]:
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(SelectCols(cols=["Age","Fare","Embarked"]))\
  .pipe(TargetEncoding(cols=["Embarked"],y=y_train))\
  .pipe(FillNa())

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,Age,Fare,Embarked
500,17.0,8.664062,0.334254
501,21.0,7.75,0.511111
502,0.0,7.628906,0.511111
503,37.0,9.585938,0.334254
504,16.0,86.5,0.334254


In [48]:
ml.transform_single(x_test.to_dict("record")[0])

{'Age': 17.0, 'Fare': 8.664, 'Embarked': 0.3342541436464088}

In [49]:
ml[-2].show_detail()

Unnamed: 0,col,bin_value,target_value
0,Embarked,C,0.521739
1,Embarked,Q,0.511111
2,Embarked,S,0.334254
3,Embarked,,1.0


#### 进阶接口
上面是简化版的，在实现fit\transform\transform_single\get_params\set_params时可能需要考虑更多：  

- 输入输出数据的类型是否需要校验一下？  
- 输入输出数据的顺序是否需要一致？  
- set_params和get_params时搞忘了父类可咋整？
- 当前命名的参数与底层的参数名称的冲突检测？    
- 在transform时候是否需要把数据拷贝一下？  

建议自定义时继承PipeObject对象，然后实现_fit\ _transform\ _transform_single\ _get_params\ _set_params，这样自定义的Pipe模块更稳健，如下，调整后的TargetEncoding，代码几乎一样

In [50]:
from easymlops.base import PipeObject
class TargetEncoding(PipeObject):
    def __init__(self,y=None, cols=None, error_value=0):
        super().__init__()
        self.y=y
        self.cols=cols
        self.error_value = error_value
        self.target_map_detail = dict()

    def show_detail(self):
        data = []
        for col, map_detail in self.target_map_detail.items():
            for bin_value, target_value in map_detail.items():
                data.append([col, bin_value, target_value])
        return pd.DataFrame(data=data, columns=["col", "bin_value", "target_value"])

    def _fit(self, s):
        assert self.y is not None and len(self.y) == len(s)
        s["y_"] = self.y
        for col in self.cols:
            tmp_ = s[[col, "y_"]]
            col_map = list(tmp_.groupby([col]).agg({"y_": ["mean"]}).to_dict().values())[0]
            self.target_map_detail[col] = col_map
        del s["y_"]
        return self
    
    def _transform(self, s):
        for col in self.cols:
            if col in s.columns:
                s[col] = s[col].apply(lambda x: self._user_defined_function(col, x))
        return s
    
    def _transform_single(self, s):
        for col in self.cols:
            if col in s.keys():
                s[col] = self._user_defined_function(col, s[col])
        return s

    def _user_defined_function(self, col, x):
        map_detail_ = self.target_map_detail.get(col, dict())
        return map_detail_.get(x, self.error_value)

    def _get_params(self):
        return {"target_map_detail": self.target_map_detail, "error_value": self.error_value}

    def _set_params(self, params):
        self.target_map_detail = params["target_map_detail"]
        self.error_value = params["error_value"]

In [51]:
ml=PipeML()
ml.pipe(FixInput())\
  .pipe(SelectCols(cols=["Age","Fare","Embarked"]))\
  .pipe(TargetEncoding(cols=["Embarked"],y=y_train))\
  .pipe(FillNa())

x_test_new=ml.fit(x_train).transform(x_test)
x_test_new.head(5)

Unnamed: 0,Age,Fare,Embarked
500,17.0,8.664062,0.334254
501,21.0,7.75,0.511111
502,0.0,7.628906,0.511111
503,37.0,9.585938,0.334254
504,16.0,86.5,0.334254


In [52]:
ml.transform_single(x_test.to_dict("record")[0])

{'Age': 17.0, 'Fare': 8.664, 'Embarked': 0.3342541436464088}

In [53]:
ml[-2].show_detail()

Unnamed: 0,col,bin_value,target_value
0,Embarked,C,0.521739
1,Embarked,Q,0.511111
2,Embarked,S,0.334254
3,Embarked,,1.0


## 7. 支持生产部署:数据一致性测试&性能测试&日志记录

通常，在生产线上使用pandas效率并不高，且生产的输入格式通常是字典格式(json)，所以如果需要部署生产，我们需要额外添加一个函数：  
- transform_single:实现与transform一致的功能，而input和output需要修改为字典格式  

###  7.1 transform_single


In [54]:
ml.transform_single({'PassengerId': 1,
 'Cabin': 0,
 'Pclass': 3,
 'Name': 'Braund, Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22.0,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Embarked': 'S'})

{'Age': 22.0, 'Fare': 7.25, 'Embarked': 0.3342541436464088}

In [55]:
ml.transform_single({'PassengerId': 1,
 'Cabin': 0,
 'Pclass': 3,
 'Name': 'Braund, Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22.0,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Embarked': 'S'},run_to_layer=-3)

{'Age': 22.0, 'Fare': 7.25, 'Embarked': 'S'}

### 7.2 日志记录 
日志通常只需要在生产中使用，所以只在transform_single可用

In [56]:
import logging
logger = logging.getLogger("EasyMLOps")
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

In [57]:
base_log_info={"user_id":1}

In [58]:
output=ml_combine.transform_single({'PassengerId': 1,
 'Cabin': 0,
 'Pclass': 3,
 'Name': 'Braund, Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22.0,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Embarked': 'S'},logger=logger,log_base_dict=base_log_info)

2023-02-14 22:51:19,414 - EasyMLOps - INFO - {'step': 'step-0-0', 'pipe_name': <class 'easymlops.ml.preprocessing.core.FixInput'>, 'transform': {'PassengerId': 1, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': '0', 'Embarked': 'S'}, 'user_id': 1}
2023-02-14 22:51:19,414 - EasyMLOps - INFO - {'step': 'step-0-1', 'pipe_name': <class 'easymlops.ml.preprocessing.onevar_operation.FillNa'>, 'transform': {'PassengerId': 1, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': '0', 'Embarked': 'S'}, 'user_id': 1}
2023-02-14 22:51:19,414 - EasyMLOps - INFO - {'step': 'step-0-2', 'pipe_name': <class 'easymlops.ml.encoding.OneHotEncoding'>, 'transform': {'PassengerId': 1, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'C

In [59]:
output

{0: 0.9998591472977608, 1: 0.00014085270223913456}

### 7.3 transform/transform_single一致性测试&性能测试:check_transform_function
部署生产环境之前，我们通常要关注两点：  
- 离线训练模型和在线预测模型的一致性，即tranform和transform_single的一致性；  
- transform_single对当条数据的预测性能  

这些可以通过调用如下函数，进行自动化测试：  
- check_transform_function：只要有打印complete，则表示在当前测试数据上transform和transform_single的输出一致，性能测试表示为speed:[*]毫秒/每条数据，以及运行过程中cpu的最大使用率和内存变化(最大内存-最小内存)，如果有异常则会直接抛出，并中断后续pipe模块的测试

In [60]:
ml_combine.check_transform_function(x_test)

(<class 'easymlops.ml.preprocessing.core.FixInput'>) module check [transform] complete,speed:[0.19ms]/it,cpu:[37%],memory:[0K]
(<class 'easymlops.ml.preprocessing.onevar_operation.FillNa'>) module check [transform] complete,speed:[0.02ms]/it,cpu:[0%],memory:[0K]
(<class 'easymlops.ml.encoding.OneHotEncoding'>) module check [transform] complete,speed:[0.0ms]/it,cpu:[0%],memory:[0K]
(<class 'easymlops.ml.encoding.LabelEncoding'>) module check [transform] complete,speed:[0.0ms]/it,cpu:[0%],memory:[0K]
(<class 'easymlops.ml.encoding.TargetEncoding'>) module check [transform] complete,speed:[0.01ms]/it,cpu:[0%],memory:[0K]
(<class 'easymlops.ml.preprocessing.onevar_operation.Normalizer'>) module check [transform] complete,speed:[0.07ms]/it,cpu:[0%],memory:[0K]
(<class 'easymlops.ml.decomposition.PCADecomposition'>) module check [transform] complete,speed:[25.26ms]/it,cpu:[40%],memory:[27152K]
(<class 'easymlops.ml.classification.LogisticRegressionClassification'>) module check [transform] c

### 7.4 空值测试：check_null_value  

- 由于pandas在读取数据时会自动做类型推断，对空会有不同的处理，比如float设置为np.nan，对object设置为None或NaN  
- 而且pandas读取数据默认为批量读取批量推断，所以某一列数据空还不唯一，np.nan和None可能共存  

所以，这里对逐个column分别设置不同的空进行测试，测试内容：  
- 相同的空情况下，transform和transform_single是否一致  
- 不同的空的transform结果是否一致  

可通过`null_values=[None, np.nan, "null", "NULL", "nan", "NaN", "", "none", "None", " "]`(默认)设置自定义空值

In [61]:
ml_combine.check_null_value(x_test,sample=10)

column:[PassengerId] check [null value] complete,speed:[29.36ms]/it,cpu:[24%],memory:[120K]
column:[Pclass] check [null value] complete,speed:[28.29ms]/it,cpu:[52%],memory:[1724K]
column:[Name] check [null value] complete,speed:[28.22ms]/it,cpu:[26%],memory:[532K]
column:[Sex] check [null value] complete,speed:[27.47ms]/it,cpu:[39%],memory:[284K]
column:[Age] check [null value] complete,speed:[26.56ms]/it,cpu:[83%],memory:[24K]
column:[SibSp] check [null value] complete,speed:[26.49ms]/it,cpu:[30%],memory:[40K]
column:[Parch] check [null value] complete,speed:[28.05ms]/it,cpu:[30%],memory:[16K]
column:[Ticket] check [null value] complete,speed:[26.6ms]/it,cpu:[20%],memory:[12K]
column:[Fare] check [null value] complete,speed:[27.42ms]/it,cpu:[20%],memory:[80K]
column:[Cabin] check [null value] complete,speed:[28.01ms]/it,cpu:[84%],memory:[120K]
column:[Embarked] check [null value] complete,speed:[28.17ms]/it,cpu:[20%],memory:[44K]


### 7.5极端值测试：check_extreme_value  

通常用于训练的数据都是经过筛选的正常数据，但线上难免会有极端值混入，比如你训练的某列数据范围在`0~1`之间，如果传入一个`-1`，也许就会报错，目前

- 对两种类型的分别进行极端测试，设置如下：
  - 数值型:设置`number_extreme_values = [np.inf, 0.0, -1, 1, -1e-7, 1e-7, np.finfo(np.float64).min, np.finfo(np.float64).max]`(默认)
  - 离散型:设置`category_extreme_values = ["", "null", None, "1.0", "0.0", "-1.0", "-1", "NaN", "None"]`(默认)  

- 将全部特征设置为如上的极端值进行测试

注意：这里只检测了transform与transform_single的一致性，不要求各极端值输入下的输出一致性(注意和上面的空值检测不一样，空值检测要求所有类型的空的输出也要一致)

In [62]:
ml_combine.check_extreme_value(x_test,sample=10)

column:[PassengerId] check [extreme value] complete,speed:[26.41ms]/it,cpu:[58%],memory:[15708K]
column:[Pclass] check [extreme value] complete,speed:[29.01ms]/it,cpu:[62%],memory:[12720K]
column:[Name] check [extreme value] complete,speed:[26.9ms]/it,cpu:[80%],memory:[12492K]
column:[Sex] check [extreme value] complete,speed:[26.75ms]/it,cpu:[49%],memory:[59900K]
column:[Age] check [extreme value] complete,speed:[28.43ms]/it,cpu:[71%],memory:[47360K]
column:[SibSp] check [extreme value] complete,speed:[26.86ms]/it,cpu:[71%],memory:[22624K]
column:[Parch] check [extreme value] complete,speed:[27.13ms]/it,cpu:[68%],memory:[31772K]
column:[Ticket] check [extreme value] complete,speed:[27.41ms]/it,cpu:[51%],memory:[82232K]
column:[Fare] check [extreme value] complete,speed:[26.58ms]/it,cpu:[69%],memory:[42920K]
column:[Cabin] check [extreme value] complete,speed:[28.16ms]/it,cpu:[35%],memory:[23224K]
column:[Embarked] check [extreme value] complete,speed:[27.78ms]/it,cpu:[84%],memory:[390

### 7.6 数据类型反转测试：check_inverse_dtype  

某特征入模是数据是数值，但上线后传过来的是离散值，也有可能相反，这里就对这种情况做测试，对原是数值的替换为离散做测试，对原始离散值的替换为数值，替换规则如下：
- 原数值的，替换为：`number_inverse_values = ["", "null", None, "1.0", "0.0", "-1.0", "-1"]`(默认)  
- 原离散的，替换为：`category_inverse_values = [0.0, -1, 1, -1e-7, 1e-7, np.finfo(np.float64).min, np.finfo(np.float64).max]`(默认)  

同样，数据类型反转测试只对transform和transform_single的一致性有要求

In [63]:
ml_combine.check_inverse_dtype(x_test,sample=10)

column:[PassengerId] check [inverse type] complete,speed:[27.72ms]/it,cpu:[27%],memory:[51200K]
column:[Pclass] check [inverse type] complete,speed:[29.49ms]/it,cpu:[32%],memory:[5712K]
column:[Name] check [inverse type] complete,speed:[27.7ms]/it,cpu:[40%],memory:[39308K]
column:[Sex] check [inverse type] complete,speed:[27.81ms]/it,cpu:[34%],memory:[17528K]
column:[Age] check [inverse type] complete,speed:[27.03ms]/it,cpu:[50%],memory:[15468K]
column:[SibSp] check [inverse type] complete,speed:[27.36ms]/it,cpu:[83%],memory:[8828K]
column:[Parch] check [inverse type] complete,speed:[27.32ms]/it,cpu:[72%],memory:[38236K]
column:[Ticket] check [inverse type] complete,speed:[27.53ms]/it,cpu:[91%],memory:[21544K]
column:[Fare] check [inverse type] complete,speed:[26.33ms]/it,cpu:[69%],memory:[3920K]
column:[Cabin] check [inverse type] complete,speed:[25.91ms]/it,cpu:[23%],memory:[1916K]
column:[Embarked] check [inverse type] complete,speed:[26.04ms]/it,cpu:[24%],memory:[34476K]


### 7.7 int转float测试：check_int_trans_float  
pandas会将某些特征自动推断为int，而线上可能传输的是float，需要做如下测试：  
- 转float后transform和transform_single之间的一致性  
- int和float特征通过transform后的一致性

In [64]:
ml_combine.check_int_trans_float(x_test)

column:[PassengerId] check [int trans float] complete,speed:[25.47ms]/it,cpu:[23%],memory:[864K]
column:[Pclass] check [int trans float] complete,speed:[25.63ms]/it,cpu:[25%],memory:[140K]
column:[SibSp] check [int trans float] complete,speed:[26.59ms]/it,cpu:[39%],memory:[1732K]
column:[Parch] check [int trans float] complete,speed:[27.45ms]/it,cpu:[44%],memory:[2700K]


### 7.8 自动测试：auto_test
就是把上面的所有测试，整合到auto_test一个函数中

In [65]:
#ml_combine.auto_test(x_test)

## TODO  

- 扩展包装式特征选择方法，比如模拟退火、遗传算法、团伙检测等方法
- 添加数据监控模块：主要是数据偏移(p(x),p(y),p(y|x))，后续会逐步加入对这些偏移问题的建模优化模块
- 扩展更多NLP中关于文本分类的内容，比如TextCNN、Bert+BiLSTM等等  
- 添加数据采样模块  
- 扩展更多集成学习方法(优先考虑不平衡样本的集成建模)

希望大家来提PR，扩充内容~