# pipenlp  
  
## 介绍   
`pipenlp`包可以方便地将NLP任务构建为`Pipline`任务流，目前主要包含的功能有：  
- 数据清洗，关键词提取等：pipenlp.preprocessing
- 文本特征提取，包括bow,tfidf等传统模型；lda,lsi等主题模型；fastext,word2vec等词向量模型；pca,nmf等特征降维模型：pipenlp.representation
- 文本分类，包括lgbm决策树、logistic回归、svm等传统机器学习模型：pipenlp.classification  

## 安装
```bash
pip install git+https://github.com/zhulei227/pipenlp
```  

注意：只安装了部分必要的包，其他依赖包可在使用到相关模块时再自行安装，比如在使用`pipenlp.preprocessing.ExtractJieBaWords`时才会提示安装`pip install jieba`，所以建议先用少量数据检验pipeline流程的依赖包是否完整，在应用到大量数据上

## 使用  

导入`PipeNLP`主程序

In [1]:
from pipenlp import PipeNLP
nlp=PipeNLP()

准备`pandas.DataFrame`格式的数据

In [2]:
import pandas as pd
data=pd.read_csv("./data/demo.csv")
data.head(5)

Unnamed: 0,text,label
0,动力差,消极
1,油耗很低，操控比较好。第二箱油还没有跑完。油耗显示为5.9了，本人13年12月刚拿的本，跑出...,积极
2,乘坐舒适性,积极
3,最满意的不止一点：1、车内空间一流，前后排均满足使用需求，后备箱空间相当大；2、外观时尚，珠...,积极
4,空间大，相对来说舒适性较好，性比价好些。,积极


### 数据清洗

In [3]:
from pipenlp.preprocessing import *
nlp.pipe(RemoveDigits())\
   .pipe(RemovePunctuation())\
   .pipe(RemoveWhitespace())

<pipenlp.pipenlp.PipeNLP at 0x2b60f15cb88>

In [4]:
data["output"]=nlp.fit(data["text"]).transform(data["text"]).head(5)
data[["output"]].head(5)

Unnamed: 0,output
0,动力差
1,油耗很低操控比较好第二箱油还没有跑完油耗显示为了本人年月刚拿的本跑出这样的油耗很满意了
2,乘坐舒适性
3,最满意的不止一点车内空间一流前后排均满足使用需求后备箱空间相当大外观时尚珠光白尤其喜欢看起来...
4,空间大相对来说舒适性较好性比价好些


### 分词  
默认空格表示分割

In [5]:
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())

<pipenlp.pipenlp.PipeNLP at 0x2b61b7e3508>

In [6]:
data["output"]=nlp.fit(data["text"]).transform(data["text"]).head(5)
data[["output"]].head(5)

Unnamed: 0,output
0,动力 差
1,油耗 很 低 操控 比较 好 第二 箱油 还 没有 跑 完 油耗 显示 为了 本人 年 月 ...
2,乘坐 舒适性
3,最 满意 的 不止 一点 车 内 空间 一流 前后排 均 满足 使用 需求 后备箱 空间 相...
4,空间 大 相对来说 舒适性 较 好性 比价 好些


### 文本特征提取  
#### BOW词袋模型

In [7]:
from pipenlp.representation import *

In [8]:
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())

<pipenlp.pipenlp.PipeNLP at 0x2b61b7e9a08>

In [9]:
nlp.fit(data["text"]).transform(data["text"]).head(5)

Unnamed: 0,一,一下,一个,一个劲,一个月,一个舒服,一二,一些,一体,一停,...,黄色,黑,黑屏,黑底,黑烟,黑白,黑色,默认,鼓包,齐全
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### LDA主题模型

In [10]:
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(LdaTopicModel(num_topics=10))

<pipenlp.pipenlp.PipeNLP at 0x2b61fe9ec48>

In [11]:
nlp.fit(data["text"]).transform(data["text"]).head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.033336,0.033338,0.03334,0.033342,0.033337,0.033348,0.033346,0.033343,0.033338,0.699931
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.970957,0.0,0.0
2,0.033337,0.033336,0.033336,0.699944,0.033355,0.033342,0.033339,0.033339,0.033336,0.033337
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.978554,0.0,0.0
4,0.011115,0.011116,0.011115,0.011116,0.011116,0.011116,0.89996,0.011115,0.011115,0.011116


#### FastText模型

In [12]:
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(FastTextModel(embedding_size=8))

<pipenlp.pipenlp.PipeNLP at 0x2b62d91ea88>

In [13]:
nlp.fit(data["text"]).transform(data["text"]).head(5)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.499252,-1.062278,-0.498653,0.006183,0.841716,1.210205,-0.377396,0.903633
1,0.453218,-0.888729,-0.410743,0.002836,0.651204,1.012078,-0.332197,0.746859
2,0.323486,-0.590322,-0.257216,0.031456,0.402557,0.635742,-0.21227,0.48293
3,0.366357,-0.726029,-0.347917,0.001929,0.535537,0.846559,-0.273708,0.617496
4,0.239123,-0.466997,-0.211472,0.014322,0.347688,0.514858,-0.157679,0.36492


#### PCA降维

In [14]:
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())\
   .pipe(PCADecomposition(n_components=2))

<pipenlp.pipenlp.PipeNLP at 0x2b62d918a88>

In [15]:
nlp.fit(data["text"]).transform(data["text"]).head(5)

Unnamed: 0,0,1
0,-1.257324,-0.222479
1,1.071957,0.266118
2,-1.288506,-0.212698
3,0.281789,-0.625973
4,-1.293974,-0.353327


### 文本分类
#### LGBM

In [16]:
from pipenlp.classification import *
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())\
   .pipe(LGBMClassification(y=data["label"]))

<pipenlp.pipenlp.PipeNLP at 0x2b62d931108>

In [17]:
nlp.fit(data["text"]).transform(data["text"]).head(5)

Unnamed: 0,积极,消极
0,0.245708,0.754292
1,0.913772,0.086228
2,0.4356,0.5644
3,0.999868,0.000132
4,0.916361,0.083639


### Logistic回归

In [18]:
from pipenlp.classification import *
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())\
   .pipe(PCADecomposition(n_components=8))\
   .pipe(LogisticRegressionClassification(y=data["label"]))

<pipenlp.pipenlp.PipeNLP at 0x2b633e699c8>

In [19]:
nlp.fit(data["text"]).transform(data["text"]).head(5)

Unnamed: 0,积极,消极
0,0.502272,0.497728
1,0.780132,0.219868
2,0.452948,0.547052
3,0.999974,2.6e-05
4,0.776129,0.223871


### 模型持久化
#### 保存

In [20]:
nlp.save("nlp.pkl")

#### 加载
由于只保留了模型参数，所以需要重新声明模型结构信息(参数无需传入)

In [21]:
nlp=PipeNLP()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())\
   .pipe(PCADecomposition())\
   .pipe(LogisticRegressionClassification())

<pipenlp.pipenlp.PipeNLP at 0x2b6348d4488>

In [22]:
nlp.load("nlp.pkl")

In [23]:
nlp.transform(data["text"]).head(5)

Unnamed: 0,积极,消极
0,0.502272,0.497728
1,0.780132,0.219868
2,0.452948,0.547052
3,0.999974,2.6e-05
4,0.776129,0.223871


## TODO  

- 加入TextCNN、Bert更高阶文本分类模型  
- 引入预训练词向量