# NLP任务  

目前该部分的功能还不多，主要是一些数据清洗以及特征提取的模块，如下以酒店评价数据集为例做演示


In [1]:
#加载及切割数据
import os
os.chdir("../../")#与easymlops同级目录
import pandas as pd
data=pd.read_csv("./data/demo2.csv",encoding="gbk").sample(frac=1)
print(data.head(5).to_markdown())

|      |   label | review                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|-----:|--------:|:----------------------------------------------------------------------------------------------------

In [2]:
data.shape

(7766, 2)

In [3]:
x_train=data[:6000]
x_test=data[6000:]
y_train=x_train["label"]
y_test=x_test["label"]
del x_train["label"]
del x_test["label"]

## 文本处理

### 文本清洗  

- Lower：所有英文字符转小写
- Upper：所有英文字符转大写
- RemoveDigits：移除字符串中的所有数字  
- ReplaceDigits：替换字符串中的所有数字为一个特殊字符`symbols="[d]"`
- RemovePunctuation：移除字符串中的所有标点符号（中英文）  
- ReplacePunctuation：替换字符串中的所有标点符号为一个特殊字符`symbols="[p]"` 
- RemoveWhitespace：移除空白字符，包括空格、指标、回车等  
- Replace：进行字符串的局部替换（注意easymlops.table.preprocessing.Replace只支持整个字符串的替换）
- RemoveStopWords：移除停用词，通过`stop_words=["a","b"...]`或`stop_words_path=xxx/stopwords.txt`(换行分割)指定
- ExtractChineseWords：仅提取字符串中的中文字符

In [4]:
from easymlops import NLPPipeline
from easymlops.nlp.preprocessing import *

In [5]:
nlp=NLPPipeline()
nlp.pipe(Lower())\
   .pipe(Upper())\
   .pipe(RemoveDigits())\
   .pipe(RemovePunctuation())\
   .pipe(RemoveWhitespace())\
   .pipe(RemoveStopWords(stop_words=["的","你","我","得","其他"]))\
   .pipe(ExtractChineseWords())

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      | review                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|-----:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 7127 | 房间残旧天花发霉卫生间马桶冲洗不干净没有窗户排气不好以后不用钱都不去哪里住了                                                                         

### 中文分词
- ExtractJieBaWords：目前只支持jieba分词，分词结果默认以空格分割   

注意，需要安装：`pip install jieba`

In [6]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      | review                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|-----:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### 关键词抽取
- ExtractKeyWords：抽取`key_words=["kw1","kw2"]`或`key_words_path=xxx/keyword.txt`指定的关键词；  
- AppendKeyWords：将抽取的关键词拼接到原文后面  

注意，这里需要`pip install pyahocorasick`

In [8]:
nlp=NLPPipeline()
nlp.pipe(ExtractKeyWords(key_words=["大堂","环境","交通","酒店","空气","嘉宾","国际","免费","早餐","宽带","前台","实惠","携程","房间","卫生"]))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      | review                             |
|-----:|:-----------------------------------|
| 7127 | 房间 卫生                          |
| 1640 | 酒店                               |
| 7113 | 酒店 酒店 房间 前台 早餐 前台 房间 |
| 5536 | 携程 酒店 交通 酒店                |
| 5288 | 实惠                               |


### 提取N-gram
- ExtractNGramWords，通过设置`n_grams=[1,2,3]`以支持2-gram,3-gram的拼接

In [10]:
nlp=NLPPipeline()
nlp.pipe(ExtractKeyWords(key_words=["大堂","环境","交通","酒店","空气","嘉宾","国际","免费","早餐","宽带","前台","实惠","携程","房间","卫生"]))\
   .pipe(ExtractNGramWords(n_grams=[2]))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      | review                                                |
|-----:|:------------------------------------------------------|
| 7127 | 房间卫生                                              |
| 1640 | 酒店                                                  |
| 7113 | 酒店酒店 酒店房间 房间前台 前台早餐 早餐前台 前台房间 |
| 5536 | 携程酒店 酒店交通 交通酒店                            |
| 5288 | 实惠                                                  |


## 文本特征提取

主要三类特征  
- bow/tfidf等常用特征  
- lsi/lda等主题模型特征  
- fastext/word2vec/doc2vec等词向量特征

这些pipe模块的前置模块即是前面的文本处理结果，**注意：处理号的token要求空格分割**

### Bag-of-Words特征提取 

In [5]:
from easymlops.nlp.representation import *

In [12]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())

x_test_new=nlp.fit(x_train).transform(x_test)
#选择前10个columns展示
top_10_cols=x_test_new.columns.tolist()[:10]
print(x_test_new[top_10_cols].head(5).to_markdown())

|      |   bag_review_ |   bag_review_一 |   bag_review_一一 |   bag_review_一一列举 |   bag_review_一丁点 |   bag_review_一上 |   bag_review_一上午 |   bag_review_一下 |   bag_review_一下下 |   bag_review_一下子 |
|-----:|--------------:|----------------:|------------------:|----------------------:|--------------------:|------------------:|--------------------:|------------------:|--------------------:|--------------------:|
| 7127 |             0 |               0 |                 0 |                     0 |                   0 |                 0 |                   0 |                 0 |                   0 |                   0 |
| 1640 |             0 |               0 |                 0 |                     0 |                   0 |                 0 |                   0 |                 0 |                   0 |                   0 |
| 7113 |             0 |               0 |                 0 |                     0 |                   0 |                 0 |                   0 |             

### TFIDF特征提取

In [13]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(TFIDF())

x_test_new=nlp.fit(x_train).transform(x_test)
#选择前10个columns展示
top_10_cols=x_test_new.columns.tolist()[:10]
print(x_test_new[top_10_cols].head(5).to_markdown())

|      |   tfidf_review_ |   tfidf_review_一 |   tfidf_review_一一 |   tfidf_review_一一列举 |   tfidf_review_一丁点 |   tfidf_review_一上 |   tfidf_review_一上午 |   tfidf_review_一下 |   tfidf_review_一下下 |   tfidf_review_一下子 |
|-----:|----------------:|------------------:|--------------------:|------------------------:|----------------------:|--------------------:|----------------------:|--------------------:|----------------------:|----------------------:|
| 7127 |               0 |         0         |                   0 |                       0 |                     0 |                   0 |                     0 |                   0 |                     0 |                     0 |
| 1640 |               0 |         0         |                   0 |                       0 |                     0 |                   0 |                     0 |                   0 |                     0 |                     0 |
| 7113 |               0 |         0         |                   0 |               

### LDA主题特征提取  
这里需要`pip install gensim`

In [14]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(LdaTopicModel(num_topics=10))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      |   lda_review_0 |   lda_review_1 |   lda_review_2 |   lda_review_3 |   lda_review_4 |   lda_review_5 |   lda_review_6 |   lda_review_7 |   lda_review_8 |   lda_review_9 |
|-----:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|
| 7127 |      0         |              0 |      0         |       0        |              0 |       0.727733 |              0 |       0.234104 |              0 |              0 |
| 1640 |      0.978557  |              0 |      0         |       0        |              0 |       0        |              0 |       0        |              0 |              0 |
| 7113 |      0.10456   |              0 |      0.114921  |       0        |              0 |       0.650086 |              0 |       0.125424 |              0 |              0 |
| 5536 |      0.0856321 |              0 |      0.0525631 |       0.175759 |              0 |       0.678

### LSI主题特征提取

In [15]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(LsiTopicModel(num_topics=10))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      |   lsi_review_0 |   lsi_review_1 |   lsi_review_2 |   lsi_review_3 |   lsi_review_4 |   lsi_review_5 |   lsi_review_6 |   lsi_review_7 |   lsi_review_8 |   lsi_review_9 |
|-----:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|
| 7127 |       0.858575 |     -0.70027   |      -0.843685 |     -0.0668491 |     -0.693314  |       0.441741 |    -0.183873   |    -1.17893    |      0.0851453 |      0.755182  |
| 1640 |       5.86927  |      2.11575   |       1.06398  |     -0.622473  |      0.598806  |       0.715508 |     0.508536   |     0.0282597  |     -0.622821  |     -0.417345  |
| 7113 |      15.1899   |      0.261911  |       1.16639  |     -1.48284   |     -3.16782   |       2.59123  |     0.287749   |     0.206293   |      0.414483  |     -0.272836  |
| 5536 |       4.78021  |     -1.56075   |      -0.927984 |      0.797013  |      2.27539   |       0.571

### Word2Vec词向量特征
这里的文档特征是每个token词向量的平均值

In [17]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(Word2VecModel(embedding_size=10))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      |   w2v_review_0 |   w2v_review_1 |   w2v_review_2 |   w2v_review_3 |   w2v_review_4 |   w2v_review_5 |   w2v_review_6 |   w2v_review_7 |   w2v_review_8 |   w2v_review_9 |
|-----:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|
| 7127 |      0.149905  |      0.539101  |       1.61803  |     -0.520869  |       1.28348  |       0.269167 |        1.7893  |       0.367624 |      -1.05573  |       0.292887 |
| 1640 |     -0.535773  |     -0.0151048 |       0.963358 |     -0.186407  |       1.03127  |       0.170625 |        1.876   |       1.2312   |      -0.72744  |      -0.37903  |
| 7113 |     -0.0289475 |      0.477569  |       1.1908   |     -0.092565  |       0.66135  |       0.250403 |        1.34142 |       0.445295 |      -1.55385  |      -0.128968 |
| 5536 |      0.0400489 |     -0.0727166 |       1.32027  |     -0.0587338 |       0.402637 |       0.173

### Doc2Vec文档向量

In [18]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(Doc2VecModel(embedding_size=10))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      |   d2v_review_0 |   d2v_review_1 |   d2v_review_2 |   d2v_review_3 |   d2v_review_4 |   d2v_review_5 |   d2v_review_6 |   d2v_review_7 |   d2v_review_8 |   d2v_review_9 |
|-----:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|
| 7127 |      0.0991587 |    -0.0109045  |     0.0790115  |     -0.0498847 |      0.0956094 |      0.0109559 |       0.214567 |      -0.152318 |      -0.172422 |     0.0982961  |
| 1640 |      0.187836  |    -0.0663459  |     0.0911619  |      0.0199243 |      0.431156  |     -0.107956  |       0.714324 |       0.138839 |      -0.324371 |    -0.132915   |
| 7113 |     -0.207934  |     0.205402   |     0.176139   |     -0.291981  |      0.407413  |     -0.0131239 |       0.731769 |      -0.439703 |      -0.656194 |     0.00921523 |
| 5536 |      0.441469  |    -0.0213459  |    -0.00080414 |      0.297836  |      0.456871  |     -0.2784

### FastText词向量

In [19]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(FastTextModel(embedding_size=10))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      |   fasttext_review_0 |   fasttext_review_1 |   fasttext_review_2 |   fasttext_review_3 |   fasttext_review_4 |   fasttext_review_5 |   fasttext_review_6 |   fasttext_review_7 |   fasttext_review_8 |   fasttext_review_9 |
|-----:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|
| 7127 |           -0.289566 |           0.158579  |             1.77409 |           -0.214785 |          -0.452001  |          -0.945153  |            1.59104  |           0.0912326 |           -0.832337 |           -0.387233 |
| 1640 |           -0.717482 |           0.0832835 |             1.49416 |            0.726315 |           0.0506884 |          -1.04302   |            1.60123  |           0.863163  |           -0.638874 |           -0.672286 |
| 7113 |           -0.583676 |          -0.0858972 |             1.80547 |          

## 文本分类
借助于table中的分类器，可以做文本分类任务了

In [6]:
from easymlops.table.classification import LGBMClassification

In [7]:
nlp=NLPPipeline()
nlp.pipe(ExtractChineseWords())\
   .pipe(ExtractJieBaWords())\
   .pipe(BagOfWords())\
   .pipe(LGBMClassification(y=y_train))

x_test_new=nlp.fit(x_train).transform(x_test)
print(x_test_new.head(5).to_markdown())

|      |        1 |          0 |
|-----:|---------:|-----------:|
| 1000 | 0.986622 | 0.0133777  |
| 5291 | 0.934425 | 0.0655745  |
| 3134 | 0.227386 | 0.772614   |
| 4607 | 0.990448 | 0.00955158 |
| 3677 | 0.954822 | 0.0451778  |
