In [None]:
!pip install --upgrade paddlehub -i https://pypi.tuna.tsinghua.edu.cn/simple
#下载ernie的module
!hub install ernie

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already up-to-date: paddlehub in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.5.3)
Module ernie already installed in /home/aistudio/.paddlehub/modules/ernie


In [None]:
# -*- coding: utf8 -*-

from paddle.fluid.framework import switch_main_program
import paddlehub 
import paddle.fluid as fluid

In [None]:
module = paddlehub.Module(name='ernie')
# dataset = hub.dataset.ChnSentiCorp()

[32m[2020-02-28 19:43:56,204] [    INFO] - Installing ernie module[0m
[32m[2020-02-28 19:43:56,224] [    INFO] - Module ernie already installed in /home/aistudio/.paddlehub/modules/ernie[0m


如果想尝试其他语义模型（如ernie_tiny, RoBERTa等），只需要更换Module中的`name`参数即可.

   模型名                           | PaddleHub Module
---------------------------------- | :------:
ERNIE, Chinese                     | `hub.Module(name='ernie')`
ERNIE 2.0 Tiny, Chinese            | `hub.Module(name='ernie_tiny')`
ERNIE 2.0 Base, English            | `hub.Module(name='ernie_v2_eng_base')`
ERNIE 2.0 Large, English           | `hub.Module(name='ernie_v2_eng_large')`
RoBERTa-Large, Chinese             | `hub.Module(name='roberta_wwm_ext_chinese_L-24_H-1024_A-16')`
RoBERTa-Base, Chinese              | `hub.Module(name='roberta_wwm_ext_chinese_L-12_H-768_A-12')`
BERT-Base, Uncased                 | `hub.Module(name='bert_uncased_L-12_H-768_A-12')`
BERT-Large, Uncased                | `hub.Module(name='bert_uncased_L-24_H-1024_A-16')`
BERT-Base, Cased                   | `hub.Module(name='bert_cased_L-12_H-768_A-12')`
BERT-Large, Cased                  | `hub.Module(name='bert_cased_L-24_H-1024_A-16')`
BERT-Base, Multilingual Cased      | `hub.Module(nane='bert_multi_cased_L-12_H-768_A-12')`
BERT-Base, Chinese                 | `hub.Module(name='bert_chinese_L-12_H-768_A-12')`

如果想加载**自定义数据集**完成迁移学习，详细参见[自定义数据集](https://github.com/PaddlePaddle/PaddleHub/wiki/PaddleHub%E9%80%82%E9%85%8D%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E5%AE%8C%E6%88%90FineTune)

In [None]:
#### train Dataset

f=open("data/train_split.csv","r")
f2=open("data/train.tsv","w",encoding= "utf-8")
for index,line in enumerate(f.readlines()):
    if index==0:
        continue
    line= line.replace("\n"," ").replace(" ","").strip().split(",")
    label=line[0]
    comment=line[1]
    string=comment+"\t"+label+"\n"
    f2.writelines(string)
f.close()
f2.close()

#### valid Dataset
f=open("data/valid_split.csv","r")
f2=open("data/dev.tsv","w",encoding= "utf-8")
for index,line in enumerate(f.readlines()):
    if index==0:
        continue
    line= line.replace("\n"," ").replace(" ","").strip().split(",")
    label=line[0]
    comment=line[1]
    string=comment+"\t"+label+"\n"
    f2.writelines(string)
f.close()
f2.close()

#### predict Dataset
f=open("data/test_new.csv","r")
f2=open("data/predict.tsv","w",encoding= "utf-8")
for index,line in enumerate(f.readlines()):
    if index==0:
        continue
    line= line.replace("\n"," ").replace(" ","").strip().split(",")
    label=line[0]
    comment=line[1]
    string=comment+"\n"
    f2.writelines(string)
f.close()
f2.close()

In [None]:
from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset

class DemoDataset(BaseNLPDataset):
    """DemoDataset"""
    def __init__(self):
        # 数据集存放位置
        self.dataset_dir = "data"
        super(DemoDataset, self).__init__(
            base_path=self.dataset_dir,
            train_file="train.tsv",
            dev_file="dev.tsv",
            test_file="dev.tsv",
            # 如果还有预测数据（不需要文本类别label），可以放在predict.tsv
            # predict_file="predict.tsv",
            train_file_with_header=False,
            dev_file_with_header=False,
            test_file_with_header=False,
            # predict_file_with_header=False,
            # 数据集类别集合
            label_list=["0", "1"])
dataset = DemoDataset()

## 三、生成Reader

接着生成一个文本分类的reader，reader负责将dataset的数据进行预处理，首先对文本进行切词，接着以特定格式组织并输入给模型进行训练。

`ClassifyReader`的参数有以下三个：
* `dataset`: 传入PaddleHub Dataset;
* `vocab_path`: 传入ERNIE/BERT模型对应的词表文件路径;
* `max_seq_len`: ERNIE模型的最大序列长度，若序列长度不足，会通过padding方式补到`max_seq_len`, 若序列长度大于该值，则会以截断方式让序列长度为`max_seq_len`;

In [None]:
module.get_vocab_path()

'/home/aistudio/.paddlehub/modules/ernie/assets/vocab.txt'

In [None]:
reader = paddlehub.reader.ClassifyReader(
    dataset=dataset,
    vocab_path=module.get_vocab_path(),
    max_seq_len=256)
    

[32m[2020-02-28 19:44:06,970] [    INFO] - Dataset label map = {'0': 0, '1': 1}[0m


**NOTE：** Reader参数max_seq_len、moduel的context接口参数max_seq_len三者应该保持一致，最大序列长度`max_seq_len`是可以调整的参数，建议值128，根据任务文本长度不同可以调整该值，但最大不超过512。

## 四、选择Fine-Tune优化策略
适用于ERNIE/BERT这类Transformer模型的迁移优化策略为`AdamWeightDecayStrategy`。详情请查看[Strategy](https://github.com/PaddlePaddle/PaddleHub/wiki/PaddleHub-API:-Strategy)。

`AdamWeightDecayStrategy`的参数有以下三个：
 * `learning_rate`: 最大学习率
 * `lr_scheduler`: 有`linear_decay`和`noam_decay`两种衰减策略可选
 * `warmup_proprotion`: 训练预热的比例，若设置为0.1, 则会在前10%的训练step中学习率逐步提升到`learning_rate`
 * `weight_decay`: 权重衰减，类似模型正则项策略，避免模型overfitting
 * `optimizer_name`: 优化器名称，使用Adam

## 五、选择运行时配置

在进行Finetune前，我们可以设置一些运行时的配置，例如如下代码中的配置，表示：

* `use_cuda`：设置为False表示使用CPU进行训练。如果您本机支持GPU，且安装的是GPU版本的PaddlePaddle，我们建议您将这个选项设置为True；

* `epoch`：要求Finetune的任务只遍历1次训练集；

* `batch_size`：每次训练的时候，给模型输入的每批数据大小为32，模型训练时能够并行处理批数据，因此batch_size越大，训练的效率越高，但是同时带来了内存的负荷，过大的batch_size可能导致内存不足而无法训练，因此选择一个合适的batch_size是很重要的一步；

* `log_interval`：每隔10 step打印一次训练日志；

* `eval_interval`：每隔50 step在验证集上进行一次性能评估；

* `checkpoint_dir`：将训练的参数和数据保存到ernie_txt_cls_turtorial_demo目录中；

* `strategy`：使用DefaultFinetuneStrategy策略进行finetune；

更多运行配置，请查看[RunConfig](https://github.com/PaddlePaddle/PaddleHub/wiki/PaddleHub-API:-RunConfig)

In [None]:
strategy = paddlehub.AdamWeightDecayStrategy(
    weight_decay=0.001,
    warmup_proportion=0.1,
    learning_rate=5e-5)


config = paddlehub.RunConfig(
    use_cuda=True,
    num_epoch=8,
    checkpoint_dir="ernie_txt_cls_turtorial_demo",
    batch_size=16,
    eval_interval=50,
    strategy=strategy)

[32m[2020-02-28 19:44:10,343] [    INFO] - Checkpoint dir: ernie_txt_cls_turtorial_demo[0m


PaddleHub提供了许多优化策略，如`AdamWeightDecayStrategy`、`ULMFiTStrategy`、`DefaultFinetuneStrategy`等，详细信息参见[策略](https://github.com/PaddlePaddle/PaddleHub/wiki/PaddleHub-API:-Strategy)

## 六、组建Finetune Task

有了合适的预训练模型和准备要迁移的数据集后，我们开始组建一个Task。

1. 获取module的上下文环境，包括输入和输出的变量，以及Paddle Program；
2. 从输出变量中找到用于情感分类的文本特征pooled_output；
3. 在pooled_output后面接入一个全连接层，生成Task；

In [None]:
inputs, outputs, program = module.context(
    trainable=True, max_seq_len=256)

# Use "pooled_output" for classification tasks on an entire sentence.
pooled_output = outputs["pooled_output"]

feed_list = [
    inputs["input_ids"].name,
    inputs["position_ids"].name,
    inputs["segment_ids"].name,
    inputs["input_mask"].name,
]

cls_task = paddlehub.TextClassifierTask(
    data_reader=reader,
    feature=pooled_output,
    feed_list=feed_list,
    num_classes=dataset.num_labels,
    metrics_choices=["f1","matthews","acc"],
    config=config)

[32m[2020-02-28 19:44:13,255] [    INFO] - Set maximum sequence length of input tensor to 256[0m
[32m[2020-02-28 19:44:13,257] [    INFO] - The shape of input tensor[input_ids] set to [-1, 256, 1][0m
[32m[2020-02-28 19:44:13,257] [    INFO] - The shape of input tensor[position_ids] set to [-1, 256, 1][0m
[32m[2020-02-28 19:44:13,258] [    INFO] - The shape of input tensor[segment_ids] set to [-1, 256, 1][0m
[32m[2020-02-28 19:44:13,258] [    INFO] - The shape of input tensor[input_mask] set to [-1, 256, 1][0m
[32m[2020-02-28 19:44:13,259] [    INFO] - 199 pretrained paramaters loaded by PaddleHub[0m



如果想改变迁移任务组网，详细参见[自定义迁移任务](https://github.com/PaddlePaddle/PaddleHub/wiki/PaddleHub:-%E8%87%AA%E5%AE%9A%E4%B9%89Task)

## 七、开始Finetune

我们选择`finetune_and_eval`接口来进行模型训练，这个接口在finetune的过程中，会周期性的进行模型效果的评估，以便我们了解整个训练过程的性能变化。

In [10]:
run_states = cls_task.finetune_and_eval()

[32m[2020-02-28 19:44:15,917] [    INFO] - Strategy with scheduler: {'warmup': 0.1, 'linear_decay': {'start_point': 0.1, 'end_learning_rate': 0}, 'noam_decay': False, 'discriminative': {'blocks': 0, 'factor': 2.6}, 'gradual_unfreeze': 0, 'slanted_triangle': {'cut_fraction': 0.0, 'ratio': 32}}, regularization: {'L2': 0.0, 'L2SP': 0.0, 'weight_decay': 0.001} and clip: {'GlobalNorm': 1.0, 'Norm': 0.0}[0m
[32m[2020-02-28 19:44:20,133] [    INFO] - Try loading checkpoint from ernie_txt_cls_turtorial_demo/ckpt.meta[0m
[32m[2020-02-28 19:44:20,134] [    INFO] - PaddleHub model checkpoint not found, start from scratch...[0m
[32m[2020-02-28 19:44:20,204] [    INFO] - PaddleHub finetune start[0m
[36m[2020-02-28 19:44:24,156] [   TRAIN] - step 10 / 4000: loss=0.50350 f1=0.00000 matthews=-0.05949 acc=0.82500 [step/sec: 2.54][0m
[36m[2020-02-28 19:44:27,114] [   TRAIN] - step 20 / 4000: loss=0.45047 f1=0.00000 matthews=-0.03083 acc=0.86250 [step/sec: 3.38][0m
[36m[2020-02-28 19:44:30,0

KeyboardInterrupt: 

## 八、使用模型进行预测

当Finetune完成后，我们使用模型来进行预测，完整预测代码如下：

In [12]:
#### predict Dataset
f=open("data/valid_split.csv","r")
data1=[]
labels=[]
for index,line in enumerate(f.readlines()):
    if index==0:
        continue
    line= line.replace("\n"," ").replace(" ","").strip().split(",")
    label=int(line[0])
    comment=line[1]
    labels.append(label)
    data1.append([comment])
f.close()

import numpy as np

# Data to be prdicted
pred=[]
index = 0
run_states = cls_task.predict(data=data1)
results = [run_state.run_results for run_state in run_states]
for batch_result in results:
    # get predict index
    batch_result = np.argmax(batch_result, axis=2)[0]
    for result in batch_result:
        # print("%s\tpredict=%s" % (data1[index], result))
        pred.append(result)
        index += 1

[32m[2020-02-28 20:06:33,899] [    INFO] - The best model has been loaded[0m
[32m[2020-02-28 20:06:33,900] [    INFO] - PaddleHub predict start[0m
[32m[2020-02-28 20:06:44,984] [    INFO] - PaddleHub predict finished.[0m


In [13]:
pred=np.array(pred)
labels=np.array(labels)
sum(pred[[pred==labels]]),sum(labels),sum(pred==labels),2000-sum(pred==labels)-(sum(labels)-sum(pred[[pred==labels]]))

  This is separate from the ipykernel package so we can avoid doing imports until


(275, 302, 1960, 13)

In [14]:
#### predict Dataset
#### predict Dataset
f=open("data/test_new.csv","r")
data1=[]
labels=[]
for index,line in enumerate(f.readlines()):
    if index==0:
        continue
    line= line.replace("\n"," ").replace(" ","").strip().split(",")
    label=line[0]
    comment=line[1]
    labels.append(label)
    data1.append([comment])
f.close()


In [15]:
import numpy as np

# Data to be prdicted
pred=[]
index = 0
run_states = cls_task.predict(data=data1)
results = [run_state.run_results for run_state in run_states]
for batch_result in results:
    # get predict index
    batch_result = np.argmax(batch_result, axis=2)[0]
    for result in batch_result:
        # print("%s\tpredict=%s" % (data1[index], result))
        pred.append(result)
        index += 1

[32m[2020-02-28 20:06:53,983] [    INFO] - The best model has been loaded[0m
[32m[2020-02-28 20:06:53,985] [    INFO] - PaddleHub predict start[0m
[32m[2020-02-28 20:07:04,815] [    INFO] - PaddleHub predict finished.[0m


In [17]:
len(pred)

2000

In [18]:
type(pred[25])

numpy.int64

In [19]:
import pandas as pd
sample_file="data/sample.csv"
df_sample=pd.read_csv(sample_file, delimiter=",")
df_sample["label"]=np.array(pred)
df_sample.to_csv("answer.csv",index=False)

In [None]:
exit()