# 摘要任务
参考文章：[huggingface nlp learn summarization](https://huggingface.co/learn/nlp-course/chapter7/5?fw=pt#summarization)
## 任务目标:
训练一个双语文本摘要模型(英语、西班牙)
## 数据集准备
Multilingual Amazon Reviews Corpus （amazon 不在提供从其他途径下载到google drive 本地读取）  
该语料库由六种语言的亚马逊产品评论组成，通常用于对多语言分类器进行基准测试  
English(en), Japanese(ja), German(de), French(fr), Chinese(zh) and Spanish(es).


In [1]:
!pip install sentencepiece
!pip install transformers
!pip install datasets
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
from datasets import load_dataset
sumDataset = load_dataset('csv', data_files={'train': '/content/drive/MyDrive/MultilingualAmazonReviews/test.csv', 'test': '/content/drive/MyDrive/MultilingualAmazonReviews/validation.csv'})
sumDataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
})

In [2]:
zhSumTrainDataset = sumDataset['train'].filter(lambda example: example['language'].startswith('zh'))
print(zhSumTrainDataset)
zhSumTestDataset = sumDataset['test'].filter(lambda example: example['language'].startswith('zh'))
print(zhSumTestDataset)

Dataset({
    features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
    num_rows: 5000
})
Dataset({
    features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
    num_rows: 5000
})


查看数据

In [3]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).filter(lambda example: example['language'].startswith('zh')).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")


show_samples(sumDataset)


'>> Title: 简单方便 性价比高'
'>> Review: 有奶碟方便很多，傻瓜操作；costa的胶囊会好喝些......已经陆续买了三台............'

'>> Title: 还可以'
'>> Review: 想吃麦片，看评论很多说这个好，买的，麦片和其他也差不多'

'>> Title: 不错'
'>> Review: 基本信息都全，没去过的地方纸上先去一下。'


查看评论商品种类信息

In [4]:
zhSumTrainDataset.set_format("pandas")
english_df = zhSumTrainDataset[:]
# Show counts for top 20 products
english_df["product_category"].value_counts()[:20]

book                      1567
digital_ebook_purchase     458
apparel                    308
shoes                      236
beauty                     224
kitchen                    223
other                      212
home                       190
grocery                    186
wireless                   173
drugstore                  169
baby_product               161
sports                     132
pc                         130
watch                       97
toy                         93
home_improvement            84
electronics                 73
office_product              72
luggage                     72
Name: product_category, dtype: int64

In [5]:
zhSumTrainDataset.reset_format()

模型选择，对于多语言可以使用 mT5、mBART-50、fnlp/bart-base-chinese
### 编码器加载

In [6]:
from transformers import AutoTokenizer
# seq2seq tokenizer need sentencepiece pip install it
check_model = "fnlp/cpt-base"
tokenizer = AutoTokenizer.from_pretrained(check_model)

In [None]:
tokenizer.pad_token_id

0

In [7]:
inputs = tokenizer("想吃麦片")
inputs

{'input_ids': [101, 9688, 6422, 25184, 14062, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [8]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['[CLS]', '想', '吃', '麦', '片', '[SEP]']

In [9]:
max_input_length = 512
max_target_length = 70


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["review_title"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
tokenized_datasets = zhSumTrainDataset.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset({
    features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 5000
})

In [11]:
tokenized_datasets1 = zhSumTestDataset.map(preprocess_function, batched=True)
tokenized_datasets1

Dataset({
    features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 5000
})

指标评测  
ROUGE-L 计算最长公共子序列  
ROUGE-N 预测句子按N拆字计算召回率  
[参考](https://zhuanlan.zhihu.com/p/504279252)

In [15]:
!pip install rouge_score
!pip install evaluate



In [12]:
import evaluate
rouge_score = evaluate.load("rouge")

In [13]:
print(rouge_score.compute.__doc__)

Compute the evaluation module.

        Usage of positional arguments is not allowed to prevent mistakes.

        Args:
            predictions (list/array/tensor, optional): Predictions.
            references (list/array/tensor, optional): References.
            **kwargs (optional): Keyword arguments that will be forwarded to the evaluation module :meth:`_compute`
                method (see details in the docstring).

        Return:
            dict or None

            - Dictionary with the results if this evaluation module is run on the main process (``process_id == 0``).
            - None if the evaluation module is not run on the main process (``process_id != 0``).
        
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tok

In [14]:
generated_summary = "简单方便 性价比高"
reference_summary = "简单，划算"
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}

In [None]:
scores["rouge1"]

0.0

In [40]:
from transformers import pipeline
summarizer = pipeline("summarization",model=check_model)
# articale = tokenized_datasets[1]['review_body']
artocale ='''本文总结了十个可穿戴产品的设计原则，而这些原则，同样也是笔者认为是这个行业最吸引人的地方：1.为人们解决重复性问题；2.从人开始，而不是从机器开始；3.要引起注意，但不要刻意；4.提升用户能力，而不是取代人'''
print(artocale)
summarizer(artocale, max_length = 30, min_length = 5, do_sample=False)

Some weights of the model checkpoint at fnlp/cpt-base were not used when initializing BartForConditionalGeneration: ['model.encoder.encoder.layer.4.attention.self.value.weight', 'model.encoder.encoder.layer.11.attention.self.query.weight', 'model.encoder.encoder.layer.6.attention.output.LayerNorm.bias', 'model.encoder.encoder.layer.1.attention.self.query.bias', 'model.encoder.encoder.layer.8.output.LayerNorm.bias', 'model.encoder.encoder.layer.2.intermediate.dense.weight', 'model.encoder.encoder.layer.1.output.dense.bias', 'model.encoder.encoder.layer.3.attention.self.value.weight', 'model.encoder.encoder.layer.8.attention.output.LayerNorm.bias', 'model.encoder.encoder.layer.2.attention.self.value.bias', 'model.encoder.encoder.layer.5.intermediate.dense.bias', 'model.encoder.encoder.layer.5.output.dense.weight', 'model.encoder.encoder.layer.1.attention.output.LayerNorm.bias', 'model.encoder.encoder.layer.5.attention.self.query.bias', 'model.encoder.encoder.layer.6.output.LayerNorm.bias

本文总结了十个可穿戴产品的设计原则，而这些原则，同样也是笔者认为是这个行业最吸引人的地方：1.为人们解决重复性问题；2.从人开始，而不是从机器开始；3.要引起注意，但不要刻意；4.提升用户能力，而不是取代人


[{'summary_text': '多 了 很 多 ， 但 是 很 多 了 多 了 以 后 多 了 ， 多 了 但 是 多 了 电 多 了 。'}]

模型

In [16]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(check_model)

Some weights of the model checkpoint at fnlp/cpt-base were not used when initializing BartForConditionalGeneration: ['model.encoder.encoder.layer.4.attention.self.value.weight', 'model.encoder.encoder.layer.11.attention.self.query.weight', 'model.encoder.encoder.layer.6.attention.output.LayerNorm.bias', 'model.encoder.encoder.layer.1.attention.self.query.bias', 'model.encoder.encoder.layer.8.output.LayerNorm.bias', 'model.encoder.encoder.layer.2.intermediate.dense.weight', 'model.encoder.encoder.layer.1.output.dense.bias', 'model.encoder.encoder.layer.3.attention.self.value.weight', 'model.encoder.encoder.layer.8.attention.output.LayerNorm.bias', 'model.encoder.encoder.layer.2.attention.self.value.bias', 'model.encoder.encoder.layer.5.intermediate.dense.bias', 'model.encoder.encoder.layer.5.output.dense.weight', 'model.encoder.encoder.layer.1.attention.output.LayerNorm.bias', 'model.encoder.encoder.layer.5.attention.self.query.bias', 'model.encoder.encoder.layer.6.output.LayerNorm.bias

定义 训练 args


In [14]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m163.8/244.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0


In [42]:
from transformers import Seq2SeqTrainingArguments

batch_size = 17
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets) // batch_size
model_name = check_model.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-zh",
    evaluation_strategy="epoch",
    overwrite_output_dir=True,
    learning_rate=2e-5,  // 调整学习率效果不佳 原值: 5.6e-5
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
)

In [18]:
!pip install nltk



In [19]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [20]:
from nltk.tokenize import sent_tokenize


def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

官方文档测试函数

In [21]:
import numpy as np

In [22]:
def postprocess_text(preds, labels):
  preds = [pred.strip() for pred in preds]
  labels = [label.strip() for label in labels]

  # rougeLSum expects newline after each sentence
  # preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
  # labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

  return preds, labels

In [23]:

def compute_metrics(eval_preds):
  preds, labels = eval_preds
  if isinstance(preds, tuple):
      preds = preds[0]
  # Replace -100s used for padding as we can't decode them
  preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # Some simple post-processing
  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

  result = rouge_score.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
  result = {k: round(v * 100, 4) for k, v in result.items()}
  prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
  result["gen_len"] = np.mean(prediction_lens)
  return result

In [24]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [25]:
tokenized_datasets = tokenized_datasets.remove_columns(
    zhSumTrainDataset.column_names
)
tokenized_datasets

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 5000
})

In [26]:
tokenized_datasets1 = tokenized_datasets1.remove_columns(
    zhSumTestDataset.column_names
)
tokenized_datasets1

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 5000
})

生成 decoder_input_ids 给 decoder 模型使用

In [27]:
features = [tokenized_datasets[i] for i in range(2)]
data_collator(features)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[  101,  5007, 15134,  4932, 15128,  7710, 15134, 25818,  4909, 11009,
         15284, 20820, 25818,  6559, 15284, 15134,  8429, 12434,  5028,  8451,
          4896,  7189, 25807, 14112,  5954,  4909,  7807, 21002, 30878,  7710,
          5722, 15134, 12637,  6222,  5954, 25818, 21536, 20820,  5028, 25807,
          4909,  9185, 20447,  5007, 25807,   102],
        [  101,  9460,  4968,  6170, 19731, 15207, 21784, 12637, 11226, 25818,
          8485,  4896,  4938,  7514, 10861, 19673, 19731,  5028,  5879,  9223,
         23390, 17922, 25818,  4896, 15264,  8485, 10344,  5028,  7807,  5879,
          4938, 11567, 12257,  5028, 25818, 21784,  4909,  7807, 14788,  5028,
           102,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [28]:
features1 = [tokenized_datasets1[i] for i in range(2)]
data_collator(features1)

{'input_ids': tensor([[  101, 21863, 21537,  5028,  9327, 22915, 10953, 23028, 25818, 11217,
          6433, 21562, 15478,  9970, 19705, 21536, 20820,  5028, 25818, 10931,
          7754, 21545, 10186,  5773,  5108,  3566, 41142, 25818,  5122,  5987,
          5100,  4909,  6350, 20469,  5140, 25818, 14417,  7221, 19916,  7807,
          7807, 20469,  5140,  5028, 25818,  6433, 11315,  9970,  8485, 10091,
         21498, 12403, 20494,  7697,  5965, 20893,  5028, 25818, 21497, 15254,
          5033, 25818, 20893,  5959,  6653,  7697,  5965,  5959,  6653, 25818,
         11217, 21991, 19916, 15134, 11009, 25818,  4909, 14788, 20437, 15284,
         15134, 20469, 20459,  5028, 25818,  4909, 14788,  9688, 21497,  8918,
          7710,  8453,  8270, 25818, 15241, 10371,  6372,  5905,  8485,  6402,
          5122,  5028, 25818, 10374, 18355, 17223,  7723,  8362, 25807, 25807,
           102],
        [  101, 23386,  8992, 10245, 17794, 25818, 10017,  5122, 20820, 10208,
          5959, 20486

In [43]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets1,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [44]:
trainer.train()



Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.4079,3.264772,0.13,0.02,0.12,0.13,7.0096
2,1.2755,3.317801,0.1733,0.0133,0.1693,0.1693,7.5226
3,1.1506,3.401588,0.168,0.0133,0.1667,0.1653,7.7178
4,1.0362,3.453164,0.18,0.0133,0.17,0.168,7.1338
5,0.9479,3.477561,0.14,0.02,0.14,0.14,7.3874
6,0.8648,3.509202,0.148,0.02,0.14,0.144,7.3338
7,0.8085,3.525378,0.1733,0.02,0.1667,0.1733,7.5888
8,0.7662,3.53059,0.14,0.02,0.14,0.14,7.5848


TrainOutput(global_step=2360, training_loss=1.031290876259238, metrics={'train_runtime': 2944.86, 'train_samples_per_second': 13.583, 'train_steps_per_second': 0.801, 'total_flos': 4016045314778112.0, 'train_loss': 1.031290876259238, 'epoch': 8.0})

In [31]:
trainer.evaluate()

{'eval_loss': 3.2537012100219727,
 'eval_rouge1': 0.2,
 'eval_rouge2': 0.02,
 'eval_rougeL': 0.186,
 'eval_rougeLsum': 0.186,
 'eval_gen_len': 7.393,
 'eval_runtime': 184.3031,
 'eval_samples_per_second': 27.129,
 'eval_steps_per_second': 1.812,
 'epoch': 8.0}

In [33]:
from transformers import pipeline
summarizer = pipeline("summarization",model="/content/cpt-base-finetuned-amazon-zh/checkpoint-2500")
articale = "想吃麦片，看评论很多说这个好，买的，麦片和其他也差不多"
summarizer(articale, max_length = 30, min_length = 5, do_sample=False)

Your max_length is set to 30, but your input_length is only 29. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)


[{'summary_text': '和 麦 片 的 不 一 样'}]

In [34]:
def print_summary(idx):
    review = zhSumTestDataset[idx]["review_body"]
    title = zhSumTestDataset[idx]["review_title"]
    summary = summarizer(zhSumTestDataset[idx]["review_body"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

In [35]:
print_summary(100)

Your max_length is set to 128, but your input_length is only 58. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=29)


'>>> Review: 商品差强人意，首先封面图片印刷不清与卖家提供图片不符…而且很贵物超所值，本身本子很小还贵比在外面文具店买的贵得多'

'>>> Title: 不满意的一次网购'

'>>> Summary: 商 品 质 量 很 差 ， 图 片 质 量 堪 忧'
