<a href="https://colab.research.google.com/github/DanielDLX/DLfinal/blob/master/bert_amazon2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 导入包
# transformer提供了一些训练好的模型，可以很方便的使用。
!pip install transformers
import tensorflow as tf
import pandas as pd
import os
import tqdm
# 使用分类的模型，增加了一个head用于分类。
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
% matplotlib inline

tf.__version__



'2.2.0'

In [2]:
# 导入现成的模型和分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) # 分类类别数
model.summary()
model.config

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

In [3]:
# 数据链接，可以在 https://course.fast.ai/datasets 找到。
# agnews数据集，类别，标题，描述。
am2_url = 'https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz'

In [4]:
# 下载数据，并指定此时数据集的目录
am2_zip_file = tf.keras.utils.get_file(origin=am2_url,fname='amazon_review_polarity_csv.tgz', extract=True)
base_dir = os.path.join(os.path.dirname(am2_zip_file), 'amazon_review_polarity_csv')
os.listdir(base_dir)

['readme.txt', 'test.csv', 'train.csv']

In [5]:
# 浏览一下readme
f = open(os.path.join(base_dir, 'readme.txt'))
con = f.readlines()
print(con)

['Amazon Review Polaridy Dataset\n', '\n', 'Version 3, Updated 09/09/2015\n', '\n', 'ORIGIN\n', '\n', 'The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.\n', '\n', 'The Amazon reviews polarity dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).\n', '\n', '\n', 'DESCRIPTION\n', '\n', 'The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as nega

In [6]:
# 读取数据
# base_dir = ''
train = pd.read_csv(os.path.join(base_dir, 'train.csv'), header=None)
print(len(train))
print(train.head())
test = pd.read_csv(os.path.join(base_dir, 'test.csv'), header=None)
print(len(test))
print(test.head())

3600000
   0  ...                                                  2
0  2  ...  This sound track was beautiful! It paints the ...
1  2  ...  I'm reading a lot of reviews saying that this ...
2  2  ...  This soundtrack is my favorite music of all ti...
3  2  ...  I truly like this soundtrack and I enjoy video...
4  2  ...  If you've played the game, you know how divine...

[5 rows x 3 columns]
400000
   0  ...                                                  2
0  2  ...  My lovely Pat has one of the GREAT voices of h...
1  2  ...  Despite the fact that I have only played a sma...
2  1  ...  I bought this charger in Jul 2003 and it worke...
3  2  ...  Check out Maha Energy's website. Their Powerex...
4  2  ...  Reviewed quite a bit of the combo players and ...

[5 rows x 3 columns]


In [7]:
# transformers自带的tokenizer中的encoder会把一段文本进行编码，然后增加上CLS和SEP，其中CLS的id是101，SEP的编码是102,PAD是0。
# 所以  a   dog   is  not   a   table
# [cls]  a   dog   is  not   a   table  [sep]
# 101   1037  3899  2003 2025  1037  2795   102   0  0  0  ...  0 
# %pprint #让列表横过来，好看一些。
tokenizer.encode(text='a dog is not a table', padding='max_length',max_length=512)[:20]

[101,
 1037,
 3899,
 2003,
 2025,
 1037,
 2795,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [None]:
# 预处理数据
# 按照上面的例子把数据集中的文本进行分词处理，并且得到对应的labels。
max_length = 512
# 数据中有的文本长度大于512，如果把max_length进一步放大内存就不够了，所以选择截断
max_length_temp = max_length - 2
for i in range(train.shape[0]):
  if len(train[2][i]) > max_length_temp:
    train[2][i] = train[2][i][0 : max_length_temp]
for i in range(test.shape[0]):
  if len(test[2][i]) > max_length_temp:
    test[2][i] = test[2][i][0 : max_length_temp]
train_ids = [tokenizer.encode(text=sent, padding='max_length', max_length=max_length, return_tensors="tf") for sent in tqdm.notebook.tqdm(train[2])]
test_ids = [tokenizer.encode(text=sent, padding='max_length', max_length=max_length, return_tensors="tf") for sent in tqdm.notebook.tqdm(test[2])]
train_labels = train[0].values - 1
test_labels = test[0].values - 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
# 把数据转成tensorflow张量
# train_ids是tf.Tensor组成得列表，所以用concat组合一下就行
train_ids = tf.concat(train_ids, 0)
# 把train_mask初始化为1，然后把train_ids等于0（PAD的部分）对应的值赋为0
train_mask = tf.ones(train_ids.shape)
train_mask = tf.where(tf.math.greater(train_ids, 0), train_mask, 0)
# labels本身是numpy数组，转为tf.Tensor
train_labels = tf.convert_to_tensor(train_labels)

# 测试集的处理同理test
test_ids = tf.concat(test_ids, 0)
test_mask = tf.ones(test_ids.shape)
test_mask = tf.where(tf.math.greater(test_ids, 0), test_mask, 0)
test_labels = tf.convert_to_tensor(test_labels)

In [None]:
print(train_ids[0])
print(train_mask[0])
print(train_labels[0])

In [None]:
# 训练参数
epochs = 1
batch_size = 4
validation_rate = 0.1

In [None]:
# 模型编译
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
# 模型训练
model.fit(x=[train_ids, train_mask], 
     y=train_labels, 
     batch_size=batch_size, 
     epochs=epochs, 
     verbose=1, 
     callbacks=None,
     validation_split=validation_rate, 
     validation_data=None, 
     shuffle=True)

In [None]:
# 模型测试
model.evaluate(x=[test_ids, test_mask],
        y=test_labels, 
        batch_size=4, 
        verbose=1)