# XLM-RoBERTa--跨(多)语言模型
> XLM是融合了MLM和TLM两种算法的BERT模型，与RoBERTa结合，采取了RoBERTa动态掩码的方式。

> `MLM：`MLM的做法和Bert的LM做法是一样的，对于句子中词进行mask预测这个mask的词来捕获上下文语义。bert中针对词的MLM分为3种方式：[MASK]，原始词和随机词。
>- 1 首先选取所有词中的15%个数
>- 2 15%的选择词数中，80%用[MASK]的tocken表示，10%用原始tocken表示，10%用随机tocken表示。

>- `TLM：`TLM的做法则是使用将两个意思相近但是语言不同的句子进行拼接，来学习到不同语言之间的联系

# 代码流程思路
0. [加载包](#0)
0. [TPU](#1)
0. [设置超参数](#2)
0. [定义数据Token函数](#3)
0. [定义模型函数](#5)
0. [加载数据集](#6)
0. [生成训练数据类型](#4)
0. [调用、训练模型](#7)
0. [预测和提交](#8)

<a id=0></a>
### 0.加载包

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import tensorflow as tf
import transformers
from transformers import TFXLMRobertaModel, XLMRobertaTokenizer
from transformers import TFAutoModel, AutoTokenizer

<a id=1></a>
### 1.TPU
> 调用TPU训练，代码都差不多，直接复制即可
> - 注意使用TPU训练时，都需加上`with strategy.scope():`

In [None]:
# TPU detection.
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Running on TPU ', tpu.master())
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU

print('Number of replicas:', strategy.num_replicas_in_sync) #输出设备数量

<a id=2></a>
### 2.设置必要参数
model,epoch,maxlen,tokenizer,auto,batchsize,

In [None]:
seed = 2020
tf.random.set_seed(seed)

model_name = 'jplu/tf-xlm-roberta-large'
#model_name = 'joeddav/xlm-roberta-large-xnli'

epoch = 10

maxlen = 100
#tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

auto = tf.data.experimental.AUTOTUNE #线程数
batch_size = 16 * strategy.num_replicas_in_sync #用TPU时，大家都这么设 8、16

<a id=3></a>
### 3.定义数据Token

In [None]:
def xlm_encode_1(x1, x2, maxlen):
    '''
    x1:df.x1.values
    x2:df.x2.values
    max_len: int最大长度
    '''
    s1 = [tokenizer.encode(x) for x in x1]
    s2 = [tkoenizer.enocde(x) for x in x2]
    input_word_ids = list(map(lambda x:x[0]+x[1], list(zip(s1,s2)))) #模型只接受list输入
    input_mask = [np.ones_like(x) for x in input_word_ids]
    inputs={
        'input_word_ids': tf.keras.prepocessing.sequence.pad_sequences(input_word_ids, padding='post', maxlen=maxlen),
        'input_mask': tf.keras.prepocessing.sequence.pad_sequences(input_mask, padding='post', maxlen=maxlen)
        }
    return inputs

In [None]:
def xlm_encode_2(x1, maxlen):
    '''
    x1: df.x1.values
    max_len: int 最大长度
    tokenizer = AutoTokenizer
    '''
    encoded = tokenizer.batch_encode_plus(x1, pad_to_max_length=True, max_length=maxlen)
    return encoded

<a id=5></a>
### 4.定义模型

In [None]:
def bulid_model():
    with strategy.scope():
        #加载模型预训练层
        bert_encoder = TFXLMRobertaModel.from_pretrained(model_name)
        #transformer_encoder = TFAutoModel.from_pretrained(model_name)
        
        #输入token
        input_word_ids = tf.keras.Input(shape=(None, ), dtype=tf.int32, name='input_word_ids')
        input_mask = tf.keras.Input(shape=(None,), dtype=tf.int32, name="input_mask")
        
        #使用第一步加载的encoder，将输入转换为Bert embedding
        embedding = bert_encoder([input_word_ids, input_mask])[0]
        
        #下接自己定义的下游任务
        output_layer = tf.keras.layers.Dropout(0.3)(embedding)
        output_dense_layer = tf.keras.layers.Dense(64, activation='relu')(output_layer)
        output_dense_layer = tf.keras.layers.BatchNormalization()(output_dense_layer)
        output_dense_layer = tf.keras.layers.Dense(32, activation='relu')(output_dense_layer)
        
        output = tf.keras.layers.Dense(3, activation='softmax')(output_dense_layer)
        
        #定义模型输入
        model = tf.keras.Model(inputs=[input_word_ids, input_mask], outputs=output)
        
        #定义损失等
        model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        
        return model
    

In [None]:
def bulid_model(max_len):
    #
    transformer_encode = TFAutoModel.from_pretrained(model_name)
    #
    input_word = tf.keras.Input(shape=(max_len, ), dtype=tf.int32, name='input_word')
    #
    embedding = transformer_encode(input_word)[0]
    #提取<s>
    cls_token = sequence_output[:, 0, :]
    #
    
    
    #
    out = Dense(3, activation='softmax')(cls_token)

    model = Model(inputs=input_ids, outputs=out)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
with strategy.scope():
    model = bulid_model(max_len)
    model.summary()

<a id=6></a>
### 5.加载数据集

In [None]:
train_df = pd.read_csv('')
test_df = pd.read._csv('')

train_text = train_df[['premise', 'hypothesis']].values.tolist()
test_text = test_df[['premise', 'hypothesis']].values.tolist()

train_label = train_df.label.values

#切分训练、测试集
train_x, valid_x, train_y, valid_y = train_test_split(train_text, train_label, test_size=0.2, random_state=2020)

train_input = xlm_encode_2(train_x, maxlen)
valid_input = xlm_encode_2(valid_x, maxlen)
test_input = xlm_encode_2(test_text, maxlen)


<a id=4></a>
### 6.生成数据
- Step0: 准备要加载的numpy数据
- Step1: 使用 tf.data.Dataset.from_tensor_slices() 函数进行加载
- Step2: 使用 shuffle() 打乱数据
- Step3: 使用 map() 函数进行预处理
- Step4: 使用 batch() 函数设置 batch size 值
- Step5: 根据需要 使用 repeat() 设置是否循环迭代数据集（一般test数据集不用）

In [None]:
#模型输入是多特征对一label的形式 tf.data.Dataset.from_tensor_slices()进行切片
train_dataset = tf.data.Dataset.from_tensor_slices((train_input,train_y)).repeat().shuffle(2020).batch(batch_size).prefetch(auto)

test_dataset = tf.data.Dataset.from_tensor_slices(test_input['input_ids']).batch(batch_size)

valid_dataset = tf.data.Dataset.from_tensor_slices((valid_input, valid_y)).batch(batch_size).cache().prefetch(auto)

<a id=7></a>
### 7.调用、训练模型

In [None]:
n_step = len(train_y) // batch_size

history = model.fit(train_dataset, steps_per_epoch=n_step, validation_data=test_dataset, epochs=epoch, shuffle=True, verbose=1)

<a id=8></a>
### 9.预测

In [None]:
test_pred = model.predict(test_input, verbose=1)

submission = test_df.id.copy().to_frame()
submission['prediction'] = test_preds.argmax(axis=1)
submission.to_csv('submission.csv', index=False)
submission.head()