**Read Me**

Our runtime uses [ GPU ]

This workbook uses two pre-trained models for language comprehension. One for english, one for not english. Using a Collab GPU enviornment optimizes performance. It is NOT possible to train both models in a single collab session. You will run out of memory and crash. 

You can load the weights using the code below. You can train one of the models in a collab session. Tips:
Each time you run a new version of the model, restart the environment.

*   Comment out/in the training of the model(s) you do not wish to train and load the data instead. 
*   Each time you run a new version of a model, restart and run all to avoid running out of memory. 
*   If you are tuning models, save the values locally in your collab session. This will improve performance verses trying to learn or load remotely. <br><br>

**Note:**
This notebook trains the **Multi-language** model, use the notebook link (at the bottom) to generate prediction/submission.



# **Import Libraries & Set Environment**

**Imports from Starter Notebook**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# stock imports from original notebook
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Install Transformers & Sentencepiece**

These two libraries are used to load and process pretrained langauge models. 

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 89.2 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 26.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 8.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 58.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [None]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.0 MB/s 
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


**Load Dependencies**

Load dependencies from the installed transformers library along with tensorflow and sklearn. tqdm is used for progress bars. 

In [None]:
import tensorflow as tf
from transformers import TFAutoModel, AutoModel, AutoTokenizer, XLMRobertaTokenizer, TFXLMRobertaModel
import sklearn
from sklearn.model_selection import train_test_split
from tqdm import tqdm

**Download Train & Test Data**

Load train and test data from github repo

In [None]:
train = pd.read_csv("https://github.com/jeffreyssimon/Hello_Watson/blob/main/dataset/train.csv?raw=true") 
test = pd.read_csv("https://github.com/jeffreyssimon/Hello_Watson/blob/main/dataset/test.csv?raw=true")

## Data Transform Helper Functions

**Tokenize_Data**

Function to take a data frame, tokenizer, and process into (tokenized) x and y values.

In [None]:
def tokenize_data(df, tokenizer, max_length):
    # tokenize
    text = df[['premise', 'hypothesis']].values.tolist()
    encoded = tokenizer.__call__(text, padding='max_length', max_length=max_length, truncation=True)
    # features
    x = encoded['input_ids']
    # labels
    y = None
    if 'label' in df.columns:
        y = df.label.values
    return x, y

**build_dataset**

Take x and y values and build them into a data set format which can be used to train or process a model

In [None]:
def build_dataset(x, y, mode, batch_size):
  if mode == "train":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices((x, y))
        .shuffle(2048)
        .batch(batch_size)
        .prefetch(tf.data.AUTOTUNE))
  elif mode == "valid":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices((x, y))
        .batch(batch_size)
        .cache()
        .prefetch(tf.data.AUTOTUNE))
  elif mode == "test":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices(x)
        .batch(batch_size))
  return dataset

# Tokenize & Format Data

**Define Pretrained Models**

One model for english language one for non english language. Both models are based on roberta-large with some extra nli training. Using an english only provides better results on english (which is half the data) than the multi language on english

In [None]:
# Multi Language Model 
model_name_multi = 'joeddav/xlm-roberta-large-xnli'
# Max length and padding for embeddings
max_length = 120
# batch size for embeddings
batch_size = 16
# train test split
train_test = 0.2

**Split Train Data Set**

Splits the train data set into english and multi-language data sets. Optional split of data set into train and test sets for measuring model performance during training.

In [None]:
# split data set into english and non english
tX_multi = train[train['language'] != 'English'].reset_index(drop=True)
# get train and test sets
tX_multi, val_multi = train_test_split(tX_multi, test_size=train_test)

# Set Parameters & Build Models

**Set Model Parameters**

Set parameters for the models. Note there are two models which share global parameters and have their own model specific parameters.

In [None]:
# Global Parameters
RANDOM_STATE = 42
tf.random.set_seed(RANDOM_STATE)
# learning_rate = 1e-6
loss = 'sparse_categorical_crossentropy'
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-6,
    decay_steps=1000,
    decay_rate=0.9)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
kernel_regularizer = tf.keras.regularizers.l1(l1=0.01)
# Multi Language Model Paremeters
model_name_multi = 'joeddav/xlm-roberta-large-xnli'
epochs_multi = 2

**Tokenize Multi-Language Dataset**

Tokenize the multi-language data and build a data set for processing through the multi language model.

In [None]:
tokenizer_multi = AutoTokenizer.from_pretrained(model_name_multi)
tX_multi, tY_multi = tokenize_data(tX_multi, tokenizer_multi, max_length)
testX_multi, testY_multi = tokenize_data(val_multi, tokenizer_multi, max_length)

tX_multi = build_dataset(tX_multi, tY_multi, "train", batch_size)
testX_multi = build_dataset(testX_multi, testY_multi, "valid", batch_size)

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/734 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

**Build the Multi Language Model**

Build the model for assessing english language contradictions. Current model is an encoding layer, pooling layer, and output layer.

In [None]:
input = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input")

encoder = TFAutoModel.from_pretrained(model_name_multi)

encoder_out = encoder(input)[0]

pooling = tf.keras.layers.GlobalAveragePooling1D()(encoder_out)
# pooling = tf.keras.layers.GlobalMaxPooling1D()(encoder_out)
output = tf.keras.layers.Dense(3, activation='softmax', kernel_regularizer=kernel_regularizer)(pooling)

model_multi = tf.keras.models.Model(inputs=input, outputs=output)
model_multi.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Some layers from the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing TFXLMRobertaModel: ['classifier']
- This IS expected if you are initializing TFXLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFXLMRobertaModel were initialized from the model checkpoint at joeddav/xlm-roberta-large-xnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaModel for predictions without further training.


**Train Multi-Language Model**

Train the model OR load the weights for the model.

In [None]:
hist = model_multi.fit(tX_multi, epochs=epochs_multi, batch_size=32, validation_data=testX_multi, verbose=1)

Epoch 1/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


In [None]:
model_multi.save_weights('/content/gdrive/MyDrive/model/multi/')

# Prediction

Run this notebook for prediction and submission:
https://colab.research.google.com/drive/1zTK2tK817x8Zll9HumvmN_ZoeXhkmbBP#scrollTo=9o8D4eUIgQs5
