**Read Me**

Our runtime uses [ GPU ]

This workbook loads pretrained weights from OneDrive for both `English-only` and `Non-English` models and generates prediction/submission.

**English-only model training**: https://colab.research.google.com/drive/1u9a5kD0NtsGR3mltaFlT_aTXybVRe5ax


**Multi-language model training**: https://colab.research.google.com/drive/1qiR2ImnIZ3buQXE6dnCPKpM9-RxCcRF_#scrollTo=d4UIm2Tuk7sC


# **Import Libraries & Set Environment**

**Imports from Starter Notebook**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# stock imports from original notebook
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Install Transformers & Sentencepiece**

These two libraries are used to load and process pretrained langauge models. 

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 4.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 49.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 8.7 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 52.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 70.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

In [None]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.2 MB/s 
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


**Load Dependencies**

Load dependencies from the installed transformers library along with tensorflow and sklearn. tqdm is used for progress bars. 

In [None]:
import tensorflow as tf
from transformers import TFAutoModel, AutoModel, AutoTokenizer, XLMRobertaTokenizer, TFXLMRobertaModel
import sklearn
from sklearn.model_selection import train_test_split
from tqdm import tqdm

**Download Train & Test Data**

Load train and test data from github repo

In [None]:
# train = pd.read_csv("https://github.com/jeffreyssimon/Hello_Watson/blob/main/dataset/train.csv?raw=true") 
test = pd.read_csv("https://github.com/jeffreyssimon/Hello_Watson/blob/main/dataset/test.csv?raw=true")

**Download Pre-Trained Model Weights**

Load weights from a previous training of this workbook for either the english or non-english model.

In [None]:
# Load remote weights
model_en_weights_remote = "/content/gdrive/MyDrive/model/en/"
model_multi_weights_remote = "/content/gdrive/MyDrive/model/multi/"

## Data Transform Helper Functions

**Tokenize_Data**

Function to take a data frame, tokenizer, and process into (tokenized) x and y values.

In [None]:
def tokenize_data(df, tokenizer, max_length):
    # tokenize
    text = df[['premise', 'hypothesis']].values.tolist()
    encoded = tokenizer.__call__(text, padding='max_length', max_length=max_length, truncation=True)
    # features
    x = encoded['input_ids']
    # labels
    y = None
    if 'label' in df.columns:
        y = df.label.values
    return x, y

**build_dataset**

Take x and y values and build them into a data set format which can be used to train or process a model

In [None]:
def build_dataset(x, y, mode, batch_size):
  if mode == "train":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices((x, y))
        .shuffle(2048)
        .batch(batch_size)
        .prefetch(tf.data.AUTOTUNE))
  elif mode == "valid":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices((x, y))
        .batch(batch_size)
        .cache()
        .prefetch(tf.data.AUTOTUNE))
  elif mode == "test":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices(x)
        .batch(batch_size))
  return dataset

# Tokenize & Format Data

**Define Pretrained Models**

One model for english language one for non english language. Both models are based on roberta-large with some extra nli training. Using an english only provides better results on english (which is half the data) than the multi language on english

In [None]:
# English Language Model
model_name_en = 'roberta-large-mnli'
# Multi Language Model 
model_name_multi = 'joeddav/xlm-roberta-large-xnli'

# Set Parameters & Build Models

**Set Model Parameters**

Set parameters for the models. Note there are two models which share global parameters and have their own model specific parameters.

In [None]:
# Global Parameters
RANDOM_STATE = 42
tf.random.set_seed(RANDOM_STATE)
# learning_rate = 1e-6
loss = 'sparse_categorical_crossentropy'
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-6,
    decay_steps=1000,
    decay_rate=0.9)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
kernel_regularizer = tf.keras.regularizers.l1(l1=0.01)

max_length = 120
batch_size = 16

**1 - Load English Model**

Build the model for assessing english language contradictions. Current model is an encoding layer, pooling layer, and output layer.

In [None]:
input = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input")

encoder = TFAutoModel.from_pretrained(model_name_en)

encoder_out = encoder(input)[0]

pooling = tf.keras.layers.GlobalAveragePooling1D()(encoder_out)
output = tf.keras.layers.Dense(3, activation='softmax', kernel_regularizer=kernel_regularizer)(pooling)

model_en = tf.keras.models.Model(inputs=input, outputs=output)
model_en.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model_en.load_weights(model_en_weights_remote)

Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some layers from the model checkpoint at roberta-large-mnli were not used when initializing TFRobertaModel: ['classifier']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f9e40060f50>

**2 - Load Multi Language Model**

Build the model for assessing english language contradictions. Current model is an encoding layer, pooling layer, and output layer.

In [None]:
input = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input")

encoder = TFAutoModel.from_pretrained(model_name_multi)

encoder_out = encoder(input)[0]

pooling = tf.keras.layers.GlobalAveragePooling1D()(encoder_out)
output = tf.keras.layers.Dense(3, activation='softmax', kernel_regularizer=kernel_regularizer)(pooling)

model_multi = tf.keras.models.Model(inputs=input, outputs=output)
model_multi.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model_multi.load_weights(model_multi_weights_remote)

Downloading:   0%|          | 0.00/734 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.09G [00:00<?, ?B/s]

Some layers from the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing TFXLMRobertaModel: ['classifier']
- This IS expected if you are initializing TFXLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFXLMRobertaModel were initialized from the model checkpoint at joeddav/xlm-roberta-large-xnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaModel for predictions without further training.


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f9db64c0a50>

# Prediction

**Build Predictions**

Go through the test data set using the appropriate model to tokenize and make prediction. 

In [None]:
print("Total Predictions to make: ", len(test))

# Split the data into English/other language segments
kaggle_en, kaggle_multi = test[test['language'] == 'English'], test[test['language'] != 'English']

Total Predictions to make:  5195


In [None]:
# Make all English predictions 
tokenizer_en = AutoTokenizer.from_pretrained(model_name_en)
kaggleX_en, kaggleY_en = tokenize_data(kaggle_en, tokenizer_en, max_length)
kaggleX_en = build_dataset(kaggleX_en, kaggleY_en, "test", batch_size)
results_en = model_en.predict(kaggleX_en)
predictions_en = [result.argmax() for result in results_en]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
# Make all predictions for other languages
tokenizer_multi = AutoTokenizer.from_pretrained(model_name_multi)
kaggleX_multi, kaggleY_multi = tokenize_data(kaggle_multi, tokenizer_multi, max_length)
kaggleX_multi = build_dataset(kaggleX_multi, kaggleY_multi, "test", batch_size)
results_multi = model_multi.predict(kaggleX_multi)
predictions_multi = [result.argmax() for result in results_multi]

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [None]:
# Merge the English and other language predictions in the order they originally appear in the data
predictions_merged = [0]*len(test)
prediction_index_en = 0
prediction_index_multi = 0
for index in range(len(test)):
  if test.loc[index]['language'] == 'English':
    predictions_merged[index] = predictions_en[prediction_index_en]
    prediction_index_en += 1
  else:
    predictions_merged[index] = predictions_multi[prediction_index_multi]
    prediction_index_multi += 1

predictions_dataframe = test[['id']]
predictions_dataframe['prediction'] = predictions_merged

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
# Permanent save submission to OneDrive
predictions_dataframe.to_csv('/content/gdrive/MyDrive/submission/submission.csv', index=False)

# Temp save submission to Colab
predictions_dataframe.to_csv('submission.csv', index=False)