**Read Me**

Our runtime uses [ GPU ]

This workbook uses two pre-trained models for language comprehension. One for english, one for not english. Using a Collab GPU enviornment optimizes performance. It is NOT possible to train both models in a single collab session. You will run out of memory and crash. 

You can load the weights using the code below. You can train one of the models in a collab session. Tips:
Each time you run a new version of the model, restart the environment.

*   Comment out/in the training of the model(s) you do not wish to train and load the data instead. 
*   Each time you run a new version of a model, restart and run all to avoid running out of memory. 
*   If you are tuning models, save the values locally in your collab session. This will improve performance verses trying to learn or load remotely. <br><br>

**Note:**
This notebook **ONLY** trains the **English** model, use the notebook link (at the bottom) to generate prediction/submission.




# **Import Libraries & Set Environment**

**Imports from Starter Notebook**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# stock imports from original notebook
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Install Transformers & Sentencepiece**

These two libraries are used to load and process pretrained langauge models. 

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 4.0 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 51.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 601 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 58.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 86.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

In [None]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.3 MB/s 
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


**Load Dependencies**

Load dependencies from the installed transformers library along with tensorflow and sklearn. tqdm is used for progress bars. 

In [None]:

# Data Visualisation
!pip install googletrans
import plotly.graph_objects as go
from plotly.offline import iplot
from plotly import tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import plotly.offline as pyo

import tensorflow as tf
from transformers import TFAutoModel, AutoModel, AutoTokenizer, XLMRobertaTokenizer, TFXLMRobertaModel
import sklearn
from sklearn.model_selection import train_test_split
from tqdm import tqdm

Collecting googletrans
  Downloading googletrans-3.0.0.tar.gz (17 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.2 MB/s 
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.5 MB/s 
[?25hCollecting rfc3986<2,>=1.3
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting hstspreload
  Downloading hstspreload-2021.12.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 8.3 MB/s 
[?25hCollecting h2==3.*
  Downloading h2-3.2.0-py2.py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 4.1 MB/s 
[?25hCollecting h11<0.10,>=0.8
  Downloading h11-0.9.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.5 MB/s 
[?25hCollecting hpack<4,>=3.0
  Downloading hpack-3.0.0-py2.py3-no

**Download Train & Test Data**

Load train and test data from github repo

In [None]:
train = pd.read_csv("https://github.com/jeffreyssimon/Hello_Watson/blob/main/dataset/train.csv?raw=true") 
test = pd.read_csv("https://github.com/jeffreyssimon/Hello_Watson/blob/main/dataset/test.csv?raw=true")

#Exploratory Data Analysis

Check shape of train/test datasets:

In [None]:
print("Number of rows and columns in train data : ", train.shape)
print("Number of rows and columns in test data : ", test.shape)

Number of rows and columns in train data :  (12120, 6)
Number of rows and columns in test data :  (5195, 5)


Check Null values:

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12120 entries, 0 to 12119
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          12120 non-null  object
 1   premise     12120 non-null  object
 2   hypothesis  12120 non-null  object
 3   lang_abv    12120 non-null  object
 4   language    12120 non-null  object
 5   label       12120 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 568.2+ KB


In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5195 entries, 0 to 5194
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          5195 non-null   object
 1   premise     5195 non-null   object
 2   hypothesis  5195 non-null   object
 3   lang_abv    5195 non-null   object
 4   language    5195 non-null   object
dtypes: object(5)
memory usage: 203.1+ KB


As shown above, all the observations/features are `non-null`.

**Target Variable Exploration**

In [None]:
Accuracy = pd.DataFrame()
Accuracy['Type'] = train.label.value_counts().index
Accuracy['Count'] = train.label.value_counts().values
Accuracy['Type'] = Accuracy['Type'].replace(0,'Entailment')
Accuracy['Type'] = Accuracy['Type'].replace(1,'Neutral')
Accuracy['Type'] = Accuracy['Type'].replace(2,'Contradiction')
Accuracy

Unnamed: 0,Type,Count
0,Entailment,4176
1,Contradiction,4064
2,Neutral,3880


In [None]:
py.init_notebook_mode(connected=True)
fig = px.bar(Accuracy, x='Type', y='Count',
             hover_data=['Count'], color='Count',
             labels={'pop':'Total Number of game titles'}, height=400)

fig.update_layout( title={
                    'text': "Count of each of the target classes",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})

fig.show(renderer="colab")

**Languages**

In [None]:
Languages=pd.DataFrame()
Languages['Type'] = train.language.value_counts().index
Languages['Count'] = train.language.value_counts().values

py.init_notebook_mode(connected=True)
fig = go.Figure(data=[go.Pie(labels=Languages['Type'], values=Languages['Count'],hole=0.2)])
fig.update_layout( title={
                    'text': "Percentage distribution of different Languages",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})

fig.show(renderer="colab")

Observation on training data:

- There are total of 15 different languages
- Major language is English: 56.7%
- Rest of languages range: 2.82% ~ 3.39%


## Data Transform Helper Functions

**Tokenize_Data**

Function to take a data frame, tokenizer, and process into (tokenized) x and y values.

In [None]:
def tokenize_data(df, tokenizer, max_length):
    # tokenize
    text = df[['premise', 'hypothesis']].values.tolist()
    encoded = tokenizer.__call__(text, padding='max_length', max_length=max_length, truncation=True)
    # features
    x = encoded['input_ids']
    # labels
    y = None
    if 'label' in df.columns:
        y = df.label.values
    return x, y

**build_dataset**

Take x and y values and build them into a data set format which can be used to train or process a model

In [None]:
def build_dataset(x, y, mode, batch_size_embeddings):
  if mode == "train":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices((x, y))
        .shuffle(2048)
        .batch(batch_size_embeddings)
        .prefetch(tf.data.AUTOTUNE))
  elif mode == "valid":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices((x, y))
        .batch(batch_size_embeddings)
        .cache()
        .prefetch(tf.data.AUTOTUNE))
  elif mode == "test":
    dataset = (
        tf.data.Dataset
        .from_tensor_slices(x)
        .batch(batch_size_embeddings))
  return dataset

# Tokenize & Format Data

**Define Pretrained Models**

One model for english language one for non english language. Both models are based on roberta-large with some extra nli training. Using an english only provides better results on english (which is half the data) than the multi language on english

In [None]:
# English Language Model
model_name_en = 'roberta-large-mnli'
# Max length and padding for embeddings - try this at larger numbers, perhaps 236
max_length = 120
# batch size for embeddings
batch_size_embeddings = 16
# train test split
train_test = 0.2
# use validation set?
val_set=True

**Split Train Data Set**

Splits the train data set into english and multi-language data sets. Optional split of data set into train and test sets for measuring model performance during training.

In [None]:
# split data set into english and non english
tX_en = train[train['language']=='English'].reset_index(drop=True)
# get train and test sets
if val_set == True:
  tX_en, val_en = train_test_split(tX_en, test_size=train_test)

# Set Parameters & Build Models

**Set Model Parameters**

Set parameters for the models. Note there are two models which share global parameters and have their own model specific parameters.

In [None]:
# Global Parameters
RANDOM_STATE = 42
tf.random.set_seed(RANDOM_STATE)
# learning_rate = 1e-5
loss = 'sparse_categorical_crossentropy'
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=2e-6,
    decay_steps=1000,
    decay_rate=0.9)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
kernel_regularizer = tf.keras.regularizers.l1(l1=0.01)

epochs_en = 2
# try batchsize of 16
batch_size_model=32

**Tokenize English Dataset**

Tokenize the English data and build a data set for processing through the english language model.

In [None]:
# Tokenize English Data
tokenizer_en = AutoTokenizer.from_pretrained(model_name_en)
tX_en, tY_en = tokenize_data(tX_en, tokenizer_en, max_length)
tX_en = build_dataset(tX_en, tY_en, "train", batch_size_embeddings)
if val_set == True:
  testX_en, testY_en = tokenize_data(val_en, tokenizer_en, max_length)
  testX_en = build_dataset(testX_en, testY_en, "valid", batch_size_embeddings)

Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

**Build English Model**

Build the model for assessing english language contradictions. Current model is an encoding layer, pooling layer, and output layer.

In [None]:
input = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input")

encoder = TFAutoModel.from_pretrained(model_name_en)

encoder_out = encoder(input)[0]
pooling = tf.keras.layers.GlobalAveragePooling1D()(encoder_out)

output = tf.keras.layers.Dense(3, activation='softmax', kernel_regularizer=kernel_regularizer)(pooling)

model_en = tf.keras.models.Model(inputs=input, outputs=output)
model_en.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some layers from the model checkpoint at roberta-large-mnli were not used when initializing TFRobertaModel: ['classifier']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


In [None]:
if val_set == True:
  hist = model_en.fit(tX_en, epochs=epochs_en, batch_size=batch_size_model, validation_data=testX_en, verbose=1)
else:
  hist = model_en.fit(tX_en, epochs=epochs_en, batch_size=batch_size_model, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10

In [None]:
model_en.save_weights('/content/gdrive/MyDrive/model/en/')

# Prediction

Run this notebook for prediction and submission:
https://colab.research.google.com/drive/1zTK2tK817x8Zll9HumvmN_ZoeXhkmbBP#scrollTo=9o8D4eUIgQs5
