# **1. Introduction**

**The Challenge:**

If you have two sentences, there are three ways they could be related: 

* one could entail the other, one could contradict the other, or they could be unrelated. 

* Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

* The given task here, is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. Also, the hypothesis and premise are in multiple languages.


**The Approach for solving the Problem:**

**Goal:** As per the competition our goal is to predict whether a given hypothesis is related to its premise by contradiction, entailment, or whether neither of those is true (neutral).
For each sample in the test set, you must predict a 0, 1, or 2 value for the variable.
Those values map to the logical condition as:
0 == entailment
1 == neutral
2 == contradiction

To achieve this goal we will be following the below steps:

* Exploratory Data Analysis (EDA) will be performed on the given datasets to understand the pattern in the data and to gain more insights into what the data looks like.

* Generally, as per any given project we consider the following EDA steps:

    • Previewing the data – which consists of loading and checking out the datasets.
    
    • Checking out the total number of entries and the shape of the datasets consisting of given number of columns and their types.
    
    • To check if the datasets consists of any null values and duplicate entries.
    
    • Plotting the distribution of the data. Which consists of numeric and categorical data.
    
    • Visualizing this distribution using seaborn, matplotlib, ggplot, boxplot and other visualization techniques suitable for this data.
    
    • We can visualize the word count using histograms.
    
    • Create our own Graphs for the visualization of the language distributions and the data label distribution.
    
There will be some experimental steps which will be done as part of this project as we will be using the TensorFlow for the distribution strategy. Tensorflow provides us the Api’s to use them to distribute training across multiple GPUs or TPUs. As it provides multiple distribution strategies as:

    • Mirrored Strategy
    • TPU Strategy
    • MultiWorkerMirrored Strategy
    • CentralStorage Strategy
    • ParameterServer Strategy
    
We will be using the TPU Strategy for our project. As the GPUs and TPUs can radically reduce the time required to execute a single training step. Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. 

   * For this we will be building flexible and efficient input pipelines using TensorFlow (tf.data).

   * Scaling the data as batch size.

   * We will be pre-processing it to the data before we use it for the transformers. Where we will be using the AutoTokenizer and later feed it to the model.

   * Encoding the data.

   * Since, we have both the train and test set given we will be using the train dataset for training the models by using the Train-test split.

There will be two-way approach for the modelling here, where we will be experimenting with a pre-defined model to check and validate the performance of the models.

For the pre-defined models, we will be using:

   * KFold which is one of the models from sklearn.
    
   * Bidirectional Encoder Representations [BERT] model - technique for NLP pre-training.
    
As BERT has different models under it, we will be considering: XLM-RoBERTa, DistilBERT for our project.

• Create predictions to store them as the final submission.csv

• The accuracy metrics will be used to score the models.

**Description of the Datasets:**

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')



from sklearn import ensemble, metrics, model_selection
import os



from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score, accuracy_score
#!pip install transformers
import transformers
from transformers import BertTokenizer,AutoTokenizer, TFAutoModel,BertForSequenceClassification, TFBertModel
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
from transformers import AdamW

import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

#!pip install googletrans
#from googletrans import Translator
import copy
import copy
from sklearn.model_selection import train_test_split


os.environ["WANDB_API_KEY"] = "0" ## to silence the warning:wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.

**TensorFlow:**

TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.[4] It is used for both research and production at Google

**Tensor processing unit (TPU):**

In May 2016, Google announced its Tensor processing unit (TPU), an application-specific integrated circuit (ASIC, a hardware chip) built specifically for machine learning and tailored for TensorFlow. A TPU is a programmable AI accelerator designed to provide high throughput of low-precision arithmetic (e.g., 8-bit), and oriented toward using or running models rather than training them. Google announced they had been running TPUs inside their data centers for more than a year, and had found them to deliver an order of magnitude better-optimized performance per watt for machine learning.


TRAIN DATA:

In [None]:
train_data = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
train_data.head(10)

The data is very straight forward here, where the premise, hypothesis and the class(label) tells us the relation between these attributes i.e. if the two attributes are entailment or contradiction or neutral and returns us any one of these label. So, the model takes two inputs i.e. the two sentences from the two attributes(premise & hypothesis) and returns one of the classes.

In [None]:
a = train_data.shape

print("The shape of train the data is:",a)

In [None]:
print("the attributes are",train_data.columns)

The train data consists of 12,120 instances and 6 attributes.

The attributes are: 

    'id'
    'premise'
    'hypothesis'
    'lang_abv'
    'language'
    'label'

The .info() method tells us the shape of object types of our data.

Below is description of the object types of our data.

In [None]:
train_data.info()

**TEST DATA:**

In [None]:
test_data = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
test_data.head(10)

In [None]:
b = test_data.shape
print("The shape of the test data is:",b)

The test data consists of 5,195 instances and 5 attributes.

The attributes are:

    id
    premise
    hypothesis
    lang_abv
    language
    
Here, the target attribute is 'label'.

Let's take a look at two of the Pairs of sentences from Premise and Hypothesis:

In [None]:
train_data['premise'].values[0]

In [None]:
train_data['hypothesis'].values[0]

In [None]:
train_data['label'].values[0]

**OBSERVATION:**

We know that this is true based on the information in the premise.
The above premise and hypothesis are entailing each other and the label shows that. 
So, this pair is related by entailment.

In [None]:
train_data['premise'].values[1]

In [None]:
train_data['hypothesis'].values[1]

In [None]:
train_data['label'].values[1]

**OBSERVATION:**

We know that this is false based on the information in the premise.
The above premise and hypothesis are contradicting each other and the label shows that. 
So, this pair is related by contradiction and the label shows that.

# **2. Explore the Data**

**Numeric Columns in the data:**

In [None]:
num_cols = train_data._get_numeric_data().columns
print("The numeric columns in the train data are:",num_cols)

**Catogrical Columns in the data:**

In [None]:
df_cat = train_data[train_data.columns.difference(num_cols)]
print("The catrgorical columns in the train data are:",df_cat.columns)

**Visualization of the languages given in the data:**

In [None]:
value = train_data.language.values
value

In [None]:
t = train_data['language'].unique()
print("The unique languages in the dataset are:",t)

In [None]:
tl = train_data.language.nunique()
print("The total number of languages present in the data are:",tl)

In [None]:
v = train_data.language.value_counts()
v

In [None]:
sns.set_style("whitegrid", {'grid.linestyle': '--'})
plt.figure(figsize=(15,6))
sns.countplot(x="language", data=train_data, palette="Set2")
plt.show()

**Percentage Disribution of all the languages:**

In [None]:
labels, frequencies = np.unique(value, return_counts = True)
plt.figure(figsize = (20,10))
plt.pie(frequencies,labels = labels, autopct = '%.1f%%')
plt.show()


**OBSERVATIONS:**

1. From the above two distributions we can see that English is the most dominating language in the given dataset which is consisting of 56.7%.

2. And the rest of the languages are more or less equally distributed.

**Comparing the Languages Distribution in both Train and Test data:**

In [None]:
fig = plt.figure(figsize = (15,5))

plt.subplot(1,2,1)
plt.title('Train data language distribution')
sns.countplot(data = train_data, x = 'language', order = v.index)
plt.xticks(rotation=90)

plt.subplot(1,2,2)
plt.title('Test data laguage distribution')
sns.countplot(data = test_data, x = 'language', order = test_data['language'].value_counts().index)
plt.xticks(rotation=90)

**OBSERVATIONS:**

From the above plots we can see that the language distribution in both train and test data are equal.

**Visualization of the Labels in the Data:**

* Here the target Variable is the "Label" attribute.

In [None]:
value_label = train_data.label.values
value_label

In [None]:
train_data.label.value_counts()

In [None]:
v_label = pd.DataFrame()
v_label['Type'] = train_data.label.value_counts().index
v_label['Count'] = train_data.label.value_counts().values
v_label['Type']=v_label['Type'].replace(0,'Entailment')
v_label['Type']=v_label['Type'].replace(1,'Neutral')
v_label['Type']=v_label['Type'].replace(2,'Contradiction')
v_label

In [None]:
train_data.hist(bins=50, figsize=(8,8))
plt.show()

In [None]:
sns.set_style("whitegrid", {'grid.linestyle': '--'})
plt.figure(figsize=(10,5))
sns.countplot(x="Type", data=v_label)
plt.title("Distribution of the lable types")
plt.show()

In [None]:
labels, frequencies = np.unique(v_label['Type'], return_counts = True)
plt.figure(figsize = (10,10))
plt.pie(frequencies,labels = labels, autopct = '%.1f%%')
plt.show()

**OBSERVATIONS:**

    There are total of 12120 instances of the data, which contains 6 attributes/features

    There are total of 4176 records of Entailment
    
    There are total of 4064 records of Contradiction
    
    There are total of 3880 records of Neutral.

    Therefore, we can say that there is No Class Imblanace for the given data.

# **3. Prepare the Data**

**To check if any null value is present in the data:**

In [None]:
train_data.isnull().sum()

**OBSERVATION:**

There are no missing values present in the data.

**Translating the non-english sentences into english sentences by using Googletrans:**

* Googletrans is a free and unlimited python library that implemented Google Translate API. This uses the Google Translate Ajax API to make calls to such methods as detect and translate.

Features which we will be using for our project:

    1. Fast and reliable - it uses the same servers that translate.google.com uses
    
    2. Auto language detection
    
    3. Bulk translations
    
* Since, we are having 15 different types of languages, we will use the GoogleTranslator to translate the non-english      langugaes into english language. 

* We will create seperate new Translated csv file for both the test and train data and run the below code only once.

**Translated Train data is as below:**

In [None]:
#train = pd.read_csv('../input/translated-data/translated_train.csv')
#train.head(10)

**Translated Test data is as below:**

In [None]:
#test_data.premise[test_data.lang_abv!= 'en']=test_data.premise[test_data.lang_abv!= 'en'].apply(lambda x: Translation(x))

In [None]:
#test_data.hypothesis[test_data.lang_abv!= 'en']=test_data.hypothesis[test_data.lang_abv!= 'en'].apply(lambda x: Translation(x))

In [None]:
#test_data.to_csv(r'translated_test.csv', index = False)

In [None]:
#test = pd.read_csv('../input/translated-data/translated_test.csv')
#test.head(10)

# **Configuring the TPU:**

* Here, we are going to detect the hardware and return the appropriate distribution strategy.

* A TPU needs to found and setup to work with the models which we will be using in the notebook.

* A strategy needs to be defined regarding how the model will replicated accross the GPU chips on the TPU board and how these replica's model will be merged back together once the training has completed of various models.

* This piece of the code below will define the defualt distribution strategy in Tensorflow and work on finding a TPU or gets CPU and single GPU if its not available and sets it up.

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Running on TPU ', tpu.master())
except ValueError:
    strategy = tf.distribute.get_strategy()

    
print("REPLICAS: ", strategy.num_replicas_in_sync)

# **4. Shortlist Promising Models**

**Models:**

* We have trained computer vision models which have built models with "backbones". These are pre-trained models whose weights can be generalised to a new task. Stick some extra layers (the head) to the end of the model to handle the new task and you have a model that benefits from cutting edge trained but that is still built to complete the current task in mind.

* We will be building a model where we will be  using roberta as the backbone and a sofmax layer on the end to apply the correct class (entailment, neutral, contradiction or 0, 1, 2).

* BERT (the original language transformer model that models like roberta are based on) is quite a complex model. 

* As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT.The use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train.

* We have used the transformers concept here where the import of the transformers allows us to use the pre-trained models from "Hugging Face". [https://huggingface.co/transformers/index.html]


* Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.


**Features:**

    High performance on NLU and NLG tasks

    Low barrier to entry for educators and practitioners

    State-of-the-art NLP for everyone:

        Deep learning researchers

        Hands-on practitioners

        AI/ML/NLP teachers and educators

    Lower compute costs, smaller carbon footprint:

        Researchers can share trained models instead of always retraining

        Practitioners can reduce compute time and production costs

        8 architectures with over 30 pretrained models, some in more than 100 languages

    Choose the right framework for every part of a model’s lifetime:

        Train state-of-the-art models in 3 lines of code

        Deep interoperability between TensorFlow 2.0 and PyTorch models

        Move a single model between TF2.0/PyTorch frameworks at will

        Seamlessly pick the right framework for training, evaluation, production

# **XLM-RoBERTa Model:**

**Overview:**

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

The abstract from the paper is the following:

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.


Several versions of xlm roberta are available in the Transformers library. Here are two:

* xlm-roberta-base

* xlm-roberta-large

This is the link to the XLM-RoBERTa paper:

https://arxiv.org/pdf/1911.02116.pdf

# **XLM-RoBERTa Model: (Large)**

**Defining Parameters to be used:**

To be able to tweak the parameters of the model we have defined them as globals.

The batch size needed to be multiplied by the number of replicas which is 8. 
This is simply to make sure each of the eight GPU chips in the TPU uses the specified batch size and not one eighth of that number.

In [None]:
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
AUTO = tf.data.experimental.AUTOTUNE
step = len(train_data) // BATCH_SIZE

**Building The Model defination:**

* We have used the classmethodfrom_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)

* Then we load the strategy which we have defined above.

* [cls] token is the sequence approximate i.e. the classifier token is used when doing sequence classification ( classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

* Then we apply the softmax layer that produces the class and compile the model.

In [None]:
def model_defination(strategy,transformer):
    with strategy.scope():
        encoder = TFAutoModel.from_pretrained(transformer)
        input_layer = Input(shape=(30,), dtype=tf.int32, name="input_layer")
        sequence_output = encoder(input_layer)[0]
        cls_token = sequence_output[:, 0, :]
        output_layer = Dense(3, activation='softmax')(cls_token)
        model = Model(inputs=input_layer, outputs=output_layer)
        model.compile(
            Adam(lr=1e-5), 
            loss='sparse_categorical_crossentropy', 
            metrics=['accuracy']
        )
        return model

* After we have compiled the model above we are going to call the model_defination and load our pre-trained model into the strategy.

In [None]:
model=model_defination(strategy,"jplu/tf-xlm-roberta-large")

Below shows us the summary of the Model which we have loaded above in the model defination.

In [None]:
model.summary()

**Tokenizing: With (xlm-roberta-large)**

* As a Machine Learning model a language model works with numbers, not text. 

* We need to be tokenizing them to prepare the sentences for training.

* These tokens are number indexes that represent each of the words. Each model has it's own unique set of tokens.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('jplu/tf-xlm-roberta-large')

**Creating 3-folds for using in our model:**

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold

# shuffle
df = shuffle(train_data)

# initialize kfold
kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1024)

# for stratification
y = df['label']

# Put the folds into a list. This is a list of tuples.
fold_list = list(kf.split(df, y))

train_df_list = []
val_df_list = []

for i, fold in enumerate(fold_list):

    # map the train and val index values to dataframe rows
    df_train = df[df.index.isin(fold[0])]
    df_val = df[df.index.isin(fold[1])]
    
    train_df_list.append(df_train)
    val_df_list.append(df_val)
    
    

print(len(train_df_list))
print(len(val_df_list))

In [None]:
# Display one train fold:

df_train = train_df_list[0]

df_train.head()

In [None]:
# Display one val fold

df_val = val_df_list[0]

df_val.head()

**Creating the training, validation and test sets:**

In [None]:
train_set = df_train[['premise','hypothesis']].values.tolist()
test_set = test_data[['premise','hypothesis']].values.tolist()

**Encoding the Data:**

In [None]:
encoded_train = tokenizer.batch_encode_plus(train_set, pad_to_max_length=True, max_length=30)
encoded_test = tokenizer.batch_encode_plus(test_set, pad_to_max_length=True, max_length=30)

Below example shows what the tokenizer has done to the first sentence at "premise[0]":

**The textual form of the data:**

In [None]:
df_train.premise.values[0]

Below is the few tokens from the above sentence:

In [None]:
print(encoded_train.input_ids[0][0:20])

Below example shows what the tokenizer has done to the first sentence at "hypothesis[0]":

In [None]:
df_train.hypothesis.values[0]

Below are the token for the above sentence:

In [None]:
print(encoded_train.input_ids[0][20:32])

* So we can see above sentences has been split into an array, where each word is represented by a number index. 

* The tokeniser even splits the words themselves up into sub words.

**Below we are checking the vocab which the tokenizer has tokens defined:**

In [None]:
# vocab size

tokenizer.vocab_size

In [None]:
# the special tokens

tokenizer.special_tokens_map

In [None]:
print('bos_token_id <s>:', tokenizer.bos_token_id)
print('eos_token_id </s>:', tokenizer.eos_token_id)
print('sep_token_id </s>:', tokenizer.sep_token_id)
print('pad_token_id <pad>:', tokenizer.pad_token_id)

* The token 0 represents "\<\s>", which represents the start of a sentence.

* When a model has two inputs (like a premise and hypothesis) the transformer will merge the tokens from the two sentences into the one array. 

* The "\<\s>" token is used to denote the end of the the premise and the beginning of the hypothesis.

**Train and Test Split:**

* Splitting the dataset into respective training and validation set.

* Using the 3-folds which are created above.

In [None]:
# using the first fold(0):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_test['input_ids']

**Pipeline:**

* When we are using the tensorflow and TPUs it is best to build a data pipeline.

* This pipeline is build using tensorflows data api. Which provides a better performance during training.

* In the pipeline we insert the data using the from tensor slices commmand, shuffle it, batch it and prefetch the next batch while the model is training on the current batch.

In [None]:
train_df = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

**Training the Model: at different Epoch values(3 and 5)**


Where "Epoch" is one complete presentation of the whole dataset which is needed to be learned by the machine.
The number of epochs represent the hyperparameter of "Gradient Descent" which controls the number of complete passes which the machine makes when it is passing through the training dataset.

* Here, we will first find out the best epoch and then run our the next 2 folds on that epoch for this model.

In [None]:
#At epochs=3
model_3 = model.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3.history['accuracy'])))
print("loss {}".format(np.mean(model_3.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3.history['val_loss'])))
print("accuracy {}".format(np.std(model_3.history['accuracy'])))
print("loss {}".format(np.std(model_3.history['loss'])))

In [None]:
#At epochs=5
model_5 = model.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=5)

In [None]:
print("validation accuracy {}".format(np.mean(model_5.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_5.history['val_loss'])))
print("accuracy {}".format(np.mean(model_5.history['accuracy'])))
print("loss {}".format(np.mean(model_5.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_5.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_5.history['val_loss'])))
print("accuracy {}".format(np.std(model_5.history['accuracy'])))
print("loss {}".format(np.std(model_5.history['loss'])))

* Accuracy -  is calculated on the training data. This tells us the percentage of the instances that are correctly classified.

* Val_accuracy - is calculated on the validation data. This is a measure of how good the predictions are of the model.

* Loss - This represents the training loss which is the average of the losses occuring over every batch of the training data.

* Val_loss - This tells us the loss occuring over every batch of the test data. Which is the unseen data at this point of time.


**OBSERVATIONS:**

* At the start since, the model is changing over the time, the loss over the first batch of the training data of an epoch is generally higher than the last batches of the data.

* We can tell the model is getting trained in a good way when the two losses(loss and the val_loss) are decreasing over each epoch and the two accuracies(accuracy and val_accuracy) are increasing gradually.

* Now if we see the last batch of each epoch at 3 and 5 we can see the following:

    * At epoch=3 loss and val_loss has decreased and accuracy and val_accuracy has increased.
    
    * At epoch = 5 loss is decreasing but val_loss is increasing. And accuracy is increasing whereas val_accuracy has reached constant.
    
    Hence, we stop at epoch = 5 otherwise the model is getting overfitted.

**Now we will run the next fold at [1] for epoch = 3 value:**

In [None]:
df_train1 = train_df_list[1]

In [None]:
train_set1 = df_train1[['premise','hypothesis']].values.tolist()
test_set1 = test_data[['premise','hypothesis']].values.tolist()

In [None]:
encoded_train1 = tokenizer.batch_encode_plus(train_set1, pad_to_max_length=True, max_length=30)
encoded_test1 = tokenizer.batch_encode_plus(test_set1, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(1):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train1['input_ids'], df_train1.label.values, test_size=0.3)

x_test = encoded_test1['input_ids']

In [None]:
train_df1 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df1 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df1 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_3_1 = model.fit(train_df1,steps_per_epoch=step,validation_data=valid_df1,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_1.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_1.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_1.history['accuracy'])))
print("loss {}".format(np.mean(model_3_1.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_1.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_1.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_1.history['accuracy'])))
print("loss {}".format(np.std(model_3_1.history['loss'])))

**Now we will run the next fold at [2] for epoch = 3 value:**

In [None]:
df_train2 = train_df_list[2]

In [None]:
train_set2 = df_train2[['premise','hypothesis']].values.tolist()
test_set2 = test_data[['premise','hypothesis']].values.tolist()

In [None]:
encoded_train2 = tokenizer.batch_encode_plus(train_set2, pad_to_max_length=True, max_length=30)
encoded_test2 = tokenizer.batch_encode_plus(test_set2, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(2):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train2['input_ids'], df_train2.label.values, test_size=0.3)

x_test = encoded_test2['input_ids']

In [None]:
train_df2 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df2 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df2 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_3_2 = model.fit(train_df2,steps_per_epoch=step,validation_data=valid_df2,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_2.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_2.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_2.history['accuracy'])))
print("loss {}".format(np.mean(model_3_2.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_2.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_2.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_2.history['accuracy'])))
print("loss {}".format(np.std(model_3_2.history['loss'])))

 Model_name at epoch=3                   |  n-fold      |  mean_loss    | mean_val_loss  | mean_accuracy | mean_val_accuracy |
 ----------------------------------------|--------------|---------------|----------------|---------------|-------------------|
 XLM-RoBERTa Model: (Large)              |   1          | 0.95          | 0.87           |  0.53         |  0.60             |
 XLM-RoBERTa Model: (Large)              |   2          | 0.36          | 0.82           |  0.86         |   0.78            |
 XLM-RoBERTa Model: (Large)              |   3          | 0.22          | 0.49           |  0.92         |  0.87             |

# **Visualizing of the model at epoch = 3 for better understanding:**

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(model_3.history['loss'], label = '0 fold loss')
plt.plot(model_3.history['val_loss'], label = '0 fold val_loss')
plt.plot(model_3_1.history['loss'], label = '1 fold loss')
plt.plot(model_3_1.history['val_loss'], label = '1 fold val_loss')
plt.plot(model_3_2.history['loss'], label = '2 fold loss')
plt.plot(model_3_2.history['val_loss'], label = '2 fold val_loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(model_3.history['accuracy'], label = '0 fold accuracy')
plt.plot(model_3.history['val_accuracy'], label = '0 fold val_accuracy')
plt.plot(model_3_1.history['accuracy'], label = '1 fold accuracy')
plt.plot(model_3_1.history['val_accuracy'], label = '1 fold val_accuracy')
plt.plot(model_3_2.history['accuracy'], label = '2 fold accuracy')
plt.plot(model_3_2.history['val_accuracy'], label = '2 fold val_accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATION:**

* As per the above visualization graph the model fits good at fold(0) and at epoch = 3 value.

* Hence, we will be considering this value for the XLM-RoBERTa Large model. 

# **XLM-RoBERTa Model: (BASE)**

In [None]:
model1=model_defination(strategy,"jplu/tf-xlm-roberta-base")

In [None]:
model1.summary()

**Tokenizing: With (xlm-roberta-base)**

In [None]:
tokenizer1 = AutoTokenizer.from_pretrained('jplu/tf-xlm-roberta-base')

*** For n-fold(0):**

In [None]:
encoded_train = tokenizer1.batch_encode_plus(train_set, pad_to_max_length=True, max_length=30)
encoded_test = tokenizer1.batch_encode_plus(test_set, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(0):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_test['input_ids']

In [None]:
train_df = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_3_b = model1.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_b.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_b.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_b.history['accuracy'])))
print("loss {}".format(np.mean(model_3_b.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_b.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_b.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_b.history['accuracy'])))
print("loss {}".format(np.std(model_3_b.history['loss'])))

In [None]:
#Epochs=5
model_3_b5 = model1.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=5)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_b5.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_b5.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_b5.history['accuracy'])))
print("loss {}".format(np.mean(model_3_b5.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_b5.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_b5.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_b5.history['accuracy'])))
print("loss {}".format(np.std(model_3_b5.history['loss'])))

**OBSERVATIONS:**

* Now if we see the last batch of each epoch at 3 and 5 we can see the following:

    * At epoch=3 loss has decreased and val_loss has decreased and accuracy has increased and val_accuracy has also increased.
    * At epoch = 5 loss is decreasing and val_loss is increasing. And accuracy is increasing whereas val_accuracy has reached constant.
    
    
 Hence, we stop at epoch = 5 otherwise the model is getting overfitted.
 
 Hence, we consider the epoch value as 3.

*** For n-fold(1):**

In [None]:
encoded_train1 = tokenizer1.batch_encode_plus(train_set1, pad_to_max_length=True, max_length=30)
encoded_test1 = tokenizer1.batch_encode_plus(test_set1, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(1):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train1['input_ids'], df_train1.label.values, test_size=0.3)

x_test = encoded_test1['input_ids']

In [None]:
train_df1 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df1 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df1 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_3_b1 = model1.fit(train_df1,steps_per_epoch=step,validation_data=valid_df1,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_b1.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_b1.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_b1.history['accuracy'])))
print("loss {}".format(np.mean(model_3_b1.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_b1.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_b1.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_b1.history['accuracy'])))
print("loss {}".format(np.std(model_3_b1.history['loss'])))

*** For n-fold(2):**

In [None]:
encoded_train2 = tokenizer1.batch_encode_plus(train_set2, pad_to_max_length=True, max_length=30)
encoded_test2 = tokenizer1.batch_encode_plus(test_set2, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(2):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train2['input_ids'], df_train2.label.values, test_size=0.3)

x_test = encoded_test2['input_ids']

In [None]:
train_df2 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df2 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df2 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_3_b2 = model1.fit(train_df2,steps_per_epoch=step,validation_data=valid_df2,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_b2.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_b2.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_b2.history['accuracy'])))
print("loss {}".format(np.mean(model_3_b2.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_b2.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_b2.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_b2.history['accuracy'])))
print("loss {}".format(np.std(model_3_b2.history['loss'])))

# **Visualizing of the model at epoch = 3 for better understanding:**

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(model_3_b.history['loss'], label = '0 fold loss')
plt.plot(model_3_b.history['val_loss'], label = '0 fold val_loss')
plt.plot(model_3_b1.history['loss'], label = '1 fold loss')
plt.plot(model_3_b1.history['val_loss'], label = '1 fold val_loss')
plt.plot(model_3_b2.history['loss'], label = '2 fold loss')
plt.plot(model_3_b2.history['val_loss'], label = '2 fold val_loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(model_3_b.history['accuracy'], label = '0 fold accuracy')
plt.plot(model_3_b.history['val_accuracy'], label = '0 fold val_accuracy')
plt.plot(model_3_b1.history['accuracy'], label = '1 fold accuracy')
plt.plot(model_3_b1.history['val_accuracy'], label = '1 fold val_accuracy')
plt.plot(model_3_b2.history['accuracy'], label = '2 fold accuracy')
plt.plot(model_3_b2.history['val_accuracy'], label = '2 fold val_accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATIONS:**

* As per the above visualization graph the model fits good at fold(0) and at epoch = 3 value.

* Hence, we will be considering this value for the XLM-RoBERTa Base model.

 Model_name at epoch=3                   |  n-fold      |  mean_loss    | mean_val_loss  | mean_accuracy | mean_val_accuracy |
 ----------------------------------------|--------------|---------------|----------------|---------------|-------------------|
 XLM-RoBERTa Model: (Base)               |   1          | 0.99          | 0.99           |  0.47         |  0.50             |
 XLM-RoBERTa Model: (Base)               |   2          | 0.57          | 0.84           |  0.78         |  0.70             |
 XLM-RoBERTa Model: (Base)               |   3          | 0.35          | 0.61           |  0.87         |  0.82             |

# **For Model: xlm-roberta-base, let us experiment with XLMRobertaTokenizer for Tokenizing:**

* Since, we have found that at epoch = 3 and at n-fold(0) the model is fitting well we will experiment at these values for the XLMRoberta Tokenizer.

In [None]:
tokenizer3 = XLMRobertaTokenizer.from_pretrained('jplu/tf-xlm-roberta-base')

In [None]:
encoded_train = tokenizer3.batch_encode_plus(train_set, pad_to_max_length=True, max_length=30)
encoded_test = tokenizer3.batch_encode_plus(test_set, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(0):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_test['input_ids']

In [None]:
train_df_x = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df_x = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df_x = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_3_b_x = model1.fit(train_df_x,steps_per_epoch=step,validation_data=valid_df_x,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_3_b_x.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_3_b_x.history['val_loss'])))
print("accuracy {}".format(np.mean(model_3_b_x.history['accuracy'])))
print("loss {}".format(np.mean(model_3_b_x.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_3_b_x.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_3_b_x.history['val_loss'])))
print("accuracy {}".format(np.std(model_3_b_x.history['accuracy'])))
print("loss {}".format(np.std(model_3_b_x.history['loss'])))

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(model_3_b_x.history['loss'], label = '0 fold loss')
plt.plot(model_3_b_x.history['val_loss'], label = '0 fold val_loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(model_3_b_x.history['accuracy'], label = '0 fold accuracy')
plt.plot(model_3_b_x.history['val_accuracy'], label = '0 fold val_accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATIONS:**

* For Model: xlm-roberta-base, experimenting with XLMRobertaTokenizer has not given any better outcome than the previous models.

* xlm-roberta-base with XLMRobertaTokenizer at epoch = 3 has performed the worst among other xlm-roberta models.



 Model_name at epoch=3                   |  n-fold      |  mean_loss    | mean_val_loss  | mean_accuracy | mean_val_accuracy |
 ----------------------------------------|--------------|---------------|----------------|---------------|-------------------|
 XLM-RoBERTa Model: (Large)              |   1          | 1.03          | 0.97           |  0.48         |  0.52             |
 XLM-RoBERTa Model: (Large)              |   2          | 0.17          | 0.43           |  0.93         |  0.89             |
 XLM-RoBERTa Model: (Large)              |   3          | 0.17          | 0.38           |  0.94         |  0.90             |
 
 
  Model_name at epoch=3                  |  n-fold      |  mean_loss    | mean_val_loss  | mean_accuracy | mean_val_accuracy |
 ----------------------------------------|--------------|---------------|----------------|---------------|-------------------|
 XLM-RoBERTa Model: (Base)               |   1          | 0.99          | 0.99           |  0.47         |  0.50             |
 XLM-RoBERTa Model: (Base)               |   2          | 0.57          | 0.84           |  0.78         |  0.70             |
 XLM-RoBERTa Model: (Base)               |   3          | 0.35          | 0.61           |  0.87         |  0.82             |
 


* As we have considered epoch =3 and n-fold(0) as the best values where the model fits perfectly for the XLM-RoBERTa (BASE AND LARGE) models from the above we would consider   XLM-RoBERTa Model-Base model at first fold n-fold(0).

# **Implementing Model: distilbert-base-multilingual-cased**

**Overview:**

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of Bert’s performances as measured on the GLUE language understanding benchmark.


* For the above model defination.

* Running it on the train and test data respectively for the 3 different folds.

In [None]:
model_b=model_defination(strategy,"distilbert-base-multilingual-cased")

In [None]:
model_b.summary()

In [None]:
tokenizerb = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')

*** At first n-fold(0):**

In [None]:
encoded_trainb = tokenizerb.batch_encode_plus(train_set, pad_to_max_length=True, max_length=30)
encoded_testb = tokenizerb.batch_encode_plus(test_set, pad_to_max_length=True, max_length=30)

In [None]:
X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_trainb['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_testb['input_ids']

In [None]:
train_dfb = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_dfb = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_dfb = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#Epochs=3
modelb = model_b.fit(train_dfb,steps_per_epoch=step,validation_data=valid_dfb,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(modelb.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelb.history['val_loss'])))
print("accuracy {}".format(np.mean(modelb.history['accuracy'])))
print("loss {}".format(np.mean(modelb.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelb.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelb.history['val_loss'])))
print("accuracy {}".format(np.std(modelb.history['accuracy'])))
print("loss {}".format(np.std(modelb.history['loss'])))

In [None]:
#Epochs=6
modelb_1 = model_b.fit(train_dfb,steps_per_epoch=step,validation_data=valid_dfb,epochs=6)

# **Visualization:**

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(modelb.history['loss'], label = 'Train_data Loss')
plt.plot(modelb.history['val_loss'], label = 'test_data loss')
plt.title("Curve at epoch=3")
plt.legend()


plt.subplot(122)
plt.plot(modelb.history['accuracy'], label = 'Train_data accuracy')
plt.plot(modelb.history['val_accuracy'], label = 'Validation_data accuracy')
plt.title("Curve at epoch=3")
plt.legend()

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(modelb_1.history['loss'], label = 'Train_data Loss')
plt.plot(modelb_1.history['val_loss'], label = 'test_data loss')
plt.title("Curve at epoch=6")
plt.legend()

plt.subplot(122)
plt.plot(modelb_1.history['accuracy'], label = 'Train_data accuracy')
plt.plot(modelb_1.history['val_accuracy'], label = 'Validation_data accuracy')
plt.title("Curve at epoch=6")
plt.legend()

**OBSERVATION:**

* At epoch=6 the model fits badly.

* Hence, we consider at epoch =3 and will further experiment with different parameters below.

*** At second n-fold(1):**

In [None]:
encoded_train1 = tokenizerb.batch_encode_plus(train_set1, pad_to_max_length=True, max_length=30)
encoded_test1 = tokenizerb.batch_encode_plus(test_set1, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(1):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train1['input_ids'], df_train1.label.values, test_size=0.3)

x_test = encoded_test1['input_ids']

In [None]:
train_df1 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df1 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df1 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#Epochs=3
modelb_1 = model_b.fit(train_df1,steps_per_epoch=step,validation_data=valid_df1,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(modelb_1.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelb_1.history['val_loss'])))
print("accuracy {}".format(np.mean(modelb_1.history['accuracy'])))
print("loss {}".format(np.mean(modelb_1.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelb_1.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelb_1.history['val_loss'])))
print("accuracy {}".format(np.std(modelb_1.history['accuracy'])))
print("loss {}".format(np.std(modelb_1.history['loss'])))

*** At third n-fold(2):**

In [None]:
encoded_train2 = tokenizerb.batch_encode_plus(train_set2, pad_to_max_length=True, max_length=30)
encoded_test2 = tokenizerb.batch_encode_plus(test_set2, pad_to_max_length=True, max_length=30)

In [None]:
# using the first fold(2):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train2['input_ids'], df_train2.label.values, test_size=0.3)

x_test = encoded_test2['input_ids']

In [None]:
train_df2 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df2 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df2 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#Epochs=3
modelb_2 = model_b.fit(train_df2,steps_per_epoch=step,validation_data=valid_df2,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(modelb_2.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelb_2.history['val_loss'])))
print("accuracy {}".format(np.mean(modelb_2.history['accuracy'])))
print("loss {}".format(np.mean(modelb_2.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelb_2.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelb_2.history['val_loss'])))
print("accuracy {}".format(np.std(modelb_2.history['accuracy'])))
print("loss {}".format(np.std(modelb_2.history['loss'])))

  Model_name at epoch=3                  |  n-fold      |  mean_loss    | mean_val_loss  | mean_accuracy | mean_val_accuracy |
 ----------------------------------------|--------------|---------------|----------------|---------------|-------------------|
 distilbert-base-multilingual-cased               |   1          | 0.94          | 1.08           |  0.52         |  0.46             |
 distilbert-base-multilingual-cased               |   2          | 0.58          | 0.99           |  0.76         |  0.64             |
 distilbert-base-multilingual-cased               |   3          | 0.34          | 0.66           |  0.88         |  0.80             |

# **Visualization:**

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(modelb.history['loss'], label = '0 fold loss')
plt.plot(modelb.history['val_loss'], label = '0 fold val_loss')
plt.plot(modelb_1.history['loss'], label = '1 fold loss')
plt.plot(modelb_1.history['val_loss'], label = '1 fold val_loss')
plt.plot(modelb_2.history['loss'], label = '2 fold loss')
plt.plot(modelb_2.history['val_loss'], label = '2 fold val_loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(modelb.history['accuracy'], label = '0 fold accuracy')
plt.plot(modelb.history['val_accuracy'], label = '0 fold val_accuracy')
plt.plot(modelb_1.history['accuracy'], label = '1 fold accuracy')
plt.plot(modelb_1.history['val_accuracy'], label = '1 fold val_accuracy')
plt.plot(modelb_2.history['accuracy'], label = '2 fold accuracy')
plt.plot(modelb_2.history['val_accuracy'], label = '2 fold val_accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATION:**


* The DistillBERT model is performing well at epoch = 3 and n-fold [0] 

* We will be further experimenting and fine tuning this model with different values at the same epoch and n-fold value.

# **Implementing Model: "Bert-base-multilingual-cased"**

**Overview:**

* The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pre-trained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

**The abstract from the paper is the following:**

* We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

* BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).


* BERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

* BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.

In [None]:
model_bert=model_defination(strategy,"bert-base-multilingual-cased")

In [None]:
model_bert.summary()

In [None]:
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

*** At the first n-fold(0):**

In [None]:
encoded_train_bert = tokenizer_bert.batch_encode_plus(train_set, pad_to_max_length=True, max_length=30)
encoded_testb_bert = tokenizer_bert.batch_encode_plus(test_set, pad_to_max_length=True, max_length=30)

In [None]:
X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train_bert['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_testb_bert['input_ids']

In [None]:
train_df_bert = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df_bert = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df_bert = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#Epochs=2
modelbert = model_bert.fit(train_df_bert,steps_per_epoch=step,validation_data=valid_df_bert,epochs=2)

In [None]:
print("validation accuracy {}".format(np.mean(modelbert.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelbert.history['val_loss'])))
print("accuracy {}".format(np.mean(modelbert.history['accuracy'])))
print("loss {}".format(np.mean(modelbert.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelbert.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelbert.history['val_loss'])))
print("accuracy {}".format(np.std(modelbert.history['accuracy'])))
print("loss {}".format(np.std(modelbert.history['loss'])))

In [None]:
#Epochs=3
modelbert_3 = model_bert.fit(train_df_bert,steps_per_epoch=step,validation_data=valid_df_bert,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(modelbert_3.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelbert_3.history['val_loss'])))
print("accuracy {}".format(np.mean(modelbert_3.history['accuracy'])))
print("loss {}".format(np.mean(modelbert_3.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelbert_3.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelbert_3.history['val_loss'])))
print("accuracy {}".format(np.std(modelbert_3.history['accuracy'])))
print("loss {}".format(np.std(modelbert_3.history['loss'])))

# **Visualization:**

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(modelbert.history['loss'], label = 'Train_data Loss')
plt.plot(modelbert.history['val_loss'], label = 'test_data loss')
plt.title("Curve at epoch=2")
plt.legend()

plt.subplot(122)
plt.plot(modelbert.history['accuracy'], label = 'Train_data accuracy')
plt.plot(modelbert.history['val_accuracy'], label = 'Validation_data accuracy')
plt.title("Curve at epoch=2")
plt.legend()

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(modelbert_3.history['loss'], label = 'Train_data Loss')
plt.plot(modelbert_3.history['val_loss'], label = 'test_data loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(modelbert_3.history['accuracy'], label = 'Train_data accuracy')
plt.plot(modelbert_3.history['val_accuracy'], label = 'Validation_data accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATION:**


* For the model Bert-base-multilingual-cased the model performs well at epoch = 2

*** At the second n-fold(1):**

In [None]:
encoded_train_bert1 = tokenizer_bert.batch_encode_plus(train_set1, pad_to_max_length=True, max_length=30)
encoded_testb_bert1 = tokenizer_bert.batch_encode_plus(test_set1, pad_to_max_length=True, max_length=30)

In [None]:
X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train_bert1['input_ids'], df_train1.label.values, test_size=0.3)

x_test = encoded_testb_bert1['input_ids']

In [None]:
train_df_bert1 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df_bert1 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df_bert1 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#Epochs=2
modelbert1 = model_bert.fit(train_df_bert1,steps_per_epoch=step,validation_data=valid_df_bert1,epochs=2)

In [None]:
print("validation accuracy {}".format(np.mean(modelbert1.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelbert1.history['val_loss'])))
print("accuracy {}".format(np.mean(modelbert1.history['accuracy'])))
print("loss {}".format(np.mean(modelbert1.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelbert1.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelbert1.history['val_loss'])))
print("accuracy {}".format(np.std(modelbert1.history['accuracy'])))
print("loss {}".format(np.std(modelbert1.history['loss'])))

* At the third n-fold(2):

In [None]:
encoded_train_bert2 = tokenizer_bert.batch_encode_plus(train_set2, pad_to_max_length=True, max_length=30)
encoded_testb_bert2 = tokenizer_bert.batch_encode_plus(test_set2, pad_to_max_length=True, max_length=30)

In [None]:
X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train_bert2['input_ids'], df_train2.label.values, test_size=0.3)

x_test = encoded_testb_bert2['input_ids']

In [None]:
train_df_bert2 = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df_bert2 = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df_bert2 = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#Epochs=2
modelbert2 = model_bert.fit(train_df_bert2,steps_per_epoch=step,validation_data=valid_df_bert2,epochs=2)

In [None]:
print("validation accuracy {}".format(np.mean(modelbert2.history['val_accuracy'])))
print("validation loss {}".format(np.mean(modelbert2.history['val_loss'])))
print("accuracy {}".format(np.mean(modelbert2.history['accuracy'])))
print("loss {}".format(np.mean(modelbert2.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(modelbert2.history['val_accuracy'])))
print("validation loss {}".format(np.std(modelbert2.history['val_loss'])))
print("accuracy {}".format(np.std(modelbert2.history['accuracy'])))
print("loss {}".format(np.std(modelbert2.history['loss'])))

  Model_name at epoch=3                  |  n-fold      |  mean_loss    | mean_val_loss  | mean_accuracy | mean_val_accuracy |
 ----------------------------------------|--------------|---------------|----------------|---------------|-------------------|
 Bert-base-multilingual-cased               |   1          | 0.41          | 1.69           |  0.83         |  0.48             |
 Bert-base-multilingual-cased               |   2          | 0.68          | 0.85           |  0.72         |  0.67             |
 Bert-base-multilingual-cased               |   3          | 0.46          | 0.66           |  0.82         |  0.77             |
 

# **Visualization:**

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(modelbert.history['loss'], label = '0 fold loss')
plt.plot(modelbert.history['val_loss'], label = '0 fold val_loss')
plt.plot(modelbert1.history['loss'], label = '1 fold loss')
plt.plot(modelbert1.history['val_loss'], label = '1 fold val_loss')
plt.plot(modelbert2.history['loss'], label = '2 fold loss')
plt.plot(modelbert2.history['val_loss'], label = '2 fold val_loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(modelbert.history['accuracy'], label = '0 fold accuracy')
plt.plot(modelbert.history['val_accuracy'], label = '0 fold val_accuracy')
plt.plot(modelbert1.history['accuracy'], label = '1 fold accuracy')
plt.plot(modelbert1.history['val_accuracy'], label = '1 fold val_accuracy')
plt.plot(modelbert2.history['accuracy'], label = '2 fold accuracy')
plt.plot(modelbert2.history['val_accuracy'], label = '2 fold val_accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATIONS:**

* BERT model is performing good at epoch =2 and at the first n-fold(0).

* We will further fine tune the parameter and do the experiments at first n-fold(0) for all the the below models:

    1. XLM-RoBERTA BASE
    2. DistillBERT
    3. BERT

# **5. Fine-tune the Models**

**1. XLM-RoBERTa BASE:**

****at epoch = 3 and at the fist fold(n-fold[0]):****

* Build the model with different parameters for Dense layer and activation along different loss value.

* Changing the MAX_LEN value to 80 to change the input_layer shape

In [None]:
def model_defination(strategy,transformer):
    with strategy.scope():
        encoder = TFAutoModel.from_pretrained(transformer)
        input_layer = Input(shape=(80,), dtype=tf.int32, name="input_layer")
        sequence_output = encoder(input_layer)[0]
        cls_token = sequence_output[:, 0, :]
        output_layer = Dense(1, activation='sigmoid')(cls_token)
        model = Model(inputs=input_layer, outputs=output_layer)
        model.compile(
            Adam(lr=1e-5), 
            loss='binary_crossentropy', 
            metrics=['accuracy']
        )
        return model

In [None]:
model1=model_defination(strategy,"jplu/tf-xlm-roberta-base")

In [None]:
model1.summary()

In [None]:
tokenizer1 = AutoTokenizer.from_pretrained('jplu/tf-xlm-roberta-base')

In [None]:
encoded_train = tokenizer1.batch_encode_plus(train_set, pad_to_max_length=True, max_length=80)
encoded_test = tokenizer1.batch_encode_plus(test_set, pad_to_max_length=True, max_length=80)

In [None]:
# using the first fold(0):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_test['input_ids']

In [None]:
train_df = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_x = model1.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_x.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_x.history['val_loss'])))
print("accuracy {}".format(np.mean(model_x.history['accuracy'])))
print("loss {}".format(np.mean(model_x.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_x.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_x.history['val_loss'])))
print("accuracy {}".format(np.std(model_x.history['accuracy'])))
print("loss {}".format(np.std(model_x.history['loss'])))

**2. distilbert-base-multilingual-cased:**

*at epoch = 3 and at the fist fold(n-fold[0]):*

In [None]:
model_d=model_defination(strategy,"distilbert-base-multilingual-cased")

In [None]:
model_d.summary()

In [None]:
tokenizer_b = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')

In [None]:
encoded_train = tokenizer_b.batch_encode_plus(train_set, pad_to_max_length=True, max_length=80)
encoded_test = tokenizer_b.batch_encode_plus(test_set, pad_to_max_length=True, max_length=80)

In [None]:
# using the first fold(0):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_test['input_ids']

In [None]:
train_df = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_d = model_d.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_d.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_d.history['val_loss'])))
print("accuracy {}".format(np.mean(model_d.history['accuracy'])))
print("loss {}".format(np.mean(model_d.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_d.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_d.history['val_loss'])))
print("accuracy {}".format(np.std(model_d.history['accuracy'])))
print("loss {}".format(np.std(model_d.history['loss'])))

3. Bert-base-multilingual-cased

*at epoch = 3 and at the fist fold(n-fold[0]):*

In [None]:
model_b=model_defination(strategy,"bert-base-multilingual-cased")

In [None]:
model_b.summary()

In [None]:
tokenizer_d = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

In [None]:
encoded_train = tokenizer_d.batch_encode_plus(train_set, pad_to_max_length=True, max_length=80)
encoded_test = tokenizer_d.batch_encode_plus(test_set, pad_to_max_length=True, max_length=80)

In [None]:
# using the first fold(0):

X_train, X_valid, Y_train, Y_valid = train_test_split(encoded_train['input_ids'], df_train.label.values, test_size=0.3)

x_test = encoded_test['input_ids']

In [None]:
train_df = (tf.data.Dataset.from_tensor_slices((X_train, Y_train)).repeat().shuffle(2048).batch(BATCH_SIZE).prefetch(AUTO))

valid_df = (tf.data.Dataset.from_tensor_slices((X_valid, Y_valid)).batch(BATCH_SIZE).cache().prefetch(AUTO))

test_df = (tf.data.Dataset.from_tensor_slices(x_test).batch(BATCH_SIZE))

In [None]:
#At epochs=3
model_b = model_b.fit(train_df,steps_per_epoch=step,validation_data=valid_df,epochs=3)

In [None]:
print("validation accuracy {}".format(np.mean(model_b.history['val_accuracy'])))
print("validation loss {}".format(np.mean(model_b.history['val_loss'])))
print("accuracy {}".format(np.mean(model_b.history['accuracy'])))
print("loss {}".format(np.mean(model_b.history['loss'])))

In [None]:
print("validation accuracy {}".format(np.std(model_b.history['val_accuracy'])))
print("validation loss {}".format(np.std(model_b.history['val_loss'])))
print("accuracy {}".format(np.std(model_b.history['accuracy'])))
print("loss {}".format(np.std(model_b.history['loss'])))

**Comparing the results of the above Models:**

 
 Model_names                             |  epoch value |  loss    | val_loss  | accuracy | val_accuracy |
 ----------------------------------------|--------------|----------|-----------|----------|--------------|
 XLM-RoBERTa Model: (Base)               |   3          | 0.05     |  -0.09      |  0.31    |  0.32        |
 distilbert-base-multilingual-cased      |   3          | -2.07     | -1.87      |  0.36    |  0.35        |     
 Bert-base-multilingual-cased            |   2          |-3.30     | -3.62      |  0.40    |  0.41        |

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.plot(model_x.history['loss'], label = 'x fold loss')
plt.plot(model_x.history['val_loss'], label = 'x fold val_loss')
plt.plot(model_d.history['loss'], label = 'd fold loss')
plt.plot(model_d.history['val_loss'], label = 'd fold val_loss')
plt.plot(model_b.history['loss'], label = 'b fold loss')
plt.plot(model_b.history['val_loss'], label = 'b fold val_loss')
plt.title("Curve at epoch=3")
plt.legend()

plt.subplot(122)
plt.plot(model_x.history['accuracy'], label = 'x fold accuracy')
plt.plot(model_x.history['val_accuracy'], label = 'x fold val_accuracy')
plt.plot(model_d.history['accuracy'], label = 'd fold accuracy')
plt.plot(model_d.history['val_accuracy'], label = 'd fold val_accuracy')
plt.plot(model_b.history['accuracy'], label = 'b fold accuracy')
plt.plot(model_b.history['val_accuracy'], label = 'b fold val_accuracy')
plt.title("Curve at epoch=3")
plt.legend()

**OBSERVATIONS:**

* The best epoch value for the models are 3.

* The models are getting fit at first n-fold without getting overfit.

* DistillBERT model is performing the best among all the other three above models.

* Hence, we will use "DistillBERT" model for further prediction of the target value: "labels".

# **6. Reporting Results**

**Generating the submission file as per the competition:**

* From the above we would consider "DistillBERT" model here to perform the prediction of the target attribute "label" on the test data given in the competition.

* And generate the submission file to submit the predictions.

In [None]:
# Given test data:

test_data.head(5)

Reading the given Submission File from the competition:

In [None]:
sub = pd.read_csv("../input/contradictory-my-dear-watson/sample_submission.csv")

**Prediction:**

* Making the prediction using the model_d(defined for DistillBERT model) and the test data(test_df) which was sliced using the tensor flow for this model above.

In [None]:
test_predict = model_d.predict(test_df, verbose=1)
sub['prediction'] = test_predict.argmax(axis=1)

**Top 10 of the submission file generated are as below:**

In [None]:
sub.head(10)

**Writing the results to the new submission file generated as below:**

In [None]:
sub.to_csv('Submission.csv', index=False)

**Displaying the submission results from the file written to:**

In [None]:
submission = pd.read_csv('Submission.csv')
submission.head()

**OBSERVATIONS:**

* The best epoch value for the models are 3.

* The models are getting fit at first n-fold without getting overfit.

* DistillBERT model is performing the best among all the other three above models at the first fold which I have taken into consideration for training of the data during which we have fine tuned the selected models(XLM-RoBERTA, DistillBERT and BERT).

* The main reason for considering the above models are as this is the task where, it is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. Also, the hypothesis and premise are in multiple languages. 

* The chosen models are trained in more than 70 different languages which statisfies our data descriptionand the task.


* Based on this we were successfully able to generate the final submission results using the final selected model from our above analysis at different n-folds and epoch values.

**System Limitations:**

* The main limitation of the system here is : we had used the TPU strategy and the TPU utilization quota which was provided by Kaggle for this competition.

* These NLP pre-trained models are quite expensive and they require TPU accelerators to run and train fast.

* The time taken to run these models were more than expected and the TPU distribution strategy had to be sometimes re-run due issues arising when the model could not allocate space.

**Assumptions:**

* The assumption of running the models together on the three n-folds created had not worked out properly and the models were not performing as expected.

* Hence, to perform more analysis the models has been run on each n-fold and at different epoch values.


# **7. Conclusion**


The expected outcome of the project is to predict whether a given hypothesis is related to its premise by contradiction, entailment, or whether neither of those is true (neutral).
For each sample in the test set, you must predict a 0, 1, or 2 value for the variable.
Those values map to the logical condition as:
0 == entailment
1 == neutral
2 == contradiction

* This project will help me in learning and exploring different Feature Engineering Techniques which have been taught in the course. 

* Explore the transformations and the scaling techniques being applied in a project.

* Apart from that the skills and knowledge which I will be getting to learn and enhance from this project will be:

Natural Language Processing related features:

• Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP). The use of NLP to develop applications and services that can understand the human language and process them.

• The knowledge which we will be exploring under NLP will be : using fastText library for efficient learning of word representations and sentence classification.


TensorFlow related features:

1. Tokenization:
Representing the words in a tokenized way that a computer can process them and train them with a Neural network that can understand their meaning.

2. Sequencing:
Represent sentences by a sequence of numbers in the correct order for processing by a neural network to understand or maybe even generate new text.

3. Word Embeddings:
To get the meaning of the sentences to number, with numbers being tokens and represented as words is where we will be using the embedding feature.

The areas where I have learned the lessons are:

• How to use Keras along with TensorFlow for creating Data Input Pipelines for Optimization and Analyzation. Along with how to use accelerators for projects.

• How to use Bidirectional Encoder Representations from Transformers.

• The different pre-trained models that are used by BERT.

• How to use the k-fold technique on TPUs to prevent memory issues.

• How to deal with enhance the performance of models by preventing leakages.

8. References
References and Citations of the resources used while developing the Project is:

https://huggingface.co/transformers/task_summary.html

https://huggingface.co/transformers/index.html

https://en.wikipedia.org/wiki/TensorFlow

https://www.kaggle.com/pradeepmuniasamy/contradictory-my-dear-watson-everything-you-need

https://www.kaggle.com/vbookshelf/basics-of-bert-and-xlm-roberta-pytorch

https://www.kaggle.com/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert?select=submission.csv

https://www.kaggle.com/mattbast/training-transformers-with-tensorflow-and-tpus

https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
