Hi!

[This notebook is based on official tutorial;](https://www.kaggle.com/anasofiauzsoy/tutorial-notebook) and [Detecting Contradictions in Multilingual Text](https://www.kaggle.com/sukanyabag/detecting-contradictions-in-multilingual-text)

If you like it, please do a upvote! ;)

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

Let's set up our TPU.

In [None]:
import tensorflow as tf

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    
print('Number of replicas:', strategy.num_replicas_in_sync)

In [None]:
import random


from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score, accuracy_score

import transformers
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, TFAutoModel
from transformers import AdamW

import warnings
warnings.filterwarnings("ignore")

## Downloading Data

The training set contains a premise, a hypothesis, a label (0 = entailment, 1 = neutral, 2 = contradiction), and the language of the text. For more information about what these mean and how the data is structured, check out the data page: https://www.kaggle.com/c/contradictory-my-dear-watson/data

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
train.shape

In [None]:
train.head()

In [None]:
train['label'].value_counts()

In [None]:
train['premise'].str.len().describe(np.linspace(0, 1, 9))

In [None]:
train['hypothesis'].str.len().describe(np.linspace(0, 1, 9))

Let's look at one of the pairs of sentences.

In [None]:
train.premise.values[1]

In [None]:
train.hypothesis.values[1]

In [None]:
train.label.values[1]

These statements are contradictory, and the label shows that.

Let's look at the distribution of languages in the training set.

In [None]:
labels, frequencies = np.unique(train.language.values, return_counts = True)

plt.figure(figsize = (10,10))
plt.pie(frequencies,labels = labels, autopct = '%1.1f%%')
plt.show()

## Preparing Data for Input

In [None]:
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')

In [None]:
encoded = tokenizer(train['premise'].tolist(), train['hypothesis'].tolist(), padding='max_length', return_tensors='tf')

## Creating & Training Model

Creating the model for parallelization:

In [None]:
with strategy.scope():
    input_ids = tf.keras.Input(shape =(512,), dtype=tf.int32, name='input_ids') 
    attention_mask = tf.keras.Input(shape=(512,),dtype=tf.int32, name='attention_mask')  
    
    roberta = TFAutoModel.from_pretrained('joeddav/xlm-roberta-large-xnli')
    roberta = roberta([input_ids, attention_mask])[0]
    
    output = tf.keras.layers.GlobalAveragePooling1D()(roberta)
    output = tf.keras.layers.Dense(3, activation = 'softmax')(output)
        
    model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)
    
    model.compile(optimizer = tf.keras.optimizers.Adam(lr=1e-5), 
                  loss='sparse_categorical_crossentropy', 
                  metrics=['accuracy']) 
    
    model.summary()

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

model.fit(encoded.data, tf.convert_to_tensor(train.label), 
          validation_split=0.2, 
          epochs=10,
          batch_size=8*strategy.num_replicas_in_sync,
          callbacks=[early_stop],
          verbose=1)

## Generating & Submitting Predictions

In [None]:
test = pd.read_csv('../input/contradictory-my-dear-watson/test.csv')

In [None]:
test_encoded = tokenizer(test['premise'].tolist(), test['hypothesis'].tolist(), padding='max_length', return_tensors='tf')

In [None]:
predictions = [np.argmax(i) for i in model.predict(test_encoded.data)]

The submission file will consist of the ID column and a prediction column. We can just copy the ID column from the test file, make it a dataframe, and then add our prediction column.

In [None]:
submission = test.id.copy().to_frame()
submission['prediction'] = predictions

In [None]:
submission['prediction'].value_counts()

In [None]:
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)

And now we've created our submission file, which can be submitted to the competition.