<center><img src="https://raw.githubusercontent.com/chiapas/kaggle/master/competitions/contradictory-my-dear-watson/header.png" width="1000"></center>
<br>
<center><h1>Detecting contradiction and entailment in multilingual text using TPUs</h1></center>
<br>

#### Natural Language Inferencing (NLI) is a classic NLP (Natural Language Processing) problem that involves taking two sentences (the _premise_ and the _hypothesis_ ), and deciding how they are related- if the premise entails the hypothesis, contradicts it, or neither.

#### In this notebook, we will use more NLI datasets, including

* [The Stanford Natural Language Inference Corpus (SNLI)](https://nlp.stanford.edu/projects/snli/)
* [The Multi-Genre NLI Corpus (MultiNLI, MNLI)](https://cims.nyu.edu/~sbowman/multinli/)
* [Cross-lingual NLI Corpus (XNLI)](https://cims.nyu.edu/~sbowman/xnli/)

#### We will also use Hugging Face recent library [nlp](https://huggingface.co/nlp/) to work with these datasets.

## Update:
### Since it is found that (some of) the test examples comes from these external datasets, I remove the training part in this notebook in order not to encourage using them.

#### Import

In [None]:
import os
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

import numpy as np
import random
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import plotly.express as px

!pip uninstall -y transformers
!pip install transformers

import transformers
import tokenizers

# Hugging Face new library for datasets (https://huggingface.co/nlp/)
!pip install nlp
import nlp

import datetime

strategy = None

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Datasets

## Competition dataset

In [None]:
original_train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")

original_train = sklearn.utils.shuffle(original_train)
original_train = sklearn.utils.shuffle(original_train)

validation_ratio = 0.2
nb_valid_examples = max(1, int(len(original_train) * validation_ratio))

original_valid = original_train[:nb_valid_examples]
original_train = original_train[nb_valid_examples:]

In [None]:
print(f"original - training: {len(original_train)} examples")
original_train.head(10)

In [None]:
print(f"original - validation: {len(original_valid)} examples")
original_valid.head(10)

In [None]:
original_test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
print(f"original - test: {len(original_test)} examples")
original_test.head(10)

## Extra datasets

### Let's use Hugging Face new library [nlp](https://huggingface.co/nlp/), to get more NLI datasets.

#### Load a dataset - The Multi-Genre NLI Corpus (MNLI)
First, let's load the [The Multi-Genre NLI Corpus (MultiNLI, MNLI)](https://cims.nyu.edu/~sbowman/multinli/). It contains $433000$ sentence pairs annotated with textual entailment information.

In [None]:
mnli = nlp.load_dataset(path='glue', name='mnli')

#### check the loaded dataset

Let's look some information about the MNLI dataset. The (default) return value of [nlp.load_dataset](https://huggingface.co/nlp/package_reference/loading_methods.html#nlp.load_dataset) is a dictionary with split names as keys, usually they are `train`, `validation` and `test`, but not always. The values are [nlp.arrow_dataset.Dataset](https://huggingface.co/nlp/master/package_reference/main_classes.html#nlp.Dataset).



In [None]:
print(mnli, '\n')

print('The split names in MNLI dataset:')
for k in mnli:
    print('   ', k)
    
# Get the datasets
print("\nmnli['train'] is ", type(mnli['train']))

mnli['train']

#### look inside 'nlp.arrow_dataset.Dataset'

In order to get the number of examples in a dataset, for example, `mnli['train']`, you can do
```
    mnli['train'].num_rows
```


You can iterate a [nlp.arrow_dataset.Dataset](https://huggingface.co/nlp/master/package_reference/main_classes.html#nlp.Dataset) object like:
```
    for elt in mnli['train']:
        ...
```
Each step, you get an example (which is a dictionary containing features - in a general sense).

You can also access the content of a [nlp.arrow_dataset.Dataset](https://huggingface.co/nlp/master/package_reference/main_classes.html#nlp.Dataset) object by specifying a feature name . For example, the training dataset in `mnli` has `premise`, `hypothesis`, `label` and `idx` as features.

You can either specify a feature name first (you get a list) followed by a slice, like
```
    # You get a `list` first, then slice it
    mnli['train']['premise'][:3]
```
or use slice notation first to get a dictionary (which represents a sliced dataset) followed by a feature name.
```
    # You get a `dictionary` (of lists) first, then a list
    mnli['train'][:3]['premise']
```

The results will be the same.

In order to get the name of the classes, you can do

```
mnli['train'].features['label'].names
```

Let's use what we learned to check some training examples

In [None]:
print('The number of training examples in mnli dataset:', mnli['train'].num_rows)
print('The number of validation examples in mnli dataset - part 1:', mnli['validation_matched'].num_rows)
print('The number of validation examples in mnli dataset - part 2:', mnli['validation_mismatched'].num_rows, '\n')

print('The class names in mnli dataset:', mnli['train'].features['label'].names)
print('The feature names in mnli dataset:', list(mnli['train'].features.keys()), '\n')

for elt in mnli['train']:
    
    print('premise:', elt['premise'])
    print('hypothesis:', elt['hypothesis'])
    print('label:', elt['label'])
    print('label name:', mnli['train'].features['label'].names[elt['label']])
    print('idx', elt['idx'])
    print('-' * 80)
    
    if elt['idx'] >= 10:
        break

Note that the class names are
```
    ['entailment', 'neutral', 'contradiction'] 
```
which corresponds to the original competition dataset, described in [this competition data page](https://www.kaggle.com/c/contradictory-my-dear-watson/data):

> label: the classification of the relationship between the premise and hypothesis (0 for entailment, 1 for neutral, 2 for contradiction)

#### Convert MNLI to pandas.DataFrame

In [None]:
# convert to a dataframe and view
mnli_train_df = pd.DataFrame(mnli['train'])
mnli_valid_1_df = pd.DataFrame(mnli['validation_matched'])
mnli_valid_2_df = pd.DataFrame(mnli['validation_mismatched'])

mnli_train_df = mnli_train_df[['premise', 'hypothesis', 'label']]
mnli_valid_1_df = mnli_valid_1_df[['premise', 'hypothesis', 'label']]
mnli_valid_2_df = mnli_valid_2_df[['premise', 'hypothesis', 'label']]

mnli_train_df['lang_abv'] = 'en'
mnli_valid_1_df['lang_abv'] = 'en'
mnli_valid_2_df['lang_abv'] = 'en'

In [None]:
mnli_train_df.head(10)

In [None]:
mnli_valid_1_df.head(10)

In [None]:
mnli_valid_2_df.head(10)

### Load more extra datasets

#### The Stanford Natural Language Inference Corpus (SNLI)

First, let's load the [The Stanford Natural Language Inference Corpus (SNLI)](https://nlp.stanford.edu/projects/snli/). It contains $570000$ sentence pairs annotated with textual entailment information.

In [None]:
snli = nlp.load_dataset(path='snli')

print('The number of training examples in snli dataset:', snli['train'].num_rows)
print('The number of validation examples in snli dataset:', snli['validation'].num_rows, '\n')

print('The class names in snli dataset:', snli['train'].features['label'].names)
print('The feature names in snli dataset:', list(snli['train'].features.keys()), '\n')

for idx, elt in enumerate(snli['train']):
    
    print('premise:', elt['premise'])
    print('hypothesis:', elt['hypothesis'])
    print('label:', elt['label'])
    print('label name:', snli['train'].features['label'].names[elt['label']])
    print('-' * 80)
    
    if idx >= 10:
        break

Again, the class names are
```
    ['entailment', 'neutral', 'contradiction'] 
```
which corresponds to the original competition dataset.

In [SNLI](https://nlp.stanford.edu/projects/snli/), we have the same premise with different hypotheses/labels. With a first try, I got `nan` as the training loss value. So I won't use this dataset in the current notebook.

#### Convert SNLI to pandas.DataFrame

In [None]:
# convert to a dataframe and view
snli_train_df = pd.DataFrame(snli['train'])
snli_valid_df = pd.DataFrame(snli['validation'])

snli_train_df = snli_train_df[['premise', 'hypothesis', 'label']]
snli_valid_df = snli_valid_df[['premise', 'hypothesis', 'label']]

snli_train_df['lang_abv'] = 'en'
snli_valid_df['lang_abv'] = 'en'

In [None]:
snli_train_df.head(10)

In [None]:
snli_valid_df.head(10)

#### The Cross-Lingual NLI Corpus (XNLI)

The [MNLI](https://cims.nyu.edu/~sbowman/multinli/) and [SNLI](https://nlp.stanford.edu/projects/snli/) contain only english sentences. Let's load the [Cross-lingual NLI Corpus (XNLI)](https://cims.nyu.edu/~sbowman/xnli/) dataset. It contains only validation and test dataset, not training examples.

In [None]:
xnli = nlp.load_dataset(path='xnli')

print('The number of validation examples in xnli dataset:', xnli['validation'].num_rows, '\n')

print('The class names in xnli dataset:', xnli['validation'].features['label'].names)
print('The feature names in xnli dataset:', list(xnli['validation'].features.keys()), '\n')

for idx, elt in enumerate(xnli['validation']):
    
    print('premise:', elt['premise'])
    print('hypothesis:', elt['hypothesis'])
    print('label:', elt['label'])
    print('label name:', xnli['validation'].features['label'].names[elt['label']])
    print('-' * 80)
    
    if idx >= 3:
        break

The class names are still
```
    ['entailment', 'neutral', 'contradiction'],
```
however, the features `premise` and `hypothesis` are no longer `string` but `dictionary` which contain sentences in different language! 

#### Convert XNLI to pandas.DataFrame

In [None]:
# convert to a dataframe and view
buffer = {
    'premise': [],
    'hypothesis': [],
    'label': [],
    'lang_abv': []
}


for x in xnli['validation']:
    label = x['label']
    for idx, lang in enumerate(x['hypothesis']['language']):
        hypothesis = x['hypothesis']['translation'][idx]
        premise = x['premise'][lang]
        buffer['premise'].append(premise)
        buffer['hypothesis'].append(hypothesis)
        buffer['label'].append(label)
        buffer['lang_abv'].append(lang)
        
# convert to a dataframe and view
xnli_valid_df = pd.DataFrame(buffer)
xnli_valid_df = xnli_valid_df[['premise', 'hypothesis', 'label', 'lang_abv']]

In [None]:
xnli_valid_df.head(15 * 3)