![](https://storage.googleapis.com/www.forwardit.lv/kaggle/headline.png)

## con‧tra‧dic‧tion /ˌkɒntrəˈdɪkʃən /

> the fact of something being the complete opposite of something else or very different from something else, so that one of them must be wrong 

—Cambrigde dictionary


> a difference between two statements, beliefs, or ideas about something that means they cannot both be true 

—Longman dictionary

> The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it's just a representation, is it not? So if I had written on my picture "This is a pipe", I'd have been lying!

— René Magritte


# Welcome

Welcome to a Very Contradictory EDA! This notebook is inspired by a quote by Kaggle Grandmaster [Agnis Liukis](https://www.kaggle.com/alijs1):
> To stand out and get some real advantage, it is necessary to do something different, find something that others didn’t notice. 

So the purpose of this EDA is to find something interesting and not so straightforward about this dataset. Let's go!

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')

# Random exploration

Let's start by looking at the raw data. Instead of the common approach - taking head of the dataset, we will take random samples:

In [None]:
train.sample(frac=0.001, replace=True, random_state=1)

In [None]:
test.sample(frac=0.001, replace=True, random_state=1)

This gives us a chance to find some non-topmost samples in training and test set that might be interesting to investigate deeper:

In [None]:
test[(test["id"]=="40a9b0f08e") | (test["id"]=="4e9266e800")]

Let's inspect one of the sentence pairs consisting of: "Yes, sir" and "I will take care of that right away Sir". Given this sentence pair without prior knowledge which is a premise and hypothesis:

* **Would you choose premise and hypothesis the same way as in the dataset?**
* **Would you label these pairs differently depending on that assignment?**

**Case 1 (neutral)**

* Premise: "Yes, sir."
* Hypothesis: "I will take care of that right away Sir."


**Case 2 (entailed)**

* Premise: "I will take care of that right away Sir."
* Hypothesis: "Yes, sir."

# Annotation Artifacts? Yes sir!

Despite having very similar meaning, the resulting label is different depending on sentence order! Why is this happening? Word count. **Longer sentence naturally conveys more information than a shorter one**
1. In case longer sentence is a hypothesis (as in Case 1), we could assume the pair is more likely to be contradictory (or at least neutral), because longer sentence have more chances to contain information that contradicts a premise.
2. In case longer sentence is a premise (as in Case 2) the pair is more likely to be entailed, just because shorter hypothesis has fewer chances to be contradictory.

Let's check this idea by introducing additional feature - **word count ratio**:

In [None]:
def get_word_count(sentence):
    return len(str(sentence).split())
    
def get_word_count_ratio(premise, hypothesis):
    return get_word_count(premise) / get_word_count(hypothesis)

train['word_count_ratio'] = train[['premise', 'hypothesis']].apply(lambda x: get_word_count_ratio(*x), axis=1)
train.head(5)

Now let's check the samples with hypothesis being at least twice as long as the premise:

In [None]:
longer_hypotheses = train[(train['word_count_ratio'] < 0.5)]
longer_hypotheses_en = longer_hypotheses.loc[longer_hypotheses["language"] == "English"]
longer_hypotheses_en.head(20)

![](https://storage.googleapis.com/www.forwardit.lv/kaggle/Jerusalem.png)

The one about proximity to Jerusalem is of particular interest:

In [None]:
train[train["id"]=="1be4c67e65"]



Let's check out all the samples with the same premise:

In [None]:
train.loc[train["premise"] == "Near Jerusalem"]

Recalling the labels: 0 = entailment, 1 = neutral, 2 = contradiction

Isn't *three miles away from Jerusalem* still being close to it? I would rather label this pair as *entailement*, however training set says it is *neutral*.

Clearly, we have found some pattern: samples with short premises are most *contradictory* to look at.  

# The defeat of Napoleon

Let's continue by filtering out samples with premises consisting of just one word:

In [None]:
one_word_premises = train.loc[train["premise"].apply(lambda premise: get_word_count(premise) == 1)]
one_word_premises_en = one_word_premises.loc[one_word_premises["language"] == "English"]
one_word_premises_en.head(30)

The following pairs are great examples of what could go wrong with the model training for Natural Language Inference - some of these sentences require very specific domain knowledge:
* *Dr Bauerstein* and *Alfred Inglethorp* are fictional characters in *Agatha Christie*'s detective novel *The Mysterious Affair at Styles*.
* *Saint-Paul-de-Vence* is a commune in the Alpes-Maritimes department in the Provence-Alpes-Côte d'Azur region of Southeastern France. 
* *D-Day* is the name of the Normandy landings operation during World War II, on June 6, 1944.
* *Waterloo* is a municipality in Belgium from which the famous *Waterloo battle* took its name.
* *Melatonin* is a hormone made by the pineal gland. It helps your body know when it's time to sleep and wake up.

![](https://storage.googleapis.com/www.forwardit.lv/kaggle/Regiment-Charles-Ewart.png)


1. **Is Waterloo is the defeat of Napoleon?** It depends on how one understands the word "Waterloo". Despite being commonly referred as *Waterloo battle*, direct meaning of this word is the municipality, not the famous Napoleon's combat. Thus Waterloo might be a battle, but primarily it is a geographic name, thus the pair is *neutral*.
2. **Is Bauerstein a doctor?** Given you have just read Agatha Cristie's novel *The Mysterious Affair at Styles*, you may recall Dr. Bauerstein and give a positive answer. We should account for no prior knowledge of Agatha Cristie and in this case Bauersten is just a random surname, so the pair is *neutral*.
3. **Are Alfred Inglethorp and Bauerstein the same person?** Obviously they are not, given you are familiar with Agatha Cristie's *The Mysterious Affair at Styles*. On the other hand, the pair is not "Agatha Cristie's Bauerstein" vs "Agatha Cristie's Alfred Inglethorp", so we should treat those names without any connection to the original novel. In this case those are just random surnames, which are different and are thus *contradictionary*.

One might argue the last statement, as sometimes a person has two names. For example, how would you label the pair *Agatha Cristie* and *Mary Westmacott*? (the latter is one of her pseudonyms)
* Following the logic of *Alfred Inglethorp vs Bauerstein*, as the names are different, we should say it is *contradictionary*.
* Given the knowledge that Mary Westmacott is Agatha Cristie's pseudonym, we should say it is an *entailement*.
* Given the knowledge that there are thousands of people with the very same name, Agatha Christie, the pair might be *neutral*.

This leads to a conclusion, that humans judge on the meaning of the sentence pairs given **prior knowledge** and **context**. 

In certain cases one could interprete the very same pairs of sentences differently given different knowledge and context.

> Should our model be aware of the Agatha Cristie and Napoleon? 

![](https://storage.googleapis.com/www.forwardit.lv/kaggle/napoleon_vs_agatha_cristie.png)

Let's recall the sample about Waterloo again. Would *The defeat of Napoleon* be an *entailement* if the premise would explicitly say about Waterloo being referred as a battle, e.g. *The battle of Waterloo*? It depends on our personal judgement on some historic event. 

The great example of this paradox is *The Battle of Borodino*, which has very diverse set of opinions expressed by different historians:
* Some claim it is a victory for the French
* Others say it is a victory for the Russians
* The third group exists saying it was a *Pyrrhic victory*, as it ultimately cost Napoleon his army

So the other important question arises: 

> How could one define the crossing line between natual language understanding and having a certain (probably biased) opinion on some facts?

In [None]:
train.label.unique()

In [None]:
labels = {
    0: "Entailment",
    1: "Neutral",
    2: "Contradiction"
}
def dist_plot(df): 
    plt.figure(figsize=(16, 10))
    for label in df.label.unique():
        subset = df[train['label'] == label]

        # Draw the density plot
        sns.distplot(subset['word_count_ratio'], hist = False, kde = True,
                     kde_kws = {'linewidth': 3},
                     label = labels[label])

    # Plot formatting
    plt.legend(prop={'size': 16}, title = 'Label')
    plt.title('Word count vs Label')
    plt.xlabel('Word Count Ratio')
    plt.ylabel('Density')

In [None]:
dist_plot(longer_hypotheses_en)

In [None]:
plt.figure(figsize=(12,10))
cor = train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

One would also note the sentences in these "strange" samples are of a very different length. With that we will formulate our own set of hypotheseses based on word count in respective sentences:
1. Sentences of the same length are more likely to be neutral
2. Sentences of the different length are more likely to either contradict or entail
3. In case hypothesis is longer than a premise, it has higher entropy containing more information which in turns increases chances of a contradiction
4. In case hypothesis is shorter than a premise, it has lower entropy and have more chances to be a "summary": either neutral or entailement

 word overlap is measured by the percentage of tokens from the question that appear in the evidence.

We will check that assumption on a training set by introducing anothe feature: **word count ratio**:

Let's check what part of a training has hypotheses longer than premises:

In [None]:
longer_hypotheses = train[(train['word_count_ratio'] < 0.8)]
len(longer_hypotheses) / len(train)

Now let's check how this correlates to the label (our assumption is that contradiction will prevail)

In [None]:
sns.distplot(longer_hypotheses['label'])

In [None]:
sns.distplot(train['label'])

# About the author

This notebook is published under the **Data Science DJ** initiative with the goal of giving you distilled pieces of valuable information, short and concise, easy to comprehend. 

I spend a few hours every day to write a single post about a single concept. You can find them by:

* [Joining my Telegram channel](https://t.me/datasciencedj)
* [Following my LinkedIn tag](https://www.linkedin.com/feed/hashtag/?keywords=datasciencedj)

If this work gives you joy, or maybe even inspiration, please consider contributing to my [Patreon account](https://www.patreon.com/datasciencedj).
Thank you!

# Resources

1. https://www.arxiv-vanity.com/papers/1803.02324/