> "…when you have eliminated the impossible, whatever remains, however improbable, must be the truth"

> -Sir Arthur Conan Doyle

![](https://i.pinimg.com/originals/e7/b8/df/e7b8dfcca0a5383d99142f28b3a6f51d.jpg)

### **The Challenge:**
If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

Your task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages! You can find more details on the dataset by reviewing the Data page.

I have always felt like EDA notebooks on kaggle have lost its main motive and that is to explain the Data. It is more about fancy graph then the meaning behind it. I hope I fulfill it.


Let us walk through the directory.

In [None]:
!pip install  -q wordcloud

In [None]:
import numpy as np
import pandas as pd
import os
print('Inside Input we have:')
for i, (dirname, _, filenames) in enumerate(os.walk('/kaggle/input/contradictory-my-dear-watson')):
    print('\t '* i, '{}) {} folder. It has:-'.format(i+1, dirname.split('/')[-1]))
    for idx,filename in enumerate(filenames):
        print('\t '* (i+1),f'{idx+1}. {filename}' )


Now let us load our CSV files and do some

In [None]:
train_df = pd.read_csv('../input/contradictory-my-dear-watson/train.csv')
test_df = pd.read_csv('../input/contradictory-my-dear-watson/test.csv')
train_df.shape, test_df.shape

In [None]:
train_df.head()

In [None]:
test_df.head()

#### Note to ourself: ***It looks like Multi-Lingual, Multi-Class Problem.***

 ### First let us analyze label or target
 They are classifying pairs of sentences (consisting of a premise and a hypothesis) into three categories - 
 
 **1. entailment means logical sequence**
 
 **2. contradiction means illogical sequence**  
 
 **3. neutral means niether logical or illogical sequence**

#### EXAMPLE
As explained in Getting Started.
> He came, he opened the door and I remember looking back and seeing the expression on his face, and I could tell that he was disappointed.

###### Hypothesis 1:

> Just by the look on his face when he came through the door I just knew that he was let down.

We know that this is true based on the information in the premise. So, this pair is related by **entailment**.

###### Hypothesis 2:

> He was trying not to make us feel guilty but we knew we had caused him trouble.

This very well might be true, but we can’t conclude this based on the information in the premise. So, this relationship is **neutral**.

###### Hypothesis 3:

> He was so excited and bursting with joy that he practically knocked the door off it's frame.

We know this isn’t true, because it is the complete opposite of what the premise says. So, this pair is related by **contradiction**

In [None]:
import seaborn as sns
sns.countplot(train_df.label);

#### Note to ourself: In terms of Labels, it is a **Multi Class Balanced** Dataset Problem.

### Now let us move forward and analyze language variable 

Let see different language we are dealing with:-

In [None]:
print('Different types of language are', train_df['language'].unique(), '\nTotal number of Languages are:-',len((train_df['language'].unique())))

We have 
Arabic, Bulgarian, Chinese, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, and Vietnamese.

In [None]:
import plotly.express as px
import matplotlib.pyplot as plt

name, count = np.unique(train_df['language'], return_counts = True)
fig = px.pie( values= count, names=name, title='Languages Available to us.')
fig.update_traces(hoverinfo='value+label+percent', textposition='inside', textfont_size=15,textinfo = 'value + label',
                  marker=dict( line=dict(color='#000100', width=2)))
fig.show()

In [None]:
name, count = np.unique(train_df[train_df['language'] != 'English'].language, return_counts = True)

fig = px.bar(x=name, y=count)
fig.update_traces(texttemplate='%{y:.2s}',  textposition='outside')
fig.update_layout(uniformtext_minsize=15, uniformtext_mode='hide', xaxis_tickangle=-80)
fig.show()

In [None]:
fig = plt.figure(figsize = (25,18))
for i,n in enumerate(train_df.language.unique()):
    ax1 = plt.subplot(5,3,i+1)
    sns.countplot(train_df[train_df.language == n].label, ax =ax1)
    ax1.set_title(n)
    ax1.set_xlabel('')


#### Note to ourself: It is fairly a balanced Dataset if we seperate English and other Languages.

### Now let us move forward and analyze Text portion.

In [None]:
import wordcloud

In [None]:
text = train_df[train_df.language == 'English'].premise.to_string()

In [None]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
stopwords.update(["many", "alway", "you", "many", "well", 'time, mean', 'much'])
wordcloud = WordCloud(stopwords=stopwords,max_font_size=50, max_words=800, background_color="white").generate(text)

# Display the generated image:
plt.figure(figsize = (15,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Min and Max of len of strings in Premise and hypothesis.

In [None]:
print('max length of sentence in premise', max(train_df.premise.apply(lambda x:len(x.split(' ')))))
print('min length of sentence in premise',min(train_df.premise.apply(lambda x:len(x.split(' ')))))
print('max length of sentence in hypothesis',max(train_df.hypothesis.apply(lambda x:len(x.split(' ')))))
print('min length of sentence in hypothesis',min(train_df.hypothesis.apply(lambda x:len(x.split(' ')))))

Things we have learned about about dataset are:
* Multi Class
* Multi Lingual
* Balanced Dataset