## Challange

We are challanged in this competition to use all available NLP tools to "teach" an algorithm the way we do... at least should. This is too idealistic to say we will teach a CPU to understand language humans speak but Natural Language Processing techniques help us to translate the problem into the one machine can understand. 

Before jumping into crunching the numbers though let us take a look at the given data and see what conclusions we can get from it.

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 5);
sns.set_style('whitegrid')

In [None]:
root_dir = '../input/contradictory-my-dear-watson'
train_path = 'train.csv'
test_path = 'test.csv'
sub_path = 'sample_submission.csv'

In [None]:
train_df = pd.read_csv(os.path.join(root_dir, train_path))
test_df = pd.read_csv(os.path.join(root_dir, test_path))
train_df.head()

In [None]:
sorted(train_df.language.unique()) == sorted(test_df.language.unique())

Good to know: we have the same set of languages in both test and training sets.

In [None]:
train_df.language.value_counts()

To be honest this is not surprising to have English language being dominating dataset.

In [None]:
train_df.label.hist(color='orange')

In [None]:
train_df.isna().sum()

There is not much we can draw from regular data so let's jump into texts processing, we start simple.

In [None]:
for each in ['premise', 'hypothesis']:
    print(f'Mean symbols in {each}:', 
          train_df[each].apply(lambda x: len(x)).mean())
    print(f'Maximum symbols in {each}:', 
          train_df[each].apply(lambda x: len(x)).max())
    print(f'Minimum symbols in {each}:', 
          train_df[each].apply(lambda x: len(x)).min())
    print(f'Median symbols in {each}:', 
          train_df[each].apply(lambda x: len(x)).median())

In [None]:
for each in ['premise', 'hypothesis']:
    print(f'Mean number of words in {each}:', 
          train_df[each].apply(lambda x: len(x.split(' '))).mean())
    print(f'Maximum number of words in {each}:', 
          train_df[each].apply(lambda x: len(x.split(' '))).max())
    print(f'Minimum number of words in {each}:', 
          train_df[each].apply(lambda x: len(x.split(' '))).min())
    print(f'Median number of words in {each}:', 
          train_df[each].apply(lambda x: len(x.split(' '))).median())

In [None]:
train_df['premise_len'] = train_df['premise'].apply(lambda x: len(x.split(' ')))
train_df['hypothesis_len'] = train_df['hypothesis'].apply(lambda x: len(x.split(' ')))

In [None]:
fig, ax = plt.subplots(1, 3)
train_df[train_df.label==0].premise_len.hist(ax=ax[0], color='gray', label='entailment', bins=10)
ax[0].legend();
train_df[train_df.label==1].premise_len.hist(ax=ax[1], color='gold', label='neutral', bins=10)
ax[1].legend();
train_df[train_df.label==2].premise_len.hist(ax=ax[2], color='olive', label='contradiction', bins=10)
ax[2].legend();

In [None]:
fig, ax = plt.subplots(1, 3)
train_df[train_df.label==0].hypothesis_len.hist(ax=ax[0], color='gray', label='entailment', bins=10)
ax[0].legend();
train_df[train_df.label==1].hypothesis_len.hist(ax=ax[1], color='gold', label='neutral', bins=10)
ax[1].legend();
train_df[train_df.label==2].hypothesis_len.hist(ax=ax[2], color='olive', label='contradiction', bins=10)
ax[2].legend();

It seems hypothesis have much wider distribution despite having less words in the sentence on average. Still these are not really good metrics when dealing with several languages. As such Spanish is usually considered to be more verbose language overall. Let's concentrate on each language individually. That is where we will start with in the next version of this notebook.

In [None]:
lang_en = train_df[train_df.language=='English']
lang_en.describe()

In [None]:
from nltk.probability import FreqDist
from nltk.corpus import stopwords
sw = stopwords.words('english')

lang_en.loc[:, 'premise'] = lang_en['premise'].apply(lambda x: x.lower())
lang_en.loc[:, 'hypothesis'] = lang_en['hypothesis'].apply(lambda x: x.lower())

p = ' '.join(lang_en['premise'].tolist()).split(' ')
h = ' '.join(lang_en['hypothesis'].tolist()).split(' ')
f_dist_p = FreqDist([x for x in p if x.replace('.', '') not in sw and len(x)>1])
f_dist_h = FreqDist([x for x in h if x.replace('.', '') not in sw and len(x)>1])

In [None]:
p_common = f_dist_p.most_common(20)
plt.bar([x[0] for x in p_common], [x[1] for x in p_common], 
        color='purple', label='most common in premise');
plt.legend();

In [None]:
p_common = f_dist_h.most_common(20)
plt.bar([x[0] for x in p_common], [x[1] for x in p_common], 
        color='purple', label='most common in hypothesis');
plt.legend();

In [None]:
import spacy
nlp = spacy.load('en')

In [None]:
doc = nlp(lang_en.loc[17, 'premise'])
spacy.displacy.render(doc, style='dep', options={'distance':80})

In [None]:
doc = nlp(lang_en.loc[17, 'hypothesis'])
spacy.displacy.render(doc, style='dep', options={'distance':80})

In [None]:
lang_en.loc[17, 'label']

That seems about right. My hypothesis though is that keywords and presence of a negating words will be quite predictive of the label. Before we boil the ocean with compute let's try another example.

In [None]:
doc = nlp(lang_en.loc[321, 'premise'])
spacy.displacy.render(doc, style='dep', options={'distance':60})

In [None]:
doc = nlp(lang_en.loc[321, 'hypothesis'])
spacy.displacy.render(doc, style='dep', options={'distance':60})

In [None]:
lang_en.loc[321, 'label']

Yeah, that could not be that easy. Nevertheless useful insight we should take into account meaning of the words, especialy those which differ between the sentences.

In [None]:
def leave_diff(string1, string2):
    string1 = string1.lower().replace('.', '')
    string2 = string2.lower().replace('.', '')
    string1 = string1.replace(',', '')
    string2 = string2.replace(',', '')
    tokens1 = string1.split(' ')
    tokens2 = string2.split(' ')
    diff = set(tokens1).difference(tokens2)
    return ' '.join(list(diff))

In [None]:
for i in lang_en.index:
    lang_en.loc[i, 'diff'] = leave_diff(
        lang_en.loc[i, 'premise'], lang_en.loc[i, 'hypothesis'])
lang_en.head()

Notice here: usual simple bag-of-words approach certainly is will not be a choice. But even though we are encouraged to use TPU here we should exhaust simple approaches before boiling the ocean with compute, right?

In [None]:
lang_en.loc[7, ['premise', 'hypothesis', 'diff']]

In [None]:
lang_en[lang_en['diff']=='']['label'].value_counts()