## About the Comeptition

For a any given two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. **Natural Language Inferencing** (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

Our task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages!

**About this kernel**

This kernel acts as a starter kit. It gives all the essential Key insights on the given text data.

**Key Takeaways**

* Extensive EDA
* Effective Story Telling
* Creative Feature Engineering
* Modelling

## Importing the Necessary Packages

In [None]:
#!pip install googletrans

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)



## Data Visualisation
from plotly.offline import iplot
from plotly import tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import plotly.offline as pyo

## Data Preprocessing
import re
import nltk
from gensim.models import word2vec

## Visializing similarity of words
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

##Translation
#from googletrans import Translator


## Models
import xgboost as xgb
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import ensemble, metrics, model_selection, naive_bayes
from sklearn.preprocessing import LabelEncoder

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


## Exploratory Data Analysis

#### Simple Data Exploration 

In [None]:
train_df = pd.read_csv("/kaggle/input/contradictory-my-dear-watson/train.csv")
test_df = pd.read_csv("/kaggle/input/contradictory-my-dear-watson/test.csv")
print("Number of rows and columns in train data : ",train_df.shape)
print("Number of rows and columns in test data : ",test_df.shape)

In [None]:
train_df.head()

In [None]:
test_df.head()

#### Target Variable Exploration

In [None]:
Accuracy=pd.DataFrame()
Accuracy['Type']=train_df.label.value_counts().index
Accuracy['Count']=train_df.label.value_counts().values
Accuracy['Type']=Accuracy['Type'].replace(0,'Entailment')
Accuracy['Type']=Accuracy['Type'].replace(1,'Neutral')
Accuracy['Type']=Accuracy['Type'].replace(2,'Contradiction')
Accuracy

In [None]:
py.init_notebook_mode(connected=True)
fig = go.Figure(data=[go.Pie(labels=Accuracy['Type'], values=Accuracy['Count'],hole=0.2)])
fig.update_layout( title={
                    'text': "Percentage distribution of the 3 classes",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
fig.show()

fig = px.bar(Accuracy, x='Type', y='Count',
             hover_data=['Count'], color='Count',
             labels={'pop':'Total Number of game titles'}, height=400)

fig.update_layout( title={
                    'text': "Count of each of the target classes",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
fig.show()

**Observation:**

There are total of **12120** records, which contain

1. **4176** records of Entailment
2. **4064** records of Contradiction
3. **3880** records of Neutral.

Therefore, there is **No Class Imblanace** in the given data.

#### Languages in Train and Test data

In [None]:
Languages=pd.DataFrame()
Languages['Type']=train_df.language.value_counts().index
Languages['Count']=train_df.language.value_counts().values

In [None]:
py.init_notebook_mode(connected=True)
fig = go.Figure(data=[go.Pie(labels=Languages['Type'], values=Languages['Count'],hole=0.2)])
fig.update_layout( title={
                    'text': "Percentage distribution of different Languages",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
fig.show()

**Observation**

From the above graph, we can see that **English** is dominating language in the given dataset.

In [None]:
Languages_test=pd.DataFrame()
Languages_test['Type']=test_df.language.value_counts().index
Languages_test['Count']=test_df.language.value_counts().values
a = sum(Languages_test.Count)
Languages_test.Count = Languages_test.Count.div(a).mul(100).round(2)

In [None]:
a = sum(Languages.Count)
Languages.Count = Languages.Count.div(a).mul(100).round(2)

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Train', x=Languages.Type, y=Languages.Count),
    go.Bar(name='Test', x=Languages_test.Type, y=Languages_test.Count)
])
# Change the bar mode
fig.update_layout(barmode='group',title={
                    'text': "Distribution across Train and Test",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
fig.show()

**Observation**

From the graph, we can clearly see that the distribution of languages across train and test data are equal.

### Premise vs Hypothesis

While dealing the text data, feature engineering can be done in two parts. They are

1. Meta features - features that are extracted from the text like number of words, number of stop words, number of punctuations etc
2. Text based features - features directly based on the text / words like frequency, svd, word2vec etc.

**Meta Features:**

We will start with creating meta featues and see how good are they at predicting the spooky authors. The feature list is as follows:

1. Number of words in the text
2. Number of unique words in the text
3. Number of characters in the text
4. Number of stopwords
5. Number of punctuations
6. Number of upper case words
7. Number of title case words
8. Average length of the words

**We'll try to analyse the Meta features between Premesis and Hypothesis. If possible we'll try to include them in models, in later part of this notebook**

In [None]:
import string

In [None]:
Meta_features = pd.DataFrame()

## Number of words in the text ##
Meta_features["premise_num_words"] = train_df["premise"].apply(lambda x: len(str(x).split()))
Meta_features["hypothesis_num_words"] = train_df["hypothesis"].apply(lambda x: len(str(x).split()))

## Number of characters in the text ##
Meta_features["premise_num_chars"] = train_df["premise"].apply(lambda x: len(str(x)))
Meta_features["hypothesis_num_chars"] = train_df["hypothesis"].apply(lambda x: len(str(x)))

## Number of punctuations in the text ##
Meta_features["premise_num_punctuations"] =train_df["premise"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
Meta_features["hypothesis_num_punctuations"] =train_df["hypothesis"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Average length of the words in the text ##
Meta_features["premise_mean_word_len"] = train_df["premise"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
Meta_features["hypothesis_mean_word_len"] = train_df["hypothesis"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

Meta_features['label'] = train_df['label']

In [None]:
fig = go.Figure()

categories = [0,1,2]
Name = ['Entailment','Contradiction','Neutral']
for category in categories:
    fig.add_trace(go.Violin(x=Meta_features['label'][Meta_features['label'] == category],
                            y=Meta_features['premise_num_words'][Meta_features['label'] == category],
                            name=Name[category],
                            box_visible=True,
                            meanline_visible=True))
    

fig.update_layout( title={
                    'text': "Number of Premise words per category",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})

fig.show()

**Observation**

The distribution of words across the classes are almost the same for contradiction and Neutral, whereas, it is little bit less in Entailment 

In [None]:
fig = go.Figure()
for category in categories:
    fig.add_trace(go.Violin(x=Meta_features['label'][Meta_features['label'] == category],
                            y=Meta_features['premise_num_punctuations'][Meta_features['label'] == category],
                            name=Name[category],
                            box_visible=True,
                            meanline_visible=True))
    

fig.update_layout( title={
                    'text': "Number of Punctuations in Premise per category",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})

fig.show()

**Observation**

The distribution of punctuations across the classes are almost the same for contradiction and entailment, whereas, it is little bit less in Neutral

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=Meta_features['premise_num_words']))
fig.add_trace(go.Histogram(x=Meta_features['hypothesis_num_words']))

# Overlay both histograms
fig.update_layout(barmode='overlay',title={
                    'text': "Distribution of Words over Premise VS Hypothesis",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=Meta_features['premise_num_punctuations']))
fig.add_trace(go.Histogram(x=Meta_features['hypothesis_num_punctuations']))

# Overlay both histograms
fig.update_layout(barmode='overlay',title={
                    'text': "Distribution of Punctuations over Premise VS Hypothesis",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=Meta_features['premise_num_chars']))
fig.add_trace(go.Histogram(x=Meta_features['hypothesis_num_chars']))

# Overlay both histograms
fig.update_layout(barmode='overlay',title={
                    'text': "Distribution of Characters over Premise VS Hypothesis",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'})
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

**Observations:**

From the following observations were made,

1) The distrbution of the parameters of hypothesis fall within the range of permise.

2) The peak of the curve represents, the most probable event in the dataset. Which is very high for Hypothesis.

### Visualizing Word Vectors in Hypothesis and Premise

In this data as we obsevered earlier there are total of 15 languages, so I have decided to translate all the languages into English so that, we could perform a generalised analysis.

In [None]:
#def Translation(x):
#    translator = Translator()
#    return translator.translate(x).text

In [None]:
#test_df.premise[test_df.lang_abv!= 'en']=test_df.premise[test_df.lang_abv!= 'en'].apply(lambda x: Translation(x))
#print("here")
#test_df.hypothesis[test_df.lang_abv!= 'en']=test_df.hypothesis[test_df.lang_abv!= 'en'].apply(lambda x: Translation(x))

In [None]:
#train_df.premise[train_df.lang_abv!= 'en']=train_df.premise[train_df.lang_abv!= 'en'].apply(lambda x: Translation(x))
#print("here")
#train_df.hypothesis[train_df.lang_abv!= 'en']=train_df.hypothesis[train_df.lang_abv!= 'en'].apply(lambda x: Translation(x))

In [None]:
train_df = pd.read_csv("../input/contradictory-my-watson-translated/train_translated.csv")
test_df = pd.read_csv("../input/contradictory-my-watson-translated/test_translated.csv")

In [None]:
train_df.head()

In [None]:
temp = pd.DataFrame()
temp['premise'] = train_df['premise']
temp['hypothesis'] = train_df['hypothesis']

In [None]:
STOP_WORDS = nltk.corpus.stopwords.words()

def clean_sentence(val):
    regex = re.compile('([^\s\w]|_)+')
    sentence = regex.sub('', val).lower()
    sentence = sentence.split(" ")
    
    for word in list(sentence):
        if word in STOP_WORDS:
            sentence.remove(word)  
            
    sentence = " ".join(sentence)
    return sentence

temp['premise'] =  temp['premise'].apply(clean_sentence)
temp['hypothesis'] =  temp['hypothesis'].apply(clean_sentence)

In [None]:
def build_corpus(data):
    corpus = []
    for col in ['premise', 'hypothesis']:
        for sentence in data[col].iteritems():
            word_list = sentence[1].split(" ")
            corpus.append(word_list)
            
    return corpus

corpus = build_corpus(temp)        

#### t-SNE on word vectors

* t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data.  
* It works by taking a group of high-dimensional vocabulary word feature vectors, then compresses them down to 2-dimensional x,y coordinate pairs. 
* The idea is to keep similar words close together on the plane, while maximizing the distance between dissimilar words.

In [None]:
def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
# A more selective model
model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=150, workers=4)
tsne_plot(model)

Similarity of words after being converted into English

## Modelling

Being a ardent fan of the boosting algorithms, even though this competition mainly focuses on Transferm models, I apply both Transformer models and boosting algorithms in this case, for understading purpose

#### XGBoost on Natural Language Inference

**Creative Feature Engineering**

As I mentioned, earlier in this notebook we will try to use the meta features derived above. To check whether they have any impact on the predicition. 

In [None]:
## Number of words in the text ##
train_df["premise_num_words"] = train_df["premise"].apply(lambda x: len(str(x).split()))
train_df["hypothesis_num_words"] = train_df["hypothesis"].apply(lambda x: len(str(x).split()))
test_df["premise_num_words"] = test_df["premise"].apply(lambda x: len(str(x).split()))
test_df["hypothesis_num_words"] = test_df["hypothesis"].apply(lambda x: len(str(x).split()))

## Number of characters in the text ##
train_df["premise_num_chars"] = train_df["premise"].apply(lambda x: len(str(x)))
train_df["hypothesis_num_chars"] = train_df["hypothesis"].apply(lambda x: len(str(x)))
test_df["premise_num_chars"] = test_df["premise"].apply(lambda x: len(str(x)))
test_df["hypothesis_num_chars"] = test_df["hypothesis"].apply(lambda x: len(str(x)))

## Number of punctuations in the text ##
train_df["premise_num_punctuations"] =train_df["premise"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
train_df["hypothesis_num_punctuations"] =train_df["hypothesis"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["premise_num_punctuations"] = test_df["premise"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["hypothesis_num_punctuations"] = test_df["hypothesis"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Average length of the words in the text ##
train_df["premise_mean_word_len"] = train_df["premise"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
train_df["hypothesis_mean_word_len"] = train_df["hypothesis"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["premise_mean_word_len"] = test_df["premise"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["hypothesis_mean_word_len"] = test_df["hypothesis"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

## Language Transformation
lb_make = LabelEncoder()
train_df["language"] = lb_make.fit_transform(train_df["language"])
test_df["language"] = lb_make.fit_transform(test_df["language"])
                                         
## lang_abv Transformation
lb_make = LabelEncoder()
train_df["lang_abv"] = lb_make.fit_transform(train_df["lang_abv"])
test_df["lang_abv"] = lb_make.fit_transform(test_df["lang_abv"])

In [None]:
from nltk.corpus import stopwords
import re
import nltk
import string

stop_words = set(stopwords.words('english')) 
def text_cleaner(text):
    newString = text.lower()
    newString = re.sub(r'\([^)]*\)', '', newString)
    newString = re.sub('"','', newString)    
    newString = re.sub(r"'s\b","",newString)
    newString = re.sub("[^a-zA-Z]", " ", newString) 
    tokens = [w for w in newString.split() if not w in stop_words]
    long_words=[]
    for i in tokens:
        if len(i)>=3:                  #removing short word
            long_words.append(i)   
    return (" ".join(long_words)).strip()

cleaned_text = []
for t in train_df['premise']:
    cleaned_text.append(text_cleaner(t))
train_df['premise'] = cleaned_text   

cleaned_text = []
for t in test_df['premise']:
    cleaned_text.append(text_cleaner(t))
test_df['premise'] = cleaned_text 

cleaned_text = []
for t in train_df['hypothesis']:
    cleaned_text.append(text_cleaner(t))
train_df['hypothesis'] = cleaned_text   

cleaned_text = []
for t in test_df['hypothesis']:
    cleaned_text.append(text_cleaner(t))
test_df['hypothesis'] = cleaned_text 

In [None]:
## premise
tfidf_vec = TfidfVectorizer(analyzer='word',max_features=1000)
tfidf_vec.fit(train_df['premise'].values.tolist() + test_df['premise'].values.tolist())
train_premise = tfidf_vec.transform(train_df['premise'].tolist())
df1 = pd.DataFrame(train_premise.toarray(), columns=tfidf_vec.get_feature_names()).add_suffix('_premise')
train_df = pd.concat([train_df, df1], axis = 1)

test_premise = tfidf_vec.transform(test_df['premise'].tolist())
df1 = pd.DataFrame(test_premise.toarray(), columns=tfidf_vec.get_feature_names()).add_suffix('_premise')
test_df = pd.concat([test_df, df1], axis = 1)

## premise
tfidf_vec = TfidfVectorizer(analyzer='word',max_features=1000)
tfidf_vec.fit(train_df['hypothesis'].values.tolist() + test_df['hypothesis'].values.tolist())
train_premise = tfidf_vec.transform(train_df['hypothesis'].tolist())
df1 = pd.DataFrame(train_premise.toarray(), columns=tfidf_vec.get_feature_names()).add_suffix('_hypothesis')
train_df = pd.concat([train_df, df1], axis = 1)

test_premise = tfidf_vec.transform(test_df['hypothesis'].tolist())
df1 = pd.DataFrame(test_premise.toarray(), columns=tfidf_vec.get_feature_names()).add_suffix('_hypothesis')
test_df = pd.concat([test_df, df1], axis = 1)

In [None]:
def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, child=1, colsample=0.3):
    param = {}
    param['objective'] = 'multi:softprob'
    param['eta'] = 0.1
    param['max_depth'] = 3
    param['silent'] = 1
    param['num_class'] = 3
    param['eval_metric'] = "mlogloss"
    param['min_child_weight'] = child
    param['subsample'] = 0.8
    param['colsample_bytree'] = colsample
    param['seed'] = seed_val
    num_rounds = 2000

    plst = list(param.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=50, verbose_eval=20)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)

    pred_test_y = model.predict(xgtest, ntree_limit = model.best_ntree_limit)
    if test_X2 is not None:
        xgtest2 = xgb.DMatrix(test_X2)
        pred_test_y2 = model.predict(xgtest2, ntree_limit = model.best_ntree_limit)
    return pred_test_y, pred_test_y2, model

In [None]:
train_X = train_df.drop(list(train_df.columns[[0,1]])+['label']+['premise','hypothesis'], axis=1)
test_X = test_df.drop(list(test_df.columns[[0,1]])+['premise','hypothesis'], axis=1)
train_y = train_df['label']

kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017)
cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train_df.shape[0], 3])

for dev_index, val_index in kf.split(train_X):
    dev_X, val_X = train_X.loc[dev_index], train_X.loc[val_index]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    pred_val_y, pred_test_y, model = runXGB(dev_X, dev_y, val_X, val_y, test_X, seed_val=0, colsample=0.7)
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index,:] = pred_val_y
    cv_scores.append(metrics.log_loss(val_y, pred_val_y))
    break
print("cv scores : ", cv_scores)

out_df = pd.DataFrame(pred_full_test)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_df['id']
submission['prediction'] = out_df.idxmax(axis=1)

submission.to_csv("submission.csv", index=False)

Since the Boosting algorithm didn't perform as expected, let's move on to transformers

## Tranformer Models

Since Transformer model is new to me, I'll try implementing Bert Tranformer through TPU's making it as a complete tutorial in the comin week.

#### Reference

1) [Spooky Author](https://github.com/SudalaiRajkumar/Kaggle/blob/master/SpookyAuthor/simple_fe_notebook_spooky_author.ipynb)

2) [Visualisation using T-sne](https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne)

This notebook is highly inspired from the above sources.

### I will be working on this kernel, extensively on coming weeks.

### Please stay tuned for more updates, any suggestions please leave it in the comments.

### Please upvote the kernel, if you find it useful.