# NLP: Contradictory, Mr. Watson

We are going to analyse Contradictor, Mr. Watson competition from Kaggle

## Index

- [1. Import libraries and download data](#section1)
- [2. Dataframe Analysis](#section2)
- [3. Model](#section3)


## 1. Import libraries and download data <a id='section1'></a>

In [None]:
import numpy as np
import pandas as pd
import string
import matplotlib.pyplot as plt
import seaborn as sns


import nltk
from nltk.corpus import stopwords

import sys
from textblob import TextBlob 

import re

from IPython.display import display
import plotly.express as px
import spacy


In [None]:
# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
path = '/kaggle/input/contradictory-my-dear-watson/'

train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')
sample_submission = pd.read_csv(path + 'sample_submission.csv')

## 2. Dataframe Analysis <a id='section2'></a>
 
 We are going to analyse the data that composes the different dataframes.

### 2.1. Shape and head

- train

Train set is composed by 6 columns (5 features and 1 target) and 12120 rows. 

In [None]:
print("  - Train: \ntrain shape:", train.shape)
print("Head:")
train.head()

- Test

Test set is composed 5 columns (5 features) and  5191 rows.

In [None]:
print("  - Test: \ntest shape:", test.shape)
print("Head:")
test.head()

### 2.2 Type of features and Nan values

In [None]:
train.info()

In [None]:
test.info()

Observing the previous values, train and test sets do not contain any NAN value, and the features' type
is string and the target is integer (categorical variable).

### 2.3 Features and Target

In this section, we want to visualise the variables, in order to have an idea what type of data is.


- Target (label)

We draw a barplot that show us the frequency of each value that contain the target.

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,5))
train['label'].value_counts().plot.bar(ax=axes)
plt.title('label - target')
plt.show()

Meaning of the label:

0-> entailment

1-> neutral

2-> contradiction

Looking at the plot, we see that the target has quite similar representation of each three groups of target.

- language

The data is multilingual, therefore we would like to see what type of languages and what proportion are represented. In order to see it, we plot one frequency barplot for train set and another for test. In addition, it is good to observe whether both sets have similar distribution.

In [None]:
def plot_table(data, name_data):
    
    pd_language_prob = round(pd.DataFrame(data.language.value_counts()).transpose()/data['language'].count(),2)
    
    lang = pd_language_prob.columns.tolist()
    prob = pd_language_prob.values.tolist()[0]

    table = [['prob'+'('+i+')',j] for (i,j) in zip(lang,prob)]

    fig = plt.figure()
    
    # definitions for the axes
    left, width = 0.10, 1.5
    bottom, height = 0.1, .8
    bottom_h = left_h = left + width + 0.02

    rect_cones = [left, bottom, width, height]
    rect_box = [left_h, bottom, 0.17, height]

    # plot
    ax1 = plt.axes(rect_cones)
    data.groupby('language')['language'].agg(['count']).plot.bar(ax=ax1)
   
  
    plt.title("Frequency of languages " + name_data + " set")
    
    ax2 = plt.axes(rect_box)
    my_table = ax2.table(cellText = table, loc ='right')
    my_table.set_fontsize(40)
    my_table.scale(4,4)
    ax2.axis('off')
    plt.show()

In [None]:

plot_table(train,'train')

In [None]:
plot_table(test,'test')

The dataframes contain 15 languages. The highest frequency language is English. and the others 14 have the same  proportion of representation. Moreover, observing the distribution of both datasets, they are equal.

- languages and label

Now we focus on how the target is distributed in the different languages. We plot a barplot, with three variables  every language that symbolise the three labels: entailment, neutral and contradiction.

In [None]:
pd_language_label = pd.DataFrame(train[['language','label']].groupby(['language','label'])['label'].count())
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,9))
pd_language_label.unstack().plot.barh(ax=axes)
plt.title('label vs language')
plt.show()

The distribution of the three labels accross the different languages is quite uniform, in every language has the same behaviour.

#### 2.3.1. English

In this section, we focuss on the entries in English language only. After that, we study the main features: Hypothesis and Premise, because they will be used as "inputs" for our model. We try to visualise some variables in order to find some patterns.

In [None]:
df_English = train[train['language']=='English']

In [None]:
# Create an empty model
nlp = spacy.load('en')

In [None]:
class removing():
    '''Clean text'''
    def __init__(self):
        self.text = texto
    def lower(texto):
        return(str(texto).lower())
    def remove_url(texto):
        return(re.sub(r'http://\S+|https://\S+','', texto))
    def remove_emoji(texto):
        emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
        return emoji_pattern.sub(r'', texto)
    def remove_punctuation(texto):
        return(re.sub(r'[^\w\s]','',texto))

In [None]:
# cleaning text from premise and hypothesis and we create two new features
df_English.loc[:,'premise_modify'] = df_English.apply(lambda x: removing.lower(x.premise), axis=1)
df_English.loc[:,'premise_modify'] = df_English.apply(lambda x:removing.remove_url(x.premise_modify), axis=1)
df_English.loc[:,'premise_modify'] = df_English.apply(lambda x: removing.remove_punctuation(x.premise_modify),axis=1)
df_English.loc[:,'premise_modify'] = df_English.apply(lambda x: removing.remove_emoji(x.premise_modify), axis=1)

df_English.loc[:,'hypothesis_modify'] = df_English.apply(lambda x:removing.lower(x.hypothesis), axis=1)
df_English.loc[:,'hypothesis_modify'] = df_English.apply(lambda x:removing.remove_url(x.hypothesis_modify), axis=1)
df_English.loc[:,'hypothesis_modify'] = df_English.apply(lambda x: removing.remove_punctuation(x.hypothesis_modify), axis=1)
df_English.loc[:,'hypothesis_modify'] = df_English.apply(lambda x: removing.remove_emoji(x.hypothesis_modify), axis=1)


- number of words' histograms

We modify the text for premise and hypothesis because we only want to count the words, so we eliminate the punctuation and another symbols. Futhermore, we draw 6 histograms, distinguishin between the three labels (row) and between premise and hypothesis (column).

In [None]:
def count_words(df, feature):
    num_words = [] 
    for i in df[feature]:
        aux = nlp(i)
        num_words.append(len(aux))
    return(num_words)


In [None]:
df_English.loc[:,'num_words_premise'] = count_words(df_English,'premise_modify')
df_English.loc[:,'num_words_hypothesis'] = count_words(df_English,'hypothesis_modify')


In [None]:
fig, axes = plt.subplots(nrows=3,ncols=2, figsize=(12,8))
plt.subplots_adjust(hspace = 0.5)
df_English[df_English.label == 0].num_words_premise.plot.hist(bins=40,  ax = axes[0][0], label = 'Fake', color='blue')
axes[0][0].set_title('# Words Premise Histogram (Entailment)')
df_English[df_English.label == 0].num_words_hypothesis.plot.hist(bins=40,  ax = axes[0][1], label = 'Fake', color='orange')
axes[0][1].set_title('# Words Hypothesis Histogram (Entailment)')

df_English[df_English.label == 1].num_words_premise.plot.hist(bins=40,  ax = axes[1][0], label = 'Fake', color='blue')
axes[1][0].set_title('# Words Premise Histogram (Neutral)')
df_English[df_English.label == 1].num_words_hypothesis.plot.hist(bins=40,  ax = axes[1][1], label = 'Fake', color='orange')
axes[1][1].set_title('# Words Hypothesis Histogram (Neutral)')

df_English[df_English.label == 2].num_words_premise.plot.hist(bins=40,  ax = axes[2][0], label = 'Fake', color='blue')
axes[2][0].set_title('# Words Premise Histogram (Contradiction)')
df_English[df_English.label == 1].num_words_hypothesis.plot.hist(bins=40,  ax = axes[2][1], label = 'Fake', color='orange')
axes[2][1].set_title('# Words Hypothesis Histogram (Contradiction)')
plt.show()

# dataframe

mean_premise = []
std_premise = []
mean_hypothesis = []
std_hypothesis = []
for i in range(3):
    for j in ['num_words_premise', 'num_words_hypothesis']:
        mean_aux = round(df_English[df_English.label == i][j].mean(),2)
        std_aux = round(df_English[df_English.label == i][j].std(),2)
        if j == 'num_words_premise':
            mean_premise.append(mean_aux)
            std_premise.append(std_aux)
        else:
            mean_hypothesis.append(mean_aux)
            std_hypothesis.append(std_aux)

index_list = ['Entailment', 'Neutral', 'Contradiction']

pd_aux = pd.DataFrame({'mean_premise': mean_premise, 'std_premise':std_premise, 
                           'mean_hypothesis': mean_hypothesis, 'std_hypothesis': std_hypothesis}, index = index_list)


pd_aux

Comparing the previous histograms, premise histograms have similar distribution between them. The same is true for the hypothesis histograms. Therefore, we can not observe any diffence between three labels, since they have similar behaviour in terms of number of words.

- stopwords

We would like to see how the stopwords are distributed, for this we select the most 20 popular stopwords and they are plotted in treemaps, distinguishing the different labels (entailment, neutral and contradiction) for premise and hypothesis. 

In [None]:
stop = stopwords.words('english')
class stop_words():
    def __init__(self):
        self.text = texto
    def get_stopwords(texto):
        return(' '.join([x for x in texto.split(' ') if x in stop]))
    def remove_stopwords(texto):
        return(' '.join([x for x in texto.split(' ') if not x in stop]))

In [None]:
df_English.loc[:, 'stopwords_premise'] = df_English.premise_modify.apply(lambda x: stop_words.get_stopwords(x))
df_English.loc[:, 'stopwords_hypothesis'] = df_English.hypothesis_modify.apply(lambda x: stop_words.get_stopwords(x))
df_English.loc[:, 'premise_notstop'] = df_English.premise_modify.apply(lambda x: stop_words.remove_stopwords(x))
df_English.loc[:, 'hypothesis_notstop'] = df_English.hypothesis_modify.apply(lambda x: stop_words.remove_stopwords(x))

In [None]:

def count_list(list_pass):
        '''you pass a str'''
        count = {}
        for word in list_pass:
            if word in count :
                count[word] += 1
            else:
                count[word] = 1
        return(count)

In [None]:
def draw_treemap(serie, title_name):
    '''It is a function that draw a treemap with the most 20 popular words from one serie
    Note: the variable you have to pass is df[[feature]] and title of the plot 
    '''
    list_aux = [w.split(' ') for w in serie]
    list_aux = [item for sublist in list_aux for item in sublist]
    dict_aux = count_list(list_aux)
    df_aux = pd.DataFrame({'words': list(dict_aux.keys()), 'number':list(dict_aux.values())})
    df_aux = df_aux.sort_values(by=['number'], ascending=False)[0:20]

    fig = px.treemap(df_aux, path=['words'], values='number', width=900, height=400, title=title_name)
    fig.show()

In [None]:
#print('STOPWORDS_PREMISE:')
draw_treemap(df_English[df_English['label']==0].stopwords_premise, 'Treemap - StopWords_Premise(Entailment)(the 20 most popular)')
draw_treemap(df_English[df_English['label']==1].stopwords_premise, 'Treemap - StopWords_Premise(Neutral)(the 20 most popular)')
draw_treemap(df_English[df_English['label']==2].stopwords_premise, 'Treemap - StopWords_Premise(Contradiction)(the 20 most popular)')
#print('-------------- STOPWORDS_hypothesis --------------')
draw_treemap(df_English[df_English['label']==0].stopwords_hypothesis, 'Treemap - StopWords_Hypothesis(Entailment)(the 20 most popular)')
draw_treemap(df_English[df_English['label']==1].stopwords_hypothesis, 'Treemap - StopWords_Hypothesis(Neutral)(the 20 most popular)')
draw_treemap(df_English[df_English['label']==2].stopwords_hypothesis, 'Treemap - StopWords_Hypothesis(Contradiction)(the 20 most popular)')

We observe that all treemaps contain the same words in a very similar proportions.

### 3. Model

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, TFAutoModel, BertTokenizer, TFBertModel
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model 
from tensorflow.keras.callbacks import ModelCheckpoint


In [None]:
# TPU 
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Running on TPU', tpu.master())
except ValueError:
    strategy = tf.distribute.get_strategy()
print("REPLICAS:", strategy.num_replicas_in_sync)

In [None]:
#variables 
model_name = 'bert-base-multilingual-cased'

max_len = 40
#batch size depend on replica
batch_size = 16 * strategy.num_replicas_in_sync


In [None]:
tokenizer = BertTokenizer.from_pretrained(model_name)

- Preparing the data

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train, 
                                                      train.label.values, 
                                                      test_size = 0.2, 
                                                     random_state = 245)

In [None]:
train_text = X_train[['hypothesis', 'premise']].values.tolist()
val_text = X_valid[['hypothesis', 'premise']].values.tolist()
test_text = test[['hypothesis', 'premise']].values.tolist()

In [None]:
def preparing_data(data):
    '''this function you pass data that is a list with premise and hypothesis and 
        they have to be encoded in order to use in BERT model
        RETURN: a dictionary with keys: 'input_ids', 'token_type_ids', 'attention_mask'
    '''
    # encoded the words (assign a number in every word) and also add '[SEP]'token between premise and hypothesis
    # and add '[CLS]'token the begining of inputs
    # and using padding in order to unify the length of the vectors (adding 0's)
    text_encoded = tokenizer.batch_encode_plus(data, pad_to_max_length = True)
    #create a dictionary with tokens
    dict_input = {}
    #convert list to tensor
    dict_input['input_ids'] = tf.convert_to_tensor(text_encoded.input_ids)
    dict_input['token_type_ids'] = tf.convert_to_tensor(text_encoded.token_type_ids)
    dict_input['attention_mask'] = tf.convert_to_tensor(text_encoded.attention_mask)
    return(dict_input)



In [None]:
train_input = preparing_data(train_text)
val_input = preparing_data(val_text)
test_input  = preparing_data(test_text)

In [None]:
with strategy.scope():
    bert_encoder = TFBertModel.from_pretrained(model_name)
    input_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name='attention_mask')
    token_type_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name='token_type_ids')
    
    embedding = bert_encoder([input_ids, attention_mask, token_type_ids])[0]
    output = tf.keras.layers.Dense(3, activation='softmax')(embedding[:,0,:])
    
    model = tf.keras.Model(inputs=[input_ids, attention_mask, token_type_ids], outputs=output)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
   

model.summary()

In [None]:
model_history = model.fit(train_input, y_train,
                          epochs = 10, 
                          batch_size = 64,
                          validation_data = (val_input, y_valid),
                          verbose = 1)

In [None]:
model_history_df = pd.DataFrame(model_history.history, index = range(1, len(model_history.history['loss'])+1))

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(12,5))
model_history_df.plot(style=['b-','r-','b--','r--'], ax = axes)
plt.title('Model Accuracy and Loss')
axes.set_xlabel('epochs')
axes.set_ylabel('Accuracy - Loss')
plt.show()

For the model, we use some hyperparameters, as example:
    - max_len = 40
    - batch_size = 64
    - epochs = 10
    
We plot the Accuracy and the Loss with respect to the accuracy. We observe that the Accuracy increases and the Loss decreases when we increase the number of epochs, however val_acuracy flattens after epoch = 2 whereas val_loss keeps increasing. This indicates that the model is overfitting after this value, therefore we select epoch = 2 for the final submission. 

In [None]:
model_history = model.fit(train_input, y_train,
                          epochs = 2, 
                          batch_size = 64,
                          validation_data = (val_input, y_valid),
                          verbose = 1)

In [None]:
# Predict on test
test_preds = model.predict(test_input, verbose = 1)
sample_submission['prediction'] = test_preds.argmax(axis = 1)

In [None]:
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head()

## References

https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-3-bert