Team members:

   Nithesh N    22BAD061

   Siva Nirai   22BAD096

   Tharanesh R  22BAD105
   
   Veerakumar S 22BAD109

Dataset link:
https://www.kaggle.com/competitions/contradictory-my-dear-watson/data

Dataset Description:
The dataset contains about the hypothesis of a statement which is given in several languages
Eg:
He came, he opened the door and I remember looking back and seeing the expression on his face, and I could tell that he was disappointed.

Hypothesis 1:

Just by the look on his face when he came through the door I just knew that he was let down.

We know that this is true based on the information in the premise. So, this pair is related by entailment.

Hypothesis 2:

He was trying not to make us feel guilty but we knew we had caused him trouble.

This very well might be true, but we can’t conclude this based on the information in the premise. So, this relationship is neutral.

Hypothesis 3:

He was so excited and bursting with joy that he practically knocked the door off it's frame.

Importing required libraries


In [70]:
import pandas as pd
import numpy as np
import re
import unicodedata
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D

This function implies,

Converts the text to lowercase to ensure uniformity in text representation.
Strips leading and trailing whitespaces to remove unnecessary spaces.
Normalizes the text by converting accented characters to their ASCII equivalents, and removes non-alphanumeric characters except for spaces, resulting in a clean and standardized text output.

In [71]:
def preprocess_text(text, language='english'):
    text = text.lower()
    text = text.strip()
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode("utf-8")
    if language == 'english':
        text = re.sub(r"[^a-z0-9 ]", "", text)
    else:
         text = re.sub(r"[^\p{L}\p{N} ]", "", text)
    text = re.sub(r"\s+", " ", text)
    return text

Importing train and test dataset


In [72]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

Creating objects for the function

In [73]:
train_df['hypothesis'] = train_df['hypothesis'].apply(preprocess_text)
train_df['premise'] = train_df['premise'].apply(preprocess_text)
test_df['hypothesis'] = test_df['hypothesis'].apply(preprocess_text)
test_df['premise'] = test_df['premise'].apply(preprocess_text)

Giving the train and test statements


In [74]:
train_texts = train_df['hypothesis'] + " " + train_df['premise']
test_texts = test_df['hypothesis'] + " " + test_df['premise']

Setting the labels for the data

In [75]:
train_labels = train_df['language']
test_labels = test_df['language']

Encoding the train and test labels

In [76]:
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_labels)
test_labels_encoded = label_encoder.transform(test_labels)

Tokenizing the text

In [77]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_texts)

Converting it to sequences

In [78]:
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

Adding paded sequences

In [79]:
max_len = max(max(len(seq) for seq in train_sequences), max(len(seq) for seq in test_sequences))
train_sequences_padded = pad_sequences(train_sequences, maxlen=max_len, padding='pre')
test_sequences_padded = pad_sequences(test_sequences, maxlen=max_len, padding='pre')

Creating sequential model with embedding,LSTM,Dense layers

In [80]:
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=100, input_length=max_len),
    SpatialDropout1D(0.2),
    LSTM(units=200),
    Dense(units=len(set(train_labels)), activation='softmax')
])

In [81]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Training the model

In [82]:
model.fit(train_sequences_padded, train_labels_encoded, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10


KeyboardInterrupt



In [83]:
loss, accuracy = model.evaluate(test_sequences_padded, test_labels_encoded)
print("Test Accuracy:", accuracy)


Test Accuracy: 0.7811356782913208
