# WatsonianAdventure


## Goal
The goal of this porjectis to develop a machine learning model that can detect contradiction and entailment in multilingual text. This problem has applications in various areas such as information retrieval, question answering, and text summarization.



## Source of the Dataset
The dataset was collected for the "Contradictory, My Dear Watson" Kaggle competition https://www.kaggle.com/competitions/contradictory-my-dear-watson/data?select=train.csv)

## Dataset General Info
The dataset contains pairs of text from various sources in different languages, along with a label indicating whether the two texts are entailment, neutral, or contradiction (0, 1, or 2)

## Dataset Summary

### Numbers:
The dataset consists of 12,500 rows. 
There are 5,501 unique premises and 9,008 unique hypotheses in the dataset. 

### Class Distribution
3,240 rows are labeled as "contradiction". 
5,924 rows are labeled as "entailment".
3,336 rows are labeled as "neutral".

### Missing Values
There are no missing values in the dataset.

### Statistical Summaries
| Variable    | Mean Characters | Var Characters | Mean Words | Var Words |
|-------------|-----------------|----------------|------------|-----------|
| Premise     | 66.91           | 2949.08        | 12.67      | 110.98    |
| Hypothesis | 51.51           | 1282.49        | 9.54        | 35.12      |

### Sample
| Premise                                             | Hypothesis                                    | Label         |
|:----------------------------------------------------|:----------------------------------------------|:--------------|
| The new rights are nice enough, but not world-shaking. | The new rights are not important.           | contradiction |
| Electronic money is not yet a frequently used feature. | Many people are starting to use electronic money more often. | entailment    |
| US is still a superpower.                             | The US is no longer a superpower.           | contradiction |
| Fear the turtle.                                      | Fear the turtle.                             | neutral       |

### Preprocessing Techniques

1) Filtering Non English Records, which leaves us with:
6,517 rows. 
2,304 rows are labeled as "entailment".
2,060 rows are labeled as "neutral".
2,152 rows are labeled as "contradiction". 

2) Removing stop words:
Stop words are commonly used words in a language. Examples of stop words in English include "the", "and", "a", "an", "in", "on", etc. We remove them because they don't carry much meaning in the context of the text and can add noise to the analysis.

3) Tokenization
breaking down text into individual tokens or words. To represent text as a sequence of discrete symbols that can be processed by machine learning algorithms.

4) Word Embeddings 
Representing text data as high-dimensional vectors. To capture the semantic relationships between words and phrases, which can improve the accuracy of your machine learning model.

You can check the dataset here: https://docs.google.com/spreadsheets/d/1dK9E9EvCCtQ5EjoUfYBJT1agvHHr92vjTQR4KwdSuUM/edit?usp=sharing

In [22]:
from langdetect import detect
import pandas as pd

In [23]:
# load the dataset
df = pd.read_csv("/Users/2rwak/Desktop/contradictory-my-dear-watson/train.csv")

In [14]:
# filter out non-English records
df['lang_premise'] = df['premise'].apply(lambda x: detect(str(x)))
df['lang_hypothesis'] = df['hypothesis'].apply(lambda x: detect(str(x)))
df = df[(df['lang_premise'] == 'en') & (df['lang_hypothesis'] == 'en')]
# drop the language columns
df = df.drop(columns=['lang_premise', 'lang_hypothesis'])

# save the filtered dataset
df.to_csv('train_en.csv', index=False)


In [15]:
#  stop-word removal
!pip3 install nltk
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

# load the dataset
df = pd.read_csv('/Users/2rwak/Desktop/contradictory-my-dear-watson/train_en.csv')

# remove stop words
stop_words = set(stopwords.words('english'))
df['premise'] = df['premise'].apply(lambda x: ' '.join([word for word in str(x).split() if word.lower() not in stop_words]))
df['hypothesis'] = df['hypothesis'].apply(lambda x: ' '.join([word for word in str(x).split() if word.lower() not in stop_words]))

# save the cleaned dataset
df.to_csv('train_en_cleaned.csv', index=False)


Defaulting to user installation because normal site-packages is not writeable


[nltk_data] Downloading package stopwords to /Users/2rwak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# Tokenization
import pandas as pd
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

# load the dataset
df = pd.read_csv('/Users/2rwak/Desktop/contradictory-my-dear-watson/train_en_cleaned.csv')

# tokenize text
df['premise'] = df['premise'].apply(lambda x: word_tokenize(str(x)))
df['hypothesis'] = df['hypothesis'].apply(lambda x: word_tokenize(str(x)))

# save the tokenized dataset
df.to_csv('train_en_tokenized.csv', index=False)

[nltk_data] Downloading package punkt to /Users/2rwak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [21]:
# word embeddings
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')

from gensim.models import KeyedVectors
from nltk.tokenize import word_tokenize

# load the dataset
df = pd.read_csv('/Users/2rwak/Desktop/contradictory-my-dear-watson/train_en_tokenized.csv')

# load the pre-trained GloVe embeddings
# download the GloVe embeddings file from the official website at https://nlp.stanford.edu/projects/glove/.
glove_model = KeyedVectors.load_word2vec_format('/Users/2rwak/Desktop/contradictory-my-dear-watson/glove.6B.100d.txt', binary=False, no_header=True)

# convert premise and hypothesis to their vector representations
premise_vectors = []
for premise in df['premise']:
    tokens = word_tokenize(premise)
    vectors = [glove_model[word] for word in tokens if word in glove_model.key_to_index]
    if len(vectors) > 0:
        premise_vectors.append(np.mean(vectors, axis=0))
    else:
        premise_vectors.append(np.zeros((100,)))
df['premise_vectors'] = premise_vectors

hypothesis_vectors = []
for hypothesis in df['hypothesis']:
    tokens = word_tokenize(hypothesis)
    vectors = [glove_model[word] for word in tokens if word in glove_model.key_to_index]
    if len(vectors) > 0:
        hypothesis_vectors.append(np.mean(vectors, axis=0))
    else:
        hypothesis_vectors.append(np.zeros((100,)))
df['hypothesis_vectors'] = hypothesis_vectors

# save the embeddings dataset
df.to_csv('train_en_embeddings.csv', index=False)


[nltk_data] Downloading package punkt to /Users/2rwak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
