<a href="https://colab.research.google.com/github/reganmeloche/mrpc_paraphrase/blob/main/baseline_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Paraphrasing - baseline approach


In this approach, we will be using some traditional NLP techniques, which will give us a baseline set of results against which we can compare future approaches.

## Import Data

In [None]:
import pandas as pd
import csv

In [None]:
ROOT_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/ms_paraphrase'

In [None]:
data_path = f'{ROOT_PATH}/data'

train_df = pd.read_csv(f'{data_path}/train_df.csv')
test_df = pd.read_csv(f'{data_path}/test_df.csv')

In [None]:
test_df.head()

## Preprocessing

We are going to be taking linguistic properties from the text, but first we want to perform some basic preprocessing: tokenization, stop-word removal, lemmatization, normalization, etc.

In [None]:
!python -m spacy download en_core_web_md

In [None]:
import spacy
import re
from string import punctuation

In [None]:
nlp = spacy.load("en_core_web_md")
nlp.remove_pipe('ner')
nlp.remove_pipe('attribute_ruler')
print(nlp.pipe_names)

We define a preprocessing function that we will apply to all of the sentences.

In [None]:
stop_words = nlp.Defaults.stop_words

def preprocess(text):
    # Text pre-processing

    # Lowercase it all
    text = text.lower()

    # Replace dash with space
    text = text.replace("-", " ")

    # Remove punctuation
    text = ''.join(c for c in text if c not in punctuation)

    # Replace digits with standard
    text = re.sub(r'\d+', '#', text)

    # Spacy preprocessing
    doc = nlp(text)

    # Lemmatize
    tokens = [t.lemma_ for t in doc]

    # Remove stop words
    tokens = [t for t in tokens if t not in stop_words]

    return tokens

In [None]:
sample_text = 'I took the 3 dogs for a walk last Tuesday at 8pm!'
sample_prepro = preprocess(sample_text)
print(sample_prepro)

Now we apply our preprocessing to our data. The preprocessing step may take a few minutes to run

In [None]:
X = train_df[['s1','s2']].values
y = train_df['label'].values

In [None]:
Xp = [[preprocess(x[0]), preprocess(x[1])] for x in X]

## Vectorizing

We are going to use a standard TFIDF approach for our baseline. This involves creating a corpus out of all of our preprocessed sentences and fitting a TFIDF vectorizer to that corpus. Any sentence in our training corpus will then map to a TFIDF vector

In [None]:
corpus = []

for x in Xp:
    corpus.append(' '.join(x[0]))
    corpus.append(' '.join(x[1]))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

In [None]:
X1 = [' '.join(x[0]) for x in Xp]
X2 = [' '.join(x[1]) for x in Xp]

print(X1[3])

In [None]:
X1t = vectorizer.transform(X1)
X2t = vectorizer.transform(X2)

## Distance measurement

Now we have all of our sentences preprocessed and transformed to TFIDF vectors.

We can now use a similarity measurment to get a sense of how "close" each sentence is to it's paired partner. We use the cosine similarity distance measurement from the scipy library.


In [None]:
from scipy import spatial

def get_dist(x1, x2):
    a = x1.toarray()
    b = x2.toarray()
    return 1 - spatial.distance.cosine(a, b)


In [None]:
test_dist = get_dist(X1t[0], X2t[0])
print(test_dist)

Now we can come up with a threshold by calculating the average distance for all of the pairs that have a label of 1.

In [None]:
import numpy as np

# Get all distances
distances = [get_dist(x1,x2) for (x1,x2) in zip(X1t, X2t)]

# Filter to keep only those that are labeled as a paraphrase
matches = [d for i,d in enumerate(distances) if y[i] == 1]

# calculate the average
threshold = np.average(matches)

print(threshold)

## Test set

Now we can apply the same treatment to our test cases and then use the threshold to predict if they are a paraphrase.

First we perform the regular processing

In [None]:
X_test = test_df[['s1','s2']].values
y_test = test_df['label'].values

Xp_test = [[preprocess(x[0]), preprocess(x[1])] for x in X_test]

X1_test = [' '.join(x[0]) for x in Xp_test]
X2_test = [' '.join(x[1]) for x in Xp_test]

X1t_test = vectorizer.transform(X1_test)
X2t_test = vectorizer.transform(X2_test)

Next we calculate the distances

In [None]:
test_distances = [get_dist(x1,x2) for (x1,x2) in zip(X1t_test, X2t_test)]

Now for each distance, if it is over our threshold we predict a 1 (the pair of sentences IS a paraphrase), otherwise we predict a 0 (not a paraphrase)

In [None]:
def predict(x):
    if x > threshold:
        return 1
    else:
        return 0

In [None]:
y_pred = [predict(x) for x in test_distances]

## Evaluation

Now we compare our predictions against the actual test set to see how well our baseline classifier performed

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, y_pred))

The dataset is unbalanced so we're more interested in the precision, recall, and f-score than we are in the accuracy. The results are quite weak, so it shouldn't be too difficult to improve on this baseline.