#                  Semantic Textual Similarity (STS) Project

The objective of this project is to be able to determine the similarity between two sentences. One sentence is said to be "parraphrased" when the content (or message) is the same, but uses different words and or structure. 

An example from the trial set: 
 - The bird is bathing in the sink.

 - Birdie is washing itself in the water basin.

Here we are given a set of training and testing sets in which they are labeled with the "gs", on a scale of 0-5. 

|label|	description|
| :-: | :-: |
|5	| They are completely equivalent, as they mean the same thing.|
|4	| They are mostly equivalent, but some unimportant details differ.|
|3	| They are roughly equivalent, but some important information differs/missing.|
|2	| They are not equivalent, but share some details.|
|1	| They are not equivalent, but are on the same topic.|
|0	| They are on different topics.|

In [2]:
# Data Loader file with two functions: load_sentences 
from helper_funcs import *

[nltk_data] Downloading package stopwords to /Users/Eric/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/Eric/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to /Users/Eric/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Eric/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/Eric/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to /Users/Eric/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/Eric/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloadin

All our data loaders, data preprocessing, feature extraction and post-processing are stored in the helper_funcs.py file. 

This was done such that we would have a cleaner notebook to present. 

Below, we can see the initial steps of our pipeline: 

**Functions:**

- *load_sentences(PATH):* it will go to the specified PATH, open and read all files which have STS.input.*. This is loaded into a pandas.DataFrame for easier data manipulation. 
- *load_gs(PATH):* similar to load_sentences, it will go to a specified PATH and load all STS.gs.* files into a pandas.DataFrame

In [3]:
import pickle
# TRAINING PATH
TRAIN_PATH = './data/train/input/'
TRAIN_GS_PATH = './data/train/gs/'
# TEST PATH
TEST_PATH = 'data/test/input/'
TEST_GS_PATH = './data/test/gs/'

# Loading the Data 
# --> COMMENT THESE LINES IF FILES ARE ALREADY PICKLED
X_train, y_train, X_test, y_test = load_sentences(TRAIN_PATH), load_gs(TRAIN_GS_PATH),load_sentences(TEST_PATH), load_gs(TEST_GS_PATH)

# X_train with extracted features and standardized values 
#X_train_scaled_norm = extract_features(X_train,scaled=True)
# X_test with extracted features and standardized values 
#X_test_scaled_norm = extract_features(X_test,scaled=True)

In [None]:
# pickle the files for later use if necesary as to not calculate more
# saving the training features 
with open("./X_train_scaled_norm.pickle",'wb') as f:
    pickle.dump(X_train_scaled_norm,f,pickle.HIGHEST_PROTOCOL)

# saving the testing features 
with open("./X_test_scaled_norm.pickle",'wb') as f:
    pickle.dump(X_test_scaled_norm,f,pickle.HIGHEST_PROTOCOL)


In [4]:
# opening the pickle files to avoid re-processing 
X_train_scaled_norm = pickle.load( open( "X_train_scaled_norm.pickle", "rb" ) )
X_test_scaled_norm = pickle.load( open( "X_test_scaled_norm.pickle", "rb" ) )

In [5]:
from sklearn.svm import SVR
from scipy.stats import pearsonr
svr = SVR(kernel = 'rbf', gamma = 0.02, C = 150, epsilon = 0.50, tol = 0.1)
svr.fit(X_train_scaled_norm, y_train.values.reshape(-1,))

# Predict
y_pred = svr.predict(X_test_scaled_norm)

y_c, y_d = y_pred.reshape(-1,), y_test.values.reshape(-1,)

corr = pearsonr(y_c, y_d)[0]
#0.7892
print(f"PEARSON CORRELATION {corr:.4f}")

PEARSON CORRELATION 0.7892


In [6]:
# checking the difference between the baseline 
baseline = 0.7562 
diff = corr - baseline
pcnt_chng = ((corr-baseline) / (baseline) )*100
print(f"Difference between our model and baseline: {diff:.4f}")
print(f"The percent change was {pcnt_chng:.4f}%")

Difference between our model and baseline: 0.0330
The percent change was 4.3588%


In [None]:
# Using an MLP
from sklearn.neural_network import MLPRegressor

HIDDEN_LAYERS = (500,3)
BATCH_SIZE = (250)

mlp_regr = MLPRegressor(hidden_layer_sizes=HIDDEN_LAYERS,
                        batch_size=BATCH_SIZE,
                        early_stopping=False,
                        solver='sgd',
                        max_iter=5000,
                        epsilon=1e-7,
                        learning_rate_init=0.01,
                        learning_rate='adaptive',
                        verbose=True,
                        n_iter_no_change=30).fit(X_train_scaled_norm,y_train.values.reshape(-1,))

y_pred = mlp_regr.predict(X_test_scaled_norm)

score = mlp_regr.score(X_test_scaled_norm, y_test.values.reshape(-1,))

In [None]:
score

In [None]:
corr = pearsonr(y_pred, y_test)[0]
print(f"PEARSON CORRELATION {corr:.4f}")