#                  Semantic Textual Similarity (STS) Project

The objective of this project is to be able to determine the similarity between two sentences. One sentence is said to be "parraphrased" when the content (or message) is the same, but uses different words and or structure. 

An example from the trial set: 
 - The bird is bathing in the sink.

 - Birdie is washing itself in the water basin.

Here we are given a set of training and testing sets in which they are labeled with the "gs", on a scale of 0-5. 

|label|	description|
| :-: | :-: |
|5	| They are completely equivalent, as they mean the same thing.|
|4	| They are mostly equivalent, but some unimportant details differ.|
|3	| They are roughly equivalent, but some important information differs/missing.|
|2	| They are not equivalent, but share some details.|
|1	| They are not equivalent, but are on the same topic.|
|0	| They are on different topics.|

In [71]:
# Data Loader file with two functions: load_sentences 
from helper_funcs import *

All our data loaders, data preprocessing, feature extraction and post-processing are stored in the helper_funcs.py file. 

This was done such that we would have a cleaner notebook to present. 

Below, we can see the initial steps of our pipeline: 

**Functions:**

- *load_sentences(PATH):* it will go to the specified PATH, open and read all files which have STS.input.*. This is loaded into a pandas.DataFrame for easier data manipulation. 
- *load_gs(PATH):* similar to load_sentences, it will go to a specified PATH and load all STS.gs.* files into a pandas.DataFrame

In [75]:
# TRAINING PATH
TRAIN_PATH = './data/train/input/'
TRAIN_GS_PATH = './data/train/gs/'
# TEST PATH
TEST_PATH = 'data/test/input/'
TEST_GS_PATH = './data/test/gs/'

# Loading the Data 
# --> COMMENT THESE LINES IF FILES ARE ALREADY PICKLED
X_train, y_train, X_test, y_test = load_sentences(TRAIN_PATH), load_gs(TRAIN_GS_PATH),load_sentences(TEST_PATH), load_gs(TEST_GS_PATH)

# X_train with extracted features and standardized values 
#X_train_scaled_norm = extract_features(X_train,scaled=True)
# X_test with extracted features and standardized values 
#X_test_scaled_norm = extract_features(X_test,scaled=True)

In [None]:
import pickle
# pickle the files for later use if necesary as to not calculate more
# saving the training features 
with open("./X_train_scaled_norm.pickle",'wb') as f:
    pickle.dump(X_train_scaled_norm,f,pickle.HIGHEST_PROTOCOL)

# saving the testing features 
with open("./X_test_scaled_norm.pickle",'wb') as f:
    pickle.dump(X_test_scaled_norm,f,pickle.HIGHEST_PROTOCOL)

In [76]:
# opening the pickle files to avoid re-processing 
X_train_scaled_norm = pickle.load( open( "X_train_scaled_norm.pickle", "rb" ) )
X_test_scaled_norm = pickle.load( open( "X_test_scaled_norm.pickle", "rb" ) )

In [77]:
from sklearn.svm import SVR
from scipy.stats import pearsonr
svr = SVR(kernel = 'rbf', gamma = 0.02, C = 150, epsilon = 0.50, tol = 0.1)
svr.fit(X_train_scaled_norm, y_train.values.reshape(-1,))

# Predict
y_pred = svr.predict(X_test_scaled_norm)

In [108]:
corr = pearsonr(y_pred, y_test.values.reshape(-1,))[0]
print(f"PEARSON CORRELATION {corr:.4f}")

PEARSON CORRELATION 0.7892


In [110]:
# checking the difference between the baseline 
baseline = 0.7562
diff = corr - baseline
pcnt_chng = ((corr-baseline) / (baseline) )*100
print(f"Difference between our model and baseline: {diff:.4f}")
print(f"The percent change was {pcnt_chng:.4f}%")

Difference between our model and baseline: 0.0330
The percent change was 4.3661%


In [145]:
# Using an MLP
from sklearn.neural_network import MLPRegressor

HIDDEN_LAYERS = (500,3)
BATCH_SIZE = (250)

mlp_regr = MLPRegressor(hidden_layer_sizes=HIDDEN_LAYERS,
                        batch_size=BATCH_SIZE,
                        early_stopping=False,
                        solver='sgd',
                        max_iter=5000,
                        epsilon=1e-7,
                        learning_rate_init=0.01,
                        learning_rate='adaptive',
                        verbose=True,
                        n_iter_no_change=30).fit(X_train_scaled_norm,y_train.values.reshape(-1,))

y_pred = mlp_regr.predict(X_test_scaled_norm)

score = mlp_regr.score(X_test_scaled_norm, y_test.values.reshape(-1,))

Iteration 1, loss = 1.78340217
Iteration 2, loss = 0.90509741
Iteration 3, loss = 0.82808012
Iteration 4, loss = 0.80055436
Iteration 5, loss = 0.78369295
Iteration 6, loss = 0.76834989
Iteration 7, loss = 0.75640489
Iteration 8, loss = 0.75041499
Iteration 9, loss = 0.74523004
Iteration 10, loss = 0.73466582
Iteration 11, loss = 0.72975278
Iteration 12, loss = 0.72194398
Iteration 13, loss = 0.71728007
Iteration 14, loss = 0.71006880
Iteration 15, loss = 0.70668046
Iteration 16, loss = 0.69990243
Iteration 17, loss = 0.69169428
Iteration 18, loss = 0.68822357
Iteration 19, loss = 0.68410500
Iteration 20, loss = 0.68155556
Iteration 21, loss = 0.67902896
Iteration 22, loss = 0.66852683
Iteration 23, loss = 0.66405586
Iteration 24, loss = 0.65927156
Iteration 25, loss = 0.65721598
Iteration 26, loss = 0.65088616
Iteration 27, loss = 0.64556189
Iteration 28, loss = 0.64221942
Iteration 29, loss = 0.63799847
Iteration 30, loss = 0.63436508
Iteration 31, loss = 0.62618464
Iteration 32, los

In [146]:
score

-0.7858825450751643

In [147]:
corr = pearsonr(y_pred, y_test.values.reshape(-1,))[0]
print(f"PEARSON CORRELATION {corr:.4f}")

PEARSON CORRELATION 0.1796
