# Semantic Textual Similarity (STS) Project

The objective of this project is to be able to determine the similarity between two sentences. One sentence is said to be "parraphrased" when the content (or message) is the same, but uses different words and or structure. 

An example from the trial set: 
 - The bird is bathing in the sink.

 - Birdie is washing itself in the water basin.

Here we are given a set of training and testing sets in which they are labeled with the "gs", on a scale of 0-5. 

|label|	description|
| :-: | :-: |
|5	| They are completely equivalent, as they mean the same thing.|
|4	| They are mostly equivalent, but some unimportant details differ.|
|3	| They are roughly equivalent, but some important information differs/missing.|
|2	| They are not equivalent, but share some details.|
|1	| They are not equivalent, but are on the same topic.|
|0	| They are on different topics.|

In [71]:
# Data Loader file with two functions: load_sentences 
from helper_funcs import *

In [75]:
# TRAINING PATH
TRAIN_PATH = './data/train/input/'
TRAIN_GS_PATH = './data/train/gs/'
# TEST PATH
TEST_PATH = 'data/test/input/'
TEST_GS_PATH = './data/test/gs/'

# Loading the Data 
# --> COMMENT THESE LINES IF FILES ARE ALREADY PICKLED
X_train, y_train, X_test, y_test = load_sentences(TRAIN_PATH), load_gs(TRAIN_GS_PATH),load_sentences(TEST_PATH), load_gs(TEST_GS_PATH)

# X_train with extracted features and standardized values 
#X_train_scaled_norm = extract_features(X_train,scaled=True)
# X_test with extracted features and standardized values 
#X_test_scaled_norm = extract_features(X_test,scaled=True)

In [None]:
import pickle
# pickle the files for later use if necesary as to not calculate more
# saving the training features 
with open("./X_train_scaled_norm.pickle",'wb') as f:
    pickle.dump(X_train_scaled_norm,f,pickle.HIGHEST_PROTOCOL)

# saving the testing features 
with open("./X_test_scaled_norm.pickle",'wb') as f:
    pickle.dump(X_test_scaled_norm,f,pickle.HIGHEST_PROTOCOL)

In [76]:
# opening the pickle files to avoid re-processing 
X_train_scaled_norm = pickle.load( open( "X_train_scaled_norm.pickle", "rb" ) )
X_test_scaled_norm = pickle.load( open( "X_test_scaled_norm.pickle", "rb" ) )

In [77]:
from sklearn.svm import SVR
from scipy.stats import pearsonr
svr = SVR(kernel = 'rbf', gamma = 0.01, C = 200, epsilon = 0.50, tol = 0.25)
svr.fit(X_train_scaled_norm, y_train.values.reshape(-1,))

# Predict
y_pred = svr.predict(X_test_scaled_norm)

In [108]:
corr = pearsonr(y_pred, y_test.values.reshape(-1,))[0]
print(f"PEARSON CORRELATION {corr:.4f}")

PEARSON CORRELATION 0.7892


In [109]:
# checking the difference between the baseline 
baseline = 0.7562
diff = corr - baseline
pcnt_chng = ((corr-baseline) / (baseline) )*100
print(f"Difference between our model and baseline: {diff:.4f}")
print(f"The percent change was {pcnt_chng:.4f}%")

Difference between our model and baseline: 0.0330
The percent change was 4.3661%
