# Semantic Textual Similarity (STS) Project

The objective of this project is to be able to determine the similarity between two sentences. One sentence is said to be "parraphrased" when the content (or message) is the same, but uses different words and or structure. 

An example from the trial set: 
 - The bird is bathing in the sink.

 - Birdie is washing itself in the water basin.

Here we are given a set of training and testing sets in which they are labeled with the "gs", on a scale of 0-5. 

|label|	description|
| :-: | :-: |
|5	| They are completely equivalent, as they mean the same thing.|
|4	| They are mostly equivalent, but some unimportant details differ.|
|3	| They are roughly equivalent, but some important information differs/missing.|
|2	| They are not equivalent, but share some details.|
|1	| They are not equivalent, but are on the same topic.|
|0	| They are on different topics.|

In [None]:
# Data Loader file with two functions: load_sentences 
from helper_funcs import *

In [None]:
# TRAINING PATH
TRAIN_PATH = './data/train/input/'
TRAIN_GS_PATH = './data/train/gs/'
# TEST PATH
TEST_PATH = 'data/test/input/'
TEST_GS_PATH = './data/test/gs/'
# Loading the Data 
X_train, y_train, X_test, y_test = load_sentences(TRAIN_PATH), load_gs(TRAIN_GS_PATH),load_sentences(TEST_PATH), load_gs(TEST_GS_PATH)
# X_train with extracted features and standardized values 
X_train_scaled_norm = extract_features(X_train,scaled=True)
# X_test with extracted features and standardized values 
X_test_scaled_norm = extract_features(X_test,scaled=True)

In [None]:
#sepparating the sentences 
SA, SB = get_processed_sentences(X_train)
# Jaccard_Fuzzy_Lev
feature_df = jd_fuzz_lev(SA, SB, X_train)
#unigram, bigram and trigram features
ngram_features = get_ngram_features(SA,SB)
#features related to the length
length_features = get_length_features(SA,SB)

In [None]:
import sklearn
from sklearn.svm import SVR
from scipy.stats import pearsonr
svr = SVR(kernel = 'rbf', gamma = 0.01, C = 200, epsilon = 0.50, tol = 0.25)
svr.fit(X_scaled, y_train.values.reshape(-1,))

# Predict
test_predict = svr.predict(X_scaled_test)

In [None]:
correlation = pearsonr(test_predict, y_test.values.reshape(-1,))[0]
print("Pearson correlation:", correlation)

In [None]:
## GRIDSEARCH
from sklearn.model_selection import GridSearchCV

param = {'kernel' : ('rbf'),
         'C' : [10,20,50,100],
         'degree' : [3,8],
         'coef0' : [0.01,10,0.5],
         'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5, 0.6, 0.9]}

modelsvr = SVR()

grids = GridSearchCV(modelsvr,param,cv=5,n_jobs=-1,verbose=2)

grids.fit(X_scaled, y_train.values.reshape(-1,))

In [None]:
test_predict = grids.predict(X_scaled_test)
correlation = pearsonr(test_predict, y_test.values.reshape(-1,))[0]
print("Pearson correlation:", correlation)