# STS PROJECT - SEMANTIC TEXTUAL SIMILARITY
- Andrea Masi
- Victor Badenas

***

## INTRODUCTION
In this project we proposed a solution for the task included in the SemEval (Semantic Evaluation Exercises), a series of workshops which have the main aim of the evaluation and comparision of semantic analysis systems.

The task done has been Semantic Textual Similarity (STS), also known as paraphrases detection. A measure of similarity between two sentences.

This notebook contains the methodology and procedures used to obtain the result shown at the end of the notebook. We propose an approach where the distance between two sentences is characterized by a vector composed of different traditional similarity metrics in NLP. 

In [1]:
import sys
sys.path.append('src')

***
## Module Imports
Importing necessary modules for the notebook.

From packages:

In [2]:
import nltk
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from nltk.metrics.distance import jaccard_distance
from collections.abc import Iterable
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.metrics import make_scorer

From our code:

In [3]:
from data_utils import load_data
from dimension.lexical import *
from dimension.syntactical import *

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


***
## Data
The data is loaded using a function from the `data_utils.py` file which will load a dataframe, read from a tsv file, containing all pairs of sentences as well as their gold standard. If the tsv does not exist, it will be automatically created from all sentences in the containing folder. 

In [4]:
train_data, test_data = load_data('data/')
print(
    f"train_data samples: {len(train_data)}, test_data samples: {len(test_data)}"
)

train_data samples: 2234, test_data samples: 3108


Display of the first 5 data points from the train_data DataFrame and the test_data DataFrame

In [5]:
train_data.head()

Unnamed: 0,S1,S2,Gs
0,But other sources close to the sale said Viven...,But other sources close to the sale said Viven...,4.0
1,Micron has declared its first quarterly profit...,Micron's numbers also marked the first quarter...,3.75
2,The fines are part of failed Republican effort...,"Perry said he backs the Senate's efforts, incl...",2.8
3,"The American Anglican Council, which represent...","The American Anglican Council, which represent...",3.4
4,The tech-loaded Nasdaq composite rose 20.96 po...,The technology-laced Nasdaq Composite Index <....,2.4


In [6]:
test_data.head()

Unnamed: 0,S1,S2,Gs
0,The problem likely will mean corrective change...,He said the problem needs to be corrected befo...,4.4
1,The technology-laced Nasdaq Composite Index .I...,The broad Standard & Poor's 500 Index .SPX inc...,0.8
2,"""It's a huge black eye,"" said publisher Arthur...","""It's a huge black eye,"" Arthur Sulzberger, th...",3.6
3,SEC Chairman William Donaldson said there is a...,"""I think there's a building confidence that th...",3.4
4,Vivendi shares closed 1.9 percent at 15.80 eur...,"In New York, Vivendi shares were 1.4 percent d...",1.4


***
## Similarity functions
Declaration of the 4 basic similarity functions that will be used pfor this task: jaccard, overlap, cosine and dice.

In [7]:
def jaccard_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    return 1 - jaccard_distance(set(s1), set(s2))

In [8]:
def overlap_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    s1 = set(s1)
    s2 = set(s2)
    intersection = s1.intersection(s2)
    return len(intersection) / min(len(s1), len(s2))

In [9]:
def cosine_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    s1 = set(s1)
    s2 = set(s2)
    intersection = s1.intersection(s2)
    return len(intersection) / ((len(s1) * len(s2))**2)

In [10]:
def dice_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    s1 = set(s1)
    s2 = set(s2)
    intersection = s1.intersection(s2)
    return 2 * len(intersection) / (len(s1) + len(s2))

***
## Feature loading
Section devoted to transform each pair of sentences to a vector formed from difference distance metrics.

### feature vector builder for dataframe of sentence pairs

Declaration of the function responsible for the iteration over the dataframe containing the sentence pairs (other columns shall be unused). Requires the sentences columns' to be named `"S1"` and `"S2"`.

Returns a numpy array of shape `(n_sentence_pairs, n_features)`

In [11]:
def get_features(df: pd.DataFrame):
    assert "S1" in df.columns, "S1 not in dataframe"
    assert "S2" in df.columns, "S2 not in dataframe"

    features = [None] * len(df)  #preallocated for memory efficiency

    for index, row in df.iterrows():
        sentence1, sentence2 = row['S1'], row['S2']

        # Get all words
        tokenized_1, tokenized_2 = get_tokenized_sentences(
            sentence1, sentence2, return_unique_words=False)
        tokenized_lc_1, tokenized_lc_2 = get_tokenized_sentences_lowercase(
            tokenized_1, tokenized_2, return_unique_words=False)

        # Get words without stopwords
        no_stopwords_1, no_stopwords_2 = filter_stopwords(
            tokenized_1, tokenized_2, return_unique_words=False)
        no_stopwords_lc_1, no_stopwords_lc_2 = filter_stopwords(
            tokenized_lc_1, tokenized_lc_2, return_unique_words=False)

        # Lemmas
        lemmatized_1, lemmatized_2 = get_lemmas(tokenized_1,
                                                tokenized_2,
                                                return_unique_words=False)
        lemmatized_lc_1, lemmatized_lc_2 = get_lemmas(
            tokenized_lc_1, tokenized_lc_2, return_unique_words=False)

        # Name entities
        sentence_ne_1, sentence_ne_2 = get_named_entities(
            tokenized_1, tokenized_2)

        #lemmas cleaned from stopwords
        stopwords_and_lemmas1, stopwords_and_lemmas2 = get_lemmas(
            no_stopwords_1, no_stopwords_2, return_unique_words=False)

        stopwords_and_lemmas_lc_1, stopwords_and_lemmas_lc_2 = get_lemmas(
            no_stopwords_lc_1, no_stopwords_lc_2, return_unique_words=False)

        # Name entities without stopwords in lowercase
        ne_no_stopwords_1, ne_no_stopwords_2 = filter_stopwords(
            sentence_ne_1,
            sentence_ne_2,
            return_unique_words=False,
            filter_and_return_in_lowercase=True)

        # Name entities without stopwords in lowercase and lemmas
        ne_no_stopwords_lemmas_1, ne_no_stopwords_lemmas_2 = get_lemmas(
            ne_no_stopwords_1, ne_no_stopwords_2, return_unique_words=False)

        # Bigrams
        bigrams_1, bigrams_2 = get_ngrams(no_stopwords_1, no_stopwords_2, n=2)
        trigrams_1, trigrams_2 = get_ngrams(no_stopwords_1,
                                            no_stopwords_2,
                                            n=3)

        # Bigrams trigrams with sentence tokenizer
        bigrams_sent_1, bigrams_sent_2 = get_ngrams_with_sent_tokenize(
            sentence1, sentence2, n=2)
        trigrams_sent_1, trigrams_sent_2 = get_ngrams_with_sent_tokenize(
            sentence1, sentence2, n=3)

        # Lesk
        lesk_1, lesk_2 = get_lesk_sentences(tokenized_1, tokenized_2)
        lesk_lc_1, lesk_lc_2 = get_lesk_sentences(tokenized_lc_1,
                                                  tokenized_lc_2)

        # Stemmer
        stemmed_1, stemmed_2 = get_stemmed_sentences(sentence1, sentence2)

        # Synset
        average_path = get_synset_similarity(tokenized_1, tokenized_2, "path")
        average_lch = get_synset_similarity(tokenized_1, tokenized_2, "lch")
        average_wup = get_synset_similarity(tokenized_1, tokenized_2, "wup")
        average_lin = get_synset_similarity(tokenized_1, tokenized_2, "lin")

        average_lc_path = get_synset_similarity(tokenized_lc_1, tokenized_lc_2,
                                                "path")
        average_lc_lch = get_synset_similarity(tokenized_lc_1, tokenized_lc_2,
                                               "lch")
        average_lc_wup = get_synset_similarity(tokenized_lc_1, tokenized_lc_2,
                                               "wup")
        average_lc_lin = get_synset_similarity(tokenized_lc_1, tokenized_lc_2,
                                               "lin")

        # ALL Features
        features[index] = [
            jaccard_similarity(tokenized_1, tokenized_2),
            jaccard_similarity(tokenized_lc_1, tokenized_lc_2),
            jaccard_similarity(no_stopwords_1, no_stopwords_2),
            jaccard_similarity(no_stopwords_lc_1, no_stopwords_lc_2),
            jaccard_similarity(lemmatized_1, lemmatized_2),
            jaccard_similarity(lemmatized_lc_1, lemmatized_lc_2),
            jaccard_similarity(sentence_ne_1, sentence_ne_2),
            jaccard_similarity(stopwords_and_lemmas1, stopwords_and_lemmas2),
            jaccard_similarity(stopwords_and_lemmas_lc_1, stopwords_and_lemmas_lc_2),
            jaccard_similarity(bigrams_1, bigrams_2),
            jaccard_similarity(trigrams_1, trigrams_2),
            jaccard_similarity(bigrams_sent_1, bigrams_sent_2),
            jaccard_similarity(trigrams_sent_1, trigrams_sent_2),
            jaccard_similarity(lesk_1, lesk_2),
            jaccard_similarity(lesk_lc_1, lesk_lc_2),
            jaccard_similarity(stemmed_1, stemmed_2),
            dice_similarity(tokenized_1, tokenized_2),
            dice_similarity(tokenized_lc_1, tokenized_lc_2),
            dice_similarity(no_stopwords_1, no_stopwords_2),
            dice_similarity(no_stopwords_lc_1, no_stopwords_lc_2),
            dice_similarity(lemmatized_1, lemmatized_2),
            dice_similarity(lemmatized_lc_1, lemmatized_lc_2),
            dice_similarity(sentence_ne_1, sentence_ne_2),
            dice_similarity(stopwords_and_lemmas1, stopwords_and_lemmas2),
            dice_similarity(stopwords_and_lemmas_lc_1, stopwords_and_lemmas_lc_2),
            dice_similarity(bigrams_1, bigrams_2),
            dice_similarity(trigrams_1, trigrams_2),
            dice_similarity(bigrams_sent_1, bigrams_sent_2),
            dice_similarity(trigrams_sent_1, trigrams_sent_2),
            dice_similarity(lesk_1, lesk_2),
            dice_similarity(lesk_lc_1, lesk_lc_2),
            dice_similarity(stemmed_1, stemmed_2),
            average_path,
            average_lch,
            average_wup,
            average_lin,
            average_lc_path,
            average_lc_lch,
            average_lc_wup,
            average_lc_lin
        ]
        # BEST Features selection for SVR
        """features[index] = [
            jaccard_similarity(tokenized_1, tokenized_2),
            jaccard_similarity(no_stopwords_lc_1, no_stopwords_lc_2),
            jaccard_similarity(sentence_ne_1, sentence_ne_2),
            jaccard_similarity(stopwords_and_lemmas1, stopwords_and_lemmas2),
            jaccard_similarity(trigrams_1, trigrams_2),
            jaccard_similarity(trigrams_sent_1, trigrams_sent_2),
            jaccard_similarity(lesk_1, lesk_2),
            jaccard_similarity(lesk_lc_1, lesk_lc_2),
            jaccard_similarity(stemmed_1, stemmed_2),
            dice_similarity(tokenized_1, tokenized_2),
            dice_similarity(no_stopwords_1, no_stopwords_2),
            dice_similarity(no_stopwords_lc_1, no_stopwords_lc_2),
            dice_similarity(stopwords_and_lemmas_lc_1,
                            stopwords_and_lemmas_lc_2),
            dice_similarity(bigrams_1, bigrams_2), average_lin, average_lc_lch,
            average_lc_wup, average_lc_lin
        ]"""
    return np.array(features)

### Features extraction

Using the function declared above, the features are extracted from the `train_data` dataframe. Also the Gold Standard is extracted from its column in the dataframe. The shapes for both numpy vectors are displayed. 

In [12]:
train_features = get_features(train_data)
train_gs = train_data['Gs'].to_numpy()
print(f"train_features.shape: {train_features.shape}")
print(f"train_gs.shape: {train_gs.shape}")

train_features.shape: (2234, 40)
train_gs.shape: (2234,)


In [13]:
test_features = get_features(test_data)
test_gs = test_data['Gs'].to_numpy()
print(f"train_features.shape: {test_features.shape}")
print(f"train_gs.shape: {test_gs.shape}")

train_features.shape: (3108, 40)
train_gs.shape: (3108,)


### Feature scaling

features are scaled using sklearns StandardScaler, where the mean is substracted for each feature and it's divided by the variance of the feature to obtain a unified feature space with zero mean and unit variance.

In [14]:
scaler = StandardScaler()
scaler.fit(train_features)
train_features_scaled = scaler.transform(train_features)
test_features_scaled = scaler.transform(test_features)

### Model Selection

4 different architectures were trained using all available features for the data
- NNR: Nearest neighbors regressor using `KNeighborsRegressor` from `sklearn.neighbors`. 
- MLP: Multi-layer perceptron using `MLPRegressor` from `sklearn.neural_network`.
- SVR: Support vector regressor `SVR` from `sklearn.svm`.
- MLP with a bottleneck layer: Same as the one defined before, but now with a hidden layer smaller than the number of features and a bigger one.

### Split definition for GridSearch
For the correct search of the best model, a predefined split has to be defined to ensure that the model is selected using the data provided. `np.ndarrays` are defined containing all the data and all the labels concatenated and then an identifyer array is passed onto the `PredefinedSplit` for it to know which samples belong to the train set (`-1` label) and to the n fold (in our case just 1) test set (label `0`).

In [15]:
all_data = np.concatenate([train_features_scaled, test_features_scaled])
all_labels = np.concatenate([train_gs, test_gs])
test_fold = np.array([-1]*train_features_scaled.shape[0] + [0]*test_features_scaled.shape[0])
print(all_data.shape, test_fold.shape)
ps = PredefinedSplit(test_fold)

(5342, 40) (5342,)


### Definition of the pearson correlation for the GridSearchCV module
The `GridSearchCV` module makes use of a scoring function to determine which model is the optimal given a set of parameters and options. Because our goal is to maximize the pearson correlation in the test set it makes sense to declare a scoring function with a custom metric for it to use.

In [16]:
pearson_scorer = make_scorer(lambda y, y_hat: pearsonr(y, y_hat)[0])

### Neirest Neighbors Regressor
The first model to be evaluated will be a NNR and the parameters for which the model will be evaluated are: weight configurations, n_neighbords, the different algorithms, the number of leaf sizes and the p value for the minkowski distance.

In [17]:
from sklearn.neighbors import KNeighborsRegressor

weights = ['uniform', 'distance']
nn = list(range(10, 101, 10))
algorithms = ['ball_tree', 'kd_tree', 'brute']
leaf_sizes = list(range(1, 51, 10))
pvalues = list(range(1, 6))

nn_params = dict(n_neighbors=nn, weights=weights, p=pvalues, leaf_size=leaf_sizes, algorithm=algorithms)

nnr = KNeighborsRegressor()
nnrmlp = GridSearchCV(nnr,
                      nn_params,
                      cv=ps,
                      scoring=pearson_scorer,
                      n_jobs=-1,
                      verbose=1)

nnrmlp = nnrmlp.fit(all_data, all_labels)

Fitting 1 folds for each of 1500 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:   19.7s
[Parallel(n_jobs=-1)]: Done 188 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 438 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 788 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1238 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed: 10.3min finished


definition of the algorithm with the best parameters of the gridsearch.

In [18]:
nnr_best_params = nnrmlp.best_params_
best_nnr = KNeighborsRegressor(**nnr_best_params)
best_nnr.fit(train_features_scaled, train_gs)

KNeighborsRegressor(algorithm='kd_tree', leaf_size=41, n_neighbors=20,
                    weights='distance')

training

In [19]:
%%time
train_nnr_predictions = best_nnr.predict(train_features_scaled)

CPU times: user 158 ms, sys: 0 ns, total: 158 ms
Wall time: 157 ms


inference

In [20]:
%%time
test_nnr_predictions = best_nnr.predict(test_features_scaled)

CPU times: user 221 ms, sys: 0 ns, total: 221 ms
Wall time: 221 ms


### NNR Results

In [21]:
train_nnr_correlation = pearsonr(train_nnr_predictions, train_gs)[0]
test_nnr_correlation = pearsonr(test_nnr_predictions, test_gs)[0]
print('Train pearsonr: ', train_nnr_correlation)
print("Test pearsonr: ", test_nnr_correlation)
print("Best model parameters:")
for k, v in nnr_best_params.items():
    print(f"\t{k}: {v}")

Train pearsonr:  0.9999533582224034
Test pearsonr:  0.6999190047319795
Best model parameters:
	algorithm: kd_tree
	leaf_size: 41
	n_neighbors: 20
	p: 2
	weights: distance


### Multi Layer Perceptron Regressor
The second model to be evaluated will be a MLPR and the parameters for which the model will be evaluated are: alphas and the hidden_layer_size.

In [22]:
from sklearn.neural_network import MLPRegressor

alphas = np.logspace(-6, -1, 6)
hidden_layer_sizes = [(i,) for i in range(5, 305, 10)]
mlp_param = dict(alpha=alphas, hidden_layer_sizes=hidden_layer_sizes)

mlpr = MLPRegressor(max_iter=1000, random_state=1)
mgsmlp = GridSearchCV(mlpr,
                      mlp_param,
                      cv=ps,
                      scoring=pearson_scorer,
                      n_jobs=-1,
                      verbose=1)
mgsmlp = mgsmlp.fit(all_data, all_labels)

Fitting 1 folds for each of 180 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:  1.3min finished


definition of the algorithm with the best parameters of the gridsearch.

In [23]:
mlp_best_params = mgsmlp.best_params_
best_mlp = MLPRegressor(max_iter=1000, random_state=1, **mlp_best_params)
best_mlp.fit(train_features_scaled, train_gs)

MLPRegressor(alpha=0.1, hidden_layer_sizes=(285,), max_iter=1000,
             random_state=1)

training

In [24]:
%%time
train_mlp_predictions = best_mlp.predict(train_features_scaled)

CPU times: user 17 ms, sys: 7.96 ms, total: 24.9 ms
Wall time: 5.83 ms


inference

In [25]:
%%time
test_mlp_predictions = best_mlp.predict(test_features_scaled)

CPU times: user 26.4 ms, sys: 4.12 ms, total: 30.5 ms
Wall time: 6.21 ms


### MLP Results

In [26]:
train_mlp_correlation = pearsonr(train_mlp_predictions, train_gs)[0]
test_mlp_correlation = pearsonr(test_mlp_predictions, test_gs)[0]
print('Train pearsonr:', train_mlp_correlation)
print("Test pearsonr: ", test_mlp_correlation)
print("Best model parameters:")
for k, v in mlp_best_params.items():
    print(f"\t{k}: {v}")

Train pearsonr: 0.8787936099286511
Test pearsonr:  0.7197486862008735
Best model parameters:
	alpha: 0.1
	hidden_layer_sizes: (285,)


### Bottleneck MLP
After an assumption that the number of features were too many and the model was overfitting. An additional layer of a number of nodes smaller than the number of features was used to reduce the complexity of the network.

In [27]:
from itertools import product
bottleneck_length = range(3, test_features.shape[-1])
hidden_layer_length = range(4, 100, 20)
alphas_lin = np.linspace(0, 1, 6)
hidden_layer_sizes = list(product(bottleneck_length, hidden_layer_length))

param = dict(hidden_layer_sizes=hidden_layer_sizes, alpha=alphas_lin)

bmlp = MLPRegressor(max_iter=500, random_state=1)
gsbmlp = GridSearchCV(bmlp,
                          param,
                          cv=ps,
                          scoring=pearson_scorer,
                          n_jobs=-1,
                          verbose=1)
gsbmlp = gsbmlp.fit(all_data, all_labels)

Fitting 1 folds for each of 1110 candidates, totalling 1110 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done 188 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 438 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 788 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 1110 out of 1110 | elapsed:  7.1min finished


definition of the algorithm with the best parameters of the gridsearch.

In [28]:
bt_mlp_best_params = gsbmlp.best_params_
best_bt_mlp = MLPRegressor(max_iter=500, random_state=1, **bt_mlp_best_params)
best_bt_mlp.fit(train_features_scaled, train_gs)

MLPRegressor(alpha=1.0, hidden_layer_sizes=(9, 84), max_iter=500,
             random_state=1)

training

In [29]:
%%time
train_bt_mlp_predictions = best_bt_mlp.predict(train_features_scaled)

CPU times: user 11.6 ms, sys: 0 ns, total: 11.6 ms
Wall time: 2.9 ms


inference

In [30]:
%%time
test_bt_mlp_predictions = best_bt_mlp.predict(test_features_scaled)

CPU times: user 14.3 ms, sys: 0 ns, total: 14.3 ms
Wall time: 4.56 ms


### Bottleneck MLP Results

In [31]:
train_bt_mlp_correlation = pearsonr(train_bt_mlp_predictions, train_gs)[0]
test_bt_mlp_correlation = pearsonr(test_bt_mlp_predictions, test_gs)[0]
print('Train pearsonr: ', train_bt_mlp_correlation)
print('Test pearsonr: ', test_bt_mlp_correlation)
print("Best model parameters:")
for k, v in bt_mlp_best_params.items():
    print(f"\t{k}: {v}")

Train pearsonr:  0.8545895366971805
Test pearsonr:  0.7336318188483206
Best model parameters:
	alpha: 1.0
	hidden_layer_sizes: (9, 84)


### SVR
The last model to be tested is an SVR. The parameters to be searched are: gamma, C and epsilon.

In [32]:
gammas = np.logspace(-6, -1, 6)
Cs = np.array([0.5, 1, 2, 4, 8, 10, 15, 20, 50, 100, 200, 375, 500, 1000])
epsilons = np.linspace(0.1, 1, 10)
svm_param = dict(gamma=gammas, C=Cs, epsilon=epsilons)

svr = SVR(kernel='rbf', tol=1)
gssvr = GridSearchCV(svr,
                     svm_param,
                     cv=ps,
                     scoring=pearson_scorer,
                     n_jobs=-1,
                     verbose=1)
gssvr = gssvr.fit(all_data, all_labels)

Fitting 1 folds for each of 840 candidates, totalling 840 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 188 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 438 tasks      | elapsed:   19.5s
[Parallel(n_jobs=-1)]: Done 788 tasks      | elapsed:   36.6s
[Parallel(n_jobs=-1)]: Done 840 out of 840 | elapsed:   40.0s finished


definition of the algorithm with the best parameters of the gridsearch.

In [33]:
svr_best_params = gssvr.best_params_
best_svr = SVR(kernel='rbf', tol=1, **svr_best_params)
best_svr.fit(train_features_scaled, train_gs)

SVR(gamma=0.1, tol=1)

training

In [34]:
%%time
train_svm_predictions = best_svr.predict(train_features_scaled)

CPU times: user 112 ms, sys: 60 µs, total: 112 ms
Wall time: 111 ms


inference

In [35]:
%%time
test_svm_predictions = best_svr.predict(test_features_scaled)

CPU times: user 155 ms, sys: 0 ns, total: 155 ms
Wall time: 158 ms


### SVR Results

In [36]:
train_svm_correlation = pearsonr(train_svm_predictions, train_gs)[0]
test_svm_correlation = pearsonr(test_svm_predictions, test_gs)[0]
print('Train pearsonr:', train_svm_correlation)
print("Test pearsonr: ", test_svm_correlation)
print("Best model parameters:")
for k, v in svr_best_params.items():
    print(f"\t{k}: {v}")

Train pearsonr: 0.8995633919315202
Test pearsonr:  0.7285741569591626
Best model parameters:
	C: 1.0
	epsilon: 0.1
	gamma: 0.1


***
## Model Results

In [37]:
correlations = {
    "SVR": {"train": train_svm_correlation,"test": test_svm_correlation},
    "MLP": {"train": train_mlp_correlation,"test": test_mlp_correlation},
    "NNR": {"train": train_nnr_correlation,"test": test_nnr_correlation},
    "BtBLP": {"train": train_bt_mlp_correlation,"test": test_bt_mlp_correlation},
}
print(pd.DataFrame.from_dict(correlations, orient="index"))

          train      test
SVR    0.899563  0.728574
MLP    0.878794  0.719749
NNR    0.999953  0.699919
BtBLP  0.854590  0.733632


The model trained with the bottleeneck mlp yields the best results however, as the goal is to select a model on which perform some feature selction in, the SVR is the model that seems to be more promising given the capability of the SVR models have to generalize.

***
## Feature Selection
The feature selection process has been performed on a `.py` file also annexed called `forward_search.py`. It has not been annexed to the notebook as the runtime was of several hours and would not give any additional information to run it embedded on the notebook. However, the results will be discussed in the notebook. 

40 features were defined in the vectors describing each pair of sentences. To search for all combinations of the possible features in the feature space would be a very computationally expensive task as the number of combinations to test would be:

$P=\sum^{N}_{i=1} \frac{N!}{(N-i)!}$

Which is very computationally intensive. For that reason a forward search approach has been taken where, starting from the feature better correlated with the ground truth, all combinations of the best and another feature are tried and then the best is selected. this is done iteratively until all features have been selected. With this method it is possible to select a suboptimal set of features.

The results are as shown on the next table:

|next_best_feature_index|feature_name|svm_train_pearson|svm_test_pearson|
|---|---|---|---|
|24|st_lemmas_lc_dice|na|na|
|25|bigrams_dice|0.7569018439|0.6701654144|
|3|no_stop_lc_jaccard|0.7928514893|0.6885382776|
|15|stemmed_jaccard|0.8098010284|0.7011783941|
|18|no_stop_dice|0.832851473|0.7177098951|
|38|average_lc_wup|0.8350481895|0.7249467512|
|37|average_lc_lch|0.8465025219|0.7345462925|
|35|average_lin|0.848472905|0.7441819896|
|16|tokenized_dice|0.8537878611|0.7443287688|
|6|ne_jaccard|0.8566475295|0.7415489911|
|39|average_lc_lin|0.8592710204|0.7438845148|
|0|tokenized_jaccard|0.8553386682|0.7416358703|
|14|lesk_lc_jaccard|0.8559150716|0.7411532023|
|19|no_stop_lc_dice|0.859945113|0.7399740251|
|13|lesk_jaccard|0.8579436501|0.7418736685|
|12|trigrams_sent_jaccard|0.8640442114|0.745306467|
|7|st_lemmas _jaccard|0.8648808183|0.7427944038|
|10|trigrams_jaccard|0.8659545425|0.7456997951|
|34|average_wup|0.8691527762|0.7437638538|
|33|average_lch|0.8699168309|0.7444932807|
|11|bigrams_sent_jaccard|0.8735932779|0.745160831|
|9|bigrams_jaccard|0.8726765445|0.7433300457|
|8|st_lemmas_lc_jaccard|0.875144412|0.7430640178|
|22|ne_dice|0.8748713591|0.7436311865|
|21|lemmas_lc_dice|0.8715725166|0.7427397658|
|29|lesk_dice|0.8653357973|0.7419653444|
|32|average_path|0.8739536114|0.7415179342|
|26|trigrams_dice|0.8680233841|0.7399533417|
|23|st_lemmas _dice|0.8745521205|0.7411973361|
|36|average_lc_path|0.8781954367|0.7405314764|
|4|lemmas_jaccard|0.8874922457|0.7393197223|
|2|no_stop_jaccard|0.8869535884|0.7402972782|
|1|tokenized_lc_jaccard|0.8889006827|0.7371574177|
|28|trigrams_sent_dice|0.8926597158|0.7385353635|
|27|bigrams_sent_dice|0.8912935486|0.7375904386|
|17|tokenized_lc_dice|0.8889953345|0.7364385229|
|30|lesk_lc_dice|0.8930292877|0.7345804535|
|5|lemmas_lc_jaccard|0.8962334904|0.7336561523|
|0|lemmas_dice|0.8959995979|0.7326778371|
|31|stemmed_dice|0.8995633919|0.728574157|

From the previous table, it can be seen that the best result were yielded using the features: \[0, 3, 6, 7, 10, 12, 13, 14, 15, 16, 18, 19, 24, 25, 35, 37, 38, 39\] with a 0.7456997951 value of correlation in the test set.

***
## Final SVR Model
The final model was trained with the selected features as shown:

In [38]:
selected_features_index = [0, 3, 6, 7, 10, 12, 13, 14, 15, 16, 18, 19, 24, 25, 35, 37, 38, 39]
all_data_selected_features = all_data[:, selected_features_index]
selected_features_train = train_features_scaled[:, selected_features_index]
selected_features_test = test_features_scaled[:, selected_features_index]

In [39]:
gammas = np.logspace(-6, -1, 6)
Cs = np.array([0.5, 1, 2, 4, 8, 10, 15, 20, 50, 100, 200, 375, 500, 1000])
epsilons = np.linspace(0.1, 1, 10)
param = dict(gamma=gammas, C=Cs, epsilon=epsilons)

svr = SVR(kernel='rbf', tol=1)
gssvr = GridSearchCV(svr,
                     param,
                     cv=ps,
                     scoring=pearson_scorer,
                     n_jobs=-1,
                     verbose=1)
gssvr = gssvr.fit(all_data_selected_features, all_labels)

Fitting 1 folds for each of 840 candidates, totalling 840 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 188 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 684 tasks      | elapsed:   20.3s
[Parallel(n_jobs=-1)]: Done 840 out of 840 | elapsed:   27.2s finished


In [40]:
best_parameters = gssvr.best_params_
best_model = SVR(kernel='rbf', tol=1, **best_parameters)
train_predictions = best_model.fit(selected_features_train, train_gs).predict(selected_features_train)

In [41]:
%%time
test_predictions = best_model.predict(selected_features_test)

CPU times: user 53.4 ms, sys: 22 µs, total: 53.5 ms
Wall time: 52.6 ms


In [42]:
train_correlation = pearsonr(train_predictions, train_gs)[0]
test_correlation = pearsonr(test_predictions, test_gs)[0]

In [43]:
print('Train pearsonr: ', train_correlation)
print('Test pearsonr: ', test_correlation)
print("Best model parameters:")
for k, v in best_parameters.items():
    print(f"\t{k}: {v}")

Train pearsonr:  0.8612974649511776
Test pearsonr:  0.7456997951303734
Best model parameters:
	C: 1.0
	epsilon: 0.4
	gamma: 0.1


***
## Conclusions
A pearson correlation of `0.746` has been obtained using an SVR. We think that it is a good result given that it is quite close to the 10th participant in the SemEval contest. However, the solution comes at a higher computational cost than traditional methods of semantic similarity. In our tests we saw a pearson correlation of `0.67` between the ground truth and the dice distance of the sentence with no stopwords, lemmatized and in lowercase. This means that all the algorithms that were performed on top of it became a significant overhead in computation time for a relatively small increment in correlation. Also, the synset similarities tend to be quite expensive computationally as well and even though they were very good contributors to the correlation, they are not time efficient at all.

However, seing that the inference cost of the model is around `55.1`ms, we can safely compute the algorithms proposed without any kind of concern on computatuinal power availability. 

***
# End STS Project