# Semantic Textual Similarity (STS) Project

The objective of this project is to be able to determine the similarity between two sentences. One sentence is said to be "parraphrased" when the content (or message) is the same, but uses different words and or structure. 

An example from the trial set: 
 - The bird is bathing in the sink.

 - Birdie is washing itself in the water basin.

Here we are given a set of training and testing sets in which they are labeled with the "gs", on a scale of 0-5. 

|label|	description|
| :-: | :-: |
|5	| They are completely equivalent, as they mean the same thing.|
|4	| They are mostly equivalent, but some unimportant details differ.|
|3	| They are roughly equivalent, but some important information differs/missing.|
|2	| They are not equivalent, but share some details.|
|1	| They are not equivalent, but are on the same topic.|
|0	| They are on different topics.|

We need to create the following: 
- Read in the sentences as a total dataframe  --> Either load all three dataframes and then append them into a bigger one. 
- append the corresponding GS to the dataframe  --> Add this one to the previous df 
- Create a utils file in which we have all the features we want to create
- Show which features were created and how/why 
- we can then create a pipeline
    - Takes in all the features from before and makes them into a feature array 
    - Standardizes the values 
    - Outputs a simple N-D array with all the processed / calculated features 

- We need to create 3 variations: 
    1. "Standard" distance similarities 
    2. "XTRa Train" --> With more training data doing back-translation and AEDA 
    3. "Embeddings" and use pre-trained models with a Deep L-layered model 


*STEPS:*
1. Preprocess textual data 
    - Read in sentence pairs
    - Tokenize 
    - Pos Tag 
    - Remove stopwords and punctuation 

2. Extract Features 
    - Similarity measures 
    - Word frequency 
    - Tf-IDF ?
3. Generate Extra Data (I)

In [1]:
# Data Loader file with two functions
from File_Loader import *

# Preprocessing file with several prerpocessing functions
from pre_processing import *

In [2]:
# TRAINING PATH
TRAIN_PATH = './data/train/input/'
TRAIN_GS_PATH = './data/train/gs/'

# TEST PATH
TEST_PATH = 'data/test/input/'
TEST_GS_PATH = './data/test/gs/'

# Loading the Data 
X_train, y_train, X_test, y_test = LoadSentences(TRAIN_PATH), LoadGS(TRAIN_GS_PATH),LoadSentences(TEST_PATH), LoadGS(TEST_GS_PATH)


In [3]:
# Doing Data Augmentation 

#Back Translation 
# Translating from one language (English) to another (French, Italian, German, etc..)
# Then translating back to the original language (English)
# English --> Another Language --> English 

# Lets look at how many total values we have
print(f"We have a total of {X_train.shape[0]} sentence pairs")

We have a total of 2234 sentence pairs


In [4]:
## Preprocessing 
# remove punctuations [X]
# remove stopwords [X]
# stemming 
# lemmatizing 
# TF-IDF


In [5]:
import string
import nltk
from nltk import pos_tag
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.stem import WordNetLemmatizer
from nltk.metrics import jaccard_distance
from nltk.probability import FreqDist
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import TrigramCollocationFinder
from nltk.wsd import lesk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.corpus import sentiwordnet
from nltk.corpus import wordnet_ic
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('words')
nltk.download('sentiwordnet')
nltk.download('wordnet_ic')
# setting the wordnet_ic 
brown_ic = wordnet_ic.ic('ic-brown.dat')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/Eric/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to /Users/Eric/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Eric/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/Eric/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/Eric/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /Users/Eric/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/Eric/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloadin

In [6]:
# Building our pipeline
PIPELINE = {}

In [7]:
import numpy as np

#SEED = np.random.seed(42)
samples = np.array(X_train.SentA).reshape(-1,)
sample_sentence = np.random.choice(samples, 1)[0]
sample_sentence

'net revenue rose to $3.99 billion from $3.85 billion during the same quarter last year.'