# Semantic Textual Similarity

---

# Author

    - Unai Gurbindo
    - Jaume Guasch
---

# Date
*December 22th, 2023*

---

# Statement

Use data set and description of task Semantic Textual Similarity in SemEval 2012.

Implement some approaches to detect paraphrase using sentence similarity metrics.

- Explore some lexical dimensions.
- Explore the syntactic dimension alone.
- Explore the combination of both previous.

Add new components at your choice (optional).

Already generated word or sentence embeddings models are not allowed, such as BERT.

Compare and comment the results achieved by these approaches among them and among the official results.

Send files to raco in IHLT STS Project before the oral presentation:

- Jupyter notebook: sts-[Student1]-[Student2].ipynb

- Slides: sts-[Student1]-[Student2].pdf

---

# Requeriments

In [1]:
!pip install nltk





In [2]:
!pip install spacy





In [3]:
!pip install pyspellchecker





In [4]:
!pip install -U deep-translator





In [19]:
!pip install textdistance

ERROR: Could not find a version that satisfies the requirement textdistance (from versions: none)
ERROR: No matching distribution found for textdistance


--- 

## Introduction

---
## Project Structure

---
## WorkFlow

---
### Libraries

In [23]:
import os
import pandas as pd
import numpy as np  

import spacy
# nlp = spacy.load('en_core_web_sm')
from spacy.tokens import Doc

# import textdistance as td

---
### Data upload

In [6]:
# Read Files in Data Folder: test-gold.tgz, train.tgz

# Unzip Files
# !tar -xzf ./Data/test-gold.tgz
# !tar -xzf ./Data/train.tgz

# Read train data and concatenate into one file (read all the txt files in the train folder)
train_path = './train'
test_path = './test-gold'

# Read Train files
train_files = ["MSRpar","MSRvid","SMTeuroparl"]
train_data = pd.DataFrame(columns = ['Sentence 1', 'Sentence 2', 'GS', 'Origin'])
for file in train_files:   
    
    sentences_path= train_path + '/STS.input.' +  file + '.txt'
    with open(sentences_path, 'r') as f:
        sentences = f.read().splitlines()
        sentences = [s.split('\t') for s in sentences]
        
    input_sentences = pd.DataFrame(sentences, columns = ['Sentence 1', 'Sentence 2']) 
    
    input_gold_score = pd.read_table(train_path + '/STS.gs.' + file+ '.txt', names=['GS'])
    
    input=  pd.concat([input_sentences, input_gold_score], axis=1)  
    input['Origin'] = file[:-4]                          
    train_data = pd.concat([train_data, input], ignore_index=True)      
train_data.reset_index(inplace=True)
train_data.drop_duplicates(subset = ['Sentence 1', 'Sentence 2', 'GS'], keep = False, inplace = True)
train_target = train_data['GS'] 
train_data.drop(columns=['index', 'GS'], inplace=True)

# Read test files
test_files = ["MSRpar", "MSRvid", "SMTeuroparl", "surprise.OnWN", "surprise.SMTnews"]
test_data = pd.DataFrame(columns = ['Sentence 1', 'Sentence 2', 'GS', 'Origin'])
for file in test_files:   
    
    sentences_path= test_path + '/STS.input.' +  file + '.txt'
    with open(sentences_path, 'r') as f:
        sentences = f.read().splitlines()
        sentences = [s.split('\t') for s in sentences]
        
    input_sentences = pd.DataFrame(sentences, columns = ['Sentence 1', 'Sentence 2']) 
    
    input_gold_score = pd.read_table(test_path + '/STS.gs.' + file+ '.txt', names=['GS'])
    
    input=  pd.concat([input_sentences, input_gold_score], axis=1)  
    input['Origin'] = file[:-4]                          
    test_data = pd.concat([test_data, input], ignore_index=True)
test_data.reset_index(inplace=True)
test_data.drop_duplicates(subset = ['Sentence 1', 'Sentence 2', 'GS'], keep = False, inplace = True)
test_target = test_data['GS']
test_data.drop(columns=['index', 'GS'], inplace=True)

In [7]:
train_data

Unnamed: 0,Sentence 1,Sentence 2,Origin
0,But other sources close to the sale said Viven...,But other sources close to the sale said Viven...,MS
1,Micron has declared its first quarterly profit...,Micron's numbers also marked the first quarter...,MS
2,The fines are part of failed Republican effort...,"Perry said he backs the Senate's efforts, incl...",MS
3,"The American Anglican Council, which represent...","The American Anglican Council, which represent...",MS
4,The tech-loaded Nasdaq composite rose 20.96 po...,The technology-laced Nasdaq Composite Index <....,MS
...,...,...,...
2229,"Action is needed quickly, which is why we deci...",It is urgent and that is why we have decided t...,SMTeuro
2230,One could indeed wish for more and for improve...,"We can actually want more and better, but I th...",SMTeuro
2231,(Parliament accepted the oral amendment),(Parliament accepted the oral amendment),SMTeuro
2232,- My party has serious reservations about Comm...,My party serious reservations about the regula...,SMTeuro


---
### Pre-processing strategies

> Spell checker

*Example*

In [8]:
from spellchecker import SpellChecker

# Create a SpellChecker instance
spell = SpellChecker(distance=3)

# Example text with some intentional spelling errors
text = "this is an examplo sentennce with some misspelled wordds"

# Split the text into words
words = text.split()

# Find and print misspelled words
misspelled = spell.unknown(words)
print("Misspelled words:", misspelled)

# Correct the misspelled words
for word in misspelled:
    corrected_word = spell.correction(word)
    text = text.replace(word, corrected_word)

print("Corrected text:", text)

Misspelled words: {'sentennce', 'wordds', 'examplo'}
Corrected text: this is an example sentence with some misspelled words


> Translation language for data augmentation

In [13]:
from deep_translator import GoogleTranslator

sentence= "keep it up, you are awesome"

sentence_translate_es=GoogleTranslator(source='auto', target='es').translate(sentence) 
GoogleTranslator(source='auto', target='en').translate(sentence_translate_es) 

"keep it up you're great"

---
### Feature extraction 

> Stopwords

In [None]:
doc = nlp(sentence)


> Lemmas

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

> Tokens

> Ngrams with words

> Ngrams with characters

> Lemmas

> Tokens

---
### Similarities

---
### Feature combination

> Feature selection

> Neural Network

> Random Forest

> SVM

> Models Ensembles

---
## Final results