## Jero-Winkler Distance

In this implementation, the `calculate_jw_distance` function takes two string arguments `sentence1` and `sentence2`, and returns a float value representing the Jaro-Winkler Distance between the two strings.

The function uses the `jaro_winkler()` method from the "jellyfish" library to compute the Jaro-Winkler Distance between the two strings.

Finally, the function returns the Jaro-Winkler Distance between the two strings as a float value.

Using docstrings and type hints helps to improve the readability and maintainability of the code, and provides useful information for users of the function. Note that you'll need to install the "jellyfish" library using `pip install jellyfish` in order to use the `jaro_winkler()` method.

In [None]:
!pip install jellyfish
import jellyfish

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting jellyfish
  Downloading jellyfish-0.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jellyfish
Successfully installed jellyfish-0.11.2


In [None]:
def calculate_jw_distance(sentence1: str, sentence2: str) -> float:
    """
    Calculate the Jaro-Winkler Distance between two strings.
    
    Args:
        sentence1 (str): The first string.
        sentence2 (str): The second string.
    
    Returns:
        float: The Jaro-Winkler Distance between the two strings, represented as a float value.
    """
    # Calculate the Jaro-Winkler Distance between the strings
    jw_distance = jellyfish.jaro_winkler(sentence1, sentence2)
    
    return jw_distance

In [None]:
s1 = calculate_jw_distance(
    "what is the purpose of clinical trials",
    "my name is teddy and I play soccer"
)

s2 = calculate_jw_distance(
    "what is the purpose of clinical trials",
    "the reason for people to do clinical trials is to conduct drug experiment"
)

s1, s2

(0.6056384276198518, 0.6573110859340131)

## Word Mover's Distance

In this implementation, the `calculate_wmd` function takes two string arguments `sentence1` and `sentence2`, and returns a float value representing the Word Mover's Distance (WMD) between the two sentences.

The function first loads a pre-trained Word2Vec model using the `KeyedVectors.load_word2vec_format()` method from the "gensim" library. You'll need to replace `'path/to/pretrained/word2vec.bin'` with the actual file path to the pre-trained Word2Vec binary file on your system.

Next, the function tokenizes the sentences into words using the `nltk.word_tokenize()` method and converts them to lowercase for case insensitivity.

The function then calculates the WMD between the two sentences using the `WmdSimilarity()` class from the "gensim" library. The `num_best` parameter is set to 1 to return only the best match. The WMD between the two sentences is extracted from the returned value using indexing.

Finally, the function returns the WMD between the two sentences as a float value.

Using docstrings and type hints helps to improve the readability and maintainability of the code, and provides useful information for users of the function. Note that you'll need to download the "punkt" package from NLTK using `nltk.download('punkt')` in order to use the `nltk.word_tokenize()` method.

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.similarities import WmdSimilarity
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
def calculate_wmd(sentence1: str, sentence2: str) -> float:
    """
    Calculate the Word Mover's Distance (WMD) between two sentences.
    
    Args:
        sentence1 (str): The first sentence.
        sentence2 (str): The second sentence.
    
    Returns:
        float: The WMD between the two sentences, represented as a float value.
    """
    # Load a pre-trained Word2Vec model
    model = KeyedVectors.load_word2vec_format('./word2vec.bin', binary=True)
    
    # Tokenize the sentences into words
    words1 = nltk.word_tokenize(sentence1.lower())
    words2 = nltk.word_tokenize(sentence2.lower())
    
    # Calculate the WMD between the sentences
    wmd_similarity = WmdSimilarity([words1], model, num_best=1)
    wmd_distance = wmd_similarity[words2][0][1]
    
    return wmd_distance


In [None]:
calculate_wmd(
    "what is the purpose of clinical trials",
    "my name is teddy and I play soccer"
)

calculate_wmd(
    "what is the purpose of clinical trials",
    "the reason for people to do clinical trials is to conduct drug experiment"
)

## Cosine Similarity

In [None]:
import numpy as np
from scipy.spatial.distance import cosine

def calculate_cosine_similarity(sentence1: str, sentence2: str) -> float:
    """
    Calculate the cosine similarity between two sentences.
    
    Args:
        sentence1 (str): The first sentence.
        sentence2 (str): The second sentence.
    
    Returns:
        float: The cosine similarity between the two sentences, represented as a float value between 0 and 1.
    """
    # Tokenize the sentences into words
    words1 = sentence1.lower().split()
    words2 = sentence2.lower().split()
    
    # Create a set of unique words from both sentences
    unique_words = set(words1 + words2)
    
    # Create a frequency vector for each sentence
    freq_vector1 = np.array([words1.count(word) for word in unique_words])
    freq_vector2 = np.array([words2.count(word) for word in unique_words])
    
    # Calculate the cosine similarity between the frequency vectors
    similarity = 1 - cosine(freq_vector1, freq_vector2)
    
    return similarity


In [None]:
calculate_cosine_similarity(
    "what is the purpose of clinical trials",
    "my name is teddy and I play soccer"
)

0.13363062095621214

In [None]:
calculate_cosine_similarity(
    "what is the purpose of clinical trials",
    "the reason for people to do clinical trials is to conduct drug experiment"
)

0.39036002917941337

## Semantic Textual Similarity

In [None]:
!pip install sentence-transformers scipy

In [None]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

This is a Python function named `calculate_sts_score` that takes two string arguments `sentence1` and `sentence2`, and returns a float value representing the semantic textual similarity (STS) score between the two sentences.

The function uses the `SentenceTransformer` class from the "sentence-transformers" library to load a pre-trained STS model called "paraphrase-MiniLM-L6-v2". This model is capable of generating sentence embeddings that capture the semantic meaning of the sentences.

The function then calls the `model.encode()` method to compute the embeddings for `sentence1` and `sentence2`. Since the `encode()` method returns a 2D array, the function flattens the array using the `[0]` index to obtain the 1D embeddings.

Finally, the function computes the cosine similarity between the two embeddings using the `cosine()` function from the `scipy.spatial.distance` module. The similarity score is subtracted from 1 to obtain the STS score, which is then returned as a float value.

In summary, this function uses a pre-trained STS model and cosine similarity to measure the semantic similarity between two sentences, and returns a float value representing the degree of similarity.

In [None]:
def calculate_sts_score(sentence1: str, sentence2: str) -> float:
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')  # Load a pre-trained STS model
    
    # Compute sentence embeddings
    embedding1 = model.encode([sentence1])[0]  # Flatten the embedding array
    embedding2 = model.encode([sentence2])[0]  # Flatten the embedding array
    
    # Calculate cosine similarity between the embeddings
    similarity_score = 1 - cosine(embedding1, embedding2)
    
    return similarity_score


In [None]:
calculate_sts_score(
    "what is the definition of clinical trials?",
    "the soccer team is not doing well"
)

-0.027773594483733177

In [None]:
calculate_sts_score(
    "what is the definition of clinical trials?",
    "the definition of clinical trials consist of conducting medical experiment"
)

0.8558713793754578

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('olympics_qa.csv')

In [None]:
df.head()

Unnamed: 0,title,heading,content,tokens,context,questions,answers
0,2020 Summer Olympics,Summary,"The 2020 Summer Olympics, officially the Games...",621,2020 Summer Olympics\nSummary\n\nThe 2020 Summ...,1. When were the 2020 Summer Olympics original...,1. The 2020 Summer Olympics were originally sc...
1,2020 Summer Olympics,Host city selection,The International Olympic Committee (IOC) vote...,126,2020 Summer Olympics\nHost city selection\n\nT...,1. When did the IOC vote to select the host ci...,1. The IOC voted to select the host city of th...
2,2020 Summer Olympics,Impact of the COVID-19 pandemic,"In January 2020, concerns were raised about th...",375,2020 Summer Olympics\nImpact of the COVID-19 p...,1. What are the potential impacts of COVID-19 ...,1. The potential impacts of COVID-19 on the 20...
3,2020 Summer Olympics,Qualifying event cancellation and postponement,Concerns about the pandemic began to affect qu...,298,2020 Summer Olympics\nQualifying event cancell...,1. What qualifying events were moved to altern...,1. The women's basketball qualification and th...
4,2020 Summer Olympics,Effect on doping tests,Mandatory doping tests were being severely res...,163,2020 Summer Olympics\nEffect on doping tests\n...,1. What is the main concern with the doping te...,1. The main concern with the doping tests for ...


This is a Python function named `add_sts_score_column` that takes a Pandas DataFrame object `dataframe` and a string object `sentence` as input arguments, and returns a new Pandas DataFrame object with a new column `sts_score` added to the original DataFrame, sorted in descending order based on the STS scores.

The function first applies the `calculate_sts_score` function to each item in the 'questions' column of the DataFrame using the `apply()` method. The `lambda` function inside the `apply()` method takes each item `x` in the 'questions' column and calculates the STS score between `x` and `sentence` using the `calculate_sts_score` function. The resulting STS scores are stored in a new column called 'sts_score', which is added to the DataFrame using the assignment operator.

Next, the function sorts the DataFrame in descending order based on the 'sts_score' column using the `sort_values()` method, with the `by` parameter set to 'sts_score' and the `ascending` parameter set to `False`. This will sort the DataFrame in descending order based on the STS scores, with the highest scores appearing first.

Finally, the function returns the sorted DataFrame object with the new 'sts_score' column added.

In summary, this function uses the `calculate_sts_score` function and Pandas DataFrame methods to add a new column to the DataFrame with the STS scores between each question and a given sentence, and returns a new sorted DataFrame object.

In [None]:
def add_sts_score_column(dataframe: pd.DataFrame, sentence: str) -> pd.DataFrame:
    dataframe['sts_score'] = dataframe['questions'].apply(lambda x: calculate_sts_score(x, sentence))
    sorted_dataframe = dataframe.sort_values(by='sts_score', ascending=False)
    
    return sorted_dataframe


In [None]:
add_sts_score_column(df, 'What is the potential impacts of COVID-19?')