## Similarity Score Calculation Methodology

This document outlines the methodology used to calculate a Similarity Score, which serves as a measure of document relevance based on text similarity, weighted named entities, and contextual relevance.

### Formula
$$Score = Similarity Score + λ * Contextual Relevance$$


- **Score**: The relevance of the document.
- **Similarity Score**: Obtained by transforming the text into a document embedding, followed by the application of cosine similarity to derive the similarity score. The `sentence_transformers` library from Hugging Face is utilized to generate these document embeddings.
- **λ (Lambda)**: Represents the weight assigned to the identified named entities in the query. Increasing λ gives higher priority to documents that mention the same entity type.
    - **Note**: Emphasis is placed on *entity type* rather than *entity span* due to:

        1. **Flexibility**: `ne_type` enables broader search capabilities by generalizing across similar concepts, not limited by exact phrase matches.
        2. **Robustness**: It offers greater resilience against variations in terminology, ensuring more consistent search results.
        3. **Efficiency**: Focusing on `ne_type` simplifies entity matching, improving computational speed and scalability.





- **Contextual Relevance**: Measures how relevant an entity is within the document, calculated using tf-idf (term frequency-inverse document frequency).

### Practical Example
Consider a set of documents:
- "Queen Victoria's coronation at Weycombe Abbey."
- "Police station at the East Corner of Weycombe Abbey."
- ...

In this example, "Queen Victoria" would be assigned a higher Contextual Relevance than "Weycombe Abbey" due to its less frequent appearance across the corpus, highlighting its significance.


## Generating Document Embeddings

In [2]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import spacy
nlp = spacy.load("en_core_web_sm")  # Load the small English model

In [2]:
input_file_path = '/Users/thebekhruz/Desktop/100Days-Of-Code/100-Days-of-NLP-Odyssey/data/processed/data_with_contextual_relevance.jsonl'

try:
    df = pd.read_json(input_file_path, lines=True)
    print('Data Frame loaded as df')
except Exception as e:
    print(e)

Data Frame loaded as df


In [3]:
# Check the data:
# The data is in the desired format if the output is False for all.
df.isna().value_counts()

IAID   text   mentions
False  False  False       1983
Name: count, dtype: int64

In [4]:
# Preprocesses the given text by removing punctuation,lemmatising the words,
    # making text lowercase, and removing stop words.
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

# Preprocess the text in the 'text' column
df['preprocessed_text'] = df['text'].apply(preprocess_text)


In [3]:
# Initialize the SentenceTransformer model with 'all-mpnet-base-v2'.
# This model is a versatile, all-purpose model for English sentence embeddings,
# trained on a large and diverse dataset of sentences.
model = SentenceTransformer("msmarco-distilbert-dot-v5")


In [6]:
# Encode the text with 'all-mpnet-base-v2' model
df['text_embedding'] = df['preprocessed_text'].apply(lambda x: model.encode(x))

In [7]:
df.head()

Unnamed: 0,IAID,text,mentions,preprocessed_text,text_embedding
0,a7bb9917-95ff-3f55-a640-4c5afcec25f2,View towards SE of junction of Queen Victoria ...,"[{'ne_span': 'Queen Victoria Road', 'ne_start'...",view se junction queen victoria road high st e...,"[-0.036429767, 0.43713483, 0.050612718, -0.088..."
1,c29a7b77-7c46-3b85-88fe-05c8f4b2e384,"Front page of Bucks Free Press, Time capsule f...","[{'ne_span': 'Bucks Free Press', 'ne_start': 1...",page bucks free press time capsule clock house...,"[0.24973764, 0.45752427, 0.47631216, -0.098070..."
2,196c11e6-f7b6-392f-ae41-28653345087c,"High Wycombe Police Station, in Queen Victoria...","[{'ne_span': 'High Wycombe Police Station', 'n...",high wycombe police station queen victoria roa...,"[-0.021879677, 0.22485393, 0.2611494, 0.114179..."
3,7a5aace6-2398-3dcf-8843-37ff6ccea875,"Reference Library door, Queen Victoria Rd, Hig...","[{'ne_span': 'Reference Library', 'ne_start': ...",reference library door queen victoria rd high ...,"[0.19242485, 0.4036871, 0.14782123, -0.1158049..."
4,c66c4715-c03a-3aab-964b-e733f3ff1cf4,"Terrace of brick and flint cottages, Beech Rd,...","[{'ne_span': 'Beech Rd', 'ne_start': 37, 'ne_e...",terrace brick flint cottage beech rd wycombe m...,"[0.05957342, 0.4252165, 0.34293148, -0.1636256..."


In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(query, model, k=5):
    # Preprocess the query
    query = preprocess_text(query)
    
    # Encode the query
    query_embedding = model.encode([query])

    
    # Calculate similarity scores
    similarity_scores = cosine_similarity(query_embedding, np.array(df['text_embedding'].tolist())).flatten()
    
    # Get indices of top k scores
    top_k_indices = np.argsort(similarity_scores)[-k:][::-1]
    
    # Retrieve and print the corresponding texts
    for index in top_k_indices:
        most_similar_text = df.iloc[index]['text']
        print(f"Document Rank: {top_k_indices.tolist().index(index) + 1}")
        print(f"{most_similar_text}\nSimilarity score: {similarity_scores[index]}\n---")


    return query_embedding

In [9]:
# Prepere the data for saving
del df['preprocessed_text']
df.head()


Unnamed: 0,IAID,text,mentions,text_embedding
0,a7bb9917-95ff-3f55-a640-4c5afcec25f2,View towards SE of junction of Queen Victoria ...,"[{'ne_span': 'Queen Victoria Road', 'ne_start'...","[-0.036429767, 0.43713483, 0.050612718, -0.088..."
1,c29a7b77-7c46-3b85-88fe-05c8f4b2e384,"Front page of Bucks Free Press, Time capsule f...","[{'ne_span': 'Bucks Free Press', 'ne_start': 1...","[0.24973764, 0.45752427, 0.47631216, -0.098070..."
2,196c11e6-f7b6-392f-ae41-28653345087c,"High Wycombe Police Station, in Queen Victoria...","[{'ne_span': 'High Wycombe Police Station', 'n...","[-0.021879677, 0.22485393, 0.2611494, 0.114179..."
3,7a5aace6-2398-3dcf-8843-37ff6ccea875,"Reference Library door, Queen Victoria Rd, Hig...","[{'ne_span': 'Reference Library', 'ne_start': ...","[0.19242485, 0.4036871, 0.14782123, -0.1158049..."
4,c66c4715-c03a-3aab-964b-e733f3ff1cf4,"Terrace of brick and flint cottages, Beech Rd,...","[{'ne_span': 'Beech Rd', 'ne_start': 37, 'ne_e...","[0.05957342, 0.4252165, 0.34293148, -0.1636256..."


In [10]:
# Check that each IAID has text_embedding
df.isna().value_counts()

IAID   text   mentions  text_embedding
False  False  False     False             1983
Name: count, dtype: int64

In [11]:
# # Save the file 
# output_file_path = '/Users/thebekhruz/Desktop/100Days-Of-Code/100-Days-of-NLP-Odyssey/data/processed/data_with_contextual_relevance_and_doc_embddings.jsonl'
# df.to_json(output_file_path, orient='records', lines=True)
# print(f'Data saved to {output_file_path}')

## Testing

In [20]:
# General Query about a Location: 
query = "What is Queen Victoria Road in High Wycombe known for?"
embedding = search(query, model, k=3)



[[ 2.32408091e-01  2.75263131e-01  3.00824612e-01 -1.37099579e-01
   2.12703750e-01  2.23786548e-01  3.17782640e-01 -5.44148922e-01
   2.07092296e-02  2.52248675e-01  1.80968478e-01 -2.86924660e-01
   8.67880210e-02  4.93522704e-01  2.04148442e-01  2.59680420e-01
   1.60995215e-01  3.49115282e-01 -3.90778072e-02 -3.56900170e-02
   3.56251091e-01  1.63225144e-01 -3.78627181e-02  4.39224064e-01
   6.81126595e-01  1.44065052e-01  2.32686520e-01 -3.03672224e-01
   2.91557908e-01 -3.88740711e-02  2.17191894e-02 -8.59443694e-02
   3.65102470e-01 -5.61769009e-02 -8.96136239e-02  1.79392189e-01
   1.57263130e-01 -1.91133529e-01 -2.94375807e-01 -1.08308412e-01
  -4.42694575e-01 -2.39115164e-01 -9.33959112e-02  2.33993813e-01
  -1.70036629e-01 -6.94545507e-02  2.42701620e-01  5.04665732e-01
   5.53732514e-02 -1.35693640e-01 -1.72060564e-01  9.25753593e-01
   8.56420957e-03  1.26511857e-01  4.48406637e-01  3.96166109e-02
   1.39501944e-01 -3.33353668e-01  2.80351430e-01 -4.31802571e-01
  -2.89443

In [13]:
# Specific Event or Item Query: 
query = "Information about the time capsule found in High Wycombe."
search(query, model, k=3)

# The model does not perform that well for specific events. When considering only Semanticity and not contextual relevance. 

[[-9.10598319e-03 -6.08279817e-02  4.10736233e-01 -1.29252240e-01
   6.31143332e-01  7.66958967e-02  4.66839701e-01 -1.75720602e-01
  -2.96951681e-02 -1.75775699e-02 -1.01538692e-02 -2.89295852e-01
  -2.05438435e-02  5.74194312e-01 -5.13539475e-04  1.80662975e-01
   8.09746236e-03  6.23788647e-02 -8.89904201e-02 -1.65618271e-01
   4.91100550e-01 -1.74240217e-01 -8.91846418e-03  3.36328030e-01
   4.30831015e-01 -7.63762891e-02 -3.25990736e-01 -4.19220120e-01
  -3.35852876e-02 -9.88990068e-02 -4.03624624e-01 -2.41954178e-01
   3.85523349e-01 -2.18320675e-02 -1.17340110e-01  1.78627402e-01
   6.24263473e-02 -8.08209330e-02 -7.97794014e-02 -1.41190410e-01
   8.90251324e-02 -2.34751940e-01 -3.32146078e-01  3.82704660e-02
  -1.06233181e-02  5.45483641e-02 -5.27046919e-02  6.33778632e-01
  -1.31855443e-01 -9.54149365e-02 -3.00210059e-01  8.34850669e-01
   2.86032651e-02  8.92716572e-02  1.66860253e-01  7.76649714e-02
   1.32768378e-01 -4.70361084e-01  2.94813477e-02 -1.08363889e-01
   5.51292

In [14]:
# Cultural or Historical Site Query: 
query = "Where is the Reference Library located in High Wycombe?"
search(query, model, k=3)



[[ 9.84901041e-02  2.19361588e-01  1.23722211e-01 -2.96491295e-01
   2.94133782e-01  8.32282975e-02 -2.89669335e-02 -2.38139838e-01
  -7.43357316e-02  1.44863665e-01 -3.32558043e-02 -2.71607995e-01
   3.47052030e-02  3.19820046e-01  2.05159798e-01  2.09611416e-01
   1.45627469e-01  9.08271894e-02 -7.94764757e-02  3.47810313e-02
   4.88785207e-01  2.14196488e-01 -1.75858259e-01  3.48131686e-01
   3.51109833e-01  2.42914200e-01  9.12759006e-02 -4.44340825e-01
   1.06857866e-01 -9.67863426e-02 -1.92713022e-01 -1.54355466e-01
   4.30629134e-01 -1.12873226e-01 -2.00127840e-01  2.61492819e-01
   4.01484638e-01 -4.49393690e-02 -8.42352689e-04 -9.30866897e-02
   9.58699286e-02 -1.72288924e-01 -8.23439509e-02  2.06850737e-01
   9.13056955e-02  2.39276007e-01 -2.03508928e-01  6.25743032e-01
  -2.57076383e-01 -1.28501147e-01 -4.50732082e-01  9.28138435e-01
  -2.36227408e-01  1.13246121e-01  9.86947343e-02  1.53077766e-01
  -8.76219943e-03 -3.25538993e-01  1.06575184e-01 -3.27098012e-01
   6.64042

In [15]:
# Broad Historical or Cultural Query:
query = "Historical landmarks in High Wycombe."
search(query, model, k=3)



[[-1.28695339e-01  1.04755960e-01  1.98115975e-01 -3.07836354e-01
   1.72916621e-01  7.53208697e-02  3.23828608e-02 -3.19486201e-01
  -7.79238194e-02  6.49881139e-02  1.14377394e-01 -1.30077675e-01
  -4.31987904e-02  5.52415907e-01  6.53677881e-02  2.16582119e-01
   1.82112321e-01  1.59957647e-01 -1.76647097e-01  6.46523684e-02
   3.76829326e-01 -8.51041544e-03 -3.82900476e-01  3.62290204e-01
   3.23034942e-01  1.80174112e-01  9.66784656e-02 -4.84691739e-01
   9.98164266e-02  1.44830614e-01 -3.76291573e-03 -1.44857258e-01
   1.27330482e-01 -1.29857212e-01 -2.29745299e-01  2.97888339e-01
   1.00224957e-01 -7.87514299e-02 -2.18666166e-01 -9.65037197e-03
   8.96870941e-02 -2.12867394e-01 -1.53660744e-01  3.08702528e-01
  -8.51944908e-02  6.96981102e-02 -5.41689694e-02  6.14795208e-01
  -7.88878873e-02 -1.17620744e-01 -5.76376319e-01  8.17503691e-01
  -3.91935706e-01  1.98269695e-01  2.71704197e-01  5.25114015e-02
   1.13092050e-01 -8.66650045e-02  1.08565912e-01 -4.15793300e-01
  -1.71221

In [16]:
query = "Are there any annual cultural festivities taking place at Queen Victoria Road, High Wycombe?"
search(query, model, k=5)


[[ 8.22718535e-03  1.70604140e-01  5.99720061e-01 -2.36553505e-01
  -3.21590193e-02  3.34031507e-02  6.78628922e-01 -3.34725939e-02
  -1.13841854e-01  1.45257488e-01  1.98471546e-02 -3.13243032e-01
   3.37471738e-02  5.58001339e-01  8.85333866e-02  4.24400777e-01
   2.29371309e-01  1.83365762e-01  5.89418411e-02  1.61373932e-02
   3.80466789e-01  4.63457964e-02  2.54782680e-02  5.68056703e-01
   4.43437129e-01  1.61404356e-01  3.35310817e-01 -4.46288168e-01
   1.89415962e-01 -2.45550185e-01 -2.03899331e-02 -1.08435303e-02
   2.09644064e-01  8.77546263e-04  4.94886562e-02 -3.70987244e-02
   3.93436998e-01 -2.49428242e-01  6.32016873e-03 -5.98685443e-03
  -2.19598010e-01 -4.64595318e-01  1.05181709e-01  1.19593146e-03
  -3.83317411e-01 -1.91955984e-01  1.29822250e-02  6.78616643e-01
  -1.02413505e-01 -8.09229463e-02 -2.35538453e-01  7.00195730e-01
   1.07495464e-01  2.00798716e-02  5.23967981e-01  1.62628546e-01
   1.32398069e-01 -1.70072063e-03  6.14733063e-02 -1.95937246e-01
  -1.32656

### Implementing the formula.