## Goal:
### Score = Semantic Similarity + λ × Contextual Relevance

This section will attempt to calculate the contextual relevance of named entities.


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
nlp = spacy.load("en_core_web_sm")  # Load the small English model

In [2]:
input_file_path = '/Users/thebekhruz/Desktop/100Days-Of-Code/100-Days-of-NLP-Odyssey/data/raw/data_formatted_date.jsonl'
df = pd.read_json(input_file_path, lines=True)
df.head(10)

Unnamed: 0,IAID,text,mentions,relations
0,a7bb9917-95ff-3f55-a640-4c5afcec25f2,View towards SE of junction of Queen Victoria ...,"[{'ne_span': 'Queen Victoria Road', 'ne_start'...",
1,c29a7b77-7c46-3b85-88fe-05c8f4b2e384,"Front page of Bucks Free Press, Time capsule f...","[{'ne_span': 'Bucks Free Press', 'ne_start': 1...",
2,196c11e6-f7b6-392f-ae41-28653345087c,"High Wycombe Police Station, in Queen Victoria...","[{'ne_span': 'High Wycombe Police Station', 'n...",
3,7a5aace6-2398-3dcf-8843-37ff6ccea875,"Reference Library door, Queen Victoria Rd, Hig...","[{'ne_span': 'Reference Library', 'ne_start': ...",
4,c66c4715-c03a-3aab-964b-e733f3ff1cf4,"Terrace of brick and flint cottages, Beech Rd,...","[{'ne_span': 'Beech Rd', 'ne_start': 37, 'ne_e...",
5,d1159b13-8aa9-35c1-a4c2-fd13e24732b2,"Site of former Dial House, demolition at corne...","[{'ne_span': 'Dial House', 'ne_start': 15, 'ne...",
6,e39a291d-ed39-3b56-b9aa-2022964a4114,"Row of cottages and part of factory building, ...","[{'ne_span': 'Coopers Yard', 'ne_start': 46, '...",[{'subject': 'https://www.wikidata.org/wiki/Q6...
7,3b84ea4c-e194-3c34-abbf-064a41ad59da,"60th anniversary of opening of Library, part o...","[{'ne_span': 'Library', 'ne_start': 31, 'ne_en...",
8,bf418a27-d1d9-324b-a53f-50ab7ae8d81d,View towards NE of crossroads at junction of Q...,"[{'ne_span': 'Queen Victoria Rd', 'ne_start': ...",
9,5bc83263-dcd0-3764-98cb-a3761480b4c7,"High Wycombe Police Station, Queen Victoria Ro...","[{'ne_span': 'High Wycombe Police Station', 'n...",


In [3]:
# Delete relations column as it does not have any meaningful information
del df['relations']

In [4]:
# Check and remove rows with empty 'mentions' dictionaries
if df['mentions'].isnull().sum()>0:
    df = df.dropna(subset=['mentions'])
    print('NaN mentions rows where removed.')
else:
    print('There are no empty mentions dictionaries in the database')   

NaN mentions rows where removed.


In [5]:
# Extract specific keys from dictionaries in the 'mentions' column and create a new column
def extract_key_from_dict_list(dict_list, key):
    if isinstance(dict_list, list):
        result = [element.get(key) for element in dict_list if isinstance(element, dict) and key in element]
        return result
    else:
        return []

In [6]:

# This function applies the extraction of a key for a series in a DataFrame
def apply_extraction_to_column(df, column_name, key, new_column_name):
    df[new_column_name] = df[column_name].apply(lambda x: extract_key_from_dict_list(x, key))
    return df


df = apply_extraction_to_column(df, 'mentions', 'ne_span', 'extracted_entities')
df.head()


Unnamed: 0,IAID,text,mentions,extracted_entities
0,a7bb9917-95ff-3f55-a640-4c5afcec25f2,View towards SE of junction of Queen Victoria ...,"[{'ne_span': 'Queen Victoria Road', 'ne_start'...","[Queen Victoria Road, High St, Easton St, High..."
1,c29a7b77-7c46-3b85-88fe-05c8f4b2e384,"Front page of Bucks Free Press, Time capsule f...","[{'ne_span': 'Bucks Free Press', 'ne_start': 1...","[Bucks Free Press, Clock House, Arts School, F..."
2,196c11e6-f7b6-392f-ae41-28653345087c,"High Wycombe Police Station, in Queen Victoria...","[{'ne_span': 'High Wycombe Police Station', 'n...","[High Wycombe Police Station, Queen Victoria R..."
3,7a5aace6-2398-3dcf-8843-37ff6ccea875,"Reference Library door, Queen Victoria Rd, Hig...","[{'ne_span': 'Reference Library', 'ne_start': ...","[Reference Library, Queen Victoria Rd, High Wy..."
4,c66c4715-c03a-3aab-964b-e733f3ff1cf4,"Terrace of brick and flint cottages, Beech Rd,...","[{'ne_span': 'Beech Rd', 'ne_start': 37, 'ne_e...","[Beech Rd, Wycombe Marsh, 1935]"


### TF-IDF analysis
1. Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).
2. Tokenize words with frequency
3. Find TF for words
4. Find IDF for words
5. Vectorize vocab

source: https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3


In [7]:
# Preprocesses the given text by removing punctuation, making text lowercase, and removing stop words.
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

<!-- * If you want to calculate the TF-IDF scores for all the words in your preprocessed text, then you should use preprocessed_text.
* If you are specifically interested in the TF-IDF scores for the named entities (i.e., ne_span) only, then you should use preprocessed_entities. -->

$ Score = Semantic Similarity * *w* * Contextual Relevance (TF-IDF) $
#### Significance of Named Entities Within Documents:
* To prioritize named entities in documents, focus on *preprocessed_entities* for TF-IDF calculations. This emphasizes entity importance independently of surrounding text.
#### Semantic Similarity and Contextual Relevance:
* TF-IDF fine-tunes the $Score$ by giving more importance to specific named entities. Using *preprocessed_entities* provides a focused relevance score on the entities without diluting the effect sorounding text.


In [8]:
# Apply preprocess_text function on the text and exctracted_entities column
# Uncomment this if you think that contextual information is important
    # df['preprocessed_text'] = df['text'].apply(preprocess_text)
df['preprocessed_entities'] = df['extracted_entities'].apply(lambda x: [preprocess_text(entity) for entity in x])


In [13]:
# Concatenate all arrays to create a single list of preprocessed entities across all documents
all_entities = sum(df['preprocessed_entities'], [])

# Convert this list into a string where each entity is separated by a space (to simulate a "document" of entities)
entities_text = ' '.join(all_entities)

# Create a "document" for each set of entities in each row to calculate TF-IDF scores
entities_documents = [' '.join(entities) for entities in df['preprocessed_entities']]

# Initialize the vectorizer
entity_vectorizer = TfidfVectorizer()

# Fit and transform the entities documents to calculate TF-IDF
entity_tfidf_matrix = entity_vectorizer.fit_transform(entities_documents)

# tfidf_df = pd.DataFrame(entity_tfidf_matrix.toarray(), columns=entity_vectorizer.get_feature_names_out())
tfidf_df = pd.DataFrame(entity_tfidf_matrix.toarray(), index=df['IAID'], columns=entity_vectorizer.get_feature_names_out())
tfidf_df.head()


Unnamed: 0_level_0,02173,02174,02181,02183,02184,02212,1230,13,13th,13thc,...,yard,ye,ymca,yorkshire,youers,young,youngs,ypres,zager,zephaniah
IAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a7bb9917-95ff-3f55-a640-4c5afcec25f2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
c29a7b77-7c46-3b85-88fe-05c8f4b2e384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
196c11e6-f7b6-392f-ae41-28653345087c,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7a5aace6-2398-3dcf-8843-37ff6ccea875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
c66c4715-c03a-3aab-964b-e733f3ff1cf4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# Function to get the TF-IDF score for a word in a specific document using IAID
def get_tfidf_score(word, vectorizer, tfidf_df, iaid):
    index = vectorizer.vocabulary_.get(word)
    # If the word is in the vocabulary, return its score for the specific document based on IAID
    if index is not None:
        return tfidf_df.loc[iaid, vectorizer.get_feature_names_out()[index]]
    else:
        # If the word is not in the vocabulary, return 0
        return 0

# Function to calculate the total TF-IDF score for each mention's ne_span for a specific document using IAID
def add_tfidf_scores_to_mentions(row, vectorizer, tfidf_df):
    mentions = row['mentions']
    iaid = row['IAID']  # Use IAID to reference the document in tfidf_df
    for mention in mentions:
        words = mention['ne_span'].lower().split()
        # Note: Ensure that `preprocess_text` is applied here if necessary, as per your preprocessing logic
        total_score = sum(get_tfidf_score(word, vectorizer, tfidf_df, iaid) for word in words)
        mention['total_tfidf_score'] = total_score

# Apply the function to each row in the DataFrame
df.apply(lambda row: add_tfidf_scores_to_mentions(row, entity_vectorizer, tfidf_df), axis=1)
df['mentions'][0][0]



{'ne_span': 'Queen Victoria Road',
 'ne_start': 31,
 'ne_end': 50,
 'ne_type': 'LOC',
 'total_tfidf_score': 0.902690796523274}