This notebook will contain the process of extracting the text embeddings to build the hard truth, and the testing of that implementation

In [1]:
import os
import sys
import pandas as pd
from dotenv import load_dotenv
from src.nlp_models import HuggingFaceEmbeddings
from src import utils

In [2]:
# --- Setup paths ---
BASE_DIR = os.getcwd()
SRC_DIR = f'{BASE_DIR}/src'
OUTPUT_DIR = f'{BASE_DIR}/outputs'

sys.path.append(str(SRC_DIR))

# --- Load environment variables ---
load_dotenv(f'{BASE_DIR}/.env')

True

In [3]:
# --- Parameters ---
SEARCH_QUERY = "2024 Champions League Final"
LANGUAGE = "en"
TOTAL_ARTICLES = 100
PAGE_SIZE = 5
TOPIC_LABEL = "champions+league+2024"
PUBLISHED_AFTER = "2024-01-01"

ARTICLES_CSV = f'{OUTPUT_DIR}/articles.csv'
EMBEDDED_CSV = f'{OUTPUT_DIR}/embedded_articles.csv'
SCORED_CSV = f'{OUTPUT_DIR}/articles_scored.csv'
GROUND_TRUTH_TXT = f'{OUTPUT_DIR}/ground_truth.txt'

Part 1: Fetch articles and store them to a dataframe object
===
The articles where downloaded using the NewsAPI resource available online. We downloaded a total of 100 articles saved in the ```outputs/articles.csv``` location.

If you wish to download another set of articles you would have to run the code seen below, but you'd have to obtain a new API key from the resource [available here](https://newsapi.org/) and store that key in the variable ```THENEWSAPI_KEY``` in your .env file.

```python
print("Fetching articles...")
articles = fetch_articles(
    query=SEARCH_QUERY,
    language=LANGUAGE,
    total_articles=TOTAL_ARTICLES,
    page_size=PAGE_SIZE,
    topic_label=TOPIC_LABEL,
    published_after=PUBLISHED_AFTER
)
```

Now you can see that the structure of the dataframe containing our article data below

In [4]:
df_articles = pd.read_csv(ARTICLES_CSV)
df_articles.head(3)

Unnamed: 0,article_id,title,body,source,published_at,url
0,0cd1f952-eb81-44ed-9792-5c854acfaeb4,Date confirmed for CAF Champions League quarte...,CAF has announced the date and venue for the q...,thesouthafrican.com,2025-02-05T09:49:17.000000Z,https://www.thesouthafrican.com/sport/soccer/c...
1,e3f890d4-4637-4dde-a45d-89c0f61a9046,"Champions League final, Premier League race: M...",Open Extended Reactions\n\nWith PSG and Boruss...,espn.co.uk,2024-05-02T15:35:23.000000Z,https://www.espn.co.uk/football/story/_/id/400...
2,83888beb-8b74-4e0e-8b46-97541b03ca31,"Champions League final, Premier League race: M...",Open Extended Reactions\n\nWith PSG and Boruss...,espn.com,2024-05-02T12:31:53.000000Z,https://www.espn.com/soccer/story/_/id/4005714...


Part 2: Generate text embeddings
===
#TODO add text

In [5]:
embedder = HuggingFaceEmbeddings(path=str(ARTICLES_CSV), save_path=str(OUTPUT_DIR))
embeds_df = embedder.get_embedding_df(column='body', directory=str(OUTPUT_DIR), file='embedded_articles.csv')

Using device: cpu
Model moved to device: cpu
Model: sentence-transformers/all-MiniLM-L6-v2


Now that we have the embeddings, we procosses them and store them into our same dataframe object

In [6]:
df_embeddings = pd.read_csv(EMBEDDED_CSV)

In [7]:
print(df_embeddings.shape)
df_embeddings.head(1).T

(98, 7)


Unnamed: 0,0
article_id,0cd1f952-eb81-44ed-9792-5c854acfaeb4
title,Date confirmed for CAF Champions League quarte...
body,CAF has announced the date and venue for the q...
source,thesouthafrican.com
published_at,2025-02-05T09:49:17.000000Z
url,https://www.thesouthafrican.com/sport/soccer/c...
embeddings,"[-0.15082290768623352, 0.014091070741415024, -..."


The shape of our new dataframe object becomes 98 by 390, because now for each of the 98 articles, we added 1 column per vector in our original embeddings that we generated for our text

Part 3: Extract ground truth from all articles
===
#TODO add text

In [8]:
import spacy

In [9]:
nlp = spacy.load("en_core_web_lg")
ground_truth = utils.get_ground_truth(ARTICLES_CSV, nlp)

with open(GROUND_TRUTH_TXT, "w", encoding="utf-8") as f:
        f.write("\n".join(ground_truth))

----------Processing Article 1----------
----------Processing Article 2----------
----------Processing Article 3----------
----------Processing Article 4----------
----------Processing Article 5----------
----------Processing Article 6----------
----------Processing Article 7----------
----------Processing Article 8----------
----------Processing Article 9----------
----------Processing Article 10----------
----------Processing Article 11----------
----------Processing Article 12----------
----------Processing Article 13----------
----------Processing Article 14----------
----------Processing Article 15----------
----------Processing Article 16----------
----------Processing Article 17----------
----------Processing Article 18----------
----------Processing Article 19----------
----------Processing Article 20----------
----------Processing Article 21----------
----------Processing Article 22----------
----------Processing Article 23----------
----------Processing Article 24----------
-

Part 4: Create the text embeddings for the ground truth
===
#TODO add text

In [10]:
# First we transform our ground truth into a text embedding vector

# we get our ground truth as a single string
ground_truth_list = list()
with open(GROUND_TRUTH_TXT, 'r') as f:
    for word in f:
        ground_truth_list.append(word.strip())

ground_truth = " ".join(ground_truth_list)

ground_truth_embedding = HuggingFaceEmbeddings.get_single_embedding(ground_truth)

Part 5: Compare the similarity between each of the embeddings to the Ground Truth
===
#TODO add text

In [None]:
# we will be using cosine similarity to evaluate the similarity between each article and the previously
# built ground truth
df_scored = utils.compare_articles_to_ground_truth(df_embeddings, ground_truth_embedding)

# df_scored.to_csv(SCORED_CSV, index=False)

np.float64(0.41755040577605784)