# Edge Full-Match Counting Implementation

This notebook processes a CSV of edges (language_code, source_wikidata_id, target_wikidata_id, weight), retrieves Wikipedia extracts and entity names using the DuckDB handler, cleans target names, counts full matches in extracts, and outputs a CSV with results.

## 1. Setup: Import Libraries and Project Modules

Import required libraries and ensure project modules are accessible.

In [1]:
import os
import sys
import pandas as pd

# Ensure src is in path for module imports
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.modules.duckdb_handler import DuckDBHandler
from src import filters, edge_processing

## 2. Load Edge CSV

Specify the input CSV file and load the edges into a DataFrame.

In [2]:
# Set input CSV path
input_csv = '../data/out/SpotlightWeightSource_0102_0505.csv'

# Load edges
edges_df = pd.read_csv(input_csv)
print(f"Loaded {len(edges_df)} edges.")
edges_df.head()

Loaded 12444180 edges.


Unnamed: 0,language_code,source_wikidata_id,target_wikidata_id,weight
0,en,Q6882,Q53003,0.0
1,en,Q6882,Q5912,0.0
2,en,Q6882,Q212719,0.0
3,en,Q6882,Q514998,0.0
4,en,Q6882,Q210134,0.0


## 3. Connect to DuckDB Database

Initialize the DuckDB handler to enable entity data retrieval.

In [3]:
# Path to DuckDB database
DB_PATH = '../data/out/graph_final.db'

# Initialize handler
handler = DuckDBHandler(DB_PATH)

## 4. Define Full-Match Counting Function

Create a function to count full matches of the cleaned target name in the source's extract.

In [4]:
# count_fuzzy_matches is now in filters — nothing to define here


## 5. Process Edges and Compute Full-Match Counts

For each edge, retrieve extract and target name, clean the name, and count full matches.

In [5]:
from tqdm import tqdm
import time

print("Step 1: Collecting all needed (qid, lang) pairs...")
src_pairs = list(zip(edges_df['source_wikidata_id'], edges_df['language_code']))
tgt_pairs = list(zip(edges_df['target_wikidata_id'], edges_df['language_code']))
print(f"  Source pairs: {len(src_pairs)} | Target pairs: {len(tgt_pairs)}")

print("Step 2: Preparing unique sets for efficient lookup...")
src_qid_lang_set = set(src_pairs)
tgt_qid_lang_set = set(tgt_pairs)
print(f"  Unique source (qid, lang): {len(src_qid_lang_set)} | Unique target (qid, lang): {len(tgt_qid_lang_set)}")

print("Step 2a: Fetching all source extracts from DB...")
src_qids = list({qid for qid, _ in src_qid_lang_set})
src_langs = list({lang for _, lang in src_qid_lang_set})
print(f"  Fetching extracts for {len(src_qids)} QIDs and {len(src_langs)} languages...")
src_extracts_df = handler.get_pages_data(src_qids, src_langs)
print(f"  Retrieved {len(src_extracts_df)} extracts.")
src_extracts_dict = {
    (row['wikidata_id'], row['language_code']): row['extract']
    for _, row in src_extracts_df.iterrows()
}

print("Step 2b: Fetching all target titles from DB...")
tgt_qids = list({qid for qid, _ in tgt_qid_lang_set})
tgt_langs = list({lang for _, lang in tgt_qid_lang_set})
print(f"  Fetching titles for {len(tgt_qids)} QIDs and {len(tgt_langs)} languages...")
tgt_titles_df = handler.get_titles_for_qids(tgt_qids, tgt_langs)
print(f"  Retrieved {len(tgt_titles_df)} titles.")
tgt_titles = {
    (row['wikidata_id'], row['language_code']): row['page_title']
    for _, row in tgt_titles_df.iterrows()
}

print("Step 3: Computing fullmatch counts (grouped + parallel)...")
start_time = time.time()
target_counts, source_self_counts = edge_processing.count_exact_matches_grouped(
    edges_df, src_extracts_dict, tgt_titles
)
edges_df['fullmatch_count'] = target_counts
edges_df['source_self_count'] = source_self_counts
total_time = time.time() - start_time
print(f"Done. Total time: {total_time / 60:.2f} min.")
print("'fullmatch_count' and 'source_self_count' columns added to edges_df.")


Step 1: Collecting all needed (qid, lang) pairs...
  Source pairs: 12444180 | Target pairs: 12444180
Step 2: Preparing unique sets for efficient lookup...
  Unique source (qid, lang): 415987 | Unique target (qid, lang): 398500
Step 2a: Fetching all source extracts from DB...
  Fetching extracts for 93339 QIDs and 5 languages...
  Retrieved 466350 extracts.
Step 2b: Fetching all target titles from DB...
  Fetching titles for 92078 QIDs and 5 languages...
  Retrieved 460089 titles.
Step 3: Computing fullmatch counts (grouped + parallel)...
Grouped 12444180 edges into 415987 source groups (avg 29.9 targets/source).


Source groups (exact): 100%|██████████| 415987/415987 [01:33<00:00, 4438.25it/s]


Done. Total time: 1.83 min.
'fullmatch_count' and 'source_self_count' columns added to edges_df.


## 6. Save Results to CSV

Export the DataFrame with full-match counts to a new CSV file.

In [6]:
# Save results to a CSV file with output name based on input name
import os

# Get input file name without extension
input_base = os.path.splitext(os.path.basename(input_csv))[0]
output_csv = f"../data/out/{input_base}_fullmatch.csv"
edges_df.to_csv(output_csv, index=False)
print(f"Saved results to {output_csv}")

# Print some results
edges_df.head(20)

Saved results to ../data/out/SpotlightWeightSource_0102_0505_fullmatch.csv


Unnamed: 0,language_code,source_wikidata_id,target_wikidata_id,weight,fullmatch_count,source_self_count
0,en,Q6882,Q53003,0.0,0,0
1,en,Q6882,Q5912,0.0,0,0
2,en,Q6882,Q212719,0.0,0,0
3,en,Q6882,Q514998,0.0,0,0
4,en,Q6882,Q210134,0.0,0,0
5,en,Q6882,Q55391,0.0,0,0
6,en,Q6882,Q1388518,0.0,1,0
7,en,Q6882,Q274143,0.0,0,0
8,en,Q6882,Q41406,0.0,0,0
9,en,Q6882,Q7371,0.0,0,0
