# LLM + BLIP Combined Pipeline: Multi-Modal Knowledge Graph Completion

## Overview
This notebook combines predictions from two modalities to enrich the WJoconde knowledge graph:
- **VLM (Vision-Language Model)**: Extracts "depicts" relations from artwork **images**
- **LLM (Large Language Model)**: Extracts "depicts" relations from artwork **text descriptions**

## Workflow
1. **Cell 1**: Combine multiple BLIP JSON output files into a single CSV
2. **Cell 2**: Merge BLIP (vision) and LLM (text) predictions
3. **Cell 3**: Add ground truth labels from knowledge graph for comparison
4. **Cell 4**: Filter predictions using Word2Vec semantic similarity (threshold=0.5)
   - Removes predictions that are semantically similar to existing KG entries
   - Keeps only NOVEL predictions that could enrich the knowledge graph
5. **Cell 5**: Count items in each depicts column for analysis
6. **Cell 6**: Compute summary statistics (total predictions vs. ground truth)
7. **Cell 7**: Count unique entities in the original knowledge graph

## Key Output
-  Final dataset containing:
  - `from`: Entity URI
  - `depicts`: All predictions (LLM + BLIP combined)
  - `true_depicts`: Ground truth from knowledge graph
  - `filtered_depicts`: Novel predictions not in KG (potential additions)


In [None]:
# =============================================================================
# CELL 1: Combine Multiple BLIP JSON Output Files
# =============================================================================
# Purpose: 
#   - Load multiple BLIP VLM output JSON files (blip_valid_json_pairs.json, etc.)
#   - Extract "depicts" predictions from each entity
#   - Merge all predictions into a single DataFrame with unique values
#   - Save combined BLIP outputs to 'blip_outputs_all.csv'

# Input: blip_valid_json_pairs.json, blip_valid_json_pairs2.json
# Output: blip_outputs_all.csv (columns: from, depicts)
# =============================================================================

import json
import pandas as pd

# Define file paths
json_files = [
    "blip_valid_json_pairs.json",
]

# Initialize an empty dictionary to store data
data = {}

# Process each JSON file
for file in json_files:
    try:
        with open(file, 'r', encoding='utf-8') as f:
            json_data = json.load(f)

        for key, value in json_data.items():
            if isinstance(value, dict) and "depicts" in value:
                if key not in data:
                    data[key] = set()  # Use a set to store unique depicts values
                depicts_list = value.get("depicts", [])
                print(key)
                print(depicts_list)
                if isinstance(depicts_list, list):
                    data[key].update(depicts_list)
    except Exception as e:
        print(f"Error processing {file}: {e}")

# Convert the cleaned data into a DataFrame
df = pd.DataFrame(
    {"from": list(data.keys()), "depicts": [", ".join(sorted(depicts_set)) for depicts_set in data.values()]}
)

# Define output file path
output_csv = "blip_outputs_all_qwen.csv"

# Save to CSV
df.to_csv(output_csv, index=False, encoding='utf-8')

print(f"CSV file saved at: {output_csv}")


In [None]:
# =============================================================================
# CELL 2: Merge BLIP (Vision) and LLM (Text) Predictions
# =============================================================================
# Purpose:
#   - Load BLIP predictions (from images) and LLM predictions (from text)
#   - Merge both sources of "depicts" predictions for each entity
#   - Deduplicate predictions using sets
#   - Save the combined multi-modal predictions

# Input: blip_outputs_all.csv, llm_outputs_all.csv
# Output: llm_bilp_combined_all.csv (merged predictions from both modalities)
# =============================================================================

import pandas as pd

# Define file paths
blip_csv_path = "blip_outputs_all.csv"
llm_csv_path = "llm_outputs_all.csv"
output_csv_path = "llm_bilp_combined_all.csv"

# Load CSV files
blip_df = pd.read_csv(blip_csv_path)
llm_df = pd.read_csv(llm_csv_path)

# Function to process 'depicts' column (convert to set for deduplication)
def process_depicts(depicts):
    if pd.isna(depicts):
        return set()
    return set(depicts.split(", "))

# Create a dictionary to store combined data
combined_data = {}

# Process BLIP data
for _, row in blip_df.iterrows():
    key = row["from"]
    depicts_set = process_depicts(row["depicts"])
    if key not in combined_data:
        combined_data[key] = depicts_set
    else:
        combined_data[key].update(depicts_set)

# Process LLM data
for _, row in llm_df.iterrows():
    key = row["from"]
    depicts_set = process_depicts(row["depicts"])
    if key not in combined_data:
        combined_data[key] = depicts_set
    else:
        combined_data[key].update(depicts_set)

# Convert the merged data into a DataFrame
final_df = pd.DataFrame(
    {"from": list(combined_data.keys()), "depicts": [", ".join(sorted(depicts_set)) for depicts_set in combined_data.values()]}
)

# Save to CSV
final_df.to_csv(output_csv_path, index=False, encoding='utf-8')

print(f"CSV file saved at: {output_csv_path}")


In [None]:
# =============================================================================
# CELL 3: Add Ground Truth Labels to Combined Predictions
# =============================================================================
# Purpose:
#   - Load LLM image results containing ground truth "depicts" labels
#   - Extract true_depicts from the combined_dict column
#   - Merge ground truth labels with combined predictions
#   - This enables comparison between predictions and ground truth

# Input: llm_image_results.csv, llm_bilp_combined_all.csv
# Output: llm_bilp_combined_all.csv (with added true_depicts column)
# =============================================================================

import pandas as pd
import ast  # To safely evaluate dictionary strings

# Define file paths
llm_image_results_path = "llm_image_results.csv" # just a file containing the true_depicts
combined_outputs_path = "blip_outputs_all.csv"
output_csv_path = "llm_bilp_combined_all.csv"


# Load CSV files
llm_image_df = pd.read_csv(llm_image_results_path)
combined_outputs_df = pd.read_csv(combined_outputs_path)

# Function to safely extract keys from combined_dict
def extract_keys(combined_dict_str):
    try:
        combined_dict = ast.literal_eval(combined_dict_str)  # Convert string to dictionary safely
        if isinstance(combined_dict, dict):
            return ", ".join(sorted(combined_dict.keys()))  # Extract and sort keys
    except:
        return None  # Return None if conversion fails

# Apply function to extract 'true_depicts'
llm_image_df["true_depicts"] = llm_image_df["combined_dict"].apply(extract_keys)

# Merge dataframes on 'entity' (llm_image_results) and 'from' (combined_outputs)
final_df = combined_outputs_df.merge(
    llm_image_df[["entity", "true_depicts"]],
    left_on="from",
    right_on="entity",
    how="left"
)

# Drop redundant 'entity' column after merge
final_df.drop(columns=["entity"], inplace=True)

# Save the final dataset to CSV
final_df.to_csv(output_csv_path, index=False, encoding='utf-8')

print(f"CSV file saved at: {output_csv_path}")


In [None]:
# =============================================================================
# CELL 4: Filter Predictions Using Word2Vec Semantic Similarity
# =============================================================================
# Purpose:
#   - Use Word2Vec (Google News 300) to compute semantic similarity
#   - Filter out predicted "depicts" that are too similar to ground truth
#   - This identifies NEW/NOVEL predictions not already in the knowledge graph
#   - Similarity threshold: 0.4 (predictions above this are filtered out)

# Method:
#   - For each predicted depict, compare against all true_depicts
#   - If similarity >= 0.4 with any true_depict, remove it (redundant)
#   - Keep only predictions that are semantically distinct from ground truth

# Input: llm_bilp_combined_all.csv
# Output: llm_bilp_combined_all.csv (with filtered_depicts column)
# =============================================================================

import pandas as pd
import gensim
import gensim.downloader as api
from tqdm import tqdm  # Import tqdm for progress tracking

# Load the Word2Vec model (Update with the correct model path)
wv = api.load('word2vec-google-news-300')

# Define file paths
input_csv_path = "llm_bilp_combined_all.csv"
output_csv_path = "llm_bilp_combined_all.csv"

# Load the CSV file
df = pd.read_csv(input_csv_path)

# Define similarity threshold (adjust as needed)
similarity_threshold = 0.4

# Function to process string lists
def process_list(column_value):
    if pd.isna(column_value) or column_value.strip() == "":
        return []
    return [x.strip() for x in column_value.split(",")]

# Function to compute similarity for phrases
def compute_similarity(seq, obj):
    seq_words = [word for word in seq.split() if word in wv.key_to_index]
    obj_words = [word for word in obj.split() if word in wv.key_to_index]

    total_similarities = []
    for seq_word in seq_words:
        word_similarities = [
            wv.similarity(seq_word, obj_word) 
            for obj_word in obj_words 
            if obj_word in wv.key_to_index
        ]
        if word_similarities:
            total_similarities.append(max(word_similarities))  # Take max similarity for each word

    return sum(total_similarities) / len(total_similarities) if total_similarities else 0

# Iterate over each row with tqdm for progress tracking
filtered_depicts_list = []

for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing Entities", unit="entity"):
    depicts = process_list(row["depicts"])
    true_depicts = process_list(row["true_depicts"])
    
    filtered_depicts = []
    
    # Iterate over each depicts phrase with tqdm
    for depict in tqdm(depicts, desc="Filtering Depicts", leave=False, unit="phrase"):
        is_similar = False
        
        # Compare with each true_depicts phrase
        for true_depict in true_depicts:
            similarity = compute_similarity(depict, true_depict)
            if similarity >= similarity_threshold:
                is_similar = True
                break  # No need to check further if already above threshold
        
        # Keep the depict if no high similarity found
        if not is_similar:
            filtered_depicts.append(depict)
    
    # Store filtered values as a string
    filtered_depicts_list.append(", ".join(filtered_depicts))

# Add results to the DataFrame
df["filtered_depicts"] = filtered_depicts_list

# Save the updated CSV
df.to_csv(output_csv_path, index=False, encoding='utf-8')

print(f"Filtered CSV saved at: {output_csv_path}")


In [None]:
# =============================================================================
# CELL 5: Count Items in Each Depicts Column
# =============================================================================
# Purpose:
#   - Count the number of items in depicts, true_depicts, and filtered_depicts
#   - Add length columns for analysis and comparison
#   - Helps quantify: total predictions, ground truth size, novel predictions

# Output columns added:
#   - depicts_length: Number of total predicted depicts
#   - true_depicts_length: Number of ground truth depicts
#   - filtered_depicts_length: Number of novel (non-redundant) predictions
# =============================================================================

# Load the CSV file
import pandas as pd
input_csv_path = "llm_bilp_combined_all.csv"
output_csv_path = "llm_bilp_combined_all.csv"

df = pd.read_csv(input_csv_path)

# Function to count the number of elements in a comma-separated list
def count_items(column_value):
    if pd.isna(column_value) or column_value.strip() == "":
        return 0
    return len(column_value.split(","))

# Compute lengths
df["depicts_length"] = df["depicts"].apply(count_items)
df["true_depicts_length"] = df["true_depicts"].apply(count_items)
df["filtered_depicts_length"] = df["filtered_depicts"].apply(count_items)

# Save the updated CSV
df.to_csv(output_csv_path, index=False, encoding='utf-8')

print(f"CSV file saved at: {output_csv_path}")
