# Soft Prompt vs Hard Prompt (GNN+LLM) Comparison

This notebook compares two approaches for injecting graph knowledge into LLMs:

## 1. Soft Prompt (Graph as Text)
- Graph → Text serialization
- Added to LLM's context window
- LLM processes structure through text attention

## 2. Hard Prompt (GNN Encoding)
- Graph → GNN → Embedding
- Structure explicitly encoded
- Can be injected into LLM hidden states

### Key Differences
| Aspect | Soft Prompt | Hard Prompt |
|--------|-------------|-------------|
| Structure | Implicit (text) | Explicit (GNN) |
| Context Length | O(nodes + edges) | O(1) virtual tokens |
| Multi-hop | LLM must infer | GNN propagates |
| Training | Zero-shot | May need finetuning |

In [None]:
# Environment setup (Run first in Colab)
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Install dependencies (uncomment in Colab)
# !pip install torch_geometric transformers accelerate sentence-transformers neo4j
# !pip install pyg_lib torch_scatter torch_sparse -f https://data.pyg.org/whl/torch-{torch.__version__.split('+')[0]}+cu121.html

In [None]:
import os
import sys
sys.path.insert(0, '../src')

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from soft_vs_hard_experiment import (
    ExperimentConfig, 
    SoftVsHardExperiment,
    SoftPromptFormatter,
    GNNEncoder,
    PromptType
)

## Configuration

In [None]:
config = ExperimentConfig(
    # Neo4j (update with your settings)
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="password",
    neo4j_database="finderlpg",
    
    # LLM
    llm_model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    # llm_model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # For testing
    
    # Embeddings
    embedding_model_id="sentence-transformers/all-MiniLM-L6-v2",
    embedding_dim=384,
    
    # GNN
    gnn_hidden_dim=256,
    gnn_num_layers=2,
    gnn_heads=4,
    
    # Retrieval
    top_k_nodes=20,
    max_hops=2,
    
    # Memory optimization
    use_4bit=True,  # Enable for Colab T4/V100
)

print(f"Device: {config.device}")

## Initialize Experiment

In [None]:
exp = SoftVsHardExperiment(config)
exp.setup(["neo4j", "embeddings", "llm", "gnn"])

In [None]:
# Load questions
questions_df = exp.data_loader.load_questions(limit=50)
print(f"Loaded {len(questions_df)} questions")
questions_df.head()

## Soft Prompt Formatting Examples

Let's see how graph data looks in different text formats:

In [None]:
# Get a sample subgraph
sample_id = questions_df.iloc[0]['id']
sample_subgraph = exp.data_loader.get_subgraph(sample_id, max_hops=2)

print(f"Subgraph: {len(sample_subgraph['nodes'])} nodes, {len(sample_subgraph['edges'])} edges")

In [None]:
# Format 1: Structured
print("=" * 60)
print("FORMAT: STRUCTURED")
print("=" * 60)
structured = SoftPromptFormatter.format_structured(
    sample_subgraph['nodes'][:10], 
    sample_subgraph['edges'][:15],
    include_props=True
)
print(structured)

In [None]:
# Format 2: Natural Language
print("=" * 60)
print("FORMAT: NATURAL LANGUAGE")
print("=" * 60)
natural = SoftPromptFormatter.format_natural(
    sample_subgraph['nodes'][:10], 
    sample_subgraph['edges'][:10]
)
print(natural)

In [None]:
# Format 3: Triples
print("=" * 60)
print("FORMAT: TRIPLES")
print("=" * 60)
triples = SoftPromptFormatter.format_triples(
    sample_subgraph['nodes'][:10], 
    sample_subgraph['edges'][:15]
)
print(triples)

In [None]:
# Token comparison
print("\n" + "=" * 60)
print("TOKEN COUNT COMPARISON")
print("=" * 60)

for name, text in [("Structured", structured), ("Natural", natural), ("Triples", triples)]:
    tokens = len(exp.tokenizer.encode(text))
    chars = len(text)
    print(f"{name}: {tokens} tokens, {chars} chars")

## Run Single Comparison

In [None]:
# Pick a question
idx = 0
row = questions_df.iloc[idx]
print(f"Question: {row['text']}")
print(f"Answer: {row['answer']}")

In [None]:
# Run comparison
result = exp.run_comparison(
    question_id=row['id'],
    question=row['text'],
    ground_truth=row['answer']
)

In [None]:
# Display results
print("=" * 80)
print("COMPARISON RESULTS")
print("=" * 80)

print(f"\nQuestion: {result['question']}")
print(f"Ground Truth: {result['ground_truth']}")
print(f"Subgraph: {result['subgraph_nodes']} nodes, {result['subgraph_edges']} edges")

print("\n" + "-" * 40)
print("[1] LLM ONLY (No Context)")
print("-" * 40)
print(f"Response: {result['llm_only_response']}")
print(f"Tokens: {result['llm_only_meta']['input_tokens']} in, {result['llm_only_meta']['output_tokens']} out")
print(f"Time: {result['llm_only_meta']['generation_time']:.2f}s")

print("\n" + "-" * 40)
print("[2] SOFT PROMPT (Graph as Text)")
print("-" * 40)
print(f"Response: {result['soft_prompt_response']}")
print(f"Tokens: {result['soft_prompt_meta']['input_tokens']} in, {result['soft_prompt_meta']['output_tokens']} out")
print(f"Context: {result['soft_prompt_meta']['context_length']} chars")
print(f"Time: {result['soft_prompt_meta']['generation_time']:.2f}s")

print("\n" + "-" * 40)
print("[3] HARD PROMPT (GNN Encoding)")
print("-" * 40)
print(f"Response: {result['hard_prompt_response']}")
print(f"Tokens: {result['hard_prompt_meta']['input_tokens']} in, {result['hard_prompt_meta']['output_tokens']} out")
print(f"GNN time: {result['hard_prompt_meta'].get('gnn_time', 0):.4f}s")
print(f"Graph embedding norm: {result['hard_prompt_meta'].get('graph_emb_norm', 0):.4f}")
print(f"Total time: {result['hard_prompt_meta']['generation_time']:.2f}s")

## Run Full Experiment

In [None]:
# Run on multiple questions
results_df = exp.run_experiment(questions_df, sample_size=10)

In [None]:
# Summary statistics
exp.print_summary()

## Visualization

In [None]:
# Extract metrics
metrics = {
    'Method': [],
    'Input Tokens': [],
    'Generation Time': [],
}

for r in exp.results:
    metrics['Method'].append('LLM Only')
    metrics['Input Tokens'].append(r['llm_only_meta']['input_tokens'])
    metrics['Generation Time'].append(r['llm_only_meta']['generation_time'])
    
    metrics['Method'].append('Soft Prompt')
    metrics['Input Tokens'].append(r['soft_prompt_meta']['input_tokens'])
    metrics['Generation Time'].append(r['soft_prompt_meta']['generation_time'])
    
    metrics['Method'].append('Hard Prompt')
    metrics['Input Tokens'].append(r['hard_prompt_meta']['input_tokens'])
    metrics['Generation Time'].append(r['hard_prompt_meta']['generation_time'])

metrics_df = pd.DataFrame(metrics)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Token comparison
sns.boxplot(data=metrics_df, x='Method', y='Input Tokens', ax=axes[0], palette='Set2')
axes[0].set_title('Input Token Count by Method')
axes[0].set_ylabel('Tokens')

# Time comparison
sns.boxplot(data=metrics_df, x='Method', y='Generation Time', ax=axes[1], palette='Set2')
axes[1].set_title('Generation Time by Method')
axes[1].set_ylabel('Seconds')

plt.tight_layout()
plt.savefig('soft_vs_hard_comparison.png', dpi=150)
plt.show()

In [None]:
# Token reduction analysis
soft_tokens = [r['soft_prompt_meta']['input_tokens'] for r in exp.results]
hard_tokens = [r['hard_prompt_meta']['input_tokens'] for r in exp.results]

reduction = [(s - h) / s * 100 for s, h in zip(soft_tokens, hard_tokens)]

print(f"Token Reduction (Soft → Hard):")
print(f"  Mean: {np.mean(reduction):.1f}%")
print(f"  Min:  {np.min(reduction):.1f}%")
print(f"  Max:  {np.max(reduction):.1f}%")

## Analysis: When does each method work better?

### Soft Prompt Strengths:
- Zero-shot (no training needed)
- Interpretable context
- Works with any LLM
- Good for small graphs

### Hard Prompt (GNN) Strengths:
- Fixed context length regardless of graph size
- Explicitly encodes structure (multi-hop paths)
- More efficient for large graphs
- Better captures graph topology

In [None]:
# Analyze by subgraph size
subgraph_sizes = [r['subgraph_nodes'] for r in exp.results]

fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(subgraph_sizes, soft_tokens, label='Soft Prompt', alpha=0.7, s=100)
ax.scatter(subgraph_sizes, hard_tokens, label='Hard Prompt', alpha=0.7, s=100)

ax.set_xlabel('Subgraph Size (nodes)')
ax.set_ylabel('Input Tokens')
ax.set_title('Token Usage vs Subgraph Size')
ax.legend()

plt.tight_layout()
plt.savefig('tokens_vs_subgraph_size.png', dpi=150)
plt.show()

In [None]:
# Save results
exp.save_results('soft_vs_hard_results.json')
results_df.to_csv('soft_vs_hard_results.csv', index=False)
print("Results saved!")

In [None]:
# Cleanup
exp.cleanup()
torch.cuda.empty_cache()

## Next Steps

1. **Quality Evaluation**: Compare answer accuracy using metrics (F1, exact match)
2. **GNN Training**: Fine-tune GNN on your dataset
3. **Full Hard Prompt Integration**: Inject GNN embeddings directly into LLM hidden states
4. **Ablation Studies**: Test different:
   - Soft prompt formats
   - GNN architectures (GCN, GAT, GraphSAGE)
   - Number of virtual tokens for hard prompt
   - Subgraph pruning strategies (PCST)