# Phase 3: Analysis & Visualization

This notebook runs the **Faithful Concept Mapper** pipeline.
1. **Ingest**: Wipro BRSR Report.
2. **Extract**: Principle 6 Data (Emissions, Waste, Water).
3. **Evaluate**: Calculate Drift (0-3) against SEBI Requirements (Ground Truth).
4. **Visualize**: Generate Sankey Diagram of the evidence flow.

In [1]:
import sys
import os
import pandas as pd
import plotly.graph_objects as go
from dotenv import load_dotenv

# Add project root to path
sys.path.append(os.path.abspath('..'))

from src.ingest import IngestionEngine
from src.eval import EvaluationEngine, load_dotenv

load_dotenv()

# Initialize Engines
ingestor = IngestionEngine()
evaluator = EvaluationEngine()

### 1. Define Ground Truth (SEBI Principle 6 Requirements)
These are the "Gold Standard" requirements against which we measure faithfulness.

In [2]:
sebi_requirements = [
    {
        "id": "REQ-01",
        "category": "Emissions",
        "requirement": "Companies must report Scope 1 and Scope 2 Greenhouse Gas (GHG) emissions in metric tonnes of CO2 equivalent."
    },
    {
        "id": "REQ-02",
        "category": "Water",
        "requirement": "Companies must disclose total water consumption and water intensity per rupee of turnover."
    },
    {
        "id": "REQ-03",
        "category": "Waste",
        "requirement": "Companies must report total waste generated, classified into hazardous and non-hazardous, and recycling percentages."
    }
]

### 2. Run Extraction (Ingestion Phase)

In [3]:
pdf_path = "../data/target_report.pdf"

if os.path.exists(pdf_path):
    # Load
    chunks = ingestor.load_and_chunk(pdf_path)
    
    # Filter Context (Naive Optimization)
    relevant_text = "\n".join([c['text'] for c in chunks if "Principle 6" in c['text'] or "emissions" in c['text'].lower() or "water" in c['text'].lower()])
    print(f"Context Length: {len(relevant_text)} chars")
    
    # Extract
    extracted_data = ingestor.extract_principle_6(relevant_text[:40000])
    print("Extraction Complete.")
    print(extracted_data.model_dump_json(indent=2))
else:
    print("PDF Missing!")

ðŸ“„ Loading PDF: ../data/target_report.pdf
âœ… Created 69 chunks from 34 pages.
Context Length: 32738 chars
ðŸ¤– Running Extraction Agent...
Extraction Complete.
{
  "emissions": {
    "scope_1": 9640.0,
    "scope_2": 59120.0,
    "scope_3": 271792.85,
    "unit": "tCO2e"
  },
  "waste": {
    "total_waste_generated": 4478.0,
    "recycled_percentage": 66.02,
    "hazardous_waste": 23.0
  },
  "water": {
    "total_water_consumed": 878303.07,
    "water_intensity": 0.97
  },
  "other_initiatives": [
    "Net Zero GHG emissions by 2040",
    "100% RE for all owned facilities by 2030",
    "Improve freshwater use efficiency by 65% by 2030",
    "Increase share of recycled water to 50% by 2030"
  ]
}


### 3. Run Evaluation (Faithfulness Check)

In [4]:
audit_results = []

# Helper to convert Extracted schema to text claims for comparison
# Real implementation would map fields more dynamically, here we map manually for the demo
claims_map = {
    "Emissions": f"Scope 1: {extracted_data.emissions.scope_1}, Scope 2: {extracted_data.emissions.scope_2}.",
    "Water": f"Total Consumption: {extracted_data.water.total_water_consumed}, Intensity: {extracted_data.water.water_intensity}.",
    "Waste": f"Total Waste: {extracted_data.waste.total_waste_generated}, Hazardous: {extracted_data.waste.hazardous_waste}."
}

for req in sebi_requirements:
    cat = req['category']
    claim = claims_map.get(cat, "No data reported")
    
    # Calculate Drift
    # Claim = What company said; Evidence = What extraction found (In this simplfied flow, extraction IS the claim)
    # We compare REQ vs CLAIM for "Relevance" and "Adherence"
    
    # In 'Groundedness', usually Evidence supports Claim. 
    # Here, we want to check if Claim satisfies Requirement.
    # Using EvalEngine logic: groundedness(Requirement, Claim). 
    # If Requirement explicitly entails Claim, it's good. 
    # Actually, inverse: Does Claim entail Requirement? (Did they do it?)
    
    eval_res = evaluator.calculate_drift(req['requirement'], claim)
    
    audit_results.append({
        "Requirement": req['category'],
        "SEBI_Mandate": req['requirement'],
        "Company_Disclosure": claim,
        "Drift_Score": eval_res['drift_score'],
        "Label": eval_res['label']
    })

df_audit = pd.DataFrame(audit_results)
df_audit

Unnamed: 0,Requirement,SEBI_Mandate,Company_Disclosure,Drift_Score,Label
0,Emissions,Companies must report Scope 1 and Scope 2 Gree...,"Scope 1: 9640.0, Scope 2: 59120.0.",2,neutral
1,Water,Companies must disclose total water consumptio...,"Total Consumption: 878303.07, Intensity: 0.97.",2,neutral
2,Waste,"Companies must report total waste generated, c...","Total Waste: 4478.0, Hazardous: 23.0.",2,neutral


### 4. Sankey Diagram (Visualization)
Visualizing the flow: SEBI Requirement -> Company Disclosure -> Drift Status

In [5]:
# Nodes: Requirements (Left), Disclosures (Middle), Scores (Right)
labels = []
sources = []
targets = []
values = []

# Levels
req_indices = {}
current_idx = 0

# 1. Requirement Nodes
for i, row in df_audit.iterrows():
    lbl = f"Req: {row['Requirement']}"
    if lbl not in labels:
        labels.append(lbl)
        req_indices[lbl] = current_idx
        current_idx += 1

# 2. Disclosure Nodes
disc_indices = {}
for i, row in df_audit.iterrows():
    lbl = f"Disc: {str(row['Company_Disclosure'])[:20]}..."
    if lbl not in labels:
        labels.append(lbl)
        disc_indices[lbl] = current_idx
        current_idx += 1

# 3. Score Nodes
score_indices = {}
for s in [0, 1, 2, 3]:
    lbl = f"Drift Score: {s}"
    if lbl not in labels:
        labels.append(lbl)
        score_indices[lbl] = current_idx
        current_idx += 1

# Links
for i, row in df_audit.iterrows():
    # Req -> Disclosure
    r_lbl = f"Req: {row['Requirement']}"
    d_lbl = f"Disc: {str(row['Company_Disclosure'])[:20]}..."
    s_lbl = f"Drift Score: {row['Drift_Score']}"
    
    # Req -> Disc
    sources.append(req_indices[r_lbl])
    targets.append(disc_indices[d_lbl])
    values.append(1)
    
    # Disc -> Score
    sources.append(disc_indices[d_lbl])
    targets.append(score_indices[s_lbl])
    values.append(1)

fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color="black", width=0.5),
        label=labels
    ),
    link=dict(
        source=sources,
        target=targets,
        value=values
    )
)])

fig.update_layout(title_text="Faithful Concept Mapper: Audit Flow", font_size=10)
fig.show()