# Grid Compliance LLM Pipeline - Data Exploration

This notebook explores the data pipeline for the Grid Compliance QA Assistant. It covers:

1. **Database Overview**: Chunks extracted from PDFs
2. **Training Dataset Analysis**: QA pairs generated by tier
3. **Data Quality Checks**: Sample review and statistics
4. **Visualization**: Distribution charts

---

In [2]:
!pip install matplotlib seaborn



In [3]:
# Setup and Imports
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Project paths
PROJECT_ROOT = Path.cwd().parent
DB_PATH = PROJECT_ROOT / "data" / "pipeline.db"

print(f"Database path: {DB_PATH}")
print(f"Database exists: {DB_PATH.exists()}")

Database path: /workspaces/automated-grid-compliance-llm-pipeline/data/pipeline.db
Database exists: True


## 1. Document Chunks Overview

First, let's explore the document chunks extracted from the PDFs.

In [18]:
# Load document chunks from database
conn = sqlite3.connect(DB_PATH)

# Get all chunks
chunks_df = pd.read_sql_query("""
    SELECT id, source_file, page_number, content, LENGTH(content) as content_length
    FROM document_chunks
""", conn)

print(f"Total document chunks: {len(chunks_df)}")
print(f"\nChunks by source:")
print(chunks_df.groupby('source_file').size())
chunks_df.head()

Total document chunks: 74

Chunks by source:
source_file
G99_Issue_2.pdf            59
SPEN_EV_Fleet_Guide.pdf     8
UKPN_EDS_08_5050.pdf        7
dtype: int64


Unnamed: 0,id,source_file,page_number,content,content_length
0,1,G99_Issue_2.pdf,26,"ENA and Department for Business, Energy and In...",2473
1,2,G99_Issue_2.pdf,27,system of the associated Steam Unit(s) or Stea...,2273
2,3,G99_Issue_2.pdf,28,Droop\n\nThe ratio of the per unit steady stat...,2370
3,4,G99_Issue_2.pdf,29,Final Operational Notification (FON)\n\nA noti...,2849
4,5,G99_Issue_2.pdf,30,Generator's Installation\n\nThe electrical i...,2641


In [23]:
# Explore chunks for a specific source file with simple pagination
source_name = "SPEN_EV_Fleet_Guide.pdf"  # change to any source_file in chunks_df
# page_size = 15                   # number of chunks per page
start = 0                        # set to 0, 15, 30, ... to view next pages

def display_chunks(source_file: str, start_idx: int = 0, limit: int = 60):
    filtered = (
    chunks_df[chunks_df["source_file"] == source_file]
    .sort_values(["page_number", "id"])
    .reset_index(drop=True)
    )
    end_idx = start_idx + limit
    print(f"Source: {source_file}")
    print(f"Total chunks: {len(filtered)} | Showing {start_idx} to {min(end_idx, len(filtered)) - 1}")

    page = filtered.iloc[start_idx:end_idx][["id", "page_number", "content_length", "content"]]
    return page

selected_chunks = display_chunks(source_name, start)
selected_chunks

Source: SPEN_EV_Fleet_Guide.pdf
Total chunks: 8 | Showing 0 to 7


Unnamed: 0,id,page_number,content_length,content
0,67,5,1946,Understanding Your Demand Profile\n\nBefore de...
1,68,6,2058,Calculating your Fleet Charging Requirements\n...
2,69,10,1557,OPTIONS TO CONSIDER\n\nLoad Management\n\nLoad...
3,70,11,1997,OPTIONS TO CONSIDER\n\nOn-site Generation and ...
4,71,12,1085,CASE STUDY\n\nExample 1 - Small Connection\n\n...
5,72,13,905,Example 2 - Large non-firm/flexible connection...
6,73,14,1496,CASE STUDY\n\nExample 3 - Large connection\n\n...
7,74,15,6805,GLOSSARY OF TERMS\n\n| Term ...


In [24]:
for idx, row in selected_chunks.iterrows():
    print(f"\n--- Chunk ID: {row['id']} | Page: {row['page_number']} | Length: {row['content_length']} ---")
    print(row['content'])
    print("\n")


--- Chunk ID: 67 | Page: 5 | Length: 1946 ---
Understanding Your Demand Profile

Before deciding on whether you need to upgrade your existing electricity connection to accommodate the additional load requirements from electric vehicle charge points, you will need to establish how much electricity you are currently consuming on your site (i.e. your Maximum Demand) and at what times.

You should then check this against your Authorised Capacity for the site, as set out in your connection agreement (i.e. the capacity that you are authorised to use as part of your agreement with your DNO).

This will determine if you have available capacity to accommodate all, or part, of the additional load from your proposed EV charge points. While the provision of a single EV charger to support one or two vehicles may not be an issue, connecting multiple commercial vehicles will normally require an assessment of the electricity network. You should therefore contact your DNO to discuss whether an increas