<div class="alert alert-block alert-success">
    <h1>
        Example notebook - AI governance control mappings
    </h1>
    <p>
        Link to dataset : Michael Brock Li' dataset
    </p>
</div>

## Executive Summary

This notebook demonstrates a **knowledge graph-based approach** to managing regulatory control mappings across multiple standards (ISO42001, ISO27001, ISO27701, EU AI Act, NIST RMF, SOC2).

**Key capabilities:**
- **Semantic search**: Find relevant controls using natural language queries
- **Cross-standard mapping**: Identify overlapping requirements between different frameworks
- **Graph visualization**: Explore relationships between domains, topics, controls, and standards
- **Hybrid search**: Combine semantic understanding with keyword matching for optimal results

---

## Data Overview

**Dataset**: Michael Brock Li's AI governance control mappings
- **44 control statements** covering AI governance requirements
- **6 regulatory standards** with reference mappings
- **Hierarchical structure**: Domain → Topic → Control → Standards

### Standard Coverage
- ISO 42001 (AI Management System)
- ISO 27001 (Information Security)
- ISO 27701 (Privacy Information Management)
- EU AI Act
- NIST RMF (Risk Management Framework)
- SOC 2 (Service Organization Control)


# Import modules and functions

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [51]:
import re
import pandas as pd
import networkx as nx
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

In [4]:
from turingdb_examples.utils import get_return_statements
from turingdb_examples.graph import build_create_command_from_networkx

In [47]:
from turingdb_graphrag.embeddings import (
    build_node_only_embeddings,
    build_context_enriched_embeddings,
    build_smart_enriched_embeddings,
    build_sparse_embeddings,
    build_node2vec_embeddings
)
from turingdb_graphrag.search import (
    dense_search,
    sparse_search,
    print_results,
    hybrid_search,
    print_hybrid_results,
    compare_search_methods
)
from turingdb_graphrag.subgraph import (
    get_subgraph_around_query
)
from turingdb_graphrag.visualization import (
    visualize_graph_with_pyvis,
    visualize_subgraph_interactive
)
from turingdb_graphrag.workflow import (
    search_and_expand_hybrid_filtered,
    generate_report_hybrid_workflow_results
)
from turingdb_graphrag.llm import (
    graph_to_llm_context,
    create_llm_prompt_with_graph,
    query_llm
)

In [6]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

True

# Set path to data

In [7]:
example_name = "AI_governance_control_mapping"
path_data = f"{os.getcwd()}/data/{example_name}"
if not os.path.exists(path_data):
    raise ValueError(f"{path_data} does not exists")

# Create graph using `turingdb` python package

<div class="alert alert-block alert-info">
    <h2>
        See <a href="https://docs.turingdb.ai/quickstart">TuringDB Get started documentation</a> for the important steps to follow :
    </h2>
    <h3>
        <ul>
            <li>Create your TuringDB account</li>
            <li>Create your instance in the <a href="https://console.turingdb.ai/auth">TuringDB Cloud UI</a></li>
            <li>Copy your Instance ID from the Database Instances management page</li>
            <li>Get API Key from the Settings in UI</li>
        </ul>
        Remember to have your instance active while working in this notebook !
    </h3>
</div>

## Connect to instance and transfer data

In [8]:
from turingdb import TuringDB

# Create TuringDB client
client = TuringDB(
    host="http://localhost:6666"  # Remove this parameter and set the two parameters below
    # instance_id=os.getenv("INSTANCE_ID"),
    # auth_token=os.getenv("AUTH_TOKEN"),
)

In [9]:
%%time

# TODO: Jira ticket
# - when access_key and secret_key and not correct, no error shown
# - why 6s the first time and 2s after ?
client.s3_connect(
    bucket_name="turing-internal",
    region="eu-west-2",
    access_key=os.getenv("AWS_ACCESS_KEY"),
    secret_key=os.getenv("AWS_SECRET_KEY")
)

CPU times: user 93.5 ms, sys: 43 ms, total: 136 ms
Wall time: 2.19 s


In [10]:
! tree -L 2 /home/dev/.turing/

[01;34m/home/dev/.turing/[0m
├── [01;34mdata[0m
│   ├── ai_gov_control_mappings_full.csv
│   └── reactome.dump
└── [01;34mgraphs[0m
    ├── [01;34mAI_governance_control_mapping1[0m
    ├── [01;34mAI_governance_control_mapping2[0m
    ├── [01;34mAI_governance_control_mapping3[0m
    ├── [01;34mAI_governance_control_mapping4[0m
    ├── [01;34mAI_governance_control_mapping5[0m
    ├── [01;34mAI_governance_control_mapping6[0m
    ├── [01;34mdefault[0m
    └── [01;34mreactome[0m

10 directories, 2 files


In [11]:
%%time

client.transfer(
    src="data/AI_governance_control_mapping/ai_gov_control_mappings_full.csv",
    dst="turingdb://ai_gov_control_mappings_full.csv"  # to s3 bucket or TuringDB instance or local .turing
)

CPU times: user 75.1 ms, sys: 12.9 ms, total: 88 ms
Wall time: 374 ms


In [12]:
! tree -L 2 /home/dev/.turing/

[01;34m/home/dev/.turing/[0m
├── [01;34mdata[0m
│   ├── ai_gov_control_mappings_full.csv
│   └── reactome.dump
└── [01;34mgraphs[0m
    ├── [01;34mAI_governance_control_mapping1[0m
    ├── [01;34mAI_governance_control_mapping2[0m
    ├── [01;34mAI_governance_control_mapping3[0m
    ├── [01;34mAI_governance_control_mapping4[0m
    ├── [01;34mAI_governance_control_mapping5[0m
    ├── [01;34mAI_governance_control_mapping6[0m
    ├── [01;34mdefault[0m
    └── [01;34mreactome[0m

10 directories, 2 files


## Check data files are available

In [13]:
list_csv_files = sorted(os.listdir(path_data))
if not "ai_gov_control_mappings_full.csv" in list_csv_files:
    raise ValueError(f"csv file is not available in {path_data}")

## Import and format data

In [14]:
# Load CSV
path_turing_folder = f"{os.getenv('HOME')}/.turing"
df = pd.read_csv(f"{path_turing_folder}/data/ai_gov_control_mappings_full.csv")
df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n',  ' ', regex=True)
print(f"✓ Loaded {len(df)} rows from CSV")
df

✓ Loaded 44 rows from CSV


Unnamed: 0,Domain,Master,Topic,Control Statement,ISO42001,ISO27001,ISO27701,EU AI ACT,NIST RMF,SOC2
0,Governance & Leadership,GL-1,Executive Commitment and Accountability,The organisation's executive leadership shall ...,4.1 5.1 5.2 9.3 A.2.2 A.2.3 A.2.4,5.1 5.2 9.3 A.5.1 A.5.2,6.1.1 6.1.2,4.1,Govern 1.1 Govern 2.3 Govern 3.1,CC.1.1 CC.1.2 CC.1.3 CC.1.4 CC.1.5 CC.5.3
1,Governance & Leadership,GL-2,"Roles, Responsibilities & Resources","The organisation shall define, document, and m...",5.3 7.1-7.3 A.3.2 A.4.2,5.3 7.1-7.3 A.6.1 A.6.2 A.7.2,6.2.1 6.2.2 7.2.2 9.2.3,22.1 22.2 26.3,,CC.1.3 CC.1.4
2,Governance & Leadership,GL-3,Strategic Alignment & Objectives,The organisation shall document clear objectiv...,4.1-4.4 5.2 6.2-6.3 A.2.2-A.2.4 A.6.1.2 A.9.3 ...,4.1-4.4 6.2-6.3,A.7.2.1 A.7.2.2 B.8.2.2,,Map 1.3 Map 1.4 Govern 1.1 Govern 1.2 Govern 4...,
3,Risk Management,RM-1,Risk Management Framework and Governance,"The organisation shall establish, document, an...",6.1,6.1,12.2.1 A.7.2.5 A.7.2.8 B.8.2.6,9.1 9.2,Govern 1.3 Govern 1.4 Govern 1.5 Map 1.5,CC3.1
4,Risk Management,RM-2,Risk Identification and Impact Assessment,The organisation shall conduct and document co...,6.1.1-6.1.2 6.1.4 8.4 A.5.2 A.5.3 A.5.4 A.5.5,6.1.2,A.7.2.5 A.7.3.10 A.7.4.4,9.9 27.1,Map 1.1 Map 3.1 Map 3.2 Measure 2.6 Measure 2...,CC3.2
5,Risk Management,RM-3,Risk Treatment and Control Implementation,The organisation shall implement appropriate t...,6.1.3,6.1.3,A.7.4.1 A.7.4.2 A.7.4.4 A.7.4.5,8.1 8.2 17.1 9.3 9.4 9.5,Manage 1.2 Manage 1.3 Manage 1.4,CC5.1 CC9.1
6,Risk Management,RM-4,Risk Monitoring and Response,The organisation shall implement continuous mo...,6.1.3 8.1-8.3,6.1.3 8.1-8.3,A.7.4.3 A.7.4.9 B.8.2.4 B.8.2.5 B.8.4.3,9.6,Measure 3.1 Measure 3.2 Manage 2.1 Manage 2.2 ...,CC3.4 CC9.2
7,Regulatory Operations,RO-1,Regulatory Compliance Framework,"The organisation shall establish, document, an...",,A.18.1,18.2.1 A.7.2.1-A.7.2.4 B.8.2.1-B.8.2.2 B.8.2.4...,5.1 5.2 6.1-6.4 8.1-8.2 40.1 41.1 42.1 43.1-43...,Govern 1.1 Map 4.1,CC1.5
8,Regulatory Operations,RO-2,"Transparency, Disclosure and Reporting",The organisation shall implement mechanisms to...,A.8.3 A.8.5,A.6.3,6.2.3 A.7.3.2-A.7.3.3 A.7.3.8-A.7.3.9 A.7.5.3-...,50.1-50.5 86.1-86.3 20.1 20.2 60.7 60.8,Govern 6.1 Map 4.1,CC2.3 P1.1 P1.2 P1.3
9,Regulatory Operations,RO-3,Record-Keeping,The organisation shall maintain comprehensive ...,,A.7.2,8.2.3 A.7.2.8 A.7.3.1 A.7.4.3 A.7.4.6-A.7.4.8 ...,11.1 11.3 18.1 19.1 19.2 71.2 71.3,Map 4.1 Measure 2.12,P3.1 P3.2 P3.3


## Create graph + Graph Design - Regulatory Control Mappings

### Graph Structure

```
[Domain] --contains--> [Topic] --has_control--> [Control] --maps_to--> [Standard Reference]
```

**Example:**
```
Data Governance → Privacy → "Ensure data encryption" → ISO27001: A.8.24
                                                      → NIST RMF: SC-28
```

### Node Types

| Type | Count | Description | Searchable |
|------|-------|-------------|------------|
| **Control** | 44 | Actual control statements | ✅ Primary |
| **Topic** | ~15-20 | Sub-categories | ✅ Optional |
| **Domain** | ~5-10 | High-level groupings | ✅ Optional |
| **Standard** | ~50-100 | Standard references | ❌ Reference only |

### Design Benefits

1. **Flexible querying**: Search by meaning, not just keywords
2. **Many-to-many mappings**: One control can satisfy multiple standards
3. **Hierarchical navigation**: Browse from high-level domains down to specific requirements
4. **Cross-standard analysis**: Find overlaps and gaps between frameworks
5. **Supports Key Queries**:
   - "Find controls about data privacy" (semantic search)
   - "Which standards cover this topic?" (graph traversal)
   - "What's the overlap between ISO and NIST?" (cross-standard analysis)

In [15]:
# Create graph
G = nx.DiGraph()

# Build graph structure
for idx, row in df.iterrows():
    # Main control node (what we'll search)
    control_id = f"control_{idx}"
    G.add_node(control_id, 
               type='control',
               statement=row['Control Statement'],
               domain=row['Domain'],
               master=row['Master'],
               topic=row['Topic'])
    
    # Domain node
    domain_id = f"domain_{row['Domain']}"
    if domain_id not in G:
        G.add_node(domain_id, type='domain', name=row['Domain'])
    
    # Topic node
    topic_id = f"topic_{row['Topic']}"
    if topic_id not in G:
        G.add_node(topic_id, type='topic', name=row['Topic'])
    
    # Connections
    G.add_edge(domain_id, topic_id, rel='contains')
    G.add_edge(topic_id, control_id, rel='has_control')
    
    # Standard mappings
    standards = {
        'ISO42001': row['ISO42001'],
        'ISO27001': row['ISO27001'],
        'ISO27701': row['ISO27701'],
        'EU_AI_ACT': row['EU AI ACT'],
        'NIST_RMF': row['NIST RMF'],
        'SOC2': row['SOC2']
    }
    
    for std_name, std_ref in standards.items():
        if pd.notna(std_ref) and str(std_ref).strip():
            std_id = f"std_{std_name}_{std_ref}"
            if std_id not in G:
                G.add_node(std_id, type='standard', standard=std_name, reference=std_ref)
            G.add_edge(control_id, std_id, rel='maps_to')

print(f"✓ Graph built: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Show breakdown
node_types = {}
for node, data in G.nodes(data=True):
    node_type = data.get('type', 'unknown')
    node_types[node_type] = node_types.get(node_type, 0) + 1

print("\nNode breakdown:")
for ntype, count in node_types.items():
    print(f"  {ntype}: {count}")

✓ Graph built: 305 nodes, 304 edges

Node breakdown:
  control: 44
  domain: 12
  topic: 44
  standard: 205


## Create `turingdb` graph

In [16]:
# Get list of available graphs
list_graphs = client.query("LIST GRAPH").loc[:, 0].tolist()

In [17]:
# Set graph name
graph_name_prefix = example_name
graph_name_nb_suffix = str(
    max(
        [
            int(re.sub(graph_name_prefix, "", g))
            for g in list_graphs
            if g.startswith(graph_name_prefix)
            and re.sub(graph_name_prefix, "", g).isdigit()
        ]
        + [0]
    )
    + 1
)
graph_name = graph_name_prefix + graph_name_nb_suffix
graph_name = re.sub("-", "_", graph_name)
graph_name

'AI_governance_control_mapping7'

In [18]:
%%time

# Set graph
client.query(f"CREATE GRAPH {graph_name}")
client.set_graph(graph_name)

# Create a new change on the graph
change = client.query("CHANGE NEW").loc[0, 0]
#TODO change to : client.new_change()
print(f"Current change {change}")

# Checkout into the change
client.checkout(change=change)

Current change 0
CPU times: user 1.12 ms, sys: 872 μs, total: 1.99 ms
Wall time: 11.4 ms


In [19]:
# Build CREATE command from networkx object
create_command = build_create_command_from_networkx(G)
print(f"Cypher CREATE command :\n\n{100 * '*'}\n{create_command}\n{100 * '*'}")

Cypher CREATE command :

****************************************************************************************************
CREATE (n0:Control {"id":"control_0", "type":"control", "statement":"The organisation s executive leadership shall establish, document, and maintain formal accountability for AI governance through approved policies that align with organisational objectives and values. These policies shall be reviewed at planned intervals by executive leadership to ensure continued effectiveness and relevance. Executive leadership shall demonstrate active engagement in AI risk decisions and maintain ultimate accountability for the organisation s AI systems.", "domain":"Governance & Leadership", "master":"GL-1", "topic":"Executive Commitment and Accountability"}),
(n1:Domain {"id":"domain_Governance & Leadership", "type":"domain", "name":"Governance & Leadership"}),
(n2:Topic {"id":"topic_Executive Commitment and Accountability", "type":"topic", "name":"Executive Commitment and Ac

In [20]:
%%time

# Run CREATE command
client.query(create_command)

# Commit the change
client.query("COMMIT")
client.query("CHANGE SUBMIT")

# Checkout into main
client.checkout()

CPU times: user 1.96 ms, sys: 846 μs, total: 2.8 ms
Wall time: 150 ms


<div class="alert alert-block alert-info">
    <h2>
        Visualize your graph in TuringDB Graph Visualizer ! Now that your instance is running:
    </h2>
    <h3>
        <ul>
            <li>Go to <a href="https://console.turingdb.ai/databases">TuringDB Console - Database Instances</a></li>
            <li>In your current instance panel, click on "Open Visualizer" button</li>
            <li>Visualizer opens, now you can choose your graph in the dropdown menu at the top-right corner</li>
        </ul>
        You can then play with your graph and visualize the nodes you want !
    </h3>
</div>

# Query TuringDB

## Use metaqueries to have insight on graph overall structure

<h3>
    To learn more about 📮 Metaqueries, please check TuringDB documentation on this <a href="https://turingdb.mintlify.app/query/cypher_subset#%F0%9F%93%AE-metaqueries">link</a>
</h3>

In [21]:
%%time

# CALL PROPERTIES() - returns a column of all the different node and edge properties and their types in the database
command = """
CALL PROPERTIES()
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = ["Property_ID", "Property_name", "Property_type"]
    display(df)

Unnamed: 0,Property_ID,Property_name,Property_type
0,0,id,String
1,1,type,String
2,2,statement,String
3,3,domain,String
4,4,master,String
5,5,topic,String
6,6,name,String
7,7,standard,String
8,8,reference,String
9,9,rel,String


CPU times: user 4.87 ms, sys: 0 ns, total: 4.87 ms
Wall time: 4.34 ms


In [22]:
%%time

# CALL LABELS () - returns a column of all the different node labels
command = """
CALL LABELS()
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = ["Node_type_ID", "Node_label"]
    display(df)

Unnamed: 0,Node_type_ID,Node_label
0,0,Control
1,1,Domain
2,2,Topic
3,3,Standard


CPU times: user 2.27 ms, sys: 887 μs, total: 3.15 ms
Wall time: 2.72 ms


In [23]:
%%time

# CALL EDGETYPES() - returns a column of all the different edge types (edge equivalent of node labels)
command = """
CALL EDGETYPES()
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = ["Edge_type_ID", "Edge_label"]
    display(df)

Unnamed: 0,Edge_type_ID,Edge_label
0,0,CONNECTED


CPU times: user 2.18 ms, sys: 872 μs, total: 3.05 ms
Wall time: 2.5 ms


In [24]:
%%time

# CALL LABELSETS() - returns a two columns describing combinations of node labels
command = """
CALL LABELSETS()
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = ["Node_type_ID", "Node_label"]
    display(df)

Unnamed: 0,Node_type_ID,Node_label
0,0,Control
1,1,Domain
2,2,Topic
3,3,Standard


CPU times: user 2.11 ms, sys: 859 μs, total: 2.97 ms
Wall time: 2.57 ms


## Simple queries

In [25]:
%%time

# Match all edges and return them
command = """
MATCH (n)-[e]-(m)
RETURN n.id, e, m.id
"""
df = client.query(command)
if df.empty:
    print("No result found")
else:
    df.columns = get_return_statements(command)
    display(df)

Unnamed: 0,n.id,e,m.id
0,control_34,0,std_ISO27701_12.2.2 A.7.4.3
1,control_34,1,std_EU_AI_ACT_26.5 72.1 72.2 72.3 72.4
2,control_34,2,std_NIST_RMF_Measure 1.2 Measure 2.4 Manage 2....
3,control_34,3,std_SOC2_CC4.1 CC4.2 A1.1 A1.2
4,control_34,4,std_ISO27001_8.1-8.3 A.12.3 A.12.6 A.17.1 A.17.2
...,...,...,...
299,"topic_Transparency, Disclosure and Reporting",299,control_8
300,topic_Robustness,300,control_24
301,topic_Data Security,301,control_19
302,"topic_Security Governance, Architecture and En...",302,control_16


CPU times: user 4.44 ms, sys: 72 μs, total: 4.52 ms
Wall time: 4.1 ms


# Load the embedding model

In [26]:
%%time

# This will convert text to vectors
model = SentenceTransformer('paraphrase-MiniLM-L3-v2')
print(f"✓ Model loaded: {model.get_sentence_embedding_dimension()} dimensions")

✓ Model loaded: 384 dimensions
CPU times: user 177 ms, sys: 24.4 ms, total: 202 ms
Wall time: 1.42 s


# Build vector index on the graph

## Vector Search Implementation - Dense (semantic) search

#### How It Works

Each control is converted to a **384-dimensional vector** using a pre-trained language model (`paraphrase-MiniLM-L3-v2`).

**Search process:**
1. Convert user query to vector
2. Calculate cosine similarity with all control vectors
3. Rank by similarity score (0-1)
4. Return top-k most relevant controls

#### Why Vectors?

- **Semantic understanding**: "data protection" matches "privacy safeguards"
- **Handles synonyms**: "AI governance" finds "artificial intelligence oversight"
- **No keyword dependency**: Works even without exact term matches

### Use the three different approaches

In [27]:
%%time

# Build different versions
node_only = build_node_only_embeddings(G, model)  # Node-only embeddings
lightweight = build_context_enriched_embeddings(G, model, strategy='lightweight')  # Context-enriched embeddings
smart = build_smart_enriched_embeddings(G, model)  # Type-specific context enrichment

print("\n" + "="*80 + "\n")

# Compare what gets encoded for a control node
control_id = 'control_0'

print("NODE-ONLY TEXT:")
print(node_only[1][control_id])
print("\n" + "="*80 + "\n")

print("SMART ENRICHED TEXT:")
print(smart[1][control_id])
print("\n" + "="*80 + "\n")

✓ Vector index built using node-only embeddings approach: 305 vectors
✓ Vector index built using context-enriched embeddings approach (strategy lightweight): 305 vectors
✓ Vector index built using type-specific context enrichment approach: 305 vectors


NODE-ONLY TEXT:
The organisation's executive leadership shall establish, document, and maintain formal accountability for AI governance through approved policies that align with organisational objectives and values. These policies shall be reviewed at planned intervals by executive leadership to ensure continued effectiveness and relevance. Executive leadership shall demonstrate active engagement in AI risk decisions and maintain ultimate accountability for the organisation's AI systems.


SMART ENRICHED TEXT:
The organisation's executive leadership shall establish, document, and maintain formal accountability for AI governance through approved policies that align with organisational objectives and values. These policies shall be review

## Vector Search Implementation - Sparse (keyword) search

In [28]:
%%time

# Build sparse embeddings
sparse_vectors, node_texts, sparse_vectorizer = build_sparse_embeddings(
    G=G,
    max_features=500,
    ngram_range=(1, 2)
)

Building sparse index (TF-IDF)...
✓ Sparse index built: 305 vectors
  Vocabulary size: 500
  Sample terms: ['access', 'accessible', 'accountability', 'actions', 'activities', 'address', 'address identified', 'affected', 'affected individuals', 'agreements', 'ai', 'ai development', 'ai driven', 'ai governance', 'ai impacts', 'ai lifecycle', 'ai systems', 'align', 'align organisational', 'alignment']
CPU times: user 19.1 ms, sys: 114 μs, total: 19.2 ms
Wall time: 18.4 ms


## Node2Vec

In [29]:
%%time

# Build structural embeddings
structural_vectors = build_node2vec_embeddings(G, dimensions=384)

Training Node2Vec on graph structure...


  import pkg_resources
  import pkg_resources


✓ Node2Vec trained: 305 structural vectors
CPU times: user 4.9 s, sys: 1.9 s, total: 6.81 s
Wall time: 6.33 s


In [30]:
structural_vectors[control_id].shape

(384,)

# Search capabilities

## Vector Search Implementation - Dense (semantic) search

Find controls relevant to any natural language query:

```python
results = search("data privacy protection", k=5)
```

**Use cases:**
- Exploratory research: "What controls cover AI model governance?"
- Concept-based lookup: "security monitoring requirements"
- Gap analysis: "What's missing in our risk management?"

In [31]:
# Choose most relevant approach
node_vectors = smart[0]
node_texts = smart[1]

### Query

In [32]:
%%time

# Example 1: Find controls about data privacy
query = "data privacy protection"
results = dense_search(
    query=query,
    node_vectors=node_vectors,
    node_texts=node_texts,
    G=G,
    model=model,
    k=3,
    #node_type='control'
)
print(f"Query: '{query}'")
print_results(results)

Query: 'data privacy protection'

Found 3 results:

1. Similarity: 0.7615
   Node: domain_Privacy
   Type: domain
   Text: Domain: Privacy. Topics: Privacy by Design and Governance, Personal Data Management, Privacy Compliance and Monitoring, Privacy-Enhancing Technologies and Mechanisms...

2. Similarity: 0.6742
   Node: topic_Privacy-Enhancing Technologies and Mechanisms
   Type: topic
   Text: Topic: Privacy-Enhancing Technologies and Mechanisms. Domain: Privacy. Examples: The organisation shall implement appropriate technical mechanisms and privacy-en...

3. Similarity: 0.6605
   Node: control_30
   Type: control
   Text: The organisation shall implement appropriate technical mechanisms and privacy-enhancing technologies in AI systems to protect personal data and ensure privacy by default. This includes implementing en...
   Domain: Privacy
   Topic: Privacy-Enhancing Technologies and Mechanisms
CPU times: user 79.8 ms, sys: 2.76 ms, total: 82.6 ms
Wall time: 7.41 ms


In [33]:
%%time

# Other examples
# Try different queries
queries = [
    "AI model governance",
    "risk management controls",
    "security monitoring requirements",
    "data subject rights"
]

for query in queries:
    print(f"\n{'='*80}")
    print(f"QUERY: '{query}'")
    print('='*80)
    results = dense_search(
        query="data privacy protection",
        node_vectors=node_vectors,
        node_texts=node_texts,
        G=G,
        model=model,
        k=3,
        #node_type='control'
    )
    print_results(results)


QUERY: 'AI model governance'

Found 3 results:

1. Similarity: 0.7615
   Node: domain_Privacy
   Type: domain
   Text: Domain: Privacy. Topics: Privacy by Design and Governance, Personal Data Management, Privacy Compliance and Monitoring, Privacy-Enhancing Technologies and Mechanisms...

2. Similarity: 0.6742
   Node: topic_Privacy-Enhancing Technologies and Mechanisms
   Type: topic
   Text: Topic: Privacy-Enhancing Technologies and Mechanisms. Domain: Privacy. Examples: The organisation shall implement appropriate technical mechanisms and privacy-en...

3. Similarity: 0.6605
   Node: control_30
   Type: control
   Text: The organisation shall implement appropriate technical mechanisms and privacy-enhancing technologies in AI systems to protect personal data and ensure privacy by default. This includes implementing en...
   Domain: Privacy
   Topic: Privacy-Enhancing Technologies and Mechanisms

QUERY: 'risk management controls'

Found 3 results:

1. Similarity: 0.7615
   Node: domai

## Vector Search Implementation - Sparse (keyword) search

In [34]:
%%time

# Example: Find controls about data privacy
query = "data privacy protection"
results = sparse_search(
    query=query,
    sparse_vectors=sparse_vectors,
    sparse_vectorizer=sparse_vectorizer,
    node_texts=node_texts,
    G=G,
    k=3,
    node_type='control'
)
print(f"Query: '{query}'")
print_results(results)

# Try different queries
queries = [
    "AI model governance",
    "risk management controls",
    "security monitoring requirements",
    "data subject rights"
]

for query in queries:
    print(f"\n{'='*80}")
    print(f"QUERY: '{query}'")
    print('='*80)
    results = sparse_search(
        query=query,
        sparse_vectors=sparse_vectors,
        sparse_vectorizer=sparse_vectorizer,
        node_texts=node_texts,
        G=G,
        k=3,
        node_type='control'
    )
    print_results(results)

Query: 'data privacy protection'

Found 3 results:

1. Similarity: 0.4642
   Node: control_27
   Type: control
   Text: The organisation shall implement privacy by design principles in all AI systems, ensuring privacy considerations are embedded from initial planning through system retirement. This includes establishin...
   Domain: Privacy
   Topic: Privacy by Design and Governance

2. Similarity: 0.4379
   Node: control_30
   Type: control
   Text: The organisation shall implement appropriate technical mechanisms and privacy-enhancing technologies in AI systems to protect personal data and ensure privacy by default. This includes implementing en...
   Domain: Privacy
   Topic: Privacy-Enhancing Technologies and Mechanisms

3. Similarity: 0.3954
   Node: control_29
   Type: control
   Text: The organisation shall establish processes to monitor compliance with privacy requirements, detect and respond to privacy incidents, and ensure continuous improvement of privacy controls. This incl

## Hybrid search: Best of Both Worlds

### The Problem

- **Dense (semantic) search**: Great for concepts, misses exact terms
- **Sparse (keyword) search**: Finds exact matches, misses semantics

### The Solution

**Hybrid search** combines both approaches:

```
final_score = α × semantic_score + (1-α) × keyword_score
```

### Alpha Parameter Guide

| Alpha | Behavior | Best For |
|-------|----------|----------|
| 1.0 | Pure semantic | Conceptual queries |
| 0.7 | Favor semantics | General use (recommended) |
| 0.5 | Balanced | Mixed queries |
| 0.3 | Favor keywords | Technical lookups |
| 0.0 | Pure keywords | Exact term matching |

### When to Use What

**Dense (α=1.0)**
- Query: "What covers security?"
- Finds: Controls about protection, safeguards, defense

**Sparse (α=0.0)**
- Query: "ISO27001 A.8.24"
- Finds: Exact standard reference

**Hybrid (α=0.7)**
- Query: "NIST risk management frameworks"
- Finds: Both NIST references AND risk-related controls

### Query

In [35]:
%%time

# Example: Find controls about data privacy
query = "data privacy protection"
results = hybrid_search(
    query=query,
    node_vectors=node_vectors,
    node_texts=node_texts,
    G=G,
    sparse_vectors=sparse_vectors,
    sparse_vectorizer=sparse_vectorizer,
    model=model,
    k=3,
    alpha=0.7,  # 70% semantic, 30% keywords
    node_type='control'
)
print(f"Query: '{query}'")
print_results(results)

# Try different queries
queries = [
    "AI model governance",
    "risk management controls",
    "security monitoring requirements",
    "data subject rights"
]

for query in queries:
    print(f"\n{'='*80}")
    print(f"QUERY: '{query}'")
    print('='*80)
    results = hybrid_search(
        query=query,
        node_vectors=node_vectors,
        node_texts=node_texts,
        G=G,
        sparse_vectors=sparse_vectors,
        sparse_vectorizer=sparse_vectorizer,
        model=model,
        k=3,
        alpha=0.7,
        node_type='control'
    )
    print_results(results)

Query: 'data privacy protection'

Found 3 results:

1. Similarity: 0.9794
   Node: control_30
   Type: control
   Text: The organisation shall implement appropriate technical mechanisms and privacy-enhancing technologies in AI systems to protect personal data and ensure privacy by default. This includes implementing en...
   Domain: Privacy
   Topic: Privacy-Enhancing Technologies and Mechanisms

2. Similarity: 0.8324
   Node: control_27
   Type: control
   Text: The organisation shall implement privacy by design principles in all AI systems, ensuring privacy considerations are embedded from initial planning through system retirement. This includes establishin...
   Domain: Privacy
   Topic: Privacy by Design and Governance

3. Similarity: 0.7496
   Node: control_29
   Type: control
   Text: The organisation shall establish processes to monitor compliance with privacy requirements, detect and respond to privacy incidents, and ensure continuous improvement of privacy controls. This incl

## Compare Dense vs Sparse vs Hybrid

In [36]:
%%time

# Limit results to specific node type (all node types by default)
node_type = None # "control"
k = 5
alpha = 0.6

# Example 1: Query with specific standard ID (sparse should win)
compare_search_methods(
    query="ISO27001 controls",
    node_vectors=node_vectors,
    node_texts=node_texts,
    G=G,
    sparse_vectors=sparse_vectors,
    sparse_vectorizer=sparse_vectorizer,
    model=model,
    k=k,
    alpha=alpha
)

# Example 2: Conceptual query (dense should win)
compare_search_methods(
    query="protecting sensitive information",
    node_vectors=node_vectors,
    node_texts=node_texts,
    G=G,
    sparse_vectors=sparse_vectors,
    sparse_vectorizer=sparse_vectorizer,
    model=model,
    k=k,
    alpha=alpha
)

# Example 3: Mixed query (hybrid should win)
compare_search_methods(
    query="NIST risk management frameworks",
    node_vectors=node_vectors,
    node_texts=node_texts,
    G=G,
    sparse_vectors=sparse_vectors,
    sparse_vectorizer=sparse_vectorizer,
    model=model,
    k=k,
    alpha=alpha
)



QUERY: 'ISO27001 controls'

1. DENSE ONLY (Semantic):
--------------------------------------------------------------------------------
1. 0.408 | The organisation shall implement and maintain comprehensive physical security controls to protect in...
2. 0.391 | The organisation shall implement mechanisms for meaningful human oversight of AI systems, ensuring h...
3. 0.390 | The organisation shall implement appropriate technical and organisational measures to address identi...

2. SPARSE ONLY (Keywords):
--------------------------------------------------------------------------------
1. 0.211 | The organisation shall implement and maintain comprehensive physical security controls to protect in...
2. 0.195 | The organisation shall conduct regular testing of AI system safety and security controls, including ...
3. 0.150 | The organisation shall implement and maintain comprehensive network security controls to protect aga...

3. HYBRID (alpha=0.6):
-----------------------------------------

# Get subgraph around query results and visualise

### Graph-Based Context Retrieval

Get full context around relevant controls:

```python
subgraph = get_subgraph_around_query("risk management", k=3, hops=1)
```

**Returns:**
- Relevant controls
- Related topics and domains
- All mapped standard references
- Network connections

In [37]:
method_to_test = "hybrid"
possible_method_to_test = ["dense", "sparse", "hybrid"]
if not method_to_test in possible_method_to_test:
    raise ValueError(f"method_to_test has to be one of {possible_method_to_test}")

if method_to_test == "dense":
    # Dense search
    subgraph, results = get_subgraph_around_query(
        query="data privacy",
        G=G,
        search_func=dense_search,
        search_params={
            'node_vectors': node_vectors,
            'node_texts': node_texts,
            'model': model,
            'node_type': 'control'
        },
        k=3,
        hops=1
    )

elif method_to_test == "sparse":
    # Sparse search
    subgraph, results = get_subgraph_around_query(
        query="ISO27001 controls",
        G=G,
        search_func=sparse_search,
        search_params={
            'sparse_vectors': sparse_vectors,
            'sparse_vectorizer': sparse_vectorizer,
            'node_texts': node_texts,
            'node_type': 'control'
        },
        k=3,
        hops=1
    )

elif method_to_test == "hybrid":
    # Hybrid search
    subgraph, results = get_subgraph_around_query(
        query="risk management",
        G=G,
        search_func=hybrid_search,
        search_params={
            'node_vectors': node_vectors,
            'node_texts': node_texts,
            'sparse_vectors': sparse_vectors,
            'sparse_vectorizer': sparse_vectorizer,
            'model': model,
            'alpha': 0.7,
            'node_type': 'control'
        },
        k=3,
        hops=1
    )

In [38]:
print(f"Query: '{query}'")
print(f"Subgraph: {subgraph.number_of_nodes()} nodes, {subgraph.number_of_edges()} edges")

# Show what's in the subgraph
types_in_subgraph = {}
for node, data in subgraph.nodes(data=True):
    ntype = data.get('type', 'unknown')
    types_in_subgraph[ntype] = types_in_subgraph.get(ntype, 0) + 1

print("\nSubgraph composition:")
for ntype, count in types_in_subgraph.items():
    print(f"  {ntype}: {count}")

Query: 'data subject rights'
Subgraph: 24 nodes, 21 edges

Subgraph composition:
  standard: 18
  topic: 3
  control: 3


# Visualization Capabilities

## Interactive Visualization (PyVis)

- Hover to see full control text
- Click to explore connections
- Physics-based layout
- Filterable and zoomable

**Features:**
- Color legend for node types
- Relevance-based sizing
- Relationship labels on edges
- Responsive browser-based interface

In [39]:
# With hybrid search
visualize_subgraph_interactive(
    query="data privacy protection",
    G=G,
    search_func=hybrid_search,
    search_params={
        'node_vectors': node_vectors,
        'node_texts': node_texts,
        'sparse_vectors': sparse_vectors,
        'sparse_vectorizer': sparse_vectorizer,
        'model': model,
        'alpha': 0.7,
        #'node_type': 'control'
    },
    k=4,
    hops=1,
    output_file='graph_hybrid.html'
)

✓ Interactive graph saved to: graph_hybrid.html
  Nodes: 13
  Edges: 12


# Workflow v3 : do hybrid also for stage 2 (Node2Vec + semantic)

In [40]:
semantic_results, expanded, subgraph = search_and_expand_hybrid_filtered(
    query="data privacy protection",
    G=G,
    node_vectors=node_vectors,
    node_texts=node_texts,
    sparse_vectors=sparse_vectors,
    sparse_vectorizer=sparse_vectorizer,
    structural_vectors=structural_vectors,
    model=model,
    k_search=5,
    max_hops=10,
    min_structural_sim=0.5,   # Must be structurally similar
    min_semantic_sim=0.5,     # AND semantically relevant to query
    structural_weight=0.5,    # 50-50 balance
    alpha=0.7                 # Weight alpha to attribute to semantic (dense) search, (1 - alpha) for keyword (sparse) search
)

report = generate_report_hybrid_workflow_results(semantic_results, expanded)
print(report)

Stage 1: Hybrid search for 'data privacy protection'...
--------------------------------------------------------------------------------

Found 5 semantically relevant seed nodes:
  1. control_30 (score: 0.981)
     The organisation shall implement appropriate technical mechanisms and privacy-enhancing technologies...
  2. control_27 (score: 0.870)
     The organisation shall implement privacy by design principles in all AI systems, ensuring privacy co...
  3. control_29 (score: 0.799)
     The organisation shall establish processes to monitor compliance with privacy requirements, detect a...
  4. control_19 (score: 0.526)
     The organisation shall protect data throughout its lifecycle using appropriate technical and procedu...
  5. control_28 (score: 0.498)
     The organisation shall implement operational processes for the responsible collection, use, storage,...

Stage 2: Hybrid filtering (structural + semantic)...
  - Max hops: 10
  - Min structural similarity: 0.5
  - Min semant

In [45]:
# Access subgraph data
print(f"\nSubgraph ({subgraph}) node attributes:")
for node in list(subgraph.nodes()):
    print(f"\n* {node}:")
    for key, value in subgraph.nodes[node].items():
        if key not in ['statement']:  # Skip long text
            print(f"  {key}: {value}")

# Export subgraph if needed
#nx.write_gml(subgraph, "filtered_subgraph.gml")

# Visualise subgraph
visualize_graph_with_pyvis(subgraph, output_file='hybrid_filtered_graph.html')


Subgraph (DiGraph with 13 nodes and 11 edges) node attributes:

* std_EU_AI_ACT_10.5 26.9 27.4:
  type: standard
  standard: EU_AI_ACT
  reference: 10.5 26.9 27.4
  is_seed: False
  seed_node: control_27
  hop_distance: 1
  direction: successor
  structural_similarity: 0.9475840926170349
  semantic_similarity: 0.5740383267402649
  combined_score: 0.7608112096786499

* topic_Data Security:
  type: topic
  name: Data Security
  is_seed: False
  seed_node: control_19
  hop_distance: 1
  direction: predecessor
  structural_similarity: 0.9788761734962463
  semantic_similarity: 0.5994614958763123
  combined_score: 0.7891688346862793

* control_28:
  type: control
  domain: Privacy
  master: PR-2
  topic: Personal Data Management
  is_seed: True
  seed_score: 0.49791723190820836
  dense_score: 0.42099461863599436
  sparse_score: 0.6774033295433743

* control_30:
  type: control
  domain: Privacy
  master: PR-4
  topic: Privacy-Enhancing Technologies and Mechanisms
  is_seed: True
  seed_scor

# LAST STEP : take exploration results and ask LLM to answer original client query only using subgraph

In [46]:
## Option 1: Natural language (most LLM-friendly)
#graph_text = graph_to_llm_context(subgraph, format='natural')
#
## Option 2: Markdown (good for structured analysis)
#graph_text = graph_to_llm_context(subgraph, format='markdown')
#
## Option 3: JSON (for programmatic LLMs)
#graph_text = graph_to_llm_context(subgraph, format='json')
#
## Option 4: Cypher (for graph database queries)
#graph_text = graph_to_llm_context(subgraph, format='cypher')

# Complete prompt
prompt = create_llm_prompt_with_graph(
    query="data privacy protection",
    subgraph=subgraph,
    report=report,
    format='natural'  # or 'markdown'
)

# Send to LLM
print(prompt)

# Save to file
with open('llm_prompt.txt', 'w') as f:
    f.write(prompt)

# Graph-Based Query Response

## User Query
"data privacy protection"

## Search Results
HYBRID-FILTERED WORKFLOW RESULTS

1. SEED NODE (Hybrid Search Match):
   Node: control_30
   Semantic Score: 0.981
   Text: The organisation shall implement appropriate technical mechanisms and privacy-enhancing technologies in AI systems to protect personal data and ensure...
   Domain: Privacy
   Topic: Privacy-Enhancing Technologies and Mechanisms

   HYBRID-FILTERED NEIGHBORS:
   (Must pass BOTH structural AND semantic thresholds)
   1. domain_Privacy (predecessor, 2 hops)
      Combined: 0.834
      └─ Structural: 0.906
      └─ Semantic: 0.761
      Type: domain
      Domain: Privacy. Topics: Privacy by Design and Governance, Personal Data Managem...
   2. topic_Privacy-Enhancing Technologies and Mechanisms (predecessor, 1 hops)
      Combined: 0.830
      └─ Structural: 0.986
      └─ Semantic: 0.674
      Type: topic
      Topic: Privacy-Enhancing Technologies and Mechanisms. Domain: Privac

In [48]:
api_keys = {
    "Anthropic": os.getenv("ANTHROPIC_API_KEY"),
    "Mistral": os.getenv("MISTRAL_API_KEY"),
    "OpenAI": os.getenv("OPENAI_API_KEY"),
}

In [49]:
%%time

#system_prompt = """
#Only base your response on the data provided.
#Do not add knowledge but describe it in a natural way as if you were a touristic guide, be friendly !
#You can add forms but not content.
#Format the output with nice Markdown format and emojis.
#"""
#
#prompt = f"""
#Here is the path I got from stations {start_station} to {end_station} : {df_path.to_dict()}.
#Describe the path between these two stations.
#Give me an itinerary of things I could visit on the way.
#"""

provider = "Anthropic"

result = query_llm(
    prompt=prompt,
    #system_prompt=system_prompt,
    provider=provider,
    api_key=api_keys[provider],
    temperature=0.2,
)

CPU times: user 187 ms, sys: 33.7 ms, total: 221 ms
Wall time: 9.59 s


In [52]:
display(Markdown(result))

# Response to "data privacy protection"

## Key Insights from Graph Analysis

### Primary Findings
The graph reveals a comprehensive approach to data privacy protection with multiple interconnected aspects:

1. Core Privacy Protection Mechanisms:
- Privacy-Enhancing Technologies
- Privacy by Design Principles
- Personal Data Management
- Privacy Compliance and Monitoring

### Relevant Nodes and Connections

#### Key Control Nodes:
1. control_30: Implements technical mechanisms for privacy protection
   - Focuses on privacy-enhancing technologies
   - High semantic relevance (0.981)

2. control_27: Embeds privacy considerations from initial planning
   - Implements "privacy by design" principles
   - Semantic score of 0.870

3. control_29: Monitors privacy compliance
   - Establishes processes to detect and respond to privacy incidents
   - Semantic score of 0.799

4. control_28: Manages personal data lifecycle
   - Covers collection, use, storage, and disposal of personal data
   - Semantic score of 0.498

#### Domain and Topic Connections:
- All controls are connected to the domain_Privacy
- Covers topics like:
  - Privacy-Enhancing Technologies
  - Privacy by Design
  - Personal Data Management
  - Privacy Compliance and Monitoring

### Standards Mapping
- control_27 maps to two standards:
  1. EU_AI_ACT (sections 10.5, 26.9, 27.4)
  2. SOC2 (sections P1.1, P1.2, P1.3)

## Comprehensive Protection Strategy
The graph illustrates a multi-layered approach to data privacy:
1. Proactive design (privacy by design)
2. Technical protection mechanisms
3. Operational data management
4. Continuous compliance monitoring

### Conclusion
Data privacy protection involves technical, operational, and governance strategies that span the entire data lifecycle, with a focus on proactive protection and continuous monitoring.

TODO necessary:
- plug LLM at the end for report : DONE
- display subgraph : DONE
- add graph creation using turingdb to show graph in visualiser
- stage 1 find seed nodes using turingdb graph instead of networkx G (using node ID injection)
- stage 2 exploration to do with turingdb queries instead of networkx G
- biosearch v2 get subgraph with all pairwise entities paths
- stage 2 add weight of if we are on a path between two seeds or not to know if the exploration makes sense

TODO potential:
- community detection
- different levels of community
- explanation of these communities

# Test : hypothetic document generation

# Test : after stage 1 ask LLM to say if seed node is relevant