# Notes

**What we need for the implementation**
- Relational Database Schema
    - R set of relations
    - K function that maps each relation to its set of primary key attributes
    - F function that maps each relation to its set of attributes, 
        where each attribute is given by its column name together with its datatype
- Each relation has:
    - Atribute set (F)
    - pk set (K)
    - Foreign key set (subset of F)
- Functions:
    - docTable: returns the textual description of a relation
    - docAttr: returns the textual description of an attribute a in relation r
- Ontology Representation:
    - C set of class identifiers (concepts)
    - P set of properties
        - Pobj subset of object properties
        - Pdata subset of data properties
    - A set of axioms 
        - class hierarchies, 
        - domain and range assertions, 
        - property characteristics
    - M set of annotation assertions
        - labels
        - comments
        - provenance info
- External Ontology Repo
    - DINGO
- Lexical view converter 
    - of any ontology
    - **CHECK**
- RAG pipeline 
    - Embed lexical views of 
        - external ontologies
        - Relational schema 
        - core ontology
    - Index in FAISS 
    - At each relation (...)

# Data Sctructures

## Database Schema

In [1]:
import re
from pprint import pprint

def parse_mysql_ddl_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        ddl = f.read()

    # Remove comments and MySQL directives
    ddl = re.sub(r'/\*.*?\*/', '', ddl, flags=re.DOTALL)
    ddl = re.sub(r'--.*?$', '', ddl, flags=re.MULTILINE)
    ddl = re.sub(r'/\!.*?\*/;', '', ddl, flags=re.DOTALL)
    ddl = re.sub(r'/\!.*?\*/', '', ddl, flags=re.DOTALL)

    # Find all CREATE TABLE statements (handles backticks and multiline)
    table_regex = re.compile(
        r'CREATE TABLE\s+`?(\w+)`?\s*\((.*?)\)\s*ENGINE=.*?;',
        re.DOTALL | re.IGNORECASE
    )
    tables = table_regex.findall(ddl)
    result = {}

    for table_name, table_body in tables:
        # Raw DDL
        raw_ddl = f"CREATE TABLE `{table_name}` ({table_body});"

        # Split lines, remove empty and trailing commas
        lines = [line.strip().rstrip(',') for line in table_body.splitlines() if line.strip()]
        columns = []
        primary_keys = []
        foreign_keys = []

        for line in lines:
            # Column definition (starts with backtick or word, not constraint)
            if re.match(r'^`?\w+`?\s', line) and not line.upper().startswith(('PRIMARY KEY', 'FOREIGN KEY', 'CONSTRAINT', 'UNIQUE', 'KEY')):
                col_name = re.match(r'^`?(\w+)`?', line).group(1)
                columns.append(col_name)
            # Primary key
            elif line.upper().startswith('PRIMARY KEY'):
                pk_match = re.search(r'\((.*?)\)', line)
                if pk_match:
                    pk_cols = [col.strip(' `') for col in pk_match.group(1).split(',')]
                    primary_keys.extend(pk_cols)
            # Foreign key
            elif line.upper().startswith('CONSTRAINT') and 'FOREIGN KEY' in line.upper():
                fk_match = re.search(r'FOREIGN KEY\s*\((.*?)\)\s*REFERENCES\s*`?(\w+)`?\s*\((.*?)\)', line, re.IGNORECASE)
                if fk_match:
                    fk_cols = [col.strip(' `') for col in fk_match.group(1).split(',')]
                    ref_table = fk_match.group(2)
                    ref_cols = [col.strip(' `') for col in fk_match.group(3).split(',')]
                    foreign_keys.append({
                        'columns': fk_cols,
                        'ref_table': ref_table,
                        'ref_columns': ref_cols
                    })

        result[table_name] = {
            'raw_ddl': raw_ddl,
            'columns': columns,
            'primary_keys': primary_keys,
            'foreign_keys': foreign_keys
        }

    return result

In [2]:
schema = parse_mysql_ddl_file('../RDB_schema/Usable_schema.sql')
pprint(schema)

{'granter_activity': {'columns': ['id',
                                  'created_at',
                                  'updated_at',
                                  'type',
                                  'title',
                                  'description',
                                  'application_id',
                                  'company_id',
                                  'file_id',
                                  'profile_id',
                                  'created_by_expert',
                                  'activity_date',
                                  'data',
                                  'data_id',
                                  'opportunity_id',
                                  'data_type'],
                      'foreign_keys': [{'columns': ['application_id'],
                                        'ref_columns': ['id'],
                                        'ref_table': 'granter_application'},
                                 

## Descriptions of the Database tables

In [3]:
None

## Prompt

In [4]:
#prompt = f"""
#    Generate ontology elements with provenance annotations for database table {data[table_name]} based on:
#
#    [CONTEXT]
#    - Database Schema of the database {data[schema_context]}
#    - Take semantics from the Relevant Documents {data[documents]}
#    - Take semantics from the Existing Ontology Knowledge {data[existing_ontology]}
#
#    [INSTRUCTIONS]
#    1. Include these elements:
#        Classes (subclass of Thing)
#        Data properties with domain/range
#        Object properties with domain/range
#        Use only one rdfs:domain and one rdfs:range per property. If multiple options exist, select the most general or create a shared superclass.
#    3. Do not create a property named "is". Use rdf:type for instance membership, rdfs:subClassOf for class hierarchies, and owl:sameAs for instance equality.
#    4. Use this format example:
#
#    Class: {data[table_name]}
#    Annotations:
#    prov:wasDerivedFrom
#    <http://example.org/provenance/{data[table_name]}>
#
#    DataProperty:
#    has_column_name
#    domain {data[table_name]}
#    range string
#    Annotations:
#    prov:wasDerivedFrom
#    <http://example.org/provenance/{data[table_name]}/column_name>
#
#    ObjectProperty:
#    relates_to_table domain {data[table_name]}
#    range RelatedTable
#    Annotations:
#    prov:wasDerivedFrom
#    <http://example.org/provenance/{data[table_name]}/fk_column>
#
#    Only output Manchester Syntax and nothing else. [OUTPUT]
#"""

# Building a delta-ontology (testing 1 iteration)

In [3]:
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2:3b")



In [4]:
with open("external_ontologies/DINGO-OWL.ttl", "r", encoding="utf-8") as f:
    external_ontology = f.read()
print(external_ontology[632:])


dg:Grant a rdfs:Class ;
    rdfs:label "Grant" ;
    rdfs:comment "The class for grant: a disbursed fund payed to a recipient or beneficiary and the process for it." ;
    rdfs:isDefinedBy dg: ;
	skos:closeMatch schema:MonetaryGrant ;
	skos:closeMatch frapo:Funding ;
	skos:narrowMatch frapo:Grant ;
	skos:broadMatch wd:Q230788 .

dg:GrantPayment a rdfs:Class ;
    rdfs:label "GrantPayment" ;
    rdfs:comment "The class for grant payments: a single payment to a recipient or beneficiary within a Grant." ;
    rdfs:isDefinedBy dg: ;
	skos:broadMatch frapo:Payment ;
	skos:broadMatch wd:Q1148747 .

dg:GrantShare a rdfs:Class ;
    rdfs:label "GrantShare" ;
    rdfs:comment "The class for grant shares: the full or proper portion or part allotted or belonging to or contributed to an individual entity within a Grant." ;
    rdfs:isDefinedBy dg: .	
	
dg:Project a rdfs:Class ;
    rdfs:label "Project" ;
    rdfs:comment "The class for projects: an organised endeavour (collective or individual) p

In [7]:
import json

table = schema['granter_activity']

prompt = f"""
    Generate ontology elements with provenance annotations for database table granter_activity based on:

    [CONTEXT]
    - Database Schema of the table {json.dumps(table)}

    [INSTRUCTIONS]
    1. Include these elements:
        Classes (subclass of Thing)
        Data properties with domain/range
        Object properties with domain/range
        Use only one rdfs:domain and one rdfs:range per property. If multiple options exist, select the most general or create a shared superclass.
    3. Do not create a property named "is". Use rdf:type for instance membership, rdfs:subClassOf for class hierarchies, and owl:sameAs for instance equality.
    4. Use this format example:

    Class: granter_activity
    Annotations:
    prov:wasDerivedFrom
    <http://example.org/provenance/granter_activity>

    DataProperty:
    has_column_name
    domain granter_activity
    range string
    Annotations:
    prov:wasDerivedFrom
    <http://example.org/provenance/granter_activity/column_name>

    ObjectProperty:
    relates_to_table domain granter_activity
    range RelatedTable
    Annotations:
    prov:wasDerivedFrom
    <http://example.org/provenance/granter_activity/fk_column>

    Only output Manchester Syntax and nothing else. [OUTPUT]
"""

In [None]:
delta_ontology = llm.invoke(prompt)

In [None]:
print(delta_ontology)

Here is the generated ontology in Manchester Syntax:

Class: granter_activity
Annotations:
  prov:wasDerivedFrom
  <http://example.org/provenance/granter_activity>

DataProperty:
has_column_name
domain granter_activity
range string
Annotations:
  prov:wasDerivedFrom
  <http://example.org/provenance/granter_activity/column_name>

ObjectProperty:
relates_to_table
domain granter_activity
range RelatedTable
Annotations:
  prov:wasDerivedFrom
  <http://example.org/provenance/granter_activity/fk_column>

DataProperty:
has_application_id
domain granter_activity
range bigint
Annotations:
  prov:wasDerivedFrom
  <http://example.org/provenance/granter_activity/application_id>

ObjectProperty:
relates_to_application
domain granter_activity
range Application
Annotations:
  prov:wasDerivedFrom
  <http://example.org/provenance/granter_activity/foreign_application_id>

DataProperty:
has_company_id
domain granter_activity
range bigint
Annotations:
  prov:wasDerivedFrom
  <http://example.org/provenance

## Validating Delta Ontology

In [10]:
evaluator_prompt = f"""
You are an expert in OWL 2 DL ontology modeling and validation.

Your task is to review the following delta ontology fragment generated from a relational database table, along with its schema and relevant context.

[DELTA-ONTOLOGY]
{delta_ontology}

[DATABASE SCHEMA]
{table}

[CORE ONTOLOGY CONTEXT]
(empty as of now)

[VALIDATION CRITERIA]
1. **Coherence with Core Ontology**  
   - Do NOT redefine an existing class, property, or concept already present in the core ontology with the same meaning.
   - Reuse existing ontology elements where possible instead of creating duplicates.

2. **Alignment with Input Table Schema**  
   - Every significant column and foreign key in the table must be represented as an appropriate ontology element (class, data property, or object property).
   - Naming should reflect the database semantics clearly and consistently.

3. **Syntactic Validity**  
   - The ontology must conform to the OWL 2 DL profile and valid Manchester Syntax.
   - Only one `rdfs:domain` and one `rdfs:range` per property.

4. **Logical Consistency**  
   - No contradictory class axioms or property constraints.
   - No circular subclass relationships.
   - Correct choice between object properties and data properties.

5. **Clarity and Naming Quality**  
   - Use self-explanatory, domain-relevant names.
   - Avoid generic or meaningless labels (e.g., "Entity1", "PropertyA").
   - All properties should follow consistent naming patterns (e.g., `has_`, `is_...Of`).

[YOUR TASK]
- Check the delta ontology fragment against all criteria above.
- If issues are found, provide a corrected version of the ontology in valid Manchester Syntax.
- Make minimal necessary changes to preserve the author's intent while ensuring correctness and OWL 2 DL compliance.
- Ensure all elements keep their provenance annotations.

[OUTPUT FORMAT]
Respond ONLY with:
1. "Status: PASS" if the ontology fragment meets all criteria, or "Status: FAIL" if it does not.
2. If FAIL, provide:
   a. A short bullet list of the issues found.
   b. A corrected Manchester Syntax version of the ontology fragment.

Do NOT include any other commentary outside this format.
"""

In [11]:
revision = llm.invoke(evaluator_prompt)

In [12]:
print(revision)

Status: PASS

No issues were found with the provided delta ontology fragment. It meets all the criteria:

1. Coherence with Core Ontology: No redefinitions or duplicates of existing core ontology elements were found.
2. Alignment with Input Table Schema: All significant columns and foreign keys in the table are represented as appropriate ontology elements, following the database semantics clearly and consistently.
3. Syntactic Validity: The ontology conforms to OWL 2 DL profile and valid Manchester Syntax, with only one `rdfs:domain` and one `rdfs:range` per property.
4. Logical Consistency: No contradictory class axioms or property constraints were found, and no circular subclass relationships exist. Correct choices between object properties and data properties were made.
5. Clarity and Naming Quality: Self-explanatory, domain-relevant names were used throughout, with consistent naming patterns (e.g., `has_`, `is_...Of`).


# Running the pipeline for the whole database schema

In [6]:
import json

def process_table(table_name, schema, llm, processed_tables, results):
    if table_name in processed_tables:
        return
    processed_tables.add(table_name)
    table = schema[table_name]

    # Generate delta ontology
    prompt = f"""
        Generate ontology elements with provenance annotations for database table {table_name} based on:

        [CONTEXT]
        - Database Schema of the table {json.dumps(table)}

        [INSTRUCTIONS]
        1. Include these elements:
            Classes (subclass of Thing)
            Data properties with domain/range
            Object properties with domain/range
            Use only one rdfs:domain and one rdfs:range per property. If multiple options exist, select the most general or create a shared superclass.
        3. Do not create a property named "is". Use rdf:type for instance membership, rdfs:subClassOf for class hierarchies, and owl:sameAs for instance equality.
        4. Use this format example:

        Class: {table_name}
        Annotations:
        prov:wasDerivedFrom
        <http://example.org/provenance/{table_name}>

        DataProperty:
        has_column_name
        domain {table_name}
        range string
        Annotations:
        prov:wasDerivedFrom
        <http://example.org/provenance/{table_name}/column_name>

        ObjectProperty:
        relates_to_table domain {table_name}
        range RelatedTable
        Annotations:
        prov:wasDerivedFrom
        <http://example.org/provenance/{table_name}/fk_column>

        Only output Manchester Syntax and nothing else. [OUTPUT]
    """
    delta_ontology = llm.invoke(prompt)

    # Revise delta ontology
    evaluator_prompt = f"""
    You are an expert in OWL 2 DL ontology modeling and validation.

    Your task is to review the following delta ontology fragment generated from a relational database table, along with its schema and relevant context.

    [DELTA-ONTOLOGY]
    {delta_ontology}

    [DATABASE SCHEMA]
    {table}

    [CORE ONTOLOGY CONTEXT]
    (empty as of now)

    [VALIDATION CRITERIA]
    1. **Coherence with Core Ontology**  
       - Do NOT redefine an existing class, property, or concept already present in the core ontology with the same meaning.
       - Reuse existing ontology elements where possible instead of creating duplicates.

    2. **Alignment with Input Table Schema**  
       - Every significant column and foreign key in the table must be represented as an appropriate ontology element (class, data property, or object property).
       - Naming should reflect the database semantics clearly and consistently.

    3. **Syntactic Validity**  
       - The ontology must conform to the OWL 2 DL profile and valid Manchester Syntax.
       - Only one `rdfs:domain` and one `rdfs:range` per property.

    4. **Logical Consistency**  
       - No contradictory class axioms or property constraints.
       - No circular subclass relationships.
       - Correct choice between object properties and data properties.

    5. **Clarity and Naming Quality**  
       - Use self-explanatory, domain-relevant names.
       - Avoid generic or meaningless labels (e.g., "Entity1", "PropertyA").
       - All properties should follow consistent naming patterns (e.g., `has_`, `is_...Of`).

    [YOUR TASK]
    - Check the delta ontology fragment against all criteria above.
    - If issues are found, provide a corrected version of the ontology in valid Manchester Syntax.
    - Make minimal necessary changes to preserve the author's intent while ensuring correctness and OWL 2 DL compliance.
    - Ensure all elements keep their provenance annotations.

    [OUTPUT FORMAT]
    Respond ONLY with:
    1. "Status: PASS" if the ontology fragment meets all criteria, or "Status: FAIL" if it does not.
    2. If FAIL, provide:
       a. A short bullet list of the issues found.
       b. A corrected Manchester Syntax version of the ontology fragment.

    Do NOT include any other commentary outside this format.
    """
    revision = llm.invoke(evaluator_prompt)
    print(f"Table: {table_name}\nRevision:\n{revision}\n")

    results[table_name] = {
        "delta_ontology": delta_ontology,
        "revision": revision
    }

    # Recursively process referenced tables via foreign keys
    for fk in table['foreign_keys']:
        ref_table = fk['ref_table']
        if ref_table in schema:
            process_table(ref_table, schema, llm, processed_tables, results)


# Initialize processed tables set
processed_tables = set()
results = {}

# Loop through all tables in the schema
for table_name in schema:
    process_table(table_name, schema, llm, processed_tables, results)

Table: granter_activity
Revision:
After reviewing the delta ontology fragment against all criteria, I found some issues that need to be addressed:

1. Coherence with Core Ontology:
   - The class "granter_activity" is not defined in the core ontology, but it's defined here. This should be avoided as per the coherence criterion.
   - The data properties and object properties related to granter_activity do not have an explicit domain or range. They should be updated to conform to OWL 2 DL profile.

2. Alignment with Input Table Schema:
   - Some columns in the table (e.g., "type", "title", "description") are not explicitly represented as ontology elements.
   - The foreign key constraints on some columns (e.g., "application_id", "company_id") do not have explicit domain or range annotations.

3. Syntactic Validity:
   - The data property "data" has no annotation for its range. It should be updated to conform to OWL 2 DL profile.
   - There is a missing `rdfs:range` annotation on the obje

KeyboardInterrupt: 

In [None]:
pprint(results)