# Data Integration for Building Information Modeling

This notebook demonstrates the process of converting raw building information data (IFC files and PDF documents) into RDF graph format for further processing with Graph RAG (Retrieval Augmented Generation). Preparing our data for graph-based knowledge retrieval

**What we'll accomplish:**
- Process IFC (Industry Foundation Classes) building models and PDF product documentation into semantic graphs (ttl)
- Convert TTL files into embeddings
- Analyze and visualize the results

## 0. Setup
This notebook can run in either Google Colab or locally. The setup cell below automatically configures your environment by detecting whether it's running in Colab or locally, cloning the repository if needed (Colab), and installing dependencies from `requirements.txt`.

**Key Point:** This setup ensures the notebook runs consistently anywhere with minimal configuration.

In [None]:
import os
from pathlib import Path

# Detect environment
try:
    from IPython import get_ipython
    IN_COLAB = 'google.colab' in str(get_ipython())
except:
    IN_COLAB = False

# Configure environment
if IN_COLAB:
    !git clone https://github.com/qaecy/bilt2025.git
    %cd bilt2025
    requirements_path = "requirements.txt"
    from google.colab import userdata
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
else:    
    # Find requirements.txt based on current directory
    current_dir = Path().resolve()
    requirements_path = "../requirements.txt" if current_dir.name == "notebooks" else "requirements.txt"
    print(f"Looking for requirements at: {Path(requirements_path).resolve()}")

# Install dependencies if requirements.txt exists
if os.path.exists(requirements_path):
    %pip install -r {requirements_path}
    if IN_COLAB:
        %pip install -e .
    print("✓ Environment setup complete")
else:
    print("⚠️ Could not find requirements.txt")

## Prerequisites

All required packages are listed in `requirements.txt` and installed automatically by the setup cell above. Key libraries include:
- **Data Processing:** pandas, rdflib
- **IFC Processing:** ifcopenshell
- **PDF Processing:** pymupdf
- **Embedding:** openai-sdk

### Data Flow Overview
The diagram below illustrates the process: Input files (IFC/PDF) are processed by converters into RDF/Turtle format, then transformed into embeddings for analysis and use in the next Graph RAG lab.

```
┌───────────────┐     ┌─────────────────┐     ┌────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Input Data   │     │   Converters    │     │  Output Data   │     │    Embedding     │     │     Analysis    │
├───────────────┤ --> ├─────────────────┤ --> ├────────────────┤ --> ├──────────────────┤ --> ├─────────────────┤
│  IFC Files    │     │ ifc_converter   │     │      RDF       │     │ ttl_to_embedding │     │ Triple Count    │
│  PDF Files    │     │ pdf_converter   │     │  (Turtle fmt)  │     │                  │     │ Size Comparison │
└───────────────┘     └─────────────────┘     └────────────────┘     └──────────────────┘     └─────────────────┘
                                                                                                       ↓
                                                                                       ┌─────────────────────────┐
                                                                                       │  Graph RAG (Next Lab)   │
                                                                                       └─────────────────────────┘
```

## 1. Import Libraries

Here we import the necessary libraries. Note the different handling for Colab vs. local environments to ensure correct path setup.

**Key Point:** These libraries provide the tools to process our building data files.

In [105]:
import sys
import time
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display
from rdflib import Graph

# Add project to path if running locally
if not IN_COLAB:
    project_root = Path().resolve()
    if project_root.name == 'notebooks':
        project_root = project_root.parent
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))

# Import converters
from src.graph_converter.ifc_converter import ifc_to_ttl
from src.graph_converter.pdf_converter import pdf_to_ttl
from src.graph_converter.embedding_converter import ttl_to_embeddings

## 2. Define Data Paths

Next, we'll set up the paths for our raw data and output directories. This uses a consistent approach to find paths relative to the project root, whether running locally or in Colab.

**Key Point:** Paths are consisten and matches our expectation.

In [None]:
# Define paths with consistent approach for Colab and local environments
project_root = Path().resolve()
if project_root.name == 'notebooks':
    project_root = project_root.parent

DATA_DIR = project_root / "data"
RAW_DIR = DATA_DIR / "raw" / "buildingsmart_duplex"
GRAPH_DIR = DATA_DIR / "graph" / "buildingsmart_duplex"
EMBEDDINGS_DIR = DATA_DIR / "embeddings" / "buildingsmart_duplex"

# Ensure directories exist
for dir_path in [RAW_DIR, GRAPH_DIR, EMBEDDINGS_DIR]:
    dir_path.mkdir(exist_ok=True)

print(f"Raw data: {RAW_DIR}")
print(f"Output directory: {GRAPH_DIR}")

## 3. List Available Data Files

Let's examine the available data files. IFC files contain 3D building models and associated data, while PDF files typically contain product specifications or documentation.

**Key Point:** We have two complementary data sources: IFC files for structured building geometric and properties data, and PDF files for unstructured textual specifications.

In [None]:
# List and display available files
ifc_files = list(RAW_DIR.glob("*.ifc"))
pdf_files = list(RAW_DIR.glob("*.pdf"))

# Create DataFrame with file information
file_info = [{
    "Filename": file.name,
    "Type": file.suffix[1:].upper(),
    "Size (MB)": round(file.stat().st_size / (1024 * 1024), 2)
} for file in ifc_files + pdf_files]

file_df = pd.DataFrame(file_info).sort_values(["Type", "Size (MB)"], ascending=[True, False])
display(file_df)

## 4. Processing IFC Files

We'll define a unified function `process_files` to use in conversion of both IFC, PDF and embedding files which includes caching (`force_reprocess=False` by default) to avoid redundant processing.

In [108]:
def process_files(files, converter_func, output_dir, output_format, force_reprocess=False):
    """Process files to TTL format with caching
    
    Args:
        files: List of files to process
        converter_func: Function to convert files to TTL
        output_dir: Directory to save TTL files
        force_reprocess: Whether to regenerate existing TTL files
    """
    results = []
    
    for file in files:
        output_file = output_dir / f"{file.stem}.{output_format}"
        
        # Use cache if file exists and not forced to reprocess
        if output_file.exists() and not force_reprocess:
            print(f"✓ Using cached {file.name}")
            results.append({
                "Filename": file.name,
                "Status": "Cached",
                "Processing Time": "N/A",
                "Output Size (MB)": round(output_file.stat().st_size / (1024 * 1024), 2)
            })
            continue
        
        try:
            print(f"Processing {file.name}...")
            start_time = time.time()
            converter_func(str(file), str(output_file))            
            processing_time = round(time.time() - start_time, 2)
            print(f"✓ Processed {file.name} in {processing_time} seconds")
            
            results.append({
                "Filename": file.name,
                "Status": "Success",
                "Processing Time": f"{processing_time} sec",
                "Output Size (MB)": round(output_file.stat().st_size / (1024 * 1024), 2)
            })
            
        except Exception as e:
            print(f"❌ Error processing {file.name}: {str(e)}")
            results.append({
                "Filename": file.name,
                "Status": f"Error: {str(e)[:50]}...",
                "Processing Time": "N/A",
                "Output Size (MB)": "N/A"
            })
    
    return pd.DataFrame(results)

For IFC files, this function utilizes the `ifc_to_ttl` converter which focuses on extracting structured semantic data from the IFC model.


**Behind the Scenes (`ifc_to_ttl`):**
- Parses the `.ifc` file using `ifcopenshell`.
- Maps IFC entities (e.g., `IfcWall`, `IfcSpace`) and their relationships to RDF triples, using standard ontologies like BOT (Building Ontology) and aligning with IFC-OWL where possible.
- Extracts Property Sets (`IfcPropertySet`) and their individual properties (e.g., `IfcPropertySingleValue`), converting values (text, numbers, booleans) into appropriate RDF literals.
- Attempts to include unit information (e.g., using QUDT ontology mappings) for relevant properties.
- Adds provenance details (source file, tool version, timestamp).
- Validates the resulting graph against SHACL shapes (`src/graph_converter/ifc_converter_shapes.ttl`) before saving the `.ttl` file.


In [None]:
ifc_results = process_files(ifc_files, ifc_to_ttl, GRAPH_DIR, "ttl", force_reprocess=False)
display(ifc_results)

## 5. Processing PDF Documents

Similarly, we'll process PDF files using the same `process_files` function. This time, it internally calls the `pdf_to_ttl` converter, which handles PDF-specific extraction and conversion.

**Behind the Scenes (`pdf_to_ttl`):**
- Opens the PDF using `PyMuPDF` (fitz).
- Extracts document metadata (title, author, etc.) and page-level details (dimensions, rotation).
- Extracts text content block by block, retaining basic formatting metadata (font, size, color, position).
- Chunks the extracted text from each page into smaller, semantically coherent segments using a sentence splitter (`wtpsplit_lite`).
- Represents the document structure (document → page → chunk) and content in RDF using standard vocabularies like FABIO (document types), DCTERMS (metadata), and CNT (text content `cnt:chars`).
- Adds formatting and positional metadata using a custom `pdo:` namespace.
- Validates the resulting graph against SHACL shapes (`src/graph_converter/pdf_converter_shapes.ttl`) before saving.

**Key Point:** The `pdf_to_ttl` converter focuses on extracting and structuring textual information from PDFs.

In [None]:
pdf_results = process_files(pdf_files, pdf_to_ttl, GRAPH_DIR, "ttl", force_reprocess=False)
display(pdf_results)

## 6. Analysis of Processed Data

Let's analyze the resulting RDF graphs (stored as TTL files). We'll compare metrics like file size and triple count between the data derived from IFC and PDF sources using the `analyze_graph` helper function.

**Key Point:** This analysis quantifies the richness of data extracted. Notice how IFC files typically yield significantly more structured triples compared to PDFs, as reflected in the IFC:PDF ratio table.

In [None]:
# Analyze processed files
ttl_files = list(GRAPH_DIR.glob("*.ttl"))

# Function to safely load and analyze a graph file
def analyze_graph(ttl_file, ifc_stems):
    try:
        g = Graph()
        g.parse(str(ttl_file), format="turtle")
        source_type = "IFC" if ttl_file.stem in ifc_stems else "PDF"
        return {
            "Filename": ttl_file.name,
            "Source Type": source_type,
            "Size (MB)": round(ttl_file.stat().st_size / (1024 * 1024), 2),
            "Triple Count": len(g)
        }
    except Exception as e:
        print(f"Error analyzing {ttl_file.name}: {str(e)}")
        return {
            "Filename": ttl_file.name,
            "Source Type": "Unknown",
            "Size (MB)": round(ttl_file.stat().st_size / (1024 * 1024), 2),
            "Triple Count": 0
        }

# Get list of IFC file stems for source type detection
ifc_stems = [ifc.stem for ifc in ifc_files]

# Gather information about all TTL files
graph_info = [analyze_graph(ttl_file, ifc_stems) for ttl_file in ttl_files]
graph_df = pd.DataFrame(graph_info).sort_values(["Source Type", "Triple Count"], ascending=[True, False])
display(graph_df)

# Visualize comparison between IFC and PDF data
plt.figure(figsize=(12, 6))

# Size comparison
plt.subplot(1, 2, 1)
graphs_by_source = graph_df.groupby("Source Type")["Size (MB)"].sum()
graphs_by_source.plot(kind="bar", color=["#2196F3", "#FF9800"])
plt.title("Total Size by Source Type")
plt.ylabel("Size (MB)")
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Triple count comparison
plt.subplot(1, 2, 2)
triples_by_source = graph_df.groupby("Source Type")["Triple Count"].sum()
triples_by_source.plot(kind="bar", color=["#2196F3", "#FF9800"])
plt.title("Total Triples by Source Type")
plt.ylabel("Number of Triples")
plt.grid(axis="y", linestyle="--", alpha=0.7)

plt.tight_layout()
plt.show()

# Create ratio comparison DataFrame
try:
    # Calculate size and triple ratios
    size_by_type = graphs_by_source.to_dict()
    triples_by_type = triples_by_source.to_dict()
    
    # Only calculate ratios if both IFC and PDF data exist
    if 'IFC' in size_by_type and 'PDF' in size_by_type and size_by_type['PDF'] > 0:
        size_ratio = f"{size_by_type['IFC'] / size_by_type['PDF']:.2f}:1"
    else:
        size_ratio = "N/A"
        
    if 'IFC' in triples_by_type and 'PDF' in triples_by_type and triples_by_type['PDF'] > 0:
        triple_ratio = f"{triples_by_type['IFC'] / triples_by_type['PDF']:.2f}:1"
    else:
        triple_ratio = "N/A"
    
    # Create ratio comparison DataFrame
    ratio_data = {
        'Metric': ['Size (MB)', 'Triple Count'],
        'IFC Total': [size_by_type.get('IFC', 0), triples_by_type.get('IFC', 0)],
        'PDF Total': [size_by_type.get('PDF', 0), triples_by_type.get('PDF', 0)],
        'Ratio (IFC:PDF)': [size_ratio, triple_ratio]
    }
    
    display(pd.DataFrame(ratio_data))
    print("This comparison shows how IFC files typically generate much richer semantic data compared to PDF documents.")
    
except Exception as e:
    print(f"Error calculating ratios: {str(e)}")

## 7. Sample Triple Analysis

Let's examine sample triples (Subject-Predicate-Object statements) to understand the structure of our semantic graphs. This helps visualize the relationships extracted from the source files.

**Key Point:** These triples form the atomic units of our knowledge graph, representing facts and relationships about the building components and documents.

In [None]:
def analyze_sample_triples(file_path, sample_size=10):
    """Display sample triples and namespace information from an RDF file"""
    g = Graph()
    g.parse(str(file_path), format="turtle")
    
    print(f"Analyzing {file_path.name} - Total triples: {len(g)}")
    
    # Show sample triples
    print(f"Sample triples:")
    for i, (s, p, o) in enumerate(list(g)[:sample_size]):
        print(f"  {i+1}. {s} → {p} → {o}")
        
# Sample one IFC and one PDF file
ifc_sample = next((f for f in ttl_files if f.stem in ifc_stems), None)
pdf_sample = next((f for f in ttl_files if f.stem not in ifc_stems), None)

if ifc_sample:
    analyze_sample_triples(ifc_sample)
    print("" + "-"*80 + "")
    
if pdf_sample:
    analyze_sample_triples(pdf_sample)

## 8. Entity-Centric Embedding Generation

To prepare our knowledge graph for semantic search and Retrieval-Augmented Generation (RAG), we'll generate *entity-centric* embeddings using the `ttl_to_embeddings` converter. This involves:
1. Grouping all triples by their subject (the entity). Each entity might be a building element (like a wall or door from IFC) or a document chunk (from PDF).
2. For each entity, creating a single text representation by concatenating its type, label, and relevant properties (selected based on a predefined list `ALLOWED_PREDICATES` in the code).
   - For IFC entities, this process includes traversing into linked Property Sets (`IfcPropertySet`) to gather related properties, providing richer context.
   - For PDF text chunks (`cnt:ContentAsText`), only the actual text content (`cnt:chars`) is typically used.
3. Using a pre-trained language model (specifically, OpenAI's `text-embedding-3-small` by default) to generate a vector embedding for this consolidated entity text.
4. Saving the entity URI, its readable name, the generated text, and the embedding vector into a JSON file.

This approach creates comprehensive vector representations capturing the semantic context around each entity.

**Key Point:** These embeddings enable semantic similarity searches over building entities and document chunks, which is crucial for the Graph RAG system in the next lab.

In [None]:
embedding_results = process_files(ttl_files, ttl_to_embeddings, EMBEDDINGS_DIR, "json", force_reprocess=False)
display(embedding_results)


## 9. Summary

We have successfully converted raw IFC and PDF building data into structured RDF graphs and generated entity-centric embeddings. This prepares the data for graph-based knowledge retrieval.

**Key Point:** We've transformed diverse building data into a unified, queryable format ready for Graph RAG.

In [None]:
# Create summary statistics
embedding_files = list(EMBEDDINGS_DIR.glob("*.json"))
total_raw_size = sum(f.stat().st_size for f in (ifc_files + pdf_files)) / (1024 * 1024)
total_graph_size = sum(f.stat().st_size for f in ttl_files) / (1024 * 1024)
total_embedding_size = sum(f.stat().st_size for f in embedding_files if 'embedding_files' in locals()) / (1024 * 1024) if 'embedding_files' in locals() else 0

summary_data = {
    'Metric': [
        'Total files processed',
        'IFC files processed',
        'PDF files processed',
        'Total RDF files created',
        'Total embedding files created',
        'Total raw data size (MB)',
        'Total RDF graph size (MB)',
        'Total embeddings size (MB)'
    ],
    'Value': [
        len(ifc_files) + len(pdf_files),
        len(ifc_files),
        len(pdf_files),
        len(ttl_files),
        len(embedding_files) if 'embedding_files' in locals() else 0,
        round(total_raw_size, 2),
        round(total_graph_size, 2),
        round(total_embedding_size, 2)
    ]
}

display(pd.DataFrame(summary_data))
print("The processed data is now ready for Graph RAG in the next lab session.")

## 10. Troubleshooting

If you encounter issues while running this notebook, here are common problems and solutions:

### Missing Dependencies
- **Problem**: Import errors like `ModuleNotFoundError`.
- **Solution**: Ensure the setup cell (Section 0) ran successfully and installed packages from `requirements.txt`. If running locally outside the standard project structure, ensure `requirements.txt` is found. You might need to run `%pip install -r path/to/requirements.txt` manually.

### Processing Errors (Memory)
- **Problem**: Large IFC files cause out-of-memory errors during conversion.
- **Solution**: Restart the kernel and try processing smaller files first. Ensure your machine or Colab instance has sufficient RAM (at least 8GB recommended, more for larger files). Close other memory-intensive applications.

### Embedding Generation Errors
- **Problem**: OPENAI_API_KEY not set.
- **Solution**: Follow instructions in readme.

### File Not Found Errors
- **Problem**: Input data files (`.ifc`, `.pdf`) or output directories are not found.
- **Solution**: Double-check the directory structure defined in Section 2 (`Define Data Paths`). Ensure the `data/raw/buildingsmart_duplex/` directory contains the necessary source files relative to your project root. The expected structure is:
  ```
  project_root/
  ├── data/
  │   ├── raw/buildingsmart_duplex/  (Input IFC & PDF files)
  │   ├── graph/buildingsmart_duplex/ (Output TTL files)
  │   └── embeddings/buildingsmart_duplex/ (Output JSON embedding files)
  ├── notebooks/
  │   └── 01_data_integration.ipynb
  ├── src/
  └── requirements.txt
  ```

## Next Steps: Graph RAG

The processed RDF graphs and embeddings created in this notebook form the foundation for the next lab session on Graph RAG. In that session, we will:
1. Do a vector based RAG with our embeddings generated here.
2. Load the TTL files into a unified knowledge graph database and use this in a query based RAG.