# Databricks GitIngest and Chunking ETL

## Get ready to put your Data Engineering Hat! 🧢

This tutorial walks through setting up GitIngest in Databricks, extracting raw data from a repository, processing it into structured chunks, and finally loading it into llm.rag.silveraiwolf_learning for LLM-based RAG (Retrieval-Augmented Generation).

```
PACKAGE VERSIONS:
- langchain==0.3.14   
- re==2.2.1
- pyspark==3.5.2
```

In [0]:
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import re
from functools import reduce
from pyspark.sql.types import StringType, StructField, StructType, ArrayType, IntegerType
from pyspark.sql.functions import udf, col, array_size, posexplode, concat, lit
import pyspark

In [0]:
print(f"""
PACKAGE VERSIONS:
- langchain=={langchain.__version__}   
- re=={re.__version__}
- pyspark=={pyspark.__version__}
""")

# 1. Creating the LLM Catalog and Database for RAG
---
Before implementing **Retrieval-Augmented Generation (RAG)** with Large Language Models (LLMs), we need to establish a **catalog and database** in Databricks. These organizational structures allow us to efficiently store, manage, and retrieve knowledge sources that enhance LLM responses.

In **Databricks**, a **catalog** and a **database** play a crucial role in structuring and governing **retrievable data** for LLM workflows. Here's how:

### 1. **Catalog (High-Level Data Governance)**
   - A **catalog** serves as the **top-level container** for organizing data assets in Databricks.
   - It provides a **centralized metadata store** that ensures **access control, security, and compliance** for RAG-based knowledge retrieval.
   - In **Unity Catalog**, it facilitates structured **data lineage tracking**, allowing LLMs to retrieve and reference the most **accurate, versioned, and curated data**.
   - Think of a catalog as a **collection of databases** containing the knowledge sources for RAG.

   **Example:**
   ```sql
   CREATE CATALOG llm_catalog;
   ```

### 2. **Database (Schema for Knowledge Management)**
  - A **database (schema)** inside a catalog organizes **tables, vector embeddings, and indexed documents** used in RAG.
  - It helps manage **structured (tabular), semi-structured (JSON), and unstructured (text, embeddings)** data for efficient retrieval.
  - By structuring LLM knowledge bases into a database schema, we ensure **efficient lookups, version control, and traceability** of referenced knowledge.

  **Examples:**
  
  - Unified approach
    ```sql
    CREATE DATABASE llm.rag;
    ```
  - Segmented approach
    ```sql
    CREATE DATABASE llm.knowledge;
    ```

#### 2.1 Using a Single Knowledge Base Table (Unified Approach)

✅ Best for: Generalized knowledge retrieval, simple governance, and ease of management.
Implementation:
- Store all knowledge sources (structured, semi-structured, and unstructured) in one table.
- Use columns to differentiate document types, sources, or categories.
- Implement vector embeddings within the same table for efficient similarity search.

Example Table Schema: (`llm.rag.knowledge_base`)
| chunk_id(BIGINT) | document_text(STRING)  | source_type(STRING)     | metadata (JSON)         | vector_embedding (ARRAY<Float>) |
|----|--------------------------------|-----------------|-------------------------|-------------|
| 1  | "Databricks simplifies AI..."  | Documentation   | {"category": "AI"}      | [0.12, 0.34, ...]               |
| 2  | "Stock market trends in 2024..." | Finance Reports | {"category": "Finance"} | [0.87, 0.45, ...]               |
| 3  | "Python's async features..."   | Blog Article    | {"category": "Tech"}    | [0.65, 0.23, ...]               |

Pros of a Single Table:
-   ✔ Simple query structure – Easier to maintain and retrieve data.
-   ✔ Centralized management – All knowledge exists in one place, reducing redundancy.
-   ✔ Efficient vector search – Unified embeddings allow for cross-domain retrieval.

Cons of a Single Table:
-   ❌ Potential performance bottlenecks – Large datasets may slow down retrieval.
-   ❌ Complex access control – If different teams need different permissions, fine-grained control is harder.
-   ❌ Schema complexity – Handling different data types and sources in a unified structure requires robust schema design.

#### 2.2 Using Multiple Knowledge Base Tables (Segmented Approach)

✅ Best for: Domain-specific retrieval, strict governance, and optimized search performance.
Implementation:

- Create separate tables per knowledge type (e.g., Finance, Healthcare, Documentation).
- Use a common schema across tables for consistency.
- Maintain separate embedding indexes for efficient similarity search.

Example Table Structures:
1. Finance Knowledge Table (`llm.knowledge.finance`)

| chunk_id(BIGINT) | document_text(STRING)                  | metadata (JSON)         | vector_embedding (ARRAY<Float>) |
|----|-------------------------------|-------------------------|---------------------------------|
| 1  | "Stock market trends in 2024..." | {"category": "Finance"} | [0.87, 0.45, ...]               |

2. Tech Knowledge Table (`llm.knowledge.tech`)

| chunk_id(BIGINT) | document_text(STRING)                | metadata (JSON)         | vector_embedding (ARRAY<Float>) |
|----|------------------------------|-------------------------|---------------------------------|
| 2  | "Python's async features..." | {"category": "Tech"}    | [0.65, 0.23, ...]               |

Pros of Multiple Tables:
- ✔ Better performance – Queries are optimized for domain-specific retrieval.
- ✔ Simplified governance – Different access controls per table (e.g., finance team sees only finance knowledge).
- ✔ Domain-specialized retrieval – RAG can query only relevant tables, improving accuracy.
- ✔ Parallel processing – Queries run faster when searching only necessary datasets.
- ✔ Better for LangChain graphs – Optimized for creating and managing LangChain graphs.

Cons of Multiple Tables:
- ❌ More complex maintenance – Requires managing multiple table schemas and pipelines.
- ❌ Increased storage overhead – Some knowledge may exist in multiple tables, causing duplication.
- ❌ Cross-domain retrieval is harder – Combining multiple knowledge sources requires joining tables.





In [0]:
# Defining catalog and database
catalog = "llm"
database = "rag"

# Create catalog and database if not exists
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {catalog}.{database}")

print(f"Catalog {catalog} and database {database} created successfully!")

# 2. Creating a Volume for RAG in Databricks
---

A **Volume** in Databricks serves as a **storage layer** for **unstructured and semi-structured knowledge sources** used in **Retrieval-Augmented Generation (RAG)**. It allows efficient management of **documents, embeddings, and reference materials** that LLMs retrieve during inference.

A **Volume** is part of **Unity Catalog**, ensuring **governance, security, and metadata tracking** for knowledge sources. It supports:
- **Raw textual documents** (PDFs, Markdown, JSON, CSV, etc.)  
- **Preprocessed embeddings** (stored in Parquet or Delta format)  
- **Reference images, code snippets, or domain-specific data**  

### **Key Features of Volumes for RAG**
1. **Organized Knowledge Storage**  
   - Volumes store **retrievable knowledge sources** such as text documents, embeddings, and metadata in a **Databricks-managed** environment.
   - **Tables are not stored in volumes**; instead, raw data is stored here before being indexed into vector stores or Delta tables.

2. **Managed by Unity Catalog**  
   - Unity Catalog provides **governance, access control, and security** for RAG knowledge sources.
   - Enables **fine-grained permissions** to control who can retrieve different types of knowledge.

3. **Supports Multi-Format Knowledge Storage**  
   - Stores **unstructured knowledge** (PDFs, Markdown, JSON) and **vector embeddings** in Parquet format.
   - Ideal for handling **staging data, embeddings, and reference documents** in RAG pipelines.

4. **Integration with Vector Search and LLM Workflows**  
   - Allows **bulk loading of knowledge** into Delta tables for vector indexing.
   - **Seamless connection** with **FAISS, CrhomaDB, or Databricks Vector Search** for efficient retrieval.

---

### **How Volumes Enhance RAG in Databricks**
✅ **Efficient Knowledge Storage** – Organizes raw **textual and embedding data** for LLM retrieval.  
✅ **Governed Access** – Ensures **secure** knowledge management via **Unity Catalog**.  
✅ **Supports Hybrid RAG** – Stores **text and multimodal data (images, PDFs, code, etc.)** for multi-source LLM retrieval.  

---

### **Example: Storing Knowledge Sources in a Volume**
#### **1️⃣ Creating a Volume**
```sql
CREATE VOLUME llm.knowledge.source_files;
```

#### **2️⃣ Uploading RAG Knowledge Files**
```python
dbutils.fs.cp("file:/local/path/documents.pdf", "dbfs:/Volumes/llm/knowledge/source_files/")
```

#### **3️⃣ Reading Knowledge Files for Processing**
```python
documents = spark.read.text("dbfs:/Volumes/llm/knowledge/source_files/")
documents.show()
```

#### **4️⃣ Converting Knowledge into a Delta Table for Vector Indexing**
```python
documents.write.format("delta").saveAsTable("llm.knowledge.document_embeddings")
```

By using **Volumes in Databricks**, we enable **scalable, secure, and optimized knowledge storage** for **RAG-based LLM pipelines**, ensuring that retrieval is both **fast and reliable**. 🚀  


In [0]:
# Define external location for the volume
external_location = "s3://silveraiwolf/landingzone/llm/rag/"
volume_path = f"{catalog}.{database}.source_files"

# Create external volume if not exists
spark.sql(f"CREATE EXTERNAL VOLUME IF NOT EXISTS {volume_path} LOCATION '{external_location}'")

print(f"External volume {volume_path} created successfully!")

# 3. Setting Up GitIngest
---

GitIngest enables pulling raw data from repositories directly into the workspace for ETL.

1. Open a web browser and go to [GitIngest Repository](https://gitingest.com/samlexrod/silveraiwolf-learning).

2. Locate and click on the **Download** button.

3. Save the downloaded file to a preferred directory for further processing.

4. Upload the downloaded `.txt` document into the previously created repository.

> 🎁 Feel free to use the https://gitingest.com/samlexrod/silveraiwolf-learning repository for this tutorial. It is open-source and it will always be.

# 4. Create Base Tables
---
Enabling Change Data Capture (CDC) in base Delta tables allows efficient tracking of data changes over time. Here’s why it’s important:

- **Incremental Updates** – Instead of reprocessing the entire dataset, CDC enables tracking only the new, modified, or deleted records, improving performance and reducing compute costs.
- **Data Lineage & Auditing** – CDC logs changes over time, making it easier to trace data history, track modifications, and support compliance requirements.
- **Real-time Processing** – Delta tables with CDC integrate seamlessly with streaming workloads, allowing for near real-time updates in analytical or machine learning applications.
- **Optimized for RAG Pipelines** – Since LLMs often work with evolving datasets, CDC helps in efficiently updating knowledge bases without redundant processing.
- **Improved Scalability** – Large datasets benefit from incremental ingestion rather than full reloads, reducing the impact on storage and computational resources.

In [0]:
# Define knowledge table name
table_name = "knowledge_base"

# Create Delta table with Change Data Capture (CDC) enabled
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {catalog}.{database}.{table_name} (
    id BIGINT GENERATED ALWAYS AS IDENTITY,
    file_name STRING NOT NULL,
    file_url STRING NOT NULL,
    chunk_id INTEGER NOT NULL,
    content STRING NOT NULL,
    content_type STRING NOT NULL,
    total_file_chunks INTEGER NOT NULL,
    updated_by STRING,
    updated_at TIMESTAMP,
    inserted_at TIMESTAMP DEFAULT current_timestamp
)
USING DELTA
TBLPROPERTIES (delta.enableChangeDataFeed=true, delta.feature.allowColumnDefaults='supported')
""")

# 5. Incremental *.txt ETL with SQL Merge

This section walks through a full data pipeline that processes raw text files, splits them into chunks, and merges the structured data into a Delta Table. The main goals of this process are:

- ✅ Text Processing: Read and split large text files into manageable chunks for better retrieval.
- ✅ Schema Structuring: Maintain metadata such as file_name, chunk_id, and total_file_chunks.
- ✅ Efficient Storage & Retrieval: Store the processed data in a Delta Table for efficient querying.
- ✅ Change Tracking: Use MERGE INTO to update, insert, or delete records dynamically.

In [0]:
# Define a UDF to process the chunks
@udf(ArrayType(StringType()))
def process_chunks(content: str) -> list[str]:
    """
    Splits the text into chunks. 

    Parameters:
    content (str): The text to be split into chunks.

    Returns:
    list[str]: A list of chunks.
    """
    splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=1000, 
    chunk_overlap=200, # Important to have overlap to avoid splitting on important sentences
    length_function=len
    )
    chunks = splitter.split_text(content)
    return chunks


# Repeated variable to clarify that we are using the same path as above
directory_path = "/Volumes/llm/rag/source_files/raw/"

# PROCESSING NEW FILES
# Filtering the files that are not in the processed_files list
bullet_format = lambda x: f" - {x}"
new_files = os.listdir(directory_path)

# Printing the list of new files
print(f"New files: \n{',\n'.join(map(bullet_format, new_files))} \n")

# Template for the knowledge base table
schema = StructType([
    StructField("file_name", StringType(), True),
    StructField("chunk_id", IntegerType(), True),
    StructField("content", StringType(), True),
    StructField("content_type", StringType(), True),
    StructField("total_file_chunks", IntegerType(), True)
])
df_knoledge_base = spark.createDataFrame([], schema)
column_order = df_knoledge_base.columns

for file_name in new_files:
    txt_path = os.path.join(directory_path, file_name)

    # Read the file
    file_text =  open(txt_path, 'r').read()

    # Specific case parsing
    if file_name == "samlexrod-silveraiwolf-learning.txt":
        """
        The following pattern splits the gitingest sections per file so the rag can remain in context within the document.

        E.g. Each one of the sections are separated by a pattern:
        ================================================
        File: README.md
        ================================================
        # Content
        """

        content_type = "Git Repo"
        base_url = "https://github.com/samlexrod/silveraiwolf-learning/blob/master/"

        # Pattern specific for gitingest (=== pattern)
        pattern = r"={48}\nFile: (.+?)\n={48}\n"

        # Splitting the text while keeping filenames
        sections = re.split(pattern, file_text)
        
        directory_section = sections[0]
        file_sections = sections[1:]  # First element is empty, so we start from index 1

        # Organizing data into dictionary {filename: content}
        file_dict = {file_sections[i]: file_sections[i+1].strip() for i in range(0, len(file_sections), 2) if "tutorial-data" not in file_sections[i]}
        file_dict["directory"] = directory_section

        # Converting dictionary to list of tuples
        data = [(file_name, content) for file_name, content in file_dict.items()]

        # Defining schema
        schema = StructType([
            StructField("file_name", StringType(), True),
            StructField("content", StringType(), True)
        ])

        # Creating a dataframe for Delta table
        df_chunks = (spark.createDataFrame(data, schema)
            .withColumn("content", process_chunks(col("content"))) 
            .withColumn("total_file_chunks", array_size(col("content")))
            .select(
                "file_name", 
                "total_file_chunks",
                posexplode(col("content")).alias("chunk_id", "content"))
            .withColumn("chunk_id", col("chunk_id") + 1)
            .select(*column_order)
        )

        # Adding content_type column 
        df_chunks = df_chunks.withColumn("content_type", lit(content_type))

        # Adding file_url column for the rag to use for the link
        df_chunks = df_chunks.withColumn("file_url", concat(lit(base_url), col("file_name")))

    else:
        pass # Other file logic not in the scope of this tutorial

    # Unioning the chunks into the knowledge base table
    df_knoledge_base = df_knoledge_base.union(df_chunks)

# Creating a view for merge
df_knoledge_base.createOrReplaceTempView("knowledge_base_view")

# Merge data into Delta table
spark.sql(f"""
MERGE INTO {catalog}.{database}.{table_name} AS target
USING knowledge_base_view AS source
ON target.file_name = source.file_name AND target.chunk_id = source.chunk_id
WHEN MATCHED AND target.content != source.content or coalesce(target.file_url, '') != source.file_url THEN 
    UPDATE SET 
        file_url = source.file_url,
        content = source.content, 
        total_file_chunks = source.total_file_chunks, 
        updated_by = current_user(),
        updated_at = current_timestamp()
WHEN NOT MATCHED THEN INSERT (file_name, file_url, chunk_id, content, total_file_chunks) VALUES (file_name, file_url, chunk_id, content, total_file_chunks)
WHEN NOT MATCHED BY SOURCE THEN DELETE
""").display()

# 6. Display Kowledge Base Table

Now that you've structured the knowledge_base table, the next step is to integrate vector search into Databricks. This will allow efficient semantic retrieval of relevant text chunks for Retrieval-Augmented Generation (RAG).

In [0]:
df_knoledge_base.display()

In [0]:
%sql
SELECT * FROM llm.rag.knowledge_base

# END OF NOTEBOOK