# RAG System for German Road Signs

This notebook demonstrates a complete Retrieval-Augmented Generation (RAG) pipeline for German traffic signs, including both textual and visual data.

## 1. Image Parsing and JSON Metadata Generation
- Extract images from [iamexpat.de road signs section](https://www.iamexpat.de/expat-info/driving-germany/road-signs)
- Generate structured JSON metadata containing:
  - Image URL
  - Sign description
  - Sign category

In [2]:
import requests
from bs4 import BeautifulSoup
import json
import re

URL = "https://www.iamexpat.de/expat-info/driving-germany/road-signs"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

def parse_all_signs():
    print(f"Connecting to {URL}...")
    response = requests.get(URL, headers=HEADERS)
    if response.status_code != 200:
        print("Error loading page")
        return

    soup = BeautifulSoup(response.content, 'html.parser')
    all_signs_data = []
    
    content_area = soup.find('div', class_='article__content') or soup.find('body')
    
    all_imgs = content_area.find_all('img')

    for img in all_imgs:
        src = img.get('src', '')
        
        if 'road-sign' not in src.lower() and 'sign' not in src.lower() and not src.endswith('.svg'):
            if not img.get('alt'): 
                continue

        img_url = src if src.startswith('http') else "https://www.iamexpat.de" + src
        

        title = img.get('alt', '').strip() or img.get('title', '').strip()
        
        if not title:
            parent_td = img.find_parent('td')
            if parent_td:
                title = parent_td.get_text(strip=True)
                if not title and parent_td.find_next_sibling('td'):
                    title = parent_td.find_next_sibling('td').get_text(strip=True)


        if not title or len(title) < 2:
            title = "NEED_MANUAL_DESCRIPTION"

        category = "General"
        prev_h = img.find_previous(['h2', 'h3'])
        if prev_h:
            category = prev_h.get_text(strip=True)

        all_signs_data.append({
            "category": category,
            "title": title,
            "image_url": img_url,
            "status": "manual_check" if title == "NEED_MANUAL_DESCRIPTION" else "ok"
        })

    unique_data = {item['image_url']: item for item in all_signs_data}.values()

    with open("data/germany_road_signs.json", "w", encoding="utf-8") as f:
        json.dump(list(unique_data), f, indent=2, ensure_ascii=False)

    print(f"{len(unique_data)}  road signs collected")
    manual_count = sum(1 for x in unique_data if x['status'] == 'manual_check')
    print(f"{manual_count} need manual description")

if __name__ == "__main__":
    parse_all_signs()

Connecting to https://www.iamexpat.de/expat-info/driving-germany/road-signs...
300  road signs collected
0 need manual description


## 2. Text Data Loading
- Load textual information from the road signs section
- Prepare documents for downstream processing

In [3]:
import os
from urllib.parse import urljoin, urlparse

START_URLS = [
    "https://www.iamexpat.de/expat-info/driving-germany/road-signs",
    "https://www.iamexpat.de/expat-info/driving-germany"
]

BASE_DOMAIN = "www.iamexpat.de"
OUTPUT_DIR = "data/text_files"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

os.makedirs(OUTPUT_DIR, exist_ok=True)

visited = set()


def clean_filename(url: str) -> str:
    path = urlparse(url).path.strip("/")
    if not path:
        return "index.txt"
    name = path.split("/")[-1]
    return re.sub(r"[^a-zA-Z0-9_-]", "_", name) + ".txt"


def extract_text(soup: BeautifulSoup) -> str:
    content = []

    for tag in soup.find_all(["h1", "h2", "h3", "p", "li"]):
        text = tag.get_text(" ", strip=True)
        if len(text) > 30:
            content.append(text)

    return "\n\n".join(content)


def is_valid_link(link: str) -> bool:
    parsed = urlparse(link)
    return (
        parsed.netloc == BASE_DOMAIN
        and parsed.path.startswith("/expat-info/driving-germany")
    )


def process_page(url: str):
    if url in visited:
        return

    print(f"Processing: {url}")
    visited.add(url)

    r = requests.get(url, headers=HEADERS, timeout=30)
    r.raise_for_status()

    soup = BeautifulSoup(r.text, "html.parser")


    text = extract_text(soup)
    if text:
        filename = clean_filename(url)
        filepath = os.path.join(OUTPUT_DIR, filename)

        with open(filepath, "w", encoding="utf-8") as f:
            f.write(text)


    for a in soup.find_all("a", href=True):
        link = urljoin(url, a["href"])
        if is_valid_link(link):
            process_page(link)




for start_url in START_URLS:
    process_page(start_url)

print(f"\nTexts saved: {len(os.listdir(OUTPUT_DIR))}")


Processing: https://www.iamexpat.de/expat-info/driving-germany/road-signs
Processing: https://www.iamexpat.de/expat-info/driving-germany
Processing: https://www.iamexpat.de/expat-info/driving-germany/driving-licence
Processing: https://www.iamexpat.de/expat-info/driving-germany/learning-to-drive
Processing: https://www.iamexpat.de/expat-info/driving-germany/buying-a-car
Processing: https://www.iamexpat.de/expat-info/driving-germany/car-leasing
Processing: https://www.iamexpat.de/expat-info/driving-germany/registering-vehicle
Processing: https://www.iamexpat.de/expat-info/driving-germany/motor-vehicle-tax
Processing: https://www.iamexpat.de/expat-info/driving-germany/emissions-sticker
Processing: https://www.iamexpat.de/expat-info/driving-germany/periodic-technical-inspection-hauptuntersuchung-tuev
Processing: https://www.iamexpat.de/expat-info/driving-germany/importing-car
Processing: https://www.iamexpat.de/expat-info/driving-germany/exporting-car
Processing: https://www.iamexpat.de/e

## 3. Data Chunking
- Split all documents (image metadata + textual data) into manageable chunks
- Ensure metadata is preserved for each chunk

In [4]:
import json
from typing import List
from langchain_core.documents import Document
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


def load_image_json(json_path: str) -> List[Document]:
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    documents: List[Document] = []

    for item in data:
        title = item.get("title", "").strip()
        category = item.get("category", "").strip()
        image_url = item.get("image_url", "").strip()

        page_content = f"Traffic sign: {title}. Category: {category}."

        metadata = {
            "type": "image",
            "title": title,
            "category": category,
            "image_url": image_url
        }

        documents.append(
            Document(
                page_content=page_content,
                metadata=metadata
            )
        )

    return documents



def load_text_files(path: str) -> List[Document]:
    loader = DirectoryLoader(
        path,
        glob="**/*.txt",
        loader_cls=TextLoader,
        loader_kwargs={"encoding": "utf-8"},
        show_progress=True
    )
    return loader.load()



class UniversalChunker:
    def __init__(
        self,
        chunk_size: int = 500,
        chunk_overlap: int = 50
    ):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ". ", " "],
            length_function=len
        )

    def chunk(self, documents: List[Document]) -> List[Document]:
        return self.splitter.split_documents(documents)



image_docs = load_image_json("data/germany_road_signs.json")
text_docs = load_text_files("data/text_files")

all_documents = image_docs + text_docs

print(f"Loaded documents:")
print(f"- Image JSON docs: {len(image_docs)}")
print(f"- Text docs: {len(text_docs)}")
print(f"- TOTAL: {len(all_documents)}")

chunker = UniversalChunker()
chunks = chunker.chunk(all_documents)

print(f"\nCreated {len(chunks)} chunks")


  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 18/18 [00:00<00:00, 576.40it/s]

Loaded documents:
- Image JSON docs: 300
- Text docs: 18
- TOTAL: 318

Created 633 chunks





## 4. Embeddings Creation and Vector Storage
- Convert chunks to vector embeddings using the `sentence-transformers/all-MiniLM-L12-v2` model
- Persist embeddings in a Chroma vector store for semantic retrieval

In [5]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L12-v2"
)

print(f"Total chunks to embed: {len(chunks)}")

image_chunks = [c for c in chunks if c.metadata.get("type") == "image"]
text_chunks = [c for c in chunks if c.metadata.get("type") != "image"]

print(f"Image chunks: {len(image_chunks)}")
print(f"Text chunks: {len(text_chunks)}")

PERSIST_DIR = "./chroma_db"

vectorstore = Chroma.from_documents(
    documents=chunks,              
    embedding=embedding_model,
    persist_directory=PERSIST_DIR
)

print(" Vector store created")
print("Stored embeddings:", vectorstore._collection.count())

results = vectorstore.similarity_search("slippery road", k=5)

for i, doc in enumerate(results):
    print(f"\nResult {i+1}")
    print("TEXT:", doc.page_content)
    print("METADATA:", doc.metadata)


  embedding_model = HuggingFaceEmbeddings(


Total chunks to embed: 633
Image chunks: 300
Text chunks: 333
 Vector store created
Stored embeddings: 633

Result 1

Result 2
TEXT: Traffic sign: Oil Slick. Category: Supplementary signs (Zusatzschilder).
METADATA: {'image_url': 'https://iamexpat.directus.app/assets/551a49e7-7fbd-46e0-9681-b89e991462c7?width=120&height=66', 'title': 'Oil Slick', 'type': 'image', 'category': 'Supplementary signs (Zusatzschilder)'}

Result 3
TEXT: Traffic sign: parking on pavement allowed wholly. Category: Parking signs.
METADATA: {'title': 'parking on pavement allowed wholly', 'category': 'Parking signs', 'image_url': 'https://iamexpat.directus.app/assets/30336666-995c-441c-b9bf-168e3b49fe3f', 'type': 'image'}

Result 4
TEXT: Traffic sign: yield. Category: Right-of-way signs.
METADATA: {'title': 'yield', 'type': 'image', 'category': 'Right-of-way signs', 'image_url': 'https://iamexpat.directus.app/assets/1e57e2c6-e312-4fdb-81a9-f62745427f0a'}

Result 5
TEXT: Traffic sign: parking on pavement allowed ha

## 5. AI Model Integration
- Initialize a language model (LLM) for RAG
- Connect the LLM with a retriever to query vector embeddings

In [6]:
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate

load_dotenv()

AIML_API_KEY = os.getenv("AIML_API_KEY")

llm = ChatOpenAI(
    model="gpt-4o-mini", 
    api_key=AIML_API_KEY,
    base_url="https://api.aimlapi.com/v1",
    temperature=0
)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

def format_docs(docs):
    formatted = []
    for d in docs:
        block = f"""
TEXT:
{d.page_content}

IMAGE_URL:
{d.metadata.get("image_url", "None")}

CATEGORY:
{d.metadata.get("category", "Unknown")}
"""
        formatted.append(block.strip())
    return "\n\n---\n\n".join(formatted)


prompt = ChatPromptTemplate.from_template("""
You are an expert assistant on German features of road rules.

Use the provided context to answer the question.
If the context contains image metadata, mention the image(s) when relevant.

If the description of a sign is incomplete, you MAY infer its meaning
based on common traffic rules and the sign category.
Do NOT invent specific legal details that are not implied by the context.

When images are relevant, include them in your answer using this format:

Image: <image_url>
Explanation: <short explanation of the sign>

Context:
{context}

Question:
{question}

Answer in a clear and structured way.
""")


rag_chain = (
    {
        "context": retriever | RunnableLambda(format_docs),
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
)


## 6. RAG Query Execution

In [None]:
response = rag_chain.invoke("What does a slippery road traffic sign mean?")
print(response.content)

The slippery road traffic sign, often represented as an oil slick sign, indicates that the road surface may be slippery due to oil, rain, or other conditions. This sign is a supplementary sign (Zusatzschild) that warns drivers to exercise caution and reduce speed to prevent accidents.

Image: ![Oil Slick Sign](https://iamexpat.directus.app/assets/551a49e7-7fbd-46e0-9681-b89e991462c7?width=120&height=66)  
Explanation: The sign alerts drivers to potential slippery conditions on the road.


### перепвірка по семантичному пошуку

In [None]:
docs = vectorstore.similarity_search("slippery road", k=3)

for d in docs:
    print("TEXT:", d.page_content)
    print("IMAGE:", d.metadata.get("image_url"))


IMAGE: https://iamexpat.directus.app/assets/f1f65762-e9f7-45ed-a274-af7a9b15d05b?width=120&height=105
TEXT: Traffic sign: Oil Slick. Category: Supplementary signs (Zusatzschilder).
IMAGE: https://iamexpat.directus.app/assets/551a49e7-7fbd-46e0-9681-b89e991462c7?width=120&height=66
TEXT: Traffic sign: parking on pavement allowed wholly. Category: Parking signs.
IMAGE: https://iamexpat.directus.app/assets/30336666-995c-441c-b9bf-168e3b49fe3f
