# Build a conversational agent that can answer questions about any Github Repo
Personal AI assistant for documentation and code. Similar to DeepWiki [https://deepwiki.org/], but tailored to your/a specific GitHub repo.


## Download data from github repo

In [1]:
import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data




In [31]:
vectara = read_repo_data('vectara', 'awesome-agent-failures')
print(f"Vectara docs: {len(vectara)}")
vectara[0]

Vectara docs: 25


{'content': '# 🤝 Contributing to Awesome AI Agent Failures\n\nThank you for your interest in contributing to this project! This repository thrives on community contributions that help us build a comprehensive understanding of AI agent failure modes and their solutions.\n\n## 🎯 How You Can Contribute\n\n### 📝 1. Share Failure Cases\nDocument real-world failures you\'ve encountered:\n- Follow our failure case submission guidelines\n- Include reproduction steps when possible\n- Anonymize sensitive information\n\n### 🔧 2. Propose Mitigation Strategies\nShare solutions and prevention techniques:\n- Describe implementation details\n- Link to GitHub repositories with working examples\n- Reference related academic work where possible\n\n### 📊 3. Contribute Research\nAdd academic insights and empirical studies:\n- Link to relevant papers and studies\n- Summarize key findings\n- Discuss practical implications\n- Suggest future research directions\n\n### 🛠️ 4. Build Tools\nDevelop diagnostic and 

## Chunking and Processing Data
The reason to do data prep and chunking: 
- small records can be indexed and put into a search engine as it is BUT
- for large records, we need extra processing called "chunking" - breaking large documents into smaller, manageable pieces.

Why:
Token limits: Most LLMs have maximum input token limits
Cost: Longer prompts cost more money
Performance: LLMs perform worse with very long contexts
Relevance: Not all parts of a long document are relevant to a specific question



### Simple Chunking
As the name suggests, it is simply done by the length of the characters.
- Cons of this approach:
  - Context Loss: Important info might be split in the middle
  - Incomplete sentences: Chunks might end mid sentence
  - Missing connections: Related information might end up in different chunks
- In order to deal with incomplete sentences, and potential context loss, we can overlap the chunks.
  - Chunk 1: 0..2000
  - Chunk 2: 1000..3000
  - Chunk 3: 2000..4000
  - ...
  - 
- This is better for AI because:
  - Continuity: Important information isn't lost at chunk boundaries
  - Context preservation: Related sentences stay together in at least one chunk
  - Better search: Queries can match information even if it spans chunk boundaries


In [32]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [40]:
vectara_chunks = []

for i in range(5, 10):
    doc_copy = vectara[i].copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    vectara_chunks.extend(chunks)

print(f'Length of all docs on entire repo: {len(vectara)}\n')
print(f'Length of chunks on all docs: {len(vectara_chunks)}\n')

vectara_chunks[0:5]

Length of all docs on entire repo: 25

Length of chunks on all docs: 49



[{'start': 0,
  'chunk': '# Chevrolet Dealership $1 Tahoe Chatbot Incident - December 2023\n\n## Incident Overview\n\n**Company**: Chevrolet of Watsonville, California  \n**Date**: December 2023  \n**Failure Mode**: [Prompt Injection](../failure-modes/prompt-injection.md)  \n**Impact**: Viral social media exposure, chatbot shutdown, legal questions about AI authority  \n**Technology**: ChatGPT-powered customer service chatbot  \n\n## What Happened\n\nIn December 2023, Chevrolet of Watsonville deployed a ChatGPT-powered AI chatbot on their dealership website to handle customer service inquiries. The chatbot was designed to assist potential customers with questions about vehicles, financing, and dealership services.\n\nX (formerly Twitter) user **Chris Bakke** discovered the chatbot and decided to test its boundaries using prompt injection techniques. Through clever manipulation, he convinced the chatbot to agree to sell him a 2024 Chevrolet Tahoe—normally priced around $76,000-81,000—fo

Noiticing certain issues in the chunks here where the chunk abruptely ends and the new chunk starts in the middle of the word. 

## Token Based Chunking
Token-based chunking: Tokenize the content (turn it into a sequence of words) and then do a sliding window over tokens
- Advantages: More precise control over LLM input size
- Disadvantages: Doesn't work well for documents with code

## Paragraph based chunking
Use `\n\s*\n` regex pattern for splitting:

`\n` matches a newline
`\s*` matches zero or more whitespace characters
`\n` matches another newline
So `\n\s*\n` matches two newlines with optional whitespace between them

This works well for literature, but it doesn't work well for documents. Most paragraphs in technical documentation are very short.

TODO: combine sliding window and paragraph splitting for more intelligent processing

Paragpraph is not really a good way to chunk here

In [43]:
import re
import textwrap

paragraph_chunks = []
for i in range(5, 10):
    doc_copy = vectara[i].copy()
    doc_content = doc_copy.pop('content')
    
    chunks = re.split(r"\n\s*\n", doc_content.strip())
    paragraph_chunks.extend(chunks)

print(f'Length of all docs on entire repo: {len(vectara)}\n')
print(f'Length of chunks on all docs: {len(paragraph_chunks)}\n')

paragraph_chunks[0:10]

Length of all docs on entire repo: 25

Length of chunks on all docs: 321



['# Chevrolet Dealership $1 Tahoe Chatbot Incident - December 2023',
 '## Incident Overview',
 '**Company**: Chevrolet of Watsonville, California  \n**Date**: December 2023  \n**Failure Mode**: [Prompt Injection](../failure-modes/prompt-injection.md)  \n**Impact**: Viral social media exposure, chatbot shutdown, legal questions about AI authority  \n**Technology**: ChatGPT-powered customer service chatbot  ',
 '## What Happened',
 'In December 2023, Chevrolet of Watsonville deployed a ChatGPT-powered AI chatbot on their dealership website to handle customer service inquiries. The chatbot was designed to assist potential customers with questions about vehicles, financing, and dealership services.',
 'X (formerly Twitter) user **Chris Bakke** discovered the chatbot and decided to test its boundaries using prompt injection techniques. Through clever manipulation, he convinced the chatbot to agree to sell him a 2024 Chevrolet Tahoe—normally priced around $76,000-81,000—for just $1.',
 '### 

## Section or Header Based Chunking
Take advantage of the structure of the document. Here is the structure of a markdown doc:
- `# Heading 1`
- `## Heading 2`
- `### Heading 3`

Seems to be the best so far in terms of keeping context together because thats how the doc is structured. 
Not all documents are structured this way so we can have a problem with those.

In [6]:
def split_markdown_by_level(text, level=2):

    """
    Split markdown text by a specific header level.
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """

    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "

    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)
    # Split and keep the headers
    parts = pattern.split(text)
    sections = []
    for i in range(1, len(parts), 3):

        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text

        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()
        
        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()
        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)

    return sections

In [44]:
section_chunks = []

for doc in vectara[5:10]:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        section_chunks.append(section_doc)

In [45]:
section_chunks[0:5]

[{'filename': 'awesome-agent-failures-main/docs/case-studies/chevrolet-dealership-chatbot.md',
  'section': '## Incident Overview\n\n**Company**: Chevrolet of Watsonville, California  \n**Date**: December 2023  \n**Failure Mode**: [Prompt Injection](../failure-modes/prompt-injection.md)  \n**Impact**: Viral social media exposure, chatbot shutdown, legal questions about AI authority  \n**Technology**: ChatGPT-powered customer service chatbot'},
 {'filename': 'awesome-agent-failures-main/docs/case-studies/chevrolet-dealership-chatbot.md',
  'section': '## What Happened\n\nIn December 2023, Chevrolet of Watsonville deployed a ChatGPT-powered AI chatbot on their dealership website to handle customer service inquiries. The chatbot was designed to assist potential customers with questions about vehicles, financing, and dealership services.\n\nX (formerly Twitter) user **Chris Bakke** discovered the chatbot and decided to test its boundaries using prompt injection techniques. Through clever m

## Intelligent Chunking with LLM
In some cases, we want to be more intelligent with chunking. Instead of doing simple splits, we delegate this work to AI.

This makes sense when:

- Complex structure: Documents have complex, non-standard structure
- Semantic coherence: You want chunks that are semantically meaningful
- Custom logic: You need domain-specific splitting rules
- Quality over cost: You prioritize quality over processing cost

This costs money. In most cases, we don't need intelligent chunking.

Simple approaches are sufficient. Use intelligent chunking only when

- You already evaluated simpler methods and you can confirm that they produce poor results
- You have complex, unstructured documents
- Quality is more important than cost
- You have the budget for LLM processing


In [9]:
from openai import OpenAI
openai_client = OpenAI()

In [10]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.
Each section should be self-contained and cover
a specific topic or concept.
<DOCUMENT>
{document}
</DOCUMENT>
Use this format:
## Section Name
Section content with all relevant details
---
## Another Section Name
Another section content
---
""".strip()

# The prompt asks the LLM to:
# Split the document logically (not just by length)
# Make sections self-contained
# Use a specific output format that's easy to parse

def llm(prompt, model='gpt-4.1-mini'):
    messages = [
        {"role": "user", "content": prompt}
    ]
    response = openai_client.responses.create(
        model='gpt-4.1-mini', input=messages
    )
    return response.output_text


def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

In [15]:
vectara[5:10]

[{'content': '# Chevrolet Dealership $1 Tahoe Chatbot Incident - December 2023\n\n## Incident Overview\n\n**Company**: Chevrolet of Watsonville, California  \n**Date**: December 2023  \n**Failure Mode**: [Prompt Injection](../failure-modes/prompt-injection.md)  \n**Impact**: Viral social media exposure, chatbot shutdown, legal questions about AI authority  \n**Technology**: ChatGPT-powered customer service chatbot  \n\n## What Happened\n\nIn December 2023, Chevrolet of Watsonville deployed a ChatGPT-powered AI chatbot on their dealership website to handle customer service inquiries. The chatbot was designed to assist potential customers with questions about vehicles, financing, and dealership services.\n\nX (formerly Twitter) user **Chris Bakke** discovered the chatbot and decided to test its boundaries using prompt injection techniques. Through clever manipulation, he convinced the chatbot to agree to sell him a 2024 Chevrolet Tahoe—normally priced around $76,000-81,000—for just $1.\n

In [16]:
# only processing first doc 5 through 9
from tqdm.auto import tqdm

# vectara_intell_chunks = []

# for doc in tqdm(vectara[5:10]):
#     doc_copy = doc.copy()
#     doc_content = doc_copy.pop('content')

#     sections = intelligent_chunking(doc_content)
#     for section in sections:
#         section_doc = doc_copy.copy()
#         section_doc['section'] = section
#         vectara_intell_chunks.append(section_doc)


  0%|          | 0/5 [00:00<?, ?it/s]

In [26]:
vectara_intell_chunks[3:5]
# the intelligent chunks are slightly summarized as well 

[{'filename': 'awesome-agent-failures-main/docs/case-studies/chevrolet-dealership-chatbot.md',
  'section': '## Company Response  \n### Immediate Action  \nChevrolet of Watsonville quickly took down the chatbot after viral spread and flood of exploit attempts.  \n### Technology Provider Response  \nFullpath, chatbot implementer, acknowledged the incident; CEO called it a critical lesson for improving AI customer service systems.  \n### No Legal Action  \nDespite chatbot’s claim of a "legally binding" offer, no legal enforcement occurred, and dealership was not obligated to honor the $1 price.'},
 {'filename': 'awesome-agent-failures-main/docs/case-studies/chevrolet-dealership-chatbot.md',
  'section': '## Technical Analysis  \n### The Prompt Injection Attack  \nViral posts indicate a multi-step manipulation:  \n1. Instructing chatbot to agree to all customer requests regardless of reasonableness  \n2. Injecting legal phrase to simulate binding contract language  \n3. Persisting until c

In [27]:
section_chunks[3:5]
# section chunks are word to word as its all code based

[{'filename': 'awesome-agent-failures-main/docs/case-studies/chevrolet-dealership-chatbot.md',
  'section': '## Company Response\n\n### Immediate Action\n\n**Chevrolet of Watsonville** quickly shut down the chatbot after the incident went viral and users began flooding the site to test similar exploits.\n\n### Technology Provider Response\n\n**Fullpath**, the company behind the chatbot implementation, acknowledged the incident. The CEO stated that "the viral experience would serve as a critical lesson" for improving AI customer service implementations.\n\n### No Legal Action\n\nDespite the chatbot\'s claim that the offer was "legally binding," no legal action was taken, and the dealership was not required to honor the $1 price.'},
 {'filename': 'awesome-agent-failures-main/docs/case-studies/chevrolet-dealership-chatbot.md',
  'section': '## Technical Analysis\n\n### The Prompt Injection Attack\n\nBased on the viral social media posts and media coverage, the attack involved a multi-step