<a href="https://colab.research.google.com/github/salmantec/AI-Agents-Crash-Course/blob/feat%2FDay-1/Day-1/Day_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## Intro

# We'll create a conversational agent that can answer questions about any GitHub repository - think of it as your personal AI assistant for documentation and code.
# If you know DeepWiki, it's something similar, but tailored to your GitHub repo.
# For that, we need to:
# - Download and process data from the repo
# - Put it inside a search engine
# - Make the search engine available to our agent

# Today, we will do the first part: downloading the data.


In [None]:
## Ingest and Index Your Data

In [4]:
!pip install uv



In [5]:
!uv pip install requests python-frontmatter

[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m2 packages[0m [2min 112ms[0m[0m


In [9]:
import frontmatter

with open('sample_data/example.md', 'r', encoding='utf-8') as f:
  post = frontmatter.load(f)

# Access metadata
print(post.metadata['title'])
print(post.metadata['tags'])

# Access content
print(post.content)



### Content of sample_data/example.md file

# ---
# title: "Getting Started with AI"
# author: "John Doe"
# date: "2024-01-15"
# tags: ["ai", "machine-learning", "tutorial"]
# difficulty: "beginner"
# ---

# # Getting Started with AI

# This is the main content of the document written in **Markdown**.

# You can include code blocks, links, and other formatting here.


FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/example.md'

In [12]:
## Working with Zip Archives

# The second option is easier and more efficient for our use case.
# We don't even need to save the zip archive - we can load it into our Python process memory and extract all the data we need from there.
# So the plan:
# - Use requests for downloading the zip archive from GitHub
# - Open the archive using built-in zipfile and io modules
# - Iterate over all .md and .mdx files in the repo
# - Collect the results into a list

# Let's implement it step by step.

# First, we import the necessary libraries:

import io
import zipfile
import requests
import frontmatter

# Next, we download the repository as a zip file. Github privodes a convenient URL format for this:

url = 'https://codeload.github.com/DataTalksClub/faq/zip/refs/heads/main'
resp = requests.get(url)


# Now we process the zip file in memory without saving it to disk:

repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
  filename = file_info.filename.lower()

  # Only process markdown files
  if not filename.endswith('.md') or filename.endswith('.mdx'):
    continue

  # Read and parse each file
  with zf.open(file_info) as f_in:
    content = f_in.read()
    post = frontmatter.loads(content)
    data = post.to_dict()
    data['filename'] = filename
    repository_data.append(data)

zf.close()


# Let's look at what we got
print(repository_data[1])


{'id': '9e508f2212', 'question': 'Course: When does the course start?', 'sort_order': 1, 'content': "The next cohort starts January 13th, 2025. More info at [DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).\n\n- Register before the course starts using this [link](https://airtable.com/shr6oVXeQvSI5HuWD).\n- Join the [course Telegram channel with announcements](https://t.me/dezoomcamp).\n- Don’t forget to register in DataTalks.Club's Slack and join the channel.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/001_9e508f2212_course-when-does-the-course-start.md'}


In [14]:
# Complete implementation of above logic with reusable function

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
  """
  Download and parse all markdown files from a github repository

  Args:
    repo_owner : Github username or organization
    repo_name: Repository name

  Returns:
    List of dictionaries containing file content and metadata
  """
  prefix = 'https://codeload.github.com'
  url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
  resp = requests.get(url)

  if resp.status_code != 200:
    raise Exception(f"Failed to download repository {repo_owner}/{repo_name}: {resp.status_code}")

  repository_data = []

  # Create a ZipFile object from the downloaded content
  zf = zipfile.ZipFile(io.BytesIO(resp.content))

  for file_info in zf.infolist():
    filename = file_info.filename
    filename_lower = filename.lower()

    if not (filename_lower.endswith('.md') or (filename_lower.endswith('.mdx'))):
      continue

    try:
      with zf.open(file_info) as f_in:
        content = f_in.read().decode('utf-8', errors='ignore')
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)
    except Exception as e:
      print(f"Error processing {filename}: {e}")
      continue

  zf.close()
  return repository_data

In [15]:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ Documents: {len(dtc_faq)}")
print(f"Evidently Docs: {len(evidently_docs)}")

FAQ Documents: 1217
Evidently Docs: 95


In [None]:
# Data Processing Considerations

# For FAQ, the data is ready to use. These are small records that we can index (put into a search engine) as is.
# For Evidently docs, the documents are very large. We need extra processing called "chunking" - breaking large documents into smaller, manageable pieces. This is important because:

# - Search relevance: Smaller chunks are more specific and relevant to user queries
# - Performance: AI models work better with shorter text segments
# - Memory limits: Large documents might exceed token limits of language models

# We will cover chunking techniques in tomorrow's lesson.
