# Week 1 — Q4–Q6 (Podcasts ingestion + chunking + search)

This notebook fetches DataTalks.Club podcasts, chunks paragraphs, indexes with **minsearch**, and answers a query.

In [118]:
import os, io, json, requests, traceback
from typing import Iterable, Callable, Dict, Any, List
from dataclasses import dataclass
import frontmatter
import sys
import re
import math

# Add the parent directory ('..') to the system path
sys.path.append(os.path.abspath('..'))

from minsearch import Index

from ai_data_pipelines.github_docs.github import GithubRepositoryDataReader
from ai_data_pipelines.common.chunking import sliding_window, chunk_documents

## Q4 — Download the podcast data and count records

In [119]:
@dataclass
class RawFile:
    filename: str
    content: str

In [120]:
reader = GithubRepositoryDataReader(
    repo_owner="DataTalksClub",
    repo_name="datatalksclub.github.io",
    allowed_extensions={"md", "mdx"},
    filename_filter=lambda p: p.startswith("_podcast/") and not p.endswith("_template.md")
)
raw = reader.read()
raw_docs = [RawFile(filename=f.filename, content=f.content) for f in raw]

In [121]:
def parse_frontmatter(raw_docs):
    parsed = []
    for rf in raw_docs:
        try:
            post = frontmatter.loads(rf.content)
            d = post.to_dict()
            d["filename"] = rf.filename
            parsed.append(d)
        except Exception as e:
            print(e)
            print(rf)
            pass
    return parsed

docs = parse_frontmatter(raw_docs)
len(docs)

184

## Q5 — Chunk by paragraphs (size=30, overlap=15) and count chunks

In [122]:
# Q5 — chunk by transcript paragraphs (size=30, overlap=15 -> step=15)

def _split_body_content(body: str) -> list[str]:
    """
    Split markdown body into 'paragraphs' with emphasis on bullet lists.
    - Prefer bullets starting with '*' or '-' at the beginning of a line.
    - Clean trailing attribute blocks like {:target="_blank"}.
    - Fallback to blank-line paragraphs if no bullets found.
    """
    if not body:
        return []

    # Remove a standalone "Links:" line (common in these docs)
    body = re.sub(r'(?im)^\s*links:\s*$', '', body).strip()

    # Extract bullets: lines beginning with * or -
    bullets = re.findall(r'^\s*[\-\*]\s+(.*\S.*)$', body, flags=re.M)
    if bullets:
        clean = [re.sub(r'\{\:.*?\}', '', b).strip() for b in bullets]
        return [c for c in clean if c]

    # Fallback: split by blank lines
    parts = [p.strip() for p in re.split(r'\n\s*\n', body) if p.strip()]
    return parts

def _extract_paragraphs_from_doc(d: dict) -> list[str]:
    paras = []

    # 1) Transcript lines/headers as paragraphs
    tr = d.get("transcript")
    if isinstance(tr, list):
        for entry in tr:
            if isinstance(entry, dict):
                if entry.get("line"):
                    paras.append(str(entry["line"]).strip())
                elif entry.get("header"):
                    paras.append(str(entry["header"]).strip())
            elif isinstance(entry, str) and entry.strip():
                paras.append(entry.strip())

    # 2) Plus any body links/paragraphs in 'content'
    body = d.get("content") or ""
    paras.extend(_split_body_content(body))

    return paras

# Prepare docs for chunking: put the paragraph LIST into the 'content' field
docs_for_chunking = []
skipped = 0
for d in docs:
    fn = d.get("filename","")
    paras = _extract_paragraphs_from_doc(d)
    docs_for_chunking.append({**d, "content": paras})

print(f"Docs prepared: {len(docs_for_chunking)} (skipped templates: {skipped})")
print("Total paragraphs:", sum(len(d["content"]) for d in docs_for_chunking))

# size=30, overlap=15 -> step=15
chunks = chunk_documents(
    documents=docs_for_chunking,
    size=30,
    step=15,
    content_field_name="content"  # this is our paragraph list
)

# chunk_documents returns windows where 'content' is the list slice; join for indexing
for c in chunks:
    if isinstance(c.get("content"), list):
        c["content"] = "\n\n".join(c["content"])
    if "para_start" not in c:
        c["para_start"] = c.get("start", "")

print(f"Q5: number of chunks: {len(chunks)}")

# Sanity: average windows per episode, a few samples
avg = len(chunks) / max(1, len(docs_for_chunking))
print("Avg windows per episode:", round(avg, 2))


Docs prepared: 184 (skipped templates: 0)
Total paragraphs: 27268
Q5: number of chunks: 1741
Avg windows per episode: 9.46


## Q6 — Index with minsearch and query

In [127]:
index = Index(text_fields=["content", "title"])
index.fit(chunks)

q = "how do I make money with AI?"
q = " ".join([
    q,
    "monetize revenue business entrepreneurship freelancing consulting"
])
hits = index.search(query=q, num_results=20)

print("Top 20:")
for i, h in enumerate(hits[:20], 1):
    print(f"{i}. {h.get('title')!r} — {h.get('filename')} (para_start={h.get('para_start')})")

first = hits[0] if hits else {}
print("\nQ6: first episode in results:", first.get("title", "<no title>"), "(", first.get("filename"), ")")


Top 20:
1. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=135)
2. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=150)
3. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=0)
4. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=75)
5. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=30)
6. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=45)
7. 'Freelancing and Consulting with Data Engineering' — _podcast/s09e04-freelancing-and-consulting-with-data-engineering.md (para_start=90)
8. 'Freelan

In [128]:
print(hits[0]["content"])

Yeah, open source chatbot company. Essentially, you don't know what you will find in the subject because it depends on implementation. Even the schema is not really fixed. So we needed a way to manage a very volatile schema, but also be able to freeze it if we wanted to.

Going from freelancing to making your own product (and other investments)

I heard that many people want to go to consulting, to freelancing, but they don't want to do it forever. Because here, you exchange time for money, but at some point, maybe you just want to get money without spending time. [chuckles] Is this usually what freelancers do at the end? They see a problem that is repeated many times, so they end up packaging it as a product and then sell it as a product?

Some people, yes – because they get bored. But each person has their own ambitions and approaches. Some people simply go three days a week and enjoy a richer life. Other people start companies, other people don't care, and just work like that. Other