## Question 4

For Q4-Q6 we'll reproduce what we did in the module, but with a different GitHub repo.

We'll use the podcasts archive from DataTalks.Club: https://datatalks.club/podcast.html

The data is available here: https://github.com/DataTalksClub/datatalksclub.github.io/tree/main/_podcast

Download the data (only for podcasts). How many records are there?

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import requests
from urllib.parse import urljoin


url = 'https://github.com/DataTalksClub/datatalksclub.github.io/tree/main/_podcast'


options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get(url)

a_elements = driver.find_elements(By.TAG_NAME, 'a')

filenames=[]
seen=set()
for a in a_elements:
    title = a.get_attribute('title')
    # if href and href.endswith(('.md')):
    if title.endswith(('.md')):
        if title in seen:
            continue
        seen.add(title)
        filenames.append(title)

In [4]:
print(f'There are {len(filenames)} records.')

There are 185 records.


In [None]:
# for title in results:
#     url = f'https://github.com/DataTalksClub/datatalksclub.github.io/blob/main/_podcast/{title}?plain=1'


# Question 5

Let's prepare this data. It's already structured, so you can chunk it using paragraphs. Let's do chunk size 30 and overlap 15. How many chunks do you have in the result?

In [2]:
import requests
import frontmatter
from tqdm.auto import tqdm
import re

podcasts = []
for f in tqdm(filenames):
    url = f'https://raw.githubusercontent.com/DataTalksClub/datatalksclub.github.io/refs/heads/main/_podcast/{f}'

    r = requests.get(url, timeout=30)
    raw = r.text
    content = re.sub(r'\{\{.*?\}\}', 'TEMPLATE_VAR', raw)
    post = frontmatter.loads(content)
    data = post.to_dict()
    data['filename'] = f
    podcasts.append(data)

# flattened = []
# for p in podcasts:
#     attrs = p.get('attributes', {})
#     record = {**attrs}
#     record['transcript'] = p.get('transcript', '')
#     record['filename'] = p.get('filename', '')
#     flattened.append(record)

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 185/185 [00:25<00:00,  7.39it/s]


In [3]:
flattened = []

for p in podcasts:
    transcript = p.get("transcript", {})
    chunks = transcript.get("chunk", []) if isinstance(transcript, dict) else transcript
    lines = [c["line"] for c in chunks if isinstance(c, dict) and "line" in c]
    merged_text = " ".join(lines).strip()

    flattened.append({
        "title": p.get("title"),
        "episode": p.get("episode"),
        "text": merged_text
    })

In [6]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

def chunk_documents(docs:list, size=30, step=15):
    doc_chunks = []

    for doc in docs:
        doc_copy = doc.copy()
        doc_content = doc_copy.pop('text')
        chunks = sliding_window(doc_content, size, step)
        for chunk in chunks:
            chunk.update(doc_copy)
        doc_chunks.extend(chunks)
    return doc_chunks

chunks = chunk_documents(flattened)

In [7]:
print(f'There are {len(chunks)} chunks.')

There are 516597 chunks.


# Question 6

Index the data with Index from minsearch. What's the first episode in the results for "how do I make money with AI?"



In [13]:
from minsearch import Index
index = Index(
    text_fields=['chunk', 'title']
)

index.fit(chunks)


<minsearch.minsearch.Index at 0x17a0c2810>

In [23]:
query = 'how do I make money with AI?'

print(f'{index.search(query)[0]['title']} is the first episode in the results.')

Make an Impact Through Volunteering Open Source Work is the first episode in the results.
