<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/Crawl_a_Website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q llama-index==0.13.3 llama-index-llms-openai==0.5.4 openai==1.102.0 newspaper4k==0.9.3.1 \
                lxml_html_clean==0.4.2 crawl4ai==0.7.4 jedi==0.19.2

!playwright install

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m812.0/812.0 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.6/296.6 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m426.2/426.2 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m6.1 MB/s[0m eta [36m

In [2]:
import os
import asyncio
import json
import nest_asyncio
from google.colab import userdata

# Set API Keys
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [3]:
import newspaper

urls = [
    "https://docs.llamaindex.ai/en/stable/understanding",
    "https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms/",
    "https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/",
    "https://docs.llamaindex.ai/en/stable/understanding/querying/querying/",
]

pages_content = []

# Retrieve the Content
for url in urls:
    try:
        article = newspaper.Article(url)
        article.download()
        article.parse()
        if len(article.text) > 0:
            pages_content.append(
                {"url": url, "title": article.title, "text": article.text}
            )
    except:
        continue

print(pages_content[0])
print(len(pages_content))

{'url': 'https://docs.llamaindex.ai/en/stable/understanding', 'title': 'Building an LLM application #', 'text': "Using LLMs: hit the ground running by getting started working with LLMs. We'll show you how to use any of our dozens of supported LLMs, whether via remote API calls or running locally on your machine.\n\nBuilding agents: agents are LLM-powered knowledge workers that can interact with the world via a set of tools. Those tools can retrieve information (such as RAG, see below) or take action. This tutorial includes:\n\nBuilding a single agent: We show you how to build a simple agent that can interact with the world via a set of tools.\n\nUsing existing tools: LlamaIndex provides a registry of pre-built agent tools at LlamaHub that you can incorporate into your agents.\n\nMaintaining state: agents can maintain state, which is important for building more complex applications.\n\nStreaming output and events: providing visibility and feedback to the user is important, and streaming

In [4]:
# Convert to Document
from llama_index.core.schema import Document

documents = [
    Document(text=row["text"], metadata={"title": row["title"], "url": row["url"]})
    for row in pages_content
]


## Crawl a Website

In [5]:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

urls_to_crawl = [
    "https://docs.llamaindex.ai/en/stable/understanding/",
]

# Synchronous wrapper
def crawl_sync():
    async def crawl_with_crawl4ai():
        config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            page_timeout=80000,
            word_count_threshold=50
        )

        data_res = {"data": []}

        async with AsyncWebCrawler() as crawler:
            results = await crawler.arun_many(
                urls_to_crawl,
                config=config
            )

            for result in results:
                if result.success:
                    title = result.metadata.get("title", "")
                    if not title and result.markdown:
                        lines = result.markdown.raw_markdown.split('\n')
                        for line in lines:
                            if line.startswith('#'):
                                title = line.strip('#').strip()
                                break

                    data_res["data"].append({
                        "text": result.markdown.raw_markdown if result.markdown else "",
                        "meta": {
                            "url": result.url,
                            "meta": {
                                "title": title
                            }
                        }
                    })

        return data_res

    # Handle async execution
    nest_asyncio.apply()

    loop = asyncio.new_event_loop()
    result = loop.run_until_complete(crawl_with_crawl4ai())
    loop.close()
    return result

# Run the crawler
data_res = crawl_sync()

# Print results (same format as before)
print("URL:", data_res["data"][0]["meta"]["url"])
print("Title:", data_res["data"][0]["meta"]["meta"]["title"])
print("Content:", data_res["data"][0]["text"][0:500], "...")

URL: https://docs.llamaindex.ai/en/stable/understanding/
Title: Building an LLM Application - LlamaIndex
Content: [ Skip to content ](https://docs.llamaindex.ai/en/stable/understanding/#building-an-llm-application)
[ ![logo](https://docs.llamaindex.ai/en/stable/_static/assets/LlamaSquareBlack.svg) ](https://docs.llamaindex.ai/en/stable/ "LlamaIndex")
LlamaIndex 
Building an LLM Application 
Search`K`
  * [ Home ](https://docs.llamaindex.ai/en/stable/)
  * [ Learn ](https://docs.llamaindex.ai/en/stable/understanding/)
  * [ Use Cases ](https://docs.llamaindex.ai/en/stable/use_cases/)
  * [ Examples ](https:/ ...


In [6]:
from llama_index.core.schema import Document

documents = [
    Document(
        text=row["text"],
        metadata={"title": row["meta"]["meta"]["title"], "url": row["meta"]["url"]},
    )
    for row in data_res["data"]
]

In [7]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(model="gpt-5-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=30)

In [8]:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [9]:
res = query_engine.query("What is a query engine?")
print(res.response)

A query engine is the component that runs retrieval and processing over your indexed data to answer user questions. It implements the querying strategy paired with an index to:

- find and return the most relevant pieces of data (improving relevance, speed, and accuracy),
- optionally re-rank, filter, or transform results before passing them to an LLM,
- and format the LLM’s output into structured responses (for example, an API response).

In short, it’s the part of a RAG pipeline that turns a user query into the right data and response.


In [10]:
# Show the retrieved nodes
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("URL\t", src.metadata["url"])
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 6713aba0-4b60-4f71-b9fd-0964ef9e80f2
Title	 Building an LLM Application - LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/
Score	 0.30650711906653494
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 be435d8c-1fbd-4429-83cc-def67de565b5
Title	 Building an LLM Application - LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/
Score	 0.2536047997303896
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
