# Simple DomContext Demo

This notebook demonstrates the basic usage of domcontext:
1. Capturing a web page (Google) with Playwright
2. Parsing it into clean markdown
3. Finding elements by ID
4. Chunking large pages

## Prerequisites

```bash
pip install domcontext[playwright]
playwright install chromium
```

## Setup

In [1]:
from playwright.async_api import async_playwright
from domcontext import DomContext
from domcontext.utils import capture_snapshot

## Step 1: Capture Google Search Page

In [2]:
# Capture CDP snapshot from Google
# Browser will stay open so you can view the page
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

# Navigate to Google
print("Navigating to Google...")
await page.goto('https://www.google.com')

# Wait for page to stabilize
print("Waiting for page to stabilize...")
await page.wait_for_timeout(1000)  # Wait 1 second

# Capture snapshot
print("Capturing snapshot...")
snapshot = await capture_snapshot(page)

print("✓ Snapshot captured!")
print("Browser will remain open for viewing.")

Navigating to Google...
Waiting for page to stabilize...
Capturing snapshot...
✓ Snapshot captured!
Browser will remain open for viewing.


## Step 2: Parse into DomContext

In [3]:
# Parse the snapshot
context = DomContext.from_cdp(snapshot)

print(f"Total tokens: {context.tokens}")
print(f"Total elements: {len(list(context.elements()))}")

Total tokens: 617
Total elements: 49


## Step 3: View Clean Markdown Output

In [4]:
# Show the clean markdown representation
print(context.markdown)

- body-1
  - div-1
    - div-2 (role="navigation")
      - a-1
        - "About"
      - a-2
        - "Store"
      - header-1 (role="none")
        - div-3
          - a-3 (aria-label="Gmail ")
            - "Gmail"
          - a-4 (aria-label="Search for Images ")
            - "Images"
        - a-5 (aria-label="Google apps" aria-expanded="false" role="button")
        - a-6 (aria-label="Sign in")
          - span-1
            - "Sign in"
    - svg-1 (aria-label="Google" role="img")
    - form-1 (role="search")
      - div-4
        - div-5
          - textarea-1 (title="Search" aria-label="Search" aria-expanded="false" name="q" role="combobox")
          - div-6
            - div-7 (aria-label=" Clear" role="button")
            - div-8
              - div-9 (aria-label="Search by voice" role="button")
              - div-10 (aria-label="Search by image" role="button")
            - button-1 (role="link" type="button")
              - span-2
                - "AI Mode"
        - 

## Step 4: Get Elements by ID

Looking at the markdown above, each element has a readable ID like `input-1`, `textarea-1`, `div-1`, etc.
You can use `context.get_element(id)` to retrieve specific elements.

In [5]:
# Get a specific element by ID from the markdown
# Looking at the markdown above, we can see "input-1" is the Google Search button
search_button = context.get_element('input-1')

print(f"Tag: {search_button.tag}")
print(f"Text: {search_button.text}")
print(f"Attributes: {search_button.attributes}")
print()

# Get the search textarea
search_input = context.get_element('textarea-1')
print(f"Tag: {search_input.tag}")
print(f"Attributes: {search_input.attributes}")

Tag: input
Text: 
Attributes: {'class': 'gNO89b', 'value': 'Google Search', 'aria-label': 'Google Search', 'name': 'btnK', 'role': 'button', 'tabindex': '0', 'type': 'submit', 'data-ved': '0ahUKEwi4js6Rz6-QAxW4OjQIHatrO9AQ4dUDCBM'}

Tag: textarea
Attributes: {'jsname': 'yZiJbe', 'class': 'gLFyf', 'aria-controls': 'Alh6id', 'aria-owns': 'Alh6id', 'title': 'Search', 'aria-label': 'Search', 'aria-autocomplete': 'both', 'aria-expanded': 'false', 'aria-haspopup': 'false', 'autocapitalize': 'off', 'autocomplete': 'off', 'autocorrect': 'off', 'id': 'APjFqb', 'maxlength': '2048', 'name': 'q', 'role': 'combobox', 'rows': '1', 'spellcheck': 'false', 'data-ved': '0ahUKEwi4js6Rz6-QAxW4OjQIHatrO9AQ39UDCAU'}


## Step 5: Iterate Through All Elements

You can iterate through all elements to explore the page structure.

In [6]:
# Iterate through first 10 elements
print("First 10 elements:")
print("=" * 60)

for i, element in enumerate(list(context.elements())[:10], 1):
    text_preview = element.text[:40] if element.text else "(no text)"
    print(f"{i}. {element.tag:10} - {text_preview}")
    
print()
print(f"Total: {len(list(context.elements()))} elements")

First 10 elements:
1. body       - About Store Gmail Images  Sign in  AI Mo
2. div        - About Store Gmail Images  Sign in  AI Mo
3. div        - About Store Gmail Images  Sign in
4. a          - About
5. a          - Store
6. header     - Gmail Images  Sign in
7. div        - Gmail Images
8. a          - Gmail
9. a          - Images
10. a          - (no text)

Total: 49 elements


## Step 6: Chunking for Large Pages

Split the DOM into smaller chunks that fit in LLM context windows.

**New in v0.1.3**: Atomic-level chunking with continuation markers (`...`) and parent context!

In [7]:
# Split into chunks of max 500 tokens with 50 token overlap
chunks = list(context.chunks(max_tokens=200, overlap=50))

print(f"Total chunks: {len(chunks)}")
print("\n" + "=" * 60)

for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print(f"  Tokens: {chunk.tokens}")
    
    preview = chunk.markdown
    print(f"  Preview:\n{preview}")

print("\n" + "=" * 60)
print("\nNotice the smart chunking features (v0.1.3):")
print("  • Parent hierarchy for context (e.g., body-1, div-1)")
print("  • Continuation markers '...' show where content splits")
print("  • Attributes split seamlessly: (type=\"submit\" ...)")
print("  • Text splits by word: \"Hello world ...\"")
print("\nTo disable parent path, use: chunks(include_parent_path=False)")

Total chunks: 7


Chunk 1:
  Tokens: 198
  Preview:
- body-1
  - div-1
    - div-2 (role="navigation")
      - a-1
        - "About"
      - a-2
        - "Store"
      - header-1 (role="none")
        - div-3
          - a-3 (aria-label="Gmail ")
            - "Gmail"
          - a-4 (aria-label="Search for Images ")
            - "Images"
        - a-5 (aria-label="Google apps" aria-expanded="false" role="button")
        - a-6 (aria-label="Sign in")
          - span-1
            - "Sign in"
    - svg-1 (aria-label="Google" role="img")
    - form-1 (role="search")
      - div-4
        - div-5
          - textarea-1 (title="Search" aria-label="Search" aria-expanded="false" ...)


Chunk 2:
  Tokens: 195
  Preview:
- body-1
  - div-1
    - div-2
      - header-1
        - a-5 (aria-label="Google apps" aria-expanded="false" role="button")
        - a-6 (aria-label="Sign in")
          - span-1
            - "Sign in"
    - svg-1 (aria-label="Google" role="img")
    - form-1 (role="sear

## Summary

You've learned:
- ✅ How to capture a live web page with Playwright
- ✅ How to parse it into clean markdown
- ✅ How to find elements by their generated IDs
- ✅ How to chunk large pages for LLM context windows

Next: Check out `advanced_demo.ipynb` for more features!

## Cleanup

Close the browser when you're done viewing the page.

In [8]:
# Close the browser and cleanup
await browser.close()
await playwright.stop()
print("✓ Browser closed!")

✓ Browser closed!
