# Chroma Initiation

This is where the Chroma database will be set up so embeddings can be stored\

In [5]:
# Initiate a persistent Chroma client

import chromadb

client = chromadb.PersistentClient(path="chroma")

client.heartbeat()  # makes sure connection to client is successful

1716157944121996000

### Using Collections

the `collection` primitive lets you manage collections of embeddings

Chroma collections w/ name and optional embedding function

## Parse website for data
To parse the website you will start with a base url. With that base you can use a function to gather sub-pages that will be parsed through.

Once the text is parse, it will need to be cleaned. This will create sizeable chunks and make sure context is maintained.

### Create crawler to gather sub-pages from main url

In [53]:
import requests
from bs4 import BeautifulSoup
import re

base_url = "https://www.faa.gov/air_traffic/publications/atpubs/aim_html/"

response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all('a', href=True)
subpages = set()

for link in links:
    href = link['href']
    if re.search(r'(chap|appendix)', href, re.IGNORECASE):  # Adjust regex to match 'chapter' or 'appendix'
            full_url = f"{base_url.rstrip('/')}/{href.lstrip('/')}"
            subpages.add(full_url)

subpages

{'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_1.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_2.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_3.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_4.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./appendix_5.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_cfr.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_chap0_policy.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_faa_desc.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_info_eoc.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap0_subscription_info.html',
 'https://www.faa.gov/air_traffic/publications/atpubs/aim_html/./chap10_section_1.html',
 'https://www.faa.gov/air_traffic/publications/atpubs

### Chunk content

Each page needs to have its content "chunked" properly in order for it to be embedded right. It needs to contain the right metadata and allow for no overlapping

```
Output:
{
    "content": "",
    "section": "",
    "heading": "",
    "url": "",
    "common_keywords": []
}
```

- Chunk by sections
  - Each subpage from above has multiple sections that are small.

In [41]:
# return main html content from given url

def get_page_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    content = soup.find("main", class_="main-content")

    return content

In [42]:
# return soup object to further extract tags

def get_page_soup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    return soup

In [62]:
soup = get_page_soup("https://www.faa.gov/air_traffic/publications/atpubs/aim_html/chap1_section_1.html")
soup

<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!-- Insert favicon -->
<link href="./assets/img/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="./assets/img/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="./assets/img/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="./assets/img/site.webmanifest" rel="manifest"/>
<link href="./assets/img/safari-pinned-tab.svg" rel="mask-icon"/>
<!-- Styles -->
<meta content="ie=edge" http-equiv="X-UA-Compatible"/><link href="./assets/css/gfonts.css" rel="stylesheet"/>
<link href="./assets/fontawesome/css/all.css" rel="stylesheet"/>
<link href="./assets/css/atc_nav_search.css" rel="stylesheet"/>
<link href="./assets/css/atc_styles.css" rel="stylesheet"/>
<link href="./assets/css/atc_faa_ol.css" rel="stylesheet"/>
<link href="./assets/css/olstyles.css" rel="stylesheet"/>
<link href="./asse

In [65]:
sections = []
list_items = soup.find_all("li")
list_items

[<li class="nav-item is-active" data-depth="0" data-id="menu-0-4">
 <ul class="nav-list">
 <li class="nav-item" data-depth="1" data-id="menu-1-/chap_0">
 <button aria-label="Navigation toggler" class="nav-toggle"></button>
 <span class="nav-text" style="cursor: pointer;"></span>
 <a class="nav-link" href="./chap_0.html">General Information</a>
 <ul class="nav-list">
 <li class="nav-item is-active" data-depth="2" data-id="menu-2-/chap0_info_eoc.html">
 <a class="nav-link" href="./chap0_info_eoc.html">Explanation of Changes</a>
 </li>
 <li class="nav-item is-active" data-depth="2" data-id="menu-2-/chap0_faa_desc">
 <a class="nav-link" href="./chap0_faa_desc.html">Federal Aviation Administration (FAA)</a>
 </li>
 <li class="nav-item is-active" data-depth="2" data-id="menu-2-/chap0_policy">
 <a class="nav-link" href="./chap0_policy.html">Flight Information Publication Policy</a>
 </li>
 <li class="nav-item is-active" data-depth="2" data-id="menu-2-/chap0_cfr">
 <a class="nav-link" href="./

In [66]:
filtered_list = [li for li in list_items if "nav-item" not in li.get("class", [])]
filtered_list

[<li class="crumb"> <a aria-label="Chapter title" class="chapter-link-title" href="./chap_1.html">Air Navigation</a></li>,
 <li class="crumb">Navigation Aids</li>,
 <li>
 <a class="anchor" id="$paragraph1-1-1"></a><strong>General</strong>
 <ol class="list-style-type-lowerLetter"><li>
 Various types of air navigation aids are in use today, each serving a special purpose. These aids have varied owners and operators, namely: the Federal Aviation Administration (FAA), the military services, private organizations, individual states and foreign governments. The FAA has the statutory authority to establish, operate, maintain air navigation facilities and to prescribe standards for the operation of any of these aids which are used for instrument flight in federally controlled airspace. These aids are tabulated in the Chart Supplement.
 </li>
 <li>
 Pilots should be aware of the possibility of momentary erroneous indications on cockpit displays when the primary signal generator for a ground-bas

In [72]:
# Page header
page_header = filtered_list[1].getText().strip()

In [None]:
# input: (list) list of sections
# output: (list) list of content in each section

# this find the text within each section then stores that within a list

def get_section_text(list_of_sections):
    for li in list_of_sections:
        if li.find("a", class_="anchor"):
            section_text = li.getText().strip()
            sections.append(section_text)
        else:
            continue