<a href="https://colab.research.google.com/github/rawkintrevo/sme-seeks/blob/main/notebooks/New_Index_from_Website_Scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Install Dependencies and CONFIG

In this section we `pip install` AND `wget` the sitemap of interest. (Note, Sitemaps are 1000x cleaner of a method for 'scraping' a website than link crawling)

In [None]:
!pip install -q llama-index GitPython "pinecone-client[grpc]" html2text

In [None]:
!wget https://firebase.google.com/docs/sitemap.xml

In [None]:
# CONFIG

INDEX_NAME = "firebase-react-helper"

# Step 2: Open and parse the `sitemap.xml`

We open the `sitemap.xml` and parse it into a list of dictionaries.

**NOTE** Sitemaps vary slightly, you might have to monkey with this a bit for your usecase.


In [None]:
# open ./sitemap.xml for reading in python
with open('./sitemap.xml', 'r') as file:
    data = file.read()

In [None]:
import xml.etree.ElementTree as ET

def parse_sitemap_to_list_of_dicts(sitemap_xml):
    result = []
    root = ET.fromstring(sitemap_xml)

    # Define the namespace for XPath
    namespace = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    for url_elem in root.findall(".//s:url", namespaces=namespace):
        url_dict = {}
        loc_elem = url_elem.find("s:loc", namespaces=namespace)
        changefreq_elem = url_elem.find("s:changefreq", namespaces=namespace)
        priority_elem = url_elem.find("s:priority", namespaces=namespace)

        url_dict["loc"] = loc_elem.text if loc_elem is not None else ""
        url_dict["changefreq"] = changefreq_elem.text if changefreq_elem is not None else ""
        url_dict["priority"] = priority_elem.text if priority_elem is not None else ""
        if "hl=" in url_dict["loc"]:
            continue #skip the link if it's in another lang
        result.append(url_dict)

    return result

In [None]:
parsed_sitemap = parse_sitemap_to_list_of_dicts(data)

# Step 3: Run the Scrape

`html2text` is nice because it converts the html directly into Markdown, which is the format I like to store docs in. However, there are some idiosyncracies with any website. For the Firebase docs, sometimes they were artifact pages, so in pseudo code:

```
if "\n# " in content:
   <take everything in between '#\n ' and 'Send Feedback' as the text>
else:
   <skip the document- it's an artifact>
```

On another website you'll have to update it.

Finally, add some Metadata.

In [None]:
import requests
import html2text
from datetime import datetime
from llama_index.readers.schema.base import Document
from time import sleep

documents = []
skips = []
counter = 0

for c in range(counter - 1 , len(parsed_sitemap)):
  item = parsed_sitemap[c]
  counter = c
  if c % 100 == 0:
    print(f"{counter} / {len(parsed_sitemap)} documents collected.")
  counter += 1
  r = requests.get(item["loc"])
  md = html2text.html2text(str(r.content))
  if "\n# " in md:
    text = md[2:].replace(r"\n", "").replace(r"\t", "").split("\n# ")[1].split("Send feedback")[0]
  else:
    print(f"Skipping {item['loc']} - missing title #")
    skips.append(item["loc"])
    continue
  documents.append(Document(text=text, metadata={"date": datetime.now().strftime("%Y-%m-%d"),
                                                "src": item["loc"],
                                                "title": item["loc"].split("docs/")[1]
  }))
  sleep(0.3)

In [None]:
# Here's a sample document
documents[0]

Document(id_='1cf71ca5-2137-490f-9913-6515013e35ff', embedding=None, metadata={'date': '2024-01-02', 'src': 'https://firebase.google.com/docs/reference/js/v8/firebase.auth.phonemultifactorassertion', 'title': 'reference/js/v8/firebase.auth.phonemultifactorassertion'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='0ab365817230322061ab47a959708b29dc896ec655c3a2ef0547f47ab3b8e4f5', text='\n\n\n\n  * [firebase](/docs/reference/js/v8/firebase).\n\n\n  * [auth](/docs/reference/js/v8/firebase.auth).\n\n\n  * PhoneMultiFactorAssertion\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe class for asserting ownership of a phone second factor.\n\n\n\n\n\n\n\n## Index\n\n\n\n\n\n### Constructors\n\n\n\n\n\n  * [constructor](/docs/reference/js/v8/firebase.auth.phonemultifactorassertion#constructor)\n\n\n\n\n### Properties\n\n\n\n\n\n  * [factorId](/docs/reference/js/v8/firebase.auth.phonemultifactorassertion#factorid)\n\n\n\n\n\n\n## Constructors\n\n\n\n### Privat

## Step 4. Create the Index

Note- you'll need to have various notebook secrets for this to work.

`open_ai_key` - this is your openAI API key- used for embeddings
`pinecone_api_key_js` - This is the key for the pinecone vector store you'll be loading

In [None]:
from llama_index.storage.storage_context import StorageContext
from llama_index import VectorStoreIndex
from llama_index.vector_stores import PineconeVectorStore

import openai
import pinecone

from google.colab import userdata

openai.api_key = userdata.get('open_ai_key')
pinecone.init(api_key=userdata.get("pinecone_api_key_js"), environment="gcp-starter")

pinecone.create_index(
    INDEX_NAME, dimension=1536, metric="euclidean", pod_type="p1"
)

pinecone_index = pinecone.Index(INDEX_NAME)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)



## Step 5. Upsert the documents


In [None]:
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)


Upserted vectors:   0%|          | 0/2048 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/2048 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/2048 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/2048 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/2048 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/1450 [00:00<?, ?it/s]