In [2]:
from dotenv import load_dotenv
load_dotenv()
import paginx
import sys
sys.path.append('..')

# Web Page Indexing and Vectorization 👀

This Jupyter notebook contains a script that performs indexing and vectorization of web page contents. The primary purpose of this script is to crawl through a specified web page, extract the textual contents, and subsequently store these contents as vector objects in a database.

The vectorized information can then be utilized in a Retrieval-Augmented Generation (RAG) flow to answer questions using a Language Model (LLM). This process enables the creation of a more context-aware and responsive system, capable of providing detailed responses based on the indexed and vectorized information from the web page.

The notebook is structured in a step-by-step manner, guiding you through the process of web page crawling, text extraction, vectorization, and storage in a database. Each step is accompanied by detailed explanations and code snippets to provide a comprehensive understanding of the process.

## Web Crawler and Content Extractor

This code implements a web crawler and content extractor that:

1. Extracts URLs from the given HTML content, filtering for the same domain and validating the URLs. ✅
2. Crawls a website starting from a given URL, iteratively processing and extracting links from each page. ✅
3. Returns a mist of HTML documents extracted from the website ✅

The code displays the source URL of each processed page and the total number of pages in the extracted content.

In [3]:
crawler = paginx.Crawler("https://vectrix.ai")
site_pages = crawler.extract()

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  1.88it/s]
Fetching pages: 100%|##########| 9/9 [00:01<00:00,  5.78it/s]
Fetching pages: 100%|##########| 6/6 [00:01<00:00,  4.62it/s]


In [4]:
site_pages[2].dict()

{'title': 'Vectrix - Contact Us',
 'hostname': None,
 'date': None,
 'fingerprint': '3addcbe8f91e221e',
 'id': None,
 'license': None,
 'comments': None,
 'raw_text': 'At Vectrix, we specialize in using advanced technology, particularly generative AI, to transform the way businesses operate. Our focus is on simplifying complex processes, automating routine tasks, and designing unique, intuitive solutions that are easy to integrate and use.',
 'text': 'At Vectrix, we specialize in using advanced technology, particularly generative AI, to transform the way businesses operate. Our focus is on simplifying complex processes, automating routine tasks, and designing unique, intuitive solutions that are easy to integrate and use.',
 'language': None,
 'image': None,
 'pagetype': 'website',
 'filedate': '2024-05-27',
 'source': None,
 'source_hostname': None,
 'excerpt': 'Contact Us',
 'categories': '',
 'tags': ''}

## Data Preprocessing and Chunking
In this step we will split all the extracted web pages into logical chunks. 

➡️ We will use the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library to extract the main content of the web pages. It will return the main content of the page, the title, and the meta description.

➡️ We will pipe this to another splitter to further cut the sections into smaller chunks if they are too large. For this we use Langchains 

➡️  Also we will attach an LLM to the chain to ignore chunks that are not relevant, for example: navigation bars, footers, etc.



### Chunking and metadata extraction
Using the functions below we extract the medata and devide the text into chunks. 

In [5]:
chunker = paginx.Webchunker(site_pages)
chunks = chunker.chunk_content(chunk_size=500, chunk_overlap=50)

In [6]:
print(chunks[2].json(indent=2))

{
  "page_content": "At Vectrix, we specialize in using advanced technology, particularly generative AI, to transform the way businesses operate. Our focus is on simplifying complex processes, automating routine tasks, and designing unique, intuitive solutions that are easy to integrate and use.",
  "metadata": {
    "title": "Vectrix - Contact Us",
    "hostname": null,
    "image": null,
    "source": null,
    "source_hostname": null,
    "excerpt": "Contact Us"
  },
  "type": "Document"
}


### NER Extraction Pipeline
Here we will use langchain and and LLM to extract the Named Entities from the text.

In [7]:
extractor = paginx.Extract('Replicate', 'meta/meta-llama-3-70b-instruct')
results = extractor.extract(chunks)

Extracting entities: 100%|██████████| 16/16 [00:21<00:00,  1.32s/it]


In [None]:
results[3].dict()

In [8]:
# Show the memory usage of this notebook
import os
import psutil
process = psutil.Process(os.getpid())
print("Memory used: ", process.memory_info().rss / 1024 ** 2, "MB")

Memory used:  276.625 MB


## Storing the results in a Chroma DB Object
Chroma is a fast and easy to use Vector database that can be used to load the retrieved content in memory and use a RAG-chain to retrieve the information. You can also persist the data in Chroma to disk for later use. We also use the langchain implementation to store the data in Chroma.

In [23]:
chroma = paginx.Chroma(results)
chroma = chroma.return_db_object()

# Let's perform a search and see of this works ...
search_results = chroma.similarity_search('Who are the founders of Vectrix ?, 3')

for result in search_results:
    print(result.json(indent=2))

{
  "page_content": "AI Innovation Studio Meet team Vectrix Ben Selleslagh Co-Founder Meet Ben, a pivotal player and Co-Founder at Vectrix. With a rich background as a data professional, Ben brings his diverse experience from banking, government, and media sectors into the mix. He's skilled in crafting and executing data-driven strategies that sync perfectly with business goals. His expertise in Google Cloud technology makes him a wizard at building scalable and efficient data architectures. At Vectrix, Ben applies his know-how to innovate and drive our data and AI solutions, always with an eye on efficiency and scalability, ensuring that we stay at the forefront of AI technology. Dimitri Allaert Co-Founder Meet Dimitri Allaert, a driving force and Co-Founder at Vectrix. With roots in medical engineering and a history in the pharmaceutical world, Dimitri brings a different twist to our AI solutions. He kicked off his journey co-founding BUFFL, a market research platform, and then moved

## Storing the results in PostgreSQL (pgvector)

In [None]:
from langchain_core.documents import Document
from langchain_cohere import CohereEmbeddings
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

connection = "postgresql+psycopg://postgres:mysecretpassword@localhost/paginx"
collection_name = url
embeddings = CohereEmbeddings()

In [None]:
#vectorstore.drop_tables()

In [None]:
vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

In [None]:
vectorstore.add_documents(results, ids=[result.metadata["uuid"] for result in results])

In [None]:
result = (vectorstore.similarity_search("When is the company founded ? ", k=3)[1])

In [None]:
print(result.page_content)