In [1]:
from dotenv import load_dotenv
load_dotenv()

import sys
sys.path.append('..')

# Web Page Indexing and Vectorization 👀

This Jupyter notebook contains a script that performs indexing and vectorization of web page contents. The primary purpose of this script is to crawl through a specified web page, extract the textual contents, and subsequently store these contents as vector objects in a database.

The vectorized information can then be utilized in a Retrieval-Augmented Generation (RAG) flow to answer questions using a Language Model (LLM). This process enables the creation of a more context-aware and responsive system, capable of providing detailed responses based on the indexed and vectorized information from the web page.

The notebook is structured in a step-by-step manner, guiding you through the process of web page crawling, text extraction, vectorization, and storage in a database. Each step is accompanied by detailed explanations and code snippets to provide a comprehensive understanding of the process.

## Web Crawler and Content Extractor

This code implements a web crawler and content extractor that:

1. Extracts URLs from the given HTML content, filtering for the same domain and validating the URLs. ✅
2. Crawls a website starting from a given URL, iteratively processing and extracting links from each page. ✅
3. Returns a mist of HTML documents extracted from the website ✅

The code displays the source URL of each processed page and the total number of pages in the extracted content.

In [2]:
from src.paginx.page_crawler.crawler import Crawler

crawler = Crawler("https://vectrix.ai")
site_pages = crawler.extract()

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  2.09it/s]
Fetching pages: 100%|##########| 9/9 [00:01<00:00,  5.79it/s]
Fetching pages: 100%|##########| 6/6 [00:01<00:00,  5.12it/s]


In [3]:
site_pages[2].dict()

{'title': 'Have a project in mind? Let’s collaborate.',
 'hostname': None,
 'date': None,
 'fingerprint': '4f94d83b257336ce',
 'id': None,
 'license': None,
 'comments': None,
 'raw_text': 'Projects Vectrix Projects: Streamlining Business Operations with AI Vectrix is committed to providing AI project solutions that are specifically designed to meet the unique needs of your business. Our focus is on developing AI applications that enhance operational efficiency and address specific challenges within your organization. \u200d Custom AI Solution Development \u200dOur process involves a detailed analysis of your business requirements to create AI solutions tailored for your operations. We concentrate on areas such as automating repetitive tasks and introducing new AI functionalities to improve efficiency. \u200d Structured Project Management \u200dVectrix employs a structured approach to project management, ensuring consistency and clarity throughout the development cycle. We maintain ope

## Data Preprocessing and Chunking
In this step we will split all the extracted web pages into logical chunks. 

➡️ We will use the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library to extract the main content of the web pages. It will return a json objects with the following attributes
- `title`: The title of the page
- `author`: The author of the page, in most cases this will be empty
- `hostname`: The hostname of the page
- `date`: The date of the page
- `fingerprint`: A fingerprint of the page
- `id`: The id of the page, most of the time this will be empty
- `license`: The license of the page, most of the time this will be empty
- `comments`: The comments of the page, most of the time this will be empty
- `raw_text` : The raw text of the page: html elements are removed, also visual elements are removed
- `language`: The language of the page
- `image`: The images of the page, contains the URLs
- `pagetype`: Always set to website
- `source`: Main URL of the website
- `source-hostname`: Hostname of the website
- `excerpt`: An excerpt of the page
- `categories`: The categories of the page
- `tags`: The tags of the page

➡️ We will pipe this to another splitter to further cut the sections into smaller chunks if they are too large. For this we use Langchains 

➡️  Also we will attach an LLM to the chain to ignore chunks that are not relevant, for example: navigation bars, footers, etc.



### Chunking and metadata extraction
Using the functions below we extract the medata and devide the text into chunks. 

In [4]:
from src.paginx.page_crawler.web_chunker import Webchunker

chunker = Webchunker(site_pages)
chunks = chunker.chunk_content(chunk_size=500, chunk_overlap=50)

In [5]:
print(chunks[2].json(indent=2))

{
  "page_content": "Projects Vectrix Projects: Streamlining Business Operations with AI Vectrix is committed to providing AI project solutions that are specifically designed to meet the unique needs of your business. Our focus is on developing AI applications that enhance operational efficiency and address specific challenges within your organization. \u200d Custom AI Solution Development \u200dOur process involves a detailed analysis of your business requirements to create AI solutions tailored for your operations. We concentrate on areas such as automating repetitive tasks and introducing new AI functionalities to improve efficiency. \u200d Structured Project Management \u200dVectrix employs a structured approach to project management, ensuring consistency and clarity throughout the development cycle. We maintain open communication, ensuring that each project phase aligns with your business objectives. \u200d Rigorous Testing and Optimization \u200dWe place a high emphasis on the qu

### NER Extraction Pipeline
Here we will use langchain and and LLM to extract the Named Entities from the text.

In [8]:
from src.paginx.ner.extract import Extract

extractor = Extract('Replicate', 'meta/meta-llama-3-70b-instruct')
results = extractor.extract(chunks)

Extracting entities: 100%|██████████| 16/16 [00:23<00:00,  1.48s/it]


In [9]:
# Show the memory usage of this notebook
import os
import psutil
process = psutil.Process(os.getpid())
print("Memory used: ", process.memory_info().rss / 1024 ** 2, "MB")

Memory used:  226.890625 MB


## Storing the results in a postgres database

In [None]:
from langchain_core.documents import Document
from langchain_cohere import CohereEmbeddings
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

connection = "postgresql+psycopg://postgres:mysecretpassword@localhost/paginx"
collection_name = url
embeddings = CohereEmbeddings()

In [None]:
#vectorstore.drop_tables()

In [None]:
vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

In [None]:
vectorstore.add_documents(results, ids=[result.metadata["uuid"] for result in results])

In [None]:
result = (vectorstore.similarity_search("When is the company founded ? ", k=3)[1])

In [None]:
print(result.page_content)