In [1]:
from dotenv import load_dotenv
load_dotenv()
import paginx
import sys
sys.path.append('..')

USER_AGENT environment variable not set, consider setting it to identify your requests.


# Web Page Indexing and Vectorization 👀

This Jupyter notebook contains a script that performs indexing and vectorization of web page contents. The primary purpose of this script is to crawl through a specified web page, extract the textual contents, and subsequently store these contents as vector objects in a database.

The vectorized information can then be utilized in a Retrieval-Augmented Generation (RAG) flow to answer questions using a Language Model (LLM). This process enables the creation of a more context-aware and responsive system, capable of providing detailed responses based on the indexed and vectorized information from the web page.

The notebook is structured in a step-by-step manner, guiding you through the process of web page crawling, text extraction, vectorization, and storage in a database. Each step is accompanied by detailed explanations and code snippets to provide a comprehensive understanding of the process.

## Web Crawler and Content Extractor

This code implements a web crawler and content extractor that:

1. Extracts URLs from the given HTML content, filtering for the same domain and validating the URLs. ✅
2. Crawls a website starting from a given URL, iteratively processing and extracting links from each page. ✅
3. Returns a mist of HTML documents extracted from the website ✅

The code displays the source URL of each processed page and the total number of pages in the extracted content.

In [2]:
crawler = paginx.Crawler("https://vectrix.ai")
site_pages = crawler.extract()

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  1.76it/s]
Fetching pages: 100%|##########| 6/6 [00:02<00:00,  2.04it/s]
Fetching pages: 100%|##########| 10/10 [00:04<00:00,  2.39it/s]


In [3]:
len(site_pages)

17

In [4]:
print(site_pages[8].content)

{'title': 'Job Description', 'author': None, 'hostname': None, 'date': None, 'fingerprint': '395059d9a3b4a234', 'id': None, 'license': None, 'comments': None, 'raw_text': "At Vectrix, we're passionate about cultivating a space where learning, innovation, and personal growth go hand-in-hand. We're on the lookout for diverse, enthusiastic talent who share our vision and are eager to make a tangible impact in the world of AI. Embark on a journey with Vectrix, where the exciting world of AI awaits you. With a spectrum of specialties and niches, there's a place for every interest. We value your ideas and initiative. Feel free to suggest topics you're passionate about or consider conducting your thesis under the mentorship of our seasoned team. What Vectrix is looking for: You're a student in Computer Science, Data Science, or a related field, looking for an energizing environment to work with the latest in data and machine learning tech. You've got a knack for analytics, are comfortable wit

## Data Preprocessing and Chunking
In this step we will split all the extracted web pages into logical chunks. 

➡️ We will use the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library to extract the main content of the web pages. It will return the main content of the page, the title, and the meta description.

➡️ We will pipe this to another splitter to further cut the sections into smaller chunks if they are too large. For this we use Langchains 

➡️  Also we will attach an LLM to the chain to ignore chunks that are not relevant, for example: navigation bars, footers, etc.



### Chunking and metadata extraction
Using the functions below we extract the medata and devide the text into chunks. 

In [5]:
chunker = paginx.Webchunker(site_pages)
chunks = chunker.chunk_content(chunk_size=500, chunk_overlap=50)

In [6]:
len(chunks)

21

In [7]:
print(chunks[18].dict())

{'id': None, 'metadata': {'source': 'https://vectrix.ai/offerings/chat-ui', 'title': 'Vectrix', 'language': 'en'}, 'page_content': 'Vectrix Chat Vectrix Chat is an internal tool that allows you to interact with various Large Language Models (LLMs) and easily switch between them. It also allows you to create your own custom models and assistants. \u200d Key Features Multiple LLM Support: Vectrix Chat integrates with popular LLMs like Google Gemini, Anthropic Claude, and OpenAI GPT-4, allowing you to quickly test and compare different models. Custom Model Integration: You can easily connect and interact with your own custom models that run locally or in the cloud. Prompt Storage & Reuse: Create, store, and reuse prompts for commonly used tasks, saving you time and effort. Assistant Creation: Build custom assistants with specific instructions and files. Think of these assistants as dedicated GPT instances, tailored to your specific needs. File Integration: Attach files to your chats or as

In [8]:
print(chunks[18].json(indent=2))

{
  "id": null,
  "metadata": {
    "source": "https://vectrix.ai/offerings/chat-ui",
    "title": "Vectrix",
    "language": "en"
  },
  "page_content": "Vectrix Chat Vectrix Chat is an internal tool that allows you to interact with various Large Language Models (LLMs) and easily switch between them. It also allows you to create your own custom models and assistants. \u200d Key Features Multiple LLM Support: Vectrix Chat integrates with popular LLMs like Google Gemini, Anthropic Claude, and OpenAI GPT-4, allowing you to quickly test and compare different models. Custom Model Integration: You can easily connect and interact with your own custom models that run locally or in the cloud. Prompt Storage & Reuse: Create, store, and reuse prompts for commonly used tasks, saving you time and effort. Assistant Creation: Build custom assistants with specific instructions and files. Think of these assistants as dedicated GPT instances, tailored to your specific needs. File Integration: Attach fi

### NER Extraction Pipeline
Here we will use langchain and and LLM to extract the Named Entities from the text.

In [9]:
extractor = paginx.Extract('Replicate', 'meta/meta-llama-3-70b-instruct')
results = extractor.extract(chunks)

Extracting entities: 100%|██████████| 21/21 [00:15<00:00,  1.37it/s]


In [10]:
print(results[4].dict()['metadata'])

{'source': 'https://vectrix.ai/about-us', 'title': 'Vectrix - About Us', 'description': 'Established in 2023, Vectrix began as a small but ambitious team, aware of the growing demand for inventive, impactful solutions in the fast-paced digital era. Our expertise lies in blending creativity with generative AI technology to help businesses excel.', 'language': 'en', 'uuid': '0a7c7c18-dacc-41d3-8385-2dffda666822', 'NER': {'entity_list': [{'entity_type': 'person', 'entity_name': 'Ben Selleslagh'}, {'entity_type': 'person', 'entity_name': 'Dimitri Allaert'}, {'entity_type': 'organization', 'entity_name': 'Vectrix'}, {'entity_type': 'organization', 'entity_name': 'BUFFL'}, {'entity_type': 'technology', 'entity_name': 'Google Cloud'}], 'language': 'English', 'category': 'Company Profile'}}


In [11]:
# Show the memory usage of this notebook
import os
import psutil
process = psutil.Process(os.getpid())
print("Memory used: ", process.memory_info().rss / 1024 ** 2, "MB")

Memory used:  275.625 MB


## Storing the results in a Chroma DB Object
Chroma is a fast and easy to use Vector database that can be used to load the retrieved content in memory and use a RAG-chain to retrieve the information. You can also persist the data in Chroma to disk for later use. We also use the langchain implementation to store the data in Chroma.

In [None]:
from langchain_cohere import CohereEmbeddings

chroma = paginx.Chroma(CohereEmbeddings())
vector_db = chroma.create_db(results, os.getenv('CHROMA_DB_LOCATION'))

# Let's perform a search and see of this works ...
search_results = vector_db.similarity_search('Who are the founders of Vectrix ?, 3')

for result in search_results:
    print(result.json(indent=2))

In [None]:
search_results = vector_db.similarity_search_with_score('Who are the founders of Vectrix ?, 3')

for result in search_results:
    print(result)

## Storing the results in PostgreSQL (pgvector)

In [None]:
from langchain_core.documents import Document
from langchain_cohere import CohereEmbeddings
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

connection = "postgresql+psycopg://postgres:mysecretpassword@localhost/paginx"
collection_name = url
embeddings = CohereEmbeddings()

In [None]:
#vectorstore.drop_tables()

In [None]:
vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

In [None]:
vectorstore.add_documents(results, ids=[result.metadata["uuid"] for result in results])

In [None]:
result = (vectorstore.similarity_search("When is the company founded ? ", k=3)[1])

In [None]:
print(result.page_content)

## Storing the result in a Weaviate (cluster)

### Initialize the Vector store and check that all the required modules are installed

Download the Docker compose file if needed
```bash
curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?cohere_key_approval=yes&generative_anyscale=false&generative_aws=false&generative_cohere=false&generative_mistral=false&generative_octoai=false&generative_ollama=false&generative_openai=false&generative_palm=false&media_type=text&modules=modules&ner_module=false&qna_module=false&ref2vec_centroid=false&reranker_cohere=true&reranker_cohere_key_approval=yes&reranker_transformers=false&runtime=docker-compose&spellcheck_module=true&spellcheck_module_model=pyspellchecker-en&sum_module=false&text_module=text2vec-cohere&weaviate_version=v1.25.4&weaviate_volume=named-volume"
```

Make sure to set the persistent directory to the correct value:
```bash
    volumes:
    - ~/weaviate_data:/var/lib/weaviate
```

Also configure the Cohere API key:
```bash
environment:
      SPELLCHECK_INFERENCE_API: 'http://text-spellcheck:8080'
      COHERE_APIKEY: ***
```

In [12]:
import weaviate

client = weaviate.connect_to_local()

meta_info = client.get_meta()
print(meta_info)



{'hostname': 'http://[::]:8080', 'modules': {'generative-ollama': {'documentationHref': 'https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion', 'name': 'Generative Search - Ollama'}, 'text-spellcheck': {'model': {'name': 'pyspellchecker'}}, 'text2vec-ollama': {'documentationHref': 'https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings', 'name': 'Ollama Module'}}, 'version': '1.26.0'}


### Configure the vectorizer

In [70]:
from weaviate.classes.config import Configure

client.collections.create(
    "Vectrix",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_ollama(
            name="content",
            source_properties=["title", "url", "NER", "content", "type"],
            model="mxbai-embed-large:335m",
            api_endpoint="http://host.docker.internal:11434"
            
              # The model to use, e.g. "nomic-embed-text"
        )
    ],
    # Additional parameters not shown
)

<weaviate.collections.collection.Collection at 0x314b43bc0>

### List all collections

In [71]:
for collection in client.collections.list_all():
    print(collection)

Vectrix


### Importing the data

In [72]:
print(results[3].dict())

{'id': None, 'metadata': {'source': 'https://vectrix.ai/career', 'title': 'Vectrix - Career', 'description': 'Discover exciting career opportunities with us and be part of a dynamic team that is revolutionising the industry.', 'language': 'en', 'uuid': 'd791d5b4-07a1-48b6-b701-0126e11fad66', 'NER': {'entity_list': [{'entity_type': 'organization', 'entity_name': 'Vectrix'}], 'language': 'English', 'category': 'Artificial Intelligence'}, 'type': 'webpage'}, 'page_content': "At Vectrix, we specialize in training and validating Small Language Models (SLMs) on business data. Choose our platform for a DIY approach or collaborate with our experts. We make complex tech simple with tailored AI solutions. Ready to transform your work? Let's innovate together!", 'type': 'Document'}


In [73]:
collection = client.collections.get("Vectrix")

with collection.batch.dynamic() as batch:
    for result in results:
        result.metadata["type"] = "webpage"
        weaviate_object = {
            "title": result.dict()["metadata"]["title"],
            "url": result.dict()["metadata"]["source"],
            "NER": result.dict()["metadata"]["NER"],
            "content": result.dict()["page_content"],
            "type": "webpage",
        }

        batch.add_object(
            properties=weaviate_object,
        )


### Performing a (hybrid) search

In [81]:
collection = client.collections.get("Vectrix")

response = collection.query.near_text(
    query="Who are the founders of Vectrix ?",  # The model provider integration will automatically vectorize the query
    limit=3,
    
)

for obj in response.objects:
    print(obj.properties["title"])

Vectrix - About Us
Vectrix - Contact Us
Vectrix


In [80]:
from weaviate.classes.query import MetadataQuery

vectrix = client.collections.get("Vectrix")
response = vectrix.query.hybrid(
    query="Who are the founders of Vectrix ?",
    alpha=0.8,
    return_metadata=MetadataQuery(explain_score=True),
    limit=3,
    query_properties=['content'],
)

for o in response.objects:
    #print(o.properties)
    print(o.properties['title'], '\n')
    print(o.properties['url'], '\n')
    print(o.properties['type'], '\n')
    print(o.metadata.explain_score)

Vectrix 

https://vectrix.ai/job-list/junior-ai-researcher 

webpage 


Hybrid (Result Set keyword,bm25) Document 83b611e3-6d7a-4e66-b400-6dfaa92f2bdf: original score 1.2208344, normalized score: 0.17300671 - 
Hybrid (Result Set vector,hybridVector) Document 83b611e3-6d7a-4e66-b400-6dfaa92f2bdf: original score 0.6553893, normalized score: 0.702733
Vectrix 

https://vectrix.ai/job-list/software-engineer-front-end 

webpage 


Hybrid (Result Set keyword,bm25) Document fe6feb9b-4753-4846-bef1-7a91aa2044dc: original score 1.1256851, normalized score: 0.1542456 - 
Hybrid (Result Set vector,hybridVector) Document fe6feb9b-4753-4846-bef1-7a91aa2044dc: original score 0.6558575, normalized score: 0.70386404
Vectrix - About Us 

https://vectrix.ai/about-us 

webpage 


Hybrid (Result Set keyword,bm25) Document f4377e91-9f84-435a-9bfe-29eb87d166ab: original score 0.53118104, normalized score: 0.037023842 - 
Hybrid (Result Set vector,hybridVector) Document f4377e91-9f84-435a-9bfe-29eb87d166ab: ori

### Remove a collection

In [69]:
client.collections.delete("Vectrix")

### Close the connection

In [None]:
client.close()