In [None]:
from dotenv import load_dotenv
load_dotenv()
import vectrix
import sys
sys.path.append('..')

# Web Page Indexing and Vectorization 👀

This Jupyter notebook contains a script that performs indexing and vectorization of web page contents. The primary purpose of this script is to crawl through a specified web page, extract the textual contents, and subsequently store these contents as vector objects in a database.

The vectorized information can then be utilized in a Retrieval-Augmented Generation (RAG) flow to answer questions using a Language Model (LLM). This process enables the creation of a more context-aware and responsive system, capable of providing detailed responses based on the indexed and vectorized information from the web page.

The notebook is structured in a step-by-step manner, guiding you through the process of web page crawling, text extraction, vectorization, and storage in a database. Each step is accompanied by detailed explanations and code snippets to provide a comprehensive understanding of the process.

## Web Crawler and Content Extractor

This code implements a web crawler and content extractor that:

1. Extracts URLs from the given HTML content, filtering for the same domain and validating the URLs. ✅
2. Crawls a website starting from a given URL, iteratively processing and extracting links from each page. ✅
3. Returns a mist of HTML documents extracted from the website ✅

The code displays the source URL of each processed page and the total number of pages in the extracted content.

In [None]:
crawler = vectrix.Crawler("https://vectrix.ai", max_pages=20)
site_pages = crawler.extract()

In [None]:
len(site_pages)

In [None]:
print(site_pages[8].content)

## Data Preprocessing and Chunking
In this step we will split all the extracted web pages into logical chunks. 

➡️ We will use the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library to extract the main content of the web pages. It will return the main content of the page, the title, and the meta description.

➡️ We will pipe this to another splitter to further cut the sections into smaller chunks if they are too large. For this we use Langchains 

➡️  Also we will attach an LLM to the chain to ignore chunks that are not relevant, for example: navigation bars, footers, etc.



### Chunking and metadata extraction
Using the functions below we extract the medata and devide the text into chunks. 

In [None]:
chunker = vectrix.Webchunker(site_pages)
chunks = chunker.chunk_content(chunk_size=500, chunk_overlap=50)

In [None]:
len(chunks)

In [None]:
print(chunks[18].dict())

In [None]:
print(chunks[18].json(indent=2))

### NER Extraction Pipeline
Here we will use langchain and and LLM to extract the Named Entities from the text.

In [None]:
extractor = vectrix.Extract('Replicate', 'meta/meta-llama-3-70b-instruct')
results = extractor.extract(chunks)

In [None]:
print(results[4].dict()['metadata'])

In [None]:
# Show the memory usage of this notebook
import os
import psutil
process = psutil.Process(os.getpid())
print("Memory used: ", process.memory_info().rss / 1024 ** 2, "MB")

## Storing the result in a Weaviate (cluster)

### Initialize the Vector store and check that all the required modules are installed

Download the Docker compose file if needed
```bash
curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?cohere_key_approval=yes&generative_anyscale=false&generative_aws=false&generative_cohere=false&generative_mistral=false&generative_octoai=false&generative_ollama=false&generative_openai=false&generative_palm=false&media_type=text&modules=modules&ner_module=false&qna_module=false&ref2vec_centroid=false&reranker_cohere=true&reranker_cohere_key_approval=yes&reranker_transformers=false&runtime=docker-compose&spellcheck_module=true&spellcheck_module_model=pyspellchecker-en&sum_module=false&text_module=text2vec-cohere&weaviate_version=v1.25.4&weaviate_volume=named-volume"
```

Make sure to set the persistent directory to the correct value:
```bash
    volumes:
    - ~/weaviate_data:/var/lib/weaviate
```

Also configure the Cohere API key:
```bash
environment:
      SPELLCHECK_INFERENCE_API: 'http://text-spellcheck:8080'
      COHERE_APIKEY: ***
```

In [2]:
from vectrix.db.weaviate import Weaviate, VectorDocument

weaviate = Weaviate()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [None]:
weaviate.create_collection(name='Vectrix', 
                           embedding_model='Ollama', 
                           model_name="mxbai-embed-large:335m",
                           model_url="http://host.docker.internal:11434")

In [3]:
print(weaviate.list_collections())

['Vectrix']


In [6]:
weaviate.set_colleciton(name='Vectrix')

In [None]:
data_to_vectorize = []

for result in results:
    data_to_vectorize.append(
        VectorDocument(
            title=result.metadata["title"],
            url=result.metadata["source"],
            content=result.page_content,
            type="webpage",
            NER=str(result.metadata["NER"]),
        )
    )

weaviate.add_data(data_to_vectorize)

In [7]:
retriever = weaviate.get_retriever()
retriever.invoke('Who are the Vectrix founders ?')

[(Document(metadata={'title': 'Vectrix', 'url': 'https://vectrix.ai/job-list/junior-ai-researcher', 'type': 'webpage'}, page_content='About Vectrix \u200dVectrix is a cutting-edge AI startup dedicated to revolutionizing the field of generative AI solutions. Located in the vibrant heart of Antwerp, we strive to push the boundaries of artificial intelligence through innovative research and practical applications. \u200d Position Overview \u200dWe are currently seeking a passionate and skilled Junior AI Researcher to join our dynamic team. This role is ideal for someone who is deeply intrigued by the world of AI and eager to contribute to the development and deployment of groundbreaking AI technologies. \u200d Key Responsibilities Collaborate with a team of experts to develop and refine generative AI models. Implement AI solutions using Python, focusing on robustness and scalability. Engage in the deployment of AI systems, ensuring smooth integration and functionality. Work with Google Cl

In [None]:
weaviate.remove_collection("Vectrix")

In [None]:
weaviate.info()

In [None]:
weaviate.close()