In [1]:
from dotenv import load_dotenv
load_dotenv()

True

# Web Page Indexing and Vectorization 👀

This Jupyter notebook contains a script that performs indexing and vectorization of web page contents. The primary purpose of this script is to crawl through a specified web page, extract the textual contents, and subsequently store these contents as vector objects in a database.

The vectorized information can then be utilized in a Retrieval-Augmented Generation (RAG) flow to answer questions using a Language Model (LLM). This process enables the creation of a more context-aware and responsive system, capable of providing detailed responses based on the indexed and vectorized information from the web page.

The notebook is structured in a step-by-step manner, guiding you through the process of web page crawling, text extraction, vectorization, and storage in a database. Each step is accompanied by detailed explanations and code snippets to provide a comprehensive understanding of the process.

## Web Crawler and Content Extractor

This code implements a web crawler and content extractor that:

1. Extracts URLs from the given HTML content, filtering for the same domain and validating the URLs. ✅
2. Crawls a website starting from a given URL, iteratively processing and extracting links from each page. ✅
3. Returns a mist of HTML documents extracted from the website ✅

The code displays the source URL of each processed page and the total number of pages in the extracted content.

In [2]:
from vectrix.importers import WebScraper

scraper = WebScraper("https://vectrix.ai")
all_links = scraper.get_all_links()


[31m2024-08-20 13:49:08,500 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://vectrix.ai/robots.txt[0m
[31m2024-08-20 13:49:08,500 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://vectrix.ai/robots.txt[0m
[31m2024-08-20 13:49:09,075 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://vectrix.ai/robots.txt[0m
[31m2024-08-20 13:49:09,075 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://vectrix.ai/robots.txt[0m


In [5]:
scraper.download_pages('Vectrix', all_links)

ProgrammingError: (psycopg2.errors.UndefinedTable) relation "langchain_pg_embedding" does not exist
LINE 3:             FROM langchain_pg_embedding
                         ^

[SQL: 
            SELECT DISTINCT jsonb_extract_path_text(cmetadata, 'url') as url
            FROM langchain_pg_embedding
            WHERE jsonb_extract_path_text(cmetadata, 'url') IS NOT NULL
        ]
(Background on this error at: https://sqlalche.me/e/20/f405)

## Data Preprocessing and Chunking
In this step we will split all the extracted web pages into logical chunks. 

➡️ We will use the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library to extract the main content of the web pages. It will return the main content of the page, the title, and the meta description.

➡️ We will pipe this to another splitter to further cut the sections into smaller chunks if they are too large. For this we use Langchains 

➡️  Also we will attach an LLM to the chain to ignore chunks that are not relevant, for example: navigation bars, footers, etc.



### Chunking and metadata extraction
Using the functions below we extract the medata and devide the text into chunks. 

In [None]:
from vectrix.importers import chunk_content
chunked_webpages = chunk_content(web_pages)


print(f"Before chunking we had {len(web_pages)} and after chunking {len(chunked_webpages)}")

## Storing the result in a Weaviate (cluster)

### Initialize the Vector store and check that all the required modules are installed

Download the Docker compose file if needed
```bash
curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?cohere_key_approval=yes&generative_anyscale=false&generative_aws=false&generative_cohere=false&generative_mistral=false&generative_octoai=false&generative_ollama=false&generative_openai=false&generative_palm=false&media_type=text&modules=modules&ner_module=false&qna_module=false&ref2vec_centroid=false&reranker_cohere=true&reranker_cohere_key_approval=yes&reranker_transformers=false&runtime=docker-compose&spellcheck_module=true&spellcheck_module_model=pyspellchecker-en&sum_module=false&text_module=text2vec-cohere&weaviate_version=v1.25.4&weaviate_volume=named-volume"
```

Make sure to set the persistent directory to the correct value:
```bash
    volumes:
    - ~/weaviate_data:/var/lib/weaviate
```

Also configure the Cohere API key:
```bash
environment:
      SPELLCHECK_INFERENCE_API: 'http://text-spellcheck:8080'
      COHERE_APIKEY: ***
```

In [None]:
from vectrix.db import Weaviate

weaviate = Weaviate()

In [None]:
weaviate.create_collection(name='Loop', 
                           embedding_model='Ollama', 
                           model_name="mxbai-embed-large:335m",
                           model_url="http://host.docker.internal:11434")

In [None]:
print(weaviate.list_collections())

In [None]:
weaviate.set_colleciton(name='Vectrix')

In [None]:
weaviate.add_data(chunked_webpages)

In [None]:
retriever = weaviate.get_retriever()
retriever.invoke('Who are the Vectrix founders ?')

In [None]:
weaviate.remove_collection("Loop")

In [None]:
weaviate.info()

In [None]:
weaviate.close()

In [15]:
from langchain_core.documents import Document
from langchain_cohere import CohereEmbeddings
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

connection = "postgresql://postgres:postgres@127.0.0.1:54322/postgres"
collection_name = "test"
embeddings = CohereEmbeddings()

In [16]:
vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

In [None]:
 pg = st.navigation(
        {
            "Ask": [chat_page],
            "Manage Data": [add_data_page, view_sources],
            "Settings" : [manage_projects]
        }
    )