In [1]:
from dotenv import load_dotenv
load_dotenv()
import os

import warnings
warnings.simplefilter("ignore", ResourceWarning)

# Vectrix Demo 👨🏻‍💻
This notebook demonstrates the functions for importing data from various sources. 
Loading it into a VectorStore, and then using it to answer questions with a Retrieval Augemented Reasoning  🦜🔗 LangGraph.

## Importing Data
### 1. From a URL 🔗

**Web Crawling and Data Extraction Example**

This cell demonstrates the process of crawling a website, chunking the extracted content, and performing Named Entity Recognition (NER) using the Vectrix library.

#### Steps:

1. **Web Crawling**: 
   - Uses `vectrix.Crawler` to extract pages from "https://vectrix.ai"
   - Limits the crawl to a maximum of 20 pages

2. **Content Chunking**:
   - Utilizes `vectrix.Webchunker` to break down the extracted content
   - Chunks are created with a size of 500 characters and an overlap of 50 characters

3. **Named Entity Recognition**:
   - Employs `vectrix.Extract` with the 'Replicate' model
   - Uses the 'meta/meta-llama-3-70b-instruct' model for entity extraction

This example showcases a typical workflow for web data extraction and processing using the Vectrix library.


In [2]:
from vectrix.importers import WebScraper

scraper = WebScraper("https://www.vectrix.ai/")
all_links = scraper.get_all_links()

[31m2024-08-14 23:59:51,671 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://www.vectrix.ai/robots.txt[0m
[31m2024-08-14 23:59:51,671 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://www.vectrix.ai/robots.txt[0m
[32m2024-08-14 23:59:51,733 - courlan.core - INFO - 14 links found – 11 valid links[0m
[32m2024-08-14 23:59:51,733 - courlan.core - INFO - 14 links found – 11 valid links[0m
[32m2024-08-14 23:59:51,734 - root - INFO - Starting to scrape 11 links[0m
[32m2024-08-14 23:59:51,734 - root - INFO - Starting to scrape 11 links[0m
[31m2024-08-14 23:59:51,914 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://www.vectrix.ai/robots.txt[0m
[31m2024-08-14 23:59:51,914 - trafilatura.downloads - ERROR - not a 200 response: 404 for URL https://www.vectrix.ai/robots.txt[0m
[32m2024-08-14 23:59:51,974 - courlan.core - INFO - 10 links found – 7 valid links[0m
[32m2024-08-14 23:59:51,974 - courlan.core - IN

In [3]:
web_pages = scraper.get_all_pages(all_links)
len(web_pages)

[32m2024-08-15 00:00:13,326 - root - INFO - Listing all pages for domain: www.vectrix.ai[0m
[32m2024-08-15 00:00:13,326 - root - INFO - Listing all pages for domain: www.vectrix.ai[0m
[32m2024-08-15 00:00:13,478 - root - INFO - Getting all documents for domain: www.vectrix.ai[0m
[32m2024-08-15 00:00:13,478 - root - INFO - Getting all documents for domain: www.vectrix.ai[0m


21

In [4]:
from vectrix.importers import chunk_content
chunked_webpages = chunk_content(web_pages)


print(f"Before chunking we had {len(web_pages)} and after chunking {len(chunked_webpages)}")

Before chunking we had 21 and after chunking 21


#### Adding Data to Vector Store using Weaviate

This code demonstrates how to add extracted web data to a Weaviate vector store using the Vectrix library.

**Steps**:

1. **Import and Instantiate Weaviate**:
   - Import necessary modules from Vectrix
   - Create a Weaviate instance

2. **Create a Collection**:
   - Set up a new collection named 'Vectrix'
   - Use a local Ollama model for embeddings
   - Specify the model name and URL

3. **Prepare Data for Vectorization**:
   - Create `VectorDocument` objects for each extracted result
   - Include metadata such as title, URL, content, type, and NER information

4. **Add Data to Vector Store**:
   - Use `weaviate.add_data()` to insert the prepared documents

5. **Perform a Test Query**:
   - Get a retriever from the Weaviate instance
   - Execute a sample query to verify functionality

**Note:**
- You can remove a collection using `weaviate.remove_collection("Vectrix")`
- This example uses a local Ollama model, but you can use any compatible embedding model

This code showcases the process of storing and querying vectorized web data using Weaviate and Vectrix, enabling efficient semantic search and retrieval.


In [5]:
from vectrix.db import Weaviate

# Instantiation of the Vector store
weaviate = Weaviate()

[32m2024-08-15 00:00:18,955 - httpx - INFO - HTTP Request: GET https://esus6owcs6iz4lv8rw5vta.c0.europe-west3.gcp.weaviate.cloud/v1/meta "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:18,955 - httpx - INFO - HTTP Request: GET https://esus6owcs6iz4lv8rw5vta.c0.europe-west3.gcp.weaviate.cloud/v1/meta "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:19,041 - httpx - INFO - HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:19,041 - httpx - INFO - HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"[0m
I0000 00:00:1723672819.082231 1311905 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


In [6]:
from vectrix.db import Weaviate

# Instantiation of the Vector store
weaviate = Weaviate()

weaviate.remove_collection("Vectrix")

# Create a new collection to store the documents (you can also append them to an existsing one)
weaviate.create_collection(name='Vectrix', 
                           embedding_model='Cohere')


weaviate.add_data(chunked_webpages)


# Perform a quick search to see if it works
retriever = weaviate.get_retriever()
retriever.invoke('Who are the Vectrix founders ?')

[32m2024-08-15 00:00:22,256 - httpx - INFO - HTTP Request: GET https://esus6owcs6iz4lv8rw5vta.c0.europe-west3.gcp.weaviate.cloud/v1/meta "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:22,256 - httpx - INFO - HTTP Request: GET https://esus6owcs6iz4lv8rw5vta.c0.europe-west3.gcp.weaviate.cloud/v1/meta "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:22,348 - httpx - INFO - HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:22,348 - httpx - INFO - HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:22,769 - httpx - INFO - HTTP Request: DELETE https://esus6owcs6iz4lv8rw5vta.c0.europe-west3.gcp.weaviate.cloud/v1/schema/Vectrix "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:22,769 - httpx - INFO - HTTP Request: DELETE https://esus6owcs6iz4lv8rw5vta.c0.europe-west3.gcp.weaviate.cloud/v1/schema/Vectrix "HTTP/1.1 200 OK"[0m
[32m2024-08-15 00:00:22,978 - httpx - INFO - HTTP Request: POST https://esu

[Document(metadata={'title': 'Vectrix - About Us', 'url': 'https://www.vectrix.ai/about-us', 'source_type': 'webpage', 'source_format': 'html', 'uuid': 'a843f8ea-a46c-4de0-99e5-c9f21778754d', 'score': 0.710525393486023}, page_content="AI Innovation Studio Meet Team Vectrix Ben Selleslagh Co-Founder Meet Ben, a pivotal player and Co-Founder at Vectrix. With a rich background as a data professional, Ben brings his diverse experience from banking, government, and media sectors into the mix. He's skilled in crafting and executing data-driven strategies that sync perfectly with business goals. His expertise in Google Cloud technology makes him a wizard at building scalable and efficient data architectures. At Vectrix, Ben applies his know-how to innovate and drive our data and AI solutions, always with an eye on efficiency and scalability, ensuring that we stay at the forefront of AI technology. Dimitri Allaert Co-Founder Meet Dimitri Allaert, a driving force and Co-Founder at Vectrix. With

In [None]:
weaviate.set_colleciton('Vectrix')
retriever = weaviate.get_retriever()
retriever.invoke('Who are the Vectrix founders ?')

### From OneDrive ☁
We can also connect with a OneDrive folder and download the information from there. After this process we can add the information to our Vector store in the same way. This way we can extract various kinds of documents like PDF, World, PowerPoint ...

Downloads files from OneDrive, processes them with Unstructured, and stores in Weaviate.

Steps:
1. Connect to OneDrive using Azure credentials
2. List and download all files
3. Process files with Unstructured importer
4. Add processed documents to Weaviate vector store
5. Close OneDrive connection

Requirements:
- vectrix library (OneDrive connector, Unstructured importer)
- Weaviate client
- Azure credentials as environment variables (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET)

Note: Close OneDrive connection before uploading to vector store.

In [None]:
from vectrix.connectors.onedrive import OneDrive
from vectrix.importers.unstructured import Unstructured

drive = OneDrive(azure_client_id = os.environ.get('AZURE_CLIENT_ID'), azure_client_secret = os.environ.get('AZURE_CLIENT_SECRET'))

drive.list_files()

In [None]:
# Download all files in the drive, returns a list of downloaded files
downloaded_files = drive.download_files()
print(downloaded_files)

# Process teh documents using unstructured (we can't load RAW PDF files into a Vector store)
importer = Unstructured()
documents = importer.process_files(downloaded_files)

# Add the documents to the Vector store
weaviate = Weaviate()
weaviate.set_colleciton('Vectrix')
weaviate.add_data(documents)

# Close the drive, do this BEFORE you want to upload the files to a vector store
drive.close()