In [1]:
from dotenv import load_dotenv
load_dotenv()
import vectrix
import os
import logging
logging.basicConfig(level=logging.INFO)

USER_AGENT environment variable not set, consider setting it to identify your requests.


# Vectrix Demo 👨🏻‍💻
This notebook demonstrates the functions for importing data from various sources. 
Loading it into a VectorStore, and then using it to answer questions with a Retrieval Augemented Reasoning  🦜🔗 LangGraph.

## Importing Data
### 1. From a URL 🔗

**Web Crawling and Data Extraction Example**

This cell demonstrates the process of crawling a website, chunking the extracted content, and performing Named Entity Recognition (NER) using the Vectrix library.

#### Steps:

1. **Web Crawling**: 
   - Uses `vectrix.Crawler` to extract pages from "https://vectrix.ai"
   - Limits the crawl to a maximum of 20 pages

2. **Content Chunking**:
   - Utilizes `vectrix.Webchunker` to break down the extracted content
   - Chunks are created with a size of 500 characters and an overlap of 50 characters

3. **Named Entity Recognition**:
   - Employs `vectrix.Extract` with the 'Replicate' model
   - Uses the 'meta/meta-llama-3-70b-instruct' model for entity extraction

This example showcases a typical workflow for web data extraction and processing using the Vectrix library.


In [2]:
# Extracting data from a URL

crawler = vectrix.Crawler("https://vectrix.ai")
site_pages = crawler.extract()

print(f"Extracted {len(site_pages)} pages")

# Chunk the data
chunker = vectrix.Webchunker(site_pages)
chunks = chunker.chunk_content(chunk_size=500, chunk_overlap=50)

# Extract NER info (Named Entities)
extractor = vectrix.Extract('Replicate', 'meta/meta-llama-3-70b-instruct')
results = extractor.extract(chunks)

print("Example CHUNK: ")
print(chunks[1])

Fetching pages: 100%|##########| 1/1 [00:01<00:00,  1.17s/it]
  def _replace_cdata_list_attribute_values(self, tag_name, attrs):
  def _replace_cdata_list_attribute_values(self, tag_name, attrs):
Fetching pages: 100%|##########| 10/10 [00:03<00:00,  2.51it/s]
  def _replace_cdata_list_attribute_values(self, tag_name, attrs):
Fetching pages: 100%|##########| 9/9 [00:04<00:00,  2.09it/s]
  def __init__(self, parser=None, builder=None, name=None, namespace=None,
  def __init__(self, parser=None, builder=None, name=None, namespace=None,


Extracted 20 pages


  def __enter__(self):
  def __enter__(self):
  def __enter__(self) -> "ThreadLock":
  def __enter__(self) -> "ThreadLock":
Extracting entities: 100%|██████████| 25/25 [05:36<00:00, 13.44s/it]

An error occurred: Prediction timed out.
Example CHUNK: 
page_content='Image Extraction with Langchain and Gemini Ben Selleslagh In today's digital landscape, businesses are constantly seeking ways to optimize their online presence and streamline their operations. One powerful tool that's gaining traction is AI-powered image metadata extraction. At Vectrix, we've implemented this technology for a major online retailer, processing thousands of product images to enhance user experience and boost SEO rankings. What is Image Metadata Extraction? Image metadata extraction is the process of using artificial intelligence to analyze images and generate structured data about their content. This can include descriptions, colors, attributes, and even SEO-friendly hashtags. By leveraging advanced machine learning models, we can automatically extract valuable information from visual content, turning images into a rich source of data. The Business Benefits - Enhanced SEO: By generating rich, varied 




#### Adding Data to Vector Store using Weaviate

This code demonstrates how to add extracted web data to a Weaviate vector store using the Vectrix library.

**Steps**:

1. **Import and Instantiate Weaviate**:
   - Import necessary modules from Vectrix
   - Create a Weaviate instance

2. **Create a Collection**:
   - Set up a new collection named 'Vectrix'
   - Use a local Ollama model for embeddings
   - Specify the model name and URL

3. **Prepare Data for Vectorization**:
   - Create `VectorDocument` objects for each extracted result
   - Include metadata such as title, URL, content, type, and NER information

4. **Add Data to Vector Store**:
   - Use `weaviate.add_data()` to insert the prepared documents

5. **Perform a Test Query**:
   - Get a retriever from the Weaviate instance
   - Execute a sample query to verify functionality

**Note:**
- You can remove a collection using `weaviate.remove_collection("Vectrix")`
- This example uses a local Ollama model, but you can use any compatible embedding model

This code showcases the process of storing and querying vectorized web data using Weaviate and Vectrix, enabling efficient semantic search and retrieval.


In [3]:
from vectrix.db import Weaviate
from vectrix.models.documents import VectorDocument

# Instantiation of the Vector store
weaviate = Weaviate()

weaviate.remove_collection("Vectrix")

# Create a new collection to store the documents (you can also append them to an existsing one)
weaviate.create_collection(name='Vectrix', 
                           embedding_model='Ollama', # For this example we use a local Ollama model, but you can use any other model 
                           model_name="mxbai-embed-large:335m",
                           model_url="http://host.docker.internal:11434") # Ollama URL

data_to_vectorize = []


# Now let's add our downloaded webpages to the Vector store.
for result in results:
    data_to_vectorize.append(
        VectorDocument(
            title=result.metadata["title"],
            url=result.metadata["source"],
            content=result.page_content,
            type="webpage",
            NER=str(result.metadata["NER"]),
        )
    )

weaviate.add_data(data_to_vectorize)


# Perform a quick search to see if it works
retriever = weaviate.get_retriever()
retriever.invoke('Who are the Vectrix founders ?')

I0000 00:00:1722528514.449570 1395484 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


[Document(metadata={'title': 'Open Application - Create Your Dream Job at Vectrix', 'url': 'https://vectrix.ai/job-list/open-application---create-your-own-dream-job', 'type': 'webpage', 'uuid': 'ed329d5a-748b-4621-af2d-c9c2e59a091b', 'score': 0.9064587950706482}, page_content="About Vectrix At Vectrix, we train and validate private Small Language Models (SLMs) on business data, ensuring accuracy, explainability, and security. Based in Antwerp, our flexible platform offers DIY tools and expert collaboration, leveraging Retrieval Augmented Reasoning (RAR) for precise, contextually relevant answers. Join our innovative team and shape your career in a dynamic environment. At Vectrix, you’ll make a real impact and grow your skills in a supportive setting. Ready to advance AI with us? Apply now! Position Overview We are growing rapidly and our hiring plans are expanding just as quickly. We are looking for highly motivated team members who want to help build one of the leading companies in AI

### From OneDrive ☁
We can also connect with a OneDrive folder and download the information from there. After this process we can add the information to our Vector store in the same way. This way we can extract various kinds of documents like PDF, World, PowerPoint ...

Downloads files from OneDrive, processes them with Unstructured, and stores in Weaviate.

Steps:
1. Connect to OneDrive using Azure credentials
2. List and download all files
3. Process files with Unstructured importer
4. Add processed documents to Weaviate vector store
5. Close OneDrive connection

Requirements:
- vectrix library (OneDrive connector, Unstructured importer)
- Weaviate client
- Azure credentials as environment variables (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET)

Note: Close OneDrive connection before uploading to vector store.

In [None]:
from vectrix.connectors.onedrive import OneDrive
from vectrix.importers.unstructured import Unstructured

drive = OneDrive(azure_client_id = os.environ.get('AZURE_CLIENT_ID'), azure_client_secret = os.environ.get('AZURE_CLIENT_SECRET'))

drive.list_files()

In [None]:
# Download all files in the drive, returns a list of downloaded files
downloaded_files = drive.download_files()
print(downloaded_files)

# Process teh documents using unstructured (we can't load RAW PDF files into a Vector store)
importer = Unstructured()
documents = importer.process_files(downloaded_files)

# Add the documents to the Vector store
weaviate = Weaviate()
weaviate.set_colleciton('Vectrix')
weaviate.add_data(documents)

# Close the drive, do this BEFORE you want to upload the files to a vector store
drive.close()