# Weaviate Import

This notebook is used to populate the `WeaviateBlogChunk` class.

You can connect to Weaviate through local host, or create a free 14-day sandbox on [WCS](https://console.weaviate.cloud/)!

1. (Option 1) Create a cluster on WCS and grab your cluster URL and auth key (if enabled)

1. (Option 2) Run `docker-compose up -d` with the docker script in the file to start Weaviate locally on localhost:8080


2. Make sure the `/blog` folder is in this directory (these are parsed from github.com/weaviate/weaviate-io -- feel free to drag and drop that folder in here to update the content).


3. Run this notebook and the 1182 blog chunks will be loaded into Weaviate.

## Connect to Client

In [12]:
# Import Weaviate and Connect to Client
import weaviate

# client = weaviate.connect_to_local()  # Connect to local host
client = weaviate.connect_to_wcs(
    cluster_url="WCS-url",  # Replace with your WCS URL
    auth_credentials=weaviate.auth.AuthApiKey("auth-key"),  # Replace with your WCS key
    headers={
        'X-Cohere-Api-Key': ("API-Key") # Replace with your Cohere API key
    }
)

## Create Schema

In [13]:
# CAUTION: Running this will delete the collection along with the objects

# client.collections.delete_all()

In [4]:
import weaviate.classes.config as wvcc

collection = client.collections.create(
    name="WeaviateBlogChunk",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_cohere
    (
        model="embed-multilingual-v3.0"
    ),
    properties=[
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT),
      ]
)

## Chunk Blogs

In [5]:
import os
import re

def chunk_list(lst, chunk_size):
    """Break a list into chunks of the specified size."""
    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

def split_into_sentences(text):
    """Split text into sentences using regular expressions."""
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def read_and_chunk_index_files(main_folder_path):
    """Read index.md files from subfolders, split into sentences, and chunk every 5 sentences."""
    blog_chunks = []
    for folder_name in os.listdir(main_folder_path):
        subfolder_path = os.path.join(main_folder_path, folder_name)
        if os.path.isdir(subfolder_path):
            index_file_path = os.path.join(subfolder_path, 'index.mdx')
            if os.path.isfile(index_file_path):
                with open(index_file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    sentences = split_into_sentences(content)
                    sentence_chunks = chunk_list(sentences, 5)
                    sentence_chunks = [' '.join(chunk) for chunk in sentence_chunks]
                    blog_chunks.extend(sentence_chunks)
    return blog_chunks

# Example usage
main_folder_path = './blog'
blog_chunks = read_and_chunk_index_files(main_folder_path)


In [6]:
len(blog_chunks)

1182

In [7]:
blog_chunks[0]

'---\ntitle: Combining LangChain and Weaviate\nslug: combining-langchain-and-weaviate\nauthors: [erika]\ndate: 2023-02-21\ntags: [\'integrations\']\nimage: ./img/hero.png\ndescription: "LangChain is one of the most exciting new tools in AI. It helps overcome many limitations of LLMs, such as hallucination and limited input lengths."\n---\n![Combining LangChain and Weaviate](./img/hero.png)\n\nLarge Language Models (LLMs) have revolutionized the way we interact and communicate with computers. These machines can understand and generate human-like language on a massive scale. LLMs are a versatile tool that is seen in many applications like chatbots, content creation, and much more. Despite being a powerful tool, LLMs have the drawback of being too general.'

## Import Objects

In [14]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

blogs = client.collections.get("WeaviateBlogChunk")

for idx, blog_chunk in enumerate(blog_chunks):
    upload = blogs.data.insert(
        properties={
            "content": blog_chunk
        }
    )