# Weaviate Import

This notebook is used to populate the `WeaviateBlogChunk` class.

You can connect to Weaviate through local host, or create a free 14-day sandbox on [WCS](https://console.weaviate.cloud/)!

1. (Option 1) Create a cluster on WCS and grab your cluster URL and auth key (if enabled)

1. (Option 2) Run `docker-compose up -d` with the docker script in the file to start Weaviate locally on localhost:8080


2. Make sure the `/blog` folder is in this directory (these are parsed from github.com/weaviate/weaviate-io -- feel free to drag and drop that folder in here to update the content).


3. Run this notebook and the 1182 blog chunks will be loaded into Weaviate.

## Connect to Client

In [15]:
# Import Weaviate and Connect to Client
import weaviate
import os


WCD_CLUSTER_URL = os.getenv("WCD_CLUSTER_URL")
WCD_CLUSTER_KEY = os.getenv("WCD_CLUSTER_KEY")
OPENAI_KEY = os.getenv("OPENAI_KEY")

# client = weaviate.connect_to_local()  # Connect to local host

# connect to your cluster on WCD
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WCD_CLUSTER_URL,  # Replace with your WCD URL
    auth_credentials=weaviate.auth.AuthApiKey(WCD_CLUSTER_KEY),  # Replace with your WCD key
    headers={
        'X-OpenAI-Api-Key': OPENAI_KEY # Replace with your OpenAI API key
    }
)

## Create Schema

In [11]:
# CAUTION: Running this will delete the collection along with the objects

# client.collections.delete_all()

In [12]:
import weaviate.classes.config as wvcc

collection = client.collections.create(
    name="WeaviateBlogChunk",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_openai
    (
        model="ada"
    ),
    properties=[
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT),
      ]
)

## Chunk Blogs

In [7]:
import re

def chunk_list(lst, chunk_size):
    """Break a list into chunks of the specified size."""
    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

def split_into_sentences(text):
    """Split text into sentences using regular expressions."""
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def read_and_chunk_index_files(main_folder_path):
    """Read index.md files from subfolders, split into sentences, and chunk every 5 sentences."""
    blog_chunks = []
    for folder_name in os.listdir(main_folder_path):
        subfolder_path = os.path.join(main_folder_path, folder_name)
        if os.path.isdir(subfolder_path):
            index_file_path = os.path.join(subfolder_path, 'index.mdx')
            if os.path.isfile(index_file_path):
                with open(index_file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    sentences = split_into_sentences(content)
                    sentence_chunks = chunk_list(sentences, 5)
                    sentence_chunks = [' '.join(chunk) for chunk in sentence_chunks]
                    blog_chunks.extend(sentence_chunks)
    return blog_chunks

# Example usage
main_folder_path = '../data'
blog_chunks = read_and_chunk_index_files(main_folder_path)


In [13]:
len(blog_chunks)

943

In [9]:
blog_chunks[0]

'---\ntitle: ChatGPT for Generative Search\nslug: generative-search\nauthors: [zain, erika, connor]\ndate: 2023-02-07\ntags: [\'search\', \'integrations\']\nimage: ./img/hero.png\ndescription: "Learn how you can customize Large Language Models prompt responses to your own data by leveraging vector databases."\n---\n![ChatGPT for Generative Search](./img/hero.png)\n\n<!-- truncate -->\n\nWhen OpenAI launched ChatGPT at the end of 2022, more than one million people had tried the model in just a week and that trend has only continued with monthly active users for the chatbot service reaching over 100 Million, quicker than any service before, as reported by [Reuters](https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/) and [Yahoo Finance](https://finance.yahoo.com/news/chatgpt-on-track-to-surpass-100-million-users-faster-than-tiktok-or-instagram-ubs-214423357.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_

## Import Objects

In [16]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

blogs = client.collections.get("WeaviateBlogChunk")

for idx, blog_chunk in enumerate(blog_chunks):
    upload = blogs.data.insert(
        properties={
            "content": blog_chunk
        }
    )