## Weaviate & keyword search 

### English: easy

Splitting up English sentences into words is easy. 

For example, the sentence `Hello, beautiful world!` splits up into `["Hello", "beautiful", "world!"]`. 

### Korean: not so easy

What about this?

```
아버지가방에들어가신다
```

Using spaces only, it will not be split up at all:

```
- ["아버지가방에들어가신다"]
```

And it could easily be wrong like:
```
- ["아버지", "가방", "에", "들어가", "신다"] ❌ (Father goes into bag)
```

It should be:
```
- ["아버지", "가", "방", "에", "들어가", "신다"] ✅ (Father goes into the room)
```

Great search is critical for building great AI applications, and the ability to split a sentence into words is a key part of that. 

### Introducing Weaviate's Korean tokenizer

In Weaviate `1.25.7`, we introduce a Korean tokenizer that can split Korean sentences into words. This is a significant step forward in helping Korean developers build great AI applications.

## Demo with Weaviate

Install Docker and run the following command to start a Weaviate instance:

```bash
docker-compose up -d
```

Run `pip install weaviate-client` to install the Weaviate client. 

Then, run the following code to connect to Weaviate:

In [1]:
import weaviate
import os

cohere_key = os.environ["COHERE_API_KEY"]

client = weaviate.connect_to_local(
    headers={"X-Cohere-Api-Key": cohere_key}
)

I0000 00:00:1721616776.815328 9085742 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


The collection below uses the "Kagome" tokenizer with the "MeCab-ko" dictionary to tokenize Korean sentences. 

- [How to set a tokenizer](https://weaviate.io/developers/weaviate/manage-data/collections#property-level-settings)
- [Available tokenizers](https://weaviate.io/developers/weaviate/config-refs/schema#tokenization)

In [2]:
from weaviate.classes.config import Configure, Property, DataType, Tokenization

# Delete the collection if it exists
if client.collections.exists("Wiki"):
    client.collections.delete("Wiki")

# Create the collection
wiki = client.collections.create(
    name="Wiki",
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            tokenization=Tokenization.KAGOME_KR
        ),
        Property(
            name="chunk",
            data_type=DataType.TEXT,
            tokenization=Tokenization.KAGOME_KR
        ),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="chunk",
            source_properties=["chunk"],
            model="embed-multilingual-v3.0"
        ),
    ],
    generative_config=Configure.Generative.cohere(model="command-r-plus")
)

## Helper code

These functions help us pre-process data

In [3]:
# Load texts (Korean Wikipedia text)

from pathlib import Path

data_dir = Path("./data")
src_texts = [
    {"body": txt_file.read_text(), "title": txt_file.stem}
    for txt_file in data_dir.glob("*.txt")
]

In [4]:
# Split text into small chunks

from typing import List

def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
    overlap = int(chunk_size // 4)
    return [text[i:i+chunk_size+overlap] for i in range(0, len(text), chunk_size)]

def get_chunks(text: str) -> List[str]:
    sections = text.split("\n\n")
    chunks = []
    for s in sections:
        if len(s) > 100:
            sub_chunks = get_chunks_fixed_size(s, 50)
            chunks.extend(sub_chunks)
        else:
            chunks.append(s)
    return chunks

### Import data into Weaviate

In [5]:
from weaviate.util import generate_uuid5

with wiki.batch.fixed_size(batch_size=200) as batch:
    for src_text in src_texts:
        chunks = get_chunks(src_text["body"])
        for chunk in chunks:
            batch.add_object(
                properties={
                    "title": src_text["title"],
                    "chunk": chunk,
                },
                uuid=generate_uuid5(chunk)
            )

# Print the total number of imported chunks
count = wiki.aggregate.over_all(total_count=True).total_count

print(count)

416


## Example searches

Let's check if this is working properly.

### "머리" vs "머리말" 

These are very different words in Korean:
- "머리"  (head)
- "머리말"  (page header / preface)

If "머리말" is not tokenized correctly, the search results will include results relating to "머리" (head).

Let's see what happens if we search for "머리구성" (head composition) and "머리말구성" (page header / preface composition).

### Example 1: Search & translate

We perform searches and translate the results using a Cohere large language model.

In [6]:
for query in ["머리구성", "머리말구성"]:

    r = wiki.generate.bm25(
        query=query,
        query_properties=["chunk"],
        single_prompt="Return a translation of this into English (and nothing else): {chunk}",
        limit=2
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    for i, o in enumerate(r.objects):
        print(f"\n========== RESULT {i+1} ==========")
        print("ARTICLE TITLE:", o.properties["title"])
        print("CHUNK BODY:", o.properties["chunk"].replace("\n", " ")[:100] + "...")
        print("TRANSLATION:", o.generated.replace("\n", " ")[:100] + "...")



ARTICLE TITLE: skull
CHUNK BODY: == 구조 == [[파일:Lateral head skull.jpg|섬네일|왼쪽|머리의 구성]] 머리뼈는 얼굴을 ...
TRANSLATION: == Structure == [[File:Lateral head skull.jpg|thumb|left|Structure of the head]] The skull is compos...

ARTICLE TITLE: skull
CHUNK BODY: ]] 머리뼈는 얼굴을 구성하고 머리뼈공간을 보호한다. [[뇌]]를 비롯하여 [[눈 (해부학)|눈]], [[귀]]...
TRANSLATION: The skull forms the face and protects the cranial cavity. It houses the [[brain]], as well as the [[...


ARTICLE TITLE: preface
CHUNK BODY: 적으로 머리말을 만들고 유지하는 기능을 제공하며 여기서 머리말은 페이지마다 동일할 수도 있고 페이지 번호와 같이...
TRANSLATION: It provides the ability to create and maintain headers as an enemy, where headers can be the same on...

ARTICLE TITLE: preface
CHUNK BODY: '''머리말''' 또는 '''머리글'''은 [[타이포그래피]]에서 본문과 구별되면서도 인쇄된 페이지의 꼭대기에 ...
TRANSLATION: A 'headword' or 'header' in typography is distinct from the main text while being at the top of a pr...


### Example 2: Search & summarize

We perform searches and summarise the results into Korean and English, using a Cohere large language model.

In [7]:
for query in ["머리구성", "머리말구성"]:

    r = wiki.generate.bm25(
        query=query,
        query_properties=["chunk"],
        grouped_task=f"Summarise the findings here into a few bullet points about {query}. Each point should be a single sentence, and in Korean AND English.",
        limit=3
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    print("GENERATED SUMMARY:")
    print(r.generated)


GENERATED SUMMARY:
Here is a summary of the findings about the structure of the head in Korean and English: 

- 머리뼈는 머리뼈공간을 보호하고, 뇌, 눈, 귀를 포함해 얼굴을 구성합니다. - The skull protects the cranial cavity and forms the face, including the brain, eyes, and ears.
- 대부분의 좌우 대칭 동물은 머리를 가지고 있으며, 머리는 신체의 앞쪽 끝부분을 구성합니다. - Most bilaterally symmetrical animals have a head, which forms the anterior end of the body.

GENERATED SUMMARY:
Here is a summary of the key points about '머리말구성' (preface composition) in both Korean and English:

- 머리말은 본문과 구분되면서도 페이지의 꼭대기에 위치하는 인쇄된 텍스트입니다. - The preface is printed text that is distinct from the main body and located at the top of the page.

- 머리말은 페이지마다 동일하거나 페이지 번호와 같이 달라질 수 있습니다. - The preface can remain the same or vary on each page, such as with page numbers.

- 출판물에서 머리말은 난외표제라고도 불리며, 이는 펼친 책의 왼쪽과 오른쪽 페이지에 나타납니다. - In publishing, the preface is also known as a 'running head', appearing on the left and right pages of an open book.
