## Weaviate & Korean

![Img](./assets/korean_tokenization_0.png)

[Weaviate](https://www.weaviate.io) includes powerful integrations that help you build AI apps with Korean data. 

This includes integration with multi-lingual models like Cohere's, and now, a Korean tokenizer.

## Korean Tokenizer

Tokenization splits up text into components, and is critical for performing keyword searches. But tokenization is not as simple as it sounds.

### English

Splitting up English sentences into words is relatively easy, as you can split on spaces. 

For example, the sentence `Hello, beautiful world!` splits up into `["Hello", "beautiful", "world!"]`. 

### Korean

But, Korean is a different story. Korean words do not always have spaces between them. So, splitting up Korean sentences is not as simple. For example, how would you split up this sentence?

```
아버지가방에들어가신다
```

Using spaces only, it will not be split up at all:

```
- ["아버지가방에들어가신다"]
```

Now, search for "아버지" (father) will not return this sentence, even though it contains the word "아버지".

And splits using words can easily be wrong. This uses common Korean words, but the split is incorrect:
```
- ["아버지", "가방", "에", "들어가", "신다"] ❌ (Father goes into bag)
```

It should be:
```
- ["아버지", "가", "방", "에", "들어가", "신다"] ✅ (Father goes into the room)
```

![Img](./assets/korean_tokenization_1.png)

Great search is critical for building great AI applications, and the ability to split a sentence into words is a key part of that. 

### Introducing Weaviate's Korean tokenizer

In Weaviate `1.25.7`, we introduce a Korean tokenizer that can split Korean sentences into words. This is a significant step forward in helping Korean developers build great AI applications.

## Demo with Weaviate

Install Docker and run the following command to start a Weaviate instance:

```bash
docker-compose up -d
```

Run `pip install weaviate-client` to install the Weaviate client. 

Then, run the following code to connect to Weaviate:

In [1]:
import weaviate
import os

cohere_key = os.environ["COHERE_API_KEY"]

client = weaviate.connect_to_local(
    headers={"X-Cohere-Api-Key": cohere_key}
)

I0000 00:00:1721674034.418590 9285394 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


The collection below uses the "Kagome" tokenizer with the "MeCab-ko" dictionary to tokenize Korean sentences. 

- [How to set a tokenizer](https://weaviate.io/developers/weaviate/manage-data/collections#property-level-settings)
- [Available tokenizers](https://weaviate.io/developers/weaviate/config-refs/schema#tokenization)

In [2]:
from weaviate.classes.config import Configure, Property, DataType, Tokenization

# Delete the collection if it exists
if client.collections.exists("Wiki"):
    client.collections.delete("Wiki")

# Create the collection
wiki = client.collections.create(
    name="Wiki",
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            tokenization=Tokenization.KAGOME_KR
        ),
        Property(
            name="chunk",
            data_type=DataType.TEXT,
            tokenization=Tokenization.KAGOME_KR
        ),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="chunk",
            source_properties=["chunk"],
            model="embed-multilingual-v3.0"  # Multi-lingual embedding model
        ),
    ],
    generative_config=Configure.Generative.cohere(model="command-r-plus")  # Multi-lingual large language model
)

## Helper code

These functions help us pre-process data

In [3]:
# Load texts (Korean Wikipedia text)

from pathlib import Path

data_dir = Path("./data")
src_texts = [
    {"body": txt_file.read_text(), "title": txt_file.stem}
    for txt_file in data_dir.glob("*.txt")
]

In [4]:
# Split text into small chunks

from typing import List

def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
    overlap = int(chunk_size // 4)
    return [text[i:i+chunk_size+overlap] for i in range(0, len(text), chunk_size)]

def get_chunks(text: str) -> List[str]:
    sections = text.split("\n\n")
    chunks = []
    for s in sections:
        if len(s) > 100:
            sub_chunks = get_chunks_fixed_size(s, 50)
            chunks.extend(sub_chunks)
        else:
            chunks.append(s)
    return chunks

### Import data into Weaviate

In [5]:
from weaviate.util import generate_uuid5

with wiki.batch.fixed_size(batch_size=200) as batch:
    for src_text in src_texts:
        chunks = get_chunks(src_text["body"])
        for chunk in chunks:
            batch.add_object(
                properties={
                    "title": src_text["title"],
                    "chunk": chunk,
                },
                uuid=generate_uuid5(chunk)
            )

# Print the total number of imported chunks
count = wiki.aggregate.over_all(total_count=True).total_count

print(count)

416


## Example keyword searches

Let's check if this is working properly by searching with sentences using similar words.

These are very different words in Korean:
- "머리"  (head)
- "머리말"  (page header / preface)

If "머리말" is not tokenized correctly, the search results will include results relating to "머리" (head).

The Korean tokenizer should be able to differentiate between these two words.

### Example 1: `머리` vs `머리말`

Let's see what happens if we search for "머리" (head) and "머리말구성" (page header / preface).

In [6]:
for query in ["머리", "머리말"]:

    r = wiki.generate.bm25(
        query=query,
        query_properties=["chunk"],
        limit=2
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    for i, o in enumerate(r.objects):
        print(f"\n========== RESULT {i+1} ==========")
        print("ARTICLE TITLE:", o.properties["title"])
        print("CHUNK BODY:", o.properties["chunk"].replace("\n", " ")[:100] + "...")



ARTICLE TITLE: head
CHUNK BODY: 아주 단순한 동물의 경우 머리가 없는 것도 있으나 대부분의 [[좌우 대칭 동물류]]는 머리가 있다. [[척추동물...

ARTICLE TITLE: head
CHUNK BODY:  머리 그림]] '''머리'''({{llang|en|Head}})는 [[인간]]이나 [[동물]]의 [[목]] 위...


ARTICLE TITLE: preface
CHUNK BODY: 적으로 머리말을 만들고 유지하는 기능을 제공하며 여기서 머리말은 페이지마다 동일할 수도 있고 페이지 번호와 같이...

ARTICLE TITLE: preface
CHUNK BODY: '''머리말''' 또는 '''머리글'''은 [[타이포그래피]]에서 본문과 구별되면서도 인쇄된 페이지의 꼭대기에 ...


### Example 2: `머리구성` vs `머리말구성`

Let's see what happens if we search for slightly more complex phrases, like: "머리구성" (head composition) and "머리말구성" (page header / preface composition).

In [7]:
for query in ["머리구성", "머리말구성"]:

    r = wiki.generate.bm25(
        query=query,
        query_properties=["chunk"],
        limit=2
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    for i, o in enumerate(r.objects):
        print(f"\n========== RESULT {i+1} ==========")
        print("ARTICLE TITLE:", o.properties["title"])
        print("CHUNK BODY:", o.properties["chunk"].replace("\n", " ")[:100] + "...")



ARTICLE TITLE: skull
CHUNK BODY: == 구조 == [[파일:Lateral head skull.jpg|섬네일|왼쪽|머리의 구성]] 머리뼈는 얼굴을 ...

ARTICLE TITLE: skull
CHUNK BODY: ]] 머리뼈는 얼굴을 구성하고 머리뼈공간을 보호한다. [[뇌]]를 비롯하여 [[눈 (해부학)|눈]], [[귀]]...


ARTICLE TITLE: preface
CHUNK BODY: 적으로 머리말을 만들고 유지하는 기능을 제공하며 여기서 머리말은 페이지마다 동일할 수도 있고 페이지 번호와 같이...

ARTICLE TITLE: preface
CHUNK BODY: '''머리말''' 또는 '''머리글'''은 [[타이포그래피]]에서 본문과 구별되면서도 인쇄된 페이지의 꼭대기에 ...


## Retrieval augmented generation (RAG)

Weaviate is AI-native, meaning it integrates with generative AI models to perform retrieval augmented generation. This makes it **easy to build AI applications**.

Above, we have set up Weaviate with:

- Cohere's multi-lingual embedding model (`embed-multilingual-v3.0`)
- Cohere's multi-lingual generative model (`command-r-plus`)

So we can perform RAG with Korean data. 

### RAG example 1: Translate

Here, we translate each result into English using the generative model.

In [8]:
for query in ["머리구성", "머리말구성"]:

    r = wiki.generate.bm25(
        query=query,
        query_properties=["chunk"],
        single_prompt="Return a translation of this into English (and nothing else): {chunk}",
        limit=2
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    for i, o in enumerate(r.objects):
        print(f"\n========== RESULT {i+1} ==========")
        print("ARTICLE TITLE:", o.properties["title"])
        print("CHUNK BODY:", o.properties["chunk"].replace("\n", " ")[:100] + "...")
        print("TRANSLATION:", o.generated.replace("\n", " ")[:100] + "...")



ARTICLE TITLE: skull
CHUNK BODY: == 구조 == [[파일:Lateral head skull.jpg|섬네일|왼쪽|머리의 구성]] 머리뼈는 얼굴을 ...
TRANSLATION: == Structure == [[File:Lateral head skull.jpg|thumb|left|Composition of the head]] The skull is comp...

ARTICLE TITLE: skull
CHUNK BODY: ]] 머리뼈는 얼굴을 구성하고 머리뼈공간을 보호한다. [[뇌]]를 비롯하여 [[눈 (해부학)|눈]], [[귀]]...
TRANSLATION: The skull forms the face and protects the cranial cavity. It includes the brain, eyes, and ears....


ARTICLE TITLE: preface
CHUNK BODY: 적으로 머리말을 만들고 유지하는 기능을 제공하며 여기서 머리말은 페이지마다 동일할 수도 있고 페이지 번호와 같이...
TRANSLATION: It provides the ability to create and maintain headers as an enemy, where the header can be the same...

ARTICLE TITLE: preface
CHUNK BODY: '''머리말''' 또는 '''머리글'''은 [[타이포그래피]]에서 본문과 구별되면서도 인쇄된 페이지의 꼭대기에 ...
TRANSLATION: A 'headword' or 'header' in typography is set apart from the main text while still appearing at the ...


### RAG example 2: Search & summarize

We use all results with a prompt into one output.

Here, we ask the model to write a summary in bullet points, in Korean AND English.

In [9]:
for query in ["머리구성", "머리말구성"]:

    r = wiki.generate.bm25(
        query=query,
        query_properties=["chunk"],
        grouped_task=f"Summarise the findings here into a few bullet points about {query}. Each point should be a single sentence, and in Korean AND English.",
        limit=3
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    print("GENERATED SUMMARY:")
    print(r.generated)


GENERATED SUMMARY:
Here is a summary of the findings about the structure of the head in Korean and English: 

- 머리뼈는 머리뼈공간을 보호하고, 뇌, 눈, 귀를 포함하는 구조물입니다. - The skull protects the cranial cavity and houses the brain, eyes, and ears.
- 머리뼈는 22개의 뼈로 이루어져 있습니다. - The skull is composed of 22 bones.
- 머리뼈의 모양은 동물마다 다를 수 있지만, 기본적인 구조는 비슷합니다. - While the shape of the skull can vary between animals, the basic structure remains similar.
- 단순한 동물을 제외한 대부분의 좌우 대칭 동물은 머리를 가지고 있습니다. - Most bilateral symmetrical animals, except for very simple ones, have heads.

GENERATED SUMMARY:
Here is a summary of the key points about '머리말구성' (preface composition) in both Korean and English:

- 머리말은 본문과 구분되면서도 페이지의 꼭대기에 위치하는 타이포그래피 요소입니다. - The preface is a typographical element that is distinct from the main text and located at the top of the page.
- 머리말은 페이지마다 동일하거나 페이지 번호와 같이 달라질 수 있습니다. - The preface can remain the same on every page or vary with elements like page numbers.
- 출판물에서 머리말은 난외표제라고도 불리며, 이는 펼친 책의 왼

## Example semantic searches

We can also perform semantic searches (based on meaning) using Weaviate, and hybrid searches that combine the best of both worlds.

Because the embedding model is multi-lingual (`embed-multilingual-v3.0`), we can perform searches in Korean and English.

In [11]:
# Semantic search in Korean & English - shows very similar results
for query in ["head", "머리"]:
    r = wiki.generate.near_text(
        query=query,
        target_vector="chunk",
        single_prompt="Return a translation of this into English (and nothing else): {chunk}",
        limit=2
    )
    print(f"\n========== RESULTS FOR QUERY: {query} ==========")
    for i, o in enumerate(r.objects):
        print(f"\n========== RESULT {i+1} ==========")
        print("ARTICLE TITLE:", o.properties["title"])
        print("CHUNK BODY:", o.properties["chunk"].replace("\n", " ")[:100] + "...")
        print("TRANSLATION:", o.generated.replace("\n", " ")[:100] + "...")



ARTICLE TITLE: head
CHUNK BODY:  머리 그림]] '''머리'''({{llang|en|Head}})는 [[인간]]이나 [[동물]]의 [[목]] 위...
TRANSLATION: 'Head' (in English) is the part of the body that is above the neck in humans and animals....

ARTICLE TITLE: head
CHUNK BODY: 물]]의 [[목]] 위의 부분을 가리킨다. 대개의 경우 머리에는 [[눈 (해부학)|눈]], [[코]], [[입]...
TRANSLATION: It refers to the part above the [[neck]] of the [[water]]. In most cases, the head includes [[eye (a...


ARTICLE TITLE: head
CHUNK BODY:  머리 그림]] '''머리'''({{llang|en|Head}})는 [[인간]]이나 [[동물]]의 [[목]] 위...
TRANSLATION: '''Head''' ({{llang|en|Head}}) is the part of the [[human]] or [[animal]] above the [[neck]]....

ARTICLE TITLE: skull
CHUNK BODY: 78-89-6109-092-6}}, 215쪽</ref> 머리를 이루는 뼈는 크게 보아 [[뇌머리뼈]], [[얼굴...
TRANSLATION: "78-89-6109-092-6}}, page 215</ref> The bones that make up the head are broadly classified into [[br...


## Generative feedback loops (GFL)

(Preview note) We are building "generative feedback loop" tools, which allow you to enrich and enhance your data using these generative outputs. 

As a basic example, these translated outputs or summarised outputs can be added back into Weaviate, and used going forward.

Keep an eye out for this feature in future releases.

## What next?

Weaviate's Korean tokenizer is a significant step forward in helping Korean developers build great AI applications.

Try out Weaviate, starting with the [Quickstart](https://weaviate.io/developers/weaviate/quickstart). 

And where you have Korean data, set the property tokenizer to "kagome_kr" as shown in the code above. 

### Note

- As of `1.25.7`, the tokenizer must be separately enabled by setting `ENABLE_TOKENIZER_KAGOME_KR` [environment variable](https://weaviate.io/developers/weaviate/config-refs/env-vars) to `true`. (For example, in the `docker-compose.yml` file.)