[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/model-providers/openai/similarity_search_multilingual_japanese.ipynb)

# Dependencies

In [1]:
%pip install -U weaviate-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Configuration

In [2]:
import weaviate
import os

from weaviate import classes as wvc
client = weaviate.connect_to_embedded(
    version="1.28.2",
    environment_variables={
        "ENABLE_TOKENIZER_KAGOME_JA": "true",
    },
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]
    }
)

{"action":"startup","build_git_commit":"5a3991d2d4","build_go_version":"go1.23.4","build_image_tag":"HEAD","build_wv_version":"1.28.2","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-12-26T21:31:38-03:00"}
{"action":"startup","auto_schema_enabled":true,"build_git_commit":"5a3991d2d4","build_go_version":"go1.23.4","build_image_tag":"HEAD","build_wv_version":"1.28.2","level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-12-26T21:31:38-03:00"}
{"build_git_commit":"5a3991d2d4","build_go_version":"go1.23.4","build_image_tag":"HEAD","build_wv_version":"1.28.2","level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-12-26T21:31:38-03:00"}
{"build_git_commit":"5a3991d2d4","build_go_version":"go1.23.

In [3]:
print(
    "Client Version:", weaviate.__version__, 
    "Server Version:", client.get_meta().get("version")
)

Client Version: 4.10.2 Server Version: 1.28.2


# Schema

In [4]:
# resetting the schema. CAUTION: This will delete your collection 
client.collections.delete("MyCollection")

collection = client.collections.create(
    "MyCollection",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    generative_config=wvc.config.Configure.Generative.openai(),
    properties=[
        wvc.config.Property(
            name="content", 
            data_type=wvc.config.DataType.TEXT,
            tokenization=wvc.config.Tokenization.KAGOME_JA
        )
    ]
)
print("Successfully created the schema.")

Successfully created the schema.


# Import the Data

In [5]:
data = [
    "私の名前は鈴木(Suzuki)です。趣味は野球です。", # My name is Suzuki. My hobby is baseball.
    "私の名前は佐藤(Sato)です。趣味はサッカーです。", # My name is Sato. My hobby is soccer.
    "私の名前は田中(Tanaka)です。趣味はテニスです。" # My name is Tanaka. My hobby is tennis.
]

# Batch import all objects
# (Yes, batch import is an overkill for 3 objects, but it is recommended for large volumes of data)with client.batch as batch:
with collection.batch.dynamic() as batch:
    for item in data:
        # the call that performs data insert
        batch.add_object(
            properties={"content": item},
        )
        print(item)

print("Data import complete")

私の名前は鈴木(Suzuki)です。趣味は野球です。
私の名前は佐藤(Sato)です。趣味はサッカーです。
私の名前は田中(Tanaka)です。趣味はテニスです。


{"action":"hnsw_prefill_cache_async","build_git_commit":"5a3991d2d4","build_go_version":"go1.23.4","build_image_tag":"HEAD","build_wv_version":"1.28.2","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-12-26T21:31:41-03:00","wait_for_cache_prefill":false}
{"build_git_commit":"5a3991d2d4","build_go_version":"go1.23.4","build_image_tag":"HEAD","build_wv_version":"1.28.2","level":"info","msg":"Created shard mycollection_SgZG5BZ5gAjb in 2.306542ms","time":"2024-12-26T21:31:41-03:00"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"5a3991d2d4","build_go_version":"go1.23.4","build_image_tag":"HEAD","build_wv_version":"1.28.2","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-12-26T21:31:41-03:00","took":46250}


Data import complete


Quick check to see if all objects are in.
Let's use [meta count](https://weaviate.io/developers/weaviate/search/aggregate#retrieve-a-meta-property).

In [6]:
# Check number of objects
collection.aggregate.over_all()

AggregateReturn(properties={}, total_count=3)

# Queries

## Semantic search (nearVector)

In [7]:
query = collection.query.near_text("バトミントン", limit=2)
for obj in query.objects:
    print("####")
    print(obj.properties["content"])


####
私の名前は田中(Tanaka)です。趣味はテニスです。
####
私の名前は鈴木(Suzuki)です。趣味は野球です。


## Semantic search with filter

In [8]:
query = collection.query.near_text(
    query="バトミントン",
    limit=2,
    filters=wvc.query.Filter.by_property("content").like("*鈴木*")
)
for obj in query.objects:
    print("####")
    print(obj.properties["content"])

####
私の名前は鈴木(Suzuki)です。趣味は野球です。


## Generative search

In [9]:
query = collection.generate.near_text(
    query="バトミントン", 
    limit=1,
    single_prompt="{content}。私の名前の読み方は何ですか？ answer in english"
)
for obj in query.objects:
    print("####")
    print(obj.properties["content"])
    print(obj.generated)

####
私の名前は田中(Tanaka)です。趣味はテニスです。
Your name, Tanaka, is pronounced as "Tah-nah-kah" in English.


## Hybrid search

In [10]:
# alpha 0.5
query = collection.query.hybrid(
    query="田中",
    alpha=0.5,
    limit=1
)
for obj in query.objects:
    print("#### alpha=0.5")
    print(obj.properties["content"])

#alpha 1
query = collection.query.hybrid(
    query="田中",
    alpha=1,
    limit=1
)
for obj in query.objects:
    print("#### alpha=1")
    print(obj.properties["content"])


#alpha 0
query = collection.query.hybrid(
    query="田中",
    alpha=1,
    limit=1
)
for obj in query.objects:
    print("#### alpha=0")
    print(obj.properties["content"])


#### alpha=0.5
私の名前は田中(Tanaka)です。趣味はテニスです。
#### alpha=1
私の名前は田中(Tanaka)です。趣味はテニスです。
#### alpha=0
私の名前は田中(Tanaka)です。趣味はテニスです。


## BM25 search (keyword-based)

In [11]:
#bm25 working
query = collection.query.bm25(
    query="田中",
    limit=1,
    return_metadata=wvc.query.MetadataQuery(score=True)
)
for obj in query.objects:
    print("####", obj.metadata.score)
    print(obj.properties["content"])

#### 0.4458314776420593
私の名前は田中(Tanaka)です。趣味はテニスです。
