
**2024/06/08**

 Weaviate 在 [2024 二月釋出 v4](https://weaviate.io/blog/py-client-v4-release)。[v3 到 v4 有很大的差別](https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration)，故在此紀錄一些基本使用上的變更。

# 套件安裝 & 環境變數設置

In [None]:
%pip install -Uqq langchain-weaviate langchain_community langchain_openai
%pip install -Uqq openai tiktoken langchain

In [None]:
from google.colab import userdata

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

#---#

WEAVIATE_TEST_API_KEY = userdata.get("weaviate_test1_api_key")

#---#

from huggingface_hub import HfApi
HF_TOKEN = userdata.get("HF_TOKEN")

api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

soaring0616


In [None]:
# # For using WCS
import weaviate
import json
import os

client = weaviate.connect_to_wcs(
    cluster_url=("https://testing1-g3dylhd4.weaviate.network"),   # 換成自己的 url
    auth_credentials=weaviate.auth.AuthApiKey(WEAVIATE_TEST_API_KEY),  # 使用自己的 API
    headers={
        "X-OpenAI-Api-Key": OPENAI_API_KEY,
        "X-Huggingface-Api-Key": HF_TOKEN,
    }
)


# Check if your instance is live and ready
# This should return `True`
client.is_ready()

True

# V4 的 `schema` 被改成 `collection`
詳細見：https://forum.weaviate.io/t/attributeerror-weaviateclient-object-has-no-attribute-schema/2433/3

In [None]:
client.collections.delete_all() # 清除既有 schema
#client.collections.get() # collections.get 需要填入 name 這個參數
client.collections.list_all() # 列出所有 `collection`

{}

## `collection` 建立也換寫法了

詳細見：
1.   https://weaviate.io/developers/weaviate/manage-data/collections#create-a-collection
2.   https://weaviate.io/developers/weaviate/manage-data/collections#create-a-collection-and-define-properties
3.   https://weaviate.io/developers/weaviate/manage-data/collections#property-level-settings



In [None]:
import weaviate
import weaviate.classes as wvc
import os

newcollections = client.collections.create(
    name="MyExampleIndex",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),    # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
    generative_config=wvc.config.Configure.Generative.cohere(),             # Set the generative module to "generative-cohere" to use the Cohere API for RAG
    properties=[
        wvc.config.Property(
            name="content",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=True,  # Include the property name ("question") when vectorizing
            tokenization=wvc.config.Tokenization.LOWERCASE  # Use "lowecase" tokenization
        ),
    ]
)


## 讀取文件，並以一行當作一組向量數據存入向量數據庫
### 讀取文件

In [None]:
!wget https://raw.githubusercontent.com/soaring0616/some_stuff/main/test123.txt

--2024-06-08 07:15:36--  https://raw.githubusercontent.com/soaring0616/some_stuff/main/test123.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1136 (1.1K) [text/plain]
Saving to: ‘test123.txt.1’


2024-06-08 07:15:36 (58.8 MB/s) - ‘test123.txt.1’ saved [1136/1136]



In [None]:
# 使用 LangChain 套件
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("test123.txt") # 以中醫十問作測試檔案
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

## 批次讀入寫法變更如下：
1.   https://weaviate.io/developers/weaviate/manage-data/import#basic-import


In [None]:
collection = client.collections.get("MyExampleIndex") # collections.get 需要填入 name 這個參數

with collection.batch.dynamic() as batch:
    for item in docs[0].page_content.split("\n\n"):
      properties = {
          "content": item
      }

      batch.add_object( #‵batch.add_data_object`換成‵batch.add_object`
          properties=properties # 沒有‵class_name` 這個參數
      )


## 查看 `collection` 內部的方式變成使用 `iterator`，詳細見：

https://weaviate.io/developers/weaviate/manage-data/read-all-objects#read-object-properties-and-ids



In [None]:
for item in collection.iterator():
    print(item.uuid, item.properties)

# for item in collection.iterator(include_vector=True): #若要查看量化的向量，則需要調整`include_vector`這個參數
#     print(item.uuid, item.properties, item.vector)

04d53cd9-d85c-4278-a82f-5d30bd1e0105 {'content': '1. 問睡眠,你的睡眠如何?是否一覺到天亮?是否每天定時會醒?如果會醒,是幾點會醒?是否多夢?等等.'}
0e95c450-82b1-414b-bb74-e115ecef2088 {'content': '2. 問胃口,你感覺餓嗎?有欲望想吃什麼特別的食物或是喜愛什麼味道的食物?或是不餓,完全沒有胃口.'}
26313942-51b3-4bfe-b30b-2ff8150bf4a6 {'content': '10. 問女子月經,無論妳有無月經,都要詳細說明妳的月經情形,是延後還是每次都提前呢?痛不痛呢?生過小孩嗎?'}
5c1d1a09-5c7f-4160-8745-d264e3100d6a {'content': '9. 問性功能,你性功能好嗎?等等.'}
8b683022-fb33-40fc-aca9-bca54a2346b9 {'content': '8. 問體力如何,精神好嗎?還是一直疲憊中?早上起床時,是精神奕奕呢?還是無法起床呢?精神能夠集中嗎?'}
8c1c2dec-3f64-4aa2-8751-0ce0a47d559a {'content': '5. 問口渴,你很渴嗎?如渴,最想喝什麼溫度的水?如不渴,時常會忘記喝水嗎?還是再怎麼喝也不能止渴呢?'}
a35b4c51-9b87-49e6-a965-3db48185bdd9 {'content': '7. 問汗,你容易出汗嗎?會半夜盜汗嗎?會時常流汗不止嗎?還是不出汗的身體呢?'}
b08df0ab-6a64-4213-82c3-d22ee1715bf6 {'content': '6. 問寒熱,你平時覺得身體很熱還是很冷?手腳冰冷嗎?'}
d302d1e7-10ca-48ec-9039-c4608e085a1f {'content': '3. 問大便,你便秘嗎?每天有大便嗎?大便顏色是什麼?是下利嗎?很臭還是無味?等等.'}
ecca02fd-f035-48ea-bee6-6d1acd28dcf1 {'content': '4. 問小便,你的小便是什麼顏色?頻尿嗎?還是小不出來?還是沒有尿意?平均一天幾次?等等.'}


## 檢索特定物件的方式變成透過`id`

如：https://weaviate.io/developers/weaviate/manage-data/read

## 尋找(`Search`)寫法變更如下
詳細：https://weaviate.io/developers/weaviate/search/basics

In [None]:
collection = client.collections.get("MyExampleIndex")
response = collection.query.fetch_objects()

for o in response.objects:
    print(o.properties)

{'content': '1. 問睡眠,你的睡眠如何?是否一覺到天亮?是否每天定時會醒?如果會醒,是幾點會醒?是否多夢?等等.'}
{'content': '2. 問胃口,你感覺餓嗎?有欲望想吃什麼特別的食物或是喜愛什麼味道的食物?或是不餓,完全沒有胃口.'}
{'content': '10. 問女子月經,無論妳有無月經,都要詳細說明妳的月經情形,是延後還是每次都提前呢?痛不痛呢?生過小孩嗎?'}
{'content': '9. 問性功能,你性功能好嗎?等等.'}
{'content': '8. 問體力如何,精神好嗎?還是一直疲憊中?早上起床時,是精神奕奕呢?還是無法起床呢?精神能夠集中嗎?'}
{'content': '5. 問口渴,你很渴嗎?如渴,最想喝什麼溫度的水?如不渴,時常會忘記喝水嗎?還是再怎麼喝也不能止渴呢?'}
{'content': '7. 問汗,你容易出汗嗎?會半夜盜汗嗎?會時常流汗不止嗎?還是不出汗的身體呢?'}
{'content': '6. 問寒熱,你平時覺得身體很熱還是很冷?手腳冰冷嗎?'}
{'content': '3. 問大便,你便秘嗎?每天有大便嗎?大便顏色是什麼?是下利嗎?很臭還是無味?等等.'}
{'content': '4. 問小便,你的小便是什麼顏色?頻尿嗎?還是小不出來?還是沒有尿意?平均一天幾次?等等.'}


## 文字相似性搜尋寫法更改如下
詳細：https://weaviate.io/developers/weaviate/search/similarity#search-with-text

In [None]:
from weaviate.classes.query import MetadataQuery

reviews = client.collections.get("MyExampleIndex")
response = reviews.query.near_text(
    query="睡眠",
    limit=4,
    target_vector="content",  # Specify the target vector for named vector collections
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.distance)

{'content': '1. 問睡眠,你的睡眠如何?是否一覺到天亮?是否每天定時會醒?如果會醒,是幾點會醒?是否多夢?等等.'}
0.17417001724243164
{'content': '8. 問體力如何,精神好嗎?還是一直疲憊中?早上起床時,是精神奕奕呢?還是無法起床呢?精神能夠集中嗎?'}
0.1948421597480774
{'content': '7. 問汗,你容易出汗嗎?會半夜盜汗嗎?會時常流汗不止嗎?還是不出汗的身體呢?'}
0.2053612470626831
{'content': '6. 問寒熱,你平時覺得身體很熱還是很冷?手腳冰冷嗎?'}
0.21589070558547974
