# Weaviate Python Library

The Weaviate Python Client is a python package that allows you to connect and interact with a Weaviate instance. The python client is NOT a Weaviate instance but you can use it to create one on the Weaviate Cloud Service. It provides API for importing data, creating schemas, do classification, query data, … We are going to go through most of them and explain how and when one could use them.

# How to use the python-client with a Weaviate cluster

# 1.Create an Weaviate instance/cluster

Creating a Weaviate instance can be done in multiple ways. It can be done using a docker-compose.yaml file

Another option is to create an account on Weaviate Cloud Service console (WCS console) and create a cluster there. There are different options for clusters you can choose from. 

install the Weaviate Python

In [1]:
import sys
!{sys.executable} -m pip install weaviate-client==3.0.0

Defaulting to user installation because normal site-packages is not writeable


import the package and create a cluster on WCS

In [2]:
from getpass import getpass # hide password
import weaviate # to communicate to the Weaviate instance
from weaviate.wcs import WCS

In order to authenticate to WCS or Weaviate instance (if Weaviate instance has Authentication enable) we need to create an Authentication object. At the moment it supports two types of authentication credentials:

* Password credentials: weaviate.auth.AuthClientPassword(username='WCS_ACCOUNT_EMAIL', password='WCS_ACCOUNT_PASSWORD')
* Token credentials weaviate.auth.AuthClientCredentials(client_secret=YOUR_SECRET_TOKEN)

In [3]:
my_credentials = weaviate.auth.AuthClientPassword(username=input("User name:renukaalai@gmail.com "), password=getpass('Password:Renuka@3200 '))

User name:renukaalai@gmail.com renukaalai@gmail.com
Password:Renuka@3200 ········


In [4]:
my_wcs = WCS(my_credentials)

Now that we connected to WCS, we can create, delete, get_clusters, get_cluster_config and check the status of a cluster with is_ready method.

If you want to check the prototype and docstring of any methods in a notebook, run this command: object.method?. You can also use the help() function.
Ex: WCS.is_ready? or my_wcs.is_ready? or help(WCS.is_ready).

In [5]:
cluster_name = 'my-first-weaviate-instance'
weaviate_url = my_wcs.create(cluster_name=cluster_name)
weaviate_url

100%|██████████| 100/100 [01:08<00:00,  1.46it/s]


'https://my-first-weaviate-instance.semi.network'

In [6]:
my_wcs.is_ready(cluster_name)

True

# 2.Connect to the cluster.


In [8]:
client = weaviate.Client(weaviate_url)

In [9]:
client.is_ready()

True

# 3. Get Data and Analyse it

we are going to use news articles to construct weaviate data. For this we are going to need the newspaper3k package.

In [10]:
!{sys.executable} -m pip install newspaper3k

Defaulting to user installation because normal site-packages is not writeable
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
     |████████████████████████████████| 211 kB 272 kB/s            
[?25hCollecting jieba3k>=0.35.1
  Downloading jieba3k-0.35.1.zip (7.4 MB)
     |████████████████████████████████| 7.4 MB 31 kB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting feedparser>=5.2.1
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
     |████████████████████████████████| 81 kB 290 kB/s            
Collecting feedfinder2>=0.0.4
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting tldextract>=2.0.1
  Downloading tldextract-3.3.0-py3-none-any.whl (93 kB)
     |████████████████████████████████| 93 kB 344 kB/s            
Collecting tinysegmenter==0.3
  Downloading tinysegmenter-0.3.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting sgmlli

In [11]:
import nltk # it is a dependency of newspaper3k
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/renuka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
>>> import newspaper
>>> import uuid
>>> import json
>>> from tqdm import tqdm

>>> def get_articles_from_newspaper(
...         news_url: str, 
...         max_articles: int=100
...     ) -> None:
...     """
...     Download and save newspaper articles as weaviate schemas.
...     Parameters
...     ----------
...     newspaper_url : str
...         Newspaper title.
...     """
...     
...     objects = []
...     
...     # Build the actual newspaper    
...     news_builder = newspaper.build(news_url, memoize_articles=False)
...     
...     if max_articles > news_builder.size():
...         max_articles = news_builder.size()
...     pbar = tqdm(total=max_articles)
...     pbar.set_description(f"{news_url}")
...     i = 0
...     while len(objects) < max_articles and i < news_builder.size():
...         article = news_builder.articles[i]
...         try:
...             article.download()
...             article.parse()
...             article.nlp()

...             if (article.title != '' and \
...                 article.title is not None and \
...                 article.summary != '' and \
...                 article.summary is not None and\
...                 article.authors):
... 
...                 # create an UUID for the article using its URL
...                 article_id = uuid.uuid3(uuid.NAMESPACE_DNS, article.url)
... 
...                 # create the object
...                 objects.append({
...                     'id': str(article_id),
...                     'title': article.title,
...                     'summary': article.summary,
...                     'authors': article.authors
...                 })
...                 
...                 pbar.update(1)
... 
...         except:
...             # something went wrong with getting the article, ignore it
...             pass
...         i += 1
...     pbar.close()
...     return objects
>>> data = []
>>> data += get_articles_from_newspaper('https://www.theguardian.com/international')

https://www.theguardian.com/international: 100%|██████████| 100/100 [01:43<00:00,  1.03s/it]


# Create appropriate data types.

In [17]:
>>> article_class_schema = {
...     # name of the class
...     "class": "Article",
...     # a description of what this class represents
...     "description": "An Article class to store the article summary and its authors",
...     # class properties
...     "properties": [
...         {
...             "name": "title",
...             "dataType": ["string"],
...             "description": "The title of the article", 
...         },
...         {
...             "name": "summary",
...             "dataType": ["text"],
...             "description": "The summary of the article",
...         },
...         {
...             "name": "hasAuthors",
...             "dataType": ["Author"],
...             "description": "The authors this article has",
...         }
...     ]
... }

In [18]:
>>> author_class_schema = {
...     "class": "Author",
...     "description": "An Author class to store the author information",
...     "properties": [
...         {
...             "name": "name",
...             "dataType": ["string"],
...             "description": "The name of the author", 
...         },
...         {
...             "name": "wroteArticles",
...             "dataType": ["Article"],
...             "description": "The articles of the author", 
...         }
...     ]
... }

In [23]:
>>> # helper function
>>> def prettify(json_dict): 
...     print(json.dumps(json_dict, indent=2))
>>> prettify(client.schema.get())
{
  "classes": [
    {
      "class": "Article",
      "description": "An Article class to store the article summary and its authors",
      "invertedIndexConfig": {
        "cleanupIntervalSeconds": 60
      },
      "properties": [
        {
          "dataType": [
            "string"
          ],
          "description": "The title of the article",
          "name": "title"
        },
        {
          "dataType": [
            "text"
          ],
          "description": "The summary of the article",
          "name": "summary"
        }
      ],
      "vectorIndexConfig": {
        "cleanupIntervalSeconds": 300,
        "maxConnections": 64,
        "efConstruction": 128,
        "vectorCacheMaxObjects": 500000
      },
      "vectorIndexType": "hnsw",
      "vectorizer": "text2vec-contextionary"
    }
  ]
}


{
  "classes": []
}


{'classes': [{'class': 'Article',
   'description': 'An Article class to store the article summary and its authors',
   'invertedIndexConfig': {'cleanupIntervalSeconds': 60},
   'properties': [{'dataType': ['string'],
     'description': 'The title of the article',
     'name': 'title'},
    {'dataType': ['text'],
     'description': 'The summary of the article',
     'name': 'summary'}],
   'vectorIndexConfig': {'cleanupIntervalSeconds': 300,
    'maxConnections': 64,
    'efConstruction': 128,
    'vectorCacheMaxObjects': 500000},
   'vectorIndexType': 'hnsw',
   'vectorizer': 'text2vec-contextionary'}]}

In [27]:
>>> prettify(client.schema.get())
{
  "classes": [
    {
      "class": "Article",
      "description": "An Article class to store the article summary and its authors",
      "invertedIndexConfig": {
        "cleanupIntervalSeconds": 60
      },
      "properties": [
        {
          "dataType": [
            "string"
          ],
          "description": "The title of the article",
          "name": "title"
        },
        {
          "dataType": [
            "text"
          ],
          "description": "The summary of the article",
          "name": "summary"
        },
        {
          "dataType": [
            "Author"
          ],
          "description": "The authors this article has",
          "name": "hasAuthors"
        }
      ],
      "vectorIndexConfig": {
        "cleanupIntervalSeconds": 300,
        "maxConnections": 64,
        "efConstruction": 128,
        "vectorCacheMaxObjects": 500000
      },
      "vectorIndexType": "hnsw",
      "vectorizer": "text2vec-contextionary"
    },
    {
      "class": "Author",
      "description": "An Author class to store the author information",
      "invertedIndexConfig": {
        "cleanupIntervalSeconds": 60
      },
      "properties": [
        {
          "dataType": [
            "string"
          ],
          "description": "The name of the author",
          "name": "name"
        },
        {
          "dataType": [
            "Article"
          ],
          "description": "The articles of the author",
          "name": "wroteArticles"
        }
      ],
      "vectorIndexConfig": {
        "cleanupIntervalSeconds": 300,
        "maxConnections": 64,
        "efConstruction": 128,
        "vectorCacheMaxObjects": 500000
      },
      "vectorIndexType": "hnsw",
      "vectorizer": "text2vec-contextionary"
    }
  ]
}

{
  "classes": [
    {
      "class": "Author",
      "description": "An Author class to store the author information",
      "invertedIndexConfig": {
        "bm25": {
          "b": 0.75,
          "k1": 1.2
        },
        "cleanupIntervalSeconds": 60,
        "stopwords": {
          "additions": null,
          "preset": "en",
          "removals": null
        }
      },
      "properties": [
        {
          "dataType": [
            "string"
          ],
          "description": "The name of the author",
          "name": "name",
          "tokenization": "word"
        }
      ],
      "shardingConfig": {
        "virtualPerPhysical": 128,
        "desiredCount": 1,
        "actualCount": 1,
        "desiredVirtualCount": 128,
        "actualVirtualCount": 128,
        "key": "_id",
        "strategy": "hash",
        "function": "murmur3"
      },
      "vectorIndexConfig": {
        "skip": false,
        "cleanupIntervalSeconds": 300,
        "maxConnections": 64,
   

{'classes': [{'class': 'Article',
   'description': 'An Article class to store the article summary and its authors',
   'invertedIndexConfig': {'cleanupIntervalSeconds': 60},
   'properties': [{'dataType': ['string'],
     'description': 'The title of the article',
     'name': 'title'},
    {'dataType': ['text'],
     'description': 'The summary of the article',
     'name': 'summary'},
    {'dataType': ['Author'],
     'description': 'The authors this article has',
     'name': 'hasAuthors'}],
   'vectorIndexConfig': {'cleanupIntervalSeconds': 300,
    'maxConnections': 64,
    'efConstruction': 128,
    'vectorCacheMaxObjects': 500000},
   'vectorIndexType': 'hnsw',
   'vectorizer': 'text2vec-contextionary'},
  {'class': 'Author',
   'description': 'An Author class to store the author information',
   'invertedIndexConfig': {'cleanupIntervalSeconds': 60},
   'properties': [{'dataType': ['string'],
     'description': 'The name of the author',
     'name': 'name'},
    {'dataType': [

In [28]:
>>> schema = client.schema.get() # save schema
>>> client.schema.delete_all() # delete all classes
>>> prettify(client.schema.get())
{
  "classes": []
}

{
  "classes": []
}


{'classes': []}

# Load data.

In [31]:
prettify(data[0])

{
  "id": "9bd544e2-132e-3fad-a31d-4ad9a05ce9d6",
  "title": "Russia-Ukraine war: Russian bombardment of Sievierodonetsk \u2018pushes Ukrainian troops back to city\u2019s outskirts\u2019 \u2013 live",
  "summary": "It was \u201cimpossible\u201d to say that Sievierodonetsk had been completely seized by Russian troops, Haidai said, adding:Our (forces) now again control only the outskirts of the city.\nBut the fighting is still going on, our (forces) are defending Sievierodonetsk, it is impossible to say the Russians completely control the city.\n1h ago 16.34 Ukraine now only controls outskirts of Sievierodonetsk, says governor Ukrainian forces have been pushed back by a Russian bombardment in the frontline eastern city of Sievierodonetsk and now only control its outskirts, according to the governor of Luhansk, Serhiy Haidai.\nIt was \u201cimpossible\u201d to say that Sievierodonetsk had been completely seized by Russian troops, Haidai said, adding:Our (forces) now again control only the 

In [32]:
article_object = {
...     'title': data[0]['title'],
...     'summary': data[0]['summary'].replace('\n', '') # remove newline character
...     # we leave out the `hasAuthors` because it is a reference and will be created after we create the Authors
... }
>>> article_id = data[0]['id']

>>> # validated the object
>>> result = client.data_object.validate(
...     data_object=article_object,
...     class_name='Article',
...     uuid=article_id
... )

>>> prettify(result)


{
  "error": [
    {
      "message": "invalid object: class 'Article' not present in schema"
    }
  ],
  "valid": false
}


In [34]:
>>> # create the object
>>> client.data_object.create(
...     data_object=article_object,
...     class_name='Article',
...     uuid=article_id) # if not specified, weaviate is going to create an UUID for you.

'9bd544e2-132e-3fad-a31d-4ad9a05ce9d6'

In [35]:
prettify(client.data_object.get(article_id, with_vector=False))

{
  "class": "Article",
  "creationTimeUnix": 1654708475775,
  "id": "9bd544e2-132e-3fad-a31d-4ad9a05ce9d6",
  "lastUpdateTimeUnix": 1654708475775,
  "properties": {
    "summary": "It was \u201cimpossible\u201d to say that Sievierodonetsk had been completely seized by Russian troops, Haidai said, adding:Our (forces) now again control only the outskirts of the city.But the fighting is still going on, our (forces) are defending Sievierodonetsk, it is impossible to say the Russians completely control the city.1h ago 16.34 Ukraine now only controls outskirts of Sievierodonetsk, says governor Ukrainian forces have been pushed back by a Russian bombardment in the frontline eastern city of Sievierodonetsk and now only control its outskirts, according to the governor of Luhansk, Serhiy Haidai.It was \u201cimpossible\u201d to say that Sievierodonetsk had been completely seized by Russian troops, Haidai said, adding:Our (forces) now again control only the outskirts of the city.01:29 31,000 Russ

# New Batch object

In [69]:
>>> from weaviate.batch import Batch # for the typing purposes
>>> from weaviate.util import generate_uuid5 # old way was from weaviate.tools import generate_uuid
>>> def add_article(batch: Batch, article_data: dict) -> str:
...    
...     article_object = {
...         'title': article_data['title'],
...         'summary': article_data['summary'].replace('\n', '') # remove newline character
...     }
...     article_id = article_data['id']
...    
...    # add article to the object batch request
...     batch.add_data_object(  # old way was batch.add(...)
...        data_object=article_object,
...        class_name='Article',
...        uuid=article_id
...    )
...    
...     return article_id
>>> def add_author(batch: Batch, author_name: str, created_authors: dict) -> str:
...    
...    if author_name in created_authors:
...        # return author UUID
...        return created_authors[author_name]
...    
...    # generate an UUID for the Author
...    author_id = generate_uuid5(author)
...    
...    # add author to the object batch request
...    batch.add_data_object(  # old way was batch.add(...)
...        data_object={'name': author_name},
...        class_name='Author',
...        uuid=author_id
...    )
...    
...    created_authors[author_name] = author_id
...    return author_id
>>> def add_references(batch: Batch, article_id: str, author_id: str)-> None:
...    # add references to the reference batch request
...    ## Author -> Article
...    batch.add_reference(  # old way was batch.add(...)
...        from_object_uuid=author_id,
...        from_object_class_name='Author',
...        from_property_name='wroteArticles',
...        to_object_uuid=article_id
...    )
...    
...    ## Article -> Author 
...    batch.add_reference(  # old way was batch.add(...)
...        from_object_uuid=article_id,
...        from_object_class_name='Article',
...        from_property_name='hasAuthors',
...        to_object_uuid=author_id
...    )

# a) Manually

In [70]:
>>> from tqdm import trange
>>> for i in trange(1, 100):
...     
...    # add article to the batch
...    article_id = add_article(client.batch, data[i])
...    
...    for author in data[i]['authors']:
...        
...        # add author to the batch
...        author_id = add_author(client.batch, author, created_authors)
...        
...        # add cross references to the batch
...        add_references(client.batch, article_id=article_id, author_id=author_id)
...    
...    if i % 20 == 0:
...        # submit the objects from the batch to weaviate
...        client.batch.create_objects()
...        
...        # submit the references from the batch to weaviate
...        client.batch.create_references()
>>> # submit any objects that are left
>>> status_objects = client.batch.create_objects()
>>> status_references = client.batch.create_references()
>>> # if there is no need for the output from batch creation, one could flush both
>>> # object and references with one call
>>> client.batch.flush()

100%|██████████| 99/99 [00:06<00:00, 16.18it/s]


In [71]:
>>> from tqdm import trange
>>> with client.batch as batch:
...     for i in trange(1, 100):
...        
...        # add article to the batch
...        article_id = add_article(batch, data[i])
...        
...        for author in data[i]['authors']:
...            
...            # add author to the batch
...            author_id = add_author(batch, author, created_authors)
...            
...            # add cross references to the batch
...            add_references(batch, article_id=article_id, author_id=author_id)
...        
...        if i % 20 == 0:
...            # submit the objects from the batch to weaviate
...            batch.create_objects()
...            
...            # submit the reference from the batch to weaviate
...            batch.create_references()

100%|██████████| 99/99 [00:05<00:00, 18.85it/s]


# b) Auto-create batches when full

In [74]:
>>> # we still need the 'created_authors' so we do not add the same author twice
>>> client.batch.configure(
...     batch_size=30,
...    callback=None,) # use this argument to set a callback function on the batch creation results

>>> for i in trange(10, 20):
...    
...    # add article to the batch
...    article_id = add_article(client.batch, data[i])
...    for author in data[i]['authors']:
...        # add author to the batch
...        author_id = add_author(client.batch, author,                             created_authors)
...        # add cross references to the batch
...        add_references(client.batch, article_id=article_id, author_id=author_id)
>>> client.batch.flush()

100%|██████████| 10/10 [00:00<00:00, 10.22it/s]


In [75]:
>>> # we still need the 'created_authors' so we do not add the same author twice
>>> client.batch.configure(
...     batch_size=30,
...    callback=None, # use this argument to set a callback function on the batch creation results
... )
>>> with client.batch(batch_size=30) as batch: # the client.batch(batch_size=30) is the same as client.batch.configure(batch_size=30)
...     for i in trange(10, 20):
...        # add article to the batch
...        article_id = add_article(batch, data[i])
...        for author in data[i]['authors']:
...            # add author to the batch
...            author_id = add_author(batch, author, created_authors)
...            # add cross references to the batch
...            add_references(batch, article_id=article_id, author_id=author_id)

100%|██████████| 10/10 [00:00<00:00, 10.45it/s]


# 5.5. Query data.

In [76]:
>>> result = client.query.get(class_name='Article', properties="title")\
...     .do()
>>> print(f"Number of articles returned: {len(result['data']['Get']['Article'])}")
>>> result

Number of articles returned: 100


{'data': {'Get': {'Article': [{'title': 'Exhausted Russian fighters complain of conditions in eastern Ukraine'},
    {'title': 'England’s new strangulation law – and why it’s needed – podcast'},
    {'title': 'Moscow’s chief rabbi ‘in exile’ after resisting Kremlin pressure over war'},
    {'title': '‘The worst law on earth’: why the rich love London’s reputation managers'},
    {'title': 'I’m nearly 60. Here’s what I’ve learned about growing old so far'},
    {'title': 'US tourist throws scooter down Rome’s Spanish Steps, causing €25,000 damage'},
    {'title': 'Microplastics found in freshly fallen Antarctic snow for first time'},
    {'title': 'Inside the strange world of NFTs'},
    {'title': 'The Congolese student fighting with pro-Russia separatists in Ukraine'},
    {'title': 'Israeli police \u200battack funeral procession of killed journalist Shireen Abu Aqleh – video'},
    {'title': 'The Guardian :: What can I expect after attending an interview?'},
    {'title': 'Depp v Hear

In [77]:
>>> result = client.query.get(class_name='Article', properties="title")\
...     .with_limit(200)\
...     .do()
>>> print(f"Number of articles returned: {len(result['data']['Get']['Article'])}")

Number of articles returned: 100


In [78]:
>>> client.query.get(class_name='Article', properties="title")\
...     .with_limit(5)\
...     .with_near_text({'concepts': ['Fashion']})\
...     .do()

{'errors': [{'locations': [{'column': 23, 'line': 1}],
   'message': 'Unknown argument "nearText" on field "Article" of type "GetObjectsObj". Did you mean "nearObject" or "nearVector"?',
   'path': None}]}

# AGGREGATE

In [80]:
>>> # no filter, count all objects of class Article
>>> client.query.aggregate(class_name='Article')\
...     .with_meta_count()\
...     .do()

{'data': {'Aggregate': {'Article': [{'meta': {'count': 100}}]}}}

In [81]:
>>> # no filter, count all objects of class Author
>>> client.query.aggregate(class_name='Author')\
...     .with_meta_count()\
...     .do()

{'data': {'Aggregate': {'Author': [{'meta': {'count': 145}}]}}}