# Information Retrieval using BERT

This example contains the notebook that was used in this <a href="https://towardsdatascience.com/a-sub-50ms-neural-search-with-distilbert-and-weaviate-4857ae390154">Towards Data Science article.</a>

In this example we are going to use Weaviate without vectorization module, and use it as pure vector database to use a BERT transformer to vectorize text documents, then retrieve the closest ones through Weaviate's Search.

Note that we use Weaviate as pure vector database without any vectorization module attached. After this example was released, we have released new vectorization modules, like the text2vec-transformers module. You can use this module to run Weaviate with a BERT transformer module out of the box.

Uncomment to install `torch`, `transformers` and `nltk` libraries

In [16]:
#!python3 -m pip install torch

In [15]:
#!python3 -m pip install transformers --quiet
#!python3 -m pip install nltk

## Model Used

We will use distilbert model

##### `distilbert-base-uncased`

**DistilBERT**:
- **DistilBERT** is a smaller, faster, and lighter version of BERT (Bidirectional Encoder Representations from Transformers). It was created by Hugging Face using a process called **knowledge distillation**, where a smaller model (student) is trained to mimic the behavior of a larger, more complex model (teacher).
- **Purpose**: The goal of DistilBERT is to maintain a performance close to BERT's while significantly reducing the model size and inference time.

**Model Details**:
- **Base**: This indicates that the model is of base size, which typically means it has 6 layers (as opposed to larger models with more layers).
- **Uncased**: This means that the model does not differentiate between uppercase and lowercase letters. During the preprocessing step, all the text is converted to lowercase, and the vocabulary used by the tokenizer does not include any uppercase letters.

##### Key Features of DistilBERT

1. **Size and Speed**:
   - DistilBERT is approximately 60% the size of BERT, making it much faster to run and less memory-intensive.
   - This efficiency is achieved without a significant drop in performance.

2. **Training Process**:
   - **Knowledge Distillation**: The smaller DistilBERT model learns to predict the output of the larger BERT model. It retains much of BERT's language understanding capabilities while being more compact and efficient.

3. **Performance**:
   - DistilBERT achieves about 97% of BERT's performance on various NLP benchmarks while being more resource-efficient.
   - It is well-suited for applications where computational resources are limited or when faster inference times are required.

##### Usage in Code
In the provided code, `distilbert-base-uncased` is used as the model name to load both the pre-trained DistilBERT model and its corresponding tokenizer. This enables efficient text processing and inference, especially useful for deploying NLP models in production environments.

Here's how it fits into the code:
```python
MODEL_NAME = "distilbert-base-uncased"
model = AutoModel.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
```
- **`MODEL_NAME`**: Specifies the DistilBERT model.
- **`AutoModel.from_pretrained(MODEL_NAME)`**: Loads the pre-trained DistilBERT model.
- **`AutoTokenizer.from_pretrained(MODEL_NAME)`**: Loads the tokenizer tailored for the DistilBERT model.

By using DistilBERT, the code benefits from a model that is both efficient and powerful, making it suitable for practical applications where computational resources and speed are important considerations.

In [25]:
import torch
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import sent_tokenize
import weaviate
from weaviate.embedded import EmbeddedOptions

torch.set_grad_enabled(False)

# udpated to use different model if desired
MODEL_NAME = "distilbert-base-uncased"
model = AutoModel.from_pretrained(MODEL_NAME)
#model.to('cuda') # remove if working without GPUs
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# initialize nltk (for tokenizing sentences)
import nltk
nltk.download('punkt')

# initialize weaviate client for importing and searching
#client = weaviate.Client("http://localhost:8080")
#client = weaviate.Client(
#    embedded_options=EmbeddedOptions()
#)
client = weaviate.connect_to_embedded(
#client = weaviate.connect_to_local(
#    port=8079,
    #headers={
    #    "X-OpenAI-Api-Key": OPENAI_API_KEY  # Replace with your API key
    #},
)


[nltk_data] Downloading package punkt to /Users/rdua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Started /Users/rdua/.cache/weaviate-embedded: process ID 17964


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-07-14T18:12:41+05:30"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-07-14T18:12:41+05:30"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-07-14T18:12:41+05:30"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-07-14T18:12:41+05:30"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-07-14T18:12:41+05:30"}
{"action":"lsm_recover_from_active_wal_success","class":"Post","index":"post","level":"info","msg":"successfully recovered from write-ahead-log","path":"

In [26]:
print(client.__dict__)

{'_WeaviateClient__skip_init_checks': False, '_connection': <weaviate.connect.v4.ConnectionV4 object at 0x1399195a0>, 'backup': <weaviate.backup.backup._Backup object at 0x139919840>, 'cluster': <weaviate.collections.cluster._Cluster object at 0x1399194b0>, 'collections': <weaviate.collections.collections._Collections object at 0x1399196f0>, 'batch': <weaviate.collections.batch.client._BatchClientWrapper object at 0x139919750>, 'integrations': <weaviate.integrations._Integrations object at 0x1399279d0>}


Let's break down and explain the provided code step by step:

### Imports
```python
import torch
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import sent_tokenize
import weaviate
```
- **`torch`**: A deep learning library for tensor computation and automatic differentiation.
- **`transformers`**: A library from Hugging Face providing pre-trained transformer models (like BERT, GPT, etc.).
  - **`AutoModel`**: A class that automatically selects the appropriate model architecture based on the provided model name.
  - **`AutoTokenizer`**: A class that automatically selects the appropriate tokenizer based on the provided model name.
- **`nltk`**: The Natural Language Toolkit, a library for working with human language data (text).
  - **`sent_tokenize`**: A function to split text into sentences.
- **`weaviate`**: A library for interacting with the Weaviate vector search engine.

### Disable Gradient Calculation
```python
torch.set_grad_enabled(False)
```
- **Purpose**: Disables gradient calculation. This is useful during inference when we don't need to compute gradients, thus saving memory and computation time.

### Model and Tokenizer Initialization
```python
# updated to use different model if desired
MODEL_NAME = "distilbert-base-uncased"
model = AutoModel.from_pretrained(MODEL_NAME)
model.to('cuda') # remove if working without GPUs
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
```
- **`MODEL_NAME`**: Specifies the name of the pre-trained model to be used. In this case, `distilbert-base-uncased`, a smaller, faster, and lower memory version of BERT.
- **`model = AutoModel.from_pretrained(MODEL_NAME)`**: Loads the pre-trained model specified by `MODEL_NAME`.
- **`model.to('cuda')`**: Moves the model to the GPU for faster computation. This line should be removed if working without GPUs.
- **`tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)`**: Loads the tokenizer corresponding to the pre-trained model specified by `MODEL_NAME`.

### Initialize NLTK
```python
# initialize nltk (for tokenizing sentences)
import nltk
nltk.download('punkt')
```
- **Purpose**: Imports `nltk` and downloads the `punkt` tokenizer models, which are used for sentence tokenization.
- **`nltk.download('punkt')`**: Downloads the necessary data for sentence tokenization.

### Initialize Weaviate Client
```python
# initialize weaviate client for importing and searching
client = weaviate.Client("http://localhost:8080")
```
- **Purpose**: Initializes a Weaviate client for interacting with the Weaviate vector search engine.
- **`weaviate.Client("http://localhost:8080")`**: Connects to the Weaviate instance running on `localhost` at port `8080`.

### Summary
This code initializes necessary libraries and models for NLP tasks and sets up a Weaviate client for vector search operations. The process includes:
1. Disabling gradient computation to optimize inference.
2. Loading a pre-trained model (`distilbert-base-uncased`) and its tokenizer.
3. Preparing NLTK for sentence tokenization.
4. Setting up a connection to a local Weaviate instance.

In [27]:
import os
import random

def get_post_filenames(limit_objects=100):
    file_names = []
    i=0
    for root, dirs, files in os.walk("./data/20news-bydate-test"):
        for filename in files:
            path = os.path.join(root, filename)
            file_names += [path]
        
    random.shuffle(file_names)
    limit_objects = min(len(file_names), limit_objects)
      
    file_names = file_names[:limit_objects]

    return file_names

def read_posts(filenames=[]):
    posts = []
    for filename in filenames:
        f = open(filename, encoding="utf-8", errors='ignore')
        post = f.read()
        
        # strip the headers (the first occurrence of two newlines)
        post = post[post.find('\n\n'):]
        
        # remove posts with less than 10 words to remove some of the noise
        if len(post.split(' ')) < 10:
               continue
        
        post = post.replace('\n', ' ').replace('\t', ' ')
        if len(post) > 1000:
            post = post[:1000]
        posts += [post]

    return posts  

Let's break down and explain the provided code step by step:

### Imports
```python
import os
import random
```
- **`os`**: This module provides a way of using operating system-dependent functionality like reading or writing to the file system.
- **`random`**: This module implements pseudo-random number generators for various distributions, in this case, it's used for shuffling a list.

### `get_post_filenames` Function
```python
def get_post_filenames(limit_objects=100):
    file_names = []
    i = 0
    for root, dirs, files in os.walk("./data/20news-bydate-test"):
        for filename in files:
            path = os.path.join(root, filename)
            file_names += [path]
        
    random.shuffle(file_names)
    limit_objects = min(len(file_names), limit_objects)
      
    file_names = file_names[:limit_objects]

    return file_names
```
- **Purpose**: To retrieve and return a list of file paths from the directory `./data/20news-bydate-test`, limited by `limit_objects`.

1. **Initialization**:
    - `file_names = []`: Initializes an empty list to store file paths.
    - `i = 0`: Initializes a counter variable (though it's not used in the function).

2. **Directory Traversal**:
    - `os.walk("./data/20news-bydate-test")`: Recursively walks through the directory tree rooted at `./data/20news-bydate-test`.
    - `for root, dirs, files in os.walk("./data/20news-bydate-test")`: Iterates over each directory (`root`), its subdirectories (`dirs`), and files (`files`).

3. **Collecting File Paths**:
    - `for filename in files`: Iterates over each file in the current directory.
    - `path = os.path.join(root, filename)`: Constructs the full file path.
    - `file_names += [path]`: Adds the file path to the `file_names` list.

4. **Shuffling and Limiting**:
    - `random.shuffle(file_names)`: Randomly shuffles the `file_names` list.
    - `limit_objects = min(len(file_names), limit_objects)`: Sets `limit_objects` to the smaller value between the length of `file_names` and the provided limit.
    - `file_names = file_names[:limit_objects]`: Limits the list to `limit_objects` number of files.

5. **Return**:
    - `return file_names`: Returns the list of file paths.

### `read_posts` Function
```python
def read_posts(filenames=[]):
    posts = []
    for filename in filenames:
        f = open(filename, encoding="utf-8", errors='ignore')
        post = f.read()
        
        # strip the headers (the first occurrence of two newlines)
        post = post[post.find('\n\n'):]
        
        # remove posts with less than 10 words to remove some of the noise
        if len(post.split(' ')) < 10:
               continue
        
        post = post.replace('\n', ' ').replace('\t', ' ')
        if len(post) > 1000:
            post = post[:1000]
        posts += [post]

    return posts
```
- **Purpose**: To read the content of files specified by `filenames`, process the content, and return a list of processed posts.

1. **Initialization**:
    - `posts = []`: Initializes an empty list to store the processed posts.

2. **Reading Files**:
    - `for filename in filenames`: Iterates over each file path in `filenames`.
    - `f = open(filename, encoding="utf-8", errors='ignore')`: Opens the file with UTF-8 encoding, ignoring errors.
    - `post = f.read()`: Reads the entire content of the file into the variable `post`.

3. **Processing Content**:
    - **Stripping Headers**:
        - `post = post[post.find('\n\n'):]`: Strips off the headers by finding the first occurrence of two newlines and taking the substring after that.
    - **Filtering Short Posts**:
        - `if len(post.split(' ')) < 10: continue`: Skips posts with less than 10 words to remove some noise.
    - **Replacing Newlines and Tabs**:
        - `post = post.replace('\n', ' ').replace('\t', ' ')`: Replaces newlines and tabs with spaces.
    - **Truncating Long Posts**:
        - `if len(post) > 1000: post = post[:1000]`: Limits the length of the post to 1000 characters.

4. **Adding to List**:
    - `posts += [post]`: Adds the processed post to the `posts` list.

5. **Return**:
    - `return posts`: Returns the list of processed posts.

### Summary
- `get_post_filenames(limit_objects=100)`: This function collects and returns a shuffled list of file paths from the specified directory, limited to a specified number.
- `read_posts(filenames=[])`: This function reads the content of the files specified by the provided list of filenames, processes the content by stripping headers, filtering short posts, replacing newlines and tabs, truncating long posts, and returns a list of these processed posts.

### Text2Vec

Certainly! The provided code defines a function `text2vec` that converts a given text into a fixed-size vector representation using a pre-trained model (likely a transformer-based model like BERT). Here is an explanation of each part of the function:

#### Function Definition
```python
def text2vec(text):
```
**`def text2vec(text):`**: This line defines a function named `text2vec` that takes a single argument `text`, which is expected to be a string.

#### Tokenization
```python
tokens_pt = tokenizer(text, padding=True, truncation=True, max_length=500, add_special_tokens=True, return_tensors="pt")
```
- **`tokenizer(text, ...)`**: This line uses a tokenizer to preprocess the input `text`.
  - **`padding=True`**: Pads the sequences to the same length.
  - **`truncation=True`**: Truncates sequences to a maximum length.
  - **`max_length=500`**: Sets the maximum length of the sequences to 500 tokens.
  - **`add_special_tokens=True`**: Adds special tokens (like [CLS] and [SEP] for BERT) to the sequences.
  - **`return_tensors="pt"`**: Returns the output as PyTorch tensors.

#### Model Inference
```python
outputs = model(**tokens_pt)
```
- **`model(**tokens_pt)`**: Passes the tokenized input through the pre-trained model to get the model's outputs. The `**tokens_pt` syntax unpacks the dictionary of tensors returned by the tokenizer and passes them as keyword arguments to the model.

#### Commented GPU Code
```python
# tokens_pt.to('cuda') # remove if working without GPUs
```
- **`# tokens_pt.to('cuda')`**: This line is commented out and would be used to move the tokenized input to a GPU if available. The comment indicates that this line should be removed if not working with GPUs.

#### Averaging the Output
```python
return outputs[0].mean(0).mean(0).detach()
```
- **`outputs[0]`**: Accesses the first element of the model's outputs, which is typically the hidden states of the last layer.
- **`mean(0).mean(0)`**: This part computes the mean of the hidden states. The first `mean(0)` averages across the sequence length (tokens), and the second `mean(0)` averages across the batch dimension (assuming batch size of 1, it just returns the same values).
- **`detach()`**: Detaches the tensor from the computation graph, which is useful if you do not need gradients for further computations.

#### Summary
The function `text2vec` takes a string of text, tokenizes it, passes it through a pre-trained model, and returns a vector representation of the text by averaging the hidden states from the model's output. The final vector is detached from the computation graph.

### Vectorize Posts

Sure, here's a detailed explanation without bullet points:

The provided code defines a function `vectorize_posts` that processes a list of posts, converts each post into a vector representation using the previously defined `text2vec` function, and keeps track of the time taken to process the posts.

The function definition starts with `def vectorize_posts(posts=[]):`, which defines a function named `vectorize_posts` that takes a single argument `posts`. This argument is expected to be a list of strings (posts), and the default value for `posts` is an empty list.

Next, two variables are initialized: `post_vectors=[]`, which initializes an empty list `post_vectors` to store the vector representations of the posts, and `before=time.time()`, which records the current time before starting the vectorization process. This time will be used to measure the duration of the process.

The function then enters a loop with `for i, post in enumerate(posts):`, which loops through each post in the `posts` list. The variable `i` is the index of the current post, and `post` is the current post itself. Inside the loop, the line `vec=text2vec(sent_tokenize(post))` converts the current post into a vector using the `text2vec` function. The `sent_tokenize(post)` part is likely meant to split the post into sentences before vectorizing, although `sent_tokenize` should be defined or imported from a library like NLTK for sentence tokenization. The resulting vector is then added to the `post_vectors` list with `post_vectors += [vec]`.

The function includes a progress update mechanism with the line `if i % 25 == 0 and i != 0:`. This line checks if the current index `i` is a multiple of 25 and not zero. If both conditions are true, it executes `print("So far {} objects vectorized in {}s".format(i, time.time()-before))`, which prints a progress update showing how many posts have been vectorized and the time elapsed since the start.

After the loop, `after=time.time()` records the current time once the vectorization process is complete. The total number of posts vectorized and the total time taken for the process are printed with `print("Vectorized {} items in {}s".format(len(posts), after-before))`. Finally, the function returns the list of vector representations of the posts with `return post_vectors`.

In summary, the `vectorize_posts` function processes a list of posts, converts each post into a vector representation, prints progress updates every 25 posts, measures and prints the total time taken, and returns a list of the vector representations.

In [28]:
import time

def text2vec(text):
    tokens_pt = tokenizer(text, padding=True, truncation=True, max_length=500, add_special_tokens = True, return_tensors="pt")
    outputs = model(**tokens_pt)
    #tokens_pt.to('cuda') # remove if working without GPUs
    return outputs[0].mean(0).mean(0).detach()

def vectorize_posts(posts=[]):
    post_vectors=[]
    before=time.time()
    for i, post in enumerate(posts):
        vec=text2vec(sent_tokenize(post))
        post_vectors += [vec]
        if i % 25 == 0 and i != 0:
            print("So far {} objects vectorized in {}s".format(i, time.time()-before))
    after=time.time()
    
    print("Vectorized {} items in {}s".format(len(posts), after-before))
    
    return post_vectors

V3 Code
```python
def init_weaviate_schema(client):
    # a simple schema containing just a single class for our posts
    schema = {
        "classes": [{
                "class": "Post",
                "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model
                "properties": [{
                    "name": "content",
                    "dataType": ["text"],
                }]
        }]
    }

    # cleanup from previous runs
    client.schema.delete_all()

    client.schema.create(schema)
```

In [40]:
import weaviate.classes as wvc
from weaviate.classes.config import Property, DataType
def init_weaviate_schema(client):
    # a simple schema containing just a single class for our posts
    schema = {
        "classes": [{
                "class": "Post",
                "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model
                "properties": [{
                    "name": "content",
                    "dataType": ["text"],
                }]
        }]
    }

    # cleanup from previous runs
    #client.schema.delete_all()

    #client.schema.create(schema)

    
    client.collections.delete_all()
    # Create the collection. Weaviate's autoschema feature will infer properties when importing.
    questions = client.collections.create(
        "Post",
        vectorizer_config=wvc.config.Configure.Vectorizer.none(),
        properties=[
            Property(name="name", data_type=DataType.TEXT),
        ]
    )

In [75]:
def import_posts_with_vectors(posts, vectors, client):
    if len(posts) != len(vectors):
        raise Exception("len of posts ({}) and vectors ({}) does not match".format(len(posts), len(vectors)))
    posts_client = client.collections.get("Post")
    for i, post in enumerate(posts):
        try:
            #client.data_object.create(
            #    data_object={"content": post},
            #    class_name='Post',
            #    vector=vectors[i]
            #)
              # This collection must have named vectors configured
            _post = posts_client.data.insert(
                properties= {"post": post},
    
                # Specify the named vectors, following the collection definition
                vector=vectors[i]
            )

        except:
            print(res)

In [76]:
print(dir(client.collections))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_connection', '_create', '_delete', '_exists', '_export', '_get_all', 'create', 'create_from_config', 'create_from_dict', 'delete', 'delete_all', 'exists', 'export_config', 'get', 'list_all']


## Searching the collection

### Function Definition

The function `search` is defined with two parameters: `query`, which is a string that defaults to an empty string, and `limit`, which is an integer that defaults to 3.

```python
def search(query="", limit=3):
```

### Vectorization Timing

The function records the current time before starting the vectorization process, calls the `text2vec` function to convert the query string into a vector, and calculates the time taken to vectorize the query, storing it in `vec_took`.

```python
before = time.time()
vec = text2vec(query)
vec_took = time.time() - before
```

### Preparing for Search

It then records the current time before starting the search process and prepares a dictionary `near_vec` containing the vectorized query.

```python
before = time.time()
near_vec = {"vector": vec}
```

### Performing the Search

The function retrieves the `Post` collection from the Weaviate client and performs a search using the `near_vector` method, which finds objects in the `Post` collection that are similar to the vectorized query. The search is limited to 2 results and includes the distance metric in the metadata. The time taken to perform the search is calculated and stored in `search_took`.

```python
posts = client.collections.get("Post")
res = posts.query.near_vector(
    near_vector=vec.tolist(),
    limit=2,
    return_metadata=MetadataQuery(distance=True)
)
search_took = time.time() - before
```

### Printing the Results

A summary of the search operation is printed, including the query, the number of results, the total time taken, the time taken to vectorize, and the time taken to search. The function then iterates through the search results and prints the properties and distance metadata of each result.

```python
print("\nQuery \"{}\" with {} results took {:.3f}s ({:.3f}s to vectorize and {:.3f}s to search)"
      .format(query, limit, vec_took+search_took, vec_took, search_took))
for o in res.objects:
    print(o.properties)
    print(o.metadata.distance)
```

### Summary

The `search` function performs a vector-based search on a Weaviate collection called `Post`. It measures and prints the time taken for vectorization and search operations, and then prints the properties and distance of the retrieved objects. The function converts the query into a vector, searches for similar vectors in the collection, and limits the results to a specified number.

In [80]:
from weaviate.classes.query import MetadataQuery

def search(query="", limit=3):
    before = time.time()
    vec = text2vec(query)
    vec_took = time.time() - before

    before = time.time()
    near_vec = {"vector": vec}
    #res = client \
    #    .query.get("Post", ["content", "_additional {certainty}"]) \
    #    .with_near_vector(near_vec) \
    #    .with_limit(limit) \
    #    .do()
    questions = client.collections.get("Post")
    res = questions.query.near_vector(
         near_vector=vec.tolist(),
        limit=2,
        return_metadata=MetadataQuery(distance=True)
    )
    search_took = time.time() - before

    print("\nQuery \"{}\" with {} results took {:.3f}s ({:.3f}s to vectorize and {:.3f}s to search)" \
          .format(query, limit, vec_took+search_took, vec_took, search_took))
    #print(res)
    for o in res.objects:
        #print(o)
        print(o.properties)
        print(o.metadata.distance)
    #for post in res["data"]["Get"]["Post"]:
    #    print("{:.4f}: {}".format(post["_additional"]["certainty"], post["content"]))
    #    print('---')

In [81]:
print(client.__dict__)
init_weaviate_schema(client)
posts = read_posts(get_post_filenames(100))
vectors = vectorize_posts(posts)
import_posts_with_vectors(posts, vectors, client)

{'_WeaviateClient__skip_init_checks': False, '_connection': <weaviate.connect.v4.ConnectionV4 object at 0x1399195a0>, 'backup': <weaviate.backup.backup._Backup object at 0x139919840>, 'cluster': <weaviate.collections.cluster._Cluster object at 0x1399194b0>, 'collections': <weaviate.collections.collections._Collections object at 0x1399196f0>, 'batch': <weaviate.collections.batch.client._BatchClientWrapper object at 0x139919750>, 'integrations': <weaviate.integrations._Integrations object at 0x1399279d0>}


{"level":"info","msg":"Created shard post_D2lz8FsXLRbl in 1.606834ms","time":"2024-07-14T18:58:59+05:30"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-07-14T18:58:59+05:30","took":82958}
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8

So far 25 objects vectorized in 2.633683919906616s
So far 50 objects vectorized in 5.101656675338745s
So far 75 objects vectorized in 7.952788829803467s
Vectorized 100 items in 10.733489751815796s


In [82]:
search("the best camera lens", 1)
search("motorcycle trip", 1)
search("which software do i need to view jpeg files", 1)
search("windows vs mac", 1)


Query "the best camera lens" with 1 results took 0.029s (0.028s to vectorize and 0.001s to search)
{'post': '  Could someone give me some info on Soft PC.   How does it work? What kind of performance can I expect? Can you run windows under it adequately? Any info if appreciated.  '}
0.3389297127723694
{'post': "   It seems I'm in the fortunate position to desire what many people want to sell- a miniature color tv.  I require color and input for cable or vcr.  I would  prefer a 5inch diagonal and a tube television (not lcd).  Get paid the first, make an offer by email.  Marc   "}
0.37118011713027954

Query "motorcycle trip" with 1 results took 0.029s (0.028s to vectorize and 0.001s to search)
{'post': '  Could someone give me some info on Soft PC.   How does it work? What kind of performance can I expect? Can you run windows under it adequately? Any info if appreciated.  '}
0.33715254068374634
{'post': '   When riding in a group, generally speaking, do most people mind when another rid