This example contains the notebook that was used in this <a href="https://towardsdatascience.com/a-sub-50ms-neural-search-with-distilbert-and-weaviate-4857ae390154">Towards Data Science article.</a>

In this example we are going to use Weaviate without vectorization module, and use it as pure vector database to use a BERT transformer to vectorize text documents, then retrieve the closest ones through Weaviate's Search.

Note that we use Weaviate as pure vector database without any vectorization module attached. After this example was released, we have released new vectorization modules, like the text2vec-transformers module. You can use this module to run Weaviate with a BERT transformer module out of the box.

In [16]:
#!python3 -m pip install torch

In [15]:
#!python3 -m pip install transformers --quiet
#!python3 -m pip install nltk

In [36]:
import torch
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import sent_tokenize
import weaviate
from weaviate.embedded import EmbeddedOptions

torch.set_grad_enabled(False)

# udpated to use different model if desired
MODEL_NAME = "distilbert-base-uncased"
model = AutoModel.from_pretrained(MODEL_NAME)
#model.to('cuda') # remove if working without GPUs
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# initialize nltk (for tokenizing sentences)
import nltk
nltk.download('punkt')

# initialize weaviate client for importing and searching
#client = weaviate.Client("http://localhost:8080")
client = weaviate.Client(
    embedded_options=EmbeddedOptions()
)

[nltk_data] Downloading package punkt to /Users/rdua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


embedded weaviate is already listening on port 8079


Let's break down and explain the provided code step by step:

### Imports
```python
import torch
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import sent_tokenize
import weaviate
```
- **`torch`**: A deep learning library for tensor computation and automatic differentiation.
- **`transformers`**: A library from Hugging Face providing pre-trained transformer models (like BERT, GPT, etc.).
  - **`AutoModel`**: A class that automatically selects the appropriate model architecture based on the provided model name.
  - **`AutoTokenizer`**: A class that automatically selects the appropriate tokenizer based on the provided model name.
- **`nltk`**: The Natural Language Toolkit, a library for working with human language data (text).
  - **`sent_tokenize`**: A function to split text into sentences.
- **`weaviate`**: A library for interacting with the Weaviate vector search engine.

### Disable Gradient Calculation
```python
torch.set_grad_enabled(False)
```
- **Purpose**: Disables gradient calculation. This is useful during inference when we don't need to compute gradients, thus saving memory and computation time.

### Model and Tokenizer Initialization
```python
# updated to use different model if desired
MODEL_NAME = "distilbert-base-uncased"
model = AutoModel.from_pretrained(MODEL_NAME)
model.to('cuda') # remove if working without GPUs
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
```
- **`MODEL_NAME`**: Specifies the name of the pre-trained model to be used. In this case, `distilbert-base-uncased`, a smaller, faster, and lower memory version of BERT.
- **`model = AutoModel.from_pretrained(MODEL_NAME)`**: Loads the pre-trained model specified by `MODEL_NAME`.
- **`model.to('cuda')`**: Moves the model to the GPU for faster computation. This line should be removed if working without GPUs.
- **`tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)`**: Loads the tokenizer corresponding to the pre-trained model specified by `MODEL_NAME`.

### Initialize NLTK
```python
# initialize nltk (for tokenizing sentences)
import nltk
nltk.download('punkt')
```
- **Purpose**: Imports `nltk` and downloads the `punkt` tokenizer models, which are used for sentence tokenization.
- **`nltk.download('punkt')`**: Downloads the necessary data for sentence tokenization.

### Initialize Weaviate Client
```python
# initialize weaviate client for importing and searching
client = weaviate.Client("http://localhost:8080")
```
- **Purpose**: Initializes a Weaviate client for interacting with the Weaviate vector search engine.
- **`weaviate.Client("http://localhost:8080")`**: Connects to the Weaviate instance running on `localhost` at port `8080`.

### Summary
This code initializes necessary libraries and models for NLP tasks and sets up a Weaviate client for vector search operations. The process includes:
1. Disabling gradient computation to optimize inference.
2. Loading a pre-trained model (`distilbert-base-uncased`) and its tokenizer.
3. Preparing NLTK for sentence tokenization.
4. Setting up a connection to a local Weaviate instance.

In [37]:
import os
import random

def get_post_filenames(limit_objects=100):
    file_names = []
    i=0
    for root, dirs, files in os.walk("./data/20news-bydate-test"):
        for filename in files:
            path = os.path.join(root, filename)
            file_names += [path]
        
    random.shuffle(file_names)
    limit_objects = min(len(file_names), limit_objects)
      
    file_names = file_names[:limit_objects]

    return file_names

def read_posts(filenames=[]):
    posts = []
    for filename in filenames:
        f = open(filename, encoding="utf-8", errors='ignore')
        post = f.read()
        
        # strip the headers (the first occurrence of two newlines)
        post = post[post.find('\n\n'):]
        
        # remove posts with less than 10 words to remove some of the noise
        if len(post.split(' ')) < 10:
               continue
        
        post = post.replace('\n', ' ').replace('\t', ' ')
        if len(post) > 1000:
            post = post[:1000]
        posts += [post]

    return posts  

Let's break down and explain the provided code step by step:

### Imports
```python
import os
import random
```
- **`os`**: This module provides a way of using operating system-dependent functionality like reading or writing to the file system.
- **`random`**: This module implements pseudo-random number generators for various distributions, in this case, it's used for shuffling a list.

### `get_post_filenames` Function
```python
def get_post_filenames(limit_objects=100):
    file_names = []
    i = 0
    for root, dirs, files in os.walk("./data/20news-bydate-test"):
        for filename in files:
            path = os.path.join(root, filename)
            file_names += [path]
        
    random.shuffle(file_names)
    limit_objects = min(len(file_names), limit_objects)
      
    file_names = file_names[:limit_objects]

    return file_names
```
- **Purpose**: To retrieve and return a list of file paths from the directory `./data/20news-bydate-test`, limited by `limit_objects`.

1. **Initialization**:
    - `file_names = []`: Initializes an empty list to store file paths.
    - `i = 0`: Initializes a counter variable (though it's not used in the function).

2. **Directory Traversal**:
    - `os.walk("./data/20news-bydate-test")`: Recursively walks through the directory tree rooted at `./data/20news-bydate-test`.
    - `for root, dirs, files in os.walk("./data/20news-bydate-test")`: Iterates over each directory (`root`), its subdirectories (`dirs`), and files (`files`).

3. **Collecting File Paths**:
    - `for filename in files`: Iterates over each file in the current directory.
    - `path = os.path.join(root, filename)`: Constructs the full file path.
    - `file_names += [path]`: Adds the file path to the `file_names` list.

4. **Shuffling and Limiting**:
    - `random.shuffle(file_names)`: Randomly shuffles the `file_names` list.
    - `limit_objects = min(len(file_names), limit_objects)`: Sets `limit_objects` to the smaller value between the length of `file_names` and the provided limit.
    - `file_names = file_names[:limit_objects]`: Limits the list to `limit_objects` number of files.

5. **Return**:
    - `return file_names`: Returns the list of file paths.

### `read_posts` Function
```python
def read_posts(filenames=[]):
    posts = []
    for filename in filenames:
        f = open(filename, encoding="utf-8", errors='ignore')
        post = f.read()
        
        # strip the headers (the first occurrence of two newlines)
        post = post[post.find('\n\n'):]
        
        # remove posts with less than 10 words to remove some of the noise
        if len(post.split(' ')) < 10:
               continue
        
        post = post.replace('\n', ' ').replace('\t', ' ')
        if len(post) > 1000:
            post = post[:1000]
        posts += [post]

    return posts
```
- **Purpose**: To read the content of files specified by `filenames`, process the content, and return a list of processed posts.

1. **Initialization**:
    - `posts = []`: Initializes an empty list to store the processed posts.

2. **Reading Files**:
    - `for filename in filenames`: Iterates over each file path in `filenames`.
    - `f = open(filename, encoding="utf-8", errors='ignore')`: Opens the file with UTF-8 encoding, ignoring errors.
    - `post = f.read()`: Reads the entire content of the file into the variable `post`.

3. **Processing Content**:
    - **Stripping Headers**:
        - `post = post[post.find('\n\n'):]`: Strips off the headers by finding the first occurrence of two newlines and taking the substring after that.
    - **Filtering Short Posts**:
        - `if len(post.split(' ')) < 10: continue`: Skips posts with less than 10 words to remove some noise.
    - **Replacing Newlines and Tabs**:
        - `post = post.replace('\n', ' ').replace('\t', ' ')`: Replaces newlines and tabs with spaces.
    - **Truncating Long Posts**:
        - `if len(post) > 1000: post = post[:1000]`: Limits the length of the post to 1000 characters.

4. **Adding to List**:
    - `posts += [post]`: Adds the processed post to the `posts` list.

5. **Return**:
    - `return posts`: Returns the list of processed posts.

### Summary
- `get_post_filenames(limit_objects=100)`: This function collects and returns a shuffled list of file paths from the specified directory, limited to a specified number.
- `read_posts(filenames=[])`: This function reads the content of the files specified by the provided list of filenames, processes the content by stripping headers, filtering short posts, replacing newlines and tabs, truncating long posts, and returns a list of these processed posts.

In [38]:
import time

def text2vec(text):
    tokens_pt = tokenizer(text, padding=True, truncation=True, max_length=500, add_special_tokens = True, return_tensors="pt")
    outputs = model(**tokens_pt)
    #tokens_pt.to('cuda') # remove if working without GPUs
    return outputs[0].mean(0).mean(0).detach()

def vectorize_posts(posts=[]):
    post_vectors=[]
    before=time.time()
    for i, post in enumerate(posts):
        vec=text2vec(sent_tokenize(post))
        post_vectors += [vec]
        if i % 25 == 0 and i != 0:
            print("So far {} objects vectorized in {}s".format(i, time.time()-before))
    after=time.time()
    
    print("Vectorized {} items in {}s".format(len(posts), after-before))
    
    return post_vectors

V3 Code
```python
def init_weaviate_schema(client):
    # a simple schema containing just a single class for our posts
    schema = {
        "classes": [{
                "class": "Post",
                "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model
                "properties": [{
                    "name": "content",
                    "dataType": ["text"],
                }]
        }]
    }

    # cleanup from previous runs
    client.schema.delete_all()

    client.schema.create(schema)
```

In [39]:
def init_weaviate_schema(client):
    # a simple schema containing just a single class for our posts
    schema = {
        "classes": [{
                "class": "Post",
                "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model
                "properties": [{
                    "name": "content",
                    "dataType": ["text"],
                }]
        }]
    }

    # cleanup from previous runs
    client.schema.delete_all()

    client.schema.create(schema)

In [40]:
def import_posts_with_vectors(posts, vectors, client):
    if len(posts) != len(vectors):
        raise Exception("len of posts ({}) and vectors ({}) does not match".format(len(posts), len(vectors)))
        
    for i, post in enumerate(posts):
        try:
            client.data_object.create(
                data_object={"content": post},
                class_name='Post',
                vector=vectors[i]
            )
        except:
            print(res)

In [47]:
def search(query="", limit=3):
    before = time.time()
    vec = text2vec(query)
    vec_took = time.time() - before

    before = time.time()
    near_vec = {"vector": vec}
    res = client \
        .query.get("Post", ["content", "_additional {certainty}"]) \
        .with_near_vector(near_vec) \
        .with_limit(limit) \
        .do()

    #questions = client.collections.get("Post")
    #res = questions.query.near_text(
    #     near_vector=near_vec,
    #    limit=2,
    #    return_metadata=MetadataQuery(distance=True)
    #)



    search_took = time.time() - before

    print("\nQuery \"{}\" with {} results took {:.3f}s ({:.3f}s to vectorize and {:.3f}s to search)" \
          .format(query, limit, vec_took+search_took, vec_took, search_took))
    for post in res["data"]["Get"]["Post"]:
        print("{:.4f}: {}".format(post["_additional"]["certainty"], post["content"]))
        print('---')

In [48]:
init_weaviate_schema(client)
posts = read_posts(get_post_filenames(100))
vectors = vectorize_posts(posts)
import_posts_with_vectors(posts, vectors, client)

  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='ignore')
  f = open(filename, encoding="utf-8", errors='i

So far 25 objects vectorized in 2.707801103591919s
So far 50 objects vectorized in 5.729227066040039s
So far 75 objects vectorized in 8.259504079818726s
Vectorized 100 items in 11.178253173828125s


In [49]:
search("the best camera lens", 1)
search("motorcycle trip", 1)
search("which software do i need to view jpeg files", 1)
search("windows vs mac", 1)


Query "the best camera lens" with 1 results took 0.064s (0.057s to vectorize and 0.007s to search)
0.8323:                                                            cheek. 
---

Query "motorcycle trip" with 1 results took 0.034s (0.027s to vectorize and 0.007s to search)
0.8913:                                                            cheek. 
---

Query "which software do i need to view jpeg files" with 1 results took 0.041s (0.034s to vectorize and 0.007s to search)
0.9204:   In article <matess.735934793@gsusgi1.gsu.edu> matess@gsusgi1.gsu.edu (Eliza Strickler) writes: >I just donwloaded a *.bin file from a unix machine which is >supposed to be converted to a MAC format. Does anyone know  >what I need to do to this file to get it into any Dos, Mac >or Unix readable format. Someone mentioned fetch on the unix >machine - is this correct? Could someone explain the .bin >format a little? >  This is almost certainly a MacBinary file which is an encoded version of a mac file so the Reso