# Hybrid search with DocArray + Weaviate

In this notebook, we will set up a Weaviate document index with 200 movie datapoints, in order to do some fun searching with on this store with DocArray! We will cover 3 different searches:
  1. Pure text search (symbolic search with BM25)
  2. Vector search and
  3. Hybrid search (combining symbolic and vector approaches)
Make sure you have a docker instance with Weaviate running locally, or set credentials for connecting to a remote instance. 

In [1]:
import docarray
docarray.__version__  # verify that we have 0.30.0 here :)

'0.30.0'

In [2]:
import os
import numpy as np
import csv
import requests as rq
from typing import Dict
from pydantic import Field
from docarray import BaseDoc
from docarray.typing import NdArray
from docarray.index.backends.weaviate import WeaviateDocumentIndex

## Data loading and encoding

We will be using the IMDB Movies Dataset from Kaggle, containing the 1000 top rated movies on IMDB. To download this dataset, we will use `opendatasets`, which will download the data as a csv file into the following directory structure: 

`imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv`

### Document Schema
After the data is downloaded, we define a DocArray document schema for our data, consisting of a text field (concatenation of title, overview and actors names of the movies) and an embedding field. 

### Encoding
We also define a helper function `encode()` to help us encode our text fields using `sentence-transformers`. We will be using the `'multi-qa-MiniLM-L6-cos-v1'` model, which has been specifically trained for semantic search, using 215M question-answer pairs from various sources and domains, including StackExchange, Yahoo Answers, Google & Bing search queries and many more. This model should perform well across many search tasks and domains, including the movies domain. Let's see how well it does.

In [3]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")

Skipping, found downloaded files in "./imdb-dataset-of-top-1000-movies-and-tv-shows" (use force=True to force download)


In [4]:
# Define a docarray document schema
class MovieDocument(BaseDoc):
    text: str  # will contain a concatenation of title, overview and actors names
    embedding: NdArray[384] = Field(
        dims=384, is_embedding=True
    )  # Embedding column -> vector representation of the document

In [5]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') # may take some time to load!

def encode(text: str):
    embedding = model.encode(text)
    return embedding

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Load the data
docs = []
with open('imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv', newline='') as csvfile:
    movie_reader = csv.reader(csvfile, delimiter=',')
    h = movie_reader.__next__()
    c = 0
    for row in movie_reader:
        text = ' '.join([row[1], row[2], row[5], row[7]] + row[9:13])
        embedding = encode(text=text)
        d = MovieDocument(text=text, embedding=embedding, id=c)
        docs.append(d)
        c += 1

In [7]:
# Take a look at the loaded + processed documents
print("Number of documents: ", len(docs))
docs[0].summary()

Number of documents:  1000


## Indexing

Next, we want to store our documents in a document store, in this case in a Weaviate document index. To do this, we first define a batch config, which is recommended practice for performing bulk operations such as importing data, as it will significantly impact performance. We specify a batch configuration below, and pass it on as runtime configuration.

We also define a database config below, which holds our host and index name. This notebook assumes you have a local Weaviate document index set up, and therefore uses the host `http://localhost:8080`, but if you have instead set up a remote instance, please follow the [documentation](https://docs.docarray.org/user_guide/storing/index_weaviate/) for connecting to your Weaviate index.

In [8]:
batch_config = {
    "batch_size": 20,
    "dynamic": False,
    "timeout_retries": 3,
    "num_workers": 1,
}
db_config = WeaviateDocumentIndex.DBConfig(
    host="http://localhost:8080",  # Replace with your endpoint credentials if relevant
    index_name="MovieDocument",
)

runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)

store = WeaviateDocumentIndex[MovieDocument](db_config=db_config)
store.configure(runtimeconfig)  # Batch settings being passed on
store.index(docs)

__________________

## 1. Text Search

Now, we can start to search our IMDB Movies data! First, let's use symbolic search with a text query to find some results. The `text_search()` method on the Weaviate document store will find exact matches (case insensitive) for our text query, meaning the words must appear in the documents text in order to be returned as a result.

In [9]:
# Define text query
text_query = 'monster scary'

In [10]:
text_q = store.build_query().text_search(text_query, search_field="text").limit(10).build()
text_results = store.execute_query(text_q)
print("Number of documents returned: ", len(text_results))

Number of documents returned:  3


In [11]:
for doc in text_results:
    print("Document number: ", doc.id)
    print("Document text:")
    print(doc.text)
    print("Match found at position ", doc.text.lower().find("monster"), "/", doc.text.lower().find("scary"))
    print()

Document number:  719
Document text:
Frankenstein 1931 Drama, Horror, Sci-Fi Dr. Frankenstein dares to tamper with life and death by creating a human monster out of lifeless body parts. James Whale Colin Clive Mae Clarke Boris Karloff
Match found at position  113 / -1

Document number:  401
Document text:
Beauty and the Beast 1991 Animation, Family, Fantasy A prince cursed to spend his days as a hideous monster sets out to regain his humanity by earning a young woman's love. Gary Trousdale Kirk Wise Paige O'Hara Robby Benson
Match found at position  100 / -1

Document number:  716
Document text:
Bride of Frankenstein 1935 Drama, Horror, Sci-Fi Mary Shelley reveals the main characters of her novel survived: Dr. Frankenstein, goaded by an even madder scientist, builds his monster a mate. James Whale Boris Karloff Elsa Lanchester Colin Clive
Match found at position  178 / -1



### Notes
Because only exact matches are returned, we only get 3 results, even though we set the limit to 10. This means, the text search does not automatically return "similar" results - nope! It only returns results that contain our term(s). Unfortunately, there are no exact matches for the term "scary" in our dataset.

__________________

## 2. Vector Search
Next, we'll try to use the embeddings we got from the inference endpoint, by also encoding the query and then computing similarity between our query vector and the vectors of our documents. We call the `find()` method with a query embedding on our document store.

In [12]:
q_embedding = encode(text_query)
vector_q = (
    store.build_query()
    .find(q_embedding)
    .limit(10)
    .build()
)

vector_results = store.execute_query(vector_q)
print("Number of documents returned: ", len(vector_results))

Number of documents returned:  10


### Notes 

Great! We have 10 results now. That's because the vector comparsion can still find "similar" results even if there are no explicit mentions of our terms "monster" and "scary". We'll take a closer look at the results in a comparative analysis below.

__________________

## 3. Hybrid Search

In [13]:
q_embedding = encode(text_query)
hybrid_q = (
    store.build_query()
    .text_search(
        text_query, search_field=None  # Set as None as it is required but has no effect
    )
    .find(q_embedding)
    .limit(10)
    .build()
)

hybrid_results = store.execute_query(hybrid_q)
print("Number of documents returned: ", len(hybrid_results))

Number of documents returned:  10


### Notes

We again have a total of 10 results. These results should combine the best of both worlds, pushing documents to the top that contain an explicit mention of our search terms, and filling up the rest of the results with highly relevant results, possibly containing synonyms and related topics. We'll analyse this below!

__________________

## Results: A Comparative Analysis

Now that we have a bunch of search results from our various approaches, let's see how many relevant results were retrieved using the approaches.

In [14]:
from IPython.display import HTML
from typing import List
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [15]:
def display_result(bm25_data: List[MovieDocument], vec_data: List[MovieDocument], hyb_data: List[MovieDocument]):
    df1 = pd.DataFrame([d.dict() for d in bm25_data])
    df2 = pd.DataFrame([d.dict() for d in vec_data])
    df3 = pd.DataFrame([d.dict() for d in hyb_data])
    df = pd.concat([df1, df2, df3], axis=1).drop(["embedding"], axis=1)
    df.columns = [
        ["text search results", "text search results", "vector search results", "vector search results", "hybrid search results", "hybrid search results"],
        ["text", "id", "text", "id", "text", "id"]
    ]
    return df.style.set_table_styles(
    [{"selector": "", "props": [("border", "1px solid grey")]},
      {"selector": "tbody td", "props": [("border", "1px solid grey")]},
     {"selector": "th", "props": [("border", "1px solid grey")]}
    ]
)

In [16]:
display_result(text_results, vector_results, hybrid_results)

Unnamed: 0_level_0,text search results,text search results,vector search results,vector search results,hybrid search results,hybrid search results
Unnamed: 0_level_1,text,id,text,id,text,id
0,"Frankenstein 1931 Drama, Horror, Sci-Fi Dr. Frankenstein dares to tamper with life and death by creating a human monster out of lifeless body parts. James Whale Colin Clive Mae Clarke Boris Karloff",719.0,"Monsters, Inc. 2001 Animation, Adventure, Comedy In order to power the city, monsters have to scare children so that they scream. However, the children are toxic to the monsters, and after a child gets through, 2 monsters realize things may not be what they think. Pete Docter David Silverman Lee Unkrich Billy Crystal",245,"Frankenstein 1931 Drama, Horror, Sci-Fi Dr. Frankenstein dares to tamper with life and death by creating a human monster out of lifeless body parts. James Whale Colin Clive Mae Clarke Boris Karloff",719
1,"Beauty and the Beast 1991 Animation, Family, Fantasy A prince cursed to spend his days as a hideous monster sets out to regain his humanity by earning a young woman's love. Gary Trousdale Kirk Wise Paige O'Hara Robby Benson",401.0,"Frankenstein 1931 Drama, Horror, Sci-Fi Dr. Frankenstein dares to tamper with life and death by creating a human monster out of lifeless body parts. James Whale Colin Clive Mae Clarke Boris Karloff",719,"Beauty and the Beast 1991 Animation, Family, Fantasy A prince cursed to spend his days as a hideous monster sets out to regain his humanity by earning a young woman's love. Gary Trousdale Kirk Wise Paige O'Hara Robby Benson",401
2,"Bride of Frankenstein 1935 Drama, Horror, Sci-Fi Mary Shelley reveals the main characters of her novel survived: Dr. Frankenstein, goaded by an even madder scientist, builds his monster a mate. James Whale Boris Karloff Elsa Lanchester Colin Clive",716.0,"Halloween 1978 Horror, Thriller Fifteen years after murdering his sister on Halloween night 1963, Michael Myers escapes from a mental hospital and returns to the small town of Haddonfield, Illinois to kill again. John Carpenter Donald Pleasence Jamie Lee Curtis Tony Moran",844,"Monsters, Inc. 2001 Animation, Adventure, Comedy In order to power the city, monsters have to scare children so that they scream. However, the children are toxic to the monsters, and after a child gets through, 2 monsters realize things may not be what they think. Pete Docter David Silverman Lee Unkrich Billy Crystal",245
3,,,"The Thing 1982 Horror, Mystery, Sci-Fi A research team in Antarctica is hunted by a shape-shifting alien that assumes the appearance of its victims. John Carpenter Kurt Russell Wilford Brimley Keith David",271,"Bride of Frankenstein 1935 Drama, Horror, Sci-Fi Mary Shelley reveals the main characters of her novel survived: Dr. Frankenstein, goaded by an even madder scientist, builds his monster a mate. James Whale Boris Karloff Elsa Lanchester Colin Clive",716
4,,,"The Nightmare Before Christmas 1993 Animation, Family, Fantasy Jack Skellington, king of Halloween Town, discovers Christmas Town, but his attempts to bring Christmas to his home causes confusion. Henry Selick Danny Elfman Chris Sarandon Catherine O'Hara",395,"Halloween 1978 Horror, Thriller Fifteen years after murdering his sister on Halloween night 1963, Michael Myers escapes from a mental hospital and returns to the small town of Haddonfield, Illinois to kill again. John Carpenter Donald Pleasence Jamie Lee Curtis Tony Moran",844
5,,,"Get Out 2017 Horror, Mystery, Thriller A young African-American visits his white girlfriend's parents for the weekend, where his simmering uneasiness about their reception of him eventually reaches a boiling point. Jordan Peele Daniel Kaluuya Allison Williams Bradley Whitford",724,"The Thing 1982 Horror, Mystery, Sci-Fi A research team in Antarctica is hunted by a shape-shifting alien that assumes the appearance of its victims. John Carpenter Kurt Russell Wilford Brimley Keith David",271
6,,,"Beauty and the Beast 1991 Animation, Family, Fantasy A prince cursed to spend his days as a hideous monster sets out to regain his humanity by earning a young woman's love. Gary Trousdale Kirk Wise Paige O'Hara Robby Benson",401,"The Nightmare Before Christmas 1993 Animation, Family, Fantasy Jack Skellington, king of Halloween Town, discovers Christmas Town, but his attempts to bring Christmas to his home causes confusion. Henry Selick Danny Elfman Chris Sarandon Catherine O'Hara",395
7,,,"Saw 2004 Horror, Mystery, Thriller Two strangers awaken in a room with no recollection of how they got there, and soon discover they're pawns in a deadly game perpetrated by a notorious serial killer. James Wan Cary Elwes Leigh Whannell Danny Glover",932,"Get Out 2017 Horror, Mystery, Thriller A young African-American visits his white girlfriend's parents for the weekend, where his simmering uneasiness about their reception of him eventually reaches a boiling point. Jordan Peele Daniel Kaluuya Allison Williams Bradley Whitford",724
8,,,"King Kong 1933 Adventure, Horror, Sci-Fi A film crew goes to a tropical island for an exotic location shoot and discovers a colossal ape who takes a shine to their female blonde star. He is then captured and brought back to New York City for public exhibition. Merian C. Cooper Ernest B. Schoedsack Fay Wray Robert Armstrong",566,"Saw 2004 Horror, Mystery, Thriller Two strangers awaken in a room with no recollection of how they got there, and soon discover they're pawns in a deadly game perpetrated by a notorious serial killer. James Wan Cary Elwes Leigh Whannell Danny Glover",932
9,,,"Young Frankenstein 1974 Comedy An American grandson of the infamous scientist, struggling to prove that his grandfather was not as insane as people believe, is invited to Transylvania, where he discovers the process that reanimates a dead body. Mel Brooks Gene Wilder Madeline Kahn Marty Feldman",417,"King Kong 1933 Adventure, Horror, Sci-Fi A film crew goes to a tropical island for an exotic location shoot and discovers a colossal ape who takes a shine to their female blonde star. He is then captured and brought back to New York City for public exhibition. Merian C. Cooper Ernest B. Schoedsack Fay Wray Robert Armstrong",566


As we already discussed above, the **text search** using BM25 only returns results containing and exact match for our query terms, in this case only matching three documents. We know with certainty that these three documents are relevant to our query, making the *precision* of this approach very high. However, many documents are missing, even ones that contain a pluralized form of our search terms (eg. "monsters"), making the *recall* of this approach poor.

The top 5 **vector search** results are also all relevant to our query, containing mentions of "monsters", "beasts", "Frankenstein" (also a 'monster') a "shape-shifting alien" and "ogre", though only two of the three documents retrieved by the text search approach appear in the 10 results retrieved using vector search (documents 719 and 401). "Bride of Frankenstein" seems to be missing in the top 10. There are also some documents returned that don't seem to be relevant, for example documents 58, 740 and 984 (though they may still be 'scary'? Actually, no. "The Muppet Movie" and "Wreck-It Ralph" are certainly not scary). Nonetheless, the vector search approach here does quite well as it goes on latent representations of our data, capturing content with similar meanings such as `beasts, threat, ogre, victims` and is thereby able to successfully retrieve quite a few relevant documents (7 out of 10), with only a few non-relevant entries. 

Let's finally take a look at the **hybrid search** results, the top 5 of which contain our exact matches (documents 719, 401, 716) from the text search approach! This is because these documents got a higher combined score from the BM25 component in our hybrid scoring calculation. The rest of the results are filled in by the vector search component, showing improved results over a simple BM25 search, but still incorporating the advantages of this symbolic search approach into the hybrid search. Although "The Muppet Movie" and "Wreck-It Ralph" (our non-relevant results) still appear in the top 10, they've been pushed down the list, making way for more relevant search results.