# Movie Search Using Zilliz Cloud and SentenceTransformers

In this example, we are going to go over a Wikipedia article search using Zilliz Cloud and the SentenceTransformers library. The dataset we will search through is the Wikipedia-Movie-Plots Dataset found on [Kaggle](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots). For this example, we have re-hosted the data in a public Google drive.

## Before you start

For this example, we are going to be using **pymilvus** to connect to use Zilliz Cloud, **sentencetransformers** to generate vector embeddings, and **gdown** to download the example dataset.

In [None]:
%pip install pymilvus sentence-transformers gdown

## Parameters

Here we can find the main arguments that need to be modified for running with your own accounts. Beside each is a description of what it is.

In [14]:
import gdown, zipfile, time, csv
from tqdm import tqdm
from pymilvus import connections, DataType, FieldSchema, CollectionSchema, Collection, utility
from sentence_transformers import SentenceTransformer

# Parameters for set up Zilliz Cloud
COLLECTION_NAME = 'movies_db'  # Collection name
DIMENSION = 384  # Embeddings size
URI = 'http://localhost:19530'  # Endpoint URI obtained from Zilliz Cloud
TOKEN = 'root:Milvus'  # API key or a colon-separated cluster username and password

# Inference Arguments
BATCH_SIZE = 128

# Search Arguments
TOP_K = 3

url = 'https://drive.google.com/uc?id=11ISS45aO2ubNCGaC3Lvd3D7NT8Y7MeO8'
output = '../movies.zip'

In [4]:
gdown.download(url, output)

with zipfile.ZipFile("../movies.zip","r") as zip_ref:
    zip_ref.extractall("../movies")

Downloading...
From (uriginal): https://drive.google.com/uc?id=11ISS45aO2ubNCGaC3Lvd3D7NT8Y7MeO8
From (redirected): https://drive.google.com/uc?id=11ISS45aO2ubNCGaC3Lvd3D7NT8Y7MeO8&confirm=t&uuid=ecb9fbc2-dbd9-4270-9f2a-ba538c62f05a
To: /Users/anthony/Documents/Github/zdoc-demos/movies.zip
100%|██████████| 30.9M/30.9M [00:02<00:00, 11.0MB/s]


## Set up Zilliz Cloud

At this point, we are going to begin setting up Zilliz Cloud. The steps are as follows:

1. Connect to the Zilliz Cloud cluster using the provided URI.
2. If the collection already exists, drop it.
3. Create the collection that holds the id, title of the movie, and the embeddings of the plot text.
4. Create an index on the newly created collection and load it into memory.

Once these steps are done the collection is ready to be inserted into and searched. Any data added will be indexed automatically and be available for search immediately. If the data is very fresh, the search might be slower as brute force searching will be used on data that is still in process of getting indexed.

In [5]:
# Connect to Milvus Database
connections.connect(
    uri=URI, 
    token=TOKEN
)

# Remove any previous collections with the same name
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)


# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=200),  # VARCHARS need a maximum length, so for this example they are set to 200 characters
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

# Create an IVF_FLAT index for collection.
index_params = {
    'index_type': 'AUTOINDEX',
    'metric_type': 'L2',
    'params': {}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()


## Insert data

For this example, we are going to use the SentenceTransformers miniLM model to create embeddings of the plot text. This model returns 384-dimensional embeddings.

In these next few steps we will:

1. Load the data.

2. Embed the plot text data using SentenceTransformers.

3. Insert the data into Zilliz Cloud.

In [20]:
transformer = SentenceTransformer('all-MiniLM-L6-v2')

# Extract the book titles
def csv_load(file):
    with open(file, newline='') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            if '' in (row[1], row[7]):
                continue
            yield (row[1], row[7])

def count(file):
    with open(file, newline='') as f:
        count = 0
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            if '' in (row[1], row[7]):
                continue
            count += 1
    return count

# Extract embedding from text using OpenAI
def embed_insert(data):
    embeds = transformer.encode(data[1])
    ins = [
            data[0],
            [x for x in embeds]
    ]
    collection.insert(ins)



data_batch = [[],[]]

for title, plot in tqdm(csv_load('../movies/plots.csv'), total=count('../movies/plots.csv')):
    data_batch[0].append(title)
    data_batch[1].append(plot)
    if len(data_batch[0]) % BATCH_SIZE == 0:
        embed_insert(data_batch)
        data_batch = [[],[]]

# Embed and insert the remainder
if len(data_batch[0]) != 0:
    embed_insert(data_batch)

# Call a flush to index any unsealed segments.
collection.flush()

100%|██████████| 34887/34887 [15:01<00:00, 38.70it/s]


## Perform the search

With all the data inserted into Zilliz Cloud, we can start performing our searches. In this example, we are going to search for movies based on the plot. Because we are doing a batch search, the search time is shared across the movie searches.

In [21]:
# Search for titles that closest match these phrases.
search_terms = ['A movie about cars', 'A movie about monsters']

# Search the database based on input text
def embed_search(data):
    embeds = transformer.encode(data)
    return [x for x in embeds]

search_data = embed_search(search_terms)

start = time.time()
res = collection.search(
    data=search_data,  # Embeded search value
    anns_field="embedding",  # Search across embeddings
    param={},
    limit = TOP_K,  # Limit to top_k results per search
    output_fields=['title']  # Include title field in result
)
end = time.time()

for hits_i, hits in enumerate(res):
    print('Title:', search_terms[hits_i])
    print('Search Time:', end-start)
    print('Results:')
    for hit in hits:
        print( hit.entity.get('title'), '----', hit.distance)
    print()

Title: A movie about cars
Search Time: 0.292910099029541
Results:
Auto Driver ---- 0.8420578837394714
Auto Driver ---- 0.8420578837394714
Red Line 7000 ---- 0.9104408025741577

Title: A movie about monsters
Search Time: 0.292910099029541
Results:
Monster Hunt ---- 0.8105474710464478
Monster Hunt ---- 0.8105474710464478
The Astro-Zombies ---- 0.8998499512672424

