# Movies recommendation with Azure Open AI & Azure Cognitive Search
## Part 1: Embeddings generation with Azure Open AI and Azure Cognitive Search ingestion

<img src="https://github.com/retkowsky/images/blob/master/movies_search.png?raw=true">

In [1]:
#%pip install azure-search-documents --pre --upgrade

In [2]:
import json
import math
import openai
import os
import pandas as pd
import pickle
import pytz
import requests
import sys
import tiktoken
import time

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    VectorSearch,
    SimpleField,
    SemanticSettings,
    SemanticField,
    SemanticConfiguration,
    SearchIndex,
    SearchFieldDataType,
    SearchField,
    SearchableField,
    PrioritizedFields,
    HnswVectorSearchAlgorithmConfiguration,
)

from datetime import datetime
from dotenv import load_dotenv
from openai.embeddings_utils import get_embedding, cosine_similarity
from tqdm import tqdm

In [3]:
sys.version

'3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]'

In [4]:
local_tz = pytz.timezone(requests.get("https://ipinfo.io").json()["timezone"])
print("Local time:", datetime.now(local_tz).strftime("%d-%b-%Y %H:%M:%S"))

Local time: 05-Sep-2023 12:47:23


In [5]:
print("Open AI version:", openai.__version__)

Open AI version: 0.27.9


In [6]:
load_dotenv("azure.env")

openai.api_type: str = "azure"
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_version = os.getenv("OPENAI_API_VERSION")

acs_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
acs_key = os.getenv("AZURE_SEARCH_ADMIN_KEY")

In [7]:
# Azure Open AI embeddings model to use
embeddings_engine = "text-embedding-ada-002"

- Vector search is in public preview
- Model name: text-embedding-ada-002
- Model version: 2
- API version: 2023-05-15

In [8]:
# Azure Cognitive Search index name to create
index_name = "moviereview"

## 0. Azure Cognitive Search vector store
<img src="https://github.com/retkowsky/images/blob/master/vector_search_architecture.png?raw=true">

## 1. Data

In [9]:
EXCEL_FILE = "data/movies.xlsx"

!ls $EXCEL_FILE -lh

-rwxrwxrwx 1 root root 2.9M Sep  5 10:41 data/movies.xlsx


In [10]:
df = pd.read_excel(EXCEL_FILE)

In [11]:
df["title"] = df["title"].astype(str)
df["year"] = df["year"].astype(str)

columns_to_drop = ["tagline", "website"]
df = df.drop(columns_to_drop, axis=1)

In [12]:
df.head(5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year
0,tt0035423,Kate & Leopold,Meg Ryan Hugh Jackman Liev Schreiber Breckin M...,James Mangold,When her scientist ex-boyfriend discovers a po...,Comedy Fantasy Romance Science Fiction,2001
1,tt0052646,The Brain That Wouldn't Die,Jason Evers Virginia Leith Doris Brent Audrey ...,Joseph Green,Dr. Bill Cortner (Jason Evers) and his fiancÃ©...,Horror Science Fiction,1962
2,tt0053559,13 Ghosts,Charles Herbert Jo Morrow Martin Milner Rosema...,William Castle,Reclusive Dr. Zorba has died and left his mans...,Horror,1960
3,tt0053580,The Alamo,John Wayne Richard Widmark Laurence Harvey Fra...,John Wayne,The legendary true story of a small band of so...,Action Adventure Drama History Western,1960
4,tt0053604,The Apartment,Jack Lemmon Shirley MacLaine Fred MacMurray Ra...,Billy Wilder,Bud Baxter is a minor clerk in a huge New York...,Comedy Drama Romance,1960


In [13]:
df.shape

(10785, 7)

In [14]:
df = df.drop_duplicates()
df.shape

(10784, 7)

In [15]:
df.head(5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year
0,tt0035423,Kate & Leopold,Meg Ryan Hugh Jackman Liev Schreiber Breckin M...,James Mangold,When her scientist ex-boyfriend discovers a po...,Comedy Fantasy Romance Science Fiction,2001
1,tt0052646,The Brain That Wouldn't Die,Jason Evers Virginia Leith Doris Brent Audrey ...,Joseph Green,Dr. Bill Cortner (Jason Evers) and his fiancÃ©...,Horror Science Fiction,1962
2,tt0053559,13 Ghosts,Charles Herbert Jo Morrow Martin Milner Rosema...,William Castle,Reclusive Dr. Zorba has died and left his mans...,Horror,1960
3,tt0053580,The Alamo,John Wayne Richard Widmark Laurence Harvey Fra...,John Wayne,The legendary true story of a small band of so...,Action Adventure Drama History Western,1960
4,tt0053604,The Apartment,Jack Lemmon Shirley MacLaine Fred MacMurray Ra...,Billy Wilder,Bud Baxter is a minor clerk in a huge New York...,Comedy Drama Romance,1960


In [16]:
# Removing some extra spaces
df["description"] = df["description"].str.replace("  ", " ")
df["title"] = df["title"].str.replace("  ", " ")
df["cast"] = df["cast"].str.replace("  ", " ")
df["director"] = df["director"].str.replace("  ", " ")
df["genres"] = df["genres"].str.replace("  ", " ")

In [17]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df["nb_tokens"] = df["description"].apply(lambda x: len(tokenizer.encode(x)))
df = df[df.nb_tokens < 8192]
len(df)

10784

In [18]:
df.head(5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,nb_tokens
0,tt0035423,Kate & Leopold,Meg Ryan Hugh Jackman Liev Schreiber Breckin M...,James Mangold,When her scientist ex-boyfriend discovers a po...,Comedy Fantasy Romance Science Fiction,2001,82
1,tt0052646,The Brain That Wouldn't Die,Jason Evers Virginia Leith Doris Brent Audrey ...,Joseph Green,Dr. Bill Cortner (Jason Evers) and his fiancÃ©...,Horror Science Fiction,1962,106
2,tt0053559,13 Ghosts,Charles Herbert Jo Morrow Martin Milner Rosema...,William Castle,Reclusive Dr. Zorba has died and left his mans...,Horror,1960,53
3,tt0053580,The Alamo,John Wayne Richard Widmark Laurence Harvey Fra...,John Wayne,The legendary true story of a small band of so...,Action Adventure Drama History Western,1960,36
4,tt0053604,The Apartment,Jack Lemmon Shirley MacLaine Fred MacMurray Ra...,Billy Wilder,Bud Baxter is a minor clerk in a huge New York...,Comedy Drama Romance,1960,67


In [19]:
df["nb_tokens"].describe()

count    10784.000000
mean        62.621384
std         35.517581
min          2.000000
25%         35.000000
50%         57.000000
75%         81.000000
max        232.000000
Name: nb_tokens, dtype: float64

In [20]:
df = df.drop("nb_tokens", axis=1)

In [21]:
df.shape

(10784, 7)

## 2. Generating text embeddings with Azure Open AI

### Vectors embeddings

In [22]:
print("Embedding engine:", embeddings_engine)

Embedding engine: text-embedding-ada-002


In [23]:
def openai_text_embeddings(text):
    """
    Generating embeddings from text using Azure Open AI
    Input: text
    Output: text embeddings
    """
    embeddings = openai.Embedding.create(
        input=text,
        deployment_id=embeddings_engine,
    )
    embeddings = embeddings["data"][0]["embedding"]

    return embeddings

In [24]:
emb = openai_text_embeddings("My name is James Bond")
emb[:5]

[-0.03617486730217934,
 -0.005520837381482124,
 -0.007070655468851328,
 -0.030174769461154938,
 0.0020399712957441807]

In [25]:
print("Size of the embeddings =", len(emb))

Size of the embeddings = 1536


### Running the embedding for the 'overview' column

In [26]:
print("Running the embedding process...")
df["embed_overview"] = None

with tqdm(total=len(df)) as pbar:
    def apply_embedding(x):
        """
        Azure Open AI text embedding
        """
        global pbar
        embedding = get_embedding(x["description"], engine=embeddings_engine)
        pbar.update(1)  # Update the progress bar
        return embedding
    df["embed_overview"] = df.apply(apply_embedding, axis=1)

Running the embedding process...


100%|██████████| 10784/10784 [22:34<00:00,  7.96it/s] 


### Running the embedding for the 'title' column

In [27]:
print("Running the embedding process...")
df["embed_title"] = None

with tqdm(total=len(df)) as pbar:
    def apply_embedding(x):
        """
        Azure Open AI text embedding
        """
        global pbar
        embedding = get_embedding(x["title"], engine=embeddings_engine)
        pbar.update(1)  # Update the progress bar
        return embedding
    df["embed_title"] = df.apply(apply_embedding, axis=1)

Running the embedding process...


100%|██████████| 10784/10784 [22:02<00:00,  8.15it/s] 


### Saving the documents (initial data + embeddings) into a file

In [28]:
df.head(5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,embed_overview,embed_title
0,tt0035423,Kate & Leopold,Meg Ryan Hugh Jackman Liev Schreiber Breckin M...,James Mangold,When her scientist ex-boyfriend discovers a po...,Comedy Fantasy Romance Science Fiction,2001,"[0.021089695394039154, -0.010973065160214901, ...","[0.012436236254870892, 0.017795050516724586, 0..."
1,tt0052646,The Brain That Wouldn't Die,Jason Evers Virginia Leith Doris Brent Audrey ...,Joseph Green,Dr. Bill Cortner (Jason Evers) and his fiancÃ©...,Horror Science Fiction,1962,"[-0.012167753651738167, -0.011197278276085854,...","[-0.030167536810040474, 0.009392774663865566, ..."
2,tt0053559,13 Ghosts,Charles Herbert Jo Morrow Martin Milner Rosema...,William Castle,Reclusive Dr. Zorba has died and left his mans...,Horror,1960,"[0.010904512368142605, -0.0042160083539783955,...","[-0.00030440607224591076, -0.01510559953749179..."
3,tt0053580,The Alamo,John Wayne Richard Widmark Laurence Harvey Fra...,John Wayne,The legendary true story of a small band of so...,Action Adventure Drama History Western,1960,"[-0.013121142983436584, -0.021301323547959328,...","[-0.016000520437955856, -0.018589602783322334,..."
4,tt0053604,The Apartment,Jack Lemmon Shirley MacLaine Fred MacMurray Ra...,Billy Wilder,Bud Baxter is a minor clerk in a huge New York...,Comedy Drama Romance,1960,"[-0.00418621813878417, -0.010639390908181667, ...","[-0.0017813920276239514, -0.002438942203298211..."


In [29]:
df.shape

(10784, 9)

In [30]:
documents = df.to_dict(orient="records")
print("Number of documents =", len(documents))

Number of documents = 10784


In [31]:
# Saving the documents into a pkl file
PKL_DIR = "embeddings"
PKL_FILE = "movies.pkl"

os.makedirs(PKL_DIR, exist_ok=True)

print("Saving documents...")
with open(os.path.join(PKL_DIR, PKL_FILE), 'wb') as f:
    pickle.dump(documents, f)
print("Done")

os.listdir(PKL_DIR)

Saving documents...
Done


['movies.pkl']

## 3. Cosine similarity principles

In [32]:
def get_cosine_similarity(vector1, vector2):
    """
    Get cosine similarity value between two embedded vectors
    Inputs: 2 embedded vectors
    Output: cosine similarity value
    """
    if len(vector1) != len(vector2):
        return None
    
    dot_product = sum(x * y for x, y in zip(vector1, vector2))
    magnitude1 = math.sqrt(sum(x * x for x in vector1))
    magnitude2 = math.sqrt(sum(x * x for x in vector2))
    cosine_similarity = round(dot_product / (magnitude1 * magnitude2), 15)

    if cosine_similarity == 1:
        print(
            "\033[1;31;34mStrictly identical text: Cosine similarity =",
            cosine_similarity,
        )
    
    elif cosine_similarity >= 0.8:
        print("\033[1;31;32mSame semantic text: Cosine similarity =", cosine_similarity)
    
    else:
        print(
            "\033[1;31;91mDifferent semantic text: Cosine similarity =",
            cosine_similarity,
        )
    
    return cosine_similarity

In [33]:
emb1 = openai_text_embeddings("My name is James Bond")
emb2 = openai_text_embeddings("Sean Connery.")
emb3 = openai_text_embeddings("Azure Open AI is great!")

In [34]:
get_cosine_similarity(emb1, emb1)

[1;31;34mStrictly identical text: Cosine similarity = 1.0


1.0

In [35]:
get_cosine_similarity(emb1, emb2)

[1;31;32mSame semantic text: Cosine similarity = 0.83354490270083


0.83354490270083

In [36]:
get_cosine_similarity(emb1, emb3)

[1;31;91mDifferent semantic text: Cosine similarity = 0.723586104586534


0.723586104586534

In [37]:
get_cosine_similarity(emb2, emb1)

[1;31;32mSame semantic text: Cosine similarity = 0.83354490270083


0.83354490270083

In [38]:
get_cosine_similarity(emb2, emb3)

[1;31;91mDifferent semantic text: Cosine similarity = 0.718070263122291


0.718070263122291

## 4. Quick local tests (without Azure Cognitive Search)

In [39]:
def quick_search(df, user_query, top_n=3):
    """
    Searching documents
    Inputs: dataframe, query and topn
    Output: results
    """
    embedding = get_embedding(
        user_query,
        engine=embeddings_engine,
    )
    df["cosine_similarity"] = df.embed_overview.apply(
        lambda x: cosine_similarity(x, embedding)
    )
    results = df.sort_values("cosine_similarity", ascending=False).head(top_n)
    display(results)

    return results

In [40]:
results = quick_search(df, "I want to see some Terminator movies", top_n=3)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,embed_overview,embed_title,cosine_similarity
1276,tt0088247,The Terminator,Arnold Schwarzenegger Michael Biehn Linda Hami...,James Cameron,In the post-apocalyptic future reigning tyrann...,Action Thriller Science Fiction,1984,"[-0.02188941463828087, -0.044750526547431946, ...","[-0.025060655549168587, -0.0402563251554966, -...",0.850306
2159,tt0103064,Terminator 2 Judgment Day,Arnold Schwarzenegger Linda Hamilton Edward Fu...,James Cameron,Nearly 10 years have passed since Sarah Connor...,Action Thriller Science Fiction,1991,"[-0.013668149709701538, -0.0405806303024292, -...","[-0.02478639781475067, -0.0283088106662035, -0...",0.829719
3755,tt0181852,Terminator 3 Rise of the Machines,Arnold Schwarzenegger Nick Stahl Claire Danes ...,Jonathan Mostow,It's been 10 years since John Connor saved Ear...,Action Thriller Science Fiction,2003,"[0.007508266717195511, -0.04598238691687584, -...","[-0.0275761429220438, -0.058681610971689224, -...",0.82646


In [41]:
results = quick_search(df, "Je veux voir un film de James Bond", top_n=3)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,embed_overview,embed_title,cosine_similarity
290,tt0062512,You Only Live Twice,Sean Connery Akiko Wakabayashi Karin Dor Mie H...,Lewis Gilbert,A mysterious space craft kidnaps a Russian and...,Action Thriller Adventure,1967,"[-0.008935574442148209, -0.03285738453269005, ...","[-0.02086525410413742, -0.007323090452700853, ...",0.818913
522,tt0070328,Live and Let Die,Roger Moore Yaphet Kotto Jane Seymour Clifton ...,Guy Hamilton,James Bond must investigate a mysterious murde...,Adventure Action Thriller,1973,"[-0.01073690690100193, -0.02086213417351246, -...","[-0.032646261155605316, -0.016837764531373978,...",0.81724
1845,tt0097742,Licence to Kill,Timothy Dalton Carey Lowell Robert Davi Talisa...,John Glen,James Bond and his American colleague Felix Le...,Adventure Action Thriller,1989,"[0.002153094159439206, 0.004642711021006107, -...","[-0.0041930824518203735, -0.005286786705255508...",0.8119


In [42]:
results = quick_search(df, "Quiero ver películas de ciencia ficción", top_n=5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,embed_overview,embed_title,cosine_similarity
4128,tt0254199,CQ,Jeremy Davies Angela Lindvall Ã‰lodie Bouchez ...,Roman Coppola,A young filmmaker in 1960s Paris juggles direc...,Comedy Drama Science Fiction,2001,"[-0.0043552410788834095, -0.016719359904527664...","[0.0024958874564617872, 0.005884502083063126, ...",0.814828
9116,tt2049543,Synchronicity,Chad McKnight Brianne Davis AJ Bowen Scott Poy...,Jacob Gentry,In this mind-bending 'Sci-Fi Noir' a daring ph...,Thriller Mystery Science Fiction,2015,"[-0.0016540847718715668, -0.018224244937300682...","[-0.01355673186480999, -0.02561151422560215, 0...",0.799714
3936,tt0216216,The 6th Day,Arnold Schwarzenegger Michael Rapaport Tony Go...,Roger Spottiswoode,Futuristic action about a man who meets a clon...,Action Mystery Science Fiction Thriller,2000,"[-0.015566646121442318, -0.03223968297243118, ...","[0.0033594027627259493, 0.0012303648982197046,...",0.797103
1792,tt0097100,Communion,Christopher Walken Lindsay Crouse Frances Ster...,Philippe Mora,A novelist's wife and son see him changed by a...,Drama Horror Science Fiction Thriller,1989,"[0.021728629246354103, -0.012683791108429432, ...","[0.01540224440395832, -0.007430095691233873, 0...",0.795096
4932,tt0366627,The Jacket,Adrien Brody Keira Knightley Kris Kristofferso...,John Maybury,A military veteran goes on a journey into the ...,Drama Mystery Thriller Fantasy,2005,"[0.0058812107890844345, -0.02817738987505436, ...","[-0.016100479289889336, -0.0138565544039011, -...",0.793296


In [43]:
results = quick_search(df, "Voglio vedere dei film musicali", top_n=5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,embed_overview,embed_title,cosine_similarity
2883,tt0115899,Il ciclone,Leonardo Pieraccioni Lorena Forteza Barbara En...,Leonardo Pieraccioni,La vita di una tranquilla famiglia toscana di ...,Drama Comedy Romance Foreign,1996,"[0.0005137093830853701, -0.008648240938782692,...","[-0.012413086369633675, -0.006416610442101955,...",0.825148
3389,tt0120910,Fantasia 2000,Steve Martin Quincy Jones Bette Midler James E...,Paul Brizzi Hendel Butoy Francis Glebas Eric G...,An update of the original film with new interp...,Music Animation Family Fantasy,1999,"[-0.02040269412100315, -0.01043155137449503, -...","[0.017556900158524513, -0.0321248397231102, -0...",0.803908
1183,tt0086850,Acqua e sapone,Carlo Verdone Natasha Hovey Glenn Saxson Elena...,Carlo Verdone,La madre-manager di una giovanissima modella a...,Drama Family,1983,"[-0.011183799244463444, -0.019681930541992188,...","[0.024926507845520973, 0.004666802939027548, 0...",0.803762
2833,tt0114844,Viaggi di nozze,Carlo Verdone Claudia Gerini Veronica Pivetti ...,Carlo Verdone,Le vicessitudini di tre coppie di novelli spos...,Non available,1995,"[-0.008751184679567814, -0.01806781068444252, ...","[-0.009643185883760452, -0.012746378779411316,...",0.79761
1515,tt0092276,Yuppies 2,Massimo Boldi Jerry CalÃ Christian De Sica Ez...,Enrico Oldoini,Continuano le avventure degli yuppies milanesi...,Comedy,1986,"[-0.008874254301190376, -0.03015108034014702, ...","[-0.0203807782381773, -0.029809894040226936, -...",0.797188


In [44]:
results = quick_search(df, "音楽映画が観たい", top_n=5)

Unnamed: 0,imdb_id,title,cast,director,description,genres,year,embed_overview,embed_title,cosine_similarity
3389,tt0120910,Fantasia 2000,Steve Martin Quincy Jones Bette Midler James E...,Paul Brizzi Hendel Butoy Francis Glebas Eric G...,An update of the original film with new interp...,Music Animation Family Fantasy,1999,"[-0.02040269412100315, -0.01043155137449503, -...","[0.017556900158524513, -0.0321248397231102, -0...",0.80374
4961,tt0368667,Interstella 5555 The 5tory of the 5ecret 5tar ...,Romanthony,Kazuhisa TakenÃ´chi,A sci-fi anime House-musical movie collaborati...,Animation Science Fiction Music,2003,"[0.017697621136903763, -0.011214802972972393, ...","[0.004670271649956703, -0.014423966407775879, ...",0.787416
4704,tt0335266,Lost in Translation,Bill Murray Scarlett Johansson Anna Faris Giov...,Sofia Coppola,Two lost souls visiting Tokyo -- the young neg...,Drama,2003,"[0.005371313542127609, -0.017871160060167313, ...","[-0.013685127720236778, -0.029664143919944763,...",0.784041
2971,tt0116922,Lost Highway,Bill Pullman Patricia Arquette John Roselius L...,David Lynch,A tormented jazz musician finds himself lost i...,Drama Thriller Mystery,1997,"[0.0029709336813539267, -0.032269012182950974,...","[-0.008940287865698338, -0.01946103200316429, ...",0.782554
5209,tt0397535,Memoirs of a Geisha,Zhang Ziyi Gong Li Youki Kudoh Tsai Chin Suzuk...,Rob Marshall,A sweeping romantic epic set in Japan in the y...,Drama History Romance,2005,"[-0.02323296293616295, -0.024124089628458023, ...","[-0.022762073203921318, -0.011697894893586636,...",0.778716


## 5. Azure Cognitive Search functions

In [45]:
def delete_index(index_name):
    """
    Deleting an Azure Cognitive Search index
    Input: Azure Cognitive Search index
    Output: None
    """
    start = time.time()
    search_client = SearchIndexClient(
        endpoint=acs_endpoint, credential=AzureKeyCredential(acs_key)
    )
    
    try:
        print("Deleting the Azure Cognitive Search index:", index_name)
        search_client.delete_index(index_name)
        print("Done. Elapsed time:", round(time.time() - start, 2), "secs")
    except:
        print("Cannot delete index. Check the index name.")

In [46]:
def index_stats(index_name):
    """
    Get statistics about Azure Cognitive Search index
    Input: Azure Cognitive Search index
    Output: Get Azure Cognitive Search index stats
    """
    url = (
        acs_endpoint
        + "/indexes/"
        + index_name
        + "/stats?api-version=2021-04-30-Preview"
    )
    headers = {
        "Content-Type": "application/json",
        "api-key": acs_key,
    }
    response = requests.get(url, headers=headers)
    print("Azure Cognitive Search index status for:", index_name, "\n")

    if response.status_code == 200:
        res = response.json()
        print(json.dumps(res, indent=2))
        document_count = res["documentCount"]
        storage_size = res["storageSize"]

    else:
        print("Request failed with status code:", response.status_code)

    return document_count, storage_size

In [47]:
def index_status(index_name):
    """
    Azure Cognitive Search index status
    Input: Azure Cognitive Search index
    Output: Get Azure Cognitive Search index status
    """
    print("Azure Cognitive Search Index:", index_name, "\n")

    headers = {"Content-Type": "application/json", "api-key": acs_key}
    params = {"api-version": "2021-04-30-Preview"}
    index_status = requests.get(
        acs_endpoint + "/indexes/" + index_name, headers=headers, params=params
    )

    try:
        print(json.dumps((index_status.json()), indent=5))
    except:
        print("Request failed with status code:", response.status_code)

## 6. Creating an Azure Cognitive Search index

In [48]:
try:
    # Setting the Azure Cognitive Search client
    print("Setting the Azure Cognitive Search client")
    search_client = SearchIndexClient(
        endpoint=acs_endpoint,
        credential=AzureKeyCredential(acs_key)
    )
    print("Done. Azure Cognitive Search client defined.")
    print(search_client)

except:
    print("Request failed. Cannot create Azure Cognitive Search client:", acs_endpoint)

Setting the Azure Cognitive Search client
Done. Azure Cognitive Search client defined.
<azure.search.documents.indexes._search_index_client.SearchIndexClient object at 0x7fbee4cb70a0>


### Removing any existing index

In [49]:
delete_index(index_name)

Deleting the Azure Cognitive Search index: moviereview
Done. Elapsed time: 0.86 secs


### Creating search index

In [50]:
vector_search_dim = len(openai_text_embeddings("Hello"))
print("Vector embeddings size =", vector_search_dim)

Vector embeddings size = 1536


In [51]:
# Create a search index
index_client = SearchIndexClient(
    endpoint=acs_endpoint, credential=AzureKeyCredential(acs_key)
)
fields = [
    # Index
    SimpleField(
        name="imdb_id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
    ),
    # Searchable fields
    SearchableField(name="title", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="cast", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="director", type=SearchFieldDataType.Single, filterable=True),
    SearchableField(name="description", type=SearchFieldDataType.String),
    SearchableField(name="genres", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="year", type=SearchFieldDataType.String, filterable=True),
    # Vectors embeddings
    SearchField(
        name="embed_overview",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=vector_search_dim,
        vector_search_configuration="my-vector-config",
    ),
    SearchField(
        name="embed_title",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=vector_search_dim,
        vector_search_configuration="my-vector-config",
    ),
]


# Configuration
vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            # HNSW is a graph-based Approximate Nearest Neighbors (ANN)
            # algorithm optimized for high-recall, low-latency applications
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine",  # Cosine similarity metric
            },
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="title"),
        prioritized_keywords_fields=[SemanticField(field_name="genres")],
        prioritized_content_fields=[SemanticField(field_name="description")],
    ),
)

# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search,
    semantic_settings=semantic_settings,
)

try:
    result = index_client.create_or_update_index(index)
    print(f"Done. The {result.name} Azure Cognitive Search index has been created!")

except:
    print(f"Error. The {result.name} Azure Cognitive Search index cannot be created.")

Done. The moviereview Azure Cognitive Search index has been created!


## 7. Uploading the documents into the index

In [52]:
print("Number of documents to load =", len(documents))

Number of documents to load = 10784


In [53]:
def upload_documents(docs):
    """
    Uploading documents into the Azure Cognitive Search index
    Inputs: documents
    Outputs: loading documents to Azure Cognitive Search index
    """
    search_client = SearchClient(
        endpoint=acs_endpoint,
        index_name=index_name,
        credential=AzureKeyCredential(acs_key),
    )
    result = search_client.upload_documents(docs)

In [54]:
def chunk_list(input_list, chunk_size):
    """
    Chunk a list according to the chunk_size value
    Inputs: documents (list), chunk size list
    Outputs: chunk list of documents
    """
    return [
        input_list[i : i + chunk_size] for i in range(0, len(input_list), chunk_size)
    ]

In [55]:
start = time.time()

chunk_size = 500  # We will load documents chunk by chunk
chunks = chunk_list(documents, chunk_size)
idx = 1

print("Loading the documents into the Azure Cognitive Search index...")
print("Total number of documents to load =", len(documents))
print()

loaded_docs = chunk_size

for chunk in chunks:
    pct_done = round(loaded_docs / len(documents) * 100)
    if pct_done >= 100:
        pct_done = 100

    print(
        f"Processing chunk {idx:03}",
        f"| Number of loaded documents = {loaded_docs:06}",
        "of",
        len(documents),
        "| Done:",
        pct_done,
        "%",
    )
    upload_documents(chunk)
    loaded_docs += chunk_size
    idx += 1

elapsed = time.time() - start
print("\nDone")
print(
    "Elapsed time: "
    + time.strftime(
        "%H:%M:%S.{}".format(str(elapsed % 1)[2:])[:15], time.gmtime(elapsed)
    )
)

Loading the documents into the Azure Cognitive Search index...
Total number of documents to load = 10784

Processing chunk 001 | Number of loaded documents = 000500 of 10784 | Done: 5 %
Processing chunk 002 | Number of loaded documents = 001000 of 10784 | Done: 9 %
Processing chunk 003 | Number of loaded documents = 001500 of 10784 | Done: 14 %
Processing chunk 004 | Number of loaded documents = 002000 of 10784 | Done: 19 %
Processing chunk 005 | Number of loaded documents = 002500 of 10784 | Done: 23 %
Processing chunk 006 | Number of loaded documents = 003000 of 10784 | Done: 28 %
Processing chunk 007 | Number of loaded documents = 003500 of 10784 | Done: 32 %
Processing chunk 008 | Number of loaded documents = 004000 of 10784 | Done: 37 %
Processing chunk 009 | Number of loaded documents = 004500 of 10784 | Done: 42 %
Processing chunk 010 | Number of loaded documents = 005000 of 10784 | Done: 46 %
Processing chunk 011 | Number of loaded documents = 005500 of 10784 | Done: 51 %
Proce

In [56]:
print(f"Elapsed time to process {len(documents)} documents = {round(elapsed)} seconds")
print(f"Time per processed document in second = {round(elapsed / len(documents), 5)}")
print(f"Number of processed documents per second = {int(len(documents) / elapsed)}")

Elapsed time to process 10784 documents = 209 seconds
Time per processed document in second = 0.01939
Number of processed documents per second = 51


## 8. Azure Cognitive Search Index informations

In [57]:
index_name

'moviereview'

In [58]:
index_status(index_name)

Azure Cognitive Search Index: moviereview 

{
     "@odata.context": "https://azurecogsearcheastussr.search.windows.net/$metadata#indexes/$entity",
     "@odata.etag": "\"0x8DBAE03DD8B3409\"",
     "name": "moviereview",
     "defaultScoringProfile": null,
     "fields": [
          {
               "name": "imdb_id",
               "type": "Edm.String",
               "searchable": false,
               "filterable": true,
               "retrievable": true,
               "sortable": true,
               "facetable": true,
               "key": true,
               "indexAnalyzer": null,
               "searchAnalyzer": null,
               "analyzer": null,
               "normalizer": null,
               "synonymMaps": []
          },
          {
               "name": "title",
               "type": "Edm.String",
               "searchable": true,
               "filterable": true,
               "retrievable": true,
               "sortable": false,
               "facetable": f

In [59]:
document_count, storage_size = index_stats(index_name)

Azure Cognitive Search index status for: moviereview 

{
  "@odata.context": "https://azurecogsearcheastussr.search.windows.net/$metadata#Microsoft.Azure.Search.V2021_04_30_Preview.IndexStatistics",
  "documentCount": 9000,
  "storageSize": 472209194
}


In [60]:
print("Number of documents in the index =", f"{document_count:,}")
print("Size of the index =", round(storage_size / (1024 * 1024), 2), "MB")

Number of documents in the index = 9,000
Size of the index = 450.33 MB


Note: Please wait some time in order to have the updated results

> Go to the next notebook