<a href="https://colab.research.google.com/github/umermansoor/EmbeddingSemanticSearchDemo/blob/main/Embeddings101UM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word Embeddings

Word Embeddings is a really cool way of converting words and phrases into numerical vector (think of each word as occupying points in a multidimensional space.) The magic behind the idea is that *similar* words like like "cheeseburger" and "hamburger," to have **closely related vector representations**. This proximity in the vector space reflects their *semantic relationship*, which helps algorithms "understand" the meaning behind words more effectively. 

One fascinating use case is "**Semantic Search**." In a nutshell, semantic search **finds** words or phrases by looking at the vector representation of the words and finding those that are **close** together in that multi-dimensional space.

How's this better compared to traditional text-based search? Suppose you're searching for articles about "climate change." In that case, a semantic search engine could also fetch documents discussing "global warming" or "greenhouse effect," as these phrases are semantically related. That's the magic!

Let's get hands on and build our own semantic search engine using word embeddings!

Let's start by installing openai Python library.



In [7]:
!pip install openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 KB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Setup our Semantic Search App

Now let's start setting up our application. To generate word word embeddings, we use specialized algorithms that learn to represent words in vector space based on the context they appear in. One common algorithm for generating word embeddings is called Word2Vec. We'll use Open AI's APIs to generate vectors which uses an underlying transformer architecture and puts embeddings on steroids!

Run the code below and enter your Open AI **API key** when *prompted* and hit enter when done. (Google it if you don't know how to retrive it.)

In [8]:
import pandas as pd
import numpy as np
from getpass import getpass
import openai

openai.api_key = getpass()

··········


## Build our Search Dataset

Okay, now let's create a dataset for our tiny search engine will store and the one we'd search against. I'm choosing some random words (feel free to modify as you play with it.)

In [9]:
l = ["hamburger", "cheeseburger", "blue", "fries", "vancouver", "karachi", "acura", "car", "weather", "biryani"]
dataset = pd.DataFrame(l, columns=['term'])
print(dataset)

           term
0     hamburger
1  cheeseburger
2          blue
3         fries
4     vancouver
5       karachi
6         acura
7           car
8       weather
9       biryani


## Convert Search Dataset to Vectors Using Embeddings

Awesome! Now we need to convert each word in our search dataset into embeddings and store those in our search engine! As mentioned above, we'll use Open AI to calculate embeddings. The model I'm using is: `text-embedding-ada-002` which is 99.8% cheaper than DaVinci so you have plenty of room to playaround without worrying costs! 

Let's convert our 'raw' search dataset to embeddings.

In [10]:
from openai.embeddings_utils import get_embedding

dataset['embedding'] = dataset['term'].apply(
    lambda x: get_embedding(x, engine='text-embedding-ada-002')
)

# print terms and their embeddings side by side
print(dataset)

           term                                          embedding
0     hamburger  [-0.01317964494228363, -0.001876765862107277, ...
1  cheeseburger  [-0.01824556663632393, 0.00504859397187829, 0....
2          blue  [0.005490605719387531, -0.007445123512297869, ...
3         fries  [0.01848343200981617, -0.030745232477784157, -...
4     vancouver  [-0.011030120775103569, -0.023991534486413002,...
5       karachi  [-0.004611444193869829, -0.001336810179054737,...
6         acura  [0.0055086081847548485, 0.013021569699048996, ...
7           car  [-0.007495860103517771, -0.021644126623868942,...
8       weather  [0.011580432765185833, -0.013912283815443516, ...
9       biryani  [-0.009054498746991158, -0.015499519184231758,...


What's next? We have our search dataset as embeddings. Without further ado, let's start searching against it!

But first let's convert our embeddings into a numpy array so we can run math on it such as finding nearby words in our multidimensional space.

In [11]:
# numpy embeddings so we can run some calculations on it
dataset['embedding'].apply(np.array)

0    [-0.01317964494228363, -0.001876765862107277, ...
1    [-0.01824556663632393, 0.00504859397187829, 0....
2    [0.005490605719387531, -0.007445123512297869, ...
3    [0.01848343200981617, -0.030745232477784157, -...
4    [-0.011030120775103569, -0.023991534486413002,...
5    [-0.004611444193869829, -0.001336810179054737,...
6    [0.0055086081847548485, 0.013021569699048996, ...
7    [-0.007495860103517771, -0.021644126623868942,...
8    [0.011580432765185833, -0.013912283815443516, ...
9    [-0.009054498746991158, -0.015499519184231758,...
Name: embedding, dtype: object

## Start Searching 🚀

We are ready to start searching. Enter a search term in the prompt below e.g. "big mac" and hit enter when done.

In [12]:
keyword = input('What do you want to search today? ')

What do you want to search today? big mac


Now that we have our search keyword, we can start searching. But remember, we have vector representations in our dataset. So we'll need to convert the search keyword to its vector representation and then use that representation for search. Let's get to it.

In [13]:
from openai.embeddings_utils import get_embedding

keywordVector = get_embedding(
    keyword, engine="text-embedding-ada-002"
)

# print embedings of our keyword
print(keywordVector)

[-0.025127442553639412, -0.022262267768383026, -0.011588485911488533, -0.019840994849801064, -0.010122270323336124, -0.001772237941622734, -0.027898456901311874, -0.016800951212644577, -0.022571653127670288, -0.0188321303576231, 0.023701582103967667, 0.03570033982396126, 0.008239056915044785, -0.004264132585376501, -0.008003655821084976, 0.028032971546053886, 0.03685716912150383, -0.0027508363127708435, 0.00852153915911913, -0.01420480664819479, -0.010431654751300812, 0.010095367208123207, -0.016679886728525162, -0.024401061236858368, 0.01584589295089245, -0.006362569984048605, 0.0010475373128429055, -0.007472320459783077, 0.009214292280375957, -0.006174248643219471, 0.015993859618902206, -0.01815955527126789, -0.024051321670413017, 0.004701307043433189, -0.014984995126724243, 0.012442657724022865, 0.001559535856358707, -0.004771927371621132, 0.026566755026578903, -0.00362518522888422, 0.014702512882649899, 0.005185561720281839, 0.00414643157273531, -0.003554564667865634, -0.0111445859

Okay, now we are REALLY ready to start searching... semantically searching! We have our both or search dataset and keyword as vectors!

The semantic search process is really simple: we just need to find closest vectors in the multidimensional space to our keyword. To do this, we'll use a concept called **cosine similarity** which measures the similarity between two vectors of an inner product space.

Our search algorithm is simple: Take the search keyword vector, run through our search dataset, and calculate nearest vectors. 

Let's print the top 10 matching results!

In [14]:
from openai.embeddings_utils import cosine_similarity

dataset["distance"] = dataset['embedding'].apply(
    lambda x: cosine_similarity(x, keywordVector)
)

dataset.sort_values(
    "distance", 
    ascending=False
).head(50)

Unnamed: 0,term,embedding,distance
0,hamburger,"[-0.01317964494228363, -0.001876765862107277, ...",0.853306
1,cheeseburger,"[-0.01824556663632393, 0.00504859397187829, 0....",0.841594
3,fries,"[0.01848343200981617, -0.030745232477784157, -...",0.823209
4,vancouver,"[-0.011030120775103569, -0.023991534486413002,...",0.784345
2,blue,"[0.005490605719387531, -0.007445123512297869, ...",0.783165
9,biryani,"[-0.009054498746991158, -0.015499519184231758,...",0.781528
5,karachi,"[-0.004611444193869829, -0.001336810179054737,...",0.781077
7,car,"[-0.007495860103517771, -0.021644126623868942,...",0.779868
8,weather,"[0.011580432765185833, -0.013912283815443516, ...",0.772024
6,acura,"[0.0055086081847548485, 0.013021569699048996, ...",0.761125


In the search results above, I used the keyword "*big mac*" and my top 3 matches were "*hamburger, cheeseburger, fries*". That's semantic search in action!

Play with the code. Enter city names, colors, car brands, update the search dataset with phrases and see it finds the best match.