<div style="background-color: #1B1A21; text-align: right; margin-bottom: -1px">
    <img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-banner.png" style="padding: 0px; padding-right: 20px; margin: 0px; padding-top: 20px; height: 60px"/>
    <img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/banner-colors.png" style="width:100%; height: 50px; padding: 0px; margin: 0px; margin-bottom: -8px"/>
</div>

# Semantic Search with OpenAI Embedding Creation

In this notebook, we will demonstrate an example of conducting semantic search on SingleStoreDB with SQL! Unlike traditional keyword-based search methods, semantic search algorithms take into account the relationships between words and their meanings, enabling them to deliver more accurate and relevant results – even when search terms are vague or ambiguous. 

SingleStoreDB’s built-in parallelization and Intel SIMD-based vector processing takes care of the heavy lifting involved in processing vector data. This allows your to run your ML algorithms right in your database extremely efficiently with just 2 lines of SQL!


In this example, we use Open AI embeddings API to create embeddings for our dataset and run semantic_search using dot_product vector matching function!

## 1. Create a workspace in your workspace group

S-00 is sufficient.

## 2. Create a Database named `semantic_search`

In [8]:
%%sql
DROP DATABASE IF EXISTS semantic_search;

CREATE DATABASE semantic_search;



<div class="alert alert-block alert-danger" style="font-size: 150%; font-weight: bold">
    <p style="float: left; padding-right: 20px; padding-left: 10px"><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/caution.png"/ style="height: 55px; vertical-align: middle"></p>
    <p>Make sure to select the <tt style="font-size: 80%">semantic_search</tt> database from the drop-down menu at the top of this notebook.
    It updates the <tt style="font-size: 80%">connection_url</tt> to connect to that database.</p>
</div>

## 3. Install and import required libraries

We will use the OpenAI embeddings API and will need to import the relevant dependencies accordingly. 

In [12]:
!pip3 install openai matplotlib plotly pandas scipy scikit-learn requests --quiet

import json
import os

import openai
import requests
from openai.embeddings_utils import get_embedding

## 4. Create an OpenAI account and get API connection details

To vectorize and embed the employee reviews and query strings, we leverage OpenAI's embeddings API. To use this API, you will need an API key, which you can get [here](https://platform.openai.com/account/api-keys). You'll need to add a payment method to actually get vector embeddings using the API, though the charges are minimal for a small example like we present here.

<div class="alert alert-block alert-danger" style="font-size: 150%; font-weight: bold">
    <p style="float: left; padding-right: 20px; padding-left: 10px"><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/caution.png"/ style="height: 55px; vertical-align: middle"></p>
    <p>You will have to update your notebook's firewall settings to include <tt style="font-size: 90%">*.*.openai.com</tt> in order to get embedddings from OpenAI APIS.</p>
</div>

In [39]:
openai.api_key = '<OPEN_AI_API_KEY>'

## 5. Create a new table in your database called reviews

In [22]:
%%sql
CREATE TABLE reviews (
    date_review VARCHAR(255), 
    job_title VARCHAR(255), 
    location VARCHAR(255), 
    review TEXT
);



## 6. Import our sample data into your table

This dataset has 15 reviews left by anonymous employees of a firm.

In [23]:
url = 'https://raw.githubusercontent.com/singlestore-labs/singlestoredb-samples/main/' + \
      'Tutorials/ai-powered-semantic-search/hr_sample_data.sql'

Note that we are using the `%sql` magic command here to run a query against the currently
selected database.

In [26]:
for query in [x for x in requests.get(url).text.split('\n') if x.strip()]:
     %sql {{query}}

## 7. Add vector embeddings for each review

To embed the reviews in our SingleStoreDB database, we iterate through each row in the table, make a call to OpenAI’s embeddings API with the text in the reviews field and update the new column called embeddings for each entry. 

In [29]:
%sql ALTER TABLE reviews ADD embeddings BLOB;

reviews = %sql SELECT review FROM reviews;

for i in reviews:
    review_embedding = json.dumps(get_embedding(i[0], engine="text-embedding-ada-002"))
    %sql UPDATE reviews SET embeddings = JSON_ARRAY_PACK('{{review_embedding}}') WHERE review='{{i[0]}}';

## 8. Run the semantic search algorithm with just one line of SQL

We will utilize SingleStoreDB's distributed architecture to efficiently compute the dot product of the input string (stored in searchstring) with each entry in the database and return the top 5  reviews with the highest dot product score. Each vector is normalized to length 1, hence the dot product function essentially computes the cosine similarity between two vectors – an appropriate nearness metric. SingleStoreDB makes this extremely fast because it compiles queries to machine code and runs dot_product using SIMD instructions.

In [32]:
searchstring = input("Please enter a search string: ")

search_embedding = json.dumps(get_embedding(searchstring, engine="text-embedding-ada-002")) 

results = %sql SELECT review, DOT_PRODUCT(embeddings, JSON_ARRAY_PACK('{{search_embedding}}')) AS Score FROM reviews ORDER BY Score DESC LIMIT 5;

for i, res in enumerate(results):
    print(f'{i + 1}: {res[0]} Score: {res[1]}')

Please enter a search string:  test search


1: Some good people to work with. Flexible working. Out of hours language classes and aerobics. Morale. Lack of managerial structure. Doesnt seem to support career progression. No formal training Score: 0.7393652200698853
2: client reporting admin. Easy to get the job, Nice colleagues. Abysmal pay, around minimum wage. No actual training for your job role. No incentive to improve. Score: 0.735421359539032
3: While the work itself is satisfactory, the daily commute can be a bit of a hassle. Although the company offers no solution for this issue, it is not uncommon to have this challenge in larger cities. Overall, if you can tolerate the commute, it could be a suitable place to work. Score: 0.7307074069976807
4: Low salary, bad micromanagement. Easy to get the job even without experience in finance. Very low salary, poor working conditions, very little training provided but high expectations Score: 0.7303189635276794
5: The company provides an excellent career path with plenty of opportu

## 9. Clean up

In [35]:
%%sql
DROP DATABASE semantic_search;



<img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/banner-colors-reverse.png" style="width: 100%; margin: 0; padding: 0"/>