Now with autocomplete suggestions!
Build your own text-based search engine with real-time smart autocomplete suggestions! The smart autocomplete takes into account spelling mistakes, word variations, and more! Search on precise text, keywords, and semantics of your own data - all powered by Rockset. This project implements the following standard text search concepts:
- Wildcard Techniques
- Tokenization
- Term Frequency (TF) with weights
- Ngrams
- Levenshtein Distance
- Vector Search
Check out these slides for more information on these concepts.
To build the autocomplete & text search on titles & keywords, you will need an account with Rockset. To build the semantic search (optional), you will need an account on both OpenAI and Rockset to get an API key for both platforms. Thankfully, API keys are available on the free versions of both platforms. To create an account on OpenAI go here and to create an account on Rockset go here.
Rockset already has a public dataset of book titles, descriptions, and embeddings! Follow the steps below to set-up this collection correctly:
- In the Rockset Console, go to the "Collections" tab and then select "Create a Collection"
- Scroll down and select "Public Datasets"
- Click the "Book Embeddings Dataset" then "Start"
- Once the preview has loaded, click "Next"
- There is a default ingest transformation, but we'll need to make a few additions for our text search use case. The ingest transformation is a powerful tool available in Rockset. It allows you to execute a SQL query on all incoming data before it is stored in Rockset. For this project, we'll need to tokenize and create ngrams on the text we plan to search. Use the ingest transformation below:
SELECT
TOKENIZE(title, 'en_US') AS title_tokens, -- tokenizing the title
NGRAMS(LOWER(title), 3) AS title_ngrams, -- creating ngrams of the title
TOKENIZE(description, 'en_US') AS description_tokens, -- tokenizing the description
title,
series,
author,
TRY_CAST(rating as float) as rating,
description,
language,
TRY_CAST (isbn as integer) as isbn,
genres,
characters,
bookFormat as book_format,
edition,
TRY_Cast(pages as int) as page_count,
publisher,
publishDate as publish_date,
firstPublishDate as first_publish_date,
awards,
TRY_CAST(numRatings as int) as num_ratings,
ratingsByStars as ratings_by_stars,
TRY_CAST(likedPercent as float) as liked_percent,
setting,
coverImg as cover_image,
TRY_CAST(bbeScore as int) as bbe_score,
TRY_Cast(bbeCotes as int) as bbe_votes,
TRY_Cast(price as float) as price,
VECTOR_ENFORCE(embedding, 1536, 'float') as book_embedding
FROM
_input
where
title is not NULL
- In the next page, type a workspace name and collection name. I used workspace=
Text-Searchand collection=Books. - Final step is to click "Create" and wait for the data to ingest into your Rockset collection. This will only take a few minutes.
In order to run semantic search on the embeddedings in the public dataset, we will need to build a special IVF Index. This can be done with the following query:
CREATE
SIMILARITY INDEX text_search_book_embed
ON
FIELD “Text-Search”.Books:book_embedding DIMENSION 1536 AS 'faiss::IVF256,Flat';
Run the query below to check the status of the index. Proceed when the status is Ready.
SELECT
index_status, *
FROM
_system.similarity_index
WHERE
name = 'text_search_book_embed'
For more information, check out Rockset's Vector Search documentation.
Rockset's patented Query Lambdas are named, parameterized SQL queries stored in Rockset that can be executed from a dedicated REST endpoint. In the Query Editor in Rockset, save the following SQL queries as a Query Lambdas. We will later call these in our webpage. Each Query Lambda below will require you to create a parameter search_query of type string.
Save the following query as a Query Lambda named searchTitles under the Text-Search workspace:
script {{{
export function levenshteinDistance(str1, str2) {
// https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows
const len1 = str1.length;
const len2 = str2.length;
let prevRow = Array.from({ length: len2 + 1 }, (_, i) => i);
for (let i = 1; i <= len1; i++) {
let currentRow = [i];
for (let j = 1; j <= len2; j++) {
const cost = str1[i - 1] === str2[j - 1] ? 0 : 1;
currentRow[j] = Math.min(
prevRow[j] + 1, // deletion
currentRow[j - 1] + 1, // insertion
prevRow[j - 1] + cost // substitution
);
}
prevRow = currentRow;
}
return prevRow[len2];
}
}}}
(
SELECT
title,
'exact' as score,
'exact' as distance,
'exact' as hybrid_score,
num_ratings
FROM
"Text-Search".Books
WHERE
LOWER(title) LIKE CONCAT(LOWER(:search_query), '%')
ORDER BY
num_ratings desc
LIMIT
10
)
UNION
(
SELECT
title,
score() as score,
_script.levenshteinDistance(title, :search_query) as distance,
score() - 0.05 * _script.levenshteinDistance(title, :search_query) as hybrid_score,
num_ratings
FROM
"Text-Search".Books
WHERE
search(
CONTAINS(title_tokens, :search_query),
BOOST(
0.5,
CONTAINS(
title_ngrams,
ARRAY_JOIN(NGRAMS(LOWER(:search_query), 3), ' ')
)
)
) OPTION(match_all = false)
ORDER BY
hybrid_score desc, num_ratings desc
LIMIT
10
)
ORDER BY
hybrid_score desc, num_ratings desc
LIMIT
10
Save the following query as a Query Lambda named searchKeywords under the Text-Search workspace:
SELECT
title,
score() as score,
num_ratings,
description
FROM
"Text-Search".Books
WHERE
search(
CONTAINS(title_tokens, :search_query),
CONTAINS(description_tokens, :search_query)
) OPTION(match_all = false)
ORDER BY
score desc,
num_ratings desc
LIMIT
10
Save the following query as a Query Lambda named searchSemantic under the Text-Search workspace:
SELECT
title,
APPROX_DOT_PRODUCT(
JSON_PARSE(:search_query),
book_embedding
) as similarity,
num_ratings,
description
FROM
"Text-Search".Books HINT(access_path=index_similarity_search)
ORDER BY
similarity DESC
LIMIT
10
Create an API key in the API Keys tab of the Rockset Console. The region can be found in the dropdown menu at the top of the page. For more information, refer to Rockset's API Reference.
Before running the webpage, check lines 2-8:
const rocksetApiKey = "YOUR_ROCKSET_API_KEY"; // UPDATE WITH YOUR ROCKSET API KEY
const apiServer = "YOUR_ROCKSET_REGION_URL" // UPDATE WITH YOUR ROCKSET REGION URL (ex: "https://api.usw2a1.rockset.com")
const qlWorkspace = 'Text-Search'; // UPDATE if not the same
const qlName_titles = 'searchTitles'; // UPDATE if not the same
const qlName_keywords = 'searchKeywords'; // UPDATE if not the same
const qlName_semantic = 'searchSemantic'; // UPDATE if not the same
const openaiApiKey = "YOUR_OPENAI_API_KEY"; // UPDATE WITH YOUR OPENAI API KEY (only for semantic search)
Now you're ready to search!
- NEW Rockset function
EDIT_DISTANCEcan be used to replace the Levenshtein UDF. Check out the official Rockset documentation for more details. - Consider adding metadata filtering to your queries by updating the WHERE clause. For example, we can update the
searchSemanticQuery Lambda to add filtering on price and publisher name:
SELECT
title,
APPROX_DOT_PRODUCT(
JSON_PARSE(:search_query),
book_embedding
) as similarity,
num_ratings,
description
FROM
"Text-Search".Books HINT(access_path=index_similarity_search)
WHERE
price BETWEEN 5 AND 25
AND publisher IN ('Scholastic Inc.', 'Pocket', 'Wildside Press')
ORDER BY
similarity DESC
LIMIT
10
- Prefer video format? Check out our workshop we hosted on this topic: https://www.youtube.com/watch?v=AG1QdjJ8BoA&t=17s