<a href="https://colab.research.google.com/github/sanimesa/genai/blob/main/notebooks/Wine_Search_using_Chromadb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Use Chromadb to search for best matches for a given wine tasting note.
In this example, we examine the capabilites of chromadb https://www.trychroma.com/, a vector database where one can store embeddings.
As of now, chromadb is fully open source but it does not have a cloud version.

In [None]:
!pip install chromadb
!pip install ipython-autotime
%load_ext autotime

Collecting chromadb
  Downloading chromadb-0.4.7-py3-none-any.whl (415 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.5/415.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting pydantic<2.0,>=1.9 (from chromadb)
  Downloading pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chroma-hnswlib==0.7.2 (from chromadb)
  Downloading chroma-hnswlib-0.7.2.tar.gz (31 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fastapi<0.100.0,>=0.95.2 (from chromadb)
  Downloading fastapi-0.99.1-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.1

### Load a list of wine tasting note

Load a list of wine tasting note and prepare the lists required to build a chromadb collection. These tasting notes were scraped from the Wine Spectator Magazine.

In [None]:
import requests
r = requests.get('https://raw.githubusercontent.com/sanimesa/wine_tasting/main/tasting_notes/cabernet_output_1.json')

ids = [ 'wine_'+str(i) for i in range(1, len(r.json()) + 1)]
documents = [wine['tasting_note'] for wine in r.json() ]
metadatas = [{'vineyard': wine['vineyard'], 'wine_name': wine['wine_name']} for wine in r.json()]

time: 383 ms (started: 2023-08-24 18:57:09 +00:00)


### Create a collection in chromadb, add the sample tasting notes then run a query

Notice that we only provided chromadb with the raw documents, not any embeddings. Chromadb then loads a default model, prepares the embeddings and performs the distance function all by itself!

In [None]:
import chromadb
import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()

client = chromadb.Client()
collection = client.create_collection("wine_reviews")

collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 46.0MiB/s]


time: 35.1 s (started: 2023-08-24 18:57:10 +00:00)


Let's search for our wine, the tasting note below was taken from the True Myth 2020 Cabernet Sauvignon: https://truemythwinery.com/wines/true-myth-cabernet-sauvignon/

In [None]:
results = collection.query(
    query_texts=["full of polished aromas of blueberry, cherry and vanilla, leading to flavors of dark red fruits, black currants and hints of pepper, mocha and caramelized oak. Rich yet smooth"],
    n_results=5
    # where_document={"$contains":"search_string"}  # optional filter
)


print(results)
# Create a dataframe from the data dictionary
df = pd.DataFrame({
    'ID': results['ids'][0],
    'Document': results['documents'][0],
    'Vineyard': [meta['vineyard'] for meta in results['metadatas'][0]],
    'Wine Name': [meta['wine_name'] for meta in results['metadatas'][0]],
    'Distance': results['distances'][0]
})

df



{'ids': [['wine_101', 'wine_9', 'wine_76', 'wine_149', 'wine_167']], 'distances': [[0.3579668402671814, 0.43874284625053406, 0.4515348970890045, 0.4670472741127014, 0.4771585166454315]], 'metadatas': [[{'vineyard': 'DUNHAM MACLACHLAN', 'wine_name': 'Cabernet Sauvignon Columbia Valley Pursued By Bear 2012'}, {'vineyard': 'SPARKMAN', 'wine_name': 'Cabernet Franc Columbia Valley Yonder 2018'}, {'vineyard': 'LEEUWIN', 'wine_name': 'Cabernet Sauvignon Margaret River Art Series 2013'}, {'vineyard': 'DOUBLEBACK', 'wine_name': 'Cabernet Sauvignon Walla Walla Valley 2009'}, {'vineyard': 'RODNEY STRONG', 'wine_name': "Cabernet Sauvignon Alexander Valley Alexander's Crown 2008"}]], 'embeddings': None, 'documents': [['Firm, dense and expressive, with layers of blueberry, currant, cedar and savory spice flavors coming together seamlessly against polished tannins. The finish sails on nicely. ', 'Sleek and polished, with black cherry and blueberry flavors laced with dusky spice and toasty mocha. Glid

Unnamed: 0,ID,Document,Vineyard,Wine Name,Distance
0,wine_101,"Firm, dense and expressive, with layers of blu...",DUNHAM MACLACHLAN,Cabernet Sauvignon Columbia Valley Pursued By ...,0.357967
1,wine_9,"Sleek and polished, with black cherry and blue...",SPARKMAN,Cabernet Franc Columbia Valley Yonder 2018,0.438743
2,wine_76,"A bright mouthful of cherry, raspberry and cur...",LEEUWIN,Cabernet Sauvignon Margaret River Art Series 2013,0.451535
3,wine_149,"Broad and generous, this is impressive for its...",DOUBLEBACK,Cabernet Sauvignon Walla Walla Valley 2009,0.467047
4,wine_167,"Very ripe, but also quite polished, fleshy and...",RODNEY STRONG,Cabernet Sauvignon Alexander Valley Alexander'...,0.477159


time: 260 ms (started: 2023-08-24 18:57:45 +00:00)


### Enter some wine characteristics below to find wines with similar characteristics!

In [None]:
#@title Find your own wine
tasting_note = "Exhilarating blackberry and blueberry aromas are enhanced by crushed violet, herbes de Provence, and spiced tea. On the palate, dense and soft, with finely textured tannins. Juicy layers of black and red currants lead to a bright finish, accented by pastry notes from extended barrel aging." #@param {type:"string"}

results = collection.query(
    query_texts=[tasting_note],
    n_results=5
)


print(results)
# Create a dataframe from the data dictionary
df = pd.DataFrame({
    'ID': results['ids'][0],
    'Document': results['documents'][0],
    'Vineyard': [meta['vineyard'] for meta in results['metadatas'][0]],
    'Wine Name': [meta['wine_name'] for meta in results['metadatas'][0]],
    'Distance': results['distances'][0]
})

df

{'ids': [['wine_82', 'wine_101', 'wine_161', 'wine_15', 'wine_56']], 'distances': [[0.5165730714797974, 0.5486158728599548, 0.5549145936965942, 0.5668891072273254, 0.5806964039802551]], 'metadatas': [[{'vineyard': 'HALL', 'wine_name': 'Cabernet Sauvignon Napa Valley Terra Secca 2014'}, {'vineyard': 'DUNHAM MACLACHLAN', 'wine_name': 'Cabernet Sauvignon Columbia Valley Pursued By Bear 2012'}, {'vineyard': 'HEWITT', 'wine_name': 'Cabernet Sauvignon Rutherford 2008'}, {'vineyard': 'TOLAINI', 'wine_name': 'Cabernet Sauvignon Toscana Legit 2018'}, {'vineyard': 'BERINGER', 'wine_name': 'Cabernet Sauvignon Napa Valley Private Reserve 2016'}]], 'embeddings': None, 'documents': [['Intense and well-centered on vivid ripe currant, blackberry and cherry flavors, shaded by cedar notes and framed by firm, chewy tannins. Leaves you with a rustic impression and a solid core of fruit. ', 'Firm, dense and expressive, with layers of blueberry, currant, cedar and savory spice flavors coming together seamle

Unnamed: 0,ID,Document,Vineyard,Wine Name,Distance
0,wine_82,Intense and well-centered on vivid ripe curran...,HALL,Cabernet Sauvignon Napa Valley Terra Secca 2014,0.516573
1,wine_101,"Firm, dense and expressive, with layers of blu...",DUNHAM MACLACHLAN,Cabernet Sauvignon Columbia Valley Pursued By ...,0.548616
2,wine_161,"Firm mineral, cedar and dusty earth notes fram...",HEWITT,Cabernet Sauvignon Rutherford 2008,0.554915
3,wine_15,Well-defined black cherry and blackberry flavo...,TOLAINI,Cabernet Sauvignon Toscana Legit 2018,0.566889
4,wine_56,"Juicy and engaging, with lots of raspberry, bl...",BERINGER,Cabernet Sauvignon Napa Valley Private Reserve...,0.580696


time: 251 ms (started: 2023-08-24 18:57:45 +00:00)
