# RAG Embeddings Practice

In [1]:
!pip install qdrant_client

Collecting qdrant_client
  Downloading qdrant_client-1.12.1-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant_client)
  Downloading grpcio_tools-1.68.0-cp310-cp310-win_amd64.whl.metadata (5.5 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant_client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant_client)
  Downloading protobuf-5.28.3-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Collecting grpcio>=1.41.0 (from qdrant_client)
  Downloading grpcio-1.68.0-cp310-cp310-win_amd64.whl.metadata (4.0 kB)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant_client)
  Downloading h2-4.1.0-py3-none-any.whl.metadata (3.6 kB)
Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant_client)
  Downloading hyperframe-6.0.1-py3-none-any.whl.metadata (2.7 kB)
Collecting hpack<5,>=4.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant_client)
  Downloading hpack-4.0.0-p

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opentelemetry-proto 1.27.0 requires protobuf<5.0,>=3.19, but you have protobuf 5.28.3 which is incompatible.


## Import libraries

In [3]:
# Pandas to read csv and preprocess data
import pandas as pd

# Qdrant for vector database, storing of processed information from CSV
from qdrant_client import models, QdrantClient

# Sentence transformer to tokenise and encode
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


## Preprocess CSV

In [4]:
df = pd.read_csv('top_rated_wines.csv')
df = df[df['variety'].notna()] # remove any NaN values as it blows up serialization
data = df.sample(700).to_dict('records') # Get only 700 records. More records make it slower to index

len(data)

700

In [9]:
data[:10]

[{'name': 'Kanzler Vineyards Pinot Noir 2007',
  'region': 'North Coast, California',
  'variety': 'Red Wine',
  'rating': 96.0,
  'notes': 'Dark ruby color. Intense flavors of black cherry, boysenberry and blueberry. Wonderful nose of rose petals and spice combined with a delightful earthiness. A big, but well balanced wine with good acid and soft tannins to complement the conentrated fruit flavors. A wine that reveals layer upon layer of mouthwatering structure, depth and complexity as it lingers on the palette. '},
 {'name': 'Guigal La Mouline Cote Rotie 2009',
  'region': 'Cote Rotie, Rhone, France',
  'variety': 'Red Wine',
  'rating': 98.0,
  'notes': 'Ruby red with bright red highlights. On the nose, small red fruits, blackberry and floral aromas of violets, powerful but fine aromatic intensity. The wine has a soft mouthfeel and beautiful harmony between the fine flavors and explosive richness due to the concentration. A voluptuous feminine wine with a velvety or silky texture a

In [5]:
# Encode with MiniLM
# Define the model name
model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Load the model and save it locally
encoder = SentenceTransformer(model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
# create the vector database client
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

In [7]:
# Create collection to store wines
qdrant.recreate_collection(
    collection_name="top_wines",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector Size is defined by used model
        distance = models.Distance.COSINE
    )
)

  qdrant.recreate_collection(


True

In [8]:
# Vectorize
qdrant.upload_points(
    collection_name = "top_wines",
    points = [
        models.PointStruct(
            id=idx,
            vector=encoder.encode(doc["notes"]).tolist(),
            payload=doc,
        ) for idx, doc in enumerate(data) # data is the variable holding all the wines
    ]
)

In [10]:
# Set a prompt
user_prompt = "Suggest an amazing Shiraz wine from Australia"

In [11]:
# Time to execute a search
# Return 5 results
hits = qdrant.search(
    collection_name="top_wines",
    query_vector = encoder.encode(user_prompt).tolist(),
    limit = 5
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'name': 'De Lisio Shiraz 2004', 'region': 'McLaren Vale, South Australia, Australia', 'variety': 'Red Wine', 'rating': 96.0, 'notes': '"The profound 2004 Shiraz was cropped at a measly .5 tons of fruit per acre, and aged almost entirely in new French oak. It is a killer wine in a killer line-up from De Lisio in 2004. Dense blue/purple to the rim, this highly extracted (but not overly extracted) effort reveals notes of crushed rocks, blueberries, blackberries, camphor, lead pencil shavings, and spicy oak. Boasting great purity, a full-bodied, opulent texture, huge richness, but no sense of pruniness or flabbiness given its precision and refreshing structure, this is a well-balanced, potentially complex McLaren Vale blockbuster. It should drink well for 15+ years." - Wine Advocate'} score: 0.6775632971328055
{'name': 'De Lisio The Catalyst Shiraz/Grenache 2004', 'region': 'McLaren Vale, South Australia, Australia', 'variety': 'Red Wine', 'rating': 96.0, 'notes': '"A blockbuster in the m

In [12]:
# We can see the results above
# Define a variable to hold the search results
search_results = [hit.payload for hit in hits]

In [20]:
search_results[0]

{'name': 'De Lisio Shiraz 2004',
 'region': 'McLaren Vale, South Australia, Australia',
 'variety': 'Red Wine',
 'rating': 96.0,
 'notes': '"The profound 2004 Shiraz was cropped at a measly .5 tons of fruit per acre, and aged almost entirely in new French oak. It is a killer wine in a killer line-up from De Lisio in 2004. Dense blue/purple to the rim, this highly extracted (but not overly extracted) effort reveals notes of crushed rocks, blueberries, blackberries, camphor, lead pencil shavings, and spicy oak. Boasting great purity, a full-bodied, opulent texture, huge richness, but no sense of pruniness or flabbiness given its precision and refreshing structure, this is a well-balanced, potentially complex McLaren Vale blockbuster. It should drink well for 15+ years." - Wine Advocate'}

In [16]:
# Set up Prompt Template
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [22]:
from langchain.prompts import ChatPromptTemplate
# Example query
query_text = user_prompt
context_text = str(search_results)

In [23]:
# Load the template for the prompt
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)
print(prompt)

Human: 
Answer the question based only on the following context:

[{'name': 'De Lisio Shiraz 2004', 'region': 'McLaren Vale, South Australia, Australia', 'variety': 'Red Wine', 'rating': 96.0, 'notes': '"The profound 2004 Shiraz was cropped at a measly .5 tons of fruit per acre, and aged almost entirely in new French oak. It is a killer wine in a killer line-up from De Lisio in 2004. Dense blue/purple to the rim, this highly extracted (but not overly extracted) effort reveals notes of crushed rocks, blueberries, blackberries, camphor, lead pencil shavings, and spicy oak. Boasting great purity, a full-bodied, opulent texture, huge richness, but no sense of pruniness or flabbiness given its precision and refreshing structure, this is a well-balanced, potentially complex McLaren Vale blockbuster. It should drink well for 15+ years." - Wine Advocate'}, {'name': 'De Lisio The Catalyst Shiraz/Grenache 2004', 'region': 'McLaren Vale, South Australia, Australia', 'variety': 'Red Wine', 'rating

In [25]:
# Time to connect to the local LLM
# We are using Mistral
from langchain_ollama import OllamaLLM

model = OllamaLLM(model="mistral")
response_text = model.invoke(prompt)

print(response_text)

 Based on the provided context, both the De Lisio Shiraz 2004 and the De Lisio The Catalyst Shiraz/Grenache 2004 received a rating of 96.0 from the Wine Advocate, indicating they are amazing Shiraz wines from Australia. However, if you're looking for a more recent vintage, the By Farr Shiraz 2014 is also a highly-rated Shiraz from Australia with a score of 96.0. For an older vintage, the Greenock Creek Alice's Shiraz 2003 received a rating of 97.0. The choice ultimately depends on your personal preference for wine styles and age.
