<a href="https://colab.research.google.com/github/yongsa-nut/SF323_CN408_AIEngineer/blob/main/SF323_CN408_Lecture_4_RAG_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SF323/CN408 - Lecture 4: RAG Demo

## Google Vertex Setup

In [None]:
!gcloud auth application-default login

In [None]:
!gcloud auth application-default set-quota-project YOUR_PROJECT_ID  # replace the last one with your project ID

In [None]:
import openai
from google.auth import default
import google.auth.transport.requests

# TODO(developer): Update and un-comment below lines
project_id = "YOUR_PROJECT_ID"    # Update here
location = "us-central1"

# # Programmatically get an access token
credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
credentials.refresh(google.auth.transport.requests.Request())

# OpenAI Client
client = openai.OpenAI(
  base_url=f"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/endpoints/openapi",
  api_key=credentials.token
)

## OpenRouter

In [None]:
from google.colab import userdata
from openai import OpenAI

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=userdata.get('openrouter'),
)

def generate(prompt):
    response = client.chat.completions.create(
        model="google/gemini-2.5-flash-lite,
        messages = [{'role':'user',
                     'content':prompt}]
    )
    return response.choices[0].message.content

generate("Hello")

'Hi there! How can I help you today?'

<br>

---

## Demo 1: Keyword Matching

In [None]:
# The data
knowledge_base = { 'CN101' : "Introduction to Computer Programming.",
             'MA111' : 'Fundamentals of Calculus',
             'CN200' : 'Discrete Mathematics',
             'SF211' : 'Object-Oriented Programming'
}

In [None]:
def keyword_generate(query, docs):
    # Retrive relevant information
    context = ''
    for k in docs:
        if k in query:
            context += f'{k} = {docs[k]}\n'

    # Augmented Prompt
    prompt = f'''<question>{query}</question>
    Please use context in <context> tags to answer the question.
    <context>{context}</context>'''

    # Generate
    response = generate(prompt)

    # printing out
    print('Query:', query)
    print('Retrieved documents:', context)
    print('Response:', response)

query = 'CN101 คือวิชาอะไร?'

keyword_generate(query, knowledge_base)

**Query**: CN101 คือวิชาอะไร?
Retrieved documents: CN101 = Introduction to Computer Programming.

**Response**: CN101 คือวิชา **Introduction to Computer Programming**


<br>

---

## Demo 2: BM25

Documentation

In [None]:
!pip install rank_bm25

In [None]:
from rank_bm25 import BM25Okapi

In [None]:
# Sample document collection
# Link: https://drive.google.com/file/d/1dofsvV5XptwgnXDZIyo5Y4vOmkl8v5qr/view
documents = [
    "CN101 Introduction to Computer Programming: Basic concepts of computer systems, electronic data processing and concepts, system and application software, algorithms, flowcharts, data representation, program design and development methodology, and problem-solving using high-level language programming (Python).",
    "SF211 Object-Oriented Programming: Introduction to object-oriented programming. Class, Object, Encapsulation, Inheritance, Polymorphism, and Abstraction.",
    "CN200 Discrete Mathematics: Logic. Proof techniques. Basic set theory. Relations and functions. Mathematical induction. Countability and counting arguments. Permutations and combinations. Inclusionexclusion principle. Elementary finite probability. Topics in graph theory: isomorphism, planarity, circuits, trees, and directed graphs.",
    "SF230 Linear Algebra and Numerical Analysis: Theorems of matrices, vector spaces, linear independence, dimensions, rank of matrices, applications of matrices for solving systems of linear equations, inverse of matrices, determinant, Cramer’ s Rule, linear transformations, inner product spaces, orthogonal complement and least square, eigenvalues, eigenvectors and its application. Numerical solutions of one variable equations, polynomial interpolation, numerical methods of differentiation and integration, solving engineering problems by using package",
    "SF250 Probability Theory and Statistics: Introduction to probability theory. Topics covered include random variables, conditional probability, expectation, independence, Bayes' rule, important distributions, joint distributions, central limit theorem, laws of large numbers, statistical inference; point and confidence interval estimation, hypothesis tests, analysis of variance, linear regression.",
    "SF212 Web Application Development: Introduction to the basic principles of web application programming. Web server systems. Basic HTML and Cascading StyleSheets. Server-side web application development. Database access and manipulation through the web. Session management. Web application security.",
    "SF220 Introduction to Software Engineering: Scientific foundation for software engineering, introduction to software development process and life cycles. Methods, techniques, and tools used for software engineering process. Students work in small teams on substantial, realistic projects, covering most phases of the software production life cycle.",
    "SF231 Data Structures and Algorithms: Introduction to data structures and algorithms, algorithm analysis, arrays and linked lists, stacks, queues, priority queues, heaps, binary trees, binary search trees, AVL trees, other variations in trees, hashing, sorting, graph algorithms, algorithm design techniques, online judges and algorithm competitions.",
    "SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additionally, the course will emphasize the art of data presentations, enabling participants to effectively communicate analytical insights to diverse stakeholders.",
    "CN351 Web Application Security: Current state of security in web applications. Key security mechanisms for web applications. Client and server side controls. Common vulnerabilities of web-based applications and how to protect against the attacks."
    ]

# Simple tokenization using string splitting
tokenized_docs = [doc.lower().split() for doc in documents]

# Create BM25 object
bm25 = BM25Okapi(tokenized_docs)

In [None]:
for i, words in enumerate(tokenized_docs):
  print(i, words)

0 ['cn101', 'introduction', 'to', 'computer', 'programming:', 'basic', 'concepts', 'of', 'computer', 'systems,', 'electronic', 'data', 'processing', 'and', 'concepts,', 'system', 'and', 'application', 'software,', 'algorithms,', 'flowcharts,', 'data', 'representation,', 'program', 'design', 'and', 'development', 'methodology,', 'and', 'problem-solving', 'using', 'high-level', 'language', 'programming', '(python).']
1 ['sf211', 'object-oriented', 'programming:', 'introduction', 'to', 'object-oriented', 'programming.', 'class,', 'object,', 'encapsulation,', 'inheritance,', 'polymorphism,', 'and', 'abstraction.']
2 ['cn200', 'discrete', 'mathematics:', 'logic.', 'proof', 'techniques.', 'basic', 'set', 'theory.', 'relations', 'and', 'functions.', 'mathematical', 'induction.', 'countability', 'and', 'counting', 'arguments.', 'permutations', 'and', 'combinations.', 'inclusionexclusion', 'principle.', 'elementary', 'finite', 'probability.', 'topics', 'in', 'graph', 'theory:', 'isomorphism,', 

In [None]:
def BM25_retriever(query, top_k=2):
  # Words only
  tokenized_query = query.lower().split()
  # Pass the list of words into bm25 to get scores
  doc_scores = bm25.get_scores(tokenized_query)
  # Then retrieve the score
  top_docs = sorted(enumerate(doc_scores), key=lambda x: x[1], reverse=True)[:top_k]

  return [documents[i] for i, _ in top_docs]

In [None]:
BM25_retriever("data science", 3)

['SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additionally, the course will emphasize the art of data presentations, ena

In [None]:
def bm25_generate(query, top_k=3):
  # Retrieve relevant information
  retrieved_docs = BM25_retriever(query, top_k)
  context = "\n".join(retrieved_docs)

  # Augmented Prompt
  prompt = f'''<question>{query}</question>
  Please use context in <context> tags to answr the question.
  <context>{context}</context>'''

  # Generate
  response = generate(prompt)

  # printing out
  print('Query:', query)
  print('Retrieved documents:', retrieved_docs)
  print('\nResponse:', response)

query = "Can you tell me about the data science class?"
bm25_generate(query)

Query: Can you tell me about the data science class?
Retrieved documents: ['SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. 

<br>

---

## Demo 3: RAG without a database

In [None]:
!pip install sentence_transformers datasets

In [None]:
import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Same document collection
documents = [
    "CN101 Introduction to Computer Programming: Basic concepts of computer systems, electronic data processing and concepts, system and application software, algorithms, flowcharts, data representation, program design and development methodology, and problem-solving using high-level language programming (Python).",
    "SF211 Object-Oriented Programming: Introduction to object-oriented programming. Class, Object, Encapsulation, Inheritance, Polymorphism, and Abstraction.",
    "CN200 Discrete Mathematics: Logic. Proof techniques. Basic set theory. Relations and functions. Mathematical induction. Countability and counting arguments. Permutations and combinations. Inclusionexclusion principle. Elementary finite probability. Topics in graph theory: isomorphism, planarity, circuits, trees, and directed graphs.",
    "SF230 Linear Algebra and Numerical Analysis: Theorems of matrices, vector spaces, linear independence, dimensions, rank of matrices, applications of matrices for solving systems of linear equations, inverse of matrices, determinant, Cramer’ s Rule, linear transformations, inner product spaces, orthogonal complement and least square, eigenvalues, eigenvectors and its application. Numerical solutions of one variable equations, polynomial interpolation, numerical methods of differentiation and integration, solving engineering problems by using package",
    "SF250 Probability Theory and Statistics: Introduction to probability theory. Topics covered include random variables, conditional probability, expectation, independence, Bayes' rule, important distributions, joint distributions, central limit theorem, laws of large numbers, statistical inference; point and confidence interval estimation, hypothesis tests, analysis of variance, linear regression.",
    "SF212 Web Application Development: Introduction to the basic principles of web application programming. Web server systems. Basic HTML and Cascading StyleSheets. Server-side web application development. Database access and manipulation through the web. Session management. Web application security.",
    "SF220 Introduction to Software Engineering: Scientific foundation for software engineering, introduction to software development process and life cycles. Methods, techniques, and tools used for software engineering process. Students work in small teams on substantial, realistic projects, covering most phases of the software production life cycle.",
    "SF231 Data Structures and Algorithms: Introduction to data structures and algorithms, algorithm analysis, arrays and linked lists, stacks, queues, priority queues, heaps, binary trees, binary search trees, AVL trees, other variations in trees, hashing, sorting, graph algorithms, algorithm design techniques, online judges and algorithm competitions.",
    "SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additionally, the course will emphasize the art of data presentations, enabling participants to effectively communicate analytical insights to diverse stakeholders.",
    "CN351 Web Application Security: Current state of security in web applications. Key security mechanisms for web applications. Client and server side controls. Common vulnerabilities of web-based applications and how to protect against the attacks."
    ]

# Initialize the sentence transformer model
embed_model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed documents
doc_embeddings = embed_model.encode(documents)

In [None]:
doc_embeddings

In [None]:
len(doc_embeddings[0])

384

In [None]:
def embedded_retriever(query, top_k=1):
    # Embed the query
    query_embedding = embed_model.encode([query])

    # Calculate cosine similarity
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

    # Get top-k relevant documents
    top_indices = similarities.argsort()[-top_k:][::-1]
    return [documents[i] for i in top_indices]

In [None]:
def RAG_generate(query, top_k=3):
  # Retrieve relevant docs
  retrieved_docs = embedded_retriever(query, top_k)
  context = "\n".join(retrieved_docs)

  # Augmented Prompt
  prompt = f'''<question>{query}</question>
  Please use context in <context> tags to answr the question.
  <context>{context}</context>'''

  # Generate
  response = generate(prompt)

  # printing out
  print('Query:', query)
  print('Retrieved documents:', retrieved_docs)
  print('\nResponse:', response)

query = "Can you tell me about the data science class?"
RAG_generate(query)

Query: Can you tell me about the data science class?
Retrieved documents: ['SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. 

<br>

---

## Demo 4: RAG with Pinecone

In [None]:
!pip3 install pinecone

In [None]:
from pinecone import Pinecone, ServerlessSpec

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print('Sorry no cuda.')
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

In [None]:
query = 'Can you tell me about the data science class?'
xq = model.encode(query)
xq.shape

(384,)

In [None]:
# Same document collection
documents = [
    "CN101 Introduction to Computer Programming: Basic concepts of computer systems, electronic data processing and concepts, system and application software, algorithms, flowcharts, data representation, program design and development methodology, and problem-solving using high-level language programming (Python).",
    "SF211 Object-Oriented Programming: Introduction to object-oriented programming. Class, Object, Encapsulation, Inheritance, Polymorphism, and Abstraction.",
    "CN200 Discrete Mathematics: Logic. Proof techniques. Basic set theory. Relations and functions. Mathematical induction. Countability and counting arguments. Permutations and combinations. Inclusionexclusion principle. Elementary finite probability. Topics in graph theory: isomorphism, planarity, circuits, trees, and directed graphs.",
    "SF230 Linear Algebra and Numerical Analysis: Theorems of matrices, vector spaces, linear independence, dimensions, rank of matrices, applications of matrices for solving systems of linear equations, inverse of matrices, determinant, Cramer’ s Rule, linear transformations, inner product spaces, orthogonal complement and least square, eigenvalues, eigenvectors and its application. Numerical solutions of one variable equations, polynomial interpolation, numerical methods of differentiation and integration, solving engineering problems by using package",
    "SF250 Probability Theory and Statistics: Introduction to probability theory. Topics covered include random variables, conditional probability, expectation, independence, Bayes' rule, important distributions, joint distributions, central limit theorem, laws of large numbers, statistical inference; point and confidence interval estimation, hypothesis tests, analysis of variance, linear regression.",
    "SF212 Web Application Development: Introduction to the basic principles of web application programming. Web server systems. Basic HTML and Cascading StyleSheets. Server-side web application development. Database access and manipulation through the web. Session management. Web application security.",
    "SF220 Introduction to Software Engineering: Scientific foundation for software engineering, introduction to software development process and life cycles. Methods, techniques, and tools used for software engineering process. Students work in small teams on substantial, realistic projects, covering most phases of the software production life cycle.",
    "SF231 Data Structures and Algorithms: Introduction to data structures and algorithms, algorithm analysis, arrays and linked lists, stacks, queues, priority queues, heaps, binary trees, binary search trees, AVL trees, other variations in trees, hashing, sorting, graph algorithms, algorithm design techniques, online judges and algorithm competitions.",
    "SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additionally, the course will emphasize the art of data presentations, enabling participants to effectively communicate analytical insights to diverse stakeholders.",
    "CN351 Web Application Security: Current state of security in web applications. Key security mechanisms for web applications. Client and server side controls. Common vulnerabilities of web-based applications and how to protect against the attacks."
    ]

doc_embeddings = model.encode(documents)

### Setup Pinecone

- [Documentation](https://docs.pinecone.io/guides/index-data/create-an-index)

In [None]:
from google.colab import userdata

pinecone = Pinecone(api_key=userdata.get('pinecone_key'))
INDEX_NAME = 'sf323-2025' #Can't be upper case

# Cleaning up the index
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)
print(INDEX_NAME)

# Creating a serverless index
pinecone.create_index(
    name = INDEX_NAME,
    dimension = model.get_sentence_embedding_dimension(),
    metric = 'cosine',
    spec = ServerlessSpec(cloud='aws', region='us-east-1')) #

dense_index = pinecone.Index(INDEX_NAME)
print(dense_index)

sf323-2025
<pinecone.db_data.index.Index object at 0x7aefd75ff310>


### Upsert to Pinecone

- Format: A list of dict
  - `{'id':xx, 'values':embedding, 'metadata':dict}`
- Document: https://docs.pinecone.io/reference/api/2024-07/data-plane/upsert

In [None]:
records = []
for x in range(len(documents)):
  record = {
      'id': str(x),
      'values':doc_embeddings[x],
      'metadata': {
          'text':documents[x]
       }
  }
  records.append(record)

dense_index.upsert(vectors=records)   # There is a limit on how much you can upsert at a time. See https://docs.pinecone.io/guides/data/upsert-data

{'upserted_count': 10}

In [None]:
dense_index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

### Retriving Documents




In [None]:
query = 'Can you tell me about the data science class?'

# 1) Embedding your query
embed_query = model.encode(query).tolist()
retrieved_docs =  dense_index.query(vector=embed_query, top_k=3, include_metadata=True)
print(retrieved_docs)

In [None]:
text = [r['metadata']['text'] for r in retrieved_docs['matches']]
print(text)

['SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additionally, the course will emphasize the art of data presentations, ena

### Generate with retrived documents

In [None]:
def dense_retriever(query, top_k=3):
    # First embedding your query
    embed_query = model.encode(query).tolist()
    # Then retrieve the document
    retrieved_docs =  dense_index.query(vector=embed_query,
                                  top_k=top_k,
                                  include_metadata=True)
    # Then get the actual text
    texts = [r['metadata']['text'] for r in retrieved_docs['matches']]
    return "\n".join(texts)

In [None]:
def RAG_pinecone_response(query, top_k=3):
    # Retrieve context
    context = dense_retriever(query, top_k)

    # Augmented prompt
    prompt = f'''<question>{query}</question>
    Use contexts in <context> tags to answr the question.
    <context>{context}</context>'''

    # Generate
    response = generate(prompt)

    # printing out
    print('Query:', query)
    print('Retrieved documents:', text)
    print('\nResponse:', response)

query = 'Can you tell me about the data science class?'
RAG_pinecone_response(query)

Query: Can you tell me about the data science class?
Retrieved documents: ['SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. 

## Demo 5: Hybrid Search with Pinecone

- Maintaining two vector databases: one sparse and one dense (from earlier)

### Sprase database

In [None]:
from pinecone import Pinecone, ServerlessSpec

index_name = "test-sparse-search"

if not pinecone.has_index(index_name):
    pinecone.create_index(
        name=index_name,
        vector_type="sparse",
        metric="dotproduct",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

sparse_index = pinecone.Index(index_name)
print(sparse_index)

<pinecone.db_data.index.Index object at 0x7aefdc1be8d0>


In [None]:
sparse_embeddings = pinecone.inference.embed(
    model="pinecone-sparse-english-v0",
    inputs=[d for d in documents],
    parameters={"input_type": "passage", "truncate": "END"}
)
print(sparse_embeddings)

https://github.com/pinecone-io/pinecone-text

In [None]:
from pinecone_text.sparse import BM25Encoder

# Initialize BM25 and fit the corpus
bm25 = BM25Encoder()
bm25.fit(documents)

# Encode a new document (for upsert to Pinecone index)

sparse_embeddings = [bm25.encode_documents(d) for d in documents]

print(sparse_embeddings)

In [None]:
sparse_records = []
for x in range(len(documents)):
  record = {
      'id': str(x),
      'sparse_values': sparse_embeddings[x],
      'metadata': {
          'text': documents[x]
      }
  }
  sparse_records.append(record)
sparse_index.upsert(vectors = sparse_records)

{'upserted_count': 10}

In [None]:
query = "Can you tell me about the data science class?"
encoded_query = bm25.encode_queries(query)
retrieved_docs = sparse_index.query(sparse_vector = encoded_query,
                                    top_k=3,
                                    include_metadata=True)
print(retrieved_docs)

In [None]:
def sparse_search(query, top_k=3):
    encoded_query = bm25.encode_queries("Can you tell me about the data science class?")
    retrieved_docs = sparse_index.query(sparse_vector = encoded_query,
                                    top_k=3,
                                    include_metadata=True)
    # Then get the actual text
    return retrieved_docs

sparse_search(query)

{'matches': [{'id': '8',
              'metadata': {'text': 'SF251 Introduction to Data Science: '
                                   'Learning the foundational principles of '
                                   'data science and data mining through an '
                                   'in-depth examination of the data science '
                                   'lifecycle. This comprehensive course '
                                   'delves into each important steps, '
                                   'beginning with a comprehension of problem '
                                   'formulation tailored to the specific needs '
                                   'and objectives of businesses. Learners '
                                   'will engage in data mining methodologies, '
                                   'including processes of data extraction, '
                                   'transformation, and integration. Moreover, '
                                   'an ana

In [None]:
def dense_search(query, top_k=3):
    # First embedding your query
    embed_query = model.encode(query).tolist()
    # Then retrieve the document
    retrieved_docs =  dense_index.query(vector=embed_query,
                                  top_k=top_k,
                                  include_metadata=True)
    return retrieved_docs

dense_search(query)

### Merge sparse and dense
- The code below simply merge the two and remove the duplication.
- Then we use a reranker to predict the top-k
- Alternatively, we can use Recipocal Rank Fusion as discussed in class.

In [None]:
def merge_chunks(h1, h2):
    """Get the unique hits from two search results and return them as single array of {'_id', 'chunk_text'} dicts, printing each dict on a new line."""
    # Deduplicate by id
    deduped_hits = {hit['id']: hit for hit in h1['matches'] + h2['matches']}.values()
    # Sort by score descending
    sorted_hits = sorted(deduped_hits, key=lambda x: x['score'], reverse=True)
    # Transform to format for reranking
    result = [{'_id': hit['id'], 'chunk_text': hit['metadata']['text']} for hit in sorted_hits]
    return result

sparse_results = sparse_search(query)
dense_results = dense_search(query)

merged_results = merge_chunks(sparse_results, dense_results)

print('[\n   ' + ',\n   '.join(str(obj) for obj in merged_results) + '\n]')

[
   {'_id': '8', 'chunk_text': 'SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additionally, the course will emphasize the

### Reranking

https://docs.pinecone.io/guides/search/rerank-results

In [None]:
result = pinecone.inference.rerank(
    model="bge-reranker-v2-m3",
    query=query,
    documents=merged_results,
    rank_fields=["chunk_text"],
    top_n=10,
    return_documents=True,
    parameters={
        "truncate": "END"
    }
)

print("Query", query)
print('-----')
for row in result.data:
    print(f"{row['document']['_id']} - {round(row['score'], 2)} - {row['document']['chunk_text']}")

Query Can you tell me about the data science class?
-----
8 - 0.35 - SF251 Introduction to Data Science: Learning the foundational principles of data science and data mining through an in-depth examination of the data science lifecycle. This comprehensive course delves into each important steps, beginning with a comprehension of problem formulation tailored to the specific needs and objectives of businesses. Learners will engage in data mining methodologies, including processes of data extraction, transformation, and integration. Moreover, an analysis of data quality and refinement techniques will be undertaken to ensure the integrity and reliability of datasets. Furthermore, learners will explore the complexities of feature engineering, unraveling intricate relationships and patterns within the data domain. Leveraging state-of-the-art machine learning algorithms, learners will delve into predictive analytics, forecasting future trends and behaviors with precision and accuracy. Additio