# **Part 1.A.1: RAG Explained in 'Plain English'**

## **Introduction**
Retrieval Augmented Generation (RAG) is a term to describe a system that parses document text into chunks, then produces numerical representations for those text chunks -- these are called embeddings. Once embeddings are created for the chunked text, RAG systems then store the embeddings in vectors in a special type of database called a vector store (e.g. the vector store is a database of embedding vectors and their associated meta data). When a user wants to sumbit a query to retrieve information from the documents or text encoded or represented by the embeddings in the vector store, a typical RAG system is designed to first measure the similarity of the user's query to the text embeddings in the vector store; the system then retrieves the embeddings most similar to the user's query; next, the system, if used with an Large Language Model (LLM), inserts the text associated with the embedding(s) most similar to the user's query into a prompt that is then sent to an LLM. The prompt sent to the LLM can be constructed in a few different ways. For example, the user's query and the text from the vector store can be sent as one 'user prompt.' Alternatively, the user's query can be sent in the 'user query' parameter of the LLM API call function and the text from the vector store can be used in the 'system prompt'; alternatively, the text (e.g. context) retrieved from the vector store can be used in the 'function' parameter in the LLM API call. Either way, the idea is that RAG with LLMs involves retrieving information/knowledge from a vector/embedding database (e.g. the vector store) and including this information, or pairing it with the user's query in the call made to an LLM.

![vector representation diagram](./assets/rag-tutorial-images/rag0sys-di-03.png)

source: https://qdrant.tech/articles/what-is-rag-in-ai/#


## **Potential Benefits of RAG (over fine-tuning or training a LLM)**
Research suggests RAG improves the performance of an LLM in retrieving or providing contextual information in response to user queries. Meanwhile, RAG systems do not necessarily require the same amount of computing or data resources as training a new large language model or fine-tuning an existing model. This possibly means RAG could enable businesses with less resources to build information systems with LLMs and/or allows businesses to build customized LLM applications at a lower cost.

## **Potential Limitations of RAG Systems with LLMs**
Depending on the LLM used (e.g. open vs. closed source), RAG systems have privacy implications. For example, while the sensitive or proprietary data of a business can be securely stored in a vector store built on local or a entity's own corporate infrastructure, unless the entity also is hosting its own LLM resource(s), then calls to an external LLM API provider might expose trade secret, intellectual property, sensitive or private customer/user data. In relations, ethical considerations around data privacy, security, and intellectual property rights must be considered when deciding how to build a RAG system utlizing an LLM or other generative AI model.

## **Main Components of a RAG System**
The diagram below depicts the main components of a RAG system. To summarize, the main components of RAG are:

1) Document loader and text chunker
2) Embedding generator (e.g. embedding model)
3) Vector database (called a vector store); a crucial design consideration when making a vector store is what 'text similarity' and search method to utilized (discussed in more detail below)
4) LLM

![rag system diagram](./assets/rag-tutorial-images/rag-sys-diagram.png)

source: https://wandb.ai/cosmo3769/RAG/reports/A-Gentle-Introduction-to-Retrieval-Augmented-Generation-RAG---Vmlldzo1MjM4Mjk1

![rag/indexing diagram](./assets/rag-tutorial-images/index-rag-di.png)

source: https://qdrant.tech/articles/what-is-rag-in-ai/#

## **Chunking Text**

Please read these brief articles/blogs on text chunking:

1) [Chunking strategies for LLM applications](https://drlee.io/chunking-strategies-for-llm-applications-7a37d56e2b15)
2) [Optimizing RAG with Advanced Chunking Techniques](https://antematter.io/blogs/optimizing-rag-advanced-chunking-techniques-study)
3) [How Chunk Sizes Affect Semantic Retrieval Results](https://ai.plainenglish.io/investigating-chunk-size-on-semantic-results-b465867d8ca1)


## **Embeddings**

![vector representation diagram](./assets/rag-tutorial-images/vector-rep-di.png)

source: https://qdrant.tech/articles/what-is-rag-in-ai/#

### **Text Chunk (Document or Knowledge Base) Embeddings**
After chunking our document(s) or the text in our knowledge base, the next step in creating a RAG system is to tranform the text chunks into embedding vectors. Essentially, embeddings are numerical representations of text. By analogy, imagine your phone number, instead of being +01 987 654 3219 was represented as Jane Smith Doe. When creating embedding vectors, we are doing the opposite. We are taking natural language or text and transforming it into a numerical representation that has some meaning in the world. In the case of a phone number, the numerical representation in the analogy is a specific person's name.

Typically, embeddings that have similar numerical representations also have similar meaning or connections. For example, imagine we had scraped the website "petfinder.com", chunked the text of the website pages, and create embeddings for the listings on each page. Pretend there is an embedding "123" and an embedding "234" and finally and embedding "789". Embeddings "123" and "234" are closer in distance than "789". If we transformed the numbers back into text we might learn "123" = chihuahua and "234" = dachshund and "789" = siamese cat. Since chihuahua and dachshund are both breeds of dogs their embedding representations are less distant or more similar than the embedding for siamese cat, e.g. 789. By analogy, when we create embeddings for a knowledge base, it is typically the case that embeddings that are less distant numerically speaking are more similar with respect to their meaning or association than more distance embedding representations.

### **Query Embeddings (In this notebook, see the Section on Using LLMs in RAG Systems for more information)**
In addition to creating embedding representations for our knowledge base or chunks of text, we also must transform our queries to the vector store or to an LLM if we choose to integrate one.

##### **Readings**

*   [Vector Embeddings and RAG Demystified: Leveraging Amazon Bedrock, Aurora, and LangChain](https://community.aws/concepts/vector-embeddings-and-rag-demystified)

## **Vector Store**

## **Similarity Metrics**

**Three main similarity measures used in RAG systems are:**
- Cosine similarity
- Dot Product similarity
- Euclidean Distance

**Please read the article below for definitions of these metrics (and to complete the corresponding homework questions).**

[Vector Similarity Explained](https://www.pinecone.io/learn/vector-similarity/)

**Additional References:**

[Distance Metrics in Vector Search](https://weaviate.io/blog/distance-metrics-in-vector-search)

## **Generation Model**

![query vectorization diagram](./assets/rag-tutorial-images/query-vectorizatio-di.png)

source: https://qdrant.tech/articles/what-is-rag-in-ai/#

## **Additional Learning Resources**

*   [Microsoft: Retrieval Augmented Generation (RAG) and Vector Databases](https://github.com/microsoft/generative-ai-for-beginners/blob/main/15-rag-and-vector-databases/README.md?WT.mc_id=academic-105485-koreyst)

*   [HuggingFace: Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain](https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain)

*   [HuggingFace: Advanced RAG on HuggingFace documentation using LangChain](https://huggingface.co/learn/cookbook/en/advanced_rag)

# **Setting up Python Environment**

In [None]:
!pip install -r requirements.txt
!pip install ipykernel langchain_experimental llama-index-vector-stores-pinecone ipykernel PyMuPDF pinecone-client pypdf faiss-cpu langchain_community transformers sentence_transformers

import fitz  # PyMuPDF

import os, io, json, transformers, pinecone, pypdf, faiss, sqlite3, langchain_community, langchain, openai, math, time, nltk, torch, huggingface_hub, datasets

from openai import OpenAI

from nltk.tokenize import sent_tokenize

from transformers import pipeline

from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN

from transformers import BertTokenizer, BertModel

import pandas as pd
import numpy as np
from io import StringIO
from dotenv import load_dotenv
from operator import itemgetter

from langchain import document_loaders, embeddings

from langchain.vectorstores import FAISS
from llama_index.vector_stores.pinecone import PineconeVectorStore

from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings.openai import OpenAIEmbedding

from langchain_experimental.text_splitter import SemanticChunker

from langchain_community.vectorstores import FAISS

from llama_index.core.schema import TextNode

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from llama_index.core.node_parser import SentenceSplitter

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from pinecone import Pinecone, ServerlessSpec, Pinecone         # vector store

from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    SimpleDirectoryReader
)

from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI

from sentence_transformers import SentenceTransformer
from sentence_transformers import util

# Download tokenizers
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shreyaraut/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Following these directions:

TLDR:

1. Open a terminal window, and type the following and press enter: `conda env create -f environment.yml`

2. When conda has finished install the packages listed in the `environment.yaml` file, in your terminal window, then type the following and press return after: `conda activate genai_lab_hw2`

3. Then run the below chunk of code to import the required Python packages.

Contact your TAs if you encounter any issues importing the required Python packages.

## Instructions for setting up your .env file:

1. Create a .env file in the same directory as this notebook

2. Add the following lines to the .env file:

    OPENAI_API_KEY=<your_openai_api_key>

    PINECONE_API_KEY=<your_pinecone_api_key>

    PINECONE_ENV=<your_pinecone_environment>

    HF_TOKEN=<your_huggingface_token>

3. Replace the placeholders with your actual keys

4. Save the file

5. Restart the kernel to ensure the keys are loaded correctly


In [None]:
# OpenAI API Key:
openai = os.getenv('OPENAI_API_KEY')

In [None]:
# Verify Python version
!python --version

# Verify OpenAI library version
!openai --version

Python 3.12.8
openai 1.60.1


In [None]:
import openai
import os

# Retrieve the API key from environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

# Printing OpenAI API Key.

In [None]:
from pinecone import Pinecone

load_dotenv()

# OpenAI API Key:
openai = os.getenv('OPENAI_API_KEY')

pinecone_api_key="pcsk_3ZNNJo_9hh3ZfVysUvH6oZqSuENptF9Whu264TF2JH6K2V8whWq5njQ3SVo2Vog847tYYb"

# Pinecone API Key:
#pinecone_api_key =os.getenv(PINECONE_API_KEY)
environment =os.getenv('PINECONE_ENV')

# Hugging Face Token:
HF_TOKEN = os.getenv('HF_TOKEN')

# configure Pinecone client
pc = Pinecone(api_key=pinecone_api_key)

# **Lab Demo**

### **Create a Pinecone Index**

In [None]:
# configure Pinecone client to use a 'serverless' environment:
use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"

In [None]:
# specify the Pinecone environment to use:
if use_serverless:
    spec = pinecone.ServerlessSpec(cloud='aws', region="us-east-1")
else:
    spec = pinecone.PodSpec(environment=environment)

In [None]:
# Name our Pinecone Index:
index_name = "hw02"

In [None]:
# If a Pinecone index of the same name already exists, delete it:
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

In [None]:
# define parameters for and create the index:

dimensions = 1536 #384              # the dimensions of the index need to align with the LLM we are using for the RAG system. For example, if using openAI then dimenion = 1536. If using Llama2, then dimension = 384.

# fixme
pc.create_index(
    name=index_name,
    dimension=dimensions,
    metric="cosine",          # we can use different distance metrics to measure the similarity between vector embeddings and user queries. this is where we define what similarity metric we are going to use for the vector store.
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# wait for index to be ready before connecting
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [None]:
# check that the index was created successfully:

for index in pc.list_indexes():
    print(index['name'])

hw02


In [None]:
# get a description of the index:

pc.describe_index("hw02")

{
    "name": "hw02",
    "dimension": 1536,
    "metric": "cosine",
    "host": "hw02-9ui1von.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "deletion_protection": "disabled"
}

In [None]:
index = pc.Index(index_name)  # create an index to use in the vector store

In [None]:
# get index stats:
index_stats_response = index.describe_index_stats()
index_stats_response

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [None]:
vector_store = PineconeVectorStore(pinecone_index=index)    # this function creates a vector store where we will add and store embeddings

### **Load Data Into the Environment**

In [None]:
# define an object to specify the filepath where data is located:
df_path = "Resume.csv"

# create a pandas dataframe to store the data:
df=pd.read_csv(df_path)

df

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR
...,...,...,...,...
2479,99416532,RANK: SGT/E-5 NON- COMMISSIONED OFFIC...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION
2480,24589765,"GOVERNMENT RELATIONS, COMMUNICATIONS ...","<div class=""fontsize fontface vmargins hmargin...",AVIATION
2481,31605080,GEEK SQUAD AGENT Professional...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION
2482,21190805,PROGRAM DIRECTOR / OFFICE MANAGER ...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION


In [None]:
# rename the column to 'text' (a column named text is required for some of the functions we will use later):
df.rename(columns = {'Resume_str':'text'}, inplace = True)

In [None]:
# check the names of the columns in the dataframe:
df.columns

Index(['ID', 'text', 'Resume_html', 'Category'], dtype='object')

In [None]:
# glance at the first few rows of the dataframe:
df.head()

Unnamed: 0,ID,text,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [None]:
# check the length of the dataframe (this also tells us how many resumes we have):
len(df)

2484

In [None]:
# Function to select 10 random rows from each group, where group is defined by the column named 'Category', and the row selection is from the column named 'text
def select_random_rows(group, n=10):
    return group.sample(n=min(n, len(group)))

# Applying the function to each group in the column named 'Category':
sample_resume_df = df.groupby('Category').apply(lambda x: select_random_rows(x['text'], 10)).reset_index(drop=False)

sample_resume_df.tail(100)

  sample_resume_df = df.groupby('Category').apply(lambda x: select_random_rows(x['text'], 10)).reset_index(drop=False)


Unnamed: 0,Category,level_1,text
140,DIGITAL-MEDIA,1299,DIGITAL MARKETING SPECIALIST Su...
141,DIGITAL-MEDIA,1311,DIGITAL MARKETING DIRECTOR ...
142,DIGITAL-MEDIA,1262,DIRECTOR OF DONOR RELATIONS Pro...
143,DIGITAL-MEDIA,1312,DIGITAL MARKETING MANAGER Summa...
144,DIGITAL-MEDIA,1254,DIRECTOR OF NEW BUSINESS DEVELOPMENT ...
...,...,...,...
235,TEACHER,374,TEACHER Summary An elementar...
236,TEACHER,365,TEACHER Summary Experien...
237,TEACHER,405,PRESCHOOL TEACHER Summary Sh...
238,TEACHER,344,TEACHER Professional Summary ...


In [None]:
# look at the sample resume dataframe:
sample_resume_df = pd.DataFrame(sample_resume_df)
sample_resume_df

Unnamed: 0,Category,level_1,text
0,ACCOUNTANT,1829,ACCOUNTANT Summary To achiev...
1,ACCOUNTANT,1853,ACCOUNTANT Summary If you ne...
2,ACCOUNTANT,1856,ACCOUNTANT Executive Profile ...
3,ACCOUNTANT,1857,ACCOUNTANT Summary Flexible ...
4,ACCOUNTANT,1871,ACCOUNTANT III Summary Ener...
...,...,...,...
235,TEACHER,374,TEACHER Summary An elementar...
236,TEACHER,365,TEACHER Summary Experien...
237,TEACHER,405,PRESCHOOL TEACHER Summary Sh...
238,TEACHER,344,TEACHER Professional Summary ...


In [None]:
# get the length of the sample resume dataframe:
len(sample_resume_df)

240

In [None]:
# glance at what the first resume looks like:
sample_resume_df['text'][0]

"         ACCOUNTANT       Summary    To achieve a job as an Accountant that utilizes my accounting, communication, analytical & leadership skills.      Highlights          MS Office (Excel, Word, PowerPoint), SAP R/3, Adobe Reader, QuickBooks, Lacerte, Prosystems & Tax base  Accounts Payable Processes & Management  Invoices/Expense Reports/Payment Transactions  Corporate Accounting & Bookkeeping  Finalization of Trial Balance & Balance Sheet/Income Statement.  Spreadsheets & Accounting Reports  Tax Reporting, Planning & Filing of returns.  Handle Customer Relations.  Journal Entries & General Ledger  Bank Reconciliation & General Ledger.  Teambuilding & Staff Supervision                Experience     09/2014   to   Current     Accountant    Company Name          Working for all Clients in USA Implemented Quickbooks Accounting v.  2013 and 2016 for all the Companies including but not limited to chart of accounts.  Implemented Quicbooks payroll v.2016 from scratch Working on processing 

### **Chunk Text**

In [None]:
# Splitting Text into Sentences
def split_text_into_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

def split_text_into_sentences(text):
    sentences = sent_tokenize(text, language='english')  # Default is usually 'english'
    return sentences

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')
nltk.download('punkt')

# create a list to store the chunks of text that we will create next:
resume_sentences = []

# split the text in each row of the 'text' column into sentences and store the sentences in the list:
for row in sample_resume_df['text']:
    sentences = split_text_into_sentences(row)
    resume_sentences.extend(sentences)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/shreyaraut/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shreyaraut/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# look at what the split sentences look like:
resume_sentences

['         ACCOUNTANT       Summary    To achieve a job as an Accountant that utilizes my accounting, communication, analytical & leadership skills.',
 'Highlights          MS Office (Excel, Word, PowerPoint), SAP R/3, Adobe Reader, QuickBooks, Lacerte, Prosystems & Tax base  Accounts Payable Processes & Management  Invoices/Expense Reports/Payment Transactions  Corporate Accounting & Bookkeeping  Finalization of Trial Balance & Balance Sheet/Income Statement.',
 'Spreadsheets & Accounting Reports  Tax Reporting, Planning & Filing of returns.',
 'Handle Customer Relations.',
 'Journal Entries & General Ledger  Bank Reconciliation & General Ledger.',
 'Teambuilding & Staff Supervision                Experience     09/2014   to   Current     Accountant    Company Name          Working for all Clients in USA Implemented Quickbooks Accounting v.  2013 and 2016 for all the Companies including but not limited to chart of accounts.',
 'Implemented Quicbooks payroll v.2016 from scratch Working

In [None]:
# define a function to split the resumes into sentences and assign unique identifiers:

def split_resumes_to_sentences(df, text_column):
    """
    Split the resumes into individual sentences and assign unique identifiers.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the resumes.
        text_column (str): The name of the column containing the resume texts.

    Returns:
        pd.DataFrame: A DataFrame with each sentence and its corresponding unique identifier.
    """
    # Initialize an empty list to hold the resulting data
    sentences_list = []

    # Iterate through the DataFrame rows
    for idx, row in df.iterrows():
        # Tokenize the resume text into sentences
        sentences = sent_tokenize(row[text_column])

        # Append each sentence along with the original index to the list
        for sentence in sentences:
            sentences_list.append((idx, sentence))

    # Convert the list to a DataFrame
    sentences_df = pd.DataFrame(sentences_list, columns=['unique_identifier', 'sentence'])

    return sentences_df

In [None]:
# Split resumes into sentences and include a unique identifier for each sentence:
sentences_df = split_resumes_to_sentences(sample_resume_df, 'text')

In [None]:
# glance at the first few rows of the sentences dataframe:
sentences_df

Unnamed: 0,unique_identifier,sentence
0,0,ACCOUNTANT Summary To achiev...
1,0,"Highlights MS Office (Excel, Word, Po..."
2,0,Spreadsheets & Accounting Reports Tax Reporti...
3,0,Handle Customer Relations.
4,0,Journal Entries & General Ledger Bank Reconci...
...,...,...
8509,239,01/2004 to 01/2009 Teacher Company ...
8510,239,"Selected for ""Leadership Academy""; a statewide..."
8511,239,Collaborated extensively with district level a...
8512,239,Invited to score Missouri Assessment Program (...


### **Create Embeddings**

In [None]:
# get the length of the sentences dataframe:
len(sentences_df)

8514

In [None]:
# define a function to create embeddings for the sentences:

model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
# create sentence embeddings:
sentence_embeddings = model.encode(sentences_df['sentence'])

# check the shape of the sentence embeddings:
sentence_embeddings.shape

(8514, 768)

#### **Example Measurement of Computational Costs by Text Chunking Method**

In [None]:

def compute_embedding_costs(text, model_name='all-MiniLM-L6-v2', eps=0.6, min_samples=2):
    """
    Computes the computational cost (in terms of execution time) for creating
    sentence embeddings and paraphrase embeddings.

    Parameters:
    - text (str): The input text to be processed.
    - model_name (str): The name of the model to use for embedding.
    - eps (float): The epsilon value for DBSCAN clustering.
    - min_samples (int): The minimum sample count for DBSCAN clustering.

    Returns:
    - A tuple containing the execution times for sentence embeddings and paraphrase-level embeddings.
    """
    model = SentenceTransformer(model_name)

    # Sentence Embedding Timing
    start_time = time.time()
    sentences = sent_tokenize(text)
    sentence_embeddings = model.encode(sentences)
    sentence_embedding_time = time.time() - start_time

    # Paraphrase Embedding (Clustering) Timing
    start_clustering_time = time.time()
    clustering = DBSCAN(eps=eps, min_samples=min_samples, metric='cosine').fit(sentence_embeddings)
    cluster_labels = clustering.labels_

    paraphrase_embeddings = []
    for cluster_id in set(cluster_labels):
        if cluster_id == -1:
            continue
        cluster_sentences = np.array(sentences)[cluster_labels == cluster_id]
        paraphrase = ' '.join(cluster_sentences)
        paraphrase_embeddings.append(paraphrase)
    paraphrase_embedding_time = time.time() - start_clustering_time

    return sentence_embedding_time, paraphrase_embedding_time

# Example usage
if __name__ == "__main__":
    text = ("This is a sample text. It has several sentences, meant to showcase "
            "how embeddings are computed. Some of these sentences may be clustered "
            "together, representing paraphrases or semantically similar groups.")

    sent_time, para_time = compute_embedding_costs(text)
    print(f"Sentence Embedding Time: {sent_time:.4f} seconds")
    print(f"Paraphrase Embedding Time: {para_time:.4f} seconds")

Sentence Embedding Time: 0.3793 seconds
Paraphrase Embedding Time: 0.0044 seconds


In [None]:
def estimate_model_flops(model_name, text):
    """
    Estimate the FLOPs for generating embeddings for a given text using a specified model.

    Parameters:
    - model_name (str): Model identifier from Hugging Face Transformers.
    - text (str): Text to process.

    Returns:
    - FLOPs (int): An estimated number of floating point operations.
    """
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs['input_ids']

    # Hooks for the operations
    def hook_fn_forward(module, input, output):
        # Attempt to access the tensor shape in a safer manner
        input_shape = input[0].size()

        # A generalized fallback if shape isn't what's expected
        if len(input_shape) == 2:  # Assuming shape [batch, seq_len] for simplicity
            batch_size, seq_len = input_shape
            # Hypothetical FLOPs calculation: For demonstration, let's assume it's just the product
            flops = batch_size * seq_len
        elif len(input_shape) > 2:  # Assuming more dimensions (e.g., embeddings)
            flops = torch.prod(torch.tensor(input_shape))
        else:
            # In case of unsupported dimensions, set flops to 0 or some placeholder
            flops = 0

        # Storing calculated FLOPs in the module
        if hasattr(module, '__flops__'):
            module.__flops__ += flops
        else:
            module.__flops__ = flops

    def add_hooks_to_model(model, hook_fn):
        """
        Recursively add hook_fn to all the layers of the model.
        """
        total_flops = 0
        for layer in model.children():
            if list(layer.children()):  # if the layer has children, recursively add hooks
                total_flops += add_hooks_to_model(layer, hook_fn)
            else:
                if hasattr(layer, 'weight'):
                    layer.register_forward_hook(hook_fn)
                    layer.__flops__ = 0
        return total_flops

    add_hooks_to_model(model, hook_fn_forward)

    with torch.no_grad():
        _ = model(**inputs)

    total_flops = sum([mod.__flops__ for mod in model.modules() if hasattr(mod, '__flops__')])

    return total_flops

# Example usage
if __name__ == "__main__":
    model_name = "bert-base-uncased"
    text = "This is an example sentence"
    flops = estimate_model_flops(model_name, text)
    print(f"Estimated FLOPs: {flops}")

Estimated FLOPs: 715797


## **Create a FAISS Vector Store**

In [None]:
# specify the dimensions of the sentence embeddings:
d = sentence_embeddings.shape[1]

# specify the number of sentences:
nb = len(set(sentences))

# specify the number of queries:
nq = 10000
np.random.seed(1234)             # set a random number to make the process reproducible
xb = np.random.random((nb, d)).astype('float32')

#
nlist = 100

In [None]:
# glance at the shape of the sentence embeddings or dimension for the vector store:
d

768

In [None]:
# create an index for the vector store:
index = faiss.IndexFlatL2(d)

In [None]:
# add the sentence embeddings to the index:
index.add(sentence_embeddings)

In [None]:
# check the number of vectors in the index:
index.ntotal

8514

In [None]:
# train the index:
index.train(sentence_embeddings)

index.is_trained  # check if index is now trained

True

### **Construct Query and Perform Search**

In [None]:
# define a query to submit to the vector store:
question = "Which resume has the most software skills listed?"

In [None]:
# define the number of documents to retrieve from the vector store in response to the query:
k=10

# create an embedding for the query:
xq = model.encode([question])

In [None]:
%%time
 # measure the time it takes to search the index
D, I = index.search(xq, k)  # search the index for the query, using the number of documents to retrieve specified by k
print(I) # print the indices of the documents that are most similar to the query

[[5763 7097 1048 1946 1045 2290 4586 7197 7162 4401]]
CPU times: user 5.76 ms, sys: 38.9 ms, total: 44.7 ms
Wall time: 299 ms


In [None]:
# Retrieve and print the string data from 'text' column of the first index in I
first_index = I[0]#[0] # Get the first index from I
first_row_string = sentences_df['sentence'].iloc[first_index].sum()  # Use iloc to access the row by index

print(first_row_string) # Print the string data


Highly computer literate in various database software programs.Served as a technical lead and a tier 2 escalation resource for multiple applications and operating systems.Sought out as first point of contact for computer & software issues.database software) and STATA (data analyzing software).Team member for computer conversion from MAS90 to JDEdwards.Advanced knowledge in repair and software requirements for Dell and Lenovo devices.At the designing stage ER and Schema was formulated and in the implementing stage database was built in the most popular RDBMS called MySQL.Supported customers having data connectivity issues, assisting with troubleshooting steps and rebooting of hardware.Determined and alleviated hardware, software and network issues.Designed, and customized databases and created software integration solutions.


### **Define System Prompt (e.g. context message) to send to LLM**

In [None]:
# define a function to get retrieve the results from the vector store:
def get_sys_message(q, k):
    k=k
    xq = model.encode([q])
    D, I = index.search(xq, k)  # search
    first_index = I[0]  # Get the first index from I
    first_row_string = sentences_df['sentence'].iloc[first_index].sum()
    return first_row_string

In [None]:
# use the custom function to retrieve the results from the vector store:
get_sys_message(q="Which resume has the most software skills listed?", k=100)

'Highly computer literate in various database software programs.Served as a technical lead and a tier 2 escalation resource for multiple applications and operating systems.Sought out as first point of contact for computer & software issues.database software) and STATA (data analyzing software).Team member for computer conversion from MAS90 to JDEdwards.Advanced knowledge in repair and software requirements for Dell and Lenovo devices.At the designing stage ER and Schema was formulated and in the implementing stage database was built in the most popular RDBMS called MySQL.Supported customers having data connectivity issues, assisting with troubleshooting steps and rebooting of hardware.Determined and alleviated hardware, software and network issues.Designed, and customized databases and created software integration solutions.Proficient in Software Development Life Cycle (SDLC) and SRUM AGILE methodologies of development process to produce software solutions by team.Team member for two c

In [None]:
from openai import OpenAI

def rag_openAI_gpt(model, q, k, prompt):
    import openai
    from openai import OpenAI
    client = OpenAI()
    f=get_sys_message(q, k)

    response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "Instruction: use the information in {f} to answer the user's question."},
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": f},
        {"role": "user", "content": "What is the answer?"}
    ]
    )
    return response.choices[0].message.content


In [None]:
gpt_3_5_turbo = "gpt-3.5-turbo"
gpt_4 = "gpt-4"
gpt_4_turbo = "gpt-4-0125-preview"
gpt_4o = "gpt-4o"

Classify the document and return a label based on the document type or class. Make the label specify which occupation the document pertains to

In [None]:
rag_openAI_gpt(model=gpt_3_5_turbo, q="Which resume has the most software skills listed?", k=20, prompt="Classify the document and return a label based on the document type or class. Make the label specify which occupation the document pertains to")

'Based on the content provided, the document pertains to the occupation of a computer technician or IT specialist.'

In [None]:
rag_openAI_gpt(model=gpt_4, q="Which resume has the most software skills listed?", k=20, prompt="summarize the resume")

"The resume belongs to a highly computer-literate professional with advanced knowledge in repair and software requirements for Dell and Lenovo devices. They have served as a technical lead and a tier 2 escalation resource for multiple applications and operating systems. They're experienced in software development life cycle (SDLC) and SRUM AGILE methodologies, and have participated in computer conversions and database building using MySQL. With proficiency in digital marketing, systems integration, database management, and complex problem-solving, they've resolved hardware, software, and network issues. They also have experience in user acceptance testing, system integration testing, performance testing, decision table testing, and regression testing."

In [None]:
rag_openAI_gpt(model=gpt_4_turbo, q="Which resume has the most software skills listed?", k=20, prompt="summarize the resume")

"The summary of the resume highlights the individual's extensive experience and skills in computer science and information technology, showcasing a strong background in database software programs, operating systems, and troubleshooting. Their expertise encompasses the full Software Development Life Cycle (SDLC), agile methodologies, digital marketing, systems integration, and database management. Notably, the individual has played key roles in technical leadership, computer software conversions, and has advanced knowledge in the repair and software requirements for specific hardware brands. Additionally, the person is adept at working with various programming, database, and data analysis software tools, including MySQL and STATA, and has a proficiency in responsiveness design for websites. They have experience in version control tools like SVN and Git, are capable of performing a wide range of testing methodologies, and excel in solving complex hardware and software problems."

In [None]:
#rag_openAI_gpt(model=gpt_o, q="Which resume has the most software skills listed?", k=5, prompt="What is the document?")

# **Homework 2 Assignment**

## **Section A. Experimenting with Vector Store Query Design (50 points)**

In [1]:
!pip install -r requirements.txt
!pip install ipykernel langchain_experimental llama-index-vector-stores-pinecone ipykernel PyMuPDF pinecone-client pypdf faiss-cpu langchain_community transformers sentence_transformers

Collecting transformers (from -r requirements.txt (line 1))
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/bd/40/902c95a2a6f5d2d120c940ac4bd1f937c01035af529803c13d65ca33c2d1/transformers-4.48.2-py3-none-any.whl.metadata
  Downloading transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pinecone-client (from -r requirements.txt (line 2))
  Obtaining dependency information for pinecone-client from https://files.pythonhosted.org/packages/55/d0/c64336b8f76e63296d04b885c545c0872ff070e6b2bc725dd0ff3ae681dc/pinecone_client-5.0.1-py3-none-any.whl.metadata
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pypdf (from -r requirements.txt (line 3))
  Obtaining dependency information for pypdf from https://files.pythonhosted.org/packages/3e/6e/9aa158121eb5a6af5537af0bde9e380

In [6]:
!pip install llama-index
!pip install llama-index-embeddings-huggingface


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting llama-index-embeddings-huggingface
  Obtaining dependency information for llama-index-embeddings-huggingface from https://files.pythonhosted.org/packages/10/ec/2cf737084b3dd465f58c0ede9c825991ef0a88669e45d73bda31392b2ab5/llama_index_embeddings_huggingface-0.5.1-py3-none-any.whl.metadata
  Downloading llama_index_embeddings_huggingface-0.5.1-py3-none-any.whl.metadata (767 bytes)
Downloading llama_index_embeddings_huggingface-0.5.1-py3-none-any.whl (8.9 kB)
Installing collected packages: llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.5.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0

In [81]:
#### do not change ###
import json, os, io, re, requests, fitz
import requests
from langchain_text_splitters import RecursiveJsonSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    SimpleDirectoryReader
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core.schema import TextNode
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI
from dotenv import load_dotenv

# sentence transformers
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [8]:
print(os.environ.get("PINECONE_API_KEY"))

None


In [None]:
# should command
dotenv_path = "/Users/leesunny/Desktop/F25_Genai_Lab/hw2/.env"

In [33]:
from pinecone import Pinecone, pinecone, ServerlessSpec
load_dotenv()

#access_endpoint_api_key =os.getenv('access_endpoint_api_key')
openai = os.getenv('OPENAI_API_KEY')
pinecone_api_key =os.getenv('PINECONE_API_KEY')
# pinecone_api_key = "pcsk_3ZNNJo_9hh3ZfVysUvH6oZqSuENptF9Whu264TF2JH6K2V8whWq5njQ3SVo2Vog847tYYb"
#environment =os.getenv('PINECONE_ENV')
environment = "us-east-1-aws"
HF_TOKEN = os.getenv('HF_TOKEN')

# configure Pinecone client
pc = Pinecone(api_key=pinecone_api_key)

In [17]:
# import the CMU Student Handbook
doc = fitz.open("the-word-2023-24-12.11.23.pdf")

### **Choose a method to chunk the text data:**

- [Semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)

- [Recursive chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)

- [Character chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter)

- [Token chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token)

##### Choose a type of chunker (From langchain):

In [18]:
from langchain_experimental.text_splitter import SemanticChunker

# parser to split up PDF resume:
text_parser = SentenceSplitter(
    chunk_size=1024
)

In [19]:
text_chunks = []
doc_idxs = []


for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_parser.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

In [20]:
text_chunks

['1 \n \n \n \nThe Word: Student Handbook \n2023-2024',
 '2 \n \n \nWelcome to The Word 2023-2024 \n \nThe Word student handbook contains information and resources to help you create your \nCarnegie Mellon experience and embrace your role as a valued member of our university. \n \nAt Carnegie Mellon, our ambition is that all students will reach their highest potential in the \nareas of intellectual and artistic pursuits, personal well-being, professional development, \nleadership, and contribution to the larger community. \n \nCommunity membership affords many privileges and likewise responsibilities as we all uphold \nthe standards of our university. To ensure you are knowledgeable of the university’s \nexpectations and the process that will be followed if standards are not met, this handbook \narticulates the rights and responsibilities afforded to and expected of each member of our \ncommunity. \n \nWe hope you will take advantage of this handbook as you prepare for your successful 

In [21]:
nodes = []

for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc_idx = doc_idxs[idx]
    src_page = doc[src_doc_idx]
    nodes.append(node)

#### **Chunker Choices**

In [None]:
# Chunker choice #1: Semantic Chunker

In [None]:
# Chunker choice #2:

### **Create the vector store using chosen similarity metrics:**

In [25]:
use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"

if use_serverless:
    spec = pinecone.ServerlessSpec(cloud='aws', region='us-east-1')
else:
    spec = pinecone.PodSpec(environment=environment)

# Name our Pinecone Index:
index_name = "hw02"

# If a Pinecone index of the same name already exists, delete it:
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

### **choose a similarity metric to use for the vector store:**

In [26]:
pc

<pinecone.control.pinecone.Pinecone at 0x12e781f70>

In [27]:
environment = "us-east-1-aws"

In [36]:
index_name = "hw02"

pc.create_index(
    name=index_name,
    dimension=1536, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

In [None]:
pc.list_indexes()

[
    {
        "name": "hw02",
        "dimension": 1536,
        "metric": "cosine",
        "host": "hw02-9ui1von.svc.aped-4627-b74a.pinecone.io",
        "spec": {
            "serverless": {
                "cloud": "aws",
                "region": "us-east-1"
            }
        },
        "status": {
            "ready": true,
            "state": "Ready"
        },
        "deletion_protection": "disabled"
    }
]

In [37]:
pc_index = pc.Index(index_name)  # create an index to use in the vector store
vector_store = PineconeVectorStore(pinecone_index=pc_index)    # this function creates a vector store where we will add and store embeddings

In [38]:
pc_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [39]:
llm = OpenAI(model="gpt-3.5-turbo")

extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
]

pipeline = IngestionPipeline(
    transformations=extractors,
)
nodes = await pipeline.arun(nodes=nodes, in_place=False)

100%|██████████| 5/5 [00:03<00:00,  1.57it/s]
100%|██████████| 297/297 [01:13<00:00,  4.04it/s]


### ***choose an embedding model to use for the vector store:**

#### **OpenAI Embeddings**

In [40]:
model_ada="text-embedding-ada-002"
small_txt_embedmodel_="text-embedding-3-small"

In [41]:
embed_model = OpenAIEmbedding(model="text-embedding-3-small", openai_api_key=openai)

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

### **load the embeddings into the vector store (e.g. create a vector store):**

In [42]:
vector_store.add(nodes)

Upserted vectors: 100%|██████████| 297/297 [00:01<00:00, 173.98it/s]


['27142df1-b926-4639-bd2f-e5990f3bb6d9',
 'b116c766-682d-4094-b0af-613954538b49',
 '305183b9-5736-48e1-8b0a-7f3901f5fdd2',
 '9381f809-e8b1-4bd3-b839-2a81ca3d9cbf',
 '162f87e1-3cf2-4a80-938a-90ffeca411b1',
 'ed12045c-f6ee-4180-baed-d69aa5b45421',
 'c5eeceb9-e4a2-4b18-8c5e-e18fe9bf3f2c',
 '84316b9d-f258-473b-9a85-563b12405e04',
 '417d8cce-d944-41d6-b894-9c6ddcedf6d1',
 '1cd5d60e-4a5b-465d-bcab-698c2d5bca25',
 '0d7f3786-8463-43a5-bf6b-cc4b07103261',
 '9711745e-ac02-42a9-8458-2ac2a4e14513',
 '9f065f0e-b748-40a2-8c9c-fd0344c047ee',
 '38044359-16c8-44ce-8400-b0a94ad74c12',
 '071a72ad-b058-4947-9056-d3147471817d',
 '307a7d0e-e800-4c8a-a67a-9517801fcea4',
 '730da04a-8360-4739-9e19-f5d530582544',
 'ab7c0db4-76bb-406b-9325-633377e5a9e1',
 '73de0664-82de-41b6-b19b-4f5b072aebd7',
 '25887983-64cc-40c7-b404-1ae972ea30c1',
 'b8784f9c-d301-47c6-aa34-c94557318039',
 '23c983e9-73a3-460e-9e05-cbb986a9d929',
 '78bcbb3e-5bb0-4beb-9518-456d651b618f',
 '6ee82d92-3a11-4cf6-8ad5-2359179c6611',
 '6e9abe48-21d7-

In [43]:
pc_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [44]:
print(nodes[0].metadata)

{'document_title': 'Carnegie Mellon University Student Success and Community Handbook: Navigating Ethical Standards and Transformative Vision', 'questions_this_excerpt_can_answer': '1. What is the title and publication year of the Carnegie Mellon University Student Success and Community Handbook?\n2. What is the purpose of the handbook mentioned in the excerpt?\n3. What specific information or guidelines can students expect to find in the handbook related to ethical standards and transformative vision at Carnegie Mellon University?'}


In [None]:
#nodes

In [45]:
print(nodes[0])

Node ID: 27142df1-b926-4639-bd2f-e5990f3bb6d9
Text: 1        The Word: Student Handbook  2023-2024


### **Retrieve Content from the Vector Store**

In [46]:
from openai import OpenAI
client = OpenAI()

In [47]:
# define the query:
query = (
    "animal pet fish policy"
)

# choose one of these models:
embed_model_ada = "text-embedding-ada-002"
embed_model_3_small = "text-embedding-3-small"

res = client.embeddings.create(
    input=[query],
    model= embed_model_ada
)

# retrieve from Pinecone
xq = res.data[0].embedding #res['data'][0]['embedding']

# get relevant contexts (including the questions)
res2 = pc_index.query(vector=xq, top_k=2, include_metadata=True)

In [48]:
# print the results:
res2

{'matches': [{'id': '9074eb45-8814-445f-b75e-5dfe369b2033',
              'metadata': {'_node_content': '{"id_": '
                                            '"9074eb45-8814-445f-b75e-5dfe369b2033", '
                                            '"embedding": null, "metadata": '
                                            '{"document_title": "Carnegie '
                                            'Mellon University Student Success '
                                            'and Community Handbook: '
                                            'Navigating Ethical Standards and '
                                            'Transformative Vision", '
                                            '"questions_this_excerpt_can_answer": '
                                            '"1. How does Carnegie Mellon '
                                            'University handle scholarship '
                                            'renewals for students based on '
                          

#### **Query the vector store using these queries**

**Instruction: set the 'k' parameter to 5**

Query 1: What is the policy statement for the academic integrity policy?

Query 2: What is the policy violation definition for cheating?

Query 3: What is the policy statement for improper or illegal communications?

Query 4: What are CMU’s quiet hours?

Query 5: Where are pets allowed on CMU?


### ***query the vector store with the 5 queries above (don't forget to record the responses in your homework submission spreadsheet: see instructions for a link to the spreadsheet!):***

In [71]:
from collections import OrderedDict
import pandas as pd
import json
import re

# Function to clean text by removing extra spaces and newline characters
def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

# Initialize OpenAI Embeddings
embed_model = OpenAIEmbeddings(model="text-embedding-3-small")

# List of queries
queries = [
    "What is the policy statement for the academic integrity policy?",
    "What is the policy violation definition for cheating?",
    "What is the policy statement for improper or illegal communications?",
    "What are CMU’s quiet hours?",
    "Where are pets allowed on CMU?"
]

k = 5  # Number of top matches to retrieve

# Store query results in an ordered dictionary
query_results = OrderedDict()

# List to store data for CSV output
data = []

for query in queries:
    # Convert query into an embedding
    query_embedding = embed_model.embed_query(query)

    # Retrieve top k matches from Pinecone
    results = pc_index.query(vector=query_embedding, top_k=k, include_metadata=True)

    for i, match in enumerate(results["matches"]):
        metadata = match['metadata']

        # Ensure `_node_content` exists before accessing it
        if '_node_content' in metadata:
            node_content = json.loads(metadata['_node_content'])  # Parse JSON
            response_text = node_content.get('text', 'N/A')  # Extract text
        else:
            response_text = 'N/A'

        # Clean the response text by removing newline characters and extra spaces
        cleaned_response_text = clean_text(response_text)
        score = match['score']
        response_number = f"Response {i+1}"  # Label responses

        # Append results to data list
        data.append([query, response_number, i+1, score, cleaned_response_text])

# Create DataFrame with added "Response Number" column
df = pd.DataFrame(data, columns=["Query", "Response Number", "Rank", "Score", "Response Text"])

# # Save DataFrame to CSV
# csv_filename = "/Users/leesunny/Desktop/F25_Genai_Lab/hw2/query_results.csv"
# df.to_csv(csv_filename, index=False)
#
# # Print CSV file path
# print(f"CSV file saved at: {csv_filename}")

CSV file saved at: /Users/leesunny/Desktop/F25_Genai_Lab/hw2/query_results.csv


In [74]:
results = pc_index.query(vector=query_embedding, top_k=k, include_metadata=True, metric="cosine")
results

{'matches': [{'id': 'f3e713ba-6295-423f-908d-e0d9146d5b67',
              'metadata': {'_node_content': '{"id_": '
                                            '"f3e713ba-6295-423f-908d-e0d9146d5b67", '
                                            '"embedding": null, "metadata": '
                                            '{"document_title": "Carnegie '
                                            'Mellon University Student Success '
                                            'and Community Handbook: '
                                            'Navigating Ethical Standards and '
                                            'Transformative Vision", '
                                            '"questions_this_excerpt_can_answer": '
                                            '"1. What are the guidelines for '
                                            'impounded bicycles or wheeled '
                                            'vehicles at Carnegie Mellon '
                          

### **Homework Questions:**

**A.II.** Explain your rationale for choosing the similarity metric you decided to use in the vector store. What is one pro of using the metric, and what is one difference between using the metric you selected and the other two similarity metrics we discussed in the lab. (We discussed cosine, dot product, and euclidean similarity metrics).

Answer: Cosine similarity measures the angle between two vectors rather than their magnitude. This makes it highly effective for text-based embeddings because the direction of the vector (which represents semantic meaning) is more important than the absolute value of the embedding. Unlike the dot product, cosine similarity does not depend on the magnitude of the embeddings. This is crucial for text-based search, where different words and sentences might have embeddings of varying lengths, but their semantic similarity should be preserved.



**A.III.** Copy and paste the results or information retrieved from the vector store in response to each of the queries you submitted to the vector store in the SPREADSHEET TEMPLATE (please see instructions for a link to the spreadsheet template you should copy and use).  


**A.IV.** Qualitatively analyze the responses to your queries submitted to the vector store. Did the queries retrieve the information you were expecting to obtain. Why or why not? Why do you think the queries were successful / unsuccessful in retrieving the information you expected or needed?

Answer: The retrieved responses were mostly aligned with the expected information but had some limitations in relevance and precision.For the following examples:

For the query on the academic integrity policy, the second response provided a structured and direct policy definition, making it the most relevant. However, the first response contained irrelevant introductory content before reaching useful details on academic policies. This suggests a potential chunking issue, where extracted text includes unnecessary context.

For the query on policy violations for cheating, the second response was highly effective, listing specific violations such as unauthorized exam access, plagiarism, and falsified data. However, the first response was too broad, discussing general ethical behavior and citation requirements rather than explicitly defining "cheating." This indicates that the retrieval model may have ranked general integrity discussions higher than specific policy violations.

Success Factors:
The best-ranked responses contained relevant information on academic integrity and violations.
Key policies were retrieved, though sometimes mixed with broader ethical discussions.

Areas for Improvement:
Chunking refinement: Remove unnecessary leading text to avoid irrelevant content.
Better query expansion or ranking techniques: Ensure direct definitions of terms like "cheating" are prioritized over general policy discussions.



## **Section B. Experimenting with Vector Store Embeddings & Query Parameters (50 points)**

1) Choose 1 of the 5 queries provided in A.1.6.A, above, and experiment with submitting the query to the vector store by changing the search parameters in the following manner:


*   A) Baseline query, e.g. query, k=1.

*   B) Query, parameter k = 3

*   C) Query, parameter k = 5

*   D) Query, parameter k = 10

**In your written homework submission, record the UNIQUE responses/results of each query submitted to the vector store.**


In [75]:
# Define different values for k
k_values = [1, 3, 5, 10]

# List to store data for CSV output
data_full_text = []

# Experiment with different k values
for k in k_values:
    results = pc_index.query(vector=query_embedding, top_k=k, include_metadata=True)

    for i, match in enumerate(results["matches"]):
        metadata = match['metadata']

        # Ensure `_node_content` exists before accessing it
        if '_node_content' in metadata:
            node_content = json.loads(metadata['_node_content'])
            response_text = node_content.get('text', 'N/A')
        else:
            response_text = 'N/A'

        # Clean response text
        cleaned_response_text = clean_text(response_text)
        score = match['score']
        response_number = f"Response {i+1}"

        # Append results to data list
        data_full_text.append([query, k, response_number, i+1, score, cleaned_response_text])

# Create DataFrame with full response text
df_full_text = pd.DataFrame(data_full_text, columns=["Query", "k", "Response Number", "Rank", "Score", "Response Text"])

# # Save DataFrame to CSV
# csv_full_text_filename = "/Users/leesunny/Desktop/F25_Genai_Lab/hw2/query_results_2.csv"
# df_full_text.to_csv(csv_full_text_filename, index=False)
#
# # Print CSV file path
# csv_full_text_filename

'/Users/leesunny/Desktop/F25_Genai_Lab/hw2/query_results_2.csv'

2. Return to step A.1.B., above, and select a different text chunking method (e.g. word, sentence, paragraph).
- Chunk your text data using the method. Create embeddings for the text.
- Load the embeddings into the vector store.
- Submit the same query you selected in B.1, above, and submit it to the vector store 6 times (using the different ‘k’ parameter settings defined in B.1, above), and record the responses.

**In your written homework submission, record the responses/results of each query submitted to the vector store.**

In [85]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF document
doc = fitz.open("the-word-2023-24-12.11.23.pdf")
full_text = ""

for page in doc:
    full_text += page.get_text("text") + "\n"

# Use RecursiveCharacterTextSplitter for chunking
text_parser = RecursiveCharacterTextSplitter(
    chunk_size=1024,      # Maximum characters per chunk
    chunk_overlap=100,   # Overlapping characters to retain context
    separators=["\n\n", "\n", " "],  # First split by paragraph, then line, then space
)

# Generate text chunks
chunks = text_parser.split_text(full_text)

# Initialize OpenAI embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Create nodes with embeddings
nodes = []
for chunk in chunks:
    embedding = embed_model.get_text_embedding(chunk)  # Generate embedding
    node = TextNode(text=chunk, embedding=embedding)  # Assign embedding
    nodes.append(node)

# Store embeddings in Pinecone vector store
vector_store = PineconeVectorStore(index_name="hw02")
vector_store.add(nodes)

Upserted vectors: 100%|██████████| 851/851 [00:04<00:00, 208.09it/s]


['aa3e3fc2-4888-4a52-8dd7-05847c2be7db',
 '05aaa26f-83dc-4369-82b3-bc4f4c764b7e',
 '7ea63ca1-43b1-4c15-ba1b-7f383f69c9a7',
 '00903342-ba88-4688-a10e-47d3308c01f7',
 '6af77b1d-12c3-4206-8517-627501522840',
 '57447676-f87a-43c3-8931-397c619b85e8',
 '122fa593-9149-4a14-9290-162fd269658c',
 '82927efa-ce58-4b25-a6e8-ebb97eec7c5b',
 'cd011819-a630-4e81-865b-a4192727ae18',
 '97f97139-e017-4dcb-a34f-2ddde519b8de',
 '7bc14caf-56cf-486c-8262-11f7de9e64ff',
 '3cc67412-cdb0-4eb2-8df7-91a38986ddf3',
 'ff0e57d5-9fde-448d-876f-0f31907d9498',
 '0f2e143e-1479-4648-b3d8-b83e5f906713',
 '10b2a0d0-bcd7-47a6-bc95-264c8a0734a1',
 'd78f2e1f-d7a0-4f09-94bc-5730d95785d2',
 'fd9bb1c6-3417-4aa1-9623-2f4a05ed1194',
 '11bc1de3-6aa5-40f6-a308-6936b5f1153d',
 '691842a8-f1c9-4bd1-8a1e-dd05ca1e3542',
 'eaa260e4-f5f3-466a-99c1-eef6e8885d1a',
 '709de5e2-6e3f-4ae6-b53c-7f7f75197ae7',
 '8c46f6ea-b38c-4b2f-8071-890f92080bee',
 'd75860ea-965e-4962-92bc-6a92ebf4b9c6',
 'ec023cc8-6c5c-43ac-99a2-556b8a24e797',
 '7ffd0a1f-e772-

In [88]:
# run the query for different k

# Function to clean text by removing extra spaces and newline characters
def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces and newlines

# Initialize OpenAI Embeddings
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Define the query to test
query = "Where are pets allowed on CMU?"

# Define different `k` values
k_values = [1, 3, 5, 10, 15, 20]

# Store results in a list for CSV export
data = []

# Convert query to embedding
query_embedding = embed_model.get_text_embedding(query)

# Loop through each `k` value and submit the query
for k in k_values:
    results = pc_index.query(vector=query_embedding, top_k=k, include_metadata=True)

    for i, match in enumerate(results["matches"]):
        metadata = match['metadata']

        # Ensure `_node_content` exists before accessing it
        if '_node_content' in metadata:
            node_content = json.loads(metadata['_node_content'])  # Parse JSON
            response_text = node_content.get('text', 'N/A')  # Extract text
        else:
            response_text = 'N/A'

        # Clean response text (but do not limit length)
        cleaned_response_text = clean_text(response_text)

        score = match['score']
        response_number = f"Response {i+1}"  # Label responses

        # Append results to data list
        data.append([query, k, response_number, i+1, score, cleaned_response_text])

# # Create DataFrame for analysis
# df = pd.DataFrame(data, columns=["Query", "k Value", "Response Number", "Rank", "Score", "Response Text"])
#
# # Save DataFrame to CSV
# csv_filename = "/Users/leesunny/Desktop/F25_Genai_Lab/hw2/query_results_diff_chunk.csv"
# df.to_csv(csv_filename, index=False)
#
# # Print confirmation
# print(f"Successfully saved results to {csv_filename}")


Successfully saved results to /Users/leesunny/Desktop/F25_Genai_Lab/hw2/query_results_diff_chunk.csv


### **Homework Questions:**

**B.I.** Explain your rationale for selecting the query you choose in B.1. Why did you choose this query vs. the other 4 queries?

**B.II.** Copy and paste the responses to the queries you submitted to the vector store in the SPREADSHEET TEMPLATE.


**B.III.** Copy and paste the responses to the queries you submitted to the vector store in the SPREADSHEET TEMPLATE.

**B.IV.** In observing the responses from the vector store to the queries created in B.1., which ‘k’ parameter do you think retrieved the highest quality / most accurate result? Why do you think this parameter was the best to use with the query?

**B.V.** In observing the responses from the vector store to the queries created in B.2., which ‘k’ parameter do you think retrieved the highest quality / most accurate result? Why do you think this parameter was the best to use with the query?

# **BONUS TASKS / QUESTIONS: Define function to call LLM API**

## Please email Sara for the Bonus Task Python Notebook once you've completed your homework assignment