# Class Introduction

## Objective
Explore how language models and embedding techniques can improve the retrieval and generation of relevant information, using practical examples of semantic search and similarity analysis between concepts.



**RAG (Retrieval-Augmented Generation)** is a technique that combines generative models (such as language models) with information retrieval systems. The goal is to improve model responses using relevant external data, which is retrieved through semantic searches.

**Embeddings** are numerical representations (vectors) of words, phrases, or documents. These representations capture the meaning and semantic relationships between concepts, allowing efficient similarity comparisons.

**How do they work together?**
1. The query text is converted into an embedding.
2. This embedding is compared with a database of embeddings (e.g., documents, phrases, words).
3. The most semantically similar items are retrieved.
4. The generative model uses these retrieved items to generate a more accurate and contextualized response.

**Advantages:**
- Enables more precise and relevant answers.
- Leverages external and up-to-date information.
- Captures deep semantic relationships between concepts.

**Relevant Links**
- [What are vector embeddings](https://www.pinecone.io/learn/vector-embeddings/) 
- [OpenAI Embeddings pricing](https://platform.openai.com/docs/pricing#embeddings)
- [RAG Course](https://www.youtube.com/watch?v=sVcwVQRHIc8)

### Current Flow without RAG
```mermaid
flowchart TD
    A[User] -->|question or data| B[Generate prompt using template]
    B -->|use the prompt with context| C[LLM API]
    C -->|generate response using training data| D[Raw LLM response]
    D -->|process and format output| E[Formatted response]
    E -->|display response to user| A

##### Configurations Section

In [39]:
from langchain_openai import OpenAIEmbeddings
import random
from utils import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
)


### Exploring Embeddings with Simple Words: Semantic Search Examples for RAG

In this section, we will test how embeddings work using simple words (animals, plants, and minerals). We will perform semantic searches to observe how the model finds similarities between concepts and how this can be useful for retrieval-augmented generation (RAG) systems. We will analyze the results and explain why certain words appear as more similar, highlighting the power of embeddings to capture contextual relationships in language.

In [40]:


animals= ["dog", "cat", "fish", "bird", "hamster", "rabbit", "turtle", "snake", "lizard"]
plants = ["rose", "tulip", "sunflower"]
minerals = ["gold", "silver", "copper", "iron"]
full_list = animals + plants + minerals
random.shuffle(full_list)
str(full_list)

"['fish', 'silver', 'dog', 'cat', 'tulip', 'rabbit', 'rose', 'gold', 'snake', 'lizard', 'iron', 'turtle', 'sunflower', 'copper', 'bird', 'hamster']"

In [41]:
# Create a vector store from the full list of items using OpenAI embeddings
# This will allow us to perform semantic searches on the items
# Note: In a real application, you would typically use a larger and more diverse dataset
# to create a more effective vector store for semantic search.


from langchain_core.vectorstores import InMemoryVectorStore

vectorstore = InMemoryVectorStore.from_texts(
    full_list,
    embedding=embeddings,
)


In [42]:
keys = [x for x in vectorstore.store.keys()]
item = vectorstore.store[keys[1]]
print("id:",item['id'])
print("text:",item['text'])
print("vector:",item['vector'])
print("vector size:",len(item['vector']))

id: 1cb52e97-0f07-4dbd-9e0c-80721e21766b
text: silver
vector: [0.02142113260924816, -0.03170178085565567, -0.013682607561349869, -0.0342688262462616, 0.035240814089775085, -0.035639580339193344, 0.015576739795506, 0.017807329073548317, -0.0038474546745419502, -0.021907126531004906, 0.01573873683810234, 0.017944402992725372, -0.004454947542399168, -0.022156354039907455, 0.0031807702034711838, 0.03730940818786621, -0.02512216567993164, -0.021022368222475052, -0.07691171020269394, 0.023975716903805733, 0.0023770104162395, 0.012143625877797604, 0.05134093388915062, 0.03222515806555748, -0.0185176283121109, 0.017396103590726852, 0.022978806868195534, -0.020673448219895363, -0.01646149903535843, -0.0027726595290005207, -0.0016854031709954143, -0.04368963837623596, 0.05308552831411362, -0.030530409887433052, -0.03200085461139679, 0.0005163688911125064, 0.023103419691324234, -0.0405244454741478, 0.008816435001790524, -0.018305784091353416, 0.002919080900028348, 0.008965970948338509, 0.04872404

In [43]:
def run_vectorstore_cosine_example(query: str, k: int = 4) -> None:
    retrieved_with_scores = vectorstore.similarity_search_with_score(query, k)
    print("len", len(retrieved_with_scores))
    for doc, score in retrieved_with_scores:
        print(f"- {doc.page_content} (score: {score:.4f})")
    print("==========================\n")
    return


In [55]:
run_vectorstore_cosine_example("animals", 10)

len 10
- dog (score: 0.5994)
- bird (score: 0.5707)
- hamster (score: 0.5418)
- cat (score: 0.5289)
- snake (score: 0.5051)
- rabbit (score: 0.5006)
- turtle (score: 0.4970)
- fish (score: 0.4271)
- lizard (score: 0.3936)
- sunflower (score: 0.3095)



In [51]:
run_vectorstore_cosine_example("flower", 4)

len 4
- rose (score: 0.5643)
- sunflower (score: 0.4832)
- tulip (score: 0.4475)
- fish (score: 0.4297)



In [52]:
run_vectorstore_cosine_example("mineral", 10)

len 10
- iron (score: 0.4294)
- copper (score: 0.3467)
- lizard (score: 0.3206)
- gold (score: 0.2902)
- fish (score: 0.2820)
- silver (score: 0.2713)
- rabbit (score: 0.2708)
- cat (score: 0.2694)
- hamster (score: 0.2495)
- rose (score: 0.2370)



## 🎯 ANSWER: Why does "lizard" appear before "gold" and "silver"?

### The results show:
- **iron**: 0.4294 ✅ (real mineral, high similarity)
- **copper**: 0.3467 ✅ (real mineral, high similarity)  
- **lizard**: 0.3205 ❌ (not a mineral, but high similarity)
- **gold**: 0.2903 ✅ (real mineral, lower similarity)
- **silver**: 0.2714 ✅ (real mineral, lower similarity)

### 🧠 Explanation:

**OpenAI embeddings** capture **real-world semantic associations**, not just taxonomic categories.

**"Lizard" is more related to "mineral" because:**

1. 🦎 **Animal care contexts**: "mineral supplements for lizards"
2. 🥗 **Animal nutrition**: "lizards need minerals in their diet"
3. 🏥 **Veterinary**: "mineral deficiency in lizard care"
4. 🧪 **Scientific**: "calcium minerals for reptile health"

**"Gold" and "Silver" may be less related because:**
- In many contexts they are precious metals, not "minerals" in the nutritional sense
- Their embeddings focus more on economic value, jewelry, investment

### 💡 Conclusion:
Embeddings are not dictionaries, they reflect **real language usage**, where "mineral" frequently appears with "lizard" in animal care contexts, more than with "gold/silver" in general contexts.

This is normal and expected in modern embedding systems that capture real language usage rather than just categorical relationships.

### Searching with Context: Enhancing Retrieval Using Semantic Embeddings

In [56]:
info = [
    "the earth is the third planet from the sun in the solar system",
    "the moon is a natural satellite of the earth",
    "the sun is the star at the center of the solar system",
    "the solar system consists of the sun and the objects that orbit it, including planets, moons, asteroids, and comets",
    "the earth has one natural satellite, the moon",
    "the sun is a nearly perfect sphere of hot plasma, with a diameter of about 1.4 million kilometers",
    "the solar system is located in the Milky Way galaxy",
    "the Milky Way galaxy is a barred spiral galaxy, containing billions of stars",
    "the solar system is about 4.6 billion years old",
    "the earth's atmosphere is composed mainly of nitrogen and oxygen",
    "the moon has a thin atmosphere, composed mainly of helium, neon, and hydrogen",
    "the sun's surface temperature is about 5,500 degrees Celsius",
    "the solar system is part of the Orion Arm of the Milky Way galaxy",
    "the Milky Way galaxy is about 100,000 light-years in diameter",
    "the solar system is located about 27,000 light-years from the galactic center",
    "the earth's rotation period is about 24 hours, while its revolution period around the sun is about 365.25 days",
    "the moon's rotation period is about 27.3 days, which is the same as its revolution period around the earth",
    "the sun's rotation period varies with latitude, ranging from about 25 days at the equator to about 35 days at the poles",
    "the solar system contains eight planets, which are divided into terrestrial planets (Mercury, Venus, Earth, Mars) and gas giants (Jupiter, Saturn, Uranus, Neptune)",
    "the terrestrial planets are rocky and have solid surfaces, while the gas giants are composed mainly of hydrogen and helium and have thick atmospheres",
    "the solar system also contains dwarf planets, such as Pluto, Eris, and Ceres, which are smaller than the eight planets and do not clear their orbits of other debris",
    "the asteroid belt is located between the orbits of Mars and Jupiter and contains millions of small rocky bodies",
]

info_vectorstore = InMemoryVectorStore.from_texts(
    info,
    embedding=embeddings,
)

In [57]:
def run_vectorstore_for_info(query: str, k: int = 4) -> None:
    retrieved_with_scores = info_vectorstore.similarity_search_with_score(query, k)
    print("len", len(retrieved_with_scores))
    for doc, score in retrieved_with_scores:
        print(f"- {doc.page_content} (score: {score:.4f})")
    print("==========================\n")
    return


In [64]:
run_vectorstore_for_info("cuales son los satelites de jupiter", 4)

len 4
- the solar system contains eight planets, which are divided into terrestrial planets (Mercury, Venus, Earth, Mars) and gas giants (Jupiter, Saturn, Uranus, Neptune) (score: 0.4182)
- the asteroid belt is located between the orbits of Mars and Jupiter and contains millions of small rocky bodies (score: 0.3999)
- the solar system also contains dwarf planets, such as Pluto, Eris, and Ceres, which are smaller than the eight planets and do not clear their orbits of other debris (score: 0.3786)
- the solar system consists of the sun and the objects that orbit it, including planets, moons, asteroids, and comets (score: 0.3722)



### Another example BTS Info

In [65]:
bts_info = [
"Blue Trail Software is a San Francisco-based software development company, founded in 2013, with production centers in Argentina, Bolivia, Mexico, Peru, Uruguay, and Spain.",
"The company specializes in creating sophisticated web solutions, mobile applications, IoT platforms, AI-powered solutions, and back-end engines.",
"Our core objective is to become an agile and resilient organization centered on the empowerment of our stakeholders (employees, clients, suppliers, and the local environment).",
"We believe in our people; success starts with a well-trained, disciplined, and passionate workforce.",
"We believe in empowering people; we pledge to our employees to create the conditions that will compel them to transform themselves.",
"We believe in creativity and execution, not in competition; our success comes from our ability to execute our vision, one step at a time.",
"We believe in patience; we are here for the long run, building what we believe a corporation of the 21st century should look like.",
"We believe in diversity; each new employee, client, and supplier makes us vastly more connected, more diverse, and more powerful.",
"We believe in giving back to our communities; making a difference in the life of our local communities is core to our mission.",
"Our mission is to excel in the global marketplace while benefiting our stakeholders and society by simultaneously optimizing financial performance, ethical practice, social contribution, and environmental impact.",
"As per our constitution, our company is allowed a maximum of three levels of management depth to preserve its agility.",
"No employee will be more than two management levels away from the CEO’s level.",
"The current leadership structure comprises the Office of the CEO, focused on corporate vision and strategy; the Office of the COO, focused on operations; and the Office of the CTO, in charge of defining our technical priorities."
"Seniority categories include Junior, Semi‑Senior, Senior Engineer, Staff Engineer, Principal Engineer, and Distinguished Engineer. ",
"Within the Junior category, there are three levels: Position 1, Position 2, and Position 3. ",
"The Semi‑Senior category likewise has three levels numbered 1 through 3. ",
"The Senior Engineer category is split into three positions: Position 1, Position 2, and Position 3. ",
"The Staff Engineer, Principal Engineer, and Distinguished Engineer categories represent the most advanced career levels without numbered subdivisions. ",
"Available career paths include Backend Engineer, Frontend Engineer, Fullstack Engineer, QA Manual, QA Automation, Mobile Engineer, Designer, Project Manager, Architect, and DevOps Engineer. ",
"Backend Engineers build and maintain the infrastructure that powers applications, manage data storage, integrate with external services, and ensure backend systems are secure, scalable, and efficient. ",
"Frontend Engineers work on the client side of applications, creating the visual and interactive elements users interact with directly, using HTML, CSS, JavaScript, and frameworks like React or Angular. ",
"Fullstack Engineers are versatile, handling both client and server sides of an application—from the user interface to server logic and databases. ",
"QA Manual testers execute test cases manually without automation tools to ensure software functions as expected and to identify bugs through hands‑on testing. ",
"QA Automation Engineers create automated tests with tools and scripts to streamline the testing process, ensuring repeatable and consistent test execution. ",
"Mobile Engineers develop applications for mobile devices, working on platforms like iOS (Swift, Objective‑C) and Android (Java, Kotlin). ",
"Designers are responsible for aesthetics, user experience, and interfaces of applications and websites, focusing on visual identity and usability. ",
"Project Managers oversee the project lifecycle, coordinating teams, managing resources, and communicating progress to stakeholders to ensure objectives and deadlines are met. ",
"Architects design the technical architecture of systems and applications, making key decisions about technologies and ensuring solutions are scalable and secure. ",
"DevOps Engineers integrate development and operations, automate processes, and manage infrastructure to implement CI/CD pipelines and accelerate software delivery. ",
"Summary of all roles and levels: BTS offers the following seniority levels—Junior (Positions 1–3), Semi‑Senior (Positions 1–3), Senior Engineer (Positions 1–3), Staff Engineer, Principal Engineer, Distinguished Engineer—and the career paths of Backend Engineer, Frontend Engineer, Fullstack Engineer, QA Manual, QA Automation, Mobile Engineer, Designer, Project Manager, Architect, and DevOps Engineer."
]

In [66]:
bts_vectorstore = InMemoryVectorStore.from_texts(
    bts_info,
    embedding=embeddings,
)

In [67]:
def run_vectorstore_bts(query: str, k: int = 4) -> None:
    retrieved_with_scores = bts_vectorstore.similarity_search_with_score(query, k)
    print("len", len(retrieved_with_scores))
    for doc, score in retrieved_with_scores:
        print(f"- {doc.page_content} (score: {score:.4f})")
    print("==========================\n")
    return

In [74]:
run_vectorstore_bts("role positions", 4)

len 4
- The Senior Engineer category is split into three positions: Position 1, Position 2, and Position 3.  (score: 0.4497)
- Summary of all roles and levels: BTS offers the following seniority levels—Junior (Positions 1–3), Semi‑Senior (Positions 1–3), Senior Engineer (Positions 1–3), Staff Engineer, Principal Engineer, Distinguished Engineer—and the career paths of Backend Engineer, Frontend Engineer, Fullstack Engineer, QA Manual, QA Automation, Mobile Engineer, Designer, Project Manager, Architect, and DevOps Engineer. (score: 0.4419)
- Within the Junior category, there are three levels: Position 1, Position 2, and Position 3.  (score: 0.4397)
- The current leadership structure comprises the Office of the CEO, focused on corporate vision and strategy; the Office of the COO, focused on operations; and the Office of the CTO, in charge of defining our technical priorities.Seniority categories include Junior, Semi‑Senior, Senior Engineer, Staff Engineer, Principal Engineer, and Distin

Basic Flow Using RAG
```mermaid
flowchart TD
    A[User] -->|"enter question or data"| B[Generate question's embedding]
    B -->|"vector search (top‑k)"| C[Retrieve relevant data]
    C -->|"combine with prompt template"| D[Construct RAG prompt]
    D -->|"call LLM API"| E[LLM API]
    E -->|"receive raw response"| F[Raw LLM response]
    F -->|"process & format output"| G[Formatted, context‑augmented response]
    G -->|"display to user"| A