<img src="banner-5-coding-with-ai.png" width="100%">
<br>

# **Understanding Embeddings in Large Language Models (LLM)**

---

Embeddings are a foundational concept in natural language processing and machine learning. In the context of a language model, they convert words, sentences, or entire documents into numerical vectors of fixed size. These vectors capture the semantic meaning of the input text. In this notebook, we'll dive deep into what embeddings mean for an LLM like GPT (Generative Pre-trained Transformer).

---

**1. Introduction to Embeddings**

At its core, an embedding is a mapping from discrete objects (such as words) to vectors of real numbers.

```python
# Sample Word to Vector Representation (Hypothetical)
word = "computer"
vector_representation = [0.12, -0.58, 0.91, ...]  # a long list of numbers
```

This vector representation is useful because:
- Vectors can be input into neural networks.
- Semantically similar words will have similar vector representations.
- They allow for efficient computations to measure similarity, perform arithmetic operations, etc.

---

**2. How LLMs Use Embeddings**

LLMs typically utilize embeddings in two main phases:

1. **Embedding Layer (Input)**: Convert words/tokens into vectors.
2. **Contextual Embeddings (Hidden States)**: Capture contextual information as the model processes sequences.

The magic of LLMs like GPT is that they don't just use a static embedding for each word; the embedding changes based on the context!

---


**3. Exploring Semantic Relationships**

Embeddings can capture various semantic relationships. For example, the famous analogy "man is to king as woman is to queen" can be represented through vector arithmetic.

```python
# Hypothetical representation
vector('king') - vector('man') + vector('woman') ≈ vector('queen')
```

This showcases the depth and richness of information present in the embeddings.

---

Embeddings are a powerful way to represent text numerically, capturing rich semantic meanings in compact vectors. Through LLMs, these embeddings aren't just static but evolve based on context, providing a deep understanding of language nuances.



# Using Embeddings to answer questions about a local document

One of the ways we can use embeddings is to answer questions about a local document. For example, given a PDF file, we can extract the content, split it into smaller documents (e.g., pages), embed them, and then perform semantic search to answer questions about the document.

To process a PDF and extract information for embedding, you would typically use the `PyPDF2` library. Then, you can use FAISS and OpenAI's embeddings (or embeddings from any other model) to do a semantic search. Below is a more detailed and realistic example of how to do this.  We will use Scikit-Learn library's nearest neighbor algorithm to perform semantic search on the embeddings we get from OpenAI's API.

---

**Embedding and Semantic Search on PDF Content using Scikit-Learn and OpenAI**

---

**Step 1:** Import Necessary Libraries

First, let's import the required libraries.


In [26]:
import numpy as np
import openai
import PyPDF2
import pandas as pd


---

**Step 2:** Extract Content from PDF

We'll use `PyPDF2` to extract content from the given PDF file, "ThePragmaticProgrammer.pdf".


In [27]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = " ".join([reader.pages[i].extract_text() for i in range(len(reader.pages))])
    return text

pdf_content = extract_text_from_pdf("ThePragmaticProgrammer.pdf")



---

**Step 3:** Split PDF Content into Documents

For simplicity, we're blindly chunking the document 
into smaller documents (in this case, approximately 1500-token chunks with a 50 token overlap). 
In a real-world scenario, you would want to split the document into meaningful chunks (e.g., paragraphs, sections, etc.).


In [None]:
from util import chunk_prompt

documents = chunk_prompt(pdf_content, chunk_size=500, overlap=50)

print(f"Number of Split Documents: {len(documents)}")
print(f"First Document length: {len(documents[0])}")


---

**Step 4:** Embed the Documents

We'll obtain embeddings for each document. This can be done in batches of up to 2048 documents at a time. We'll use OpenAI's `text-embedding-ada-002` model to embed the documents. This model is trained on a large corpus of text and is able to capture rich semantic information.  We will do 20 documents at a time.


In [None]:
# openai credentials
openai.api_base = 'http://aitools.cs.vt.edu:7860/openai/v1'
openai.api_key = 'aitools'

# calculate embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 20  # you can submit up to 2048 embedding inputs per request

embeddings = []
for batch_start in range(0, len(documents), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = documents[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end - 1}")
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

df = pd.DataFrame({"text": documents, "embedding": embeddings})


In [None]:
# We are using a data frame to visualize the data here.

print(df.head())
df.shape

Let's use SciKit-Learn's Nearest Neighbors algorithm to perform semantic search on the embeddings. We'll use the `ball_tree` algorithm, which is a fast implementation of the k-nearest neighbors algorithm.


In [31]:
from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(embeddings)

Let's now try to ask some questions about the document and see if we can get the right answers.

In [None]:
query = "When should I catch an Exception?"
# Example new embedding
response = openai.Embedding.create(model=EMBEDDING_MODEL, input=query)
query_embedding = np.array(response["data"][0]["embedding"]).reshape(1, -1)
print(query_embedding.shape)

In [None]:
distances, indices = nbrs.kneighbors(query_embedding)

print("Nearest Neighbors Indices:", indices)
print("Distances:", distances)

count = 0
for idx in indices[0]:
    print("""[{idx}]@{distance} {doc}""".format(idx=idx, distance=distances[0][count], doc=documents[idx].replace("\n", " ")))
    print("-" * 100)
    print("\n\n")
    count += 1


How do we know if the semantic search did a good job?
We can use a context + question prompt to OpenAI to see if the LLM can explain the question using the context. If it can, then we know that the semantic search did a good job.

In [None]:
from util import converse
prompt_template = """
Answer the following question using the context provided:
%Question: 
```
{question}
``` 
%Context: 
```
{context}
```
"""
for idx in indices[0]:
    messages = []
    prompt = prompt_template.format(question=query, context=documents[idx])
    messages, response = converse(messages, prompt)
    print(f""" [{idx}]: Explanation: {response}""")


---

**Conclusion:**


We've demonstrated how to extract content from a PDF, split it into smaller documents, embed them, and perform semantic search using Scikit-Learn and OpenAI's embeddings. This can be a powerful way to search through large documents or even collections of documents.
 
Tip: For practical deployment, always consider factors like the size of your dataset, the frequency of queries, and the desired 
latency. In many real-world scenarios, batching operations, caching frequent queries, or using more specialized search libraries can significantly enhance performance and user experience. Furthermore, periodically updating embeddings can ensure that your semantic search remains relevant as the underlying content or context evolves.

---

**End of Notebook**

---
