---
---
# Notebook: [ Week #04: From Embeddings to Applications]

## Setup
---

In [None]:
!pip install openai
!pip install langchain
!pip install langchain-openai
!pip install langchain-experimental
!pip install pypdf
!pip install lolviz
!pip install chromadb
!pip install rank_bm25
!pip install umap-learn
!pip install tqdm
!pip install tiktoken

In [None]:
!wget "https://d17lzt44idt8rf.cloudfront.net/aicamp/data/digital_products.csv"

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
from openai import OpenAI
from getpass import getpass

API_KEY = getpass("Enter your OpenAI API Key")
client = OpenAI(api_key=API_KEY)

## Helper Function
---

## Function for Generating Embedding

In [None]:
def get_embedding(input, model='text-embedding-3-small'):
    response = client.embeddings.create(
        input=input,
        model=model
    )
    return [x.embedding for x in response.data]

## Function for Text Generation

In [None]:
# This is the "Updated" helper function for calling LLM
                                # gpt-4
def get_completion(prompt, model="gpt-3.5-turbo", temperature=0, top_p=1.0, max_tokens=4000, n=1):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        top_p=top_p,
        max_tokens=max_tokens,
        n=1
    )
    return response.choices[0].message.content

In [None]:
# This a "modified" helper function that we will discuss in this session
# Note that this function directly take in "messages" as the parameter.
def get_completion_by_messages(messages, model="gpt-3.5-turbo", temperature=0, top_p=1.0, max_tokens=1024, n=1):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        top_p=top_p,
        max_tokens=max_tokens,
        n=1
    )
    return response.choices[0].message.content

## Functions for Token Counting

In [None]:
# This function is for calculating the tokens given the "message"
# ⚠️ This is simplified implementation that is good enough for a rough estimation

import tiktoken

def count_tokens(text):
    encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
    return len(encoding.encode(text))

def count_tokens_from_message_rough(messages):
    encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
    value = ' '.join([x.get('content') for x in messages])
    return len(encoding.encode(value))


---
---

# Understanding Embeddings

In [None]:
in_1 = "Flamingo spotted at the in the bird park"

in_2 = "Sea otter seen playing at the marine park"

in_3 = "Baby panda born at the city zoo"

in_4 = "Python developers prefer snake_case for variable naming"

in_5 = "New JavaScript framework aims to simplify coding"

in_6 = "C++ developers appreciate the power of OOP"

in_7 = "Java is a popular choice for enterprise applications"


list_of_input_texts = [in_1, in_2, in_3, in_4, in_5, in_6, in_7]

## Understand the Outputs of Embedding Process

In [None]:
response = get_embedding(list_of_input_texts)

In [None]:
len(response)

In [None]:
len(response[0])

In [None]:
import lolviz

lolviz.objviz(response)

# Note that there are 7 items in the list, each represents the embeddings (list of numerical values) for each of the inputs
# Each embedding vector contains 1,536 numerical values that represent the original text

In [None]:
# To Save the image locally
# Supported format includes svg, png, jpeg, pdf, and etc.
drawing.render('embeddings_seven', format='png')

## Visualize Embedding

---

### Understanding UMAP for Data Analysts

- Uniform Manifold Approximation and Projection (UMAP) is a powerful dimensionality reduction technique that can be used to visualize high-dimensional data in a lower-dimensional space.
- Unlike other dimensionality reduction techniques, UMAP preserves both the local and global structure of the data, making it an excellent tool for exploratory data analysis.

### Using UMAP in Python

The UMAP algorithm is implemented in the `umap-learn` package in Python. Here's a simple example of how to use it:

```python
import umap
import numpy as np

# Assume embeddings is your high-dimensional data
embeddings = np.random.rand(100, 50)

reducer = umap.UMAP()
umap_embeddings = reducer.fit_transform(embeddings)
```

In this example, `umap.UMAP()` creates a UMAP object, and `fit_transform()` fits the model to the data and then transforms the data to a lower-dimensional representation. The result, `umap_embeddings`, is a 2D array of the lower-dimensional embeddings of your data.


<br>

---

In this following example, we will use UMAP to visualize the 7 pieces of texts from the previous example

In [None]:
import numpy as np
import pandas as pd
import umap # For compressing high-dimensional data (many columns) into lower-dimensional data (e.g. 2 columns)

import matplotlib.pyplot as plt
import seaborn as sns # For data visualization

In [None]:
def get_projected_embeddings(embeddings, random_state=0):
    reducer = umap.UMAP(random_state=random_state).fit(embeddings)
    embeddings_2d_array = reducer.transform(embeddings)
    return pd.DataFrame(embeddings_2d_array, columns=['x', 'y'])

> 💡 Explanation for the cell above:
> - `def get_projected_embeddings(embeddings, random_state=0):` This line defines the function and its parameters. The function takes in two arguments: embeddings (your high-dimensional data) and random_state (a seed for the random number generator, which ensures that the results are reproducible).  
> - `reducer = umap.UMAP(random_state=random_state).fit(embeddings)` This line creates a UMAP object and fits it to your data. The fit method learns the structure of the data.  
> - `embeddings_2d_array = reducer.transform(embeddings)` This line transforms the high-dimensional data into a lower-dimensional space. The transformed data is stored in embeddings_2d_array.  
> - `return pd.DataFrame(embeddings_2d_array, columns=['x', 'y'])` This line converts the lower-dimensional data into a pandas DataFrame for easier manipulation and returns it. The DataFrame has two columns, 'x' and 'y', which represent the two dimensions of the reduced data.

In [None]:
# Get the embeddings in a DataFrame object
projected_embeddings = get_projected_embeddings(response)

# Insert a new column to store the original texts
projected_embeddings['text'] = list_of_input_texts

projected_embeddings

In [None]:
# Create the Scatter plot that visualize the locations of the embeddings in the number space (commonly known as "vector space")
sns.scatterplot(x=projected_embeddings['x'], y=projected_embeddings['y'])

# Add labels to each point
for i in range(projected_embeddings.shape[0]):
    plt.text(x=projected_embeddings.loc[i, 'x'],
             y=projected_embeddings.loc[i, 'y'],
             s=projected_embeddings.loc[i, 'text'],
             fontsize='x-small')

- Observe the distances between the different texts
  - Although the text starts with "Python developers prefer snake_case", contains two animals, the embedding is further away from the three data points that truly talking about animals.
  - It is closer to the other two data points that are focusing on programming/coding

![](https://d17lzt44idt8rf.cloudfront.net/aicamp/resources/embeddings_distance.png)


---


### Understanding Cosine Similarity for LLM Embeddings

Cosine similarity is a metric used to measure how similar two vectors are, irrespective of their size. It's widely used in natural language processing to calculate the similarity between text documents represented as vector embeddings, such as those produced by Language Learning Models (LLMs).

The cosine similarity between two vectors is calculated as the cosine of the angle between them.
- If the vectors are identical, the angle is 0 and the cosine similarity is 1.
- If the vectors are orthogonal, the angle is 90 degrees and the cosine similarity is 0, indicating no similarity.

Cosine similarity is particularly useful for LLM embeddings because it effectively captures the semantic similarity between text documents. It's robust to the high dimensionality of LLM embeddings and is relatively efficient to compute, making it a popular choice for measuring the distance between LLM embeddings


---
---
<br>

# Setting up Credentials for langchain
---

We will be using `Langchain` for our Use Cases in the next section.
- The code cell below is to set up the credentials (i.e., the OPENAI API KEY) to allow `Langchain` to use OpenAI API Endpoints,
such as the `ChatCompletion` and `Embedding`.
- This also means that we will not directly use OpenAI's SDK in some of the use cases below.
- `Langchain`'s components like those `retriever` (that we discussed in the Knowledge Base) will handle the API calls to OpenAI's API endpoints automatically. All we need to do is to specify the `API_KEY` as required by Langchain, as indicated in the cell below.
- This way of providing credentials is called `environment variable assignment` or `setting environment variable`
- Specifically, it sets the value of the environment variable named `OPENAI_API_KEY` to the value stored in the Python variable `API_KEY` (we entered earlier in this notebook).
- Environment variables are used to configure and customize the behavior of software applications by allowing them to access external settings or secrets without hardcoding them directly into the code.
- We will learn more about setting the credentials  more efficiency and securly in later part of our training.

Find out more about why we are using Langchain from [here](https://d27l3jncscxhbx.cloudfront.net/topic-4-from-embeddings-to-applications/4.-retrieval-augmented-generation-(rag).html#Using_Langchain_for_RAG)

In [None]:
import os

os.environ["OPENAI_API_VERSION"] = "2024-03-01-preview"
os.environ["OPENAI_API_KEY"] = API_KEY

---
---

<br>

# Use Cases of Embeddings
---

## Use Case #1: Semantic Search & Recommendation System

### Understanding Keyword Retrieval and Dense Retrieval

- Keyword retrieval and dense retrieval are two different methods used in information retrieval systems.

- Keyword Retrieval:
    - Keyword retrieval, also known as sparse retrieval, is a traditional method of information retrieval.
    - It involves matching the exact keywords in the query with the documents in the database.
    - This method is simple and explainable, but it has some limitations
    - It doesn’t fully capture the semantics of each term in the context of the whole text.
    - It may not perform well when the query words do not exactly match the words in the document.
- Dense Retrieval (or Semantic Search):
    - Dense retrieval, on the other hand, uses dense vector representations (embeddings) of the text to capture the deep semantic relationship between queries and documents.
    - These vectors are usually generated by neural networks, particularly transformer-based models.
    - This method shows a huge improvement over keyword search as it captures the semantics of the text1. However, it’s more complex and computationally intensive than keyword retrieval.

<br>
    
- Here are the key differences between the two methods:
    - **Semantics**: Dense retrieval captures the semantics of the text, while keyword retrieval does not.
    - **Matching**: Keyword retrieval relies on exact keyword matching, while dense retrieval uses semantic matching.
    - **Performance**: Dense retrieval generally outperforms keyword retrieval, especially when the query words do not exactly match the words in the document.
    - **Complexity**: Dense retrieval is more complex and computationally intensive than keyword retrieval

[Reference for Products - https://www.developer.tech.gov.sg/products/all-products/](https://www.developer.tech.gov.sg/products/all-products/)

In [None]:
# The variable `corpus`  contains a list of documents (text data).
# We are using the products & services from Reference https://www.developer.tech.gov.sg/products/all-products/
corpus = [
    'Cybersecurity. AI Document Parser (AISAY) – An AI-Powered Document Reader and Transcription API Service. AISAY is an AI-Powered Document Reader and Transcription API Service for public officers. Learn more here!',
    'Analytics. Analytics.gov – Enabling Data Exploitation for Whole-of-Government (WOG). Analytics.gov is a Whole-Of-Government (WOG) data exploitation platform to support the analysis of data by agencies.',
    'Productivity. Cloak – The Central Privacy Toolkit for Policy-Compliant Data Anonymisation. Cloak helps public officers to anonymise sensitive data based on public sector guidelines through a one-stop, self-service web application. Learn more here!',
    'DevOps. Container Stack (CStack) – Managed Platform for Apps using Kubernetes. Container Stack is a cloud-based container hosting platform and a Runtime component within Singapore Government Tech Stack.',
    'Data and APIs. Data.gov.sg — The One-Stop Open Data Portal for Publicly Available Singapore Government Datasets. Learn from Data.gov.sg, Singapore’s one-stop open data portal offering government datasets. Dive in now!',
    'Analytics. GovText – The Whole-of-Government Text Analytics Platform. Analyse your textual data efficiently with the GovText Natural Language Processing (NLP) platform for WOG. Discover more!',
    'Data and APIs. Monetary Authority of Singapore (MAS) APIs - Streamlining of Financial Applications through Data. The Monetary Authority of Singapore (MAS) provides APIs for developers, allowing MAS’ applications to be streamlined.',
    'Analytics. Whole-of-Government Application Analytics (WOGAA) - Improve Government Services with Data. WOGAA is an analytics & performance platform for public officers to monitor the health of their government websites and optimise the performance of their digital services with data.',
    'Data and APIs. Vault - A Central Data Discovery and Distribution Platform for WOG Vault is a platform where government data is consolidated, organised and made discoverable for public servants to explore, search and securely access.',
    'Productivity. Transcribe provides auto-transcription and localised Speech-to-Text services for Singapore government officers.',
    'Data and APIs. SingStat Table Builder. The SingStat Table Builder contains over 1,800 statistical data tables from 60 public sector agencies providing a comprehensive statistical view of Singapore’s economic and socio-demographic characteristics.',
    'Productivity. Postman — Deliver Messages to Citizens in Minutes. Postman is a multichannel cloud-based service for Singapore government agencies to send mass personalized messages in minutes.',
]

#### Keywords Retrieval

In [None]:
# This library provides an implementation of the BM25 algorithm,
# which is commonly used for information retrieval and text search
from rank_bm25 import BM25Okapi

# Tokenizes each document in the corpus
# by splitting it into individual words based on spaces.
tokenized_corpus = [doc.split(" ") for doc in corpus]

# Initialize BM25
# The BM25Okapi model is now ready to compute relevance scores for queries against this corpus.
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
# Now let's say we have the following query
query = "Data exploitation"

# The query is tokenized into individual words: tokenized_query = query.split(" ")
# The query must be tokenized in the same way as the corpus (list of input documents)
tokenized_query = query.split(" ")

# Computes BM25 scores for each document in the corpus based on the query.
doc_scores = bm25.get_scores(tokenized_query)

# Print the list of relevance scores corresponding to each document.
# It's  a position (index) based listing of the score,
# based on the original order of the text in our `corpus` variable
print(doc_scores)

In [None]:
# Finding the Most Relevant Document:
import numpy as np
x = np.array(doc_scores)

# The argmax() function returns the index of the maximum value in the array.
# This line use "indexing" apprach to retrieve the most relevant document
# from the original corpus based on the highest BM25 scores.
corpus[x.argmax()]

> 💡 Experiemnt with different `query` to see how well the `Keyword-based Retriever` works and when does it hit its limitations

---

> ⚠️ The provided code demonstrates a basic implementation of the BM25 algorithm for keyword-based search.
> - However, it’s essential to recognize that real-world production-level keyword search systems are significantly more complex.
> - While this simplified BM25 example provides a foundational understanding, building robust search systems involves addressing these complexities.
> - Production-level search engines require a combination of information retrieval, machine learning, and engineering expertise to deliver accurate and efficient results.

---


<br>



#### Dense Retrieval

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

In [None]:
query = "Data exploitation"

In [None]:
# an embeddings model is initialized using the OpenAIEmbeddings class.
# The specified model is 'text-embedding-3-small'.
embeddings_model = OpenAIEmbeddings(model='text-embedding-3-small')

In [None]:
# A vector store (also known as a database) is created using the Chroma class.
# It stores embeddings for a given set of texts (documents).
# The embeddings_model is used to convert the texts into embeddings.
vectorstore = Chroma.from_texts(corpus, embeddings_model)

In [None]:
# This line performs a similarity search within the vector store for the query
# The k=3 parameter specifies that we want to retrieve the top 3 most similar documents.
# The method returns both the similar documents and their associated relevance scores.
# These scores indicate how similar each retrieved document is to the query (higher value means highly similar).
vectorstore.similarity_search_with_relevance_scores(query, k=3)

> 💡 Experiemnt with different `query` to see how well the `Keyword-based Retriever` works on vague query and non-direct match

---

## Use Case #2: Building Predictive Models with Semantic Meanings of Text

---

We are preparing the data for our machine learning model later,
where we want to use the "description of the product" to derive at the classification of the products
e.g., is it `productivity`, `cybersecurity`, `analytics` or other categories

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("digital_products.csv")
df[-10:]

In [None]:
# Get the embeddings of the "description" (text column)
embeddings_vector_prods = get_embedding(df.text)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(embeddings_vector_prods, df.label, test_size=0.3, random_state=42)

In [None]:
# Choose a model
classifier = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=42)

In [None]:
# Train the model
classifier.fit(X_train, y_train)

In [None]:
# Evaluate the model
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))

In [None]:
# Print out the Accuracy of the Model
# in predicting the "cateogry of the products'
print(accuracy_score(y_test, y_pred))

---
---

<br>

    

## Use Case #3: Retrieval-Augmented Generation

**\[ Why Context Augmentation? \]**

- LLMs offer a natural language interface between humans and data. Widely available models come pre-trained on huge amounts of publicly available data like Wikipedia, mailing lists, textbooks, source code and more.

- However, while LLMs are trained on a great deal of data, they are not trained on your data, which may be private or specific to the problem you’re trying to solve. It’s behind APIs, in SQL databases, or trapped in PDFs and slide decks.

- You may choose to fine-tune a LLM with your data, but:
    - Training a LLM is expensive.
    - Due to the cost to train, it’s hard to update a LLM with latest information.
    - Observability is lacking. When you ask a LLM a question, it’s not obvious how the LLM arrived at its answer.

- Instead of fine-tuning, one can use a context augmentation pattern called Retrieval-Augmented Generation (RAG) to obtain more accurate text generation relevant to your specific data. RAG involves the following high level steps:

    1. Retrieve information from your data sources first,
    2. Add it to your question as context, and
    3. Ask the LLM to answer based on the enriched prompt.

- In doing so, RAG overcomes all three weaknesses of the fine-tuning approach:
    - There’s no training involved, so it’s cheap.
    - Data is fetched only when you ask for them, so it’s always up to date.
    - LlamaIndex can show you the retrieved documents, so it’s more trustworthy.


**\[ Why LangChain for Context Augmentation? \]**

- Firstly, LangChain imposes no restriction on how you use LLMs. You can still use LLMs as auto-complete, chatbots, semi-autonomous agents, and more (see Use Cases on the left). It only makes LLMs more relevant to you.

- LangChain provides the following tools to help you quickly stand up production-ready RAG systems:

- Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL, and (much) more.

- Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.

- Engines provide natural language access to your data. For example:

- Query engines are powerful retrieval interfaces for knowledge-augmented output.

![](https://d27l3jncscxhbx.cloudfront.net/lib/media/img-20240421132947558.png)

### Document Loading

- Use document loaders to load data from a source as Document's.
  - A Document is a piece of text and associated metadata.
  - For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

- See [official documentation on LangChain's Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders) for different kinds of loaders for different sources.

In [None]:
# In this example, we will load the Prompt Engineering Playbook

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://www.developer.tech.gov.sg/products/collections/data-science-and-artificial-intelligence/playbooks/prompt-engineering-playbook-beta-v3.pdf")
pages = loader.load()

- Each page is a `Document` object.
- A `Document` contains text (`page_content`) and `metadata`.

In [None]:
pages[0]

In [None]:
pages[0].metadata

In [None]:
# Let's count how many token are there
# by summing all the token counts from every page
# Don't worry about understand the code in this cell
import numpy as np

list_of_tokencounts = []
for page in pages:
    list_of_tokencounts.append(count_tokens(page.page_content))

print(f"There are total of {np.sum(list_of_tokencounts)} tokens")

In [None]:
np.average(list_of_tokencounts)

### Document Splitting

![](https://d27l3jncscxhbx.cloudfront.net/lib/media/img-20240421143533502.png)



**[ Different Types of Spliters ]**
- `CharacterTextSplitter()` ->Implementation Of splitting text that looks at characters.
- `MarkdownHeaderTextSplitter()` -> lmplementation Of splitting markdown files based on specified headers.
- `TokenTextSplitter()` -> lmplementation Of splitting text that looks at tokens.
- `SentenceTransformersTokenTextSpIitter()`  -> lmplementation Ofsplitting text that IOOks at tO kens.
- `RecursiveCharacterTextSplitted()`  -> lmplementation Of splitting textthat looks at characters. Recursively tries tO split by differentcha racters to find one that works.
- `Language()` -> for CPP，Python， Ruby，Markdown, etc
- `NLTKTextSpIitter()` -> lmplementation Of splitting text that | 00 ks atsentences using NLTK (Natural l-anguage TOOI Kit)
- `SpacyTextSplitter()` -> lmplementation Of splitting text that IOOks atsentences using Spacy

|    Name   |               Splits On               | Adds Metadata |                                                                                        Description                                                                                       |
|:---------:|:-------------------------------------:|:-------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Recursive | A list of user defined characters     |               | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. |
| HTML      | HTML specific characters              | ✅             | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)                                           |
| Markdown  | Markdown specific characters          | ✅             | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)                                   |
| Code      | Code (Python, JS) specific characters |               | Splits text based on characters specific to coding languages. 15 different languages are available to choose from.                                                                       |
| Token     | Tokens                                |               | Splits text on tokens. There exist a few different ways to measure tokens.                                                                                                               |
| Character | A user defined character              |               | Splits text based on a user defined character. One of the simpler methods.                                                                                                               |

In [None]:
# Basic Document Splitting

from langchain.text_splitter import CharacterTextSplitter


some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""


# Splitting the text: The split_text method of the r_splitter object is called with some_text as the argument.
# This method splits the input text into chunks according to the chunk_size and chunk_overlap parameters specified
# when the r_splitter object was created.
r_splitter = CharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=25
)

r_splitter.split_text(some_text)

In [None]:
# Testing on PDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://www.developer.tech.gov.sg/products/collections/data-science-and-artificial-intelligence/playbooks/prompt-engineering-playbook-beta-v3.pdf")
pages = loader.load()


from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=50,
    length_function=count_tokens
)


splitted_documents = text_splitter.split_documents(pages)

> 💡 Explanation

- **Separators**:
    - The separators parameter is a list of strings that the algorithm will use to split the text.
    - The algorithm will try to split the text using the first separator in the list.
    - If the resulting chunks are still larger than the chunk_size, it will try to split them further using the next separator in the list, and so on.
    - This process continues until all chunks are smaller than the chunk_size or until all separators have been used.
- **Chunk Size**:
    - The chunk_size parameter is the maximum size for each chunk of text.
    - If after using all separators, there are still chunks that are larger than the chunk_size, the algorithm will split these chunks at the chunk_size character, regardless of where this falls in the text.
- **Chunk Overlap**:
    - The chunk_overlap parameter determines how many characters from the end of one chunk should be repeated at the start of the next chunk.
    - This can be useful to provide context when processing each chunk independently.
- **Length Function**:
    - The length_function parameter is a function that calculates the ‘length’ of a chunk.
    - This could be a simple function like len (which counts characters), or a more complex function like count_tokens (which counts words or tokens).

`RecursiveCharacterTextSplitter` will first try to split by the separators. If the resulting chunks are still too large, it will then split them further until they are smaller than the chunk_size. The chunk_overlap is applied after all splitting is done. The length_function is used throughout this process to measure the size of the chunks. It’s important to note that the separators are tried in the order they are given in the list, and the algorithm will always try to split by separators before resorting to splitting by chunk_size

In [None]:
len(splitted_documents)

In [None]:
splitted_documents[15]

In [None]:
splitted_documents[16]

In [None]:
splitted_documents[17]

<br>

---

### Embedding & Vectorstores

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [None]:

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
db = Chroma.from_documents(splitted_documents, embeddings_model, persist_directory="./chroma_db")

In [None]:
print(db._collection.count())

<br>

---

### Retrieval

![](https://d27l3jncscxhbx.cloudfront.net/lib/media/img-20240421143642415.png)

 ### Basic Retrieval

In [None]:
db.similarity_search('Zero Shot', k=3)

In [None]:
db.similarity_search_with_relevance_scores('Zero Shot', k=3)

### Question & Answer

![](https://d27l3jncscxhbx.cloudfront.net/lib/media/img-20240421150029496.png)

- Multiple relevant documents have been retrieved from the vector store
- Potentially compress the relevant splits to fit into the LLM context
- Send the information along with our question to an LLM to select and format an answer

#### RetrievalQA Chain

In [None]:
from langchain.chains import RetrievalQA

In [None]:
from langchain_openai import ChatOpenAI

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    ChatOpenAI(model='gpt-3.5-turbo'),
    retriever=db.as_retriever(k=20)
)

qa_chain.invoke("Why LLM hallucinate?")

#### With Custom Q&A Prompt

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Run chain
qa_chain = RetrievalQA.from_chain_type(
    ChatOpenAI(model='gpt-3.5-turbo'),
    retriever=db.as_retriever(),
    return_source_documents=True, # Make inspection of document possible
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
qa_chain.invoke("Why LLM hallucinate?")