**DSS : BUILDING LARGE LANGUAGE MODELS FOR BUSINESS APPLICATIONS Day 3**

- Part 3 of BUILDING LARGE LANGUAGE MODELS FOR BUSINESS APPLICATIONS
- Course Length: 3 hours
- Last Updated: July 2023
---

Developed by Algoritma's Research and Development division

## Background

The coursebook is part of the **Large Language Models Specialization** developed by [Algoritma](https://algorit.ma/). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

# Understanding Embedding in LLM 

## Training Objectives

- **Understanding Embeddings in Large Language Models (LLM) for Natural Language Processing**
   - Basic concepts of embeddings in LLM
   - Usage of embeddings in natural language processing
   - Demonstration of embeddings usage in text analysis


- **Advanced Embeddings in Large Language Models (LLM) for Text Processing**
   - In-depth understanding of embeddings in LLM
   - Implementation of embeddings in text processing using Python
   - Demonstration of embedding techniques in text processing tasks


- **Advanced Applications of Embeddings in Text Processing with Large Language Models (LLM)**
   - Application of embeddings in advanced text processing
   - Usage of embeddings for text classification and contextual understanding
   - Demonstration of embeddings usage in advanced tasks

## Understanding Embeddings in Large Language Models (LLM) for Natural Language Processing

We have created a GPT question and answering system using Large Language Models (LLMs) that can generate answers based on our data. Now, let's delve deeper into how LLMs understand natural language by exploring the concept of embeddings.

In natural language processing (NLP), **embeddings** are representations of **words or text as numerical vectors**. These vectors capture the **semantic and contextual** information of the text, allowing the model to understand the meaning and relationships between words. In simpler terms, embeddings help the chatbot understand the meaning of words and how they relate to each other.

Imagine you have a chatbot designed to assist customers with their inquiries. When a customer asks a question, the chatbot needs to understand the meaning of the question and provide a relevant answer. This is where embeddings come into play. The chatbot is trained on a large amount of text data and learns to associate words with their respective embeddings. These embeddings encode the information about the words' meaning and context.

For example, if a customer asks, "What are the payment options available?" the chatbot uses the embeddings to understand the meaning of the words "payment" and "options" and their relationship within the sentence. It can then provide an appropriate response by retrieving information from its knowledge base.

Embeddings enable the chatbot to **make sense** of the customer's input and generate accurate and contextually relevant responses. By capturing the meaning and relationships between words, embeddings enhance the chatbot's understanding of natural language and improve its ability to generate meaningful and coherent answers.

### Basic Concept of Embedding (Vector)

A vector (or embedding) is an **array of numbers**. That on its own is exciting, but what is even more exciting is that these arrays can represent more complex data like text, images, audio or even video. In the case of text, these representations are designed to capture **semantic and syntactic** relationships between words, allowing algorithms to understand and process language more effectively.

Word embeddings, specifically, are dense vector representations that encode the meaning of a word based on its context in a large corpus of text. In simpler terms, they map words to numerical vectors in a high-dimensional space, where similar words are located closer to each other. This is done in a vector database (we will talk about this later)

Creating these embeddings is done by an embedding model. There are multiple embedding models that can be used. OpenAI also provide embedding model but we will use free LLM model so we don't ran out of credit. We will use **`"all-MiniLM-L6-v2"`** embedding model.

Making embeddings can be visualised in the following way:

![embedding](assets/embedding.gif)

This embedding process apply in many LLM implementation, for example QnA system or GPT chatbot. The question asked to the chatbot will be embedded as well, and on the basis of similarity search, the retriever will return the embeddings with the data to answer the question. After this, the LLM will return a coherent and well-structured answer.

But let's dive deep the concept one by one start by how to perform embedding from raw text to vector form.

In [10]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

This embedding function is based on an open-source sentence transformer model, which converts sentences or text into numerical embeddings that capture the semantic meaning and contextual information of the text. Let's create an example sentences:

In [11]:
sentences = [
    "This is document about cat",
    "This is document about car",
    "Example of the long sentences: China increased its coal-fired power capacity by 42.9 GW, or 4.5%, in the 18 months to June 2019, according to a report by Global Energy Monitor. The study also found that another 121.3 GW of coal-fired power plants are under construction in China, which has pledged to reduce its coal usage. However, the country’s absolute coal consumption has still increased in line with rising energy demand. China accounts for more than 40% of the world's total coal generation capacity."
]

In [12]:
# Perform embedding using embed_documents()
embedded_sentences = embedding_function.embed_documents(sentences)

# show embedded result
embedded_sentences[0][:10]

[0.03584069758653641,
 0.0844104066491127,
 0.004435761831700802,
 0.061324890702962875,
 -0.0967964306473732,
 -0.011024923995137215,
 -0.03996596485376358,
 0.027137821540236473,
 -0.03685334697365761,
 0.04033121094107628]

The output of this code is the result of performing embedding using the `embed_documents()` method of the `embedding_function`. The `sentences` variable represents a list of sentences or text that we want to embed. The `embed_documents()` method takes these sentences as input and generates their corresponding embeddings. 

The variable `embedded_sentences` stores the embedded representations of the input sentences. It is a numerical representation of the sentences that captures their semantic meaning and context.

The statement indicates that the embedding model, specifically the [sentence-transformers/msmarco-MiniLM-L-12-v3](sentence-transformers/msmarco-MiniLM-L-12-v3) model, generates a fixed-size vector of 384 dimensions for any given sentence, regardless of its length. This model is designed to map sentences and paragraphs to a dense vector space with 384 dimensions.

The purpose of this embedding is to capture the semantic meaning of sentences and enable tasks such as semantic search, where similarity between sentences can be measured in this vector space. By representing sentences in a fixed-size vector format, the model facilitates efficient comparison and retrieval of semantically similar sentences.

In summary, regardless of the length of the input sentence, the embedding model consistently produces a 384-dimensional vector representation that captures the semantic information of the sentence. This representation can be used for various NLP tasks, including semantic search, where similarity between sentences is important.

## Advanced Embedding in Large Language Models (LLM) for Text Processing

The ability of embedding to handle text data is crucial for addressing the demands of today's industries. Embedding allows us to process and understand textual information effectively, enabling a wide range of text processing tasks.

One significant application of embedding is in large language models (LLMs), which leverage advanced embedding techniques to comprehend and generate natural language. By utilizing embedding in LLMs, we can tackle various industrial challenges more efficiently.

In this section, we will explore the use of **Chroma DB**, a powerful embedding-based database, to enhance text processing capabilities. By training the model on relevant data, we can leverage the embedded representations of text to perform tasks like question-answering. This approach enables us to extract meaningful information from the data and provide accurate responses to user queries.

By harnessing the power of embedding in LLMs and leveraging technologies like Chroma DB, we can significantly improve the **efficiency and effectiveness** of text processing in industries. This opens up new opportunities for automating tasks, gaining insights from textual data, and enhancing decision-making processes.

### Vector Database (CHROMA DB)

When working with Large Language Models (LLMs) like GPT-4 or Google's PaLM 2, we will often be working with big amounts of unstructured, textual data. Structured data can just be stored in a SQL database, but that is much harder with unstructured data. When we for instance have a lot of text files like above example with information on a certain topic, it might be good to store this information in a different way in order to retrieve the desired data in the most efficient way. The answer to this: **Vector Databases**.

The specific vector database that we will use is the **ChromaDB** vector database.

[Chroma Website](https://docs.trychroma.com/getting-started#:~:text=Chroma%20is%20a%20database%20for,hosted%20version%20is%20coming%20soon!):

> Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine. ChromaDB

By using `Chroma`, we can streamline the process of embedding and computing cosine distances, as it provides built-in functionality for these tasks.

Both `Chroma` and `LangChain` are integrated, allowing for seamless usage. To take advantage of this integration, we need to import the necessary functions from the respective libraries. This integration simplifies the implementation of text processing tasks by providing convenient methods for embedding text, computing cosine distances, and utilizing these functionalities within the broader context of `LangChain`.

To utilize `Chroma DB` and `LangChain`, we need to import the necessary libraries and modules. 

- `SentenceTransformerEmbeddings` from `langchain.embeddings.sentence_transformer`: to generate embeddings for sentences using a pre-trained Sentence Transformer model.
- `CharacterTextSplitter` from `langchain.text_splitter`: split text documents into smaller chunks or segments, which can be useful for efficient processing and analysis.
- `Chroma` from `langchain.vectorstores`: Chroma is a vector store that enables us to store and query embedded text data efficiently.
- `TextLoader` from `langchain.document_loaders`: provides functionality to load text documents from various sources, such as files or directories.

In [13]:
# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

In [36]:
# load the document and split it into chunks
loader = TextLoader("data_input/summary.txt")
document = loader.load()

In [37]:
document

[Document(page_content='Summary for Australia:\n\n- Australia\'s coal and gas exports may reduce by half within the next five years due to the passing of its peak and the efforts of Asian countries to decrease greenhouse gas emissions. The earnings of minerals and energy exports are predicted to reach $464bn in 2022-23 from $128bn in thermal coal exports and $91bn in liquidified natural gas (LNG) exports. These figures have resulted from the global energy crisis caused by Russia\'s invasion of Ukraine, leading to high fossil fuel prices, causing the replacement of Russian gas with alternative supplies in northern hemisphere nations. \n\n- The seaborne coal market grew by 5.9% year-on-year to 1208 million tonnes in 2022, reversing the negative trend of previous years, according to shipbroker Banchero Costa. Although Australia\'s coal exports declined by 5% in 2022 due to China\'s adoption  of alternative markets, relations between the two countries have mended and coal shipments are exp

Once the document has been loaded, we may find that it consists of multiple paragraphs. To facilitate our search for the most relevant paragraph, we can use the `CharacterTextSplitter` module. By specifying the `chunk_size` parameter as 1000 characters and setting `chunk_overlap` to 0, the document will be divided into smaller chunks or segments, each containing approximately 1000 characters.

This splitting process allows us to analyze each paragraph individually and determine which one is most similar to our query. It simplifies the task of finding relevant information within the document and enables more focused analysis.

In [38]:
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text = text_splitter.split_documents(document)

text[:3]

Created a chunk of size 1917, which is longer than the specified 1000


[Document(page_content="Summary for Australia:\n\n- Australia's coal and gas exports may reduce by half within the next five years due to the passing of its peak and the efforts of Asian countries to decrease greenhouse gas emissions. The earnings of minerals and energy exports are predicted to reach $464bn in 2022-23 from $128bn in thermal coal exports and $91bn in liquidified natural gas (LNG) exports. These figures have resulted from the global energy crisis caused by Russia's invasion of Ukraine, leading to high fossil fuel prices, causing the replacement of Russian gas with alternative supplies in northern hemisphere nations.", metadata={'source': 'data_input/summary.txt'}),
 Document(page_content='- The seaborne coal market grew by 5.9% year-on-year to 1208 million tonnes in 2022, reversing the negative trend of previous years, according to shipbroker Banchero Costa. Although Australia\'s coal exports declined by 5% in 2022 due to China\'s adoption  of alternative markets, relati

Once we have split each paragraph in the document, we can proceed to embed the sentences and store them in Chroma for efficient retrieval.

- We will create an open-source embedding function using `SentenceTransformerEmbeddings`. In this example, we use the model "all-MiniLM-L6-v2" to perform the sentence embedding.

- We load the embedded sentences into Chroma using the `from_documents` method. We pass in the `text` as the input and the `embedding_function` to perform the embedding process. Chroma will store the embedded vectors along with their corresponding sentences, enabling fast and accurate retrieval based on semantic similarity.


In [39]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(text, embedding_function)

By leveraging Chroma, we can easily search for the most relevant sentences or paragraphs in the document by comparing their embedded vectors, providing a powerful tool for text processing and information retrieval.

Chroma will store the extracted ids, embeddings, documents, and metadata into a collection. This collection acts as a repository where the information is organized and indexed for efficient retrieval

In [40]:
db._collection.get().keys()

dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])

- The `ids` represent unique identifiers associated with each document or sentence in the collection. These ids serve as references to access specific entries in the collection.

- The `embeddings` are the vector representations of the documents or sentences. These embeddings capture the semantic information and enable similarity-based searches within the collection.

- The `documents` refer to the original text content that has been split and processed. These documents can be paragraphs, sentences, or any other meaningful textual units.

- The `metadata` includes any additional information associated with the documents, such as timestamps, author names, or any other relevant attributes.

In [41]:
db._collection.get()['documents'][:3]

["Summary for Australia:\n\n- Australia's coal and gas exports may reduce by half within the next five years due to the passing of its peak and the efforts of Asian countries to decrease greenhouse gas emissions. The earnings of minerals and energy exports are predicted to reach $464bn in 2022-23 from $128bn in thermal coal exports and $91bn in liquidified natural gas (LNG) exports. These figures have resulted from the global energy crisis caused by Russia's invasion of Ukraine, leading to high fossil fuel prices, causing the replacement of Russian gas with alternative supplies in northern hemisphere nations.",
 '- The seaborne coal market grew by 5.9% year-on-year to 1208 million tonnes in 2022, reversing the negative trend of previous years, according to shipbroker Banchero Costa. Although Australia\'s coal exports declined by 5% in 2022 due to China\'s adoption  of alternative markets, relations between the two countries have mended and coal shipments are expected to resume. Indones

After exploring the Chroma collection and embedding the query, we can evaluate the performance of our model by finding similar documents that best match the given question.

We will define the `query` as "Is there an export ban on Coal in Indonesia? Why?" and use the `similarity_search_with_score` function from Chroma to find the most similar documents to the query. We specify `k=3` to retrieve the top 3 matching documents.


In [44]:
# Embed query and find similar document
query = "Is there an export ban on Coal in Indonesia? Why?"
docs = db.similarity_search_with_score(query, k=3)

In [45]:
docs

[(Document(page_content="- Indonesia aims to produce 694 million tons of coal in 2021 to fulfill both domestic and export demands, said the country's Ministry of Energy and Mineral Resources. Last year, the nation produced 627 million tons, with 4-5 million tons exported to Europe, compared with 500,000 tons in previous years, the Indonesian Coal Mining Association said. European coal demand is expected to stay strong next year.\n\n- Indonesia's new ban on coal exports, implemented to ensure adequate supply for its state-owned electricity companies, is expected to disrupt Supramax and Panamax markets in the Pacific region. The country exports around 400 million metric tonnes of thermal coal each year to countries including China, India, Japan, South Korea and Vietnam. The ban, which has seen bulkers unable to sail out of Indonesian ports, is likely to result in a tonnage surfeit in the Asia-Pacific, leading to lower shipping rates, particularly in East Coast South America (ECSA) and th