Reference: https://medium.com/@jainashish.079/get-insight-from-your-business-data-build-llm-application-with-langchain-and-hugging-face-using-b32c442ea6cd

The rise of LLM revolutionized the industry and everyone’s thinking about how we can utilize the LLM’s power to support our business and make our customer’s life easier. When we see LLM’s space, we see either these models are proprietary like ChatGPT family of model from OpenAI or if its open source like FLAN-T5 family of model from Google (encoder-decoder model), these are trained on public internet based data.

<b><i>Question arises how we can utilize LLM’s with our own business data?</i></b>

#  Fine Tune vs RAG (Retrieval Augmented Generation)

There are two approaches either we can Fine tune the LLMs with our own data for a specific task (like question-answer, summarization etc) or we can use RAG which provides how to incorporate your business data with the LLMs while executing customer query on the business data. Choosing between RAG and fine-tuning the LLM will depends on various factors.

* Fine-tuning is great choice when we have a large amount of task’s specific labeled data and want to get insight of the data. For example we want to summarize agent and customer’s chat in call center for getting better insight of the complex data. We can fine-tune the LLMs with chat history database with labeled summery and then inference the model with real time chat history. Fine tuning can be computationally expensive, time consuming and requires big infra (GPUs and memory) resources. However we can fast track the training and consume less memory using method like PEFT (Parameter Efficient fine-tuning) and deal with computational challenges with techniques like Quantization, Purning etc.

* RAG is advantageous when we have a retrieval corpus available, covering relevant information for the task (may be question-answer). It provides a way that customers can have conversation with these document or corpus and get answer to their query from these documents using the LLM. For example, we may have data on corporate wiki, web-sites, pdfs and want to run customer query on these documents using LLMs. RAG is more efficient in terms of resource utilization and provides faster results, making it suitable for applications with limited computation resources, real time requirements or low latency needs.

It’s cheaper to keep retrieval indices up-to-date (RAG) than to continuously pre train an LLM using Fine tune.

# RAG (Retrieval Augmented Generation)

RAG is a framework for building the LLM powered applications that make use of external data sources outside the model and enhances the input with data, providing the richer context to improve output. One of the option for getting answers from LLM from our own data is that we can pass full data as a context window with question as a prompt which we want to query. Problem with this approach is that LLM are constrained with context window as a prompt (4096 tokens in GPT3). However context window are getting larger and larger with new releases of model (32,768 in GPT4) but still we can not pass a full corpus which may be in Gigabytes.

Intuition behind the RAG is that if we can first run customer query with corpus and fetch the relevant specific information in much smaller size and then pass the retrieve information to LLMs in context window with the customer query to get the desired result. This means we need to divide the corpus in multiple chunks and store in a form where we can fetch the relevant chunks based on customer’s query. Best way is to convert the chunks into text embedding and store them in the vector database. A text embedding is a compressed, abstract representation of text data where text of arbitrary length can be represented as a vector of numbers. Embedding is usually learnt from a corpus of text data such as Wikipedia. Think of them as a universal encoding for text, where text with similar content will have similar vectors. We can now use this vector store to find the relevant chunks by doing the similarity search on vector store. Finally we can use the relevant information to create a prompt with customer query and pass that prompt to llm to get the desired result.

# Implementing RAG (Retrieval Augmented Generation)

In the blog we will use LangChain (https://www.langchain.com/) — which is excellent open source developer framework for building LLM applications. It provides abstraction to hide all the complexity for building LLM application and provide very easy to use simpler interfaces using python and java-script library. It also provides various integration point for other library/system for document loading, vector stores, calling various LLMs using API and loading the LLM model from Hugging Face model hub. We will also use Hugging Face, it is a platform where machine learning community collaborates on models, datasets and applications. We will use Hugging Face to download one of the open source LLM model FLAN_T5 (https://huggingface.co/google/flan-t5-large) from Google and sentence transformer all-MiniLM-L6-v2 model (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to get the result from these LLMs. We will load these from the local machine. You can easily download these models from Hugging Face by cloning the model repository.It will help you out to run this code without internet or in very constrained environment. Downloading model can take time depending on your network speed:
* git lfs install 
* git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 
* git clone https://huggingface.co/google/flan-t5-large

We will use python code for illustration purpose. You can use this code for your applications.You can also refer LangChain site for more code references. You also need to install below required python library to run the code.

In [31]:
# Install only if the package is absent. Otherwise, comment out the following lines

!pip install langchain
# !pip install torch
# !pip install transformers
# !pip install faiss-cpu
# !pip install pypdf
# !pip install sentence-transformers

Defaulting to user installation because normal site-packages is not writeable


# RAG (Retrieval Augmented Generation)

<b>Step1</b> - 1-) Load the corpus from multiple sources. Corpus can be in multiple form like PDFs, Microsoft Words, online or corporate Wiki. LangChain provides different document loader to load the data from different sources likes PDFs, CSV, File directory, HTML, JSON, Markdown.

In [8]:
from langchain.document_loaders import PyPDFLoader

# pdfLoader = PyPDFLoader("example_data/Large_language_model.pdf")
pdfLoader = PyPDFLoader(r"C:\Users\raman\OneDrive\Documents\Datasets/large_language_models.pdf")
documents = pdfLoader.load()

<b/>Step2</b> - 2-) Once the document is loaded into memory we can divide them into smaller chunks. <b><i>It sounds easy but is tricky to divide the document without loosing relationships between the chunks</i></b>. LangChain provides different types of Text splitters like Split by character, Split code, MarkdownHeaderTextSplitter, Recursively split by character, Split by tokens.

In [38]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(documents)
print(type(docs))
print(len(docs))
# print((docs[10]))
print("====================")
print(docs)
print("====================")
print(docs[10])

<class 'list'>
232
[Document(page_content='JOURNAL OF L ATEX 1\nA Comprehensive Overview of Large Language\nModels\nHumza Naveed, Asad Ullah Khan*, Shi Qiu*, Muhammad Saqib*,\nSaeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, Ajmal Mian\nAbstract —\nLarge Language Models (LLMs) have recently demonstrated\nremarkable capabilities in natural language processing tasks and\nbeyond. This success of LLMs has led to a large influx of\nresearch contributions in this direction. These works encompass\ndiverse topics such as architectural innovations of the underlying\nneural networks, context length improvements, model alignment,\ntraining datasets, benchmarking, efficiency and more. With the\nrapid development of techniques and regular breakthroughs in\nLLM research, it has become considerably challenging to perceive\nthe bigger picture of the advances in this direction. Considering\nthe rapidly emerging plethora of literature on LLMs, it is\nimperative that the research community is abl

<b>Begin - Clean the Data using NLTK</b>

<b>End - Clean the Data using NLTK</b>

<b>Step3</b> - 3–) We can now create the embedding from these docs. Embedding creates a vector representation of a piece of text. Embedding represent every docs in a vector space and similar docs will have similar vectors. It will help us to find the docs based on the user query. We can easily do semantic search (similarity search ) where we look for pieces of text that are most similar in the vector space. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) which has integration with LangChain. We will use one of the sentence transformer model all-miniLM-L6-v2 and will load this from local. This model maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [32]:
from langchain.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer

# modelPath = "/model/sentence-transformer/all-MiniLM-L6-v2"
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':False}
embeddings = HuggingFaceEmbeddings(
#   model_name = modelPath,
    model = model,
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)

ValidationError: 1 validation error for HuggingFaceEmbeddings
model
  extra fields not permitted (type=value_error.extra)

In [35]:
from sentence_transformers import SentenceTransformer
# sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# # embeddings = model.encode(sentences)
embeddings = model.encode(docs)
# print(embeddings)

TypeError: 'Document' object is not subscriptable