#Installing Dependencies


In [1]:
!pip install -q langchain
!pip install -q transformers
!pip install -q sentence_transformers
# !pip install -q chromadb
!pip install -q faiss-cpu
!pip install -q datasets
!pip install -q torch
!pip install -q huggingface
!pip install -q langchain-community

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.6/50.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.7/407.7 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.7/296.7 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

#Importing Libraries

In [29]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
import torch


#Document Loading

We are going to use the langchain document laoder. [Documentation](https://python.langchain.com/docs/integrations/document_loaders/) In this task we are going top use huggingface dataset databrick dolly 15k


In [8]:
#Specify the name of the dataset from the huggingface
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"

#Use the hf datasetLoader to load the dataset: Creating the loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

#loading the data using the "load" function of the loader
data = loader.load()

In [10]:
data[:2]

[Document(metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}, page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."'),
 Document(metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'}, page_content='""')]

#Document Transformers

Once the data is loaded you can transform it according to your requirements. It is about splitting the data into the smaller chunks which your model can accept and give accurate results

##Text Splitter
For document transformer we have to use the **"Text Splitter"** there are several text splitters available in the lanchain we are going to use the **"Recursive Text Splitter"** today. This text splitter is recommended for the generic text. It tries to split the long texts recursively until the smaller chunks are created

In [11]:
#create an instance of RecursiveTextSplitter class with the specific parameters
#It splits the text in the chunk of size 1000 characters and where the overlap will be of 150 characters
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 150)

#Now pass the data that you have loaded into the text_splitter by using the function "split_documents"
docs = text_splitter.split_documents(data)

In [12]:
docs[0]

Document(metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}, page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."')

#Text Embeddings

Text embedding is the step where we are converting the text into the form of the vectors. Embeddings captures the semantic meaning which allows you to quickly and efficienctly find other pieces of the text which are similar.

In this tutorial we are going to use **HuggingFaceEmbeddings**

In [15]:
#define the model pathe of the pre-trained embedding model. We are going to use sentence-transformers/all-MiniLM-l6-v2
modelPath  = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device': 'cpu'}

#Create a dictionary with the encoding options. We are setting the normalize_embedding to False. You can play with it by setting it True and compare the output
encode_kwargs = {'normalize_embeddings': True}

#create an instance of the HuggingFaceEmbeddings class
embeddings = HuggingFaceEmbeddings(
    model_name = modelPath,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [17]:
#test the embeddings

text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:5]

[-0.038338545709848404,
 0.12346472591161728,
 -0.028642980381846428,
 0.05365270748734474,
 0.008845367468893528]

#Vector Stores

There is a need of database where we can store the embeddings and search them efficiently. You can retrieve the embeddings vector which will be "most similar". In this tutorial we are going to use Meta's FAISS. You can also use the other vector stores like chromadb, pinecone etc.



In [18]:
db = FAISS.from_documents(docs, embeddings)

In [19]:
#Now search your question
qus = "What is France?"
searchDocs = db.similarity_search(qus)
print(searchDocs[0].page_content)

"The French Revolution (French: R\u00e9volution fran\u00e7aise [\u0281ev\u0254lysj\u0254\u0303 f\u0281\u0251\u0303s\u025b\u02d0z]) was a period of radical political and societal change in France that began with the Estates General of 1789 and ended with the formation of the French Consulate in November 1799. Many of its ideas are considered fundamental principles of liberal democracy, while the values and institutions it created remain central to French political discourse.\n\nIts causes are generally agreed to be a combination of social, political and economic factors, which the Ancien R\u00e9gime proved unable to manage. In May 1789, widespread social distress led to the convocation of the Estates General, which was converted into a National Assembly in June. Continuing unrest culminated in the Storming of the Bastille on 14 July, which led to a series of radical measures by the Assembly, including the abolition of feudalism, the imposition of state control over the Catholic Church


#Preparing the LLM Model

You can choose any model from the huggingface and use the tokenizer to preprocess the text and a question answering model to provides answers based on the input text and questions.

In this tutorial I am going to use the **Intel/dynamic_tinybert** which is finetuned for the question answering.

In [20]:
#Create a tokenizer Object
tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")

#Create a model object
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

tokenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

##Preparing the question answer pipeline

Create a pipeline usingt the model and the tokenizer and then extend its functionality by using the Langchain pipeline with additional model-specific arguments.

In [25]:
#specify the model name
model_name = "Intel/dynamic_tinybert"

#load the tokenizer associated with the specified model.
tokenizer = AutoTokenizer.from_pretrained(model_name, padding = True, truncation= True, max_length = 512)

#Define the question answering pipeline using the model and tokenizer
question_answer = pipeline(
    "question-answering",
    model = model_name,
    tokenizer = tokenizer,
    device = 0 if torch.cuda.is_available() else -1, # Move device specification here to the underlying pipeline
    return_tensors= "pt")

#Create an instance of huggingFace Pipeline which wraps the question answering pipeline with addition
#model specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline = question_answer,
    model_kwargs = {'temperature': 0.7, 'max_length': 512},
)




#Retriever

Once the below tasks are completed:


*   Data Embedding and stored in datbase
*   Prepration of the LLM model
*   Creation of the Question-Answering pipeline and wrapping up it in huggingfacePipline

We need to retrieve the data. A Retriever is an interface that returns the document from the query.

It is not able to store the documents it can only return or retrieve them. Basically Vector Stores are the backbone of the retriever.



In [22]:
#Create a retriever object from the "db" using the "as_retriever" method.
#This retriever is likely used for retrieveing the data or documents from the databases.
retriever = db.as_retriever()

In [26]:
#searching relevant documents for the questions
query = "What is France?"
searchDocs = retriever.get_relevant_documents(query)
print(searchDocs[0].page_content)

"The French Revolution (French: R\u00e9volution fran\u00e7aise [\u0281ev\u0254lysj\u0254\u0303 f\u0281\u0251\u0303s\u025b\u02d0z]) was a period of radical political and societal change in France that began with the Estates General of 1789 and ended with the formation of the French Consulate in November 1799. Many of its ideas are considered fundamental principles of liberal democracy, while the values and institutions it created remain central to French political discourse.\n\nIts causes are generally agreed to be a combination of social, political and economic factors, which the Ancien R\u00e9gime proved unable to manage. In May 1789, widespread social distress led to the convocation of the Estates General, which was converted into a National Assembly in June. Continuing unrest culminated in the Storming of the Bastille on 14 July, which led to a series of radical measures by the Assembly, including the abolition of feudalism, the imposition of state control over the Catholic Church


  searchDocs = retriever.get_relevant_documents(query)


#Retrieval QA Chain from LangChain

Now, we are going to use the retrieval QA chain to find the answers to the questions. The RetrievalQA Chain combines the Question-Answering with a retrieval step. To do this we will use the lanchain model and the vector database as a retriever.

By default you can put the all the data by using the **"Stuff"** chain but if you have large data then you can use the **MapReduce Chain, Refin Chain and MapRerank Chain**. I will recommend to use the **Refine** chain.

In [27]:
#create a retrieval object from the "db" with a search configuration where it retrievs up to 4 splits/documents.
#You can also mention the search_type. Here we are using the similarity. Defines the type of search thatthe Retriever should perform.
# Can be "similarity" (default), "mmr", or"similarity_score_threshold".


#You can change the number of splits/documents according to your use case.


retriever = db.as_retriever(search_type = "similarity", search_kwargs = {"k": 4})

In [34]:
#Create a question answering instance using the Retrieval QA class
#it is configured with our llm, chain type: refine, retriever instance, and an option to not return the source document

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=retriever,
    return_source_documents=False
    )

#Inferencing

Now we are ready to ask the question to our LLM model which is going to use the RAG that we have implemented

In [38]:
question = "Who is Abhraham Lincoln?"
result = qa.run({"query": question})
print(result["result"])