

## NOTE: This notebook will ONLY work with a colab runtime which has GPU. 
So make sure that you have a GPU instance. Runtime -> Change Runtime Type -> GPU -> T4

If you have a GPU instance the below command will print a table



In [None]:
!nvdia-smi



Install Prerequsites.
* datasets, transformers - to use Huggiging face transformers library
* langchain - Langchain python library for chaining, RAG and agent examples
* bitsandbytes - to enable loading models in 8bit
* accelerate - runtime optimization of inference
* ChromaDB - Vector Database for indexing and RAG examples

In [None]:
!pip install datasets transformers==4.28.0 numpy langchain bitsandbytes accelerate chromadb

Make sure that your colab is connecting to a GPU machine. Runtime > Change Runtime type > GPU

Define a Custom LLM wrapper for Dolly v2
## Dolly LLM Wrapper for LangChain
The crux of our solutioning is here. This implementation replaces OpenAI. This implementation loads the dolly 3B model from huggingface in 8bit mode. Defines a "_call" method which will be used by langchain for its chaining. 

### Parameters:
* **temperature** - sharpness of answeres. Ranges from 0 to 1. Lower the value, sharper the results are. For example a value of 1.0, the answers will be more creative and a value of 0.1, the answers will be more factual. Default is 0.8.
* **top_p** - propability of the tokens to be considered for the result. Value ranges from 0 to 1. Default is 0.9 meaning tokens who have 90% propability or more will only be considered for output.
* **top_k** - number of candidate tokens to be considered for each output. Default is 40.
* **max_tokens** - Total number of tokens to be generated. Note: this includes in the number of input tokens as well.
* **repeat_penalty** - Penalty to be given for repeated answers. Should be greater than 1. Default is 1.1
* **do_sampling** - Whether to probe for tokens or not. Default is True. Set this valut to true for creative tasks like story generation. Set it to false for factual Q & A.



In [None]:
# Prerequisites: pip install transformers,langchain, torch
from langchain.llms.base import LLM
from typing import Optional, List, Mapping, Any
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pydantic
from pydantic import Field, validator
from langchain import PromptTemplate


class DollyLLM(LLM, pydantic.BaseModel):
    
    temperature: float = Field(0.8, description="Temperature")
    top_p: float = Field(0.9, description="Top p")
    top_k: int = Field(40, description="Top k")
    repeat_penalty: float = Field(1.1, description="Repeat penalty")
    max_tokens: int = Field(512, description="Max token to generate")
    do_sampling: bool = Field(True, description="Sample NN")
    #model_path:str = Field("databricks/dolly-v2-3b", description="Dolly model path")
    END_KEY = "### End"
    tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b")
    model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained("RajuKandasamy/dolly-v2-3b-8bit", load_in_8bit=True, device_map="auto")
    device:str = "cuda" if torch.cuda.is_available() else "cpu"
    class Config:
        arbitrary_types_allowed = True

    def sanitize_and_tokenize(self, text: str, max_tokens: int = 128):
        tokens = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=max_tokens)
        return tokens
        
    @property
    def _llm_type(self) -> str:
        return "Dolly_v2_3B"
    
    # This method will be called by Langchain to generate a response
    def _call(self, prompt: str, stop: Optional[List[str]] = ["### End"]) -> str:
      self.model.eval()
      if not stop or len(stop) == 0:
        stop = [self.END_KEY]
      try:
        input_text = prompt #f"### Instruction:\n{prompt.prompt}\n\n### Response:\n"
        input_ids = self.tokenizer.encode(input_text, return_tensors="pt").to(self.model.device)
        #print(input_text)
        with torch.no_grad():
            output = self.model.generate(input_ids,  max_length=self.max_tokens, do_sample=self.do_sampling,temperature=self.temperature,top_k=self.top_k, pad_token_id=self.tokenizer.eos_token_id, early_stopping=True)
        response = self.tokenizer.decode(output[0], skip_special_tokens=False)
        response = response.split("### Response:\n")[-1]
        #print(response)
        stop_index = -1
        if stop:
            for s in stop:
                if s in response:
                    stop_index = response.find(s)
                    break

        if stop_index == -1:
            return response
        else:
            return response[:stop_index]
      except Exception as e:
          raise RuntimeError(f"Error generating response: {e}")

    
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {"temperature": self.temperature, "max_tokens": self.max_tokens, "do_sampling": self.do_sampling, "top_k": self.top_km, "model_path": self.model_path}
    



You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` attribute will be overwritten with the one you passed to `from_pretrained`.



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Langchains Prompt Template's are a better way to format inputs to LLMs. Dolly v2 is fine tuned with the prompt template as shown below. Note the execution of the cell may take time since it will download the Dolly model. Execute this cell only once as this will load the model every time you execute it.

In [None]:
torch.cuda.empty_cache()

In [None]:
llm = DollyLLM()

Lets test the LLM using a simple prompt

In [None]:
llm.max_tokens = 128
template = """
You are a funny AI assistant.
### Instruction:
{prompt}
### Response:
Once upon a time"""

prompt = PromptTemplate(
    input_variables=["prompt"],
    template=template,
)
llmprompt = prompt.format(prompt="Tell me a joke.")

print(llm(llmprompt, stop=["### End"]))

  attn_scores = torch.where(causal_mask, attn_scores, mask_value)


Once upon a time there was a little mouse.  He was walking down the street when he came across a cracker.  He thought to himself, "this cracker must be very tasty because I'm very hungry."  So he started to eat it.  After he finished he looked down at the cracker and he was very sad.  He cried, "oh no, I'm so hungry but I can't eat this cracker because it's gone bad."  Then he heard a sound.  He looked up and saw


##Introducing LangChain Chains. 
A chain is a unit in a LLM task list. we can sequence the chain such a way that the earlier chain output is passed as input to the next link. In the below sample, we come up with fancy domain names for a couple of companies using chaining.

We define 2 prompts. first prompt we get a name for the company. then we pass the company name to the second prompt to generate a domain name. 

Both prompts are chained using langchain

In [None]:
from langchain.chains import LLMChain
from langchain.chains import SimpleSequentialChain
llm.max_tokens = 256
prompt = PromptTemplate(
    input_variables=["product"],
    template="You are a helpful AI assistant. End the conversation after your answer. ### Instruction: What is a good name for a company that makes {product}?\n\n### Response:\n",
)

chain = LLMChain(llm=llm, prompt=prompt)
#company_name = chain.run("clay dolls")

second_prompt = PromptTemplate(
    input_variables=["company_name"],
    template="You are a helpful AI assistant. End the conversation after your answer. ### Instruction: Suggest a fancy .com domain name within 10 letters for the company: {company_name}\n\n### Response:\n",
)
chain_two = LLMChain(llm=llm, prompt=second_prompt)

overall_chain = SimpleSequentialChain(chains=[chain, chain_two], verbose=True)

# Run the chain specifying only the input variable for the first chain.
domain1 = overall_chain.run("clay dolls")
domain2 = overall_chain.run("Coconut Oil")
print(domain1)
print(domain2)





[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mClay dolls

[0m
[33;1m[1;3mwww.smartclaydolls.com

[0m

[1m> Finished chain.[0m


[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mPure Coconut Oil

[0m
[33;1m[1;3mpurchasingpurecoconutoil.com

[0m

[1m> Finished chain.[0m
www.smartclaydolls.com


purchasingpurecoconutoil.com




You can explore more about the chaining in https://python.langchain.com/en/latest/modules/chains/getting_started.html

## Language indexing
Just like SQL table indexes, There are techniques to index the linquistic data known and store them in a special form of database known as "VectorDB". Pinecone is a popular one. For this demonstration we are going to take a simple one called ChromaDB. 

Why Language indexing?
Language indexing and retreival mechanisms can help overcome few limitations of LLM's. For instance, the number of words that we can input to LLM is limited to few thousand words. If we need to work on a huge pile of documents with LLMs then we need a way to filter the relevant records in the dataset in a linquistic manner and pass it to LLMs for further processing.

A Vector DB stores something known as embeddings and the distance between the embeddings (known as vector). For instance the words "BentoML" and "BuntoML" are closer then "AutoML" from a vector distance Point of View.

The below sample show how to install and use a VectorDB 

In [None]:
!git clone https://github.com/rkandas/aibootcampdata.git


Cloning into 'aibootcampdata'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 11 (delta 1), reused 8 (delta 1), pack-reused 0[K
Unpacking objects: 100% (11/11), 12.98 MiB | 7.44 MiB/s, done.


In [None]:
%cd /content/aibootcampdata
!git pull
%cd /content

/content/aibootcampdata
Already up to date.
/content


In [None]:
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./aibootcampdata/Radar_datatable.csv')

data = loader.load()

print(data[:10])

[Document(page_content='Name: Aleph.js\nURL: https://www.thoughtworks.com/radar/languages-and-frameworks/aleph-js\nVolume: Oct-22\nRing: Assess\nQuadrant: Languages & Frameworks\nshorturl: https://tinyurl.com/2pawh2eu\nDescription: There is certainly no shortage of frameworks to build web applications in JavaScript/ TypeScript . We\'ve featured many of them in the Radar, but what sets  Aleph.js  apart in this crowded field is that it\'s built to run on  Deno , the new server-side run time created by the original developer of  Node . This puts Aleph.js on a modern foundation that addresses several shortcomings and problems with Node. Aleph.js is still new " it\'s approaching the 1.0 release at the time of writing " but it already offers a solid developer experience, including hot module replacement. With Deno now way past its  1.0 release , this is a modern choice for projects that can take the risk.', metadata={'source': './aibootcampdata/Radar_datatable.csv', 'row': 0}), Document(page

Here we introduce Chroma DB a Vector database to store our embeddings.
Here are our buzz words:
* **ChromaDB** - A vector Database which can store text document using embeddings. You can read more about it in https://docs.trychroma.com/getting-started
* **Embeddings** - Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video. You can read more about it in https://docs.trychroma.com/embeddings
* **HuggingFaceEmbeddings** - A langchain class that can help us transform text document into embeddings. https://huggingface.co/blog/getting-started-with-embeddings

In the below example, we use Huggingface embeddings class to convert the csv data loaded in the privious step into embeddings and load it into CromaDB


In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()
db = Chroma.from_documents(data, embeddings)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]



Once the text data embeddings are loaded into VectorDB we can execute queries on top of it. The vectorDB will return documents that are closest to the given query. Note: the returned results need not neccesarily match the query, they are most likely neighbours.

In [None]:
query = "What is the use of BentoML?"
docs = db.similarity_search(query)

In [None]:
print(len(docs))
print(docs)

4


## RAG - Retrieval Augmented Generation
[![Retrieval Augmented Generation](https://global.discourse-cdn.com/business7/uploads/hellohellohello/optimized/2X/f/f1d7e02e789e83dbeab9ee0e6bdc0b7f1e4d59d7_2_690x388.jpeg)](https://ai.facebook.com/1319742961447503/videos/244800523626272/ "Retrieval Augmented Generation")


RAG augments LLM's knowledge by passing the documents as additional context. In general, LLMs answers from their stored weights. Most use cases the data may not exists with LLMs for instance all judgement records of our indian judiciary for example. 
We combine VectorDB index based filtering with LLMs to come up with RAG.

We use the earlier example data loaded in VectorDB and run the same query but this time to LLM.

In the below sample, the filtered documents are passed to LLM as context to answer the nlp query posted.

In [None]:
from langchain.chains.question_answering import load_qa_chain
llm.max_tokens = 512
chain = load_qa_chain(llm, chain_type="stuff")
# limiting it to top record due to the limitation of our dolly model.
result = chain.run(input_documents=[docs[0]], question=query)
print(result)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Name: BentoML
URL: https://www.thoughtworks.com/radar/languages-and-frameworks/bentoml
Volume: Oct-22
Ring: Assess
Quadrant: Languages & Frameworks
shorturl: https://tinyurl.com/2l8syruf
Description: BentoML  is a python-first framework for serving machine-learning models in production at scale. The models it provides are agnostic of their environment; all model artifacts, source code and dependencies are encapsulated in a self-contained format called Bento. It's like having your model "as a service." Think of BentoML as the  Docker  for ML models: It generates VM images with pre-programmed APIs ready for deployment and includes features that make it easy to test these images. BentoML can help speed up the initial development effort by easing the start of projects which is why we included it in Assess.

Question: What is the

## LLM Agents

LLM Agents are a powerfull tool that leverages the chaining along with a pool of tools and LLM to achive a particular goal. 



An OpenAI based Colab tutorial of Agents is available here 

[![LLM Agents](https://img.youtube.com/vi/ziu87EXZVUE/0.jpg)](https://www.youtube.com/watch?v=ziu87EXZVUE "LLM Agents")


Presenting an Agent concept without OpenAI here. Execution of this cell may take LOOONG time. For Agents to work effectively, we need LLM models which are 30B parameters or above which has ability to plan, prioritize tasks. Our 3B dolly model may not be up for the job.

In [None]:
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain.agents import AgentExecutor
db = SQLDatabase.from_uri("sqlite:///./aibootcampdata/northwind.db")
llm.max_tokens = 2048
toolkit = SQLDatabaseToolkit(db=db, llm=llm)

agent_executor = create_sql_agent(
    llm=llm,
    toolkit=toolkit,
    verbose=True
)
agent_executor.run("Which region customers ordered the most?")



[1m> Entering new AgentExecutor chain...[0m
