<a href="https://colab.research.google.com/github/yashasvi-shukl/GenAI/blob/main/RAG_llama_13B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers sentence-transformers pinecone-client datasets accelerate einops langchain xformers bitsandbytes

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pinecone-client
  Downloading pinecone_client-2.2.4-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.24.0-py3-none-any.whl (260 kB)
[2K     [90m━━━━

# Initilizing the Hugging Face Embedding Pipeline

To transform text documents into vector embeddings.

In [2]:
from torch import cuda
from langchain.embeddings import HuggingFaceEmbeddings

embed_model_id  = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name = embed_model_id,
    model_kwargs = {'device': device},
    encode_kwargs = {'device': device, 'batch_size': 32}

)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [3]:
# We can use the embedding model to create documents embedding as below:

docs = [
    "this is first document",
    "this is second document"
]

embeddings = embed_model.embed_documents(docs)
print(f'We have {len(embeddings)} doc embedding, each with a dimentionality of {len(embeddings[0])}.')

We have 2 doc embedding, each with a dimentionality of 384.


# Building Vector Indexes

Now we need to build embedding pipeline to generate embedding of documents and store that embedding in some vector store (Pinecone in this case) for indexing.

#### Note: You need [Pinecone free API key](https://app.pinecone.io/organizations/-NhjXysNz5U5mmIw808B/projects/gcp-starter:46f24ad/keys)

In [4]:
import os
import pinecone

os.environ['PINECONE_API_KEY'] = 'PINECONE_API_KEY'
os.environ['PINECONE_ENVIRONMENT'] = 'PINECONE_ENVIRONMENT'

In [5]:
pinecone.init(
    api_key = os.environ.get('PINECONE_API_KEY'),
    environment = os.environ.get('PINECONE_ENVIRONMENT')
)

In [6]:
#Now we initialize the index

import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
  pinecone.create_index(
      index_name,
      dimension = len(embeddings[0]),
      metric = 'cosine'
  )

  #wait until index is finished
  while not pinecone.describe_index(index_name).status['ready']:
    time.sleep(1)

In [7]:
# Now we connect to the index

index = pinecone.Index(index_name = index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.04871,
 'namespaces': {'': {'vector_count': 4871}},
 'total_vector_count': 4871}

In [8]:
# As we completed index and out vectore store is ready.
# We now need a data to start indexing process.
# I am using LLama Arxiv research papers as a dataset.

from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)
data = data.to_pandas()

Downloading readme:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [9]:
batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


In [10]:

index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.04871,
 'namespaces': {'': {'vector_count': 4871}},
 'total_vector_count': 4871}

# Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a text-generation pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:


*   A LLM, in this case it will be meta-llama/Llama-2-13b-chat-hf.
*   The respective tokenizer for the model.


We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [11]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = 'HUGGINGFACE_API_KEY'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


In [12]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)




Downloading (…)okenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [15]:

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.7,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

In [16]:
res = generate_text("Explain to me the difference between Machine Learning and Deep Learning.")
print(res[0]["generated_text"])

Explain to me the difference between Machine Learning and Deep Learning.

Answer:

Machine learning (ML) and deep learning (DL) are both subfields of artificial intelligence (AI) that involve training algorithms to make predictions or take actions based on data. The key differences between ML and DL are in their approach, architecture, and application areas.

1. Approach:
	* ML is a more traditional approach to AI that involves designing hand-crafted features and using statistical models to learn from data.
	* DL is a more recent approach that uses neural networks with multiple layers to learn complex representations of data.
2. Architecture:
	* ML typically involves a single hidden layer between the input and output layers, while DL has multiple hidden layers (usually called "deep" layers) that allow it to learn more complex and abstract representations of data.
	* Each layer in a DL model learns to extract higher-level features from the previous layer, allowing it to capture much dee

In [17]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [20]:
print(llm(prompt="Explain to me the difference between Machine Learning and Deep Learning."))



Machine learning is a type of artificial intelligence (AI) that involves training algorithms on data and using that training to make predictions or decisions. It's like a computer program that can learn from experience and improve its performance over time.

Deep learning is a subfield of machine learning that focuses on training algorithms with multiple layers of neural networks. These neural networks are designed to mimic the structure and function of the human brain, and they can learn to recognize patterns in large amounts of data.

The main differences between machine learning and deep learning are:

1. Complexity: Machine learning algorithms are typically simpler and more straightforward than deep learning algorithms.
2. Training data: Machine learning algorithms can be trained on smaller amounts of data, while deep learning algorithms require much larger amounts of data to train effectively.
3. Accuracy: Deep learning algorithms tend to be more accurate than machine learning a

# Initializing a RetrievalQA Chain

For **Retrieval Augmented Generation (RAG)** in **LangChain** we need to initialize either a **RetrievalQA** or **RetrievalQAWithSourcesChain** object. For both of these we need an llm (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

In [21]:
# Let's begin by initializing the LangChain vector store

from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorestore = Pinecone(
    index, embed_model.embed_query, text_field
)



In [22]:
query = 'what makes llama 2 so powerful?'

vectorestore.similarity_search(
    query,
    k = 3
)

[Document(page_content='Ricardo Lopez-Barquilla, Marc Shedroﬀ, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta\nChauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh\nYazdan, Elisa Garcia Anzano, and Natascha Parks.\n•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.\n46\n•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original\nLlama team who helped get this work started.\n•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the ﬁgures in the\npaper.\n•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the\ninternal demo.\n•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,\nLaurens van der Maaten, Jason Weston, and Omer Levy.', metadata={'source': 'http://arxiv.org/pdf/230

In [25]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm = llm , chain_type = 'stuff',
    retriever = vectorestore.as_retriever()
)

## Let's start asking questions

In [28]:
# without RAG

llm("How Large Language Models (LLM) works?")

'\n\nLarge language models (LLMs) are a class of artificial intelligence models that are trained on vast amounts of text data to generate language outputs that are coherent and natural-sounding. These models have become increasingly popular in recent years due to their ability to generate text that is often indistinguishable from human-written content. In this article, we will explore how LLMs work and some of the key techniques used in their training.\n\n1. Word embeddings:\n\nOne of the fundamental components of LLMs is word embeddings, which are a way of representing words as vectors in a high-dimensional space. This allows the model to capture the relationships between words and their meanings, such as synonymy, antonymy, and context. Word embeddings are typically learned using unsupervised methods, such as Word2Vec or GloVe.\n\n2. Encoder-decoder architecture:\n\nLLMs typically use an encoder-decoder architecture, where the encoder takes in a sequence of words and generates a cont

In [30]:
# Using RAG pipeline
rag_pipeline("How Large Language Models (LLM) works?")

{'query': 'How Large Language Models (LLM) works?',
 'result': ' Large Language Models (LLM) are collections of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. They are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, maybe as suitable substitute for closed-source models.'}

In [31]:
rag_pipeline("Why llama 2 is so special?")

{'query': 'Why llama 2 is so special?',
 'result': '\nLlama 2 is a family of pretrained and fine-tuned large language models (LLMs) that have been developed and released by the authors of this paper. The Llama 2 models are trained on a large corpus of self-supervised data, and then aligned with human preferences through techniques such as reinforcement learning with human feedback (RLHF). This allows the models to perform complex reasoning tasks requiring expert knowledge across a wide range of fields, and to interact with humans through intuitive chat interfaces. The Llama 2 models are also heavily ﬁne-tuned to align with human preferences, which greatly enhances their usability and safety.'}