<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/llamindex-projects/02_rag_with_llamaindex_Llama2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RAG System using Llama2 with Hugging Face

In [None]:
!pip install pypdf

In [None]:
!pip install einops accelerate langchain bitsandbytes

In [None]:
## Embedding
!pip install install sentence_transformers

In [None]:
!pip install llama_index

In [None]:
from llama_index import VectorStoreIndex,SimpleDirectoryReader,ServiceContext
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index import ServiceContext
from llama_index.embeddings import LangchainEmbedding
from llama_index.response.pprint_utils import pprint_response

from langchain.embeddings.huggingface import HuggingFaceEmbeddings

import torch

In [None]:
import warnings

warnings.filterwarnings('ignore')

In [None]:
!wget https://arxiv.org/pdf/1706.03762.pdf
!wget https://arxiv.org/pdf/1506.02640.pdf

!mkdir data
!mv 1706.03762.pdf attention.pdf
!mv 1506.02640.pdf yolo.pdf

!mv *.pdf data/

##Loading documents

In [None]:
documents=SimpleDirectoryReader("/content/data").load_data()
documents

In [None]:
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""

## Default format supportable by LLama2
query_wrapper_prompt=SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


##LLM model

In [None]:
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)

In [None]:
embed_model=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)

In [None]:
service_context=ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)

In [None]:
service_context

ServiceContext(llm_predictor=LLMPredictor(system_prompt=None, query_wrapper_prompt=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>), prompt_helper=PromptHelper(context_window=4096, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None, separator=' '), embed_model=LangchainEmbedding(model_name='sentence-transformers/all-mpnet-base-v2', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7b3afc9c8fa0>), transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7b3afc9c8fa0>, id_func=<function default_id_func at 0x7b3bb8cea290>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?')], llama_logger=<llama_index.logger.base.LlamaLogger object at 0x7b3bb64f48e0>, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7b3afc9c8fa0>)

##Vector Index

In [None]:
vector_store_index=VectorStoreIndex.from_documents(documents, service_context=service_context)

In [None]:
vector_store_index

<llama_index.indices.vector_store.base.VectorStoreIndex at 0x7b3b2306e7d0>

In [None]:
query_engine=vector_store_index.as_query_engine()

##Query vector index

In [None]:
response=query_engine.query("what is attention is all you need?")
pprint_response(response, show_source=True)
print(response)

Final Response: Attention is a powerful tool in NLP, but it is not the
only thing you need to build a successful model. While attention
mechanisms like the one described in the passage can help the model
focus on relevant parts of the input, they do not address other
important aspects of language processing, such as syntax and
semantics. To build a truly robust NLP model, you will need to
incorporate a variety of techniques, including attention, as well as
other types of neural network layers and traditional NLP methods.
______________________________________________________________________
Source Node 1/2
Node ID: a1933801-245c-4048-ae18-67dbab1654d6
Similarity: 0.5568833006661468
Text: Input-Input Layer5 The Law will never be perfect , but its
application should be just - this is what we are missing , in my
opinion . <EOS> <pad> The Law will never be perfect , but its
application should be just - this is what we are missing , in my
opinion . <EOS> <pad> Input-Input Layer5 The Law wil

In [None]:
response=query_engine.query("what is Transformers?")
pprint_response(response, show_source=True)
print(response)

Final Response: Transformers is a type of neural network architecture
introduced in the paper "Attention is All You Need" by Vaswani et al.
in 2017. It's primarily designed for sequence-to-sequence tasks, such
as machine translation, and it relies on self-attention mechanisms to
process input sequences. The Transformer model consists of an encoder
and a decoder, each composed of multiple identical layers. Each layer
in the encoder and decoder contains a stack of two sub-layers: a
multi-head self-attention mechanism and a position-wise fully
connected feed-forward network. The self-attention mechanism allows
the model to attend to different parts of the input sequence
simultaneously, while the feed-forward network processes the output of
the self-attention mechanism to produce the final output. The
Transformer model also uses a technique called attention masking to
prevent the model from attending to positions in the input sequence
that are beyond the current position being processed. T

In [None]:
response=query_engine.query("what is YOLO?")
pprint_response(response, show_source=True)
print(response)

Final Response: YOLO is a real-time object detection system that uses
a single neural network to predict bounding boxes and class
probabilities directly from full images. It is simple, fast, and
achieves high performance on object detection tasks. YOLO is trained
on full images and directly optimizes detection performance, making it
different from traditional object detection methods. It is also shown
to be effective in detecting objects in artwork, where other methods
struggle.
______________________________________________________________________
Source Node 1/2
Node ID: 0b2b0b90-582c-4dc9-a31d-25226fd0ac95
Similarity: 0.4745535599386536
Text: 8 27.6 52.0 41.7 69.6 61.3 68.3 57.8 29.6 57.8 40.9 59.3 54.1
SDS [16] 50.7 69.7 58.4 48.5 28.3 28.8 61.3 57.5 70.8 24.1 50.7 35.9
64.9 59.1 65.8 57.1 26.0 58.8 38.6 58.9 50.7 R-CNN [13] 49.6 68.1 63.8
46.1 29.4 27.9 56.6 57.0 65.9 26.5 48.7 39.5 66.2 57.3 65.4 53.2 26.2
54.5 38.1 50.6 51.6 Table 3: PASCAL VOC 2012 Leaderboard. YOLO
compared wi

In [None]:
response=query_engine.query("what is Object detection?")
pprint_response(response, show_source=True)
print(response)

Final Response: Object detection is a core problem in computer vision
that involves locating and classifying objects within images or
videos. It is a fundamental task in many applications, such as
autonomous driving, robotics, surveillance, and healthcare. Object
detection can be broadly classified into two categories: instance-
level detection and semantic segmentation. Instance-level detection
involves identifying individual objects within an image, while
semantic segmentation involves assigning a class label to each pixel
in an image.  Object detection pipelines typically start by extracting
robust features from input images, followed by classifiers or
localizers that identify objects in the feature space. These
classifiers or localizers can be run in a sliding window fashion over
the whole image or on some subset of regions in the image.  Deformable
parts models (DPM) and R-CNN are two popular object detection
frameworks that use a sliding window approach to find objects in
images.