[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb)

# RAG with LLaMa 13B

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.
At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours). If you need guidance on getting access please refer to the beginning of this [article](https://www.pinecone.io/learn/llama-2/) or [video](https://youtu.be/6iHVJyX2e50?t=175).

---

🚨 _Note that running this on CPU is sloooow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [2]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchdata 0.6.0 requires torch==2.0.0, but you have torch 2.0.1 which is incompatible.[0m[31m
[0m

## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [6]:
device_count = torch.cuda.device_count()

# Define your device map for multi-GPU support
# Example device map: {"model": "cuda:0", "tokenizer": "cuda:1"}
device_map = {
    "model": [f"cuda:{i}" for i in range(device_count)],
    "tokenizer": [f"cuda:{i}" for i in range(device_count)],
}


NameError: name 'torch' is not defined

In [7]:
from torch import cuda

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import torch
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Check if CUDA (GPU) is available and get the number of available GPUs
if torch.cuda.is_available():
    device = torch.device("cuda")
    num_devices = torch.cuda.device_count()
    print(f"Using {num_devices} GPU(s)")
else:
    device = torch.device("cpu")
    print("No GPU available, using CPU")
embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

# device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

Using 2 GPU(s)




Downloading .gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:

In [8]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We have 2 doc embeddings, each with a dimensionality of 384.


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [9]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or '3b5e2eb7-ac6b-4271-aafc-ff9ecac6cbcb',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)


Now we initialize the index.

In [10]:
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [11]:
pinecone.describe_index(index_name)

IndexDescription(name='llama-2-rag', metric='cosine', replicas=1, dimension=384.0, shards=1, pods=1, pod_type='starter', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

Now we connect to the index:

In [12]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.63774,
 'namespaces': {'': {'vector_count': 63774}},
 'total_vector_count': 63774}

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
%time yelp_reviews = pd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json", encoding = 'ISO-8859-1', lines=True, nrows=5000)
%time yelp_business = pd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_business.json", encoding = 'ISO-8859-1', lines=True)

yelp_business_sub = yelp_business[['business_id','name','city','state','stars','review_count','is_open','categories','attributes','hours']]
yelp_reviews_sub = yelp_reviews[['business_id','stars','useful','funny','cool','text']]
yelp_reviews_sub = yelp_reviews_sub.rename(columns = {'stars':'review_rating'})

filtered_businesses = yelp_business_sub[yelp_business_sub['business_id'].isin(yelp_reviews_sub['business_id'])]
print(len(filtered_businesses))
result_df = pd.merge(filtered_businesses, yelp_reviews_sub, on='business_id', how='left')
result_df.to_csv('output.csv', index=False)



# data = load_dataset(
#     'jamescalam/llama-2-arxiv-papers-chunked',
#     split='train'
# )
# data

In [None]:
len(result_df)

In [None]:
# grouped_df = result_df.groupby('business_id').agg({'review': '\n\n\n'.join, 'name': 'first', 'city': 'first','state':'first'}).reset_index()
# grouped_df.dtypes
# specific_business_id = "0Kn5W22UmxOqPj2cjouFNA"

# # Use boolean indexing to filter rows with the specific business ID
# filtered_rows = grouped_df[grouped_df['business_id'] == specific_business_id]
# if not filtered_rows.empty:
#     # Print the complete review(s)
#     for review in filtered_rows['review']:
#         print(review)

In [None]:
import pandas as pd
import json  # Import the json module

# Assuming df is your DataFrame
def format_review(row):
    # Separate true and false attributes
    if isinstance(row['attributes'], str):
        try:
            attributes_dict = json.loads(row['attributes'])
        except json.JSONDecodeError:
            attributes_dict = {}
    else:
        attributes_dict = row['attributes']
#     if row['attributes'] is not None:
#         attributes_true = [k for k, v in row['attributes'].items() if v]
#         attributes_false = [k for k, v in row['attributes'].items() if not v]
#     else:
#         attributes_true = []
#         attributes_false = []
    if attributes_dict is not None:
        attributes_true = [k for k, v in attributes_dict.items() if v]
        attributes_false = [k for k, v in attributes_dict.items() if not v]
    else:
        attributes_true = []
        attributes_false = []

    # Format attributes
    attr_true_str = ', '.join(attributes_true) if attributes_true else 'None'
    attr_false_str = ', '.join(attributes_false) if attributes_false else 'None'

    # Format review reactions
    reactions = []
    if row['useful'] == 1: reactions.append('useful')
    if row['funny'] == 1: reactions.append('funny')
    if row['cool'] == 1: reactions.append('cool')
    reactions_str = ', '.join(reactions) if reactions else 'None'

    # Construct the review text
    review_text = f"{row['name']} in city {row['city']} state {row['state']} has {row['stars']} rating for {row['review_count']} reviews, offers {row['categories']} food and it has attributes {attr_true_str} and lacks {attr_false_str}. The restaurant is open {row['hours']}. User gave the restaurant {row['review_rating']} rating with review reading \"{row['text']}\". Others found this review {reactions_str}."

    return review_text

# Apply the function to each row
result_df['review'] = result_df.apply(format_review, axis=1)

# Now df['review'] contains the formatted text for each row


In [14]:
# review_df = result_df[['business_id','review']]
import pandas as pd
new_review_df = pd.read_csv('/kaggle/input/input-new/sampled_dataset.csv')


In [None]:
new_review_df = new_review_df[['business_id','name', 'city', 'state' ,'review']]

In [None]:
# /grouped_df =new_review_df.groupby('business_id')['review'].apply('\n'.join).reset_index()


In [None]:
# grouped_df = new_review_df.groupby('business_id').agg({'review': '\n\n\n'.join, 'name': 'first', 'city': 'first','state':'first'}).reset_index()
# grouped_df.dtypes
# specific_business_id = "0Kn5W22UmxOqPj2cjouFNA"

# # Use boolean indexing to filter rows with the specific business ID
# filtered_rows = grouped_df[grouped_df['business_id'] == specific_business_id]
# if not filtered_rows.empty:
#     # Print the complete review(s)
#     for review in filtered_rows['review']:
#         print(review)

In [15]:
new_review_df.dtypes

business_id       object
name              object
city              object
state             object
stars            float64
review_count       int64
is_open            int64
categories        object
attributes        object
hours             object
review_id         object
review_rating      int64
useful             int64
funny              int64
cool               int64
text              object
combined_id       object
review            object
dtype: object

We will embed and index the documents like so:

In [16]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.63774,
 'namespaces': {'': {'vector_count': 63774}},
 'total_vector_count': 63774}

In [None]:

from datasets import Dataset
# data = Dataset.from_pandas(new_review_df)
data = Dataset.from_pandas(new_review_df)
data
data = data.to_pandas()

batch_size = 256

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['combined_id']}" for i, x in batch.iterrows()]
    texts = [x['review'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {
         'review': x['review'],
         'city': x['city'],
         'state': x['state'],
         'name': x['name'],
#          'business_id': x['business_id'],
        } for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds,metadata))

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
# import torch
# from torch import cuda, bfloat16
# import transformers

# model_id = 'NousResearch/Llama-2-13b-chat-hf'

# device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# # Check the number of available GPUs
# device_count = torch.cuda.device_count()

# # Define your device map for multi-GPU support
# device_map = {
#     "model": [f'cuda:{i}' for i in range(device_count)],
#     "tokenizer": [f'cuda:{i}' for i in range(device_count)],
# }

# # set quantization configuration to load a large model with less GPU memory
# bnb_config = transformers.BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type='nf4',
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_compute_dtype=bfloat16
# )

# # begin initializing HF items, need auth token for these
# hf_auth = 'HF_AUTH_TOKEN'
# model_config = transformers.AutoConfig.from_pretrained(
#     model_id,
#     use_auth_token=hf_auth
# )

# # Pass device_map as a keyword argument to from_pretrained
# model = transformers.AutoModelForCausalLM.from_pretrained(
#     model_id,
#     trust_remote_code=True,
#     config=model_config,
#     quantization_config=bnb_config,
#     device_map=device_map,  # Use the updated device_map as a keyword argument
#     use_auth_token=hf_auth
# )

# # Move the model to the selected device
# model.to(device)

# # Set the model in evaluation mode
# model.eval()
# print(f"Model loaded on {device}")


In [18]:
from torch import cuda, bfloat16
import transformers

model_id = 'NousResearch/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = 'HF_AUTH_TOKEN'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

Downloading config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/175 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [19]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Downloading tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [20]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("Suggest me a restaurant in San Fransisco for Thai")
print(res[0]["generated_text"])

Now to implement this in LangChain

In [21]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [None]:
llm(prompt="Suggest me a restaurant in San Fransisco for Thai")

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

# **Large context metadata**

# Dont use vectorstore

In [22]:
from langchain.vectorstores import Pinecone

text_field = 'review'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [None]:
# from langchain.schema.retriever import BaseRetriever, Document
# from typing import List,Any

# class CustomVectorStoreRetriever(BaseRetriever):
#     vectorstore: Any
#     dataframe: Any

#     def __init__(self, vectorstore, dataframe):
#         super().__init__()
#         self.dataframe = dataframe
#         self.vectorstore =vectorstore

#     def get_relevant_documents(self, query, top_k=5):
#         retrieval_results = self.vectorstore.similarity_search(query, k=top_k)
#         retrieved_documents:Document = []
#         for result in retrieval_results:
#             doc_id = result.page_content
#             document_text = self.dataframe[self.dataframe['business_id'] == doc_id]['review'].iloc[0]
#             document = Document(page_content=document_text)
#             retrieved_documents.append(document)
# #         print(retrieved_documents)
#         return retrieved_documents
#     async def _aget_relevant_documents(
#             self,
#             query: str,
#             *,
#             run_manager ,
#             **kwargs ,
#     ) -> List[Document]:
#         raise NotImplementedError()

# custom_retriever = CustomVectorStoreRetriever(vectorstore, new_review_df)

In [None]:
query = "restuarant in San Fransisco"
vectorstore.similarity_search(
    query,  # the search query
    k=5  # returns top 3 most relevant chunks of text
)

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [23]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

Let's begin asking questions! First let's try *without* RAG:

In [None]:
# llm(' restuarant in San Fransisco')

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [None]:
rag_pipeline('Suggest restuarant in city nashville')

In [None]:
rag_pipeline('Restuarant in Nashville open on sunday and serves thai food ')

In [None]:
rag_pipeline('Tell me more about Siam Cafe in Nashville?')

This looks *much* better! Let's try some more.

In [None]:
rag_pipeline('What food does Chile Burrito serves?')

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.

In [None]:
rag_pipeline('what safety measures were used in the development of llama 2?')

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

In [None]:
rag_pipeline('what red teaming procedures were followed for llama 2?')

Very interesting!

In [None]:
rag_pipeline('how does the performance of llama 2 compare to other local LLMs?')

In [None]:
rag_pipeline('Give me different city names')

In [None]:
df[df['city']=='New Castle']['name'].unique()

In [None]:
df['city'].unique()

**Metrics/Results**

In [None]:
df =new_review_df

# **Synthetic data creation**

In [None]:
import json
import ast

# question=f" Restaurants in {city} with more than {rating} that deliver food?"
# def extract_delivery(attribute_str):
#     try:
#         attribute_dict = ast.literal_eval(attribute_str)
#         return attribute_dict.get('RestaurantsDelivery', 'True')
#     except (SyntaxError, ValueError):
#         return 'False'

# # Apply the extract_delivery function to create a new column 'delivery'
# new_review_df['delivery'] = new_review_df['attributes'].apply(extract_delivery)

# # Filter for restaurants in Nashville with delivery
# nashville_restaurants_with_delivery = new_review_df[(new_review_df['city'] == 'Nashville') & (new_review_df['delivery'] == 'True') & (new_review_df['stars'] > rating)]


# if not nashville_restaurants_with_delivery.empty:
#     restaurant_names = ', '.join(nashville_restaurants_with_delivery['name'].unique())
#     statement = f"Following restaurants in Nashville deliver food: {restaurant_names}"
# else:
#     statement = "There are no restaurants in Nashville that deliver food."
# print(statement)
def extract_delivery(attribute_str):
        
        try:
            attribute_dict = ast.literal_eval(attribute_str)
            return attribute_dict.get('RestaurantsDelivery', 'True')
        except (SyntaxError, ValueError):
            return 'False'

    # Apply the extract_delivery function to create a new column 'delivery'
df['delivery'] = df['attributes'].apply(extract_delivery)


def add_question_and_answer(df, city, rating):
    # Create the question based on the parameters
    question = f"Restaurants in {city} with more than {rating} that deliver food?"

    # Function to parse the attributes string and extract 'RestaurantsDelivery' value
    
    # Filter for restaurants in the specified city with delivery and a star rating above the given threshold
    city_restaurants_with_delivery = df[(df['city'] == city) & (df['delivery'] == 'True') & (df['stars'] > rating)]
    # Create a statement based on the filter
    if not city_restaurants_with_delivery.empty:
        restaurant_names = ', '.join(city_restaurants_with_delivery['name'].unique())
        statement = f"Following restaurants in {city} deliver food and have a star rating above {rating}: {restaurant_names}"
    else:
        statement = f"There are no restaurants in {city} that deliver food and have a star rating above {rating}."

    # Add the question and answer as new rows to the DataFrame
    new_row = {'question': question, 'answer': statement}
    qna_list = []
#     print(new_row)
    qna_list.append(new_row)
    return qna_list

cities = ['Nashville','Clementon','New Castle','New Orleans','Harvey','Yardley','Franklin','Wilmington','Santa Barbara','Saint Petersburg']
rating = 4
qna_list = []

for city in cities:
    qna_list.extend(add_question_and_answer(df, city, rating))



In [None]:
len(qna_list)

In [None]:
attirbutes = ['OutdoorSeating','RestaurantsTakeOut','ByAppointmentOnly']
for attrib in attirbutes:
    def extract(attribute_str):
        try:
            attribute_dict = ast.literal_eval(attribute_str)
            return attribute_dict.get(attrib, 'True')
        except (SyntaxError, ValueError):
            return 'False'
    df[attrib] = df['attributes'].apply(extract)

# Outdoor seating
def add_question_and_answer_attribe(df, city, rating):
    question = f"Restaurants in {city} with more than {rating} rating that offers Outdoor Seating?"
    city_restaurants_with_delivery = df[(df['city'] == city) & (df['OutdoorSeating'] == 'True') & (df['stars'] > rating)]
    if not city_restaurants_with_delivery.empty:
        restaurant_names = ', '.join(city_restaurants_with_delivery['name'].unique())
        statement = f"Following restaurants in {city} offer Outdoor Seating and have a star rating above {rating}: {restaurant_names}"
    else:
        statement = f"There are no restaurants in {city} that offer Outdoor Seating and have a star rating above {rating}."
    new_row = {'question': question, 'answer': statement}
    qna_list = []
    qna_list.append(new_row)
    return qna_list

# cities = ['Nashville','','New Orleans','Harvey','Yardley','Franklin','Wilmington','Santa Barbara','Saint Petersburg']
rating = 4.5

for city in cities:
    qna_list.extend(add_question_and_answer_attribe(df, city, rating))

print(len(qna_list))    
    
# RestaurantsTakeOut
def add_question_and_answer_attribe(df, city, rating):
    question = f"Restaurants in {city} with more than {rating} rating that offers take outs?"
    city_restaurants_with_delivery = df[(df['city'] == city) & (df['RestaurantsTakeOut'] == 'True') & (df['stars'] > rating)]
    if not city_restaurants_with_delivery.empty:
        restaurant_names = ', '.join(city_restaurants_with_delivery['name'].unique())
        statement = f"Following restaurants in {city} offer take outs and have a star rating above {rating}: {restaurant_names}"
    else:
        statement = f"There are no restaurants in {city} that offer take outs and have a star rating above {rating}."
    new_row = {'question': question, 'answer': statement}
    qna_list = []
    qna_list.append(new_row)
    return qna_list

# cities = ['Nashville','New Orleans','Harvey','Yardley','Franklin','Wilmington','Santa Barbara','Saint Petersburg']
rating = 4.5

for city in cities:
    qna_list.extend(add_question_and_answer_attribe(df, city, rating))
print(len(qna_list))    
    
# ByAppointmentOnly
def add_question_and_answer_attribe(df, city, rating):
    question = f"Restaurants in {city} with more than {rating} rating that are not appointment only?"
    city_restaurants_with_delivery = df[(df['city'] == city) & (df['ByAppointmentOnly'] == 'False') & (df['stars'] > rating)]
    if not city_restaurants_with_delivery.empty:
        restaurant_names = ', '.join(city_restaurants_with_delivery['name'].unique())
        statement = f"Following restaurants in {city} that are not appointment only and have a star rating above {rating}: {restaurant_names}"
    else:
        statement = f"There are no restaurants in {city} that are not appointment only and have a star rating above {rating}."
    new_row = {'question': question, 'answer': statement}
    qna_list = []
    qna_list.append(new_row)
    return qna_list

# cities = ['Nashville','New Orleans','Harvey','Yardley','Franklin','Wilmington','Santa Barbara','Saint Petersburg']
rating = 4.5

for city in cities:
    qna_list.extend(add_question_and_answer_attribe(df, city, rating))

print(len(qna_list))    

In [None]:
len(qna_list)

In [None]:
def add_question_and_answer_rest_rat(df, city, rating):
    question = f"Restaurants in {city} with more than {rating}"
    filtered_names = df[(df['city'] == city) & (df['stars'] > rating)]['name']
    if not filtered_names.empty:
        restaurant_names = ', '.join(filtered_names.unique())
        statement = f"Following restaurants in {city} have more than {rating} rating: {restaurant_names}"
    else:
        statement = f"There are no restaurants in {city} with more than {rating} rating."
    new_row = {'question': question, 'answer': statement}
    qna_list = []
    qna_list.append(new_row)
    return qna_list


# cities = ['Nashville','New Orleans','Harvey','Yardley','Franklin','Wilmington','Santa Barbara','Saint Petersburg']
rating = 4.9

for city in cities:
    qna_list.extend(add_question_and_answer_rest_rat(df, city, rating))

In [None]:
len(qna_list)
# print(qna_list)

In [None]:
qna_list

In [204]:
cities

['Nashville',
 'Clementon',
 'New Castle',
 'New Orleans',
 'Harvey',
 'Yardley',
 'Franklin',
 'Wilmington',
 'Santa Barbara',
 'Saint Petersburg']

In [205]:
filtered_city_df = df[df['city'].isin(cities)]

In [207]:
len(filtered_city_df)

17262

In [183]:
df.dtypes

business_id            object
name                   object
city                   object
state                  object
stars                 float64
review_count            int64
is_open                 int64
categories             object
attributes             object
hours                  object
review_id              object
review_rating           int64
useful                  int64
funny                   int64
cool                    int64
text                   object
combined_id            object
review                 object
delivery               object
OutdoorSeating         object
RestaurantsTakeOut     object
ByAppointmentOnly      object
dtype: object

In [None]:
df['categories'].unique().tolist()

In [None]:


category = ['Burgers', 'Pizza', 'Coffee & Tea', 'Bakeries','Mexican','Sandwiches','Mediterranean', 'Italian', 'Greek']

# food = df[df['categories'].isin(category)]['name'].unique()
pattern = '|'.join(category)
# restaurants = filtered_city_df[filtered_city_df['categories'].str.contains(pattern)]['name'].unique().tolist()
# print('food ',food, len(food), type(food))
qna = []
# print(restaurants)
for city in cities: 
    restaurants = filtered_city_df[(filtered_city_df['categories'].str.contains(pattern)) & (df['city']==city)]['name'].unique().tolist()
    if restaurants:
        categories = ', '.join(category)
        restaurants_str = ', '.join(restaurants)
        question = f"Which restaurants in {city} include either of {categories}"
        answer =  f"{restaurants_str} in {city} include either of {categories}"
        vikram = {}
        print(vikram)
        vikram['question'] = question
        vikram['answer'] = answer
        qna.append(vikram)
    else:
        question = f"Which restaurants in {city} include {categories}"
        answer =  f"There are no restaurants in {city} that include {categories}"
        
print(qna)

In [None]:
print(qna)

In [220]:
print(qna)

[]


In [None]:
qna_df = pd.DataFrame(qna_list)
print(qna_df)

In [1]:
len(qna_df)

NameError: name 'qna_df' is not defined

# **Inference code:**

In [24]:
import pandas as pd
qna_df = pd.read_csv('/kaggle/input/qna-dataset/Question answer predicts_reformatted.csv')

**1. 13B llama2**

In [None]:
import pandas as pd
rag_outputs = []
for index, row in qna_df.iterrows():
    question = row['question']
    predicted_answer = rag_pipeline(question)
    rag_outputs.append({
        'question': question,
        'answer': row['answer'],
        'predicted_answer': predicted_answer['result']
    })


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
result_df = pd.DataFrame(rag_outputs)
result_df.to_csv("rag_outputs_LLama_13B.csv", index=False)

**2. 7B Llama2**

In [None]:
import pandas as pd
rag_outputs = []
for index, row in qna_df.iterrows():
    question = row['question']
    predicted_answer = rag_pipeline(question)
    rag_outputs.append({
        'question': question,
        'answer': row['answer'],
        'predicted_answer': predicted_answer['answer']
    })

result_df = pd.DataFrame(rag_outputs)
result_df.to_csv("rag_outputs.csv", index=False)

In [None]:
for index, row in result_df.iterrows():
    question = row['question']
    predicted_answer = rag_pipeline(question)
    rag_outputs.append({
        'question': question,
        'answer': row['answer'],
        'predicted_answer': predicted_answer.answer
    })

In [241]:
rag_outputs = []
for index, row in df.iterrows():
    question = row['question']
#     predicted_answer = rag_pipeline(question)
    rag_outputs.append({
        'question': question,
        'answer': row['answer'],
        'predicted_answer': row['predicted_answer']['result']
    })

k = pd.DataFrame(rag_outputs)
k.to_csv("rag_outputs.csv", index=False)