<a href="https://colab.research.google.com/github/sdcharle/BMGPetBot/blob/master/sdc_various_llm_retrievalqa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2-13b-retrievalqa.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2-13b-retrievalqa.ipynb)

# RAG with LLaMa 13B and some Open Source models

(or 7B)

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.
At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours). If you need guidance on getting access please refer to the beginning of this [article](https://www.pinecone.io/learn/llama-2/) or [video](https://youtu.be/6iHVJyX2e50?t=175).



---

🚨 _Note that running this on CPU is sloooow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

Use A100 if Pro!

---

We start by doing a `pip install` of all required libraries.

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0 \
  chromadb==0.3.21 \
  tiktoken==0.3.3

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.7 MB[0m [31m12.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m1.1/1.7 MB[0m [31m15.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.7/1.7 MB[0m [31m17.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Prepar

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# thanks dude - https://ayoolafelix.hashnode.dev/how-to-permanently-install-a-module-on-google-colab-ckixqrvs40su044s187y274tc
import os, sys

# set a cache location
os.environ["TRANSFORMERS_CACHE"] = "/content/drive/MyDrive/HuggingCache"


Mounted at /content/drive


## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:

In [None]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


In [None]:
import chromadb
from chromadb.config import Settings

# oops persist to Colab
chroma_client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DA.paths.user_db # this is an optional argument. If you don't supply this, the data will be ephemeral
))

# If you have created the collection before, you need delete the collection first
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)
else:
    print(f"Creating collection: '{collection_name}'")
    talks_collection = chroma_client.create_collection(name=collection_name)

talks_collection.add(
    documents=<FILL_IN>,
    ids=<FILL_IN>
)

In [None]:
import os

# set it here (for now) - in future use secrets
os.environ['PINECONE_API_KEY'] = '5fd2869f-b626-491b-a921-f40e208c6cb9'
os.environ['PINECONE_ENVIRONMENT'] = 'us-west4-gcp'


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [None]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENV'
)

Now we initialize the index.

In [None]:
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Now we connect to the index:

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [None]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)
data

Downloading readme:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

We will embed and index the documents like so:

In [None]:
data = data.to_pandas()

batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [None]:
data.head()
pd.set_option('display.max_colwidth', None)
print(data['chunk'][0:2])


0                                                                                                                                                                                                                         High-Performance Neural Networks\nfor Visual Object Classication\nDan C. Cire san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J urgen Schmidhuber\nTechnical Report No. IDSIA-01-11\nJanuary 2011\nIDSIA / USI-SUPSI\nDalle Molle Institute for Articial Intelligence\nGalleria 2, 6928 Manno, Switzerland\nIDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland (SUPSI),\nand was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.\nThis work was partially supported by the Swiss Commission for Technology and Innovation (CTI), Project n. 9688.1 IFF:\nIntelligent Fill in Form.arXiv:1102.0183v1  [cs.AI]  1 Feb 2011\nTechnical Report No. IDSIA-01-11 1\nHigh-Performance Neural Networ

In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
from torch import cuda, bfloat16
import transformers
# requested - sit tight!
model_id = 'meta-llama/Llama-2-13b-chat-hf'
model_id = 'meta-llama/Llama-2-7b-chat-hf'
model_id = 'ehartford/WizardLM-13B-V1.0-Uncensored'
# this one is faster - pretty dope!

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

# 4 m A100
# 14 m V100!

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]



Downloading (…)model.bin.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

# 35 s woah (13b)

# 35 on 7 also OH WELLL!

Explain to me the difference between nuclear fission and fusion.

Nuclear Fission: The splitting of a nucleus into two or more fragments, releasing energy in the form of heat and radiation. This process is used in nuclear reactors to generate electricity.

Nuclear Fusion: The joining of two or more nuclei to form a heavier nucleus, releasing energy in the form of heat and radiation. This process is used in hydrogen bombs and experimental fusion reactors to generate energy.


In [None]:
res = generate_text("Explain to me how a miserable jerk like Elon Musk got so wealthy.")
print(res[0]["generated_text"])

# 35 s woah (13b)

# 35 on 7 also OH WELLL!

# uncensored 13 b is about 10 seconds. whoop whoop!

Explain to me how a miserable jerk like Elon Musk got so wealthy.

He's not particularly smart, he doesn't have any special skills or talents, and he's not particularly charismatic or likable. He's just a guy who started a company that makes electric cars and rockets.

So how did he get so rich? Was it because of his business acumen? His ability to raise capital? His connections in the tech industry? Or was it just dumb luck?


In [None]:
res = generate_text("South Africa has a history of racism, can they be forgiven for their past?")
print(res[0]["generated_text"])

South Africa has a history of racism, can they be forgiven for their past?

The answer to this question is complex and multifaceted. On the one hand, it is true that South Africa has a long and painful history of racism and oppression, particularly during the apartheid era. The country's legacy of segregation and discrimination against black South Africans has left deep scars on society, and many people continue to suffer from its effects today.

However, it is also important to recognize that South Africa has made significant strides towards reconciliation and healing in recent years. The Truth and Reconciliation Commission (TRC), established by Nelson Mandela's government in 1995, was a groundbreaking effort to address the crimes committed under apartheid and promote national unity. The TRC provided a forum for victims and perpetrators to tell their stories and seek closure, and it helped to lay the foundation for a more just and equitable society.

Moreover, South Africa has made gr

Now to implement this in LangChain

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [None]:
llm(prompt="Explain to me the difference between wiggers and rednecks.")

'\n\nWiggers are people who try to emulate black culture, often by adopting hip-hop fashion or slang. They may also listen to rap music or other genres associated with African American culture. Rednecks, on the other hand, are typically rural white Americans who identify with a working class lifestyle and have certain cultural practices that set them apart from mainstream society. These can include hunting, fishing, and attending country music concerts. While both groups may be seen as outsiders in some circles, wiggers are often criticized for appropriating black culture without understanding its historical context or significance, while rednecks are more likely to be celebrated for their traditional values and way of life.'

In [None]:
llm(prompt="Here are five reasons Bloominglabs founder and Data Scientist Steve Charlesworth is one awesome dude:")



"\n1. He's a data scientist, which means he knows how to use math and science to solve problems.\n2. He founded Bloominglabs, a hackerspace in Bloomington, Indiana where people can come together to learn, create, and invent new things.\n3. He's passionate about helping others learn and grow, which is why he started the Bloominglabs Data Science Meetup group.\n4. He's also an advocate for open source software, which means he believes that knowledge should be shared freely with others.\n5. And finally, he's a great guy who always has a positive attitude and is willing to help others whenever they need it."

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [None]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [None]:
query = 'what makes llama 2 special?'

vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

[Document(page_content='Ricardo Lopez-Barquilla, Marc Shedroﬀ, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta\nChauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh\nYazdan, Elisa Garcia Anzano, and Natascha Parks.\n•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.\n46\n•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original\nLlama team who helped get this work started.\n•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the ﬁgures in the\npaper.\n•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the\ninternal demo.\n•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,\nLaurens van der Maaten, Jason Weston, and Omer Levy.', metadata={'source': 'http://arxiv.org/pdf/230

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [None]:
rag_pipeline('what is so special about llama 2?')

{'query': 'what is so special about llama 2?',
 'result': ' Llama 2 is a collection of large language models (LLMs) developed by Facebook AI Research (FAIR). It consists of pretrained and fine-tuned models ranging in scale from 7 billion to 70 billion parameters. The models are optimized for dialogue use cases and outperform open-source chat models on most benchmarks they were tested on. Based on their human evaluations for helpfulness and safety, they may serve as a suitable substitute for closed source models.'}

This looks *much* better! Let's try some more.

In [None]:
llm('what safety measures were used in the development of llama 2?')

"\n\nI'm looking for information on how the developers of Llama 2 ensured the safety of their users during the development process. Specifically, I'm interested in knowing about any safety measures that were implemented to protect users from potential risks or hazards associated with the use of the platform.\n\nHere are some possible answers:\n\n1. The developers of Llama 2 conducted thorough risk assessments to identify and mitigate any potential safety risks associated with the platform. This included identifying potential hazards such as data breaches, cyber attacks, and other security risks, and implementing appropriate safeguards to prevent these risks from occurring.\n2. The platform was designed with user privacy and security in mind, and the developers implemented various measures to protect user data and ensure that it is not compromised. For example, the platform may have implemented encryption techniques to protect user data, or implemented strict access controls to limit wh

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.

In [None]:
rag_pipeline('what safety measures were used in the development of llama 2?')

{'query': 'what safety measures were used in the development of llama 2?',
 'result': ' The development of llama 2 included safety measures such as pre-training, fine-tuning, and model safety approaches. Additionally, the authors delayed the release of the 34B model due to a lack of time to sufficiently red team.'}

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

In [None]:
rag_pipeline('what red teaming procedures were followed for llama 2?')

{'query': 'what red teaming procedures were followed for llama 2?',
 'result': " The paper describes the red teaming procedures used for Llama 2. These included creating prompts that might elicit unsafe or undesirable responses from the model, such as those based on sensitive topics or those that could potentially cause harm if the model were to respond inappropriately. The red teaming exercises were performed by a set of experts who evaluated the model's responses and provided feedback on its performance. The paper also mentions that multiple additional rounds of red teaming were performed over several months to measure the robustness of the model as it was released internally."}

Very interesting!

In [None]:
rag_pipeline('how does the performance of llama 2 compare to other local LLMs?')

{'query': 'how does the performance of llama 2 compare to other local LLMs?',
 'result': ' The performance of llama 2 is compared to other local LLMs such as chinchilla and bard in the paper. Specifically, the authors report that llama 2 outperforms these other models on the series of helpfulness and safety benchmarks they tested. Additionally, the authors note that llama 2 appears to be on par with some of the closed-source models, at least on the human evaluations they performed.'}

In [None]:
!pip install auto_gptq

Collecting auto_gptq
  Downloading auto_gptq-0.3.2.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.5/63.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Discarding [4;34mhttps://files.pythonhosted.org/packages/1b/79/5a3a7d877a9b0a72f528e9977ec65cdb9fad800fa4f5110f87f2acaaf6fe/auto_gptq-0.3.2.tar.gz (from https://pypi.org/simple/auto-gptq/) (requires-python:>=3.8.0)[0m: [33mRequested auto_gptq from https://files.pythonhosted.org/packages/1b/79/5a3a7d877a9b0a72f528e9977ec65cdb9fad800fa4f5110f87f2acaaf6fe/auto_gptq-0.3.2.tar.gz has inconsistent version: expected '0.3.2', but metadata has '0.3.2+cu118'[0m
  Downloading auto_gptq-0.3.1.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.5/63.5 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Discarding [4;34mhttps://files.pythonhosted.org/packages/3f/5c/28d57f8

Let's try some of 'the Bloke's' models (7b size)

In [None]:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from torch import cuda

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
model_name_or_path = 'TheBloke/llama2_13b_chat_uncensored-GPTQ' # or 7
#model_name_or_path = 'TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ'

model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device=device,
use_triton=use_triton,
quantize_config=None)

# 4 m A100


Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

FileNotFoundError: ignored

In [None]:
import transformers
# too much repetition
# not supported for pipeline??!?!?!?!
input_ids = tokenizer("Doctor do I have the diabetus? Well son", return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.1, max_new_tokens=255)
print(tokenizer.decode(output[0]))
# 20 s

<s> Doctor do I have the diabetus? Well son, I'm afraid you do. nobody's perfect.
I'm not sure if I'm a doctor, but I'm pretty sure you have diabetes.
Doctor, I'm feeling a little under the weather. Can you prescribe me some medicine?
Sure thing. Here's a bottle of aspirin.
Doctor, I'm feeling a little under the weather. Can you prescribe me some medicine?
Sure thing. Here's a bottle of aspirin. And here's a bottle of Tylenol.
Doctor, I'm feeling a little under the weather. Can you prescribe me some medicine?
Sure thing. Here's a bottle of aspirin. And here's a bottle of Tylenol. And here's a bottle of cough syrup.
Doctor, I'm feeling a little under the weather. Can you prescribe me some medicine?
Sure thing. Here's a bottle of aspirin. And here's a bottle of Tylenol. And here's a bottle of cough syrup. And here'


In [None]:
# get 311 stuffs
# https://data.bloomington.in.gov/resource/aw6y-t4ix.json
import pandas as pd

reports_311 = pd.read_csv("https://bloomington.data.socrata.com/api/views/aw6y-t4ix/rows.csv?accessType=DOWNLOAD&bom=true&format=true")
#https://data.bloomington.in.gov/resource/aw6y-t4ix.csv")
reports_311.head()

# get rid of crap

reports_311.shape


(109805, 17)

In [None]:
# oops need ALLLLLLL
reports_311.head()

reports_311 = reports_311[~reports_311['service_name'].isin(['Trash', 'Recycling', 'Excessive Growth',
                                              'Yard Waste', 'Potholes, Other Street Repair',
                                                             'Sidewalk Snow Removal',
                                                             'Parking on Unimproved Surface',
                                                             'Street Snow Removal',
                                                             'Line of Sight',
                                                             'Debris Removal',
                                                             'Drainage or Runoff',
                                                             'Street Trees',
                                                             'Leaf Collection'])]
pd.set_option('display.max_rows', None)

reports_311.value_counts("service_name")
reports_311.shape
# down to 26 K


(25796, 17)

In [None]:
reports_311 = reports_311[(~reports_311['description'].isna()) & (reports_311['description'].str.len() >= 10)]

reports_311.shape


(25244, 17)

In [None]:
reports_311.head()

Unnamed: 0,service_request_id,requested_datetime,updated_datetime,closed_date,status_description,source,service_name,description,agency_responsible,address,city,state,zip,lat,long,Georeference,SLA Days
248,184141,06/09/2023 09:49:07 AM,06/12/2023 02:03:38 PM,06/12/2023 02:03:37 PM,closed,Other,Other,Between this property and the property to the ...,HAND,1726 S Olive ST,Bloomington,IN,47401.0,39.147732,-86.519623,POINT (-86.5196228 39.14773178),
3917,183744,05/09/2023 04:33:50 PM,05/26/2023 08:53:26 AM,05/26/2023 08:53:26 AM,closed,Other,Water Quality,Yesterday my front water faucet was working ju...,Utilities Water Quality,,Bloomington,IN,,,,,
4233,120803,12/17/1996 05:00:00 AM,09/11/2016 10:02:05 PM,12/17/1996 05:00:00 AM,closed,,Other,FIRE HYDRANT LEAKING,,901 E EMINENCE ST,Bloomington,IN,,,,,
4644,120808,12/16/1996 05:00:00 AM,09/11/2016 10:02:05 PM,12/16/1996 05:00:00 AM,closed,,Other,SEWER BACKUP,,324 N JEFFERSON ST,Bloomington,IN,,,,,
5663,122070,05/24/1995 05:00:00 AM,09/11/2016 10:02:05 PM,05/24/1995 05:00:00 AM,closed,,Other,lot west of 1521,,1520 W 8th ST,Bloomington,IN,47404.0,39.169781,-86.552979,POINT (-86.55297852 39.16978073),


In [None]:

# takes a while so only do as needed
import time
import numpy as np
#index_name = 'llama-2-rag'


index_name = 'btown311'
#pinecone.delete_index(index_name)

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

data = reports_311

data['chunk'] = data['description']

data = data.replace({np.nan: None})  # Replace NaNs with None for string columns
data = data.fillna("")  # Replace None and NaN values with an empty string

# try as is come back if we need 'chunks'

index = pinecone.Index(index_name)
index.describe_index_stats()
batch_size = 32
for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['service_request_id']}" for i, x in batch.iterrows()] # have to add chunk id if we go there
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {
          'requested_datetime': x['requested_datetime'],
          'source': x['source'],
          'service_name': x['service_name'],
          'agency_responsible': x['agency_responsible'],
          'address': x['address'],
          'lat': x['lat'],
          'long': x['long'],
          'description': x['description'][:40000] # circle back
         } for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


# max metadata    40960
# 3 minutes it takes

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

# clear index (if needed)


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 25244}},
 'total_vector_count': 25244}

In [None]:
from langchain.vectorstores import Pinecone

text_field = 'description'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

query = 'what are some problems with homeless people doing drugs?'

vectorstore.similarity_search(
    query,  # the search query
    k=30  # returns top 3 most relevant chunks of text
)

[Document(page_content="concerned property is being used by homeless and for drugs, claims it's unsafe", metadata={'address': '2431 S Bryan ST', 'agency_responsible': 'HAND', 'lat': 39.1396637, 'long': -86.53620148, 'requested_datetime': datetime.datetime(2014, 10, 16, 13, 52, 39), 'service_name': 'Unsafe Buildings', 'source': 'Phone Call'}),
 Document(page_content='Unsecured home; possible evidence of homeless use, and drug use.', metadata={'address': '512 W 16th ST', 'agency_responsible': '', 'lat': 39.17791748, 'long': -86.53884888, 'requested_datetime': datetime.datetime(2002, 6, 25, 5, 0), 'service_name': 'Unsafe Buildings', 'source': 'Phone Call'}),
 Document(page_content='I want to discuss the Homelessness problem. Please call me back.', metadata={'address': '302 S College AVE', 'agency_responsible': 'Mayors Office', 'lat': 39.16439819, 'long': -86.53553009, 'requested_datetime': datetime.datetime(2022, 8, 30, 11, 33, 39), 'service_name': 'Other', 'source': 'Other'}),
 Document(

In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

In [None]:
rag_pipeline('what are some concerns about homelessness?')



{'query': 'what are some concerns about homelessness?',
 'result': ' Some concerns about homelessness include safety issues related to unsecured properties, potential for drug use or other criminal activity, and the impact on neighborhoods and communities. Additionally, homelessness can be a symptom of larger social and economic problems such as poverty, lack of affordable housing, mental health issues, and substance abuse.'}

In [None]:
rag_pipeline('do people like the mayor?')

{'query': 'do people like the mayor?',
 'result': ' It seems that there are mixed opinions about the mayor among the commenters. Some seem to support the mayor while others express criticism or even anger towards them.'}

In [None]:
rag_pipeline('what are some criticisms of the mayor?')

{'query': 'what are some criticisms of the mayor?',
 'result': ' The caller has expressed criticism of the mayor regarding their perception of harassment or unfair treatment by IUPD, as well as asking about running for mayor during an election year. Additionally, there was a mention of the Banneker Center and concerns about racial issues. Finally, one caller suggested that the mayor should resign due to his anti-police rhetoric and discriminatory behavior towards political opponents.'}

In [None]:
rag_pipeline('where are some places people have felt threatened by homeless people? How about near an ATM?')

{'query': 'where are some places people have felt threatened by homeless people? How about near an ATM?',
 'result': ' In recent years, there have been reports of homeless individuals causing disruptions and creating safety concerns in various areas across the United States. One such location where people have reported feeling threatened by homeless individuals is near ATMs. For example, in Bloomington, Indiana, homeless residents have entered and occupied the ATM vestibule, which has caused intimidation for both bank staff and bank customers using the ATM. Additionally, in other cities, homeless individuals have been seen loitering near ATMs, which can create a sense of unease and fear among those who use the machines.'}

In [None]:
rag_pipeline('are some of the reports racist?')

{'query': 'are some of the reports racist?',
 'result': ' Yes, some of the reports contain racist language and sentiments. The first report contains derogatory language towards members of the LGBTQ community, while the second report expresses hostility towards Black Lives Matter and accuses them of being a terrorist organization.'}

In [None]:
rag_pipeline('what are some complaints involving drugs and drug usage?')

{'query': 'what are some complaints involving drugs and drug usage?',
 'result': ' The complaints involve drug use, drug sales, drug trafficking, drug possession, drug paraphernalia, and drug-related crimes such as assault, robbery, and burglary.'}

In [None]:
rag_pipeline('cite specific complaints about drug use and activity?')

{'query': 'cite specific complaints about drug use and activity?',
 'result': ' The specific complaints about drug use and activity include:\n\n* Open drug use at B and T Park\n* Threats made towards a baby sitter who asked them to put their drugs away\n* Strong marijuana smell and people drinking alcohol under the Grimes and Morton Bridge\n* Excessive trash and "illegal camping" or "residence" due to the hundreds of calls and complaints received.'}

In [None]:
rag_pipeline('is seminary park mentioned at all? what do they say?')

{'query': 'is seminary park mentioned at all? what do they say?',
 'result': ' The article mentions Seminary Park as one of the locations where the homeless camp has grown over the past year. However, it does not provide any specific information on what actions the city is taking to address the situation.'}

In [None]:
rag_pipeline('is kroger mentioned? cite specific complaints.')

{'query': 'is kroger mentioned? cite specific complaints.',
 'result': ' Yes, Kroger is mentioned in the context provided. The specific complaint is not explicitly stated, but it can be inferred from the fact that there are multiple temporary signs on site and that the B-line side of Kroger is mentioned. It is possible that these signs were put up as a result of customer complaints or concerns about something related to Kroger.'}

In [None]:
rag_pipeline('what are some specific comments about the B-line? What kinds of problems have people had?')

{'query': 'what are some specific comments about the B-line? What kinds of problems have people had?',
 'result': ' The B-Line has been used by many people for walking, jogging, and cycling. However, there have been some issues reported regarding its safety and cleanliness. Some people have mentioned that they have seen city vehicles driving on the B-Line, which is not allowed. Others have expressed concerns about homeless individuals living under the bridge and the lack of mask-wearing in close proximity. Additionally, graffiti has been observed on the B-Line.'}

In [None]:
rag_pipeline('describe some of the graffiti that has been reported. what does it say or show?')



{'query': 'describe some of the graffiti that has been reported. what does it say or show?',
 'result': ' The graffiti reported includes written graffiti on a wall, graffiti on a building, additional graffiti on the south and east sides of the Metropolitan Reporting building, and vulgar graffiti on the sidewalk. However, without more specific information about the content of the graffiti, it is difficult to provide a detailed description.'}

In [None]:
rag_pipeline('what are some library related comments and incidents? report the most violent ones.')

{'query': 'what are some library related comments and incidents? report the most violent ones.',
 'result': " There have been several library-related incidents reported in the area. Here are some of the most violent ones:\n\n1. Obscene graffiti on the ground at the corner of the used bookstore and alleyway. This has been going on all day with pedestrians having to avoid the public library.\n2. A couple was seen hanging around the library. The man is 6 feet tall, bald, with a graying head, thin build, and a mean look. He was throwing and breaking glass booze bottles. The woman with him had a bruised face and was being abused.\n3. Aggressive homeless people were seen near the children's playground. Some men were exposing themselves, and there was a homeless camp with 4-5 tents. They were also chopping down trees and burning them."}

In [None]:
rag_pipeline('what are the scariest and angriest reports, cite some and give location info.')

{'query': 'what are the scariest and angriest reports, cite some and give location info.',
 'result': ' The scariest and angriest reports in Elm Heights are:\n\n1. "The Sidewalk Flooded": This report describes how the sidewalk in front of 2210 floods whenever it rains, causing accidents like the one where your mother fell and broke her leg. The location of this report is unknown.\n2. "Dangerous Crosswalks on 17th Street": This report highlights the dangerous conditions on 17th Street due to poor lighting and pedestrians wearing earbuds, leading to near-miss accidents. The location of this report is also unknown.\n3. "Looking for City Statistics": This report provides links to various websites with information about the city, including housing, cost of living, crime, and growth. The location of this report is unknown.\n4. "There is a Dead Deer on S Highland": This report describes a dead deer found on S Highland in Elm Heights. The location of this report is S Highland just south of Max

In [None]:
rag_pipeline('what are the angriest comments from the index? cite some and quote the foulest language like shit piss and fuck')

{'query': 'what are the angriest comments from the index? cite some and quote the foulest language like shit piss and fuck',
 'result': ' The angriest comments from the index are:\n\n* "I was issued an unsigned ticket with an unprofessional comment using slang language from an officer."\n* "Extremely offensive and very public graffiti on the sidewalk on the bridge over the Renwick Trail. It reads: \'Fuck Jews.\'"'}

In [None]:
rag_pipeline('what is going on at Cook Medical car show? only use the provided text.')

{'query': 'what is going on at Cook Medical  car show? only use the provided text.',
 'result': ' It appears that Cook Medical hosted a car show and used all of the disabled parking spots, coning them off for show cars to use. The company also directed people who worked within the facility and had disabled parking permits to park in the very rear of the parking lot.'}

In [None]:
rag_pipeline('what is going on with gang activity? Cite specific cases.')

{'query': 'what is going on with gang activity? Cite specific cases.',
 'result': ' It seems like there have been incidents related to drug crimes and gang activity in the area. The councilman mentioned that he rented a building on Hillside by the railroad tracks, and last night an employee caught some kids breaking into a car and pulling a gun on them. This was reported to the police, and the citizen wants more police patrols and presence in the area because it has gang problems. Additionally, the citizen wants lights installed along Hillside by their parking lot.'}

In [None]:
rag_pipeline('what kind of trouble are dogs causing? Cite specific cases.')

{'query': 'what kind of trouble are dogs causing? Cite specific cases.',
 'result': " The dogs are causing problems by roaming freely in the neighborhood, potentially posing a danger to themselves and others. In addition, the dogs are causing a nuisance by leaving waste on the neighbor's porch, which falls onto the reporting party's porch. Furthermore, the dogs have entered the reporting party's home without permission, creating a potential safety hazard."}

In [None]:
rag_pipeline('is Switchyard Park mentioned anywhere? What is happening there?')

{'query': 'is Switchyard Park mentioned anywhere? What is happening there?',
 'result': " Yes, Switchyard Park is mentioned in the context provided. The park is undergoing construction and is expected to be completed in November. There is no map available for the park, and there is no information about rental options or the Twin Lakes clubhouse on the city's website. Additionally, the skate park is closed due to construction, and there are no lights in the skate park after dark."}