<a href="https://colab.research.google.com/github/socd06/databricks-language-hackathon/blob/main/June3_Dolly_3B_QA_Chain_Multi_Doc_Retriever_WIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers accelerate einops Xformers langchain InstructorEmbedding sentence-transformers chromadb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip install unstructured pandoc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# Reference: https://huggingface.co/databricks/dolly-v2-3b
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
import torch

from transformers import pipeline, AutoTokenizer

model_name = "databricks/dolly-v2-3b" # can use dolly-v2-3b, dolly-v2-7b or dolly-v2-12b for smaller model and faster inferences.

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
generate_text = pipeline("text-generation",
                         model=model_name, 
                         tokenizer=tokenizer,
                         torch_dtype=torch.bfloat16, 
                         trust_remote_code=True, 
                         device_map="auto",
                         return_full_text=True, 
                         max_new_tokens=128, 
                         top_p=0.95, top_k=50)

# template for an instrution with no input
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="{instruction}")

hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)

# Test LLM Chain

In [4]:
question = 'Who was Dolly the sheep?'
llm_chain.run(question)

"\nDolly the sheep was a sheep who was the world's first successful cloned animal.  She was also the only sheep to be cloned.  She was produced using an advanced method of cloning called nuclear transfer.  This method resulted in a embryo that was genetically identical to a full grown sheep."

# Use an agent to summarize conversation memory

In [5]:
from transformers import load_tool
summarizer = load_tool('summarization')

In [6]:
corpus = """dolly-v2-12b Model Card
Summary
Databricks’ dolly-v2-12b, an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. Based on pythia-12b, Dolly is trained on ~15k instruction/response fine tuning records databricks-dolly-15k generated by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA and summarization. dolly-v2-12b is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.

Dolly v2 is also available in these smaller models sizes:

dolly-v2-7b, a 6.9 billion parameter based on pythia-6.9b
dolly-v2-3b, a 2.8 billion parameter based on pythia-2.8b
Please refer to the dolly GitHub repo for tips on running inference for various GPU configurations.

Owner: Databricks, Inc.

Model Overview
dolly-v2-12b is a 12 billion parameter causal language model created by Databricks that is derived from EleutherAI’s Pythia-12b and fine-tuned on a ~15K record instruction corpus generated by Databricks employees and released under a permissive license (CC-BY-SA)
"""

In [7]:
summary = summarizer(corpus)
summary



"Dolly-v2-12b is an instruction-following large language model trained on the Databricks machine learning platform. It is based on EleutherAI’s Pythia-12B and fine-tuned on a 15K record instruction corpus. It's not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior."

In [8]:
hf_summary = pipeline("summarization", model="knkarthick/MEETING_SUMMARY")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

# Prepare Documents


In [24]:
!pip install pypdf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pypdf
  Downloading pypdf-3.9.0-py3-none-any.whl (249 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.5/249.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.9.0


In [27]:
import os
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

In [29]:
# Load and process the PDF files
loader = DirectoryLoader('/content/docs/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [33]:
# Verify we were able to load the docs
documents[0:2]

[Document(page_content='  \n \n  \n   \n  BLUEPRINT FOR AN \nAI B ILL OF \nRIGHTS \nMAKING AUTOMATED \nSYSTEMS WORK FOR \nTHE AMERICAN PEOPLE \nOCTOBER 2022 \n', metadata={'source': '/content/docs/Blueprint-for-an-AI-Bill-of-Rights.pdf', 'page': 0}),
 Document(page_content=' \n \n  \n  \n \n \n \n \n \n \n \n About this Document \nThe Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People was \npublished by the White House Office of Science and Technology Policy in October 2022. This framework was \nreleased one year after OSTP announced  the launch of a process to develop “a bill of rights for an AI-powered \nworld.” Its release follows a year of public engagement to inform this initiative. The framework is available \nonline at: https://www.whitehouse.gov/ostp/ai-bill-of-rights \nAbout the Office of Science and Technology Policy \nThe Office of Science and Technology Policy (OSTP)  was established by the National Science and Technology  \nPolicy, Or

In [34]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Download HuggingFace Embeddings
Check [MTEB English Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to make sure you download embeddings with good performance

In [35]:
# Choose one of the top performers from the MTEB English Leaderboard

from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

# top #2 when task = Retrieval June 2023 for under ~500 MB
model_name = "intfloat/e5-base-v2" 

hf = HuggingFaceEmbeddings(model_name=model_name)

Downloading (…)47d37/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)5d8dc47d37/README.md:   0%|          | 0.00/65.3k [00:00<?, ?B/s]

Downloading (…)8dc47d37/config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)47d37/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)5d8dc47d37/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



# Make a Vector Database

In [36]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## Here is the nmew embeddings being used
embedding = hf 

vectordb = Chroma.from_documents(documents=documents, 
                                 embedding=embedding,
                                 persist_directory=persist_directory)

# and a retriever

In [37]:
retriever = vectordb.as_retriever()

In [38]:
docs = retriever.get_relevant_documents("What is the AI BILL OF RIGHTS?")

# Test the basic retriever

In [39]:
docs[0]

Document(page_content=' \n  ABOUT THIS  FRAMEWORK\nThe Blueprint for an AI Bill of Rights is a set of five principles and associated practices to help guide the \ndesign, use, and deployment of automated systems to protect the rights of the American public in the age of \nartificial intel-ligence. Developed through extensive consultation with the American public, these principles are \na blueprint for building and deploying automated systems that are aligned with democratic values and protect \ncivil rights, civil liberties, and privacy. The Blueprint for an AI Bill of Rights includes this Foreword, the five \nprinciples, notes on Applying the The Blueprint for an AI Bill of Rights, and a Technical Companion that gives \nconcrete steps that can be taken by many kinds of organizations—from governments at all levels to companies of \nall sizes—to uphold these values. Experts from across the private sector, governments, and international \nconsortia have published principles and framework

In [41]:
docs = retriever.get_relevant_documents("What is the AI BILL OF RIGHTS?")

page_content=' \n  ABOUT THIS  FRAMEWORK\nThe Blueprint for an AI Bill of Rights is a set of five principles and associated practices to help guide the \ndesign, use, and deployment of automated systems to protect the rights of the American public in the age of \nartificial intel-ligence. Developed through extensive consultation with the American public, these principles are \na blueprint for building and deploying automated systems that are aligned with democratic values and protect \ncivil rights, civil liberties, and privacy. The Blueprint for an AI Bill of Rights includes this Foreword, the five \nprinciples, notes on Applying the The Blueprint for an AI Bill of Rights, and a Technical Companion that gives \nconcrete steps that can be taken by many kinds of organizations—from governments at all levels to companies of \nall sizes—to uphold these values. Experts from across the private sector, governments, and international \nconsortia have published principles and frameworks to guid

In [42]:
docs = retriever.get_relevant_documents("What does AI RMF mean?")

In [43]:
docs[0]

Document(page_content='NIST AI 100-1 AI RMF 1.0\nWhen applying the AI RMF, risks which the organization determines to be highest for the\nAI systems within a given context of use call for the most urgent prioritization and most\nthorough risk management process. In cases where an AI system presents unacceptable\nnegative risk levels – such as where significant negative impacts are imminent, severe harms\nare actually occurring, or catastrophic risks are present – development and deployment\nshould cease in a safe manner until risks can be sufficiently managed. If an AI system’s\ndevelopment, deployment, and use cases are found to be low-risk in a specific context, that\nmay suggest potentially lower prioritization.\nRisk prioritization may differ between AI systems that are designed or deployed to directly\ninteract with humans as compared to AI systems that are not. Higher initial prioritization\nmay be called for in settings where the AI system is trained on large datasets comprised 

# Make a proper Question Retrieval chain

In [52]:
# use map_reduce to prevent token mismatch errors
qa_chain = RetrievalQA.from_chain_type(llm=hf_pipeline, 
                                  chain_type="map_reduce", 
                                  retriever=retriever, 
                                  return_source_documents=True)

qa_chain.combine_documents_chain.llm_chain.prompt.template = '''
You are an AI Ethicists. 
Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always answer with unbiased ethical and safe advise.
----------------
{context}

Question: {question}
Helpful Answer:'''

In [53]:
def trim_string(input_string):
    input_string = str(input_string)
    trim_index = input_string.find("### Human:")
    if trim_index != -1:  # If the phrase is found
        return input_string[:trim_index]
    else:
        return input_string  # If the phrase isn't found, return the original string

In [54]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)
    print(f'Wrapped Text is: {len(wrapped_text)} chars long')
    return wrapped_text

def process_llm_response(llm_response):
    temp_resp = wrap_text_preserve_newlines(llm_response['result'])
    temp_resp = trim_string(temp_resp)
    print(temp_resp)
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [55]:
# full example
query = "What is the AI Bill of Rights?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1899 > 1024). Running this sequence through the model will result in indexing errors


Wrapped Text is: 617 chars long

The AI Bill of Rights is an ethical framework for building and deploying automated systems to align with
democratic values and protect civil rights, civil liberties, and privacy.
The AI Bill of Rights is developed by Partnership on AI and civil liberties and was published in June 2023.
It should be noted that the framework is developed by civil liberties groups that consulted with the public.
However, it can be used by governments at all levels to guide their approach to deploying automated systems.
Some of the highlights of the framework include:
*  The AI should be designed by and serve the public


An AI (


Sources:
/content/docs/Blueprint-for-an-AI-Bill-of-Rights.pdf
/content/docs/Blueprint-for-an-AI-Bill-of-Rights.pdf
/content/docs/Blueprint-for-an-AI-Bill-of-Rights.pdf
/content/docs/Blueprint-for-an-AI-Bill-of-Rights.pdf


# Notes for next version:
## Fix Token indices sequence length Issue
Reference: 
[Token indices sequence length Issue](https://stackoverflow.com/questions/68850172/token-indices-sequence-length-issue)
