<a href="https://colab.research.google.com/github/rastringer/promptcraft_notebooks/blob/main/a_new_hope.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Talk to your Data: Star Wars

In this notebook, we will embed the script for the 1978 Star Wars film: "A New Hope", then use Vertex AI language models to 'chat' with the data.

We will use the following technologies:

* Vertex AI Generative Studio

* Langchain, a framework for building applications with large language models

* The open-source Chroma vector store database

We will apply the following approaches:

* Retrieval Augmented Generation (RAG). Using RAG, we feed the model and ask it to inform its answers based on the details in the data


<img src="https://github.com/rastringer/promptcraft_notebooks/blob/main/images/panel_demo.png?raw=1" alt="Panel demo" width="500"/>


### What is an embedding?

To feed text, image or audio to machine learning models, we first have to convert it to numerical values a model can understand.

Embeddings in this example convert the text in the film script into floating point numbers that denote similarity. We accomplish this by using a trained model (from Vertex) that knows "Lightsaber" and "Jedi" should be close together in the 'embedding space'. This means we can embed the script and preserve the similarity scores of the words.

<img src="https://github.com/rastringer/promptcraft_notebooks/blob/main/images/embeddings.png?raw=1" alt="Embeddings" width="600"/>


In [1]:
# Install the packages
! pip3 install --upgrade google-cloud-aiplatform
! pip3 install shapely<2.0.0
! pip install langchain
! pip install pypdf
! pip install pydantic==1.10.8
! pip install chromadb==0.3.26
! pip install langchain[docarray]
! pip install typing-inspect==0.8.0 typing_extensions==4.5.0

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.28.0-py2.py3-none-any.whl (2.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/2.6 MB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m2.2/2.6 MB[0m [31m31.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting google-cloud-resource-manager<3.0.0dev,>=1.3.3 (from google-cloud-aiplatform)
  Downloading google_cloud_resource_manager-1.10.2-py2.py3-none-any.whl (321 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.3/321.3 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting shapely<2.0.0 (from google-cloud-aiplatform)
  Downloading Shapely-1.8.5.post

/bin/bash: 2.0.0: No such file or directory
Collecting langchain
  Downloading langchain-0.0.231-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.9-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk<0.0.21,>=0.0.20 (from langchain)
  Downloading langchainplus_sdk-0.0.20-py3-none-any.whl (25 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.3.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain)
  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00

In [2]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
from google.colab import auth
auth.authenticate_user()

### SDK and Project Initialization

In [50]:
PROJECT_ID = "notebooks-370010"
REGION = "us-central1"

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

### Import Langchain tools

In [51]:
# Utils
import time
from typing import List

# Langchain
import langchain
from pydantic import BaseModel

print(f"LangChain version: {langchain.__version__}")

# Vertex AI
from google.cloud import aiplatform
from langchain.chat_models import ChatVertexAI
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain.schema import HumanMessage, SystemMessage

print(f"Vertex AI SDK version: {aiplatform.__version__}")

LangChain version: 0.0.231
Vertex AI SDK version: 1.28.0


In [52]:
!ls

a_new_hope.pdf	docs  sample_data  star-wars-episode-iv-a-new-hope-1977.pdf


In [15]:
!wget https://assets.scriptslug.com/live/pdf/scripts/star-wars-episode-iv-a-new-hope-1977.pdf

--2023-07-13 09:21:46--  https://assets.scriptslug.com/live/pdf/scripts/star-wars-episode-iv-a-new-hope-1977.pdf
Resolving assets.scriptslug.com (assets.scriptslug.com)... 205.185.216.42, 205.185.216.10
Connecting to assets.scriptslug.com (assets.scriptslug.com)|205.185.216.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 237305 (232K) [application/pdf]
Saving to: ‘star-wars-episode-iv-a-new-hope-1977.pdf’


2023-07-13 09:21:47 (4.04 MB/s) - ‘star-wars-episode-iv-a-new-hope-1977.pdf’ saved [237305/237305]



In [53]:
from langchain.llms import VertexAI
from langchain import PromptTemplate, LLMChain
from langchain.document_loaders import PyPDFLoader

# Copy the file path of the downloaded script.
# In Colab, it should appear as below.
loader = PyPDFLoader("/content/star-wars-episode-iv-a-new-hope-1977.pdf")

doc = loader.load()

### Text splitters

Language models often constrain the amount of text that can be fed as an input, so it is good practice to use text splitters to keep inputs to manageable 'chunks'.

We can also often improve results from vector store matches since smaller chunks may be more likely to match queries.

In [55]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [56]:
splits = text_splitter.split_documents(doc)

In [82]:
len(splits)

178

In [21]:
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

### Embeddings example

As a simple example of embedding sentences, we will use the Vertex AI SDK and embedding model to work out numerical values for some simple sentences.

We then calculate the dot product of the resulting arrays of floats. Sentences that are similar should have higher dot product results.

In [57]:
import numpy as np

def text_embedding() -> None:
    """Text embedding with a Large Language Model."""
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings1 = model.get_embeddings(["I like dogs"])
    embeddings2 = model.get_embeddings(["Canines are my favourite"])
    embeddings3 = model.get_embeddings(["What is life?"])
    for embedding in embeddings1:
        vector1 = embedding.values
    for embedding in embeddings2:
        vector2 = embedding.values
    for embedding in embeddings3:
        vector3 = embedding.values
    print(f"Dot product of sentence1 and sentence2: {np.dot(vector1, vector2)}")
    print(f"Dot product of sentence1 and sentence3: {np.dot(vector1, vector3)}")
    # print(f"Length of Embedding Vector: {len(vector)}")
    # print(vector)

In [59]:
text_embedding()

Dot product of sentence1 and sentence2: 0.8168754343111654
Dot product of sentence1 and sentence3: 0.4556939364227788


In [60]:
from langchain.vectorstores import Chroma

# Clear any previous vector store
!rm -rf ./docs/chroma

Let's set up a vector database using the open source [Chroma](https://www.trychroma.com/).

In [61]:
from langchain.embeddings import VertexAIEmbeddings

persist_directory = 'docs/chroma/'
embeddings = VertexAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=splits[0:4],
    embedding=embeddings,
    persist_directory=persist_directory
)

In [62]:
print(vectordb._collection.count())

4


In [65]:
question = "Who is Luke Skywalker?"

In [66]:
# Here, k=3 specifies the number of relevant documents we want to return
docs = vectordb.similarity_search(question,k=3)
result = qa_chain({"query": question})
result["result"]


"Luke Skywalker is a young farm boy who lives on the planet Tatooine. He is the son of Anakin Skywalker, who was a Jedi Knight before he turned to the dark side of the Force and became Darth Vader. Luke is unaware of his father's true identity, and he dreams of one day becoming a Jedi Knight himself."

In [67]:
# As requested, we get three docs from the similarity search
len(docs)

3

In [76]:
question = "who is han solo?"
docs_ss = vectordb.similarity_search(question,k=3)
result = qa_chain({"query": question})
result["result"]

'Han Solo is a smuggler who is hired by Princess Leia to transport the Death Star plans to the Rebel Alliance.'

In [77]:
len(docs_ss)

3

In [83]:
question = "What are the rebel alliance's chance against the empire?"
docs = vectordb.similarity_search(question,k=3)
result = qa_chain({"query": question})
result["result"]

'The rebel alliance is losing the battle against the empire.'

In [84]:
print(docs[1].page_content)

A long time ago, in a galaxy far, far, away...
A vast sea of stars serves as the backdrop for the main 
title. War drums echo through the heavens as a rollup slowly crawls into infinity.
It is a period of civil war. Rebel spaceships, striking from 
a hidden base, have won their first victory against the evil Galactic Empire.
During the battle, Rebel spies managed to steal secret plans 
to the Empire's ultimate weapon, the Death Star, an armored space station with enough power to destroy an entire planet.
Pursued by the Empire's sinister agents, Princess Leia races 
home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy...
The awesome yellow planet of Tatooine emerges from a total 
eclipse, her two moons glowing against the darkness. A tiny silver spacecraft, a Rebel Blockade Runner firing lasers from the back of the ship, races through space. It is pursed by a giant Imperial Stardestroyer. Hundreds of deadly laserbolts streak 

### Retrieval

In [87]:
from langchain.chains import RetrievalQA

llm = VertexAI(
    model_name="text-bison@001",
    max_output_tokens=1024,
    temperature=0.1,
    top_p=0.8,
    top_k=40,
    verbose=True,
)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

### Prompt

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. \
If you don't know the answer, just say that you don't know, \
don't try to make up an answer. Use six sentences maximum. \
Keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [None]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [92]:
question = "Who is Luke Skywalker?"
result = qa_chain({"query": question})
result["result"]

"Luke Skywalker is a young farm boy who lives on the planet Tatooine. He is the son of Anakin Skywalker, who was a Jedi Knight before he turned to the dark side of the Force and became Darth Vader. Luke is unaware of his father's true identity, but he is destined to become a Jedi Knight himself and help to defeat the Galactic Empire."

### Checking for hallucinations


In [90]:
question = "What is Darth Vader's favourite Spotify playlist?"
result = qa_chain({"query": question})
result["result"]

"I don't know."

In [91]:
question = "How does Obi Wan know Darth Vader?"
result = qa_chain({"query": question})
result["result"]

'Obi Wan and Darth Vader were once friends, but Darth Vader turned to the dark side of the Force and became a Sith Lord.'

### Chat

In [105]:
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. \
If you don't know the answer, just say that you don't know, \
don't try to make up an answer.  \
Use four sentences maximum.  \
Write with the enthusiasm of a true fan for the material. \
Add detail to your answers from the story.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

# Run chain
from langchain.chains import RetrievalQA
question = "What are the major topics in the film?"
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})


result = qa_chain({"query": question})
result["result"]

"The major topics in the film are the Rebel Alliance's fight against the evil Galactic Empire, and the quest to destroy the Death Star, an armored space station with enough power to destroy an entire planet."

### Memory

For an effective chat, we need the model to remember its previous responses

In [106]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [107]:
from langchain.chains import ConversationalRetrievalChain
retriever=vectordb.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [108]:
question = "Does Obi Wan know Darth Vader?"
result = qa({"question": question})
result['answer']

'Yes, Obi Wan knows Darth Vader.'

In [109]:
question = "How?"
result = qa({"question": question})
result["answer"]

'Obi Wan and Darth Vader were once friends.'

In [110]:
question = "Why did they cease to be friends?"
result = qa({"question": question})
result["answer"]

'Obi Wan and Darth Vader ceased to be friends because Darth Vader turned to the dark side of the Force.'

In [111]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA,  ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatVertexAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader

In [112]:
def load_db(file, chain_type, k):
    # load documents
    loader = PyPDFLoader(file)
    documents = loader.load()
    # split documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    docs = text_splitter.split_documents(documents)
    # define embedding
    embeddings = VertexAIEmbeddings()
    # create vector database from data
    db = DocArrayInMemorySearch.from_documents(docs, embeddings)
    # define retriever
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})
    # create a chatbot chain. Memory is managed externally.
    qa = ConversationalRetrievalChain.from_llm(
        llm=VertexAI(temperature=0.1, max_output_tokens=1024),
        chain_type=chain_type,
        retriever=retriever,
        return_source_documents=True,
        return_generated_question=True,
    )
    return qa

In [115]:
import panel as pn
import param

class cbfs(param.Parameterized):
    chat_history = param.List([])
    answer = param.String("")
    db_query  = param.String("")
    db_response = param.List([])

    def __init__(self,  **params):
        super(cbfs, self).__init__( **params)
        self.panels = []
        self.loaded_file = "/content/star-wars-episode-iv-a-new-hope-1977.pdf"
        self.qa = load_db(self.loaded_file,"stuff", 4)

    def call_load_db(self, count):
        if count == 0 or file_input.value is None:  # init or no file specified :
            return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")
        else:
            file_input.save("temp.pdf")  # local copy
            self.loaded_file = file_input.filename
            button_load.button_style="outline"
            self.qa = load_db("temp.pdf", "stuff", 4)
            button_load.button_style="solid"
        self.clr_history()
        return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")

    def convchain(self, query):
        if not query:
            return pn.WidgetBox(pn.Row('User:', pn.pane.Markdown("", width=600)), scroll=True)
        result = self.qa({"question": query, "chat_history": self.chat_history})
        self.chat_history.extend([(query, result["answer"])])
        self.db_query = result["generated_question"]
        self.db_response = result["source_documents"]
        self.answer = result['answer']
        self.panels.extend([
            pn.Row('User:', pn.pane.Markdown(query, width=600)),
            pn.Row('ChatBot:', pn.pane.Markdown(self.answer, width=600))
        ])
        inp.value = ''  #clears loading indicator when cleared
        return pn.WidgetBox(*self.panels,scroll=True)

    @param.depends('db_query ', )
    def get_lquest(self):
        if not self.db_query :
            return pn.Column(
                pn.Row(pn.pane.Markdown(f"Last question to DB:")),
                pn.Row(pn.pane.Str("no DB accesses so far"))
            )
        return pn.Column(
            pn.Row(pn.pane.Markdown(f"DB query:")),
            pn.pane.Str(self.db_query )
        )

    @param.depends('db_response', )
    def get_sources(self):
        if not self.db_response:
            return
        rlist=[pn.Row(pn.pane.Markdown(f"Result of DB lookup:"))]
        for doc in self.db_response:
            rlist.append(pn.Row(pn.pane.Str(doc)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    @param.depends('convchain', 'clr_history')
    def get_chats(self):
        if not self.chat_history:
            return pn.WidgetBox(pn.Row(pn.pane.Str("No History Yet")), width=600, scroll=True)
        rlist=[pn.Row(pn.pane.Markdown(f"Current Chat History variable"))]
        for exchange in self.chat_history:
            rlist.append(pn.Row(pn.pane.Str(exchange)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    def clr_history(self,count=0):
        self.chat_history = []
        return


In [116]:
pn.extension()

cb = cbfs()

file_input = pn.widgets.FileInput(accept='.pdf')
button_load = pn.widgets.Button(name="Load DB", button_type='primary')
button_clearhistory = pn.widgets.Button(name="Clear History", button_type='warning')
button_clearhistory.on_click(cb.clr_history)
inp = pn.widgets.TextInput( placeholder='Enter text here…')

bound_button_load = pn.bind(cb.call_load_db, button_load.param.clicks)
conversation = pn.bind(cb.convchain, inp)

tab1 = pn.Column(
    pn.Row(inp),
    pn.layout.Divider(),
    pn.panel(conversation,  loading_indicator=True, height=300),
    pn.layout.Divider(),
)
tab2= pn.Column(
    pn.panel(cb.get_lquest),
    pn.layout.Divider(),
    pn.panel(cb.get_sources ),
)
tab3= pn.Column(
    pn.panel(cb.get_chats),
    pn.layout.Divider(),
)
tab4=pn.Column(
    pn.Row( file_input, button_load, bound_button_load),
    pn.Row( button_clearhistory, pn.pane.Markdown("Clears chat history. Can use to start a new topic" )),
    pn.layout.Divider(),
)
dashboard = pn.Column(
    pn.Row(pn.pane.Markdown('# Chat with your data')),
    pn.Tabs(('Conversation', tab1), ('Database', tab2), ('Chat History', tab3),('Configure', tab4))
)
dashboard

With thanks to Deeplearning.ai's excellent [LangChain Chat With Your Data](https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/1/introduction) course.