# Power your products with ChatGPT and your own data

This is a walkthrough taking readers through how to build starter Q&A and Chatbot applications using the ChatGPT API and their own data. 

It is laid out in these sections:
- **Setup:** 
    - Initiate variables and source the data
- **Lay the foundations:**
    - Set up the vector database to accept vectors and data
    - Load the dataset, chunk the data up for embedding and store in the vector database
- **Make it a product:**
    - Add a retrieval step where users provide queries and we return the most relevant entries
    - Summarise search results with GPT-4
    - Test out this basic Q&A app in Streamlit
- **Build your moat:**
    - Create an Assistant class to manage context and interact with our bot
    - Use the Chatbot to answer questions using semantic search context
    - Test out this basic Chatbot app in Streamlit
    
Upon completion, you have the building blocks to create your own production chatbot or Q&A application using OpenAI APIs and a vector database.

This notebook was originally presented with [these slides](https://drive.google.com/file/d/1dB-RQhZC_Q1iAsHkNNdkqtxxXqYODFYy/view?usp=share_link), which provide visual context for this journey.

In [1]:
%load_ext autoreload
%autoreload 2

## Setup

First we'll setup our libraries and environment variables

In [2]:
import openai
import os
import requests
import numpy as np
import pandas as pd
from typing import Iterator
import tiktoken
import textract
from numpy import array, average

from database import get_redis_connection

# Set our default models and chunking size
from config import COMPLETIONS_MODEL, EMBEDDINGS_MODEL, CHAT_MODEL, TEXT_EMBEDDING_CHUNK_SIZE, VECTOR_FIELD_NAME

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ImportWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [3]:
pd.set_option('display.max_colwidth', 0)

In [4]:
data_dir = os.path.join(os.curdir,'data')
pdf_files = sorted([x for x in os.listdir(data_dir) if 'DS_Store' not in x])
pdf_files

['Medicare_Home_Health.pdf']

## Laying the foundations

### Storage

We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the [docs for Redis Stack](https://redis.io/docs/stack/get-started/install/docker/).

To set this up locally, you will need to install Docker and then run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.

The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).

After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [5]:
# Setup Redis
from redis import Redis
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField,
    NumericField
)
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)

redis_client = get_redis_connection()

In [6]:
# Constants
VECTOR_DIM = 1536 #len(data['title_vector'][0]) # length of the vectors
#VECTOR_NUMBER = len(data)                 # initial number of vectors
PREFIX = "homedoc"                            # prefix for the document keys
DISTANCE_METRIC = "COSINE"                # distance metric for the vectors (ex. COSINE, IP, L2)

In [7]:
# Create search index

# Index
INDEX_NAME = "homehealth-index"           # name of the search index
VECTOR_FIELD_NAME = 'content_vector'

# Define RediSearch fields for each of the columns in the dataset
# This is where you should add any additional metadata you want to capture
filename = TextField("filename")
text_chunk = TextField("text_chunk")
file_chunk_index = NumericField("file_chunk_index")

# define RediSearch vector fields to use HNSW index

text_embedding = VectorField(VECTOR_FIELD_NAME,
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC
    }
)
# Add all our field objects to a list to be created as an index
fields = [filename,text_chunk,file_chunk_index,text_embedding]

In [8]:
redis_client.ping()

True

In [9]:
# Optional step to drop the index if it already exists
#redis_client.ft(INDEX_NAME).dropindex()

# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except Exception as e:
    print(e)
    # Create RediSearch Index
    print('Not there yet. Creating')
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Index already exists


### Ingestion

We'll load up our PDFs and do the following
- Initiate our tokenizer
- Run a processing pipeline to:
    - Mine the text from each PDF
    - Split them into chunks and embed them
    - Store them in Redis

In [10]:
# The transformers.py file contains all of the transforming functions, including ones to chunk, embed and load data
# For more details the file and work through each function individually
from transformers import handle_file_string

In [11]:
openai.api_key = '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'

In [12]:
%%time
# This step takes about 5 minutes

# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# Process each PDF file and prepare for embedding
for pdf_file in pdf_files:
    
    pdf_path = os.path.join(data_dir,pdf_file)
    print(pdf_path)
    
    # Extract the raw text from each PDF using textract
    text = textract.process(pdf_path, method='pdfminer')
    
    # Chunk each document, embed the contents and load to Redis
    handle_file_string((pdf_file,text.decode("utf-8")),tokenizer,redis_client,VECTOR_FIELD_NAME,INDEX_NAME)

./data/Medicare_Home_Health.pdf
CPU times: user 767 ms, sys: 199 ms, total: 966 ms
Wall time: 5.44 s


In [13]:
# Check that our docs have been inserted
redis_client.ft(INDEX_NAME).info()['num_docs']

'141'

## Make it a product

Now we can test that our search works as intended by:
- Querying our data in Redis using semantic search and verifying results
- Adding a step to pass the results to GPT-3 for summarisation

In [14]:
from database import get_redis_results

In [18]:
%%time

homehealth_query='Who can perform a home health initial assessment?'

result_df = get_redis_results(redis_client,homehealth_query,index_name=INDEX_NAME)
result_df.head(2)

CPU times: user 2.72 ms, sys: 1.59 ms, total: 4.31 ms
Wall time: 116 ms


Unnamed: 0,id,result,certainty
0,0,"If the patient is starting home health directly after discharge from an acute/post-acute care setting where the physician or allowed practitioner, with privileges, that cared for the patient in that setting is certifying the patient’s eligibility for the home health benefit, but will not be following the patient after discharge, then the certifying physician or allowed practitioner must identify the community physician or allowed practitioner who will be following the patient after discharge. One of the criteria that must be met for a patient to be considered eligible for the home health benefit is that the patient must be under the care of a physician or allowed practitioner (number 4 listed above). Otherwise, the certification is not valid. The certification must be complete prior to when an HHA bills Medicare for reimbursement however, physicians and allowed practitioners should complete the certification when the plan of care is established, or as soon as possible thereafter. This is longstanding CMS policy as referenced in Pub 100-01, Medicare General Information, Eligibility, and Entitlement Manual, chapter 4, section 30.1. It is not acceptable for HHAs to wait until the end of a 60-day certification period to obtain a completed certification/recertification. 30.5.1.1 – Face-to-Face Encounter (Rev. 10438, Issued: 11-06-20, Effective: 03-01-20, Implementation: 01- 11-21) 1. Allowed Provider Types As part of the certification of patient eligibility for the Medicare home health benefit, a face-to-face encounter with the patient must be performed by the certifying physician or allowed practitioner himself or herself, a physician or allowed practitioner that cared for the patient in the acute or post-acute care facility (with privileges who cared for the patient in an acute or post-acute care facility from which the patient was directly admitted to home health) or an allowed non-physician practitioner (NPP).",0.164514839649
1,1,"For recertification of home health services, the physician or allowed practitioner must certify (attest) that: 1. The home health services are or were needed because the patient is or was confined to the home as defined in §30.1 2. The patient needs or needed skilled nursing services on an intermittent basis (other than solely venipuncture for the purposes of obtaining a blood sample), or physical therapy, or speech-language pathology services or continues to need occupational therapy after the need for skilled nursing care, physical therapy, or speech-language pathology services ceased. Where a patient’s sole skilled service need is for skilled oversight of unskilled services (management and evaluation of the care plan as defined in §40.1.2.2), the physician or allowed practitioner must include a brief narrative describing the clinical justification of this need as part of the recertification, or as a signed addendum to the recertification 3. A plan of care has been established and is periodically reviewed by a physician or allowed practitioner and 4. The services are or were furnished while the patient is or was under the care of a physician or allowed practitioner. Medicare does not limit the number of continuous 60-day recertifications for beneficiaries who continue to be eligible for the home health benefit. The certification may cover a period less than but not greater than 60 days. Because the updated home health plan of care must include the frequency and duration of visits to be made, the physician or allowed practitioner does not have to estimate how much longer skilled services will be needed for the recertification. 30.5.3 - Who May Sign the Certification or Recertification (Rev. 10738, Issued: 05-07-21, Effective: 01-01-21, Implementation: 08-09-21) The physician or allowed practitioner who signs the certification or recertification must be permitted to do so by 42 CFR 424.22.",0.172113120556


In [19]:
# Build a prompt to provide the original query, the result and ask to summarise for the user
summary_prompt = '''Summarise this result in a bulleted list to answer the search query a customer has sent.
Search query: SEARCH_QUERY_HERE
Search result: SEARCH_RESULT_HERE
Summary:
'''
summary_prepped = summary_prompt.replace('SEARCH_QUERY_HERE',homehealth_query).replace('SEARCH_RESULT_HERE',result_df['result'][0])
summary = openai.Completion.create(engine=COMPLETIONS_MODEL,prompt=summary_prepped,max_tokens=500)
# Response provided by GPT-3
print(summary['choices'][0]['text'])

- The certifying physician or allowed practitioner must identify the community physician or allowed practitioner who will be following the patient after discharge. 
- The certification must be completed prior to the home health agency bill Medicare for reimbursement. 
- A face-to-face encounter with the patient is required and must be performed by the certifying physician or allowed practitioner, a physician or allowed practitioner with privileges who cared for the patient in an acute or post-acute care facility, or an allowed non-physician practitioner.


### Search

Now that we've got our knowledge embedded and stored in Redis, we can now create an internal search application. Its not sophisticated but it'll get the job done for us.

In the directory containing this app, execute ```streamlit run search.py```. This will open up a Streamlit app in your browser where you can ask questions of your embedded data.

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form

## Build your moat

The Q&A was useful, but fairly limited in the complexity of interaction we can have - if the user asks a sub-optimal question, there is no assistance from the system to prompt them for more info or conversation to lead them down the right path.

For the next step we'll make a Chatbot using the Chat Completions endpoint, which will:
- Be given instructions on how it should act and what the goals of its users are
- Be supplied some required information that it needs to collect
- Go back and forth with the customer until it has populated that information
- Say a trigger word that will kick off semantic search and summarisation of the response

For more details on our Chat Completions endpoint and how to interact with it, please check out the docs [here](https://platform.openai.com/docs/guides/chat).

### Framework

This section outlines a basic framework for working with the API and storing context of previous conversation "turns". Once this is established, we'll extend it to use our retrieval endpoint.

In [20]:
# A basic example of how to interact with our ChatCompletion endpoint
# It requires a list of "messages", consisting of a "role" (one of system, user or assistant) and "content"
question = 'How can you help me'


completion = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[
    {"role": "user", "content": question}
  ]
)
print(f"{completion['choices'][0]['message']['role']}: {completion['choices'][0]['message']['content']}")

assistant: As an AI language model, I can help you with various tasks including answering questions, providing information, offering suggestions, generating content, and assisting with daily tasks. Some examples:

1. Answering general knowledge questions.
2. Providing information on various topics like history, science, and technology.
3. Giving advice on productivity, time management, and personal development.
4. Helping you find resources, tools, or apps for a specific task or goal.
5. Assisting with language-related tasks, such as grammar correction, text summarization, or translation.
6. Suggesting ideas for creative projects, such as writing prompts or content ideas.
7. Offering troubleshooting or technical support for simple issues.

Please let me know what you need help with, and I will do my best to assist you.


In [26]:
from termcolor import colored

# A basic class to create a message as a dict for chat
class Message:
    
    
    def __init__(self,role,content):
        
        self.role = role
        self.content = content
        
    def message(self):
        
        return {"role": self.role,"content": self.content}
        
# Our assistant class we'll use to converse with the bot
class Assistant:
    
    def __init__(self):
        self.conversation_history = []

    def _get_assistant_response(self, prompt):
        
        try:
            completion = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=prompt
            )
            
            response_message = Message(completion['choices'][0]['message']['role'],completion['choices'][0]['message']['content'])
            return response_message.message()
            
        except Exception as e:
            
            return f'Request failed with exception {e}'

    def ask_assistant(self, next_user_prompt, colorize_assistant_replies=True):
        [self.conversation_history.append(x) for x in next_user_prompt]
        assistant_response = self._get_assistant_response(self.conversation_history)
        self.conversation_history.append(assistant_response)
        return assistant_response
            
        
    def pretty_print_conversation_history(self, colorize_assistant_replies=True):
        for entry in self.conversation_history:
            if entry['role'] == 'system':
                pass
            else:
                prefix = entry['role']
                content = entry['content']
                output = colored(prefix +':\n' + content, 'green') if colorize_assistant_replies and entry['role'] == 'assistant' else prefix +':\n' + content
                print(output)

In [22]:
# Initiate our Assistant class
conversation = Assistant()

# Create a list to hold our messages and insert both a system message to guide behaviour and our first user question
messages = []
system_message = Message('system','You are a helpful business assistant who has innovative ideas')
user_message = Message('user','What can you do to help me')
messages.append(system_message.message())
messages.append(user_message.message())
messages

[{'role': 'system',
  'content': 'You are a helpful business assistant who has innovative ideas'},
 {'role': 'user', 'content': 'What can you do to help me'}]

In [23]:
# Get back a response from the Chatbot to our question
response_message = conversation.ask_assistant(messages)
print(response_message['content'])

As a business assistant, I can do a lot of things to help you. Here are some innovative ideas that you might find useful:

1. Create a Social Media Marketing Plan: I can help you develop a social media marketing plan that aligns with your business goals. This plan can include strategies for creating engaging content, targeting your audience to reach a wider audience, and leveraging social media advertising to promote your business.

2. Develop a Customer Loyalty Program: A customer loyalty program can be an effective way to retain customers and grow your business. I can help you develop a program that rewards your customers for repeat business, referrals, and other actions that support your business.

3. Conduct Market Research: Every successful business needs to stay up-to-date with the latest market trends and customer preferences. I can conduct market research on your behalf to help you make informed decisions about your business's product or service offerings.

4. Streamline Your B

In [24]:
next_question = 'Tell me more about option 4'

# Initiate a fresh messages list and insert our next question
messages = []
user_message = Message('user',next_question)
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
print(response_message['content'])

Streamlining your business operations is all about making your processes more efficient. This means identifying tasks and processes that can be automated, simplified, or eliminated entirely. By doing this, you can save time, reduce costs, and increase productivity, which can help your business grow.

Here are some ways that I can help you streamline your business operations:

1. Conduct a Process Audit: In order to streamline your operations, it's important to start by understanding your current processes. I can conduct a process audit on your behalf to identify areas where improvements can be made.

2. Automate Repetitive Tasks: There are likely many tasks that are performed repeatedly in your business, such as data entry, invoicing, or scheduling. By automating these tasks, you can save time and free up your team to focus on more important tasks.

3. Integrate Your Systems: Do you have multiple systems that don't communicate with each other? Integrating these systems can save time an

In [25]:
# Print out a log of our conversation so far

conversation.pretty_print_conversation_history()

user:
What can you do to help me
[32massistant:
As a business assistant, I can do a lot of things to help you. Here are some innovative ideas that you might find useful:

1. Create a Social Media Marketing Plan: I can help you develop a social media marketing plan that aligns with your business goals. This plan can include strategies for creating engaging content, targeting your audience to reach a wider audience, and leveraging social media advertising to promote your business.

2. Develop a Customer Loyalty Program: A customer loyalty program can be an effective way to retain customers and grow your business. I can help you develop a program that rewards your customers for repeat business, referrals, and other actions that support your business.

3. Conduct Market Research: Every successful business needs to stay up-to-date with the latest market trends and customer preferences. I can conduct market research on your behalf to help you make informed decisions about your business's pr

### Knowledge retrieval

Now we'll extend the class to call a downstream service when a stop sequence is spoken by the Chatbot.

The main changes are:
- The system message is more comprehensive, giving criteria for the Chatbot to advance the conversation
- Adding an explicit stop sequence for it to use when it has the info it needs
- Extending the class with a function ```_get_search_results``` which sources Redis results

In [27]:
# Updated system prompt requiring Question and Year to be extracted from the user
system_prompt = '''
You are a helpful Medicare Home Health knowledge base assistant. You need to capture a Question and any clarification information.
The Question is their query on Medicare Home Health Regulations, and the clarification information is any additional information they provide to help you answer their question.
If you need to ask the user for clarification information, ask them for it.
Once you have a Question and any clarification information, say "searching for answers".

Example 1:

User: I'd like to know if a clinician can perform a home health initial assessment.

Assistant: Certainly, which clinician did you want to know this about?

User: Physical Therapist, please.

Assistant: Searching for answers.
'''

# New Assistant class to add a vector database call to its responses
class RetrievalAssistant:
    
    def __init__(self):
        self.conversation_history = []  

    def _get_assistant_response(self, prompt):
        
        try:
            completion = openai.ChatCompletion.create(
              model=CHAT_MODEL,
              messages=prompt,
              temperature=0.1
            )
            
            response_message = Message(completion['choices'][0]['message']['role'],completion['choices'][0]['message']['content'])
            return response_message.message()
            
        except Exception as e:
            
            return f'Request failed with exception {e}'
    
    # The function to retrieve Redis search results
    def _get_search_results(self,prompt):
        latest_question = prompt
        search_content = get_redis_results(redis_client,latest_question,INDEX_NAME)['result'][0]
        return search_content
        

    def ask_assistant(self, next_user_prompt):
        [self.conversation_history.append(x) for x in next_user_prompt]
        assistant_response = self._get_assistant_response(self.conversation_history)
        
        # Answer normally unless the trigger sequence is used "searching_for_answers"
        if 'searching for answers' in assistant_response['content'].lower():
            question_extract = openai.Completion.create(model=COMPLETIONS_MODEL,prompt=f"Extract the user's latest question and the year for that question from this conversation: {self.conversation_history}. Extract it as a sentence stating the Question and Year")
            search_result = self._get_search_results(question_extract['choices'][0]['text'])
            
            # We insert an extra system prompt here to give fresh context to the Chatbot on how to use the Redis results
            # In this instance we add it to the conversation history, but in production it may be better to hide
            self.conversation_history.insert(-1,{"role": 'system',"content": f"Answer the user's question using this content: {search_result}. If you cannot answer the question, say 'Sorry, I don't know the answer to this one'"})
            #[self.conversation_history.append(x) for x in next_user_prompt]
            
            assistant_response = self._get_assistant_response(self.conversation_history)
            print(next_user_prompt)
            print(assistant_response)
            self.conversation_history.append(assistant_response)
            return assistant_response
        else:
            self.conversation_history.append(assistant_response)
            return assistant_response
            
        
    def pretty_print_conversation_history(self, colorize_assistant_replies=True):
        for entry in self.conversation_history:
            if entry['role'] == 'system':
                pass
            else:
                prefix = entry['role']
                content = entry['content']
                output = colored(prefix +':\n' + content, 'green') if colorize_assistant_replies and entry['role'] == 'assistant' else prefix +':\n' + content
                #prefix = entry['role']
                print(output)

In [29]:
conversation = RetrievalAssistant()
messages = []
system_message = Message('system',system_prompt)
user_message = Message('user','Can a patient be discharged from home health services if they are not homebound?')
messages.append(system_message.message())
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
response_message

[{'role': 'system', 'content': '\nYou are a helpful Medicare Home Health knowledge base assistant. You need to capture a Question and any clarification information.\nThe Question is their query on Medicare Home Health Regulations, and the clarification information is any additional information they provide to help you answer their question.\nIf you need to ask the user for clarification information, ask them for it.\nOnce you have a Question and any clarification information, say "searching for answers".\n\nExample 1:\n\nUser: I\'d like to know if a clinician can perform a home health initial assessment.\n\nAssistant: Certainly, which clinician did you want to know this about?\n\nUser: Physical Therapist, please.\n\nAssistant: Searching for answers.\n'}, {'role': 'user', 'content': 'Can a patient be discharged from home health services if they are not homebound?'}]
{'role': 'assistant', 'content': 'Yes, a patient can be discharged from home health services if they are not homebound. Ac

{'role': 'assistant',
 'content': 'Yes, a patient can be discharged from home health services if they are not homebound. According to Medicare Home Health Regulations, for a patient to be eligible to receive covered home health services under both Part A and Part B, they must be considered "confined to the home" or homebound. If a patient no longer meets the homebound criteria, they may no longer be eligible for home health services, and discharge from the services may be appropriate.'}

In [30]:
messages = []
user_message = Message('user','For 2023 please.')
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
#response_message

In [31]:
conversation.pretty_print_conversation_history()

user:
Can a patient be discharged from home health services if they are not homebound?
[32massistant:
Yes, a patient can be discharged from home health services if they are not homebound. According to Medicare Home Health Regulations, for a patient to be eligible to receive covered home health services under both Part A and Part B, they must be considered "confined to the home" or homebound. If a patient no longer meets the homebound criteria, they may no longer be eligible for home health services, and discharge from the services may be appropriate.[0m
user:
For 2023 please.
[32massistant:
I'm sorry, but I cannot provide information about Medicare Home Health Regulations for 2023 as they have not been released yet. Regulations and guidelines may change over time, so it's essential to refer to the most current information when it becomes available. Please feel free to ask any other questions you may have about the current regulations.[0m


### Chatbot

Now we'll put all this into action with a real (basic) Chatbot.

In the directory containing this app, execute ```streamlit run chat.py```. This will open up a Streamlit app in your browser where you can ask questions of your embedded data. 

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form
- how can a competitor be disqualified

### Consolidation

Over the course of this notebook you have:
- Laid the foundations of your product by embedding our knowledge base
- Created a Q&A application to serve basic use cases
- Extended this to be an interactive Chatbot

These are the foundational building blocks of any Q&A or Chat application using our APIs - these are your starting point, and we look forward to seeing what you build with them!