# GPT user support

> Semantic search enabled via GPT and context-specific responses

In [9]:
import numpy as np
import openai
from openai import OpenAI
import os
import pandas as pd
import pickle

COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

# Authenticate with OpenAI API
with open('apiKeys.txt', 'r') as temp:
    apiKey = temp.read()
client = OpenAI(api_key=apiKey)

In [13]:
def gpt4(question, tokens=500):
    messages=[{"role": "user", "content": question}]

    response = client.chat.completions.create(model="gpt-4",
                                                max_tokens=tokens,
                                                temperature=0,
                                                messages=messages)

    # Extract the content
    content = response.choices[0].message.content

    # Split the content into text and code
    text_parts = []
    code_parts = []
    in_code_block = False

    for line in content.split("\n"):
        if line.startswith("```"):
            in_code_block = not in_code_block
            continue
        if in_code_block:
            code_parts.append(line)
        else:
            text_parts.append(line)

    # Print the text parts
    for line in text_parts:
        print(line)

    # Print a separator
    print("\n" + "-"*50 + "\n")

    # Print the code parts
    for line in code_parts:
        print(line)

## GPT Hallucination (lying)

In [14]:
prompt = "How to generate a token using Tapipy"

gpt4(prompt,300)

Tapipy (Tenable Application Programming Interface Python) is a Python client library for Tenable.io API. However, it does not directly generate tokens. 

The token generation is handled by Tenable.io API itself. You can generate API keys (access key and secret key) from your Tenable.io account. Here is how you can do it:

1. Log in to your Tenable.io account.
2. Click on "My Account" in the top right corner.
3. Click on "API Keys" on the left side menu.
4. Click on "Generate" button. It will generate a pair of keys: Access Key and Secret Key.

Once you have these keys, you can use them in Tapipy to authenticate your requests. Here is an example:


Remember to replace 'your_access_key' and 'your_secret_key' with your actual keys.

--------------------------------------------------

from tapipy.tenable_io import TenableIOClient

client = TenableIOClient(access_key='your_access_key', secret_key='your_secret_key')

# Now you can use client to make requests


'openai.Completion.create(\n    prompt=prompt,\n    temperature=0,\n    max_tokens=300,\n    model=COMPLETIONS_MODEL\n)["choices"][0]["text"].strip(" \n")'

Ther is no website called "Tapipy" to create an account...all these are wrong!

## Forcing GPT to not lie!

In [15]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: How to generate a token using Tapipy?
A:
"""

gpt4(prompt,300)

Sorry, I don't know.

--------------------------------------------------



'openai.Completion.create(\n    prompt=prompt,\n    temperature=0,\n    max_tokens=300,\n    model=COMPLETIONS_MODEL\n)["choices"][0]["text"].strip(" \n")'

Well....that was very helpful!

## Providing Context to GPT

> What if we could provide GPT with some context so it can provide useful help!

In [16]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Context: 
Create an Tapis Client Object

The first step in using the Tapis Python SDK, tapipy, is to create a Tapis Client object. First, import the Tapis class and create python object called t that points to the Tapis server using your TACC username and password. Do so by typing the following in a Python shell:

# Import the Tapis object
from tapipy.tapis import Tapis

# Log into you the Tapis service by providing user/pass and url.
t = Tapis(base_url='https://tacc.tapis.io',
          username='your username',
          password='your password')

Generate a Token

With the t object instantiated, we can exchange our credentials for an access token. In Tapis, you never send your username and password directly to the services; instead, you pass an access token which is cryptographically signed by the OAuth server and includes information about your identity. The Tapis services use this token to determine who you are and what you can do.

    # Get tokens that will be used for authenticated function calls
    t.get_tokens()
    print(t.access_token.access_token)

    Out[1]: eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...

Note that the tapipy t object will store and pass your access token for you, so you don’t have to manually provide the token when using the tapipy operations. You are now ready to check your access to the Tapis APIs. It will expire though, after 4 hours, at which time you will need to generate a new token. If you are interested, you can create an OAuth client (a one-time setup step, like creating a TACC account) that can be used to generate access and refresh tokens. For simplicity, we are skipping that but if you are interested, check out the Tenancy and Authentication section.
Q: How to generate a token using Tapipy?
A:
"""

gpt4(prompt)

To generate a token using Tapipy, you first need to create a Tapis Client object. After importing the Tapis class and creating a python object that points to the Tapis server using your TACC username and password, you can exchange your credentials for an access token. Here are the steps:

1. Import the Tapis object:


2. Log into the Tapis service by providing your username, password, and url:


3. Generate the tokens:


The output will be your access token. The Tapipy object will store and pass your access token for you, so you don’t have to manually provide the token when using the Tapipy operations.

--------------------------------------------------

from tapipy.tapis import Tapis
t = Tapis(base_url='https://tacc.tapis.io',
          username='your username',
          password='your password')
t.get_tokens()
print(t.access_token.access_token)


### (1) Create a word embedding as vector

In [19]:
import markdown2
from bs4 import BeautifulSoup
from transformers import GPT2TokenizerFast

import numpy as np
from nltk.tokenize import sent_tokenize

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# Open the markdown file
with open("actor.md", "r") as file:
    content = file.read()

# Use markdown2 to convert the markdown file to html
html = markdown2.markdown(content)

# Use BeautifulSoup to parse the html
soup = BeautifulSoup(html, "html.parser")

# Initialize variables to store heading, subheading, and corresponding paragraphs
headings = []
paragraphs = []

data = []

MAX_WORDS = 500

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(tokenizer.encode(text))

# Iterate through the tags in the soup
for tag in soup.descendants:
    # Check if the tag is a heading
    if tag.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
        # When the next heading is encountered, print the heading, subheading, and corresponding paragraphs
        if headings and paragraphs:
            hdgs = " ".join(headings)
            para = " ".join(paragraphs)
            data.append([hdgs, para, count_tokens(para)])
            headings = []
            paragraphs = []
        # Add to heading
        headings.append(tag.text)
    # Check if the tag is a paragraph
    elif tag.name == "p":
        paragraphs.append(tag.text)

  from .autonotebook import tqdm as notebook_tqdm
vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 7.44MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 3.89MB/s]
tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 7.73MB/s]
config.json: 100%|██████████| 665/665 [00:00<00:00, 187kB/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (2262 > 1024). Running this sequence through the model will result in indexing errors


We create a dataset and filter out any sections with fewer than 40 tokens, as those are unlikely to contain enough context to ask a good question.

In [20]:
df = pd.DataFrame(data, columns=["heading", "content", "tokens"])
df = df[df.tokens>40]
df = df.reset_index().drop('index',axis=1) # reset index
df.head()

Unnamed: 0,heading,content,tokens
0,Actors Introduction to Abaco What is Abaco,Abaco is an NSF-funded web service and distrib...,131
1,Using Abaco,Abaco is in production and has been adopted by...,58
2,Getting Started,This Getting Started guide will walk you throu...,85
3,Account Creation and Software Installation Cre...,The main instance of the Abaco platform is hos...,77
4,Create a Docker account,Docker is an open-source container runtime\npr...,55


In [23]:
def testing(question, tokens=300):
    messages=[{"role": "user", "content": question}]

    response = client.chat.completions.create(model="gpt-4",
                                                max_tokens=tokens,
                                                temperature=0,
                                                messages=messages)
    print(response)
    # Extract the content
    content = response.choices[0].message.content

testing('who let the dogs out')

ChatCompletion(id='chatcmpl-8VOz2e5g76ZisRkbFSENcQnppdytk', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='"The Baha Men" is the name of the band that released the popular song "Who Let the Dogs Out" in 2000. The phrase is often used humorously to refer to situations where things are getting out of control.', role='assistant', function_call=None, tool_calls=None))], created=1702494896, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=47, prompt_tokens=12, total_tokens=59))


In [21]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL):
    result = openai.Embedding.create(
        model=model,
        input=text
    )
    return result["data"][0]["embedding"]


def compute_doc_embeddings(df: pd.DataFrame):
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content) for idx, r in df.iterrows()
    }

### Word embedding as vectors

In [22]:
vector_embedding = compute_doc_embeddings(df)

APIRemovedInV1: 

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742


In [10]:
df['vector_embedding'] = pd.Series(vector_embedding)
df.head()

Unnamed: 0,heading,content,tokens,vector_embedding
0,Actors Introduction to Abaco What is Abaco,Abaco is an NSF-funded web service and distrib...,131,"[-0.008385769091546535, -0.01955496147274971, ..."
1,Using Abaco,Abaco is in production and has been adopted by...,58,"[-0.017907971516251564, -0.008049326948821545,..."
2,Getting Started,This Getting Started guide will walk you throu...,85,"[-0.006924053188413382, -0.010912280529737473,..."
3,Account Creation and Software Installation Cre...,The main instance of the Abaco platform is hos...,77,"[-0.0048550949431955814, -0.020554518327116966..."
4,Create a Docker account,Docker is an open-source container runtime\npr...,55,"[-0.0035271041560918093, -0.03285187482833862,..."


### (2) Find the most similar document embeddings to the question embedding

We embed the query strip and use it to find the most similar document sections. Since this is a small example, we store and search the embeddings locally. 

In [11]:
from openai.embeddings_utils import cosine_similarity

def order_documents_query_similarity(data, query_str, nres=3):
    embedding = get_embedding(query_str, model=EMBEDDING_MODEL)
    data['similarities'] = data.vector_embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = data.sort_values('similarities', ascending=False).head(nres)
    return res

We can see that the most relevant document sections for the token is listed at the top

In [12]:
res = order_documents_query_similarity(df, "How to generate a token using Tapipy")
res.head()

Unnamed: 0,heading,content,tokens,vector_embedding,similarities
10,Get tokens that will be used for authenticated...,t.gettokens()\n print(t.accesstoken.access_to...,205,"[-0.01216100063174963, -0.010938613675534725, ...",0.828162
7,Create an Tapis Client Object,"The first step in using the Tapis Python SDK, ...",69,"[-0.00659452797845006, -0.005156101193279028, ...",0.824878
36,Python with Tapipy,Setting up an Tapis object with token and API ...,502,"[-0.015956224873661995, -0.0027807820588350296...",0.816858


### (3) Add the most relevant document sections to the query prompt

In [13]:
question =  "How to generate a token using Tapipy"

In [14]:
def construct_prompt(question: str, df: pd.DataFrame, ncontents = 3) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_documents_query_similarity(df, question)
    
    chosen_sections = []
    chosen_section_len = 0

    MAX_SECTION_LEN = 500
    context = order_documents_query_similarity(df, question)
    context.head()

    for _, ctx in context.iterrows():
        chosen_section_len += ctx.tokens
        if chosen_section_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(" " + ctx.content.replace("\n", " "))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"



In [15]:
construct_prompt(question="How to generate a token using Tapipy", df=df)

'Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don\'t know."\n\nContext:\n t.gettokens()   print(t.accesstoken.access_token) Out[1]: eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...   ``` Note that the tapipy t object will store and pass your access token for you, so you don\\\'t have to manually provide the token when using the tapipy operations. You are now ready to check your access to the Tapis APIs. It will expire though, after 4 hours, at which time you will need to generate a new token. If you are interested, you can create an OAuth client (a one-time setup step, like creating a TACC account) that can be used to generate access and refresh tokens. For simplicity, we are skipping that but if you are interested, check out the Tenancy and Authentication section. The first step in using the Tapis Python SDK, tapipy, is to create a Tapis Client object. First, import the Tapis class and create python o

### (4) Answer the user's question based on the context.



In [16]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    show_prompt: bool = False) -> str:
    
    prompt = construct_prompt(
        query,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

### Original GPT without context - telling lies as it invents a new Tapipy website and App to generate a token

In [20]:
prompt = "How to generate a token using Tapipy"

openai.Completion.create(prompt=prompt, temperature=0, max_tokens=300, model=COMPLETIONS_MODEL)["choices"][0]["text"].strip(" \n")

'1. Create an account on Tapipy.\n\n2. Log in to your account and go to the “My Apps” page.\n\n3. Click on “Create New App” and enter the details of your app.\n\n4. Once your app is created, click on “Generate Token”.\n\n5. Enter the details of the token you want to generate and click “Generate”.\n\n6. Your token will be generated and displayed on the screen.'

### When you ask a question for which it can find a context! - It answers correctly!

In [17]:
answer_query_with_context("How to generate a token using Tapipy", df)

'Use the t.gettokens() command to generate a token using Tapipy.'

### When it doesn't know...at least it is honest!

In [18]:
answer_query_with_context("How to access files using Tapipy", df)

"I don't know."