# GPT user support

> Semantic search enabled via GPT and context-specific responses

In [5]:
import numpy as np
import openai
from openai import OpenAI
import os
import pandas as pd
import pickle

COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

# Authenticate with OpenAI API
with open('apiKeys.txt', 'r') as temp:
    apiKey = temp.read()
client = OpenAI(api_key=apiKey)

ModuleNotFoundError: No module named 'openai'

In [None]:
def gpt4(question, tokens=500):
    messages=[{"role": "user", "content": question}]

    response = client.chat.completions.create(model="gpt-4",
                                                max_tokens=tokens,
                                                temperature=0,
                                                messages=messages)

    # Extract the content
    content = response.choices[0].message.content

    # Split the content into text and code
    text_parts = []
    code_parts = []
    in_code_block = False

    for line in content.split("\n"):
        if line.startswith("```"):
            in_code_block = not in_code_block
            continue
        if in_code_block:
            code_parts.append(line)
        else:
            text_parts.append(line)

    # Print the text parts
    for line in text_parts:
        print(line)

    # Print a separator
    print("\n" + "-"*50 + "\n")

    # Print the code parts
    for line in code_parts:
        print(line)

## GPT Hallucination (lying)

In [None]:
prompt = "How to generate a token using Tapipy"

gpt4(prompt,300)

Tapipy (Tapis Python SDK) is a Python client for the Tapis API. However, generating a token is not done through Tapipy itself, but through the Tapis API. 

Here's a general way to generate a token using Tapis API:

1. First, you need to have your Tapis tenant id, username, and password.

2. You can then use the Tapis API's `tokens` endpoint to generate a token. Here's an example using Python's `requests` library:


Replace `"your_username"`, `"your_password"`, and `"your_tenant_id"` with your actual Tapis username, password, and tenant id.

3. The `access_token` in the response is your token.

Please note that this is a general way to generate a token using the Tapis API, and the actual steps may vary depending on your specific use case and settings. Always refer to the official T

--------------------------------------------------

import requests
import json

url = "https://api.tapis.io/tokens"

payload = {
    "grant_type": "password",
    "username": "your_username",
    "password"

Ther is no website called "Tapipy" to create an account...all these are wrong!

## Forcing GPT to not lie!

In [None]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: How to generate a token using Tapipy?
A:
"""

gpt4(prompt,300)

Sorry, I don't know.

--------------------------------------------------



Well....that was very helpful!

## Providing Context to GPT

> What if we could provide GPT with some context so it can provide useful help!

In [None]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Context: 
Create an Tapis Client Object

The first step in using the Tapis Python SDK, tapipy, is to create a Tapis Client object. First, import the Tapis class and create python object called t that points to the Tapis server using your TACC username and password. Do so by typing the following in a Python shell:

# Import the Tapis object
from tapipy.tapis import Tapis

# Log into you the Tapis service by providing user/pass and url.
t = Tapis(base_url='https://tacc.tapis.io',
          username='your username',
          password='your password')

Generate a Token

With the t object instantiated, we can exchange our credentials for an access token. In Tapis, you never send your username and password directly to the services; instead, you pass an access token which is cryptographically signed by the OAuth server and includes information about your identity. The Tapis services use this token to determine who you are and what you can do.

    # Get tokens that will be used for authenticated function calls
    t.get_tokens()
    print(t.access_token.access_token)

    Out[1]: eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...

Note that the tapipy t object will store and pass your access token for you, so you don’t have to manually provide the token when using the tapipy operations. You are now ready to check your access to the Tapis APIs. It will expire though, after 4 hours, at which time you will need to generate a new token. If you are interested, you can create an OAuth client (a one-time setup step, like creating a TACC account) that can be used to generate access and refresh tokens. For simplicity, we are skipping that but if you are interested, check out the Tenancy and Authentication section.
Q: How to generate a token using Tapipy?
A:
"""

gpt4(prompt)

To generate a token using Tapipy, you first need to create a Tapis Client object. After importing the Tapis class and creating a python object that points to the Tapis server using your TACC username and password, you can exchange your credentials for an access token. Here are the steps:

1. Import the Tapis object:


2. Log into the Tapis service by providing your username, password, and url:


3. Generate the tokens:


The output will be your access token. The Tapipy t object will store and pass your access token for you, so you don’t have to manually provide the token when using the Tapipy operations.

--------------------------------------------------

from tapipy.tapis import Tapis
t = Tapis(base_url='https://tacc.tapis.io',
          username='your username',
          password='your password')
t.get_tokens()
print(t.access_token.access_token)


### (1) Create a word embedding as vector

In [None]:
import markdown2
from bs4 import BeautifulSoup
from transformers import GPT2TokenizerFast

import numpy as np
from nltk.tokenize import sent_tokenize

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

context = 'actor'

# Open the markdown file
with open(os.path.join(context + '.md'), "r") as file:
    content = file.read()

# Use markdown2 to convert the markdown file to html
html = markdown2.markdown(content)

# Use BeautifulSoup to parse the html
soup = BeautifulSoup(html, "html.parser")

# Initialize variables to store heading, subheading, and corresponding paragraphs
headings = []
paragraphs = []

data = []

MAX_WORDS = 500

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(tokenizer.encode(text))

# Iterate through the tags in the soup
for tag in soup.descendants:
    # Check if the tag is a heading
    if tag.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
        # When the next heading is encountered, print the heading, subheading, and corresponding paragraphs
        if headings and paragraphs:
            hdgs = " ".join(headings)
            para = " ".join(paragraphs)
            data.append([hdgs, para, count_tokens(para)])
            headings = []
            paragraphs = []
        # Add to heading
        headings.append(tag.text)
    # Check if the tag is a paragraph
    elif tag.name == "p":
        paragraphs.append(tag.text)

Token indices sequence length is longer than the specified maximum sequence length for this model (2262 > 1024). Running this sequence through the model will result in indexing errors


We create a dataset and filter out any sections with fewer than 40 tokens, as those are unlikely to contain enough context to ask a good question.

In [None]:
df = pd.DataFrame(data, columns=["heading", "content", "tokens"])
df = df[df.tokens>40]
df = df.reset_index().drop('index',axis=1) # reset index
df.head()

Unnamed: 0,heading,content,tokens
0,Actors Introduction to Abaco What is Abaco,Abaco is an NSF-funded web service and distrib...,131
1,Using Abaco,Abaco is in production and has been adopted by...,58
2,Getting Started,This Getting Started guide will walk you throu...,85
3,Account Creation and Software Installation Cre...,The main instance of the Abaco platform is hos...,77
4,Create a Docker account,Docker is an open-source container runtime\npr...,55


In [None]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL):
    result = client.embeddings.create(model=model,
                                        input=text).data[0].embedding
    return result


def compute_doc_embeddings(df: pd.DataFrame):
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content) for idx, r in df.iterrows()
    }

### Word embedding as vectors

In [None]:
vector_embedding = compute_doc_embeddings(df)

In [None]:
df['vector_embedding'] = pd.Series(vector_embedding)
df.head()

Unnamed: 0,heading,content,tokens,vector_embedding
0,Actors Introduction to Abaco What is Abaco,Abaco is an NSF-funded web service and distrib...,131,"[-0.008369413204491138, -0.019507333636283875,..."
1,Using Abaco,Abaco is in production and has been adopted by...,58,"[-0.017935028299689293, -0.008070049807429314,..."
2,Getting Started,This Getting Started guide will walk you throu...,85,"[-0.006894299294799566, -0.0108660152181983, 0..."
3,Account Creation and Software Installation Cre...,The main instance of the Abaco platform is hos...,77,"[-0.004799393005669117, -0.02063984051346779, ..."
4,Create a Docker account,Docker is an open-source container runtime\npr...,55,"[-0.003532970556989312, -0.0329020619392395, 0..."


Stash the dataframe so that in the future we can just load it without having to revectorize it. This will save some time.

In [None]:
df.to_csv(os.path.join('vectorizedDataFrames', context))

### (2) Find the most similar document embeddings to the question embedding

We embed the query strip and use it to find the most similar document sections. Since this is a small example, we store and search the embeddings locally. 

In [None]:
from scipy.spatial.distance import cosine

def order_documents_query_similarity(data, query_str, nres=3):
    embedding = get_embedding(query_str, model=EMBEDDING_MODEL)
    data['similarities'] = data.vector_embedding.apply(lambda x: 1-cosine(x, embedding))

    res = data.sort_values('similarities', ascending=False).head(nres)
    return res

We can see that the most relevant document sections for the token is listed at the top

In [63]:
res = order_documents_query_similarity(df, "How to generate a token using Tapipy")
res.head()

[-0.028395162895321846, -0.011560259386897087, -0.01682025007903576, -0.05107612907886505, -0.008864331059157848, 0.004131803754717112, -0.019574787467718124, -0.015018081292510033, 0.003673935541883111, -0.04650476947426796, 0.04700293019413948, -0.0067398217506706715, -0.025552716106176376, -0.01806565374135971, -0.00854199193418026, 0.002529264660552144, 0.004780145362019539, 0.024014277383685112, 0.020527152344584465, 0.011838643811643124, -0.029303573071956635, 0.008629902265965939, -0.006505393423140049, -0.002780176466330886, -0.018109608441591263, 0.004673920106142759, -0.011171987280249596, -0.03035850264132023, -0.02323773317039013, -0.024409877136349678, 0.0064064934849739075, -0.018212171271443367, -0.03463682159781456, -0.013845938257873058, -0.028673546388745308, -0.0015997919254004955, -0.0011373449815437198, -0.011157335713505745, 0.022593054920434952, -0.0023278025910258293, -0.0039047012105584145, -0.007040183525532484, 0.01055661216378212, -0.009992518462240696, -0.0

Unnamed: 0,heading,content,tokens,vector_embedding,similarities
10,Get tokens that will be used for authenticated...,t.gettokens()\n print(t.accesstoken.access_to...,205,"[-0.012237769551575184, -0.01072288304567337, ...",0.827899
7,Create an Tapis Client Object,"The first step in using the Tapis Python SDK, ...",69,"[-0.0065297014079988, -0.00512651028111577, 0....",0.825042
36,Python with Tapipy,Setting up an Tapis object with token and API ...,502,"[-0.01604623533785343, -0.002748530823737383, ...",0.816826


### (3) Add the most relevant document sections to the query prompt

In [64]:
question =  "How to generate a token using Tapipy"

In [65]:
def construct_prompt(question: str, df: pd.DataFrame, ncontents = 3) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_documents_query_similarity(df, question)
    
    chosen_sections = []
    chosen_section_len = 0

    MAX_SECTION_LEN = 500
    context = order_documents_query_similarity(df, question)
    context.head()

    for _, ctx in context.iterrows():
        chosen_section_len += ctx.tokens
        if chosen_section_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(" " + ctx.content.replace("\n", " "))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"



In [66]:
construct_prompt(question="How to generate a token using Tapipy", df=df)

[-0.028395162895321846, -0.011560259386897087, -0.01682025007903576, -0.05107612907886505, -0.008864331059157848, 0.004131803754717112, -0.019574787467718124, -0.015018081292510033, 0.003673935541883111, -0.04650476947426796, 0.04700293019413948, -0.0067398217506706715, -0.025552716106176376, -0.01806565374135971, -0.00854199193418026, 0.002529264660552144, 0.004780145362019539, 0.024014277383685112, 0.020527152344584465, 0.011838643811643124, -0.029303573071956635, 0.008629902265965939, -0.006505393423140049, -0.002780176466330886, -0.018109608441591263, 0.004673920106142759, -0.011171987280249596, -0.03035850264132023, -0.02323773317039013, -0.024409877136349678, 0.0064064934849739075, -0.018212171271443367, -0.03463682159781456, -0.013845938257873058, -0.028673546388745308, -0.0015997919254004955, -0.0011373449815437198, -0.011157335713505745, 0.022593054920434952, -0.0023278025910258293, -0.0039047012105584145, -0.007040183525532484, 0.01055661216378212, -0.009992518462240696, -0.0

'Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don\'t know."\n\nContext:\n t.gettokens()   print(t.accesstoken.access_token) Out[1]: eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...   ``` Note that the tapipy t object will store and pass your access token for you, so you don\\\'t have to manually provide the token when using the tapipy operations. You are now ready to check your access to the Tapis APIs. It will expire though, after 4 hours, at which time you will need to generate a new token. If you are interested, you can create an OAuth client (a one-time setup step, like creating a TACC account) that can be used to generate access and refresh tokens. For simplicity, we are skipping that but if you are interested, check out the Tenancy and Authentication section. The first step in using the Tapis Python SDK, tapipy, is to create a Tapis Client object. First, import the Tapis class and create python o

### (4) Answer the user's question based on the context.



In [67]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    show_prompt: bool = False) -> str:
    
    prompt = construct_prompt(
        query,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = client.completions.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response.choices[0].text.strip(" \n")

### Original GPT without context - telling lies as it invents a new Tapipy website and App to generate a token

In [68]:
prompt = "How to generate a token using Tapipy"
# ["choices"][0]["text"].strip(" \n")
client.completions.create(prompt=prompt, temperature=0, max_tokens=300, model=COMPLETIONS_MODEL).choices[0].text.strip(" \n")

'1. Create an account on Tapipy.\n\n2. Log in to your account and go to the “My Apps” page.\n\n3. Click on “Create New App” and enter the details of your app.\n\n4. Once your app is created, click on “Generate Token”.\n\n5. Enter the details of the token you want to generate and click “Generate”.\n\n6. Your token will be generated and displayed on the screen.'

### When you ask a question for which it can find a context! - It answers correctly!

In [69]:
answer_query_with_context("How to generate a token using Tapipy", df)

[-0.028476707637310028, -0.011556296609342098, -0.016839802265167236, -0.05103236809372902, -0.008903550915420055, 0.004162318538874388, -0.019580483436584473, -0.015066420659422874, 0.003673172788694501, -0.04654761776328087, 0.046987298876047134, -0.006683159153908491, -0.025589466094970703, -0.018012287095189095, -0.008537149988114834, 0.0024896967224776745, 0.00483283307403326, 0.024035923182964325, 0.02051847241818905, 0.011702856980264187, -0.029282789677381516, 0.008595773950219154, -0.006503622513264418, -0.0027791536413133144, -0.01818815991282463, 0.0046606240794062614, -0.011043334379792213, -0.030308714136481285, -0.023259153589606285, -0.024387668818235397, 0.006338742095977068, -0.018261440098285675, -0.034676216542720795, -0.013827984221279621, -0.028740515932440758, -0.0015535413986071944, -0.0011990481289103627, -0.011145927011966705, 0.02257031947374344, -0.0023394720628857613, -0.0038508775178343058, -0.00695429602637887, 0.010610980913043022, -0.010039394721388817, 

'Use the t.gettokens() command to generate a token using Tapipy.'

### When it doesn't know...at least it is honest!

In [70]:
answer_query_with_context("How to access files using Tapipy", df)

[-0.018007272854447365, -0.013472541235387325, -0.00385452201589942, -0.07191206514835358, -0.0015103581827133894, 0.020333152264356613, -0.006421765312552452, -0.029095424339175224, 0.011841501109302044, -0.02738392911851406, 0.026711031794548035, -0.015169409103691578, -0.007979664951562881, -0.009800871834158897, -0.01483296137303114, -0.010941868647933006, 0.004644442815333605, 0.014620852656662464, 0.02176671288907528, 0.003806980326771736, -0.0234196949750185, 0.020362408831715584, -0.0045383889228105545, -0.0245460644364357, -0.015608253888785839, -0.0032675666734576225, -0.003258424112573266, -0.030192537233233452, -0.030426587909460068, -0.027544837445020676, 0.012017038650810719, -0.029856087639927864, -0.02219092845916748, -0.015754535794258118, -0.01486221794039011, -0.019309179857373238, 0.01227303221821785, -0.017714710906147957, 0.019674884155392647, 0.010239716619253159, -0.006403480190783739, -0.021664315834641457, 0.004783410578966141, -0.007398195564746857, -0.010554

"I don't know."