## Why search is better than fine-tuning

GPT can learn knowledge in two ways:

- Via model weights (i.e., fine-tune the model on a training set)
- Via model inputs (i.e., insert the knowledge into an input message)

Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall.

As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.

In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

One downside of text search relative to fine-tuning is that each model is limited by a maximum amount of text it can read at once:

| Model           | Maximum text length       |
|-----------------|---------------------------|
| `gpt-3.5-turbo` | 4,096 tokens (~5 pages)   |
| `gpt-4`         | 8,192 tokens (~10 pages)  |
| `gpt-4-32k`     | 32,768 tokens (~40 pages) |

Continuing the analogy, you can think of the model like a student who can only look at a few pages of notes at a time, despite potentially having shelves of textbooks to draw upon.

Therefore, to build a system capable of drawing upon large quantities of text to answer questions, we recommend using a Search-Ask approach.


## Search

Text can be searched in many ways. E.g.,

- Lexical-based search
- Graph-based search
- Embedding-based search

This example notebook uses embedding-based search. [Embeddings](https://platform.openai.com/docs/guides/embeddings) are simple to implement and work especially well with questions, as questions often don't lexically overlap with their answers.

Consider embeddings-only search as a starting point for your own system. Better search systems might combine multiple search methods, along with features like popularity, recency, user history, redundancy with prior search results, click rate data, etc. Q&A retrieval performance may also be improved with techniques like [HyDE](https://arxiv.org/abs/2212.10496), in which questions are first transformed into hypothetical answers before being embedded. Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.

## Full procedure

Specifically, this notebook demonstrates the following procedure:

1. Prepare search data (once per document)
    1. Collect: We'll download a few hundred Wikipedia articles about the 2022 Olympics
    2. Chunk: Documents are split into short, mostly self-contained sections to be embedded
    3. Embed: Each section is embedded with the OpenAI API
    4. Store: Embeddings are saved (for large datasets, use a vector database)
2. Search (once per query)
    1. Given a user question, generate an embedding for the query from the OpenAI API
    2. Using the embeddings, rank the text sections by relevance to the query
3. Ask (once per query)
    1. Insert the question and the most relevant sections into a message to GPT
    2. Return GPT's answer

### Costs

Because GPT is more expensive than embeddings search, a system with a decent volume of queries will have its costs dominated by step 3.

- For `gpt-3.5-turbo` using ~1,000 tokens per query, it costs ~$0.002 per query, or ~500 queries per dollar (as of Apr 2023)
- For `gpt-4`, again assuming ~1,000 tokens per query, it costs ~$0.03 per query, or ~30 queries per dollar (as of Apr 2023)

Of course, exact costs will depend on the system specifics and usage patterns.

## Preamble

We'll begin by:
- Importing the necessary libraries
- Selecting models for embeddings search and question answering



In [1]:
# imports
import ast  # for converting embeddings saved as strings back to arrays
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search
from transformers import GPT2Tokenizer # for counting tokens
import os
import spacy

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

### Setting API Key

In [2]:
# Ensure the file is in your script's directory or provide the full path
filename = "N:\CAREER\SEPEHR\EDUCATION\Brainstation\Data Science\GPT API Keys\OpenAI API.txt"

# Open the file in read mode ('r')
with open(filename, 'r') as file:
    openai.api_key = str(file.read().strip())  # read the key and strip leading/trailing white spaces

#### For Lists of Models available:

In [4]:
# Author: Viet Dac Lai
import pprint

GPT4 = 'gpt-4-0314'
MODEL_NAME = GPT4
model = openai.Model(MODEL_NAME)

def list_all_models():
    model_list = openai.Model.list()['data']
    model_ids = [x['id'] for x in model_list]
    model_ids.sort()
    pprint.pprint(model_ids)

if __name__ == '__main__':
    list_all_models()

['ada',
 'ada-code-search-code',
 'ada-code-search-text',
 'ada-search-document',
 'ada-search-query',
 'ada-similarity',
 'ada:2020-05-03',
 'babbage',
 'babbage-code-search-code',
 'babbage-code-search-text',
 'babbage-search-document',
 'babbage-search-query',
 'babbage-similarity',
 'babbage:2020-05-03',
 'code-davinci-edit-001',
 'code-search-ada-code-001',
 'code-search-ada-text-001',
 'code-search-babbage-code-001',
 'code-search-babbage-text-001',
 'curie',
 'curie-instruct-beta',
 'curie-search-document',
 'curie-search-query',
 'curie-similarity',
 'curie:2020-05-03',
 'cushman:2020-05-03',
 'davinci',
 'davinci-if:3.0.0',
 'davinci-instruct-beta',
 'davinci-instruct-beta:2.0.0',
 'davinci-search-document',
 'davinci-search-query',
 'davinci-similarity',
 'davinci:2020-05-03',
 'gpt-3.5-turbo',
 'gpt-3.5-turbo-0301',
 'if-curie-v2',
 'if-davinci-v2',
 'if-davinci:3.0.0',
 'text-ada-001',
 'text-ada:001',
 'text-babbage-001',
 'text-babbage:001',
 'text-curie-001',
 'text-curi

#### Troubleshooting: Installing libraries

If you need to install any of the libraries above, run `pip install {library_name}` in your terminal.

For example, to install the `openai` library, run:
```zsh
pip install openai
```

(You can also do this in a notebook cell with `!pip install openai` or `%pip install openai`.)

After installing, restart the notebook kernel so the libraries can be loaded.

#### Troubleshooting: Setting your API key

The OpenAI library will try to read your API key from the `OPENAI_API_KEY` environment variable. If you haven't already, you can set this environment variable by following [these instructions](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).

### Motivating example: GPT cannot answer questions about current events

Because the training data for `gpt-3.5-turbo` and `gpt-4` mostly ends in September 2021, the models cannot answer questions about more recent events

## 1. Prepare search data

To save you the time & expense, we've prepared a pre-embedded dataset of a few hundred Wikipedia articles about the 2022 Winter Olympics.

To see how we constructed this dataset, or to modify it yourself, see [Embedding Wikipedia articles for search](Embedding_Wikipedia_articles_for_search.ipynb).

In [3]:
# Path
embeddings_path = "N:\CAREER\SEPEHR\EDUCATION\Brainstation\Data Science\Deliverables\Hackathon\OpenAI API\Embedding Search-Ask\df_GPT_embedded_final_no_reviews.csv"

# Reading
df = pd.read_csv(embeddings_path)

# Display
df.head(3)

Unnamed: 0,text,embedding
0,"App_Name: Product Passage, Developer: Tezisto ...","[-0.012677876278758049, 0.019527003169059753, ..."
1,"App_Name: Smart Bundles, Developer: Gravitate,...","[0.002880027750506997, 0.015666238963603973, 0..."
2,"App_Name: AnyAsset ‑ Digital Downloads, Develo...","[-0.04260732978582382, -0.0012113063130527735,..."


## 2. Search

Now we'll define a search function that:
- Takes a user query and a dataframe with text & embedding columns
- Embeds the user query with the OpenAI API
- Uses distance between query embedding and text embeddings to rank the texts
- Returns two lists:
    - The top N texts, ranked by relevance
    - Their corresponding relevance scores

In [7]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]


## 3. Ask

With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.

Below, we define a function `ask` that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

### Conversation

In [4]:
# Initialize spacy
nlp = spacy.blank("en")

# Set up tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def count_tokens(text: str) -> int:
    """Return the number of tokens in a string."""
    return len([token for token in nlp(text)])

def query_message(
    query: str,
    df: pd.DataFrame,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from the DataFrame."""
    
    introduction = "- Assist the merchant with building their Shopify website. \
        - If you are asked a question, Make sure to ask a question back to make sure you are replying accurately. \
        - If the user asks for recommendations, use the appstore data provided to give 2 recommendations, and write a short summary of each recommendation. \
        - Do not forget to give a recommendation, as you don't want to waste the user's time. \
        - Prioritize free apps.\
        - Put the FULL app name into quotes! We want to be able to query it\
        - Keep it between 300 characters"

    question = f"\n\nQuestion: {query}"
    message = introduction
    
    if df is not None:
        for _, row in df.iterrows():
            text = row['text']
            next_article = f'\n\nText:\n"""\n{text}\n"""'
            if (
                count_tokens(message + next_article + question)
                > token_budget
            ):
                break
            else:
                message += next_article
    return message + question

def ask(
    query: str,
    model_selected = "gpt-3.5-turbo",
    token_budget: int = 4096-100,
    print_message: bool = False,
    history: list = []
    ) -> str:

    """Answers query using GPT and dataframe of relevant text and embeddings provided. Recommend 3 apps if conversation is about apps"""  
    message = query
      
    if print_message:
        print(message)

    messages = history.copy()  # Make sure to copy the history, so it doesn't get modified
    messages.append({"role": "user", "content": message})

    while count_tokens(''.join([m['content'] for m in messages])) > token_budget:
        messages.pop(1)  # Remove the oldest messages first (pop(0) is the system message)

    max_tokens = token_budget - count_tokens(''.join([m['content'] for m in messages]))  # Remaining tokens for response

    response = openai.ChatCompletion.create(
        model=model_selected,
        messages=messages,
        temperature=0,
        max_tokens=max_tokens,
    )
    response_message = response["choices"][0]["message"]["content"]

    # Store the new system message to history
    history.append({"role": "assistant", "content": response_message})

    return response_message

# Initialise the conversation history
history = [
    {"role": "system", "content": "You are a helpful assistant that aids in Shopify Webstore Development by suggesting apps and providing captions for products."},
]

### Example questions

In [39]:
# Ask the first question
response = ask('Recommend me 6 good starting apps', model_selected=GPT_MODEL, history=history)  
print(response)

Sure, here are 6 great starting apps for different purposes:

1. Trello - Trello is a project management app that allows you to organize tasks and projects on a visual board. It's great for keeping track of to-do lists, deadlines, and team collaboration.

2. Grammarly - Grammarly is a writing app that helps you improve your writing skills by checking for grammar, spelling, and punctuation errors. It's great for writing emails, reports, and other business documents.

3. Canva - Canva is a graphic design app that allows you to create professional-looking designs for social media, marketing materials, and presentations. It's great for creating visual content without needing to hire a graphic designer.

4. Google Analytics - Google Analytics is a web analytics app that allows you to track website traffic, user behavior, and conversion rates. It's great for understanding how your website is performing and making data-driven decisions.

5. Slack - Slack is a communication app that allows you

In [34]:
# Ask the second question
response = ask('Are there any alternatives to the second one?', model_selected=GPT_MODEL, history=history)  
print(response)

Yes, there are several alternatives to Grammarly that you can consider:

1. ProWritingAid - ProWritingAid is a writing app that offers grammar and spelling checks, as well as style and readability suggestions. It also offers a thesaurus and a contextual thesaurus to help you find the right words.

2. Hemingway Editor - Hemingway Editor is a writing app that helps you simplify your writing and make it more readable. It highlights complex sentences, adverbs, and passive voice, and suggests simpler alternatives.

3. Ginger Software - Ginger Software is a writing app that offers grammar and spelling checks, as well as a sentence rephraser and a translator. It also offers a personal trainer feature that helps you improve your writing skills over time.

4. WhiteSmoke - WhiteSmoke is a writing app that offers grammar and spelling checks, as well as style and punctuation suggestions. It also offers a translator and a plagiarism checker.

5. LanguageTool - LanguageTool is a writing app that off

In [35]:
# Ask the third question
response = ask('Which of these are cheapest?', model_selected=GPT_MODEL, history=history)  
print(response)

Of the alternatives I mentioned, LanguageTool is the cheapest option as it offers a free version with basic grammar and spelling checks. However, the free version has limited features and is only available for personal use. 

ProWritingAid also offers a free version with limited features, but it has a word limit of 500 words per check. The premium version of ProWritingAid starts at $70 per year.

Hemingway Editor has a one-time fee of $19.99 for the desktop app, and the online version is free to use.

Ginger Software offers a free version with basic grammar and spelling checks, but it has limited features. The premium version of Ginger Software starts at $20.97 per month.

WhiteSmoke offers a free trial, but the premium version starts at $5 per month for the basic plan and goes up to $11.50 per month for the premium plan.


#### Implementing .py code

In [36]:
# Create an instance of the class
assistant = ShopifyAssistant()

# Call the ask() method with a query
response = assistant.ask(query="Recommend me 6 good starting apps")

# Print the response
print(response)

Token indices sequence length is longer than the specified maximum sequence length for this model (1748 > 1024). Running this sequence through the model will result in indexing errors


Sure, I'd be happy to help! Can you please clarify what you mean by "starting apps"? Are you looking for apps to help you get started with building your Shopify website, or are you looking for apps to help you with specific tasks such as marketing, inventory management, or customer service?


In [37]:
# Create an instance of the class
assistant = ShopifyAssistant()

# Call the ask() method with a query
response = assistant.ask(query="Yes I'm looking to build my shopify website")

# Print the response
print(response)

Token indices sequence length is longer than the specified maximum sequence length for this model (1752 > 1024). Running this sequence through the model will result in indexing errors


Great! What kind of products will you be selling on your website?


In [38]:
# Create an instance of the class
assistant = ShopifyAssistant()

# Call the ask() method with a query
response = assistant.ask(query="Jewelry")

# Print the response
print(response)

Token indices sequence length is longer than the specified maximum sequence length for this model (1744 > 1024). Running this sequence through the model will result in indexing errors


What type of jewelry are you looking to sell on your Shopify website?
