# Exploring the Nuances of Customizing GPT by Teaching GPT about Current Events

With all the hype around ChatGPT, we wanted to explore what it's like to develop using OpenAI's APIs. What would it be like to train a model using GPT-3 as a platform? How much training data would it take? How much would it cost? What would we learn and how hard would it be?

In order to learn more, we needed to pick a niche that GPT-3 didn't know about. We chose current events. The latest version of GPT-3 (often referred to as GPT-3.5) is trained on data thru June 2021. Anything after this, the model doesn't know about. By downloading RSS feeds of popular news sites, we could teach GPT-3 about things it didn't know. We could then test it out by asking the new model questions about events that happened after June 2021.

OpenAI offers a service called fine tuning which allows you to customize a model by feeding it prompts and responses which would be examples of what it should learn. This was our first approach to solving this problem.

## Fine Tuning with OpenAI's GPT models

The first thing we did was install OpenAI's Python package, then chose to train it on a topic that required recent information: the train derailment in Ohio in 2023.

In [14]:
!pip install --upgrade openai

Defaulting to user installation because normal site-packages is not writeable
Collecting openai
  Using cached openai-0.26.5-py3-none-any.whl
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 0.26.3
    Uninstalling openai-0.26.3:
      Successfully uninstalled openai-0.26.3
Successfully installed openai-0.26.5


In [15]:
import openai
import json

We'll also need to setup our OpenAI key that we obtained from OpenAI.com. OpenAI gives you $18 in free credit which is more than enough to run this notebook and do much more experimentation.

In [16]:
import os
os.environ['OPENAI_API_KEY'] = "Add OpenAI key here"
openai.api_key = "Add OpenAI key here"

As a baseline let's query OpenAI's state of the art GPT-3.5 model, Davinci and ask it about the recent train derailment.

In [24]:
prompt = "Where did the train carrying hazardous materials derail?"

In [25]:
result = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt
)
print(result["choices"][0]["text"])



The exact location of the train derailment is not available, as different


Well, that didn't work. Let's try fine tuning a model by adding some data to the model about when the train derailment in Ohio. We'll prepare the data, save it to a file and then upload that file to OpenAI.

In [28]:
# from https://en.wikipedia.org/wiki/2023_Ohio_train_derailment
examples = [
    {"prompt": "2023 Ohio train derailment", "completion": "The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States.[1] The freight train burned for more than two days, and then emergency crews conducted a controlled burn of several railcars at the request of state officials,[2] which released hydrogen chloride and phosgene into the air.[1] As a result, residents within a 1-mile (1.6-kilometer) radius were evacuated, and an emergency response was initiated from agencies in Ohio, Pennsylvania, and West Virginia. The U.S. federal government sent Environmental Protection Agency (EPA) administrator Michael S. Regan to provide assistance on February 16, 2023."}
]

In [29]:
f = open("trainingdata.jsonl", "w")

for example in examples:
    f.write(json.dumps(example) + "\n")

In [30]:
file = openai.File.create(file=open("trainingdata.jsonl"), purpose='fine-tune')

From here we can tell OpenAI to begin fine tuning a model using Davinci as a base model but we'll add the additional information about the 2023 train derailment in Ohio.

In [31]:
fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")

We can use the following console command to track the fine tuning's progress. It'll likely take about 30 minutes for this to be complete and query against the new model. If the command fails you can run it again to continue polling the progress.

In [69]:
!openai api fine_tunes.follow -i {fine_tune['id']}

[2023-02-23 15:59:43] Created fine-tune: ft-av2Lfjr4eAObz9DFPPS7WX6G
[2023-02-23 16:06:37] Fine-tune costs $0.02
[2023-02-23 16:06:37] Fine-tune enqueued
[2023-02-23 18:16:05] Fine-tune is in the queue. Queue number: 31
[2023-02-23 18:17:02] Fine-tune is in the queue. Queue number: 30
[2023-02-23 18:17:30] Fine-tune is in the queue. Queue number: 29
[2023-02-23 18:19:17] Fine-tune is in the queue. Queue number: 28
[2023-02-23 18:19:50] Fine-tune is in the queue. Queue number: 27
[2023-02-23 18:20:19] Fine-tune is in the queue. Queue number: 26
[2023-02-23 18:20:21] Fine-tune is in the queue. Queue number: 25
[2023-02-23 18:25:59] Fine-tune is in the queue. Queue number: 24
[2023-02-23 18:26:23] Fine-tune is in the queue. Queue number: 23
[2023-02-23 18:29:01] Fine-tune is in the queue. Queue number: 22
[2023-02-23 18:29:37] Fine-tune is in the queue. Queue number: 21
[2023-02-23 18:30:57] Fine-tune is in the queue. Queue number: 20
[2023-02-23 18:31:23] Fine-tune is in the queue. Queue

Once this is complete, let's copy the model below and try running our previous prompt to see if it does any better.

In [70]:
result = openai.Completion.create(
    model="davinci:ft-personal-2023-02-24-01-57-19",
    prompt=prompt
)
print(result["choices"][0]["text"])



Officials say the train derailed in Nantes Dorian, just west of


We really don't see an improvement here. Perhaps it's because we need more data. 

## Fine Tuning from RSS feeds on more data
Let's start building out a more complex example that downloads all of today's news via RSS feeds and fine tunes based on that. First we'll install an RSS parser, have it download several popular news sources, and prepare our data to fine tune a new model. 

In [33]:
!pip install rss-parser

Defaulting to user installation because normal site-packages is not writeable


In [34]:
from rss_parser import Parser
from requests import get

In [35]:
rss_urls = [
    "https://rss.nytimes.com/services/xml/rss/nyt/US.xml",
    "https://rss.nytimes.com/services/xml/rss/nyt/World.xml",
    "http://feeds.bbci.co.uk/news/rss.xml?edition=us",
    "http://rss.cnn.com/rss/cnn_world.rss",
    "http://rss.cnn.com/rss/cnn_us.rss",
    "https://feeds.washingtonpost.com/rss/world?itid=lk_inline_manual_36",
    "https://feeds.washingtonpost.com/rss/national?itid=lk_inline_manual_32",
    "https://feeds.a.dj.com/rss/RSSWorldNews.xml",
    "https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml",
    "https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en"
]

In [43]:
prompts = []

In [44]:
for url in rss_urls:
    xml = get(url)
    print(url)
    
    parser = Parser(xml=xml.content)
    feed = parser.parse()
    
    for item in feed.feed:
        prompts.append({"prompt": item.title, "completion": item.description})

https://rss.nytimes.com/services/xml/rss/nyt/US.xml
https://rss.nytimes.com/services/xml/rss/nyt/World.xml
http://feeds.bbci.co.uk/news/rss.xml?edition=us
http://rss.cnn.com/rss/cnn_world.rss
http://rss.cnn.com/rss/cnn_us.rss
https://feeds.washingtonpost.com/rss/world?itid=lk_inline_manual_36
https://feeds.washingtonpost.com/rss/national?itid=lk_inline_manual_32
https://feeds.a.dj.com/rss/RSSWorldNews.xml
https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml
https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en


In [48]:
f = open("rss-trainingdata.jsonl", "w")

for prompt in prompts:
    f.write(json.dumps(prompt) + "\n")

This time we'll use a tool that OpenAI provides to clean the training data.

In [49]:
!openai tools fine_tunes.prepare_data -f rss-trainingdata.jsonl -q

Analyzing...

- Your file contains 307 prompt-completion pairs
- `completion` column/key should not contain empty strings. These are rows: [160, 162, 164, 165, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194]
- There are 9 duplicated prompt-completion sets. These are rows: [43, 45, 49, 177, 178, 179, 210, 211, 212]
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://beta.openai.com/

Let's go ahead and train a newly fine tuned model on this much larger set of data.

In [52]:
file = openai.File.create(file=open("rss-trainingdata_prepared.jsonl"), purpose='fine-tune')
rss_fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")

In [71]:
!openai api fine_tunes.follow -i {rss_fine_tune['id']}

[2023-02-23 16:09:28] Created fine-tune: ft-zn6ehx8QooiPqquVScmCT8gd
[2023-02-23 19:06:21] Fine-tune costs $1.84
[2023-02-23 19:06:22] Fine-tune enqueued
[2023-02-23 20:30:19] Fine-tune is in the queue. Queue number: 31
[2023-02-23 20:30:51] Fine-tune is in the queue. Queue number: 30
[2023-02-23 20:34:11] Fine-tune is in the queue. Queue number: 29
[2023-02-23 20:34:52] Fine-tune is in the queue. Queue number: 28
[2023-02-23 20:35:22] Fine-tune is in the queue. Queue number: 27
[2023-02-23 20:36:49] Fine-tune is in the queue. Queue number: 26
[2023-02-23 20:38:21] Fine-tune is in the queue. Queue number: 25
[2023-02-23 20:40:15] Fine-tune is in the queue. Queue number: 23
[2023-02-23 20:40:59] Fine-tune is in the queue. Queue number: 22
[2023-02-23 20:42:31] Fine-tune is in the queue. Queue number: 21
[2023-02-23 20:42:46] Fine-tune is in the queue. Queue number: 20
[2023-02-23 20:44:41] Fine-tune is in the queue. Queue number: 19
[2023-02-23 20:44:43] Fine-tune is in the queue. Queue

Once that's complete, let's compare before (non-finetuned) and after (finetuned) on a question about the last day's news.

In [72]:
prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
    model="davinci",
    prompt=prompt + '\n\n###\n\n'
)
print("Before (non-finetuned) result: " + result['choices'][0]['text'])
result = openai.Completion.create(
    model="davinci:ft-personal-2023-02-24-04-03-06",
    prompt=prompt + '\n\n###\n\n'
)
print("After (finetuned) result: " + result['choices'][0]['text'])

Before (non-finetuned) result: 

Additional Information:



Sound Transit’s emergency closure of
After (finetuned) result: 

Backgrounder

In the early hours of February 10, 2019


Still the results are gibberish. It turns out the issue here is that when we fine tune on davinci, OpenAI is using a much older version of davinci that doesn't include the instruction following features that text-davinci-003 (or ChatGPT) include. It turns out that fine tuning is much less suited to solving instruction based problems and is much more suited towards solving problems like classification and autocompletion. To make this work the way we want it to we'll need to take a new approach.

## Getting Customized Results Without Finetuning

Let's do an experiment.

In [54]:
prompt = "Given that The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States. Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt + '\n\n###\n\n'
)
print(result['choices'][0]['text'])

 The train carrying hazardous materials derailed in East Palestine, Ohio, United States.


This may seem counterintuitive but we presented some new information, and asked a question about the new information all without finetuning a model. If we can search for content that may provide the answer to the question being asked, prepopulate that content within the prompt, then we can use GPT-3's instructional features to work with thie new information. Fortunately, there are some great tools for coming up with creative solutions like this like langchain.

In [55]:
!pip install langchain

Defaulting to user installation because normal site-packages is not writeable


Now we'll download the same RSS feeds as before but instead we'll prefill our prompt with this data before we ask the question about current events.

In [56]:
from langchain.docstore.document import Document

documents = []
for url in rss_urls:
    xml = get(url)
    
    parser = Parser(xml=xml.content)
    feed = parser.parse()
    
    for item in feed.feed:
        documents.append(Document(
            page_content=item.title + '. ' + item.description
        ))    

In [57]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

prompt = "Where did the train carrying hazardous materials derail?"

chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":documents, "question":prompt}, return_only_outputs=True)["output_text"]

InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 16655 tokens (16399 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

We see here that GPT doesn't support this prompts this large, after all we dumped the entire contents of the last day's news. We'll need to take a more intelligent approach to populating this data.

## Using Text Embeddings and Vector Similarity Searches to Prepopulate Our Prompt

Imagine if we searched for the content of our prompt within our recent news data and only populated the content that was appropriate. A typical full text search index wouldn't be appropriate here because it's very unlikely the exact words will appear in our content, especially since our content is full of statements and our prompt is a question. Instead we'll use some cutting edge technology that can determine text that's similar to other text. OpenAI recently released a Text Embeddings API which allows one to convert words into a vector that can be compared to other vectors. The vector equivalent of the statement "people work" would be similar to the vector equivalent of "humans do jobs" even though the words are completely different. If we only populate our prompt with language that's similar to our prompt we're likely to have the answer in the prompt. FAISS is a Vector Store we can use to compare text embeddings and langchain supports it as a way to build prompts for OpenAI. We'll build a search index and then only populate our prompt with information that's similar to our prompt.

One note, I did have to add a payment method to my OpenAI account to overcome rate limits from generating the text embeddings from the RSS feeds.

In [33]:
!pip install faiss-cpu

Defaulting to user installation because normal site-packages is not writeable
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.3


In [58]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS

search_index = FAISS.from_documents(documents, OpenAIEmbeddings())

In [60]:
prompt = "Where did the train carrying hazardous materials derail?"

chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":search_index.similarity_search(prompt, k=4), "question":prompt}, return_only_outputs=True)["output_text"]

' East Palestine, Ohio.'

It works! By pairing the similar text search with GPT-3, we're able to now give answers about news in the RSS feed.

## Parting Thoughts, Shortcomings, Cavaeats

While this is a powerful utilization of some cutting edge Natural Language Processing, there are still a few potential shortcomings of this approach. One issue is authority. FAISS only cares about how similar text is for populating prompts. If I wanted to ask the model "What color is the dog?" and within my text database I see "The dog is black" and "The dog is white," the vector index has no way of knowing which to present to the model.

Another issue is that if there are many pieces of text that are similar to the prompt it will result in a very large prompt which will increase processing time and cost. 

All the same, we're at the beginning of a very interesting time where we seem to be reaching a critical velocity with generative AI. Tools like Langchain which have the power to accelerate already powerful tools are the sign of even more intersting times to come. With GPT-4 likely to be released any time, this year should prove to be another year of growth for AI.

## Sources

* https://langchain.readthedocs.io/en/latest/use_cases/question_answering.html
* https://dagster.io/blog/chatgpt-langchain
* https://github.com/openai/openai-cookbook