##Prompt Engineering

***Introduction***

In this notebook, we are going to show how we can prompt engineer LLMs to analyze the text, from the Discovery, according to our need. This notebook is intended as a starting point, showing two examples prompt engineering, with the goal that you can then apply the same techniques to other analysis requirement.

We will start by importing and installing required libraries. We will be using OpenAI library. Using OpenAI's API we will be using GPT-3.5. To use the API we will need OpenAI's API key.

In [None]:
%pip install -q openai
%pip install -q requests
%pip install -q json
import json
import re
import requests
from openai import OpenAI
import helper_functions as hf
client = OpenAI(api_key="API_KEY")

## **Document classification**

Text classification is the task of Natural Language Processing (NLP) in which text/document is assigned to one or more classes or categories. However, for this task the classes/categories have to be pre-defined as a label to which the texts are classified. In the scenario in which we have a set of descriptions but no lables, we can use LLMS to classify the required text to those descriptions.

We will first define the set of descriptions to which we want our texts to be classified. For this we will use the Collections description from the Discovery. We will use the Discovery API as shown in the [Discovery API notebook](https://github.com/rae-drt/tna-exploratory-notebooks/blob/main/1-intro-to-discovery-api.ipynb). We will use a list of default ids for which we will request the description. If you want to add to the list some other record series, add their ids in the input box as list like separated by comma for example ['123', '124', '125']. If not, just press enter in the input box.

In [None]:
documentIDs = ['C13519', 'C13520', 'C13522']
inputArray = input("Please enter the list of record series IDs")
if inputArray:
  hf.append_IDs(documentIDs, inputArray)
print("The list of records are:", documentIDs)


Now we will use Discovery API to fetch their descriptions and store them in 'documents' dictionary with the key as the title of the record.

In [None]:
documents = hf.populate_documents(documentIDs)
print(documents)

We will now define the set of texts we want to classify.

In [None]:
textIDs = ['C3411937', 'C12215981', 'C5485074', 'C5485055']
inputArray = input("Please enter the list of record IDs")
if inputArray:
  hf.append_IDs(textIDs, inputArray)
print("The list of records are:", textIDs)

In [None]:
texts = hf.populate_texts(textIDs)
print(texts)

Now we will send this to GPT using the required prompt. For this, we first tell the system what it's role is. Here, it's role is being a helpful assistant. Now, we will pass the set of documents followed by the text with the prompt. In the prompt we ask the GPT just to classify the text and then return the result in JSON schema {'text': text, 'category':value} so that it can be computationally processed. For this example, since we haven't passed the labels, it will create its own set of label to which the documents closely associate and classify it to it.

In [None]:
for text in texts:
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Classify and return as JSON the text '" + text + "'. Format in the following JSON schema {'text': text, 'category':value}"}
    ]
  )

  response_json = json.loads(response.choices[0].message.content)
  print(response_json["text"] + ": " + response_json["category"])

In [None]:
for text in texts:
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "The set of documents are: " + str(documents)},
      {"role": "user", "content": "Classify and return as JSON the text '" + text + "' to one of the documents given above. Format in the following JSON schema {'text': text, 'category':value}"}
    ]
  )

  response_json = json.loads(response.choices[0].message.content)
  print(response_json["text"] + ": " + response_json["category"])

## **Sensitivity classification**

Sensitivity classification is a task of NLP in which a text is binary (two classes) classified into "sensitive" and "not sensitive" depending upon the factors: text contains personal information, text is offensive in nature.

We will now define the context based on which we want the system to classify. We pass this context alomg with the prompt.

In [None]:
context = "Text speaking monarchy are not sensitive. Every other text are sensitive"

We now classify the text without giving context.

In [None]:
for text in texts:
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Classify as sensitive or not the text '" + text + "'. Format in the following JSON schema {'text': text, 'classification':value, 'reason':value}"}
    ]
  )

  response_json = json.loads(response.choices[0].message.content)
  print(response_json["classification"] + ": " + response_json["text"])

We can see that the some text is classified as sensitive even though it is open to public in The National Archives. This is because it relate to legal matters or conflicts.

We now classify the texts based on the context. For this, we first provide the context and then in the other prompt we tell the system to classify the text based on the text.

In [None]:
for text in texts:
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "system", "content": "The context is: '" + context + "'"},
      {"role": "user", "content": "Given the context, classify as sensitive or not the text '" + text + "'. Format in the following JSON schema {'text': text, 'classification':value}"}
    ]
  )
  response_json = json.loads(response.choices[0].message.content)
  print(response_json["classification"] + ": " + response_json["text"])

We can see that only the text that passed our context are classified sensitive.