<a href="https://colab.research.google.com/github/stele-and-rivers-001/study-series-nlp-1/blob/main/Large_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

If you have been following along with this series, you likely already have some familiarity with chat completion models such as OpenAI's GPT model series. If not, check out ChatGPT to see the magic in action. These *Large Language Models*, or LLMs, are designed to take text as inputs and generate human-like responses based on the input that they receive. They are trained on vast amounts of text data and are capable of a wide range of language-related tasks, such as answering questions, generating text, translating languages, and more. If you're familiar with working with APIs, you can work directly with many of these models using python, as we did in the last two parts of this series.

*Prompt engineering* is the process of designing and refining the text inputs to generate desired responses from artificial intelligence models, particularly large language models like GPT. We will explore prompt engineering techniques to improve performance as well.

This study will compare performance of some of the most popular AI chat models for our text classification task. At a base level, here's how we'll run the test: instead of sending ChatGPT each job title in our list of ~1,000 in an individual chat window one by one, we will create a script in python that will feed each item and ask the model to categorize it into our labels. Then we can measure performance and compare to our models from the previous studies. Since there is no training stage, we will only use the test data. Using prompt engineering best practices we can provide additional context to assist the models. We won't dive too deeply into features such as agents and assistants, as those will be the focus of the next study in this series.

The models included in this study are as follows: OpenAI's GPT-3 and GPT-4, Meta's Llama 3, Google's Gemini Pro 1.0, Huggingface's Hugging Chat and Arthropic's Claude Sonnet. Integrating these APIs into your code is not a one size fits all approach. Each company has different integration processes, so it is important to have an understanding of the associated documentation for each model. Let's dive in!

## Install libraries and import data

In [None]:
! pip install -Uqq openai
! pip install -Uqq google-generativeai
! pip install -Uqq replicate
! pip install -Uqq hugchat
! pip install -Uqq anthropic

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m870.8/870.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
#hide
import json
import os
from pathlib import Path
import openai
from openai import OpenAI
from google.colab import files, drive
import pandas as pd
import time
import google.generativeai as genai
import replicate
from hugchat import hugchat
from hugchat.login import Login
import anthropic
import textwrap
from IPython.display import display
from IPython.display import Markdown

In [None]:
#hide
drive.mount('/content/drive')

Mounted at /content/drive


You'll need an API key for each model. The below code can be used for uploading and reading a file with all of your API keys, but you can also use the Google Colab secrets panel.

In [None]:
uploaded = files.upload()

Saving secrets.txt to secrets.txt


In [None]:
with open('secrets.txt', 'r') as f:
    lines = f.readlines()

# set the openai key
openai_api_key = None
for line in lines:
    if line.startswith('OPENAI_API_KEY='):
        openai_api_key = line.strip().split('=')[1]
        break

# set the hugging face key
hf_api_key = None
for line in lines:
    if line.startswith('HUGGING_FACE_KEY='):
        hf_api_key = line.strip().split('=')[1]
        break

# set the hugging face email
hf_email = None
for line in lines:
    if line.startswith('HF_EMAIL='):
        hf_email = line.strip().split('=')[1]
        break

# set the hugging face password
hf_pw = None
for line in lines:
    if line.startswith('HF_PW='):
        hf_pw = line.strip().split('=')[1]
        break

# set the gemini key
gemini_api_key = None
for line in lines:
    if line.startswith('GEMINI_API_KEY='):
        gemini_api_key = line.strip().split('=')[1]
        break

# set the replicate key (llama)
replicate_api_key = None
for line in lines:
    if line.startswith('REPLICATE_API_KEY='):
        replicate_api_key = line.strip().split('=')[1]
        break

# set the claude key
claude_api_key = None
for line in lines:
    if line.startswith('CLAUDE_API_KEY='):
        claude_api_key = line.strip().split('=')[1]
        break

Import test data. This method does not require a training dataset as we are directly asking the chat models to categorize the data.

You can access the test data here:
https://drive.google.com/drive/folders/1b8l9kVtItOInUVASMvzpu1evVSR7x1Ub

In [None]:
#### add your path to test_data.csv below ####
test_df = pd.read_csv()
test_df.columns

Index(['label', 'text'], dtype='object')

In [None]:
test_df.columns = ['label','text']
test_df.tail()

Unnamed: 0,label,text
195,healthcare,Clinical Nurse Educator
196,retail_hospitality,Entertainment Coordinator
197,legal,Estate Planning Lawyer
198,technology,Technical Program Manager
199,healthcare,Oncology Nurse


In [None]:
test_df.describe()

Unnamed: 0,label,text
count,200,200
unique,8,200
top,education,Education Technology Specialist
freq,36,1


In [None]:
#hide
# show unique labels to ensure no typos or missing categories
unique_labels = test_df['label'].unique()
label_counts = test_df['label'].value_counts()
# print("Label Counts:", label_counts)
# print("Unique Labels:", unique_labels)

unique_labels_list = unique_labels.tolist()
print("Label Counts List:", unique_labels_list)

Label Counts List: ['education', 'technology', 'retail_hospitality', 'marketing_advertising', 'drama_arts', 'legal', 'healthcare', 'finance']


## Prompts and common parameters

### Prompts

Here we will define some universal prompt variables. Many of these APIs support both a system and user prompt.

System Prompt:
- The system prompt is the initial text provided to the model by the API user. It sets the context or direction for the subsequent text generation. Think of it as a broad description of the scene/job to be done by the model.
- The system prompt can be a question, a statement, or any text that provides context for the response.
- The quality and relevance of the system prompt can significantly influence the coherence and accuracy of the generated response.

User Prompt:
- The user prompt is a specific instruction or query provided by the end-user, typically through an interface or application that utilizes the API.
- It's the input from the user that triggers the generation of text from the AI model based on the context provided in the system prompt.
- User prompts guide the AI model on what specific information or response the user is seeking.
- They can vary widely depending on the application, ranging from simple questions to more complex requests or commands.

Remember that most of these models charge users on a per token basis so the longer the system and user prompts are, the more tokens input to the model. Finding a balance between providing enough context and excessive resource spending is important when working with limited resources. Longer input tokens also add time to process.

Our system prompt tells the model what the general purpose of the incoming prompts will be about. We are asking the model to categorize items in a specific domain.

In [None]:
system_prompt = "You are a helpful assistant that categorizes job titles by industry."

One of the best ways to provide context to the models is by giving examples of the inputs and expected outputs. This is often referred to as "Few-shot" prompting. This user prompt provides a specific request as well as two thorough examples with inputs and outputs in the requested format of a json dictionary key:value pair. It then provides a final input using the variables we will be submitting via the API request.

In [None]:
## function to create user prompt given the input text and the existing list of topics
def create_user_prompt(topics_list, input_text):

    user_prompt = f"""I'd like you to assist me in relating a job title from an input text to an existing list of industries and return a dictionary with a key-value pair. The "key" will be equal to the input text and the "value" will be equal to the existing topic that the input text is most closely related to.

    Example:
    Existing Topics:
    ['education', 'technology', 'retail_hospitality', 'marketing_advertising', 'drama_arts', 'legal', 'healthcare', 'finance']
    Input Text:
    'Spa Operations Manager'
    Response:{{"Spa Operations Manager":"retail_hospitality"}}

    Example:
    Existing Topics:
    ['education', 'technology', 'retail_hospitality', 'marketing_advertising', 'drama_arts', 'legal', 'healthcare', 'finance']
    Input Text:
    'Clinical Nurse Educator'
    Response:{{"Clinical Nurse Educator":"healthcare"}}

    Existing Topics:
    {topics_list}
    Input Text:
    {input_text}
    Response:
    """

    return user_prompt

### Parameters

Many of these models allow users to specify parameters to adjust the performance of the model and tune the output to their liking. Here we will discuss many of the common parameters and later on, will see some of the more relevent options in practice.

*Fine-tuning:* Similarly to assistants and agents, we will not explore fine-tuning in this study but it is important to discuss. This works similarly to how we fine-tuned models in the previous studies. Users can submit data and fine-tune a model before attempting their task. This saves costs by allowing for less examples in the prompt, but requires time and effort to setup. OpenAI's GPT series and Google's Gemini Pro have the option to fine tune a model.

*Functions*, or *tools* is a parameter in the API which can be used to provide function specifications. When included, it enables models to generate function arguments, but is not required unless specified. OpenAI, Google and Anthropic include function capabilities.

Here is a helpful article regarding how function calling works: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

*Temperature:* influences the creativity and diversity of the model response. A higher temperature may provide the model more freedom to explore word choices, but the generated response may end up being incorrect or irrelevant.

*max_tokens:* limit the tokens of the output to save resources and avoid lengthy responses

*top_p:* Specifies the cumulative probability threshold for tokens considered in the next token generation. Lower values allow the model to consider a wider range of options, even those with lower probabilities. This can lead to more diverse and surprising text. Higher values restrict the model to consider only the most probable tokens, resulting in more fluent but less diverse text.

*top_k*: Specifies the maximum number of tokens (words) to consider when generating the next token in a sequence.

*response_mime_type or response_format:* setting this to JSON ensures a formatted output.

*candidate_count*, *n:* The maximum number of generated response messages to return. This could be an interesting way to see what the top predictions are but would require user oversight to correct any errors.

*frequency penalty:* Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

*presence penalty:* Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

## OpenAI

First up, from the company that built ChatGPT, we will checkout the GPT series. GPT has tons of models available for a variety of tasks. In this study, we will compare the popular GPT-3 and GPT-4 chat completion models. All models through OpenAI are paid for use on a per token basis, but GPT-3 is much cheaper than GPT-4 and has higher limit rates.

In [None]:
openai.api_key = openai_api_key
client = OpenAI(api_key=openai_api_key)

In [None]:
# select a model from OpenAI's offerings
#model = 'gpt-3.5-turbo'
model = 'gpt-4-0613'

Here is our function for making an API call. Some additional parameters that GPT supports are temperature and functions, or tools.


In [None]:
## function for making GPT API call
def gpt_call(user_prompt,system_prompt,model,tools,tool_choice):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role":"user","content": user_prompt}
            ],
            temperature=0.5,
            tools=tools,
            tool_choice=tool_choice
            )
    return response

Here is where we use a function, or tool, to ensure the output format to be a key:value pair. We want GPT to return the output as specified in the user prompt above. In order to do this, we tell GPT to use this function, which requires data to be in a structured format instead of regular (and inconsistent) text. Now when we parse the output data, we have a consistent and workable structure, removing potential formatting errors.

In [None]:
# Define the function
tools = [
    {   "type":"function",
        "function": {
            "name": "generate_dictionary",
            "description": "Generate a dictionary with 'key' equal to the input text value and 'value' equal to the matching topic from the existing list.",
            "parameters": {
                "type": "object",
                "properties": {
                    "key": {
                        "type": "string",
                        "description": "The input text."
                    },
                    "value": {
                        "type": "string",
                        "description": "The matching topic from the existing list of topics."
                    }
                }
            }
        }
    }
]

tool_choice={"type": "function", "function": {"name": "generate_dictionary"}}

In [None]:
## set the predictions df so that we can calculate accuracy
gpt_preds_df = test_df.copy()
#gpt_preds_df = gpt_preds_df.iloc[:2]

In [None]:
## GPT-4 Loop
tracker_list = []
# loop through the text column in the test_df
for i, text in enumerate(gpt_preds_df['text']):
    print("Text Item: ", text)
    # use a while loop to run the loop continuously until a valid response is obtained from GPT
    ## BE CAREFUL! You may want to add a max_n argument to ensure that you're not running the model over and over againg while racking up a large bill
    ## You can also implement spending limits in your OpenAI account
    while True:
        try:
            # run the user_prompt function, inputting the categories and the text item to be categorized
            user_prompt = create_user_prompt(unique_labels_list, text)
            # run the gpt_call function and return the response
            response = gpt_call(user_prompt=user_prompt,system_prompt=system_prompt,model=model,tools=tools,tool_choice=tool_choice)
            # parse the response data and print the output (prediction)
            gpt_output = response.choices[0].message
            print("GPT Output: ", gpt_output)
            # set the key:value pair based on the GPT output
            gpt_key = json.loads(gpt_output.tool_calls[0].function.arguments).get("key")
            gpt_value = json.loads(gpt_output.tool_calls[0].function.arguments).get("value")
            tracker = {gpt_key: gpt_value}
            print("Dict Output: ", tracker)
            # if the output value is not blank, append the dictionary to the final list and exit the while loop
            # Check if both key and value have values
            if gpt_key is not None and gpt_value is not None and gpt_value != '':
                # Append the dictionary to the final list
                tracker_list.append(tracker)
                # Update gpt_preds_df with the gpt_value
                gpt_preds_df.at[i, 'predicted_label'] = gpt_value
                break  # Exit the while loop

        # GPT restricts users by setting a rate limit. If a RateLimitError occurs, pause for a minute and continue
        except Exception as e:
            print(f"An error occurred: {e}")
            if "Rate limit exceeded" in str(e):
                print(f"Rate limit exceeded. Waiting for 60 seconds to retry")
                time.sleep(60)
            else:
                break  # Exit the loop for other exceptions

print(tracker_list)

Text Item:  Education Technology Specialist
GPT Output:  ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_Yhy8LQ5tElVfEag2TxdSoDPH', function=Function(arguments='{\n  "key": "Education Technology Specialist",\n  "value": "education"\n}', name='generate_dictionary'), type='function')])
Dict Output:  {'Education Technology Specialist': 'education'}
Text Item:  Incident Response Analyst
GPT Output:  ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_7gO0TB8QSZMASQsiKKZVamhX', function=Function(arguments='{\n  "key": "Incident Response Analyst",\n  "value": "technology"\n}', name='generate_dictionary'), type='function')])
Dict Output:  {'Incident Response Analyst': 'technology'}
Text Item:  Spa Operations Manager
GPT Output:  ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id

Finally we loop through the test data and add the GPT output to a new prediction label column. We then measure the accuracy.

In [None]:
## GPT-3.5 Loop
tracker_list = []
# loop through the text column in the test_df
for i, text in enumerate(gpt_preds_df['text']):
    print("Text Item: ", text)
    # use a while loop to run the loop continuously until a valid response is obtained from GPT
    while True:
        try:
            # run the user_prompt function, inputting the categories and the text item to be categorized
            user_prompt = create_user_prompt(unique_labels_list, text)
            # run the gpt_call function and return the response
            response = gpt_call(user_prompt=user_prompt,system_prompt=system_prompt,model=model,tools=tools,tool_choice=tool_choice)
            # parse the response data and print the output (prediction)
            gpt_output = response.choices[0].message
            print("GPT Output: ", gpt_output)
            # set the key:value pair based on the GPT output
            gpt_key = json.loads(gpt_output.tool_calls[0].function.arguments).get("key")
            gpt_value = json.loads(gpt_output.tool_calls[0].function.arguments).get("value")
            tracker = {gpt_key: gpt_value}
            print("Dict Output: ", tracker)
            # if the output value is not blank, append the dictionary to the final list and exit the while loop
            # Check if both key and value have values
            if gpt_key is not None and gpt_value is not None and gpt_value != '':
                # Append the dictionary to the final list
                tracker_list.append(tracker)
                # Update gpt_preds_df with the gpt_value
                gpt_preds_df.at[i, 'predicted_label'] = gpt_value
                break  # Exit the while loop

        # GPT restricts users by setting a rate limit. If a RateLimitError occurs, pause for a minute and continue
        except Exception as e:
            print(f"An error occurred: {e}")
            if "Rate limit exceeded" in str(e):
                print(f"Rate limit exceeded. Waiting for 60 seconds to retry")
                time.sleep(60)
            else:
                break  # Exit the loop for other exceptions

print(tracker_list)

Text Item:  Education Technology Specialist
GPT Output:  ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_HHkcyzPIve0aNYDODlKtWsK6', function=Function(arguments='{"key":"Education Technology Specialist","value":"technology"}', name='generate_dictionary'), type='function')])
Dict Output:  {'Education Technology Specialist': 'technology'}
Text Item:  Incident Response Analyst
GPT Output:  ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_IuBq4uZPt1y4TxHDbS3im4o0', function=Function(arguments='{"key":"Incident Response Analyst","value":"technology"}', name='generate_dictionary'), type='function')])
Dict Output:  {'Incident Response Analyst': 'technology'}
Text Item:  Spa Operations Manager
GPT Output:  ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_c6BmRmedFPVFU0d

Let's see how GPT performed. First let's check the unique labels in the predictions column of our dataset. GPT should have only output options from the original list of 8.

In [None]:
unique_labels = gpt_preds_df['predicted_label'].unique()
unique_labels_list = unique_labels.tolist()
print("Label Counts List:", unique_labels_list)

Label Counts List: ['education', 'technology', 'retail_hospitality', 'legal', 'drama_arts', 'marketing_advertising', 'healthcare', 'finance', 'management']


We can see GPT created an extra label called "management". This is one of the problems with these LLM models, they do not always stay within the bounds. Lowering temperature to decrease creativity helps with this issue. Here we have set temperature to 0.5 on a scale from 0 to 2.0.

Here is our function for measuring accuracy

In [None]:
def test_set_accuracy(test_df):
  ## simple accuracy calc using pandas - TRUE/FALSE evaluates to 1/0 when using .mean()
  ## so taking average is a handy shortcut for calculating accuracy
  accuracy = (test_df['predicted_label'] == test_df['label']).mean()
  print(f"Accuracy: {accuracy}")

In [None]:
gpt_preds_df.head()

Unnamed: 0,label,text,predicted_label
0,education,Education Technology Specialist,education
1,technology,Incident Response Analyst,technology
2,retail_hospitality,Spa Operations Manager,retail_hospitality
3,marketing_advertising,Data Analyst,technology
4,drama_arts,Hair Assistant,retail_hospitality


In [None]:
test_set_accuracy(gpt_preds_df)

Accuracy: 0.935


Using GPT-4.0 we get an accuracy of 93.5%. This is an unbelievable start. We can see above one of the mislabeled categories was "Data Analyst". If we are being honest, it probably qualifies as either technology or marketing/advertising, and GPT chose technology. This is a situation where some human oversight is still required. Up to this point, our best performaning model was the BERT transformer with a test accuracy of 89%. Our first experiment with an LLM has already surpassed that performance and we haven't even looked into fine-tuning or adding agents (coming soon!)

In order to save some lines of repeated code, we re-ran the code above for GPT-3.5 by switching the model variable. GPT-3.5 was found to have a test accuracy of 90.5%. Some improvement is expected for he latest model, but knowing the exact difference is important when accounting for resources as well.

## Google Gemini

Google's Gemini has versions 1.0 and 1.5 available as of the publishing of this series. There is a free tier available for each model, with strict rate limits. As it currently stands, versions 1.0 and 1.5 do not yet have paid for tiers. The limits on the free version of 1.5 Pro are too strict for our task. Due to this, we will test using version 1.0 and be sure to check back when version 1.5's paid for tier releases later in May 2024. Version 1.0 supports only user prompts, a major difference from version 1.5.

In [None]:
gemini_preds_df = test_df.copy()
#gemini_preds_df = gemini_preds_df.iloc[:1]

In [None]:
# not using the markdown, remove?
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
genai.configure(api_key=gemini_api_key)
#os.environ['GENAI_API_KEY'] = gemini_api_key

In [None]:
# Set up the model
generation_config = {
  "temperature": 0.5,
  ## this disables the top_p filtering
  "top_p": 1,
  ## this disables the top_k filtering
  "top_k": 0,
  ## keep max output low, labels provided are low token count
  "max_output_tokens": 100,
  #"response_mime_type": "application/json",
}

In [None]:
# list the available models
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision


In [None]:
# choose the text only model
model_name="gemini-1.0-pro-latest"

In [None]:
def gemini_call(model_name, generation_config, user_prompt):
    model = genai.GenerativeModel(model_name=model_name,
                                  generation_config=generation_config)

    response = model.generate_content(user_prompt)

    return response

In [None]:
tracker_list = []

for i, text in enumerate(gemini_preds_df['text']):
    print("Text Item:", text)
    while True:
        try:
            user_prompt = create_user_prompt(unique_labels_list, text)
            response = gemini_call(model_name, generation_config, user_prompt)
            print("Gemini Output:", response)
            ## Extracting the text
            response_str = response.candidates[0].content.parts[0].text
            response_str = response_str.strip()
            ## Parsing the string into a dictionary object
            json_response = json.loads(response_str)
            print("Parsed JSON:", json_response)
            if json_response is not None:
                ## Extracting key and value as separate variables using popitem()
                key, value = json_response.popitem()
                tracker = {key: value}
                print("Dict Output:", tracker)
                if text == key and value in unique_labels_list:
                    print("Adding prediction to df")
                    gemini_preds_df.at[i, 'predicted_label'] = value
                    tracker_list.append(tracker)
                    break
                else:
                    print("Invalid Response from Gemini, resubmitting request")
            else:
                print("JSON response is None, resubmitting request...")
        except Exception as e:
            ## Catch all exceptions including RateLimitError
            print(f"An error occurred: {e}")
            if "RESOURCE_EXHAUSTED" in str(e):
                print("Rate limit exceeded. Waiting for 100 seconds to retry...")
                time.sleep(100)  ## Wait for 100 seconds (adjust as needed)
            else:
                break  ## Exit the loop for other exceptions

print("Tracker List:", tracker_list)

Text Item: Education Technology Specialist
Gemini Output: response:
GenerateContentResponse(
    done=True,
    iterator=None,
    result=glm.GenerateContentResponse({'candidates': [{'content': {'parts': [{'text': '{"Education Technology Specialist":"education"}'}], 'role': 'model'}, 'finish_reason': 1, 'index': 0, 'safety_ratings': [{'category': 9, 'probability': 1, 'blocked': False}, {'category': 8, 'probability': 1, 'blocked': False}, {'category': 7, 'probability': 1, 'blocked': False}, {'category': 10, 'probability': 1, 'blocked': False}], 'token_count': 0, 'grounding_attributions': []}]}),
)
Parsed JSON: {'Education Technology Specialist': 'education'}
Dict Output: {'Education Technology Specialist': 'education'}
Adding prediction to df
Text Item: Incident Response Analyst
Gemini Output: response:
GenerateContentResponse(
    done=True,
    iterator=None,
    result=glm.GenerateContentResponse({'candidates': [{'content': {'parts': [{'text': '{"Incident Response Analyst":"technolog

In [None]:
unique_labels = gemini_preds_df['predicted_label'].unique()
unique_labels_list = unique_labels.tolist()
print("Label Counts List:", unique_labels_list)

Label Counts List: ['education', 'technology', 'retail_hospitality', 'legal', 'drama_arts', 'marketing_advertising', 'healthcare', 'finance', 'management']


Oddly enough, Gemini also creates a new label category called "management". GPT-4 did the same, but GPT-3.5 stayed within the bounds. Let's keep this in mind and see if it is a recurring issue with other models.

In [None]:
gemini_preds_df.head()

Unnamed: 0,label,text,predicted_label
0,education,Education Technology Specialist,education
1,technology,Incident Response Analyst,technology
2,retail_hospitality,Spa Operations Manager,retail_hospitality
3,marketing_advertising,Data Analyst,technology
4,drama_arts,Hair Assistant,retail_hospitality


In [None]:
test_set_accuracy(gemini_preds_df)

Accuracy: 0.91


Gemini Pro 1.0 falls between GPT-3.5 and GPT-4.0 with an accuacy of 91%. We also found similar speed of task completion in comparison to the GPT series. It was able to beat GPT 3.5 slightly without needing a system prompt or function/tool to maintain output in dictionary format.

## Meta Llama

Meta's open-source model "Llama" released its third version on 4/18/24. While it is open-source, there is no API available through Meta as of this writing. We will use a third-party called Replicate to access the model via API. Replicate has a free trial, but eventually will require a fee to use the API. Llama can also be accessed via Hugging Face's API but let's explore Replicate as it has additional features that Hugging Face does not. We will explore the Hugging Face API later on.

Replicate provides a very user-friendly interface and includes many of the same parameters as OpenAI and Gemini do.

https://replicate.com/meta/meta-llama-3-70b-instruct/api

We discussed earlier how some of the job titles in our data could qualify in multiple categorized. Meta acknowledges this and occasionally tries to add two labels to the job title. This is incorrect for our task, so it required adding more detail to the prompt that only one label must be selected. There are other ways to get around this problem, perhaps your solution is to have the model flag each job that could fit into multiple categories and ask for human intervention to decide. For the task at hand, we just want to compare performance, but it was interesting to see this model deviate from the instructions in that way.

Llama was also found to be very chatty. It provided the output requested, but would output longer strings of text and sentences before it provided the one or two word answer requested. This resulted in many errors when parsing the data in the loop below. The first full run only provided 72% accuracy, largely due to many "NaN" values predicted due to parsing errors.

This piece was adding to the user prompt before the examples are provided. It corrected all errors on the very next run: "Only choose one value from the existing topics. Only respond with the dictionary output shown in the examples provided. No chatting."

In [None]:
os.environ['REPLICATE_API_TOKEN'] = replicate_api_key

In [None]:
model = 'meta/meta-llama-3-70b-instruct'

In [None]:
meta_preds_df = test_df.copy()
#meta_preds_df = meta_preds_df.iloc[:2]

In [None]:
def meta_call(model, system_prompt, user_prompt):
    response = replicate.run(model,
               input={"prompt": f"{system_prompt} {user_prompt} Assistant: ",
               "temperature":0.5, "top_p":1, "top_k":0, "max_length":50})
    return response

In [None]:
tracker_list = []

for i, text in enumerate(meta_preds_df['text']):
    print("Text Item:", text)
    while True:
        try:
            user_prompt = create_user_prompt(unique_labels_list, text)
            response = meta_call(model, system_prompt, user_prompt)
            full_response = ""
            for item in response:
                full_response += item
            print("Meta Output:", full_response)
            # Parsing the string into a dictionary object
            json_response = json.loads(full_response)
            print("Parsed JSON:", json_response)  # Debugging
            if json_response is not None:
                # Extracting key and value as separate variables using popitem()
                key, value = json_response.popitem()
                tracker = {key: value}
                print("Dict Output:", tracker)
                if text == key and value in unique_labels_list:
                    print("Adding prediction to df")
                    meta_preds_df.at[i, 'predicted_label'] = value
                    tracker_list.append(tracker)
                    break
                else:
                    print("Invalid Response from Meta, resubmitting request")
            else:
                print("JSON response is None, resubmitting request...")
        except Exception as e:
            # Catch all exceptions including RateLimitError
            print(f"An error occurred: {e}")
            if "Rate limit exceeded" in str(e):
                print("Rate limit exceeded. Waiting for 100 seconds to retry...")
                time.sleep(100)  # Wait for 100 seconds (adjust as needed)
            else:
                break  # Exit the loop for other exceptions

print("Tracker List:", tracker_list)

Text Item: Education Technology Specialist
Meta Output: {"Education Technology Specialist":"education"}
Parsed JSON: {'Education Technology Specialist': 'education'}
Dict Output: {'Education Technology Specialist': 'education'}
Adding prediction to df
Text Item: Incident Response Analyst
Meta Output: {"Incident Response Analyst":"technology"}
Parsed JSON: {'Incident Response Analyst': 'technology'}
Dict Output: {'Incident Response Analyst': 'technology'}
Adding prediction to df
Text Item: Spa Operations Manager
Meta Output: {"Spa Operations Manager":"retail_hospitality"}
Parsed JSON: {'Spa Operations Manager': 'retail_hospitality'}
Dict Output: {'Spa Operations Manager': 'retail_hospitality'}
Adding prediction to df
Text Item: Data Analyst
Meta Output: {"Data Analyst":"finance"}
Parsed JSON: {'Data Analyst': 'finance'}
Dict Output: {'Data Analyst': 'finance'}
Adding prediction to df
Text Item: Hair Assistant
Meta Output: {"Hair Assistant":"retail_hospitality"}
Parsed JSON: {'Hair Assis

In [None]:
unique_labels = meta_preds_df['predicted_label'].unique()
unique_labels_list = unique_labels.tolist()
print("Label Counts List:", unique_labels_list)

Label Counts List: ['education', 'technology', 'retail_hospitality', 'finance', 'legal', 'drama_arts', 'marketing_advertising', 'healthcare']


In [None]:
meta_preds_df.head()

Unnamed: 0,label,text,predicted_label
0,education,Education Technology Specialist,education
1,technology,Incident Response Analyst,technology
2,retail_hospitality,Spa Operations Manager,retail_hospitality
3,marketing_advertising,Data Analyst,finance
4,drama_arts,Hair Assistant,retail_hospitality


In [None]:
test_set_accuracy(meta_preds_df)

Accuracy: 0.93


After adding additional context to the user prompt, the Meta Llama 3 model provided 93% accuracy. It showed good speed as well. It is okay that the model required some tinkering with the prompt, that is the basis of prompt engineering. Not all of these models will work the exact same way. It is up to users to figure out how to get the best performance out of the model. The addition of two sentences resulted in a 93% accuracy, we'd say that is a fair trade-off. It would be nice however, if Replicate incorporated function calling as GPT does to ensure proper output structure.

## Hugging Chat

Hugging Face, the popular open source library has its own chat bot which can be found here: https://huggingface.co/chat/

One of the advantages of using this open-source chat bot is the ability to change the model based on the user's specific needs. This section shows how to use the *unofficial* python library to interface with the Hugging Chat API.

Special thanks to Github user **Soulter** for their contributions to Hug Chat API. See more on their Github page:

https://github.com/Soulter/hugging-chat-api

One huge advantage of Hug Chat is it is FREE. The open-source library is fairly lightweight, only allowing one prompt, but since we don't have to worry about cost per token, we can add additional details to the prompt if needed.

In [None]:
hf_preds_df = test_df.copy()
#hf_preds_df = hf_preds_df.iloc[:2]

We will be using the default model for testing: CohereForAI/c4ai-command-r-plus

In [None]:
## function to switch models
## testing with default (0)
def switch_model(models):
    print("Available models:")
    for i, model in enumerate(models):
        print(f"{i}: {model}")

    while True:
        try:
            index = int(input("Enter the index of the model you want to switch to: "))
            chatbot.switch_llm(index)
            print(f"Switched to model {index}: {models[index]}")
            break
        except ValueError:
            print("Invalid input. Please enter a number.")

In [None]:
## Log in to huggingface and grant authorization to huggingchat
## trailing slash (/) is required to avoid errors
cookie_path_dir = "./cookies/"
sign = Login(hf_email, hf_pw)
cookies = sign.login(cookie_dir_path=cookie_path_dir, save_cookies=True)
## create chatbot session
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())
## display available models
models = chatbot.get_available_llm_models()
## run function to switch between models (if needed)
switch_model(models)

In [None]:
tracker_list = []

for i, text in enumerate(hf_preds_df['text']):
    print("Text Item:", text)
    while True:
        try:
            user_prompt = create_user_prompt(unique_labels_list, text)
            response = chatbot.chat(user_prompt)
            print("HF Output:", response)
            ## Parsing the Message into a dictionary object
            str_response = str(response)
            json_response = json.loads(str_response)
            print("Parsed JSON:", json_response)
            if json_response is not None:
                ## Extracting key and value as separate variables using popitem()
                key, value = json_response.popitem()
                tracker = {key: value}
                print("Dict Output:", tracker)
                if text == key and value in unique_labels_list:
                    print("Adding prediction to df")
                    hf_preds_df.at[i, 'predicted_label'] = value
                    tracker_list.append(tracker)
                    break
                else:
                    print("Invalid Response from HF, resubmitting request")
            else:
                print("JSON response is None, resubmitting request...")
        except Exception as e:
            ## When HF flags for rate limit, pause requests for one minute
            print(f"An error occurred: {e}")
            if "You are sending too many messages" in str(e):
                print("Rate limit exceeded. Waiting for 100 seconds to retry...")
                time.sleep(100)  ## Wait for 100 seconds before retrying
            else:
                break  ## Exit the loop for other exceptions

print("Tracker List:", tracker_list)

Text Item: Education Technology Specialist
HF Output: {"Education Technology Specialist": "education"}
Parsed JSON: {'Education Technology Specialist': 'education'}
Extracted key-value pair: Education Technology Specialist education
Adding prediction to df
Text Item: Incident Response Analyst
HF Output: {"Incident Response Analyst": "technology"}
Parsed JSON: {'Incident Response Analyst': 'technology'}
Extracted key-value pair: Incident Response Analyst technology
Adding prediction to df
Text Item: Spa Operations Manager
HF Output: {"Spa Operations Manager": "retail_hospitality"}
Parsed JSON: {'Spa Operations Manager': 'retail_hospitality'}
Extracted key-value pair: Spa Operations Manager retail_hospitality
Adding prediction to df
Text Item: Data Analyst
HF Output: {"Data Analyst": "technology"}
Parsed JSON: {'Data Analyst': 'technology'}
Extracted key-value pair: Data Analyst technology
Adding prediction to df
Text Item: Hair Assistant
HF Output: {"Hair Assistant": "retail_hospitality

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"College Counselor":"education"}
Parsed JSON: {'College Counselor': 'education'}
Extracted key-value pair: College Counselor education
Adding prediction to df
Text Item: Stage Manager


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd006ab0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Stage Manager":"drama_arts"}
Parsed JSON: {'Stage Manager': 'drama_arts'}
Extracted key-value pair: Stage Manager drama_arts
Adding prediction to df
Text Item: School Psychologist
HF Output: {"School Psychologist":"education"}
Parsed JSON: {'School Psychologist': 'education'}
Extracted key-value pair: School Psychologist education
Adding prediction to df
Text Item: IT Business Relationship Manager
HF Output: {"IT Business Relationship Manager":"technology"}
Parsed JSON: {'IT Business Relationship Manager': 'technology'}
Extracted key-value pair: IT Business Relationship Manager technology
Adding prediction to df
Text Item: Public Relations Strategist
HF Output: {"Public Relations Strategist":"marketing_advertising"}
Parsed JSON: {'Public Relations Strategist': 'marketing_advertising'}
Extracted key-value pair: Public Relations Strategist marketing_advertising
Adding prediction to df
Text Item: Judge
HF Output: {"Judge":"legal"}
Parsed JSON: {'Judge': 'legal'}
Extracted key

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Video Editor":"technology"}
Parsed JSON: {'Video Editor': 'technology'}
Extracted key-value pair: Video Editor technology
Adding prediction to df
Text Item: Volunteer Coordinator


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd0042e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Volunteer Coordinator": "education"}
Parsed JSON: {'Volunteer Coordinator': 'education'}
Extracted key-value pair: Volunteer Coordinator education
Adding prediction to df
Text Item: Marketing Consultant
HF Output: {"Marketing Consultant":"marketing_advertising"}
Parsed JSON: {'Marketing Consultant': 'marketing_advertising'}
Extracted key-value pair: Marketing Consultant marketing_advertising
Adding prediction to df
Text Item: Admissions Counselor
HF Output: {"Admissions Counselor": "education"}
Parsed JSON: {'Admissions Counselor': 'education'}
Extracted key-value pair: Admissions Counselor education
Adding prediction to df
Text Item: Teacher
HF Output: {"Teacher":"education"}
Parsed JSON: {'Teacher': 'education'}
Extracted key-value pair: Teacher education
Adding prediction to df
Text Item: Podiatry Assistant
HF Output: {"Podiatry Assistant":"healthcare"}
Parsed JSON: {'Podiatry Assistant': 'healthcare'}
Extracted key-value pair: Podiatry Assistant healthcare
Adding predi

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Social Media Coordinator": "marketing_advertising"}
Parsed JSON: {'Social Media Coordinator': 'marketing_advertising'}
Extracted key-value pair: Social Media Coordinator marketing_advertising
Adding prediction to df
Text Item: Education Data Analyst


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd051c40>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Education Data Analyst": "education"}
Parsed JSON: {'Education Data Analyst': 'education'}
Extracted key-value pair: Education Data Analyst education
Adding prediction to df
Text Item: Information Security Engineer
HF Output: {"Information Security Engineer": "technology"}
Parsed JSON: {'Information Security Engineer': 'technology'}
Extracted key-value pair: Information Security Engineer technology
Adding prediction to df
Text Item: Fulfillment Manager
HF Output: {"Fulfillment Manager": "retail_hospitality"}
Parsed JSON: {'Fulfillment Manager': 'retail_hospitality'}
Extracted key-value pair: Fulfillment Manager retail_hospitality
Adding prediction to df
Text Item: Business Intelligence Developer
HF Output: {"Business Intelligence Developer": "technology"}
Parsed JSON: {'Business Intelligence Developer': 'technology'}
Extracted key-value pair: Business Intelligence Developer technology
Adding prediction to df
Text Item: Art Conservator
HF Output: {"Art Conservator": "drama_

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Mortgage Broker": "finance"}
Parsed JSON: {'Mortgage Broker': 'finance'}
Extracted key-value pair: Mortgage Broker finance
Adding prediction to df
Text Item: Genetic Counselor
HF Output: {"Genetic Counselor": "healthcare"}
Parsed JSON: {'Genetic Counselor': 'healthcare'}
Extracted key-value pair: Genetic Counselor healthcare
Adding prediction to df
Text Item: Marketing Analyst
HF Output: {"Marketing Analyst":"marketing_advertising"}
Parsed JSON: {'Marketing Analyst': 'marketing_advertising'}
Extracted key-value pair: Marketing Analyst marketing_advertising
Adding prediction to df
Text Item: Data Engineer
HF Output: {"Data Engineer": "technology"}
Parsed JSON: {'Data Engineer': 'technology'}
Extracted key-value pair: Data Engineer technology
Adding prediction to df
Text Item: Financial Examiner
HF Output: {"Finan

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd004eb0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"College Access Coordinator":"education"}
Parsed JSON: {'College Access Coordinator': 'education'}
Extracted key-value pair: College Access Coordinator education
Adding prediction to df
Text Item: Financial Planner Assistant
HF Output: {"Financial Planner Assistant":"finance"}
Parsed JSON: {'Financial Planner Assistant': 'finance'}
Extracted key-value pair: Financial Planner Assistant finance
Adding prediction to df
Text Item: Multicultural Education Specialist
HF Output: {"Multicultural Education Specialist":"education"}
Parsed JSON: {'Multicultural Education Specialist': 'education'}
Extracted key-value pair: Multicultural Education Specialist education
Adding prediction to df
Text Item: Clinical Psychologist
HF Output: {"Clinical Psychologist": "healthcare"}
Parsed JSON: {'Clinical Psychologist': 'healthcare'}
Extracted key-value pair: Clinical Psychologist healthcare
Adding prediction to df
Text Item: Online Merchandiser
HF Output: {"Online Merchandiser": "retail_hospit

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Financial Aid Advisor": "education"}
Parsed JSON: {'Financial Aid Advisor': 'education'}
Extracted key-value pair: Financial Aid Advisor education
Adding prediction to df
Text Item: Network Operations Center (NOC) Engineer


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd0521f0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Network Operations Center (NOC) Engineer": "technology"}
Parsed JSON: {'Network Operations Center (NOC) Engineer': 'technology'}
Extracted key-value pair: Network Operations Center (NOC) Engineer technology
Adding prediction to df
Text Item: Information Technology Auditor
HF Output: {"Information Technology Auditor": "technology"}
Parsed JSON: {'Information Technology Auditor': 'technology'}
Extracted key-value pair: Information Technology Auditor technology
Adding prediction to df
Text Item: Vice Squad Officer
HF Output: {"Vice Squad Officer": "legal"}
Parsed JSON: {'Vice Squad Officer': 'legal'}
Extracted key-value pair: Vice Squad Officer legal
Adding prediction to df
Text Item: Law Clerk
HF Output: {"Law Clerk":"legal"}
Parsed JSON: {'Law Clerk': 'legal'}
Extracted key-value pair: Law Clerk legal
Adding prediction to df
Text Item: Resort Activities Coordinator
HF Output: {"Resort Activities Coordinator":"retail_hospitality"}
Parsed JSON: {'Resort Activities Coordinator

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Producer":"drama_arts"}
Parsed JSON: {'Producer': 'drama_arts'}
Extracted key-value pair: Producer drama_arts
Adding prediction to df
Text Item: Systems Integrator


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd005770>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Systems Integrator":"technology"}
Parsed JSON: {'Systems Integrator': 'technology'}
Extracted key-value pair: Systems Integrator technology
Adding prediction to df
Text Item: Nurse
HF Output: {"Nurse": "healthcare"}
Parsed JSON: {'Nurse': 'healthcare'}
Extracted key-value pair: Nurse healthcare
Adding prediction to df
Text Item: Medical Malpractice Lawyer
HF Output: {"Medical Malpractice Lawyer":"legal"}
Parsed JSON: {'Medical Malpractice Lawyer': 'legal'}
Extracted key-value pair: Medical Malpractice Lawyer legal
Adding prediction to df
Text Item: Content Executive
HF Output: {"Content Executive":"marketing_advertising"}
Parsed JSON: {'Content Executive': 'marketing_advertising'}
Extracted key-value pair: Content Executive marketing_advertising
Adding prediction to df
Text Item: Interventional Radiologist
HF Output: {"Interventional Radiologist": "healthcare"}
Parsed JSON: {'Interventional Radiologist': 'healthcare'}
Extracted key-value pair: Interventional Radiologist he

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Night Auditor":"hospitality_retail"}
Parsed JSON: {'Night Auditor': 'hospitality_retail'}
Extracted key-value pair: Night Auditor hospitality_retail
Invalid Response from HF, resubmitting request


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd0060a0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Night Auditor":"hospitality_retail"}
Parsed JSON: {'Night Auditor': 'hospitality_retail'}
Extracted key-value pair: Night Auditor hospitality_retail
Invalid Response from HF, resubmitting request
HF Output: {"Night Auditor":"hospitality_retail"}
Parsed JSON: {'Night Auditor': 'hospitality_retail'}
Extracted key-value pair: Night Auditor hospitality_retail
Invalid Response from HF, resubmitting request
HF Output: {"Night Auditor":"hospitality_retail"}
Parsed JSON: {'Night Auditor': 'hospitality_retail'}
Extracted key-value pair: Night Auditor hospitality_retail
Invalid Response from HF, resubmitting request
HF Output: {"Night Auditor":"hospitality_retail"}
Parsed JSON: {'Night Auditor': 'hospitality_retail'}
Extracted key-value pair: Night Auditor hospitality_retail
Invalid Response from HF, resubmitting request
HF Output: {"Night Auditor":"hospitality_retail"}
Parsed JSON: {'Night Auditor': 'hospitality_retail'}
Extracted key-value pair: Night Auditor hospitality_retail
In

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Movement Coach":"drama_arts"}
Parsed JSON: {'Movement Coach': 'drama_arts'}
Extracted key-value pair: Movement Coach drama_arts
Adding prediction to df
Text Item: IT Compliance Analyst


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd005e00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"IT Compliance Analyst":"technology"}
Parsed JSON: {'IT Compliance Analyst': 'technology'}
Extracted key-value pair: IT Compliance Analyst technology
Adding prediction to df
Text Item: Tour Guide
HF Output: {"Tour Guide":"retail_hospitality"}
Parsed JSON: {'Tour Guide': 'retail_hospitality'}
Extracted key-value pair: Tour Guide retail_hospitality
Adding prediction to df
Text Item: Corporate Treasurer
HF Output: {"Corporate Treasurer": "finance"}
Parsed JSON: {'Corporate Treasurer': 'finance'}
Extracted key-value pair: Corporate Treasurer finance
Adding prediction to df
Text Item: Compliance Analyst
HF Output: {"Compliance Analyst": "legal"}
Parsed JSON: {'Compliance Analyst': 'legal'}
Extracted key-value pair: Compliance Analyst legal
Adding prediction to df
Text Item: Crime Prevention Specialist
HF Output: {"Crime Prevention Specialist": "legal"}
Parsed JSON: {'Crime Prevention Specialist': 'legal'}
Extracted key-value pair: Crime Prevention Specialist legal
Adding predict

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"School Counselor":"education"}
Parsed JSON: {'School Counselor': 'education'}
Extracted key-value pair: School Counselor education
Adding prediction to df
Text Item: Financial Auditor
HF Output: {"Financial Auditor": "finance"}
Parsed JSON: {'Financial Auditor': 'finance'}
Extracted key-value pair: Financial Auditor finance
Adding prediction to df
Text Item: Student Support Services Coordinator
HF Output: {"Student Support Services Coordinator":"education"}
Parsed JSON: {'Student Support Services Coordinator': 'education'}
Extracted key-value pair: Student Support Services Coordinator education
Adding prediction to df
Text Item: School Social Worker
HF Output: {"School Social Worker":"education"}
Parsed JSON: {'School Social Worker': 'education'}
Extracted key-value pair: School Social Worker education
Adding pr

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd004820>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Advertising Manager": "marketing_advertising"}
Parsed JSON: {'Advertising Manager': 'marketing_advertising'}
Extracted key-value pair: Advertising Manager marketing_advertising
Adding prediction to df
Text Item: Digital Artist
HF Output: {"Digital Artist": "drama_arts"}
Parsed JSON: {'Digital Artist': 'drama_arts'}
Extracted key-value pair: Digital Artist drama_arts
Adding prediction to df
Text Item: Sponsorship Coordinator
HF Output: {"Sponsorship Coordinator": "marketing_advertising"}
Parsed JSON: {'Sponsorship Coordinator': 'marketing_advertising'}
Extracted key-value pair: Sponsorship Coordinator marketing_advertising
Adding prediction to df
Text Item: Clinical Laboratory Scientist
HF Output: {"Clinical Laboratory Scientist": "healthcare"}
Parsed JSON: {'Clinical Laboratory Scientist': 'healthcare'}
Extracted key-value pair: Clinical Laboratory Scientist healthcare
Adding prediction to df
Text Item: Technology Evangelist
HF Output: {"Technology Evangelist": "technology

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Tax Advisor":"finance"}
Parsed JSON: {'Tax Advisor': 'finance'}
Extracted key-value pair: Tax Advisor finance
Adding prediction to df
Text Item: Assistant Store Manager
HF Output: {"Assistant Store Manager": "retail_hospitality"}
Parsed JSON: {'Assistant Store Manager': 'retail_hospitality'}
Extracted key-value pair: Assistant Store Manager retail_hospitality
Adding prediction to df
Text Item: Financial Analyst Manager
HF Output: {"Financial Analyst Manager": "finance"}
Parsed JSON: {'Financial Analyst Manager': 'finance'}
Extracted key-value pair: Financial Analyst Manager finance
Adding prediction to df
Text Item: Robotics Engineer
HF Output: {"Robotics Engineer": "technology"}
Parsed JSON: {'Robotics Engineer': 'technology'}
Extracted key-value pair: Robotics Engineer technology
Adding prediction to df
Text I

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd004c80>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"IT Governance Analyst":"technology"}
Parsed JSON: {'IT Governance Analyst': 'technology'}
Extracted key-value pair: IT Governance Analyst technology
Adding prediction to df
Text Item: Email Marketing Specialist
HF Output: {"Email Marketing Specialist": "marketing_advertising"}
Parsed JSON: {'Email Marketing Specialist': 'marketing_advertising'}
Extracted key-value pair: Email Marketing Specialist marketing_advertising
Adding prediction to df
Text Item: Hotel Concierge Supervisor
HF Output: {"Hotel Concierge Supervisor":"retail_hospitality"}
Parsed JSON: {'Hotel Concierge Supervisor': 'retail_hospitality'}
Extracted key-value pair: Hotel Concierge Supervisor retail_hospitality
Adding prediction to df
Text Item: Literacy Coach
HF Output: {"Literacy Coach": "education"}
Parsed JSON: {'Literacy Coach': 'education'}
Extracted key-value pair: Literacy Coach education
Adding prediction to df
Text Item: Travel Consultant
HF Output: {"Travel Consultant": "retail_hospitality"}
Parse

ERROR:root:No `type` found in response: {'message': 'You are sending too many messages. Try again later.'}


HF Output: An error occurred: Server returns an error: You are sending too many messages. Try again later.
Rate limit exceeded. Waiting for 100 seconds to retry...
HF Output: {"Surgical Assistant": "healthcare"}
Parsed JSON: {'Surgical Assistant': 'healthcare'}
Extracted key-value pair: Surgical Assistant healthcare
Adding prediction to df
Text Item: Data Governance Specialist


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 728, in _stream_query
    yield obj
GeneratorExit
Exception ignored in: <generator object ChatBot._stream_query at 0x7af0cd005620>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hugchat/hugchat.py", line 738, in _stream_query
    raise exceptions.ChatError(f"Failed to parse response: {res}")
hugchat.exceptions.ChatError: Failed to parse response: {"message":"You are sending too many messages. Try again later."}


HF Output: {"Data Governance Specialist": "technology"}
Parsed JSON: {'Data Governance Specialist': 'technology'}
Extracted key-value pair: Data Governance Specialist technology
Adding prediction to df
Text Item: Retail Operations Coordinator
HF Output: {"Retail Operations Coordinator": "retail_hospitality"}
Parsed JSON: {'Retail Operations Coordinator': 'retail_hospitality'}
Extracted key-value pair: Retail Operations Coordinator retail_hospitality
Adding prediction to df
Text Item: Financial Analyst Associate
HF Output: {"Financial Analyst Associate":"finance"}
Parsed JSON: {'Financial Analyst Associate': 'finance'}
Extracted key-value pair: Financial Analyst Associate finance
Adding prediction to df
Text Item: IT Security Operations Manager
HF Output: {"IT Security Operations Manager": "technology"}
Parsed JSON: {'IT Security Operations Manager': 'technology'}
Extracted key-value pair: IT Security Operations Manager technology
Adding prediction to df
Text Item: Nurse Navigator
HF Ou

In [None]:
unique_labels = hf_preds_df['predicted_label'].unique()
unique_labels_list = unique_labels.tolist()
print("Label Counts List:", unique_labels_list)

Label Counts List: ['education', 'technology', 'retail_hospitality', 'legal', 'drama_arts', 'marketing_advertising', 'healthcare', 'finance']


In [None]:
hf_preds_df.head()

Unnamed: 0,label,text,predicted_label
0,education,Education Technology Specialist,education
1,technology,Incident Response Analyst,technology
2,retail_hospitality,Spa Operations Manager,retail_hospitality
3,marketing_advertising,Data Analyst,technology
4,drama_arts,Hair Assistant,retail_hospitality


In [None]:
test_set_accuracy(hf_preds_df)

Accuracy: 0.95


HugChat gives us a 95% accuracy! For a free and open-source API this is incredible.

Some additional observations while working with the Hug Chat API:
- It is much slower than the other options
- It will hit users with a rate limit error very quickly
- The available models change over time, but currently include Llama 3. Having access to many options for free is a plus.
- There are less parameters available such as functions, system/user prompts and temperature.

However, the API is totally free to use, which could make this a viable options for those on a budget.

## Anthropic Claude

Claude has 3 models available: Haiku, Sonnet and Opus. Haiku being light and fast while Opus is the largest and most powerful model. All models are in a paid for tier, scaling up from Haiku. They offer $5 in free credits to get started.

Claude utilizes function calling which can be included in the parameters if needed. We will include it in our code to ensure proper output structure. It is very similar in schema to GPT but with some slight variations so we will need to adjust and reset the "tools" variables.

They also allow for integration with Google Sheets, allowing users to execute interactions with Claude directly in cells. It is important to provide extremely detailed descriptions, per Claude's documentation. They have a very thorough "performance enhancement" user guide, which we suggest becoming familiar with to get the most out the models.

In [None]:
client = anthropic.Anthropic(api_key=claude_api_key)

In [None]:
claude_preds_df = test_df.copy()
#claude_preds_df = claude_preds_df.iloc[:1]

In [None]:
#model_name = 'claude-3-haiku-20240307'
model_name = 'claude-3-sonnet-20240229'
#model_name = 'claude-3-opus-20240229'

In [None]:
# Define the function
tools = [
        {
            "name": "generate_dictionary",
            "description": "Generate a dictionary with 'key' equal to the input text value and 'value' equal to the matching topic from the existing list.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "key": {
                        "type": "string",
                        "description": "The input text."
                    },
                    "value": {
                        "type": "string",
                        "description": "The matching topic from the existing list of topics."
                    }
                }
            }
        }
]

In [None]:
def claude_call(model_name, tools, system_prompt, user_prompt):
    message = client.beta.tools.messages.create(
        model=model_name,
        max_tokens=100,
        temperature=0.5,
        tools=tools,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": user_prompt
                    }
                ]
            }
        ]
    )
    response = message.content

    return response

In [None]:
tracker_list = []

for i, text in enumerate(claude_preds_df['text']):
    print("Text Item:", text)
    while True:
        try:
            user_prompt = create_user_prompt(unique_labels_list, text)
            response = claude_call(model_name, tools, system_prompt, user_prompt)
            print("Claude Output:", response)
            ## set the key:value pair based on the GPT output
            c_key = response[0].input['key']
            c_value = response[0].input['value']
            tracker = {c_key: c_value}
            print("Dict Output: ", tracker)
            ## if the output value is not blank, append the dictionary to the final list and exit the while loop
            ## Check if both key and value have values
            if c_key is not None and c_value is not None and c_value != '':
                # Append the dictionary to the final list
                tracker_list.append(tracker)
                # Update claude_preds_df with the c_value
                claude_preds_df.at[i, 'predicted_label'] = c_value
                break  # Exit the while loop
        except Exception as e:
            # Catch all exceptions
            print(f"An error occurred: {e}")
            if "rate_limit_error" in str(e).lower():
                print("Rate limit exceeded. Waiting for 100 seconds to retry...")
                time.sleep(100)  # Wait for 100 seconds (adjust as needed)
            else:
                break  # Exit the loop for other exceptions

print("Tracker List:", tracker_list)

Text Item: Education Technology Specialist
Claude Output: [ToolUseBlock(id='toolu_01BhMNje4zGF9xWeq3bpdKUm', input={'key': 'Education Technology Specialist', 'value': 'technology'}, name='generate_dictionary', type='tool_use')]
Dict Output:  {'Education Technology Specialist': 'technology'}
Text Item: Incident Response Analyst
Claude Output: [ToolUseBlock(id='toolu_01Cs3DxJz8vmhVQR5B7TXquc', input={'key': 'Incident Response Analyst', 'value': 'technology'}, name='generate_dictionary', type='tool_use')]
Dict Output:  {'Incident Response Analyst': 'technology'}
Text Item: Spa Operations Manager
Claude Output: [ToolUseBlock(id='toolu_01Nyerpqyo9bNf9m5mzdXffs', input={'key': 'Spa Operations Manager', 'value': 'retail_hospitality'}, name='generate_dictionary', type='tool_use')]
Dict Output:  {'Spa Operations Manager': 'retail_hospitality'}
Text Item: Data Analyst
Claude Output: [ToolUseBlock(id='toolu_0133Y1K28c2BUtspQ5F2PT9j', input={'key': 'Data Analyst', 'value': 'technology'}, name='gen

In [None]:
claude_preds_df.head()

Unnamed: 0,label,text,predicted_label
0,education,Education Technology Specialist,technology
1,technology,Incident Response Analyst,technology
2,retail_hospitality,Spa Operations Manager,retail_hospitality
3,marketing_advertising,Data Analyst,technology
4,drama_arts,Hair Assistant,retail_hospitality


In [None]:
test_set_accuracy(claude_preds_df)

Accuracy: 0.925


Claude's Sonnet model, the middleweight option, scores a 92.5% accuracy. The addition of a tool allows us to ensure output structure, limiting errors in the process. Anthropic might be a name users are unfamiliar with, but they offer a very user friendly experience to go along with performance right on par with the other popular LLMs.

## Conclusion

In this study, we found our best performance of the series. Recall our baseline fastai AWD-LSTM model had a test accuracy of 60.5%, which has since been improved to 89% with the BERT transformer model and data augmentation. The LLMs in this study were able to provide accuracy from 90.5% - 95%. While not exactly the same these popular options offer a similar experience. The user friendly nature of OpenAI combined with its plethora of models to choose from makes for a compelling option. Hugging Face's open-source and completely free API is great for limited use but the lengthy response time and low rate limits make it difficult for larger completion tasks. Google's Gemini has an exceptional interface and offers valuable add-ons such as function calling, model tuning and can be linked to a user's google drive with ease. However, the current 1.0 version is a bit limited in tuning parameters and both versions have strict rate limits. Meta's ever improving Llama 3 offers strong performance and can be accessed via Hugging Chat or other third party API services like Replicate. It would be great to see Meta come out with their own API service for an even easier user experience. Incorporating function calling to keep the model output structure in place would be our top recommendation. Anthropic offers another easy to use option with features like function calling. We found it operates very similarly to OpenAI, which in our opinion is a positive and user friendly experience.

We found merits in each of the options explored during this study. If you are just trying to become familiar with LLMs and are on a budget, it's a great idea to check out the open-source options like Hugging Face or Llama. If you want to add features such as functions, agents or assistants to really get the most out of your model, explore the options from OpenAI, Google and Anthropic. In the next study, we are going to dive into the world of agents and see if the hype is real. Many of these models can even accept images as input or can output an image based on a description (ex. GPT Dall-E). We only explored chat completion type models, but the possibilities feel limitless.

As a reminder, this series is meant to be a survey of techniques for improving performance of NLP classification tasks with small datasets.

Stay tuned for more case studies like this one and if there is something you would like to chat about, feel free to reach out at:

shane@stelerivers.com