Did you know that you can classify text into distinct, predefined classes with LLMs like ChatGPT?

Yes it's possible and I show you how. 

The big advantage is that these LLMs have a lot of knowledge and text understanding capabilites already encoded in their weights. This means that their ability to do text classification without any pre-training is really high. 

This is very interesting if you don't have a lot of labeled training data. 

First let's import all the libraries we need. We will also use OpenAIs API and their GPT-3.5-turbo model. Additionally we will leverage their function calling capabilities through a library called [instructor](https://github.com/jxnl/instructor).

In [1]:
import instructor
import openai
from openai import OpenAI
import enum 
from datasets import load_dataset
import numpy as np
from tqdm import tqdm

#load the .env file
from dotenv import load_dotenv
load_dotenv()

#load the API key
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

from pydantic import BaseModel

The idea is to use another library called ["pydantic"](https://pydantic.dev/) to specifiy what data we want the LLM to return. We can leverage this to restrict the model output to the classes we want. 

For demonstrating how this works we'll use the ["rotten_tomatoes"](https://huggingface.co/datasets/rotten_tomatoes) dataset from [huggingface.co](https://huggingface.co). This is a dataset of movie reviews from the website ["Rotten Tomatoes"](https://www.rottentomatoes.com/). The movie reviews are either "positive" or "negative" and the task is to classify the movie reviews in one of these two categories. 

Further, we will only use a subset of 100 randomly chosen examples for demonstration purposes here. 

The labels are encoded as integeres with 0 = 'negative' and 1 = 'positive'. With the `itol` dict we map the integers to the actual labels.

In [44]:
dataset = load_dataset("rotten_tomatoes")
subset = dataset["train"].shuffle().select(range(100)) # random subset of 100 examples

# convert the integers to labels
itol = {0: "negative", 1: "positive"}

# an example of the data
subset["text"][2], itol[subset["label"][2]]

('there has always been something likable about the marquis de sade .',
 'positive')

Now we come to the heart of the whole idea.

We leverage data types to describe the data we want to receive back from the model. 

We use an `Enum` to implement our labels `POSITIVE` and `NEGATIVE`. Then we implement a `pydantic` `BaseModel` that describes what the prediction of the LLM should encapsulate. In our very simple case this is just the predicted `class_label`, which is of type `Labels`. 

Note that the docstring here has a function that goes further than just providing documentation. Since the docstring will be part of the schema of the `SinglePrediction` class and the schema will be passed to the LLM, it gives further information to the LLM what this class represents and what the LLM should do. 

In [55]:
class Labels(str, enum.Enum):
    POSITIVE = "positive"
    NEGATIVE = "negative"
    
class SinglePrediction(BaseModel):
    """
    Correct class label for the given text
    """

    class_label: Labels

SinglePrediction.model_json_schema()

{'$defs': {'Labels': {'enum': ['positive', 'negative'],
   'title': 'Labels',
   'type': 'string'}},
 'description': 'Correct class label for the given text',
 'properties': {'class_label': {'$ref': '#/$defs/Labels'}},
 'required': ['class_label'],
 'title': 'SinglePrediction',
 'type': 'object'}

The `instructor` library works by "patching" the `openai` client and expanding its functionality. We just call the `instructor.patch()` function with the `OpenAI()` class as its sole argument and use this as our `client`. 

In [None]:
client = instructor.patch(OpenAI())

Now we can implement the `classify` function that takes in the data we want to classify (our movie reviews) and return an instance of the `SinglePrediction` class we implemented above. 

The code is quite simple and should be familiar to you if you have already used the OpenAI api. 

In [26]:
def classify(data: str) -> SinglePrediction:
    return client.chat.completions.create(
        model="gpt-3.5-turbo-0613",
        temperature=0.4,
        response_model=SinglePrediction,
        messages=[
             {
                "role": "system",
                "content": "You are a world class algorithm to identify the sentiment of movie reviews.",
            },

            {
                "role": "user",
                "content": f"Classify the sentiment of the following movie review: {data}",
            },
        ],
    )

In [6]:
# calculate accuracy based on preds and targets
def accuracy(preds, targets):
    return np.sum(np.array(preds) == np.array(targets)) / len(preds)

In [38]:
preds = [classify(t).class_label.value for t in tqdm(subset["text"])]
targets = [itol[l] for l in subset["label"]]

100%|██████████| 100/100 [01:13<00:00,  1.36it/s]


In [39]:
accuracy(preds, targets) # 0.88 ... well, that's much better than random guessing and remember this is without any training. 

0.88