# Getting probability scores out of LLM classification

When comparing traditional ML classifiers to LLM-based ones, a common problem is that most classifier performance metrics require a vector of confidence/probability scores across the available options, not just the most likely answer. 

Fortunately, eg the OpenAI API allows to query token logprobs for up to 20 most likely tokens in each position of its response. 
These still need to be masked (discarding irrelevant options), converted to probabilities, and normalized to sum to one. 

To spare you the hassle of doing this, we provide two functions, a binary classifier (which expects a yes/no question), and a multiple-choice classifier that expects a multiple-choice question and a list of valid options. It also has an optional boolean argument `include_other`, which if true makes the classifier also include an "Other" option in its output, for when none of the valid options fit. 

To keep it simple, the multiple chocice classifier only supports up to 9 choice options, so the LLM output can be a single digit (for speed and parsing simplicity). Feel free to contribute a version that supports a larger number of choices! ;)

In [1]:
from pprint import pprint

from langchain_openai import ChatOpenAI

try:
    import wise_topic
except ImportError:
    import os, sys
    sys.path.append(os.path.realpath(".."))


from wise_topic import llm_classifier_binary, llm_classifier_multiple
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)


In [2]:
question1 = "Consider a very friendly pet with a fluffy tail. You know it's a cat or a dog. Is it a cat?"
llm_classifier_binary(llm, question1)

{False: 0.03559647724243312, True: 0.9644035227575669}

In [3]:
question2 = "Consider a very friendly pet with a waggy tail. You know it's a cat or a dog. Is it a cat?"
llm_classifier_binary(llm, question2)

{False: 0.9999912515146222, True: 8.748485377892584e-06}

In [4]:
question3 = "Consider a very friendly pet with a fluffy tail. You know it's a cat, a dog, or a dragon. Which is it?"
llm_classifier_multiple(llm, question3, ["cat", "dog", "dragon", "duck"], include_other=False)

{'cat': 0.9372176977942116,
 'dog': 0.062782248112413,
 'dragon': 5.215838794110004e-09,
 'duck': 4.887753666874768e-08}