# Cost projecting with LLMs

In [27]:
import os
import openai
import pprint
import math

# Set up

In [3]:
# Set the OpenAI API key for making OpenAI requests
# Similar to above, the key is fetched from the environment variables
# 'openai' is a Python client library for accessing OpenAI's APIs, so this line sets the authentication to be used for all subsequent requests through 'openai'
openai.api_key = os.getenv("OPENAI_API_KEY")

# Set the Hugging Face token for making Hugging Face API calls
# Once again, the token is fetched from the environment variables
# 'HF_TOKEN' will now hold the token to be used for any calls to the Hugging Face API
HF_TOKEN = os.getenv('HF_TOKEN')


In [4]:
def pretty_print(data, indent=0):
    pp = pprint.PrettyPrinter(indent=indent)
    for item in data:
        for key, value in item.items():
            if isinstance(value, dict):
                print(" " * indent + f"{key}:")
                print(value, indent + 4)
            

In [5]:
# Define a dictionary for prices
prices = {
    'gpt-3.5-turbo': {
        'input': 0.0015 / 1000,  # per token
        'output': 0.002 / 1000  # per token
    },
    'ada': 0.0016 / 1000  # per token
}

def calculate_ada_cost(res):
    # Calculate the cost
    input_tokens = res['usage']['prompt_tokens']
    output_tokens = res['usage']['completion_tokens']
    total_tokens = res['usage']['total_tokens']
    total_cost = total_tokens * prices['ada']
    return {'input_tokens': input_tokens, 'output_tokens': output_tokens, 'total_cost': total_cost}


def generate_openai_response(prompt, model='gpt-3.5-turbo', **kwargs):
    if model in ('gpt-3.5-turbo', 'gpt-4'):
        pretty_print(prompt)

        chat_completion = openai.ChatCompletion.create(
            model=model,
            messages=prompt,
            **kwargs
        )
        # Calculate the cost
        input_tokens = chat_completion['usage']['prompt_tokens']
        output_tokens = chat_completion['usage']['completion_tokens']
        input_cost = input_tokens * prices[model]['input']
        output_cost = output_tokens * prices[model]['output']
        total_cost = input_cost + output_cost
        return chat_completion, {'input_cost': input_cost, 'output_cost': output_cost, 'total_cost': total_cost}
    elif 'ada' in model:

        # Generate a completion using the fine-tuned model
        res = openai.Completion.create(
            model=model, 
            prompt=prompt,
            **kwargs
        )
        return res, calculate_ada_cost(res)
        



In [8]:
# the text we want to classify

text = 'you are such a loser! I cannot believe you are even here.'

# Cost projecting with OpenAI


In [9]:
generate_openai_response(
    [
        {'role': 'system', 'content': 'You are a bot that classifies whether a given piece of text is toxic or not. Use "Toxic" or "Non-Toxic"'}, 
        {'role': 'user', 'content': f'Use this reasoning:\nText: (the input text)\nReasoning: (an explanation of why the language is toxic)\nLabel: (Either "Non-Toxic" or "Toxic")\n\nText: {text}'}
    ]
)

(<OpenAIObject chat.completion id=chatcmpl-8Hxh12Rn2td42h1LZz56WAqLpu50d at 0x11a707650> JSON: {
   "id": "chatcmpl-8Hxh12Rn2td42h1LZz56WAqLpu50d",
   "object": "chat.completion",
   "created": 1699291727,
   "model": "gpt-3.5-turbo-0613",
   "choices": [
     {
       "index": 0,
       "message": {
         "role": "assistant",
         "content": "Reasoning: This text uses derogatory language and insults to demean and belittle the reader, calling them a \"loser.\" It demonstrates a disrespectful and offensive tone.\nLabel: Toxic"
       },
       "finish_reason": "stop"
     }
   ],
   "usage": {
     "prompt_tokens": 94,
     "completion_tokens": 36,
     "total_tokens": 130
   }
 },
 {'input_cost': 0.000141,
  'output_cost': 7.2e-05,
  'total_cost': 0.00021300000000000003})

# Cost projecting with Huggingface Inference Endpoint


In [10]:
import requests
import json

def call_huggingface_api(text, params=None):
    url = "https://d2q5h5r3a1pkorfp.us-east-1.aws.endpoints.huggingface.cloud"
    endpoint = "/"
    headers = {
        "Authorization": f"Bearer {HF_TOKEN}",
        "Content-Type": "application/json"
    }
    data = {
        "inputs": text,
        "parameters": params
    }
    json_data = json.dumps(data)

    response = requests.post(url + endpoint, data=json_data, headers=headers)

    if response.status_code == 200:
        return response.json()
    else:
        print("Request failed with status code:", response.status_code)
        return None


In [21]:
parameters = {
    "top_k": None
}

result = call_huggingface_api(text, parameters)
print(result)


[{'label': 'Toxic', 'score': 0.6368551254272461}, {'label': 'Non-Toxic', 'score': 0.3631448745727539}]


In [11]:
# it should take over 200k OpenAI calls to hit 40 dollars. Not too bad!
45 / 0.000213

211267.60563380283

# Cost projecting with OpenAI Fine-tuned model

Predicting one token with a fine-tuned ada model is about the same as GPT-3.5 turbo.

```
prices = {
    'gpt-3.5-turbo': {
        'input': 0.0015 / 1000,  # per token
        'output': 0.002 / 1000  # per token
    },
    'ada': 0.0016 / 1000  # per token
}
```

Note that Ada is actually more expensive that gpt 3.5 input tokens so for short outputs, gpt-3.5 is cheaper than fine-tuning. For longer outputs, switching to ada might be more beneficial


In [12]:
prompt = '''Great pieces of jewelry for the price

Great pieces of jewelry for the price. The 6mm is perfect for my tragus piercing. I gave four stars because I already lost one because it fell out! Other than that I am very happy with the purchase!

###

'''

res, cost = generate_openai_response(prompt, model='ada:ft-personal-2023-05-08-16-25-48', 
                                     temperature=0, max_tokens=1, logprobs=3)
print(cost)

{'input_tokens': 57, 'output_tokens': 1, 'total_cost': 9.28e-05}


In [15]:
# Extract logprobs from the API response
probs = []
logprobs = res.choices[0].logprobs.top_logprobs
# Convert logprobs to probabilities and store them in the 'probs' list
for logprob in logprobs:
    _probs = {}
    for key, value in logprob.items():
        _probs[key] = math.exp(value)
    probs.append(_probs)
# Extract the predicted category (star) from the API response
pred = res['choices'][0].text.strip()
# Nicely print the prompt, predicted category, and probabilities
print("Predicted Star:", pred)
print("Probabilities:")
for prob in probs:
    for key, value in sorted(prob.items(), key=lambda x: x[1], reverse=True):
        print(f"{key}: {value:.4f}")
    print()

Predicted Star: 4
Probabilities:
 4: 0.9959
 5: 0.0025
 3: 0.0013



In [16]:
generate_openai_response([
        {'role': 'system', 'content': 'You are a bot that classifies whether a given piece of text is toxic or not. Use "Toxic" or "Non-Toxic"'}, 
        {'role': 'user', 'content': f'Text: {text}\nLabel:'}
    ])

(<OpenAIObject chat.completion id=chatcmpl-8HxhzsxieW4PrexsR3pyT5yEkwALa at 0x11a7077d0> JSON: {
   "id": "chatcmpl-8HxhzsxieW4PrexsR3pyT5yEkwALa",
   "object": "chat.completion",
   "created": 1699291787,
   "model": "gpt-3.5-turbo-0613",
   "choices": [
     {
       "index": 0,
       "message": {
         "role": "assistant",
         "content": "Toxic"
       },
       "finish_reason": "stop"
     }
   ],
   "usage": {
     "prompt_tokens": 58,
     "completion_tokens": 2,
     "total_tokens": 60
   }
 },
 {'input_cost': 8.7e-05, 'output_cost': 4e-06, 'total_cost': 9.1e-05})

# Starting to Think About Evaluation of Models
When evaluating the performance of a language model, it's important to consider a variety of factors. Apart from obvious ones like the quality of output, cost, and ease of use, another key factor is the latency of the model, i.e., the time it takes from when a request is made until the result is received.

## Latency of an LLM
Large Language Models (LLMs) can be resource-intensive. When running models like GPT-3 or BERT, you need substantial computational power, and generating results may take considerable time. The latency you experience can vary significantly depending on whether you're using your own infrastructure or a third-party API service like OpenAI.

### API Calls
When you use a third-party API like OpenAI, you're essentially renting time on their servers to perform your task. This can have a significant impact on the latency of the model. The travel time for the request and response data, the load on the servers, and the processing capabilities of the service all contribute to the overall latency.

On the upside, such services often have powerful servers designed to handle these tasks efficiently, so latency can still be relatively low even with these additional factors. Moreover, they manage the infrastructure, so you don't have to worry about setup, maintenance, or scaling.

### Running Models on Local Compute
Running models on your own servers or local machine eliminates the data travel time, and you have direct control over the computational resources allocated to the task. However, the latency will largely depend on the processing power of your local machine or server. If the infrastructure is not as powerful as that of a dedicated API service, the latency may be higher.

Additionally, running LLMs on your own compute means that you have to manage everything from setup to maintenance, which can add overhead not present when using an API service.

### Comparison
While third-party APIs can provide a powerful, easy-to-use solution with potentially low latency, running models on your own compute provides more control over the entire process. The best choice depends on your specific use case, considering factors such as latency requirements, computational resources, costs, and the overhead you're willing to manage.

Remember, the lowest latency solution isn't always the best one if it compromises on other important aspects of your project. Always consider the trade-offs involved when choosing between these options.

In [17]:
# star rating with smaller fine-tuned model (about the same input/output as toxic vs non-toxic)
%timeit generate_openai_response(prompt, model='ada:ft-personal-2023-05-08-16-25-48', temperature=0, max_tokens=1, logprobs=3)


109 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [26]:
# no chain of thought toxic vs non toxic with OpenAI
%timeit generate_openai_response([{'role': 'system', 'content': 'You are a bot that classifies whether a given piece of text is toxic or not. Use "Toxic" or "Non-Toxic"'}, {'role': 'user', 'content': f'Text: {text}\nLabel:'}])


380 ms ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
# toxic vs non toxic with custom classifier
%timeit call_huggingface_api(text, parameters)

456 ms ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
# smaller fine tuned models are generally cheaper/faster than massive closed-source models like ChatGPT and GPT-4