# Interfacing with LLM's Through APIs: Subtitle [EDIT ME]

* * * 

<div class="alert alert-success">  
    
### Learning Objectives - WIP
    
* Understand the Power of the Raw LLM Models that goes behind the tools like ChatGPT and Claude
* Figure out how to interface with them to unlock their hidden power enabling your use case
* As a researcher how to use LLMs directly for text extraction, etc at scale
* Background behind how LLMs are hosted and how these are seperate from the tools that ChatGPT/etc uses
* lol i gotta get better at writing this content. this sounds so corny
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
📝 **Poll:** A Zoom poll to help you learn!<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br> 

[NOTE: Remove icons if they're not used in the notebook]

### Sections
1. [Section Name - EDIT ME](#section1)
2. [Section Name - EDIT ME](#section2)
3. [Reflection: [Title of Reflection] - EDIT ME](#refl)
4. [Demo: [Title of Demo] - EDIT ME](#demo)

<a id='this'></a>

## This Workshop

Most of us have used tools like ChatGPT or Claude, but what you may not realize is that these interfaces are **not the language models themselves**. They’re polished applications built on top of large language models (LLMs), adding helpful features like memory, chat formatting, context injection, and more.

In this workshop, we’ll peel back the layers and explore:
- **How LLMs are hosted and accessed through APIs**
- **How tools like ChatGPT are built around them**
- **How you can interface directly with LLMs via API calls**

This shift from using LLMs in chat to using them as infrastructure unlocks powerful possibilities for research:
- Extracting structured data from unstructured text  
- Performing thematic coding across interview transcripts  
- Generating summaries, classifications, and annotations at scale  

By the end, you'll know how to go from “ask ChatGPT a question” to let's build our own GPT like system for extraction (ehh, not really i gotta figure this out).

## Motivating Scenario: Coding Therapy Dialogues at Scale

You're a psychology researcher analyzing hundreds of counseling sessions between clients and therapists. Your goal is to understand:
- What types of **issues** clients bring up
- How they **cope** with emotional challenges
- Whether there's any language indicating **emotional risk**
- What **techniques** therapists use in response

---

### Approach 1: Manual Thematic Coding (Example)

Let’s look at one real example from the [`Amod/mental_health_counseling_conversations`](https://huggingface.co/datasets/Amod/mental_health_counseling_conversations) dataset.

> Client (Context):
> "I’ve been feeling overwhelmed with school lately. It’s like no matter how hard I try, I can’t catch up. I just end up crying at night."
>
> Therapist (Response):
> "It’s completely understandable to feel that way under so much pressure. I’m here to support you. Can we explore what’s making you feel so behind?"

From a research perspective, you might label this exchange as:

- `presenting_issue`: academic stress / anxiety  
- `coping_style`: emotion-focused  
- `client_emotion`: sad  
- `risk_flag`: no  
- `counselor_technique`: reflection

You can imagine doing this manually for the first 10–20 samples, but then you realize this task gets quite tedious and daunting. 

---

### Approach 2: ChatGPT

At some point, you realize: "What is just paste this into ChatGPT and ask it to extract that structured information for me!" 
And you try it, and it works! Voila. Seems simple enough and you get a clean response like:

```json
{
  "presenting_issue": "anxiety",
  "risk_flag": false,
  "client_emotion": "sad",
  "coping_style": "emotion-focused",
  "counselor_technique": "reflection"
}
```

But You Still Have thousands of more more samples to get through. You still have to:
- Manually copy and paste each sample from the UI
- Append the output to a document or spreadsheet
- Hope ChatGPT stays consistent
- Repeat this thousands of times


Seems like something that would be made easy if you could just programmatically call ChatGPT or whatever the model is that it utilizes behind the scenes.

### Breakthrough Approach: Interfacing directly with the Large Language Models behind GPT
Instead of working through a web interface, what if you could:
- Programmatically send each sample to the model  
- Receive structured, reliable JSON responses  
- Save everything automatically into a CSV or database  
- Scale up from 10 samples to 10,000

Using LLMs directly you can...

maybe add a visual or high level diagram of what we want to do?

---

## Background: LLM Hosting
Now before we get into how you can interface with these LLM's directly through the API, let's build up some context to how LLM's are even hosted.

- image goes here -

describe how there are hosted Servers that host the LLM itself, then ChatGPT is a web application, that connects to those servers

### LLM Server

At its core, a Large Language Model is just a giant neural network:  a trained set of weights and biases. When you ask it something you’re running what’s called a forward pass or inference. You pass in some input text, and the model predicts the next most likely word (again and again, until it's done). Theoretically, if you had a computer powerful enough you could run this set of weights and biases locally on your computer. 

Howevver, in practice the SoTa (state of the art models) tend to be much, much larger than what you can fit on your computer. When you use ChatGPT, Claude, or Gemini, you’re not actually running the model on your laptop. Behind the scenes, you're sending a request to a Large Language Model (LLM) hosted on a powerful server that is capable of running these giant models. They can require tens or hundreds of gigabytes of memory and need specialized hardware like GPUs.

### Hosted LLMs and API Servers
Because running an LLM requires this massive amount of compute (reword this to something simpler?), companies host them on the cloud and provide access through something called an API Server.

An API (Application Programming Interface) server gives your application a way to:
- Send text to the model (request),
- Run an inference (get predictions) remotely,
- And receive the result back (response)

So when you interact with ChatGPT, what’s really happening under the hood is something like this:
ChatGPT UI  →  API Call  →  Hosted LLM  →  Output  →  ChatGPT UI [need to create this image]

This pattern became the foundation for how LLMs are accessed at scale.

### Competing Standards for LLM APIs
As more models were released, different organizations built their own API “specifications” for hosting LLMs. A few major ones emerged:

- OpenAI API Spec – The most widely adopted standard, designed by OpenAI
- Hugging Face TGI (Text Generation Inference) – Hugging Face’s approach to model serving
- vLLM – A fast, GPU-efficient serving engine from UC Berkeley & partners

Each of these defined:
- What endpoints (like /generate or /chat) should exist
- How to format the request body (prompts, parameters, etc.)
- What the output should look like (tokens, probabilities, completions...)

### The World Converges: OpenAI's API Format
Similar to how phones began to converge on a single charging standard (like USB-C), the AI developer ecosystem has increasingly converged on OpenAI’s API format as a common interface for LLMs. Today, even third-party models like Mistral, Meta’s LLaMA, and others are often hosted behind APIs that support OpenAI-style endpoints, such as:

- [Responses](https://platform.openai.com/docs/api-reference/responses)
- [Chat](https://platform.openai.com/docs/api-reference/chat)
- [Embeddings](https://platform.openai.com/docs/api-reference/embeddings)
- Mention MCP? probably out of scope

Why the convergence?
- Developer tools (e.g., LangChain, LlamaIndex, OpenRouter) are already built around these formats
- Developers don’t need to learn a new interface for every model (easy model swapping)
- It became a de facto "dialect" for talking to LLMs

# Using OpenRouter for LLM Access

For the rest of this workshop, we’ll be using a service called **OpenRouter** to interact with LLMs.

---

## What is OpenRouter?

OpenRouter is a unified API gateway for Large Language Models. It allows you to send a prompt to many different models (like GPT-4, Claude, Mistral, and others) using a single, consistent interface. Though OpenRouter doesn't host the LLM servers themselves, they allow you to interact with LLM hosts through one unified API. 

In addition to providing one unified experience, we are using OpenRouter since it doesn't require a credit card to interact with the free tier of models (though you are limited to 50 calls per day).

-- add more details about openrouter -- 
---

## Getting Set Up

To use OpenRouter, you’ll need to:
1. Create a free account at [https://openrouter.ai](https://openrouter.ai)
2. Click on your profile (top right) and generate an API key
    - Settings -> API Keys -> Create API Key
4. ⚠️ **Warning:** Save this API Key Somehwere Safe
    -  It is generally best practice to save this key somewhere secure like an environment variable or in a secure file on your system.<br>
    - This is a very sensitive key which allows you to interface directly with the API's. Do not share it with anyone or upload it to any public accounts like Github.
    - You will only be able to view this key once, before you can't see it again. If you lose it, you will have to create a new one.

For the purposes of this workshop, let's make a new file in your environment called API_KEY, save it there, and then read our key directly from that file.

In [6]:
# Let's load in the API_KEY which we've saved to a file
with open('API_KEY.txt', 'r') as file:
    API_KEY = file.read()

In [8]:
# Install the OpenAI Python Client
# !pip install openai

In [14]:
# Import the OpenAI Python Client
from openai import OpenAI

# Notice how we can utilize the OpenAI Client with non OpenAI Hosts/Models due to the unified standard
client = OpenAI(
  base_url="https://openrouter.ai/api/v1", 
  api_key= API_KEY # Place API Key Here,
)

So now we have our client setup. Next we need to figure out which model to use. 
On OpenRouter, you can navigate to [models](https://openrouter.ai/models) tab and search free to view all free models. Notice popular models like Llama 3.2, Google Gemma, and DeepSeek V3. We will choose DeepSeek v3 as it has the largest token set, but feel free to try any other.

⚠️ **Warning:** Be careful not to run these cells too many times though. We are only allowed **50 free calls per day** with the free tier on Open Router. Any more than that, and you will be prevented from making any more requests till the next day. 

💡 **Tip**: You can check how many requests you've used per day by navigating to the [Activity Menu](https://openrouter.ai/activity) in OpenRouter (Top Left Menu -> Activity)

In [None]:
# Request a Completion (Output) from the Chat Completions Endpoint
completion = client.chat.completions.create(
    model="deepseek/deepseek-chat-v3-0324:free",
    max_tokens=150,
    messages=[
        {
          "role": "user",
          "content": "Tell me about the D-Lab at UC Berkeley in 50 words or less."
        }
    ]
)

In [27]:
# Pretty Printing The Completion Response
from pprint import pprint
pprint(completion.model_dump())

{'choices': [{'finish_reason': 'length',
              'index': 0,
              'logprobs': None,
              'message': {'annotations': None,
                          'audio': None,
                          'content': "UC Berkeley's **D-Lab** (Development, "
                                     'Data, and Design Lab) provides '
                                     'interdisciplinary training in data '
                                     'science, research methods, and digital '
                                     'tools to empower students and '
                                     'researchers. Offering workshops, '
                                     'consulting, and courses, D-Lab focuses '
                                     'on practical skills for impactful, '
                                     'data-driven work across social sciences, '
                                     'humanities',
                          'function_call': None,
                          'reaso

Notice the abundance of data contained in the raw Completions Object. This Object represents the response from the LLM featuring the outpue message, content, finish_reason, and a variety of other useful metadata. To get the generated response, we want to focus on the choices object. 

In [16]:
print(completion.choices[0].message.content)

UC Berkeley's **D-Lab** (Development, Data, and Design Lab) provides interdisciplinary training in data science, research methods, and digital tools to empower students and researchers. Offering workshops, consulting, and courses, D-Lab focuses on practical skills for impactful, data-driven work across social sciences, humanities


Notice the broad knowledge base of the DeepSeek Model which contains information about UC Berkley's very own D-Lab.

### Congrats! This marks the end of the first part of the workshop! 

In Part 2, we will break down the parameters that go into constructing this call what it all means. And how you can use this to build your own scalable workflows