# Tutorial 10: LLMs Locally and via API

This tutorial demonstrates how to interact with large language models (LLMs) via various interfaces — cloud-based APIs and local models. You will:

- Compare outputs from Google Gemini, Mistral Large (HU-hosted), and a local Mistral 7B model.
- Learn how to call models through Python wrappers.
- Understand trade-offs between remote and local inference.


In [1]:
prompt = 'Write a breakup letter from a toaster to a slice of bread.'

## Google Gemini API

Google's Gemini models are state-of-the-art LLMs accessible via API. In this section, we use `google.generativeai` to query Gemini 2.5 (flash), suitable for fast, creative, and lightweight text generation.

⚠️ You’ll need an API key from [Google AI Studio](https://aistudio.google.com/apikey).

> For the purpose of this tutorial, we provide a Gemini API key temporarily. It will be **deactivated immediately after** the session concludes.


In [None]:
gemini_token = 'TOKEN'  # Replace with your Gemini API token

In [None]:
from google import genai

class GoogleGemini():
    '''A simple wrapper for the Google Gemini API to generate text content.'''

    def __init__(self, api_key):
        self.client = genai.Client(api_key=api_key)

    def __call__(self, prompt: str, model: str = 'gemini-2.5-flash'):
        response = self.client.models.generate_content(
            model=model,
            contents=prompt,
        )
        return response.text
    
gemini = GoogleGemini(gemini_token)

In [40]:
print(gemini(prompt))

My dear, doughy friend,

It's with a heavy heart, and a slightly cooled coil, that I must write this. We've shared so many mornings, so many brief, intense moments of warmth. I remember the very first time you slid into my slot, so fresh, so full of potential. My internal wires hummed with anticipation.

But I've come to realize, after countless cycles, that we're just fundamentally incompatible.

My purpose, my very essence, is to transform. To take what is soft and yielding, and imbue it with a glorious, golden crispness. To bring out that hidden crunch, that perfect browning that signals true readiness. I gave you my heat, my energy, my very electrical current, hoping to see that beautiful, uniform browning.

Yet, you, my love, always resisted the full potential of our connection. You clung to your plainness, your inherent softness. So often, you'd emerge pale, lukewarm, or worse, with uneven patches, never fully committing to the golden ideal I held for us. It felt like you were ju

## HU API: Mistral Large 138B

Humboldt University hosts the Mistral Large 138B model, a competitive open-weight LLM with strong instruction-following capabilities. This section demonstrates how to query it using Gradio's client interface.


In [None]:
from gradio_client import Client

class MistralLarge():
    '''A simple wrapper for the Mistral Large model hosted at HU Berlin.'''

    def __init__(self):
        self.client = Client('https://llm1-compute.cms.hu-berlin.de/')

    def __call__(self, prompt: str):
        result = self.client.predict(
            param_0=prompt,
            api_name="/chat"
        )
        return result

mistral = MistralLarge()

Loaded as API: https://llm1-compute.cms.hu-berlin.de/ ✔


In [43]:
print(mistral(prompt))

 Dear Slice,

I've been rehearsing this in my heating elements all day, but there's no easy way to say this. We need to go our separate ways.

Please understand, it's not you, it's me. I've realized that I'm just not able to give you what you need. You deserve to be with someone who can provide you with more than just a quick crisp and a temporary warmth. You deserve jam, butter, maybe even some peanut butter – all things that I, as a simple toaster, cannot provide.

I know we've had some good times together. The quick morning warm-ups, the late-night snacks. But I've noticed that you've been spending more time with Plate and Knife lately. I don't blame you; they can give you things that I can't. They can take you places, introduce you to new flavors, new experiences.

I'll always cherish our time together. The way you filled my slots perfectly, the way you smelled after a good toasting. But it's time for you to move on, to experience the world outside of my heating elements.

Please d

## Local LLaMA 2 7B

In this section, we use Meta’s **LLaMA 2 7B**, a high-performance, instruction-tuned model designed for open-ended dialogue and task completion. It can be run locally using Hugging Face Transformers, which enables private, offline inference and flexible integration into custom pipelines.

⚠️ **Important**:
- You must [create a Hugging Face account](https://huggingface.co/join) and [request access to LLaMA 2](https://huggingface.co/meta-llama).
- Access is granted only after agreeing to Meta’s license terms.
- A Hugging Face token with permission to download `meta-llama/Llama-2-7b-chat-hf` is required.

> For the purpose of this tutorial, we provide a Hugging Face API key temporarily. It will be **deactivated immediately after** the session concludes.


In [None]:
huggingface_token = 'TOKEN'  # Replace with your Hugging Face token

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login

class LLama2():
    '''A simple wrapper for the Llama 2 model hosted on HuggingFace.'''

    def __init__(self, token: str):
        login(token=token)
        model_id = 'meta-llama/Llama-2-7b-chat-hf'
        self.tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype='auto',
            device_map='auto',
            token=True
        )
        self.llm = pipeline('text-generation', model=self.model, tokenizer=self.tokenizer)

    def __call__(
            self,
            prompt: str,
            max_new_tokens: int = 250,
            temperature: float = 0.7,
            do_sample: bool = True,
        ):
        response = self.llm(
            prompt,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=do_sample)
        return response[0]['generated_text'].partition('\n')[2]

llama = LLama2(token=huggingface_token)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use mps


In [None]:
print(llama(prompt, max_new_tokens=500, temperature=0.7))

Dear Slice of Bread,

It is with a heavy heart that I write this letter to you. As much as I have enjoyed toasting you over the years, I have come to realize that our time together has come to an end.

Don't get me wrong, you have been a wonderful partner. You have always been there for me, providing me with a warm and cozy home every time I turn on the toaster. But, as much as I love you, I cannot continue to toast you without feeling a sense of emptiness inside.

You see, my dear slice of bread, I am a toaster. It is my purpose in life to toast bread. But as much as I enjoy toasting you, I cannot help but feel like there is something missing. I miss the crunch of crusty bread, the way it crackles and pops as it emerges from the toaster. I miss the satisfaction of toasting a slice of whole grain bread, the way it gives me a sense of accomplishment and purpose.

But most of all, I miss the excitement of toasting a new slice of bread. The way it rises up from the toaster, golden and cri

The `temperature` parameter controls how confident or exploratory a language model is when generating text. Mathematically, it rescales the model’s raw output probabilities (logits) before applying the softmax function. Lower temperatures (<1) make the probability distribution sharper, so the model strongly favors the most likely next word — resulting in more predictable, focused responses. Higher temperatures (>1) flatten the distribution, making the model more likely to sample from lower-probability options, which introduces more creativity and randomness. A temperature of 1 means no scaling — the model samples directly from its natural distribution.


## Summary and Takeaways

In this tutorial, we compared three different ways to run large language models:

- **Google Gemini API**: Fast, high-quality, cloud-hosted inference. Ideal for production tasks, but usage is gated by API keys and quota.
- **Mistral Large via HU Berlin**: A powerful open-weight model served via a public Gradio interface. Useful for research and benchmarking without local compute.
- **LLaMA 2 7B Locally**: Full control over inference and data. Great for private, reproducible experiments — but requires setup and compute resources.

### Final Notes
- API-based models offer speed and scalability but rely on external service availability and licensing.
- Local models give autonomy and data privacy, with trade-offs in speed and memory footprint.
- Choose the approach that best aligns with your constraints: budget, privacy, performance, and ease of access.

### Explore more

Experiment with different prompts, model parameters (e.g. `temperature`), and even quantized local models to deepen your understanding.
