<a href="https://colab.research.google.com/github/stoyinbizz-ui/Getting-Started-with-the-Hugging-Face-Hub-Inference-API-/blob/main/Getting_Started_with_the_HuggingFace_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Getting Started with the Hugging Face Hub (Inference API)**

* Heard of AI chat-based interfaces like **ChatGPT**, **Gemini**, **HuggingChat**?

**What exactly is a Transformer-based language model?**

These are large language models (LLMs) trained on massive datasets to understand and generate natural language. They are:

* **Generative** ‚Äî able to produce text and other content.
* **Pre-trained** ‚Äî trained in advance on large corpora.
* **Transformer-based** ‚Äî built on the transformer architecture that converts input to context-aware output.

These models power many common NLP tasks: answering questions, summarizing content, translating languages, and generating human-like dialogue.

### **Adding Your API Key to Colab**

1. In Colab, click **üîë Secrets** in the left panel.
2. Add a new secret with:

   * **Name**: `HF_TOKEN`
   * **Value**: *your API key*
3. Grant notebook access to that secret.

### **Loading the API Key in Your Code**

First, retrieve the key securely:




In [None]:
# Used to securely store your API key
from google.colab import userdata

In [None]:
HF_API_KEY=userdata.get('HF_TOKEN')

**Then pass it to the SDK:**

```python
import os
os.environ["HF_API_KEY"] = HF_API_KEY

from huggingface_hub import InferenceClient
client = InferenceClient(token=os.environ["HF_API_KEY"])
```

- <font color="red">Warning</font>: Ensure that there are no whitespaces in your API key.

## **Install the Hugging Face Hub SDK**

Now that your account and access token are ready, the next step is setting up your local environment. We‚Äôll access models, datasets, and repos via the Hugging Face Hub Python library.

You can install it using pip using the command below:


In [None]:
!pip install huggingface_hub



## **Import packages**
Import the necessary packages.

In [None]:
import pathlib
import textwrap

import openai

from IPython.display import display
from IPython.display import Markdown


def to_markdown(text):
  text = text.replace('‚Ä¢', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
import os
os.environ["HF_API_KEY"] = HF_API_KEY

### **Instantiate a client**

We will now create a client that can access various types of models globally. Please note that you should only provide an API key for authentication whenever you are initialising a client.

In [None]:
from huggingface_hub import InferenceClient
import os

# Create client
client = InferenceClient(token=os.environ["HF_API_KEY"])

prompt = "write a poem about generative AI and it founders."
messages = [{"role": "user", "content": prompt}]

response = client.chat_completion(
    messages=messages,
    temperature=0.1,
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=500
)

print(response.choices[0].message.content)

In silicon halls, a dream took shape,
A future born of code and fate,
Generative AI, a wondrous thing,
 Created by minds that dared to sing.

Ian Goodfellow, a pioneer true,
Invented GANs, a breakthrough anew,
A neural network that could create and play,
Unleashing art, in a digital way.

Andrew Ng, a visionary guide,
 Led the charge, with a gentle stride,
Deep learning's power, he did unfold,
A new frontier, where AI could hold.

Yoshua Bengio, a mastermind,
 Contributed greatly, to the design,
Recurrent neural networks, a key to unlock,
The secrets of language, and the human stock.

Geoffrey Hinton, a giant in the field,
His work on backpropagation, a story to yield,
A fundamental concept, that paved the way,
For the AI revolution, in a brighter day.

These founders, of a new era's birth,
Their work and passion, gave AI its mirth,
Generative AI, a tool of great might,
Creating art, music, and a digital light.

Their legacy lives on, in the code they wrote,
A testament to human ingenu

In [None]:
response = client.chat_completion(
    messages=messages,
    temperature=0.7,
    model="Qwen/Qwen2.5-7B-Instruct",
    max_tokens=100
)

print(response.choices[0].message.content)

The capital of France is Paris.


In [None]:
response = client.chat_completion(
    messages=messages,
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=500
)

print(response.choices[0].message.content)

It was a typical Tuesday morning in Silicon Valley when Elon Musk decided to shake things up. He had been feeling particularly mischievous that day, and as he scrolled through his Twitter feed, he spotted the perfect targets.

First, he tweeted a cryptic message about his latest Neuralink project, claiming that it would allow people to control their dreams remotely. But just as people were getting excited, he dropped the bombshell: "Just kidding, it's just a toaster with a fancy name."

The tech community was left scratching their heads, but Musk wasn't done yet. Next, he announced that SpaceX would be launching a new rocket, dubbed the "Hyperion," which would supposedly reach the moon in under an hour. But when asked for more details, he simply responded with a selfie of himself holding a rubber chicken, captioned "Hyperion: coming soon to a moon near you."

But Musk's pi√®ce de r√©sistance was yet to come. He started responding to people's tweets with absurd, over-the-top comments, l

## **What models can be used with the Python SDK?**

Now you're ready to call models via the Hugging Face Inference API. Before generating responses, let‚Äôs explore the models available for use with the SDK.

For a more holistic view of available models, see the [Hugging Face Model Hub](https://huggingface.co/models).


In [None]:
from huggingface_hub import HfApi
import pandas as pd

# Initialize API
api = HfApi()

models = api.list_models(
    task="text-generation",
    library="transformers",
    sort="downloads",
    direction=-1,
    limit=50  # Get top 50
)

# Display as a list
model_list = []
for model in models:
    model_list.append({
        "Model ID": model.id,
        "Downloads": model.downloads if hasattr(model, 'downloads') else 0,
        "Likes": model.likes if hasattr(model, 'likes') else 0,
        "Tags": ", ".join(model.tags[:3]) if model.tags else ""
    })

# Show as DataFrame
df = pd.DataFrame(model_list)
df


Use `filter` instead.


Unnamed: 0,Model ID,Downloads,Likes,Tags
0,openai-community/gpt2,11144960,3001,"transformers, pytorch, tf"
1,Qwen/Qwen2.5-7B-Instruct,7897741,843,"transformers, safetensors, qwen2"
2,Qwen/Qwen3-0.6B,7261136,741,"transformers, safetensors, qwen3"
3,Gensyn/Qwen2.5-0.5B-Instruct,6501589,26,"transformers, safetensors, qwen2"
4,meta-llama/Llama-3.1-8B-Instruct,5234642,4840,"transformers, safetensors, llama"
5,openai/gpt-oss-20b,4769961,3819,"transformers, safetensors, gpt_oss"
6,dphn/dolphin-2.9.1-yi-1.5-34b,4717677,43,"transformers, safetensors, llama"
7,google/gemma-3-1b-it,4504785,677,"transformers, safetensors, gemma3_text"
8,TinyLlama/TinyLlama-1.1B-Chat-v1.0,4360479,1437,"transformers, safetensors, llama"
9,Qwen/Qwen3-Embedding-0.6B,4254861,692,"sentence-transformers, safetensors, qwen3"


### **Making Your First API Call**

In this section, you‚Äôll learn how to use the Hugging Face Inference API to send requests. For text tasks, you‚Äôll use endpoints designed for text generation or analysis. For image tasks, you‚Äôll use models specifically for image generation or classification(which we would cover later in this course).

When interacting with chat-based models through Hugging Face‚Äôs `InferenceClient`, a common endpoint is the **chat completions** interface for instruction-following models.

#### **Chat Completions API Overview**

The chat-style API allows both single-turn and multi-turn interactions by processing a sequence of messages and generating coherent responses. It works well for both conversations and one-off queries.

#### **Input Structure & Parameters**

**A. Messages**
You send a list of messages structured like:

```python
[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "Tell me about Yoruba culture."}
]
```

Each message contains:

* **role**: Either `"system"`, `"user"`, or `"assistant"`
* **content**: The actual message string

**B. Max Tokens**
Controls how much text the model should generate.

```python
response = client.chat_completion(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=messages,
    max_tokens=150
)
```

Tokens are subword units, not full words. For example, ‚Äúunexpectedly‚Äù might tokenize to:
`["un", "expect", "ed", "ly"]`

Each model has its own token limit. For example, Llama 3 (8B) supports up to ~8,192 tokens (input + output combined).

**C. Temperature**
Controls the randomness of the output. A lower temperature leads to more predictable, deterministic text; a higher temperature increases creativity and variety but may also increase the risk of irrelevant or nonsensical responses. The value ranges between 0 and 2:

* `temperature = 0.0`: More deterministic
* `temperature = 2.0`: More diverse and creative

Example:

```python
response = client.chat_completion(
    messages=messages,
    temperature=0.7
)
```

Higher values produce more varied results, lower values are better for accuracy and repetition control.

In [None]:
# Define user Prompt
prompt="Recommend to me a very interesting movie."
# Creating a message as required by the API
messages = [{"role": "user", "content": prompt}]
completion = client.chat_completion(
    messages = messages,
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=300,
    temperature=1.0,
)
print(completion)
Markdown(completion.choices[0].message.content)



I'd like to recommend a thought-provoking and visually stunning movie that explores the concept of reality, identity, and memory.

**Movie: "Eternal Sunshine of the Spotless Mind" (2004)**

Directed by Michel Gondry, this film tells the story of a couple, Joel (Jim Carrey) and Clementine (Kate Winslet), who undergo a procedure to erase their memories of each other after a painful breakup. The movie then follows Joel as he undergoes the procedure, reliving memories of his time with Clementine and experiencing the fragmentation of their relationship.

**Why you'll find it interesting:**

1. **Unique narrative structure**: The film's non-linear storytelling and unconventional use of flashbacks and dream sequences make it a fascinating and intellectually engaging watch.
2. **Exploration of love and heartbreak**: The movie poignantly captures the complexities of human emotions, particularly the intensity and depth of love, as well as the pain and longing that can follow a breakup.
3. **Michel Gondry's visual storytelling**: The film's cinematography, production design, and special effects create a dreamlike atmosphere, immersing the viewer in the characters' emotional experiences.

**Eternal Sunshine of the Spotless Mind** is a thought-provoking, beautiful, and emotionally resonant movie that will leave you reflecting on the nature of love, memory, and the human experience.

(Warning: The movie deals with mature themes, including memory loss

Next, we will now learn how to have a multi-turn conversation with this LLM. To do this, we will add the assistant's response to the previous conversation and also include the new prompt in the same message format. After that, we will provide a list of dictionaries to the chat completion function.

#### **Conversation Dynamics**
Conversations can vary in length from a single message to a series of exchanges. Typically, these interactions might start with a system message to guide the assistant's behavior, followed by a sequence of alternating messages between the user and the assistant.

#### **Roles Explained**
- **System**: The system message sets the initial tone or guidelines for the assistant‚Äôs behavior during the interaction. It can be used to imbue the assistant with a specific personality or to provide precise instructions on how it should conduct itself. While the system message is optional, its absence defaults the assistant's demeanor to that of a generally helpful nature, akin to starting with a message like "You are a helpful assistant."
  
- **User**: Messages from the user generally consist of queries or comments that prompt responses from the assistant. These are the driving force of the conversation, guiding the topics and flow of the dialogue.

- **Assistant**: This role involves messages generated in response to the user or system inputs. The assistant's messages can include responses based on previous interactions within the conversation. Alternatively, you can manually craft messages in this role to demonstrate preferred responses or to simulate typical interactions.

By understanding and effectively utilizing these roles, you can create nuanced and dynamic dialogues tailored to specific interaction scenarios or conversational needs.

In [None]:
# Multi-turn conversation with system message
response = client.chat_completion(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the FIFA world in 2022?"},
        {"role": "assistant", "content": "The Argentina won the FIFA world cup in 2022."},
        {"role": "user", "content": "Where was it played?"}
    ]
)

Markdown(response.choices[0].message.content)

 The 2022 FIFA World Cup was played in Qatar.

## Introduction to Gradio

[Gradio](https://www.gradio.app/docs) is an open‚Äësource Python library that makes it easy to build a web‚Äëbased interface around any Python function, model, or API. With just a few lines of code you can wrap your machine‚Äëlearning model (or any processing function) into a usable UI and launch it locally or share it publicly. Gradio abstracts away the need for front‚Äëend web development skills, enabling non‚Äëtechnical users to interact with your model via browser input fields, sliders, image uploads, etc. Once your interface is running, you can also host it on platforms like Hugging Face Spaces so others can try it from anywhere.


In [None]:
!pip install gradio --quiet

In [None]:
# Step 2: Import libraries
import gradio as gr
import os

In [None]:
def chat_with_model(prompt):
    try:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user",   "content": prompt}
        ]

        response = client.chat_completion(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=messages,
            max_tokens=500
        )

        print(response.choices[0].message.content)
        print(response.usage.prompt_tokens)

        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

In [None]:
# Step 5: Create a Gradio interface
iface = gr.Interface(
    fn=chat_with_model,
    inputs=gr.Textbox(lines=2, placeholder="Ask me anything..."),
    outputs="text",
    title="Chat with AI Models",
    description="Ask the model any questions."
)

In [None]:
# Step 6: Launch the interface
iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://419edbf9b3588c37d2.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)






```
# This is formatted as code
```

### **Assignment: Add Memory to Your Chatbot**

### Your Task
Modify the chatbot function above, so it remembers previous messages in the conversation.

### Current Problem
Right now, your chatbot forgets everything after each response. If you say "My name is Sarah" and then ask "What's my name?", it won't remember.

### What You Need to Do
Make your chatbot remember all previous messages and responses, so it can refer back to earlier parts of the conversation.

### Test Your Memory
Your chatbot should be able to handle this conversation:
```
User: "Hi, my name is Sarah and I love pizza."
Bot: [responds]
User: "What's my name?"
Bot: [should say "Sarah"]
User: "What do I love?"
Bot: [should say "pizza"]
```

### **Hint**

To give the chatbot a memory, you‚Äôll need to keep *all* of the past messages (user + bot) in a list of dictionaries behind the scenes, and then include that history every time you make a new API call. Here‚Äôs step‚Äëby‚Äëstep how you can do it:

1. At the top of your script (outside the function) create a variable, e.g.

   ```python
   chat_history = []
   ```

   This will hold the sequence of all messages.

2. Each time the user sends a prompt, add a dictionary representing the user message to `chat_history`, e.g.

   ```python
   chat_history.append({"role": "user", "content": prompt})
   ```

3. Then when you call the model, pass *all* of the previous messages + the current user message as the `messages` list. For example:

   ```python
   messages = chat_history.copy()
   messages.append({"role": "assistant", "content": ???})  # you‚Äôll do this after you get the response
   ```

4. After the model returns a response, take the assistant‚Äôs reply content and append another dictionary into `chat_history`:

   ```python
   chat_history.append({"role": "assistant", "content": response_text})
   ```

5. That way, when you next ask ‚ÄúWhat‚Äôs my name?‚Äù or ‚ÄúWhat do I love?‚Äù, the history contains the earlier statement (‚ÄúMy name is Sarah and I love pizza.‚Äù) and then the question follows. The model sees the entire chain and can answer accordingly.

6. If you want, you can limit how many past messages you keep (for token/efficiency reasons) by slicing the list (e.g., `chat_history = chat_history[-10:]`).

By following those steps, your chatbot will ‚Äúremember‚Äù previous messages in the conversation.