## Hugging Face Model Inference <img src="../../images/huggingface.png" width=30 />

Inference is the process of using a trained model to make predictions on new data. As this process can be compute-intensive, running on a dedicated server is an option. The huggingface_hub library provides an easy way to call a service that runs inference for hosted models. 

### Serverless Inference API

Explore the most popular models for text, image, speech, and more — all with a simple API request. Build, test, and experiment without dedicated infrastructure or setup.  The Hugging Face Inference API exposes models that have large community interest and are in active use.  Active models are based on recent likes, downloads, and usage and therefore deployed models can be swapped without prior notice. The Hugging Face stack aims to keep all the latest popular models warm and ready to use.

**- Warm models:** models ready to be used.

**- Cold models:** models that are not loaded but can be used.

**- Frozen models:** models that currently can’t be run with the API.

#### Using the Requests library

Using the Requests library for Hugging Face serverless inference involves interacting with Hugging Face's Inference API. 

**1. API Token:** Obtain an API key from Hugging Face by creating an account and generating a token from your settings.

**2. Endpoint:** Identify the endpoint URL for the specific model you wish to use. 

**3. HTTP Request:** Use the Python requests library to send an HTTP POST request to the model endpoint, including:
- The input data in JSON format.
- An Authorization header containing the API token.

**4. Response Handling:** Parse the JSON response returned by the API, which contains the model's prediction or output.

In [None]:
import os
import requests
from dotenv import load_dotenv

# 1. API Token - load API token environment variables
load_dotenv()

# 2. Endpoint - define the model endpoint 
MODEL_ENDPOINT = "https://api-inference.huggingface.co/models/cardiffnlp/twitter-roberta-base-sentiment-latest"

# 3. HTTP Request - define the input data and associated headers and make the request
input = {"inputs": "Today is a great day"}
headers = {"Authorization": f"Bearer {os.getenv("HUGGINGFACEHUB_API_TOKEN")}"}

response = requests.post(MODEL_ENDPOINT, headers=headers, json=input)

# 4. Response Handling - convert the response and display the results:
output = response.json()

print(f"Sentiment of text: '{input["inputs"]}'")
for result in output[0]:
    print(f"{result["label"]}: {round(result['score'] * 100, 2)}%")

#### Using the Inference Client

Using the InferenceClient method for Hugging Face serverless inference provides a more streamlined and Pythonic way to interact with the Hugging Face Inference API. This method comes from the huggingface_hub library, simplifying the process of sending inputs and receiving predictions.

**1. Initialise Client:** Use the InferenceClient class from huggingface_hub.

**2. API Token:** Authenticate with your Hugging Face API token. The client manages this for requests automatically.

**3. Call Model:** Use the .post() or task-specific methods like .text_generation() to send input data to a specified model endpoint.

**4. Get Results:** The client processes the response and returns predictions in a usable format.

In [None]:
from huggingface_hub import InferenceClient

# 1. Initialise Client
client = InferenceClient()

# 2. API Token (previously loaded from environmental variables)

# 3. Call Model
input = "Today is a great day"
response = client.text_classification(model="cardiffnlp/twitter-roberta-base-sentiment-latest", text=input)

#4. Get Results
print(f"Sentiment of text: '{input}'")
for result in response:
    print(f"{result["label"]}: {round(result['score'] * 100, 2)}%")

#### Using the Async Inference Client

The AsyncInferenceClient is an asynchronous version of the InferenceClient provided by the huggingface_hub library. It allows you to make non-blocking requests to Hugging Face's Inference API, which is especially useful in scenarios where you need to handle multiple inference tasks concurrently or optimize for responsiveness in your application.

The AsyncInferenceClient leverages Python's asyncio framework to support asynchronous programming. You define async functions and use the await keyword to call the client’s methods.

**Concurrency:**

Traditional synchronous methods block execution while waiting for the server's response.
With AsyncInferenceClient, multiple requests can run simultaneously without waiting for one to complete, making it ideal for handling multiple inputs in parallel.

**Performance:**

Asynchronous execution helps in reducing the total runtime for applications that need to process a large batch of requests or interact with multiple APIs.

**Non-blocking Execution:**

In web servers or interactive applications, asynchronous operations prevent blocking the main thread, ensuring smooth user experiences and efficient resource use.

In [None]:
from huggingface_hub import AsyncInferenceClient

# 1. Initialise Client
client = AsyncInferenceClient()

# 2. API Token (previously loaded from environmental variables)

# 3. Call Model
input = "Today is a great day"
response = await client.text_classification(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest", 
    text=input
)

#4. Get Results
print(f"Sentiment of text: '{input}'")
for result in response:
    print(f"{result["label"]}: {round(result['score'] * 100, 2)}%")

### Model Domains

In [None]:
client = InferenceClient()
client_async = AsyncInferenceClient()

#### Audio

##### Audio Classification

Audio classification is the task of assigning a label or class to a given audio.

Example applications:
- Recognizing which command a user is giving
- Identifying a speaker
- Detecting the genre of a song

In [None]:
import soundfile as sf
import pprint
from IPython.display import Audio

file_path = "../../data/audio-classify.wav"

response = client.audio_classification(
    file_path,
    model="speechbrain/google_speech_command_xvector"
    )

for data in response:
    print(f"Label: {data["label"]}, Score: {round(data["score"], 2)}")

data, samplerate = sf.read(file_path)
Audio(data, rate=samplerate)

##### Automatic Speech Recognition

Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text.

Example applications:
- Transcribing a podcast
- Building a voice assistant
- Generating subtitles for a video

In [None]:
file_path = "../../data/audio-asr.flac"

response = client.automatic_speech_recognition(file_path)
pprint.pprint(response.text)

data, samplerate = sf.read(file_path)
Audio(data, rate=samplerate)

##### Text to Image

Generate an image based on a given text prompt.

In [None]:
image = client.text_to_image(
    image_text,
    negative_prompt="low resolution, blurry",
    model="stabilityai/stable-diffusion-3.5-large"
) 

display(image.resize((300, 300)))

#### Image

In [None]:
from PIL import Image

image_path = "../../data/teddy.jpg"
 
image = Image.open(image_path)
display(image)

##### Image Segmentation

Image Segmentation divides an image into segments where each pixel in the image is mapped to an object.

In [None]:
## NEED A GOOD EXAMPLE

##### Image Classification

Image classification is the task of assigning a label or class to an entire image. Images are expected to have only one class for each image.

In [None]:
response = client.image_classification(
    "../../data/teddy.jpg",
    top_k=3,
    model="google/vit-base-patch16-224",
)

for label in response:
    print(f"Class Label: {label["label"]}, Score: {round(label["score"], 2)}")

##### Image-Text to Text

Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.

In [None]:
response = client.image_to_text(
    "../../data/teddy.jpg",
    model="Salesforce/blip-image-captioning-large"
)

image_text = response.generated_text
print(image_text)

##### Image to Image

Image-to-image is the task of transforming a source image to match the characteristics of a target image or a target image domain.

Example applications:
- Transferring the style of an image to another image
- Colorizing a black and white image
- Increasing the resolution of an image

In [None]:
image_transformed = await client_async.image_to_image(
    "../../data/teddy.jpg", 
    prompt="change the cat to a tiger",
    model="stabilityai/stable-diffusion-xl-refiner-1.0"
)

display(image_transformed)

##### Object Detection

Object Detection models allow users to identify objects of certain defined classes. These models receive an image as input and output the images with bounding boxes and labels on detected objects.

In [None]:
result = client.object_detection(
    "../../data/teddy.jpg",
)

result

In [None]:
from PIL import ImageDraw, ImageFont
import random

def display_object_detection(image, objects):
    draw_image = image.copy()
    draw = ImageDraw.Draw(draw_image)
    try:
        font = ImageFont.truetype("arial.ttf", 12)
    except IOError:
        font = ImageFont.load_default()
    for object in objects:
        label = object.label  
        score = object.score
        box = object.box

        random_color = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255))
        xmin, ymin, xmax, ymax = box['xmin'], box['ymin'], box['xmax'], box['ymax']

        draw.rectangle(((xmin, ymin), (xmax, ymax)), outline=random_color, width=3)
        text_bbox = draw.textbbox((0, 0), f"{label} ({score:.2f})", font=font)
        text_width = text_bbox[2] - text_bbox[0]
        text_height = text_bbox[3] - text_bbox[1]
    
        text_x = xmax - text_width - 5  
        text_y = ymax - text_height - 5  

        text_background = [(text_x, text_y), (text_x + text_width, text_y + text_height)]
        draw.rectangle(text_background, fill=random_color)
        draw.text((text_x, text_y), f"{label} ({score:.2f})", fill="white", font=font)

    display(draw_image)

In [None]:
display_object_detection(Image.open("../../data/teddy.jpg"), result)

#### Question Answering

##### Question Answering

Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.

In [None]:
image = Image.open("../../data/invoice.png")
display(image)

In [None]:
response = client.document_question_answering(
    image="../../data/invoice.png", 
    question="what is the invoice number?"
)

print(f"Answer: {response[0]["answer"]}, Score: {round(response[0]["score"], 2)}")

##### Table Question Answering

Table Question Answering (Table QA) is the answering a question about an information on a given table.

In [None]:
table = {"Repository": ["Transformers", "Datasets", "Tokenizers"], "Stars": ["36542", "4512", "3934"]}

client.table_question_answering(
    table=table,
    query="What is the average stars rating?",
    model="google/tapas-base-finetuned-wtq",
)

#### Text

##### Text Generation

Generate text based on a prompt.  For a Chat Completion task, which generates a response based on a list of messages, check out the chat-completion task.

In [None]:
input = "The huggingface_hub library is"
response = await client_async.text_generation(
    input,
    max_new_tokens=15,
)

print(input + response)

##### Text Classification

Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural language inference, and assessing grammatical correctness.

In [None]:
response = client.text_classification(
    "I like you.",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

for message in response:
    print(f"Label: {message["label"]}, Score: {round(message["score"], 2)}")

##### Text Summarization

Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text.

In [None]:
text = """
The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. 
Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington 
Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in 
New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a 
broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). 
Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
"""

result = client.summarization(text)

result.summary_text

##### Translation

Translation is the task of converting text from one language to another.

In [None]:
result = client.translation(
    "My name is Wolfgang and I live in Berlin",
    model="google-t5/t5-base"
)

result.translation_text

##### Chat Completion

Generate a response given a list of messages in a conversational context, supporting both conversational Language Models (LLMs) and conversational Vision-Language Models (VLMs). This is a subtask of text-generation and image-text-to-text.

In [None]:
response = client.chat_completion(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=500,
    stream=True
)

for message in response:
    print(message.choices[0].delta.content, end="")

##### Token Classification

Token classification is a task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging.

In [None]:
response = client.token_classification(
    "My name is Scott and I was born in Toronto, Canada",
    model="dslim/bert-base-NER"
) 

for message in response:
    print(f"Word: {message["word"]}, Entity Group: {message["entity_group"]}")

##### Feature Extraction

Feature extraction is the task of converting a text into a vector (often called “embedding”). 

Example applications:
- Retrieving the most relevant documents for a query (for RAG applications).
- Reranking a list of documents based on their similarity to a query.
- Calculating the similarity between two sentences.

In [None]:
response = client.feature_extraction(
    "Today is sunny",
    model="thenlper/gte-large"
)

print(response[:5])

##### Fill Mask

Mask filling is the task of predicting the right word (token to be precise) in the middle of a sequence.

In [None]:
response = client.fill_mask(
    "The capital of France is [MASK].", 
    model="bert-base-uncased",
    top_k=2
)

for result in response:
    print(f"Sequence: {result["sequence"]} Score: {round(result["score"]*100, 2)}%")

##### Zero Shot Classification

Zero-shot text classification is super useful to try out classification with zero code, you simply pass a sentence/paragraph and the possible labels for that sentence, and you get a result. The model has not been necessarily trained on the labels you provide, but it can still predict the correct label.

In [None]:
response = client.zero_shot_classification(
    text="Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!",
    labels=["refund", "legal", "faq"],
    model="facebook/bart-large-mnli"
)

for message in response:
    print(f"Label: {message["label"]}, Score: {round(message["score"], 2)}")