# Deep Learning for Business Applications course

## TOPIC 6: Hugging Face Hub for Computer Vision. Zero-Shot Image Classification

### 1. Libraries

In [None]:
!pip install transformers

In [None]:
# you need to downgrade PyTorch for GPU usage
# because our CUDA drivers for GPU are old
# so uncomment lines below if you are in
# the GPU environment

#!pip uninstall -y torch torchvision
#!pip install torch==2.0.1 torchvision==0.15.2

In [None]:
import os
import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from transformers import CLIPProcessor, CLIPModel

# check if GPU available
# (works in GPU environment only)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device available:', DEVICE)

# to get rid off warnings
os.environ["TOKENIZERS_PARALLELISM"] = 'false'

In [None]:
!df -h | grep dev/vdb

In [None]:
!rm -rf ~/.cache/huggingface/hub

### 2. Model

[CLIP model](https://huggingface.co/openai/clip-vit-base-patch16) was developed by researchers at OpenAI.

In [None]:
model_name = 'openai/clip-vit-base-patch16'
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

In [None]:
img = Image.open('imgs/catburger.jpg')  # `soup`, `borsch`, `trayfood`, `catburger`
plt.figure(figsize=(8, 6))
plt.imshow(img)
plt.show()

In [None]:
# here we can set classes that are not 
# in a well known datasets e.g. COCO
# and this is very cool

CLASSES = [
    'a photo of a salad', 
    'a photo of a soup', 
    'a photo of a hamburger',
    'a bowl with borsch',
    'a plate with soup',
    'burger with soup and fries', 
    'burger with cat'
]

In [None]:
inputs = processor(
    text=CLASSES,
    images=img,
    return_tensors='pt',
    padding=True
)
outputs = model(**inputs)

# this is the image-text similarity score
logits_per_image = outputs.logits_per_image
# we can take the softmax to get the label probabilities
probs = logits_per_image.softmax(dim=1)
# resulting class
print('resulting class:', CLASSES[np.argmax(probs.detach().numpy())])

...with use of a `pipeline`:

In [None]:
from transformers import pipeline

In [None]:
def zeroshot(model_name, classes):
    classifier = pipeline('zero-shot-image-classification', model=model_name)
    scores = classifier(
        img,
        candidate_labels=CLASSES
    )
    print(
        f'highest score is {scores[0]["score"]:.2f}',
        f' for the label -{scores[0]["label"]}-'
    )
    return scores


scores = zeroshot(model_name, CLASSES)

In [None]:
plt.figure(figsize=(8, 4))
plt.bar(
    [x['label'] for x in scores],
    [x['score'] for x in scores],
    color='b'
)
plt.xticks(rotation=90)
plt.show()

### 3. Mini-project

### <font color='red'>HOME ASSIGNMENT</font>

Imagine you are a computer vision engineer  and you have a project to make student canteen more digital. You are offered to deploy a CV model for meals classification. The canteen is free flow ыщ students have to take a tray with food and show it to camera. The camers takes a photo of a tray and display shows total price for a combo meal that is on a tray. Let's assume that there are limited number of meal sets available.

It seems like you better use ready zero-shot framework rather than collect images, label them and finetune model.

You task is as follows:
1. Define your own classes with `MEAL_CLASSES` variable, use this variable for zero-shot model. Five classes are enough for home assignment. One class will be for one set of meal (e.g. `hamburger with fries and juice`)
2. Collect at least one images for every class. Test that your model works well for every class (use function `zeroshot` from above)
3. Define a dictionary with prices for every class (set of meals). Create your own function to return a total price for the meals set that student take.
4. __(ADVANCED, NOT NECESSARY)__ Use image-from-text generated pipeline to test your work. Ask me for GPU if needed.

In [None]:
# HINT-1

MEAL_CLASSES = [
    'cutlet with mashed potatoes',
    'vegan sausage with broccoli and lavender raff',
    # add your options
]

In [None]:
# HINT-2

PRICES_RUR = {
    'cutlet with mashed potatoes': 250,
    'vegan sausage with broccoli and lavender raff': 470,
    # add your options
}

In [None]:
# HINT-3

def price_for_meal_set(scores, prices_dict):
    # you have to implement function yourself
    return price