# Deep Learning for Business Applications course

## TOPIC 6: Hugging Face Hub for Computer Vision

### 1. Libraries

In [None]:
!pip install transformers

In [None]:
# you need to downgrade PyTorch for GPU usage
# because our CUDA drivers for GPU are old
# so uncomment lines below if you are in
# the GPU environment

#!pip uninstall -y torch torchvision
#!pip install torch==2.0.1 torchvision==0.15.2

In [None]:
import os
import torch
from PIL import Image
import matplotlib.pyplot as plt

# check if GPU available
# (works in GPU environment only)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device available:', DEVICE)

# to get rid off warnings
os.environ["TOKENIZERS_PARALLELISM"] = 'false'
# env variable to set path to download models
os.environ['HF_HOME'] = '/home/jovyan/dlba/topic_06/cache/'

### 2. Disk space

<font color='red'>__WARNING!!!__</font>

Keep in mind free disk space for downloading models from Hub. Your local disk is 12 GB only, whereas modern architecture  models are large and can overfill your free space. Your server will stuck with disk overfilled and you will have to [contact support](https://t.me/simbaplatform).

In [None]:
!df -h | grep dev

In [None]:
!ls -ls ~/.cache/

In [None]:
# a place for Huggin Face Hub models

!ls -ls ~/.cache/huggingface/hub

In [None]:
# use `rm -rf` !!! WITH CARE !!!

!rm -rf ~/.cache/huggingface/hub

In [None]:
# a place for PyTorch models

!ls -ls ~/.cache/torch/hub/

In [None]:
!rm -rf ~/.cache/torch/hub/checkpoints

### 3. Models from the Hub

#### 3.1. Warm up: classification

Start with [ResNet model](https://huggingface.co/microsoft/resnet-50) pre-trained on ImageNet-1k at resolution 224x224.

In [None]:
from transformers import AutoImageProcessor, ResNetForImageClassification

In [None]:
img = Image.open('imgs/burger.jpg')
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.show()

In [None]:
# create model and image processor
# model will be downloaded automaticly
# from Huggin Face Hub

model_name = 'microsoft/resnet-50'
processor = AutoImageProcessor.from_pretrained(model_name)
model = ResNetForImageClassification.from_pretrained(model_name)

In [None]:
# convert image to tensor
inputs = processor(img, return_tensors='pt')

# inference of the model
with torch.no_grad():
    logits = model(**inputs).logits

In [None]:
logits

In [None]:
# how many classes?

len(logits[-1])

In [None]:
# model predicts one of the 1000 ImageNet classes

predicted_label = logits.argmax(-1).item()
print(
    'class predicted:',
    model.config.id2label[predicted_label]
)

#### 3.2. More classification

More interesting case of [Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification](https://huggingface.co/Falconsai/nsfw_image_detection). The model can be used to detect NSFW (Not Safe for Work) content for the sites in the Internet.

In [None]:
from transformers import AutoModelForImageClassification, ViTImageProcessor

In [None]:
# Police Academy rules!

img = Image.open('imgs/blueoyster.jpg')
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.show()

In [None]:
model = AutoModelForImageClassification.from_pretrained('Falconsai/nsfw_image_detection')
processor = ViTImageProcessor.from_pretrained('Falconsai/nsfw_image_detection')

with torch.no_grad():
    inputs = processor(images=img, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits

predicted_label = logits.argmax(-1).item()
print(
    'class predicted:',
    model.config.id2label[predicted_label]
)

Where to find NSFW images? Think of it for yourself... Use the Internet if you want.

In [None]:
from transformers import pipeline

# let's use `pipeline` for model inference
pipe = pipeline(
    'image-classification',
    model='Falconsai/nsfw_image_detection'
)

In [None]:
import requests

# load the image into memory
# you will need the URL for the image
img_url = '<YOUR_URL_TO_IMAGE>'
img = Image.open(
    requests.get(img_url, stream=True).raw
).convert('RGB')

In [None]:
# easy to use a pipeline to classify image

results = pipe(img)
print(results)

#### 3.4. Object detection

Work with an arbitary [detection model](https://huggingface.co/facebook/detr-resnet-50).

In [None]:
!pip install opencv-python

In [None]:
import cv2
from transformers import DetrImageProcessor, DetrForObjectDetection

In [None]:
DATA_PATH = '/home/jovyan/__DATA/DLBA_F24/topic_04'
img_path = f'{DATA_PATH}/ace.jpg'
img = Image.open(img_path)
img_ = cv2.imread(img_path)
img_ = cv2.cvtColor(img_, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(16, 6))
plt.imshow(img)
plt.show()

In [None]:
# model and image processor
model_name = 'facebook/detr-resnet-50'
processor = DetrImageProcessor.from_pretrained(model_name, revision='no_timm')
model = DetrForObjectDetection.from_pretrained(model_name, revision='no_timm')

# inference for detection
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)

In [None]:
# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.9
th = .75
target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
    outputs,
    target_sizes=target_sizes,
    threshold=th
)[0]

# results and bbox drawing
for score, label, box in zip(results['scores'], 
                             results['labels'], 
                             results['boxes']):
    box = [round(i, 2) for i in box.tolist()]
    lbl = model.config.id2label[label.item()]
    print(
            f'detected {lbl} with confidence',
            f'{round(score.item(), 2)} at location {box}'
    )

    top_left = (int(box[0]), int(box[1]))
    bottom_right = (int(box[2]), int(box[3]))
    cv2.rectangle(img_, top_left, bottom_right, (0, 255, 0), 3)
    cv2.putText(
        img_,
        lbl,
        top_left,
        cv2.FONT_HERSHEY_SIMPLEX,
        2,
        (0, 255, 0),
        3
    )

In [None]:
plt.figure(figsize=(16, 6))
plt.imshow(img_)
plt.show()

__NOTE:__
You may finetune this model like we did in previous classes.

#### 3.3. Image captioning

Now let's [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base). 

BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) model is used image captioning pretrained on COCO dataset - base architecture (with ViT base backbone).

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration

In [None]:
model_name = 'Salesforce/blip-image-captioning-base'
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

In [None]:
# check for free space left...
!df -h | grep dev

In [None]:
img = Image.open('imgs/burger.jpg')
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.show()

In [None]:
def img_caption(img):
    # conditional image captioning
    text = 'a photography of'
    inputs = processor(img, text, return_tensors='pt')
    out = model.generate(**inputs)
    print(
        'conditional image captioning:',
        processor.decode(out[0], skip_special_tokens=True)
    )

    # unconditional image captioning
    inputs = processor(img, return_tensors="pt")
    out = model.generate(**inputs)
    print(
        'unconditional image captioning',
        processor.decode(out[0], skip_special_tokens=True)
    )


img_caption(img)

In [None]:
img = Image.open('imgs/blueoyster.jpg')
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.show()

img_caption(img)

#### 3.4. More for image captioning

Now try [one more image captioning model](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) form the Hub.

In [None]:
from transformers import (
    VisionEncoderDecoderModel,
    ViTImageProcessor,
    AutoTokenizer
)

In [None]:
model_name = 'nlpconnect/vit-gpt2-image-captioning'
model = VisionEncoderDecoderModel.from_pretrained(model_name)
feature_extractor = ViTImageProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
img = Image.open('imgs/burger.jpg')
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.show()

In [None]:
# parameters to manage model's performance
max_length = 16
num_beams = 4
gen_kwargs = {'max_length': max_length, 'num_beams': num_beams}

# run the model
pixel_values = feature_extractor(images=img, return_tensors='pt').pixel_values
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
preds = [pred.strip() for pred in preds]
print(preds)

...or with use of `pipeline` from `transformers`:

In [None]:
pipe_img2txt = pipeline('image-to-text', model='nlpconnect/vit-gpt2-image-captioning')
results = pipe_img2txt('imgs/blueoyster.jpg')
print(results)

#### 3.5. Text-to-Image model

[Stable diffusion](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) based model.

In [None]:
!pip install diffusers
!pip install accelerate

<font color='red'>__WARNING!!!__</font>

(1) Keep in mind free disk space for downloading models from Hub. Diffusion models ARE VERY LARGE.

(2) You need GPU environment to run image generating models.

In [None]:
!rm -rf ~/.cache/huggingface/hub
!rm -rf /home/jovyan/dlba/topic_06/cache/hub

In [None]:
!df -h | grep dev

In [None]:
from diffusers import StableDiffusionPipeline

In [None]:
# just take model from the Hub
# and create a pipeline for work

model_name = 'sd-legacy/stable-diffusion-v1-5'
pipe = StableDiffusionPipeline.from_pretrained(
    model_name, 
    torch_dtype=torch.float16
)

In [None]:
# put model to GPU to run fast
pipe = pipe.to(DEVICE)

In [None]:
# we need a brief description of what we want to get
prompt = 'a bald guy with earphones is giving an online lecture'

# ...and here is the image
img = pipe(prompt).images[0]
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.show()