# Pre-trained Deep Learning Models!

## What are pre-trained Models?

- The landscape of deep learning (DL) has evolved dramatically over the past decade.
- **Transitioning from traditional feed-forward neural networks** focused on training task-specific models **from scratch** to **transformer** architectures that leverage massive **pre-trained models** capable of adapting to diverse applications through fine-tuning and prompting.
- This **paradigm shift** has **revolutionized performance across domains while fundamentally changing how AI systems are developed and deployed**.

<img src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cecbccba-6358-476e-9fd8-e2807de9f220/Frame_118.png?t=1693044751" width=500>

Founded in 2016

Thousands of models (e.g., BERT, ChatGPT) you can use **without training from scratch**!

[Go to Hugging Face](https://huggingface.co/) and explore [the pre-trained models available on the website](https://huggingface.co/models).



## Popularity Ranking of DL Architectures (as of 2025)

| Rank | Architecture     | Popularity               | Primary Use Cases                                        |
|------|------------------|--------------------------|----------------------------------------------------------|
| 1️⃣   | Transformers      | ⭐⭐⭐⭐⭐ *(most popular)*    | LLMs, NLP, vision, audio, multimodal                     |
| 2️⃣   | CNNs              | ⭐⭐⭐⭐                     | Image classification, object detection, vision tasks     |
| 3️⃣   | GANs              | ⭐⭐⭐                      | Image generation, style transfer, data augmentation      |
| 4️⃣   | RNNs / LSTMs      | ⭐⭐                       | Legacy NLP, time series prediction, audio modeling       |


# CNNs: Convolutional Neural Networks

## Image classification, detection, vision

<img src="https://hips.hearstapps.com/hmg-prod/images/pembroke-welsh-corgi-royalty-free-image-1726720011.jpg?crop=1.00xw:0.756xh;0,0.134xh&resize=1024:">

### [ResNet-50 v1.5](https://huggingface.co/microsoft/resnet-50)

"ResNet (Residual Network) is a convolutional neural network. ResNet model pre-trained on ImageNet-1k at resolution 224x224."

- 1,000 object categories (classes)
- 1.2 million training images
- 50,000 validation images


In [None]:
import warnings
warnings.filterwarnings("ignore")

from transformers import pipeline
import torch

import textwrap # print output in multiple lines

In [None]:
# Check for CUDA (GPU)
device = 0 if torch.cuda.is_available() else -1  # 0 for GPU, -1 for CPU
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

# Load image classification pipeline with ResNet-50 (CNN)
classifier = pipeline(
    "image-classification",
    model="microsoft/resnet-50",
    device=device,
    use_fast=True  # Use the fast image processor to avoid the warning
)

# Classify the image
result = classifier("https://hips.hearstapps.com/hmg-prod/images/pembroke-welsh-corgi-royalty-free-image-1726720011.jpg")

for item in result:
    print(f"Label: {item['label']}, Score: {item['score']:.4f}")

<img src="https://consumer-cms.petfinder.com/sites/default/files/images/content/Golden%20Retriever%201.jpg">

In [None]:
# creat labels for the above dog

result = classifier("https://hips.hearstapps.com/hmg-prod/images/pembroke-welsh-corgi-royalty-free-image-1726720011.jpg")

for item in result:
    print(f"Label: {item['label']}, Score: {item['score']:.4f}")

<img src="https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg">

In [None]:
# try the above image. what labels do you expect?
# https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg







Try a different method (Transformer) ...

### [CLIP model](https://huggingface.co/docs/transformers/en/model_doc/clip)

"CLIP is a is a multimodal vision and language model motivated by **overcoming the fixed number of object categories** when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables **zero-shot transfer** to downstream tasks." Developed by the OpenAI organization.

This is a **transformer**-based model.

In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch, requests

# Load model & processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_id).to(device)
processor = CLIPProcessor.from_pretrained(model_id)

# Image and candidate captions
image = Image.open(requests.get(
    "https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg",
    stream=True).raw)

texts = ["a photo of BTS",
         "a photo of a dog",
         "a photo of a band",
         "a group of men"]

# Predict
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
probs = model(**inputs).logits_per_image.softmax(dim=1)[0]

# Display results
print("\n CLIP Similarity Scores:")
for text, p in zip(texts, probs):
    print(f"{text:<25} -> {p:.4f}")


<img src="https://s.yimg.com/ny/api/res/1.2/UrUx_Vbbk413oGzvWSklPA--/YXBwaWQ9aGlnaGxhbmRlcjt3PTI0MDA7aD0xNjAw/https://media.zenfs.com/en/parade_250/0b28a903a2ed548d063f996165786cd4">

In [None]:
# Try the above image. Who is this person?








# Transformers: LLMs / Attention-Based Models

## Sentiment analysis


[distilbert/distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

In [None]:
from transformers import pipeline

sentiment = pipeline("sentiment-analysis",
                     model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

print(sentiment("This product is awful and I want a refund."))
print(sentiment("Average experience, nothing special."))
print(sentiment("Absolutely love the new update!"))

In [None]:
# Try a different sentence for sentiment analysis




## Emotion Detection
Use case: Go beyond "positive/negative" — detect emotions like joy, anger, sadness

<img src="https://webflow-amber-prod.gumlet.io/620e4101b2ce12a1a6bff0e8/66ab6846124b51c486c24b3e_640f1bb03074900cbf0f28f3_What-are-the-Ivy-League-schools.webp">

In [None]:
emotion = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=None)
result = emotion("I can't believe I got in! I'm so happy and feel very grateful.")

for row in result:
    for item in row:
        print(f"{item['label']:<10} -> {item['score']:.4f}")

In [None]:
# Try another expression for Emotion Detection






## Text Generation / Chatbots

Use case: Writing, storytelling, character dialogue

In [None]:
from transformers import pipeline
import torch

# Use GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create text generation pipeline (defaults to GPT-2)
generator = pipeline("text-generation", model="gpt2", device=device)

# Generate text
output = generator(
    "Once upon a time in Bucharest ...",
    max_length=500,
    truncation=True,
    pad_token_id=generator.tokenizer.eos_token_id
)

wrapped = textwrap.fill(output[0]["generated_text"], width=80)
print("\n Generated Text:\n" + "-"*80)
print(wrapped)
print("-"*80)


In [None]:
# Try to generate different texts







## Image Captioning
Use case: Describe an image using natural language using the [the BLIP model](https://huggingface.co/docs/transformers/en/model_doc/blip)

BLIP is a model that is able to perform various multi-modal tasks including:

- Visual Question Answering
- Image-Text retrieval (Image-text matching)
- Image Captioning

<img src="https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg">

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

image = Image.open(requests.get("https://people.com/thmb/TlNhUj4fJ8pnJNpEvUN-015Jcac=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(979x595:981x597):format(webp)/bts-members-1-03a9c478f1794c448bcb5f74bf94812c.jpg", stream=True).raw)

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))


btt? not BTS?

Oops! That looks like a hallucination from the model.

Advanced models (e.g., [BLIP 2](https://huggingface.co/docs/transformers/en/model_doc/blip-2)) are more accurate.

## Audio Transcription (Speech-to-Text)
Use case: Convert speech into text using Whisper.

[openai/whisper-small](https://huggingface.co/openai/whisper-small) is a pre-trained model for automatic speech recognition (ASR) and speech translation.

In [None]:
from IPython.display import Audio, display

# Direct link to the audio file
audio_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"

# Embed audio player
display(Audio(audio_url))

In [None]:
asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    generate_kwargs={"task": "translate", "language": "en"}
)

output = asr("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

wrapped_text = textwrap.fill(output["text"], width=80)

print("Transcription:\n")
print(wrapped_text)

## Translation

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ro")
text = "I Love You All!"

translation = translator(text)[0]['translation_text']
print(translation)

# GANs: Generative Adversarial Networks -- This is process heavy ... taking very long :(

Image generations (DALL·E (OpenAI))

In [None]:
# install required packages for deep learning
#!pip install -q huggingface_hub[hf_xet]

[Stable Diffusion 2.1 Version](https://huggingface.co/spaces/stabilityai/stable-diffusion) on Huggingface

In [None]:
from diffusers import StableDiffusionPipeline
import torch
from PIL import Image
from IPython.display import display

# Check if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the model (use full ID: CompVis/stable-diffusion-v1-4)
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True  # helps on CPU!
).to(device)

# Prompt
prompt = "A futuristic smart home device in minimal style product photography"

# Generate
print("Generating image... please wait ⏳")
image = pipe(prompt).images[0]

# Display
display(image)


Using device: cpu


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

scheduler_config-checkpoint.json:   0%|          | 0.00/209 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


config.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

Generating image... please wait ⏳


  0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
# Try another prompt (e.g., A futuristic intelligent vehicle)






