# ⚡ **Transformers: Revolutionizing Deep Learning**  

Transformers have redefined the landscape of deep learning, establishing themselves as the cornerstone of cutting-edge advancements in **Natural Language Processing (NLP) 📖**, **Computer Vision 🖼️**, and beyond. 🌍✨  

Introduced in the seminal paper *["Attention Is All You Need"](https://arxiv.org/abs/1706.03762)*, the **transformer architecture** leverages **self-attention mechanisms 🤖🧠** to efficiently model complex dependencies in data, enabling unprecedented breakthroughs in artificial intelligence. 🚀  

---

## 🎯 **About This Notebook**  

This course/notebook is designed to provide a **comprehensive understanding** of how to get stared with Transformers library🏆🚀  

### Also I will only be using Pytorch, i mean like it's obv why... 


# **> Using the `pipeline()` Function for Inference in `transformers` Library**

The `pipeline()` function from Hugging Face’s `transformers` library provides an easy-to-use interface for leveraging pretrained models across various natural language processing (NLP), computer vision, and speech-related tasks. It abstracts away the complexities of model selection, tokenization, feature extraction, and inference, allowing users to perform state-of-the-art AI tasks with minimal setup.

---

## **1. Purpose of `pipeline()`**
The `pipeline()` function serves as a high-level API for running inference on various machine learning tasks using pretrained models. It simplifies the process of loading models from the Hugging Face Model Hub and executing them efficiently. Instead of manually handling the model, tokenizer, and preprocessing steps, `pipeline()` takes care of everything in the background.

The function is particularly useful for:
- Rapid prototyping and experimentation.
- Users who want to leverage deep learning models without diving into complex implementation details.
- Applications where inference speed and ease of use are prioritized over customization.

---

## **2. Key Components and Workflow**
When calling `pipeline()`, the function automatically:
1. **Loads the model**: If no specific model is provided, it selects a default pretrained model suitable for the task.
2. **Handles preprocessing**: Applies tokenization for text models or feature extraction for images and audio.
3. **Performs inference**: Passes the processed input through the model for prediction.
4. **Post-processes outputs**: Converts model outputs into a human-readable format.

The function supports various deep learning frameworks, including PyTorch and TensorFlow, and can be executed on CPUs or GPUs for faster performance.

---

## **3. Supported Tasks**
The `pipeline()` function supports a broad range of tasks across different modalities:

### **a) Natural Language Processing (NLP) Tasks**
- **Text Classification** – Categorizes text into predefined labels (e.g., sentiment analysis, spam detection).
- **Named Entity Recognition (NER)** – Identifies entities such as names, locations, and organizations in text.
- **Summarization** – Generates a concise summary from a longer text passage.
- **Question Answering** – Answers questions based on a given context (e.g., extractive QA).
- **Text Generation** – Produces new text based on a given prompt using models like GPT.
- **Translation** – Translates text from one language to another.
- **Text-to-Text Generation** – Converts one form of text into another (e.g., paraphrasing).

### **b) Computer Vision Tasks**
- **Image Classification** – Assigns labels to images based on their content.
- **Object Detection** – Detects and localizes objects within an image.
- **Image Segmentation** – Identifies pixels belonging to specific objects.
- **Image-to-Text** – Generates captions for images.

### **c) Speech and Audio Tasks**
- **Speech Recognition (ASR)** – Converts spoken language into text.
- **Text-to-Speech (TTS)** – Synthesizes speech from text.
- **Audio Classification** – Categorizes audio clips into predefined classes.

### **d) Multimodal Tasks**
- **Image Captioning** – Generates descriptive text for images.
- **Document Question Answering** – Answers questions based on scanned documents.

---

## **4. Advantages of Using `pipeline()`**
- **Ease of Use**: Requires minimal setup and eliminates the need for manual preprocessing and model loading.
- **Efficiency**: Automatically selects optimal configurations for inference.
- **Framework Agnostic**: Works with both PyTorch and TensorFlow models.
- **Supports Multiple Modalities**: Can be used for text, images, and audio-based tasks.
- **Scalability**: Can run on CPUs or GPUs for better performance.

# **> Let's install some important libraries**

In [1]:
!pip install transformers datasets evaluate accelerate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [2]:
pip install torch 

Note: you may need to restart the kernel to use updated packages.


# **> Understanding the pipeline for Inference** 
## **Using the `pipeline()` for Inference**
The `pipeline()` function is the easiest way to use pretrained models for various tasks, such as:
- **Text classification**
- **Summarization**
- **Object detection**
- **Speech recognition**
- **Image captioning**
- **And more!**

# **> Now we can easily get started with performing some basic sentiment analysis using the pipeline function**

In [3]:
%%capture
from transformers import pipeline

# Load a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")


In [4]:
# Run inference
result = classifier("I am happy to see you guys here")
print(result)

[{'label': 'POSITIVE', 'score': 0.999849796295166}]


**As you can see the drop in positive score percentage as the sentiment of the input was a bit changed and make a bit less positive**

In [5]:
result = classifier("I am happy to see you guys here, but also sad because of a few personal event in my life")
print(result)

[{'label': 'POSITIVE', 'score': 0.8790883421897888}]


**Now let's try with some harsh and negative tone**

In [6]:
result = classifier("I feel like i should hurt more people")
print(result)

[{'label': 'NEGATIVE', 'score': 0.9995318651199341}]


**As we can see that the sentiment appears to be quite negative as the input seems quite harsh**

In [7]:
result = classifier("I like to abuse people in my spare time")
print(result)

[{'label': 'NEGATIVE', 'score': 0.9980902075767517}]


## **> We can also also multiple input's at once to get their sentiment analysis score, Better known as "Batch Inference"**

In [8]:
results = classifier([
    "We're thrilled to introduce this new feature! 🎉", 
    "I sincerely hope you find it useful", 
    "We are genuinely proud of this update", 
    "Not sure if you’ll like it, but here it is", 
    "You might want to try it before deciding", 
    "This is probably our best release yet!", 
    "We hope you don’t mind the changes", 
    "It’s a big step forward, and we're excited!", 
    "We understand it might not be for everyone", 
    "Feel free to ignore it if it doesn’t interest you"
])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")


label: POSITIVE, with score: 0.9998
label: POSITIVE, with score: 0.9992
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.6981
label: NEGATIVE, with score: 0.9935
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.9664
label: POSITIVE, with score: 0.9999
label: NEGATIVE, with score: 0.9995
label: NEGATIVE, with score: 0.9996


# **> Uptil now we not using any model or any tokenizer, we were just using general pipeline without any defined model and tokenizer just for the sake of learning, now let's use some...**

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load a custom model and tokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a pipeline with the custom model
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Function to map the star ratings to sentiment labels
def map_to_sentiment(star_rating):
    if star_rating in [1, 2]:
        return "Negative"
    elif star_rating == 3:
        return "Neutral"
    else:
        return "Positive"

# Classify French text
result = classifier("I am glad that i made it to the first place in this competition")

# Extract the star rating and convert it to a sentiment label
star_rating = int(result[0]['label'].split()[0])  # Extract number of stars
sentiment_label = map_to_sentiment(star_rating)

# Output the mapped sentiment label
print({"sentiment": sentiment_label, "score": result[0]['score']})


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

{'sentiment': 'Positive', 'score': 0.5760469436645508}


# **> Now Let's have a look at AutoClasses**
**AutoClasses (AutoModel, AutoTokenizer) provide an easy way to load a model from the Hugging Face Model Hub without specifying its architecture manually.**

**Let's look at some of the Classes that comes under AutoClasses**

## **1- AutoTokenizer**
**AutoTokenizer is used to load the appropriate tokenizer for a given model.**

In [10]:
from transformers import AutoTokenizer

# Load the tokenizer for a model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize a sample sentence
input_text = "Hello, how are you?"
encoded_input = tokenizer(input_text, return_tensors="pt")
print(encoded_input)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

{'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


## **2- AutoModel**
**AutoModel loads the appropriate model for a given task (without specifying the task type directly).**

In [11]:
from transformers import AutoModel

# Load the base model for BERT
model = AutoModel.from_pretrained("bert-base-uncased")

# Sample input
input_text = "Hello, how are you?"
encoded_input = tokenizer(input_text, return_tensors="pt")

# Forward pass through the model
output = model(**encoded_input)
print(output)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0824,  0.0667, -0.2880,  ..., -0.3566,  0.1960,  0.5381],
         [ 0.0310, -0.1448,  0.0952,  ..., -0.1560,  1.0151,  0.0947],
         [-0.8935,  0.3240,  0.4184,  ..., -0.5498,  0.2853,  0.1149],
         ...,
         [-0.2812, -0.8531,  0.6912,  ..., -0.5051,  0.4716, -0.6854],
         [-0.4429, -0.7820, -0.8055,  ...,  0.1949,  0.1081,  0.0130],
         [ 0.5570, -0.1080, -0.2412,  ...,  0.2817, -0.3996, -0.1882]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9397, -0.4081, -0.9024,  0.8667,  0.6076, -0.1782,  0.9319,  0.2685,
         -0.7918, -1.0000, -0.4899,  0.9625,  0.9823,  0.6102,  0.9614, -0.8728,
         -0.6449, -0.6543,  0.3102, -0.6648,  0.7556,  1.0000,  0.0778,  0.3350,
          0.5094,  0.9948, -0.8847,  0.9590,  0.9761,  0.7384, -0.7787,  0.1232,
         -0.9912, -0.2119, -0.9225, -0.9931,  0.3767, -0.8050, -0.0945,  0.0497,
         -0.9269,  0.2934,  1.00

## 1. **3- AutoModelForSequenceClassification**
**AutoModelForSequenceClassification loads a pre-trained model for sequence classification tasks (like sentiment analysis).**

In [12]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the pre-trained model for sentiment analysis
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Input text
text = "I am so happy with the results!"

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Perform forward pass (get logits)
outputs = model(**inputs)

# Get the logits (raw model predictions)
logits = outputs.logits
print("Logits:", logits)

# Convert logits to probabilities using softmax
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print("Probabilities:", probabilities)

# Get the predicted class (0 = negative, 1 = positive)
predicted_class = torch.argmax(probabilities, dim=-1).item()
print("Predicted class:", predicted_class)

# Map the predicted class to a label
labels = ["negative", "positive"]
predicted_label = labels[predicted_class]
print("Predicted sentiment:", predicted_label)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Logits: tensor([[-4.3497,  4.6922]], grad_fn=<AddmmBackward0>)
Probabilities: tensor([[1.1833e-04, 9.9988e-01]], grad_fn=<SoftmaxBackward0>)
Predicted class: 1
Predicted sentiment: positive


## **AutoModelForTokenClassification**
**AutoModelForTokenClassification is used for tasks like named entity recognition (NER).**

In [13]:
%%capture
from transformers import AutoModelForTokenClassification

# Load the BERT model for token classification
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")


In [14]:
# Forward pass
output = model(**encoded_input)
print(output)

TokenClassifierOutput(loss=None, logits=tensor([[[ 8.7187, -2.3364, -0.7877, -2.5555, -2.1112, -1.6363,  0.1820,
          -2.2605,  0.2477],
         [ 0.8799, -2.0976,  1.1481, -3.5797, -0.5549, -1.5543,  5.8577,
          -2.7680,  0.5249],
         [ 2.9795, -2.1272,  1.3494, -3.8953, -0.8189, -1.0131,  2.8276,
          -2.5852,  0.6740],
         [ 4.6340, -2.5157,  0.5012, -3.5238, -0.6426, -0.7075,  3.0012,
          -2.4569, -0.1732],
         [ 3.9892, -2.3489,  1.2423, -3.7496, -1.4238, -1.3199,  5.3302,
          -2.4133, -0.1764],
         [10.7349, -2.2816, -0.6372, -3.2772, -1.1850, -2.4364,  0.5903,
          -2.3134, -0.9988],
         [ 9.7019, -1.9995, -0.5227, -2.7245, -1.0998, -1.9262,  0.7947,
          -2.2120, -0.3752],
         [ 8.7187, -2.3364, -0.7877, -2.5555, -2.1112, -1.6363,  0.1820,
          -2.2605,  0.2477]]], grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)


## **5- AutoModelForQuestionAnswering**
**AutoModelForQuestionAnswering is used for question-answering tasks.**

In [15]:
%%capture
from transformers import AutoModelForQuestionAnswering

# Load the BERT model for question answering
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# Example question-answering input
context = "Transformers are a type of machine learning model used in NLP."
question = "What are transformers used for?"

encoded_input = tokenizer(question, context, return_tensors="pt")

In [16]:
# Forward pass through the model
output = model(**encoded_input)
print(output)

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-5.4208, -5.6408, -7.7696, -7.2089, -6.7390, -7.0294, -9.2507, -5.4206,
         -1.3214, -4.5176,  0.5997, -0.6010, -4.0932,  4.8012, -0.2067, -2.4149,
         -1.7752, -0.1793,  2.9989, -3.0361, -5.4210, -5.4206]],
       grad_fn=<CloneBackward0>), end_logits=tensor([[-0.4754, -5.2360, -6.4876, -6.8828, -7.6285, -6.0191, -6.3213, -0.4754,
         -3.8593, -6.4513, -5.6309, -5.3834, -5.1025, -1.9531,  4.1193,  3.0422,
         -2.5693, -4.1460, -2.9805,  4.2221, -0.4753, -0.4754]],
       grad_fn=<CloneBackward0>), hidden_states=None, attentions=None)


## **6- AutoModelForCausalLM**
**AutoModelForCausalLM is for causal language modeling tasks, such as autoregressive generation (e.g., GPT-2).**

In [17]:
%%capture
from transformers import AutoModelForCausalLM

# Load GPT-2 for causal language modeling
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Example text to generate continuation
input_text = "Once upon a time"
encoded_input = tokenizer(input_text, return_tensors="pt")

In [18]:
# Forward pass and generation
output = model.generate(encoded_input['input_ids'], max_length=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

once upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193]


## **7- AutoModelForMaskedLM**
**AutoModelForMaskedLM is for masked language modeling tasks like BERT’s fill-mask.**

In [19]:
%%capture
from transformers import AutoModelForMaskedLM

# Load BERT for masked language modeling
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Example masked input
input_text = "I am a [MASK] model."
encoded_input = tokenizer(input_text, return_tensors="pt")

In [20]:
# Forward pass
output = model(**encoded_input)
print(output)

MaskedLMOutput(loss=None, logits=tensor([[[ -6.6148,  -6.5519,  -6.5816,  ...,  -5.9933,  -5.7652,  -4.0339],
         [ -8.9368,  -8.6326,  -8.7497,  ...,  -8.2383,  -7.9668,  -5.8301],
         [-12.4872, -12.2117, -12.3804,  ..., -11.0130, -10.7869,  -8.4687],
         ...,
         [ -9.6846,  -9.6352,  -9.7309,  ...,  -9.8764,  -8.5677,  -8.0470],
         [-11.8507, -11.4918, -12.0264,  ...,  -9.5268, -10.0466,  -6.8397],
         [-12.3598, -12.6287, -12.3523,  ..., -11.3152, -10.1098, -10.1205]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)


## **8- AutoModelForMultipleChoice**
**AutoModelForMultipleChoice is used for multiple-choice tasks, often used for tasks like SWAG.**

In [21]:
%%capture
import torch
from transformers import AutoModelForMultipleChoice, AutoTokenizer

# ✅ Use a model fine-tuned for multiple-choice tasks
model_name = "LIAMF-USP/roberta-large-finetuned-race"  # Best for multiple-choice QA
model = AutoModelForMultipleChoice.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ Define the question and choices
question = "What is the capital of France?"
choices = ["Berlin", "Madrid", "Paris", "Rome", "London", "Lisbon", "Amsterdam", "Brussels", "Stockholm", "Vienna", "Oslo"]

# ✅ Tokenize input properly (pair each choice with the question)
encoding = tokenizer(
    [[question, choice] for choice in choices],  # Format: (question, choice)
    padding=True,
    truncation=True,
    return_tensors="pt"
)

# ✅ Ensure input tensor dimensions match (batch_size=1, num_choices, sequence_length)
input_ids = encoding["input_ids"].unsqueeze(0)  # Add batch dimension
attention_mask = encoding["attention_mask"].unsqueeze(0)

# ✅ Forward pass through model
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

# ✅ Extract logits and find the highest-scoring choice
logits = outputs.logits
predicted_choice = torch.argmax(logits, dim=-1).item()
predicted_answer = choices[predicted_choice]

In [22]:
print(f"✅ Predicted answer: {predicted_answer}")

✅ Predicted answer: Paris


## **AutoConfig**
**AutoConfig is used to load the configuration of a model. This is often used for fine-tuning or customizing models.**

In [23]:
from transformers import AutoConfig

# Load the model configuration for BERT
config = AutoConfig.from_pretrained("bert-base-uncased")
print(config)

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.47.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## **10- AutoFeatureExtractor**
**AutoFeatureExtractor is used for feature extraction tasks (typically for image models).**

In [24]:
%%capture
from transformers import AutoFeatureExtractor
from PIL import Image

# Load a feature extractor (e.g., for image models)
extractor = AutoFeatureExtractor.from_pretrained("facebook/dino-vitb16")

# Process an image (requires a PIL image)
image = Image.open("/kaggle/input/car-image/car.png")

# Ensure the image is in RGB format
image = image.convert("RGB")

# Extract features
features = extractor(images=image, return_tensors="pt")

In [25]:
# Print the extracted features
print(features)

{'pixel_values': tensor([[[[-0.8507, -0.8849, -0.8849,  ..., -1.1589, -1.1589, -1.1589],
          [-0.8507, -0.8678, -0.8678,  ..., -1.1589, -1.1589, -1.1589],
          [-0.8164, -0.8335, -0.8335,  ..., -1.1589, -1.1589, -1.1589],
          ...,
          [ 0.2624,  0.2624,  0.2967,  ...,  0.1426,  0.1426,  0.1597],
          [ 0.4337,  0.4679,  0.4851,  ..., -0.0287,  0.0056,  0.0227],
          [ 0.6221,  0.6392,  0.6563,  ..., -0.0458, -0.0458, -0.0287]],

         [[-0.8803, -0.8803, -0.8803,  ..., -0.2675, -0.2675, -0.2675],
          [-0.8978, -0.8978, -0.8803,  ..., -0.2675, -0.2675, -0.2675],
          [-0.8978, -0.8803, -0.8803,  ..., -0.2675, -0.2675, -0.2675],
          ...,
          [ 0.1877,  0.1877,  0.2052,  ...,  0.0826,  0.0826,  0.1001],
          [ 0.3452,  0.3627,  0.3803,  ..., -0.1275, -0.1099, -0.0924],
          [ 0.4853,  0.4853,  0.5028,  ..., -0.1450, -0.1625, -0.1625]],

         [[-0.2707, -0.2532, -0.2358,  ...,  0.6705,  0.6705,  0.6705],
          [-0

## **11- AutoProcessor**
**AutoProcessor is used for multimodal tasks (such as vision-language models).**

In [26]:
from transformers import AutoProcessor
from PIL import Image

# Use a publicly available model that doesn't require a token
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load an image to process
image = Image.open("/kaggle/input/car-image/car.png")  # Replace with your image file

# Process the image
inputs = processor(images=image, return_tensors="pt")

# Display the processed inputs
print(inputs)



preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

{'pixel_values': tensor([[[[-0.4638, -0.4492, -0.4200,  ..., -0.8872, -0.9018, -0.9018],
          [-0.4638, -0.4492, -0.4492,  ..., -0.8726, -0.8726, -0.8726],
          [-0.4346, -0.4492, -0.4638,  ..., -0.8580, -0.8580, -0.8726],
          ...,
          [ 0.1493,  0.1639,  0.1785,  ...,  0.1785,  0.1785,  0.1931],
          [ 0.2515,  0.2515,  0.2223,  ...,  0.1931,  0.1931,  0.1931],
          [ 0.4413,  0.4413,  0.4559,  ...,  0.1493,  0.1493,  0.1493]],

         [[-0.3864, -0.4014, -0.4164,  ..., -0.2213, -0.2063, -0.2063],
          [-0.3864, -0.3864, -0.4014,  ..., -0.2063, -0.2063, -0.2063],
          [-0.3714, -0.3714, -0.3864,  ..., -0.2063, -0.2063, -0.2213],
          ...,
          [ 0.0188,  0.0338,  0.0338,  ...,  0.0638,  0.0789,  0.0638],
          [ 0.1089,  0.1239,  0.1089,  ...,  0.0638,  0.0488,  0.0188],
          [ 0.3190,  0.3190,  0.3190,  ...,  0.0939,  0.0789,  0.0789]],

         [[ 0.1977,  0.1835,  0.1409,  ...,  0.5532,  0.5532,  0.5675],
          [ 0