
---

## Image Captioning using ViT-GPT2

### Introduction
This project leverages a pre-trained Vision Transformer (ViT) and GPT-2 model to automatically generate captions for images. Using state-of-the-art natural language processing and computer vision technologies, this system can analyze images and generate meaningful descriptions. This project demonstrates the power of combining image processing and text generation models to create a sophisticated image captioning system.

### Project Code Explanation

#### 1. Importing Libraries

In [None]:
import os
import torch
from PIL import Image
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer


We start by importing the necessary libraries. `torch` is used for tensor operations, `transformers` provides the pre-trained models, `PIL` handles image processing, and `os` is used for file system operations.

#### 2. Loading the Pre-trained Model

In [None]:
model_name = "nlpconnect/vit-gpt2-image-captioning"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = VisionEncoderDecoderModel.from_pretrained(model_name)
model.to(device)

Here, we load the pre-trained ViT-GPT2 model. If a GPU is available, it will be used to speed up computations.

#### 3. Loading Image Processor and Tokenizer

In [None]:
processor = ViTImageProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

We load the image processor and tokenizer corresponding to the ViT-GPT2 model. The processor is used to preprocess images, and the tokenizer is used to decode generated captions.

#### 4. Defining the Caption Generation Function

In [None]:
def generate_caption(image_path):
    try:
        # Load and preprocess the image
        image = Image.open(image_path).convert("RGB")
        pixel_values = processor(images=image, return_tensors="pt").pixel_values
        pixel_values = pixel_values.to(device)

        # Generate captions
        outputs = model.generate(pixel_values, max_length=16, num_beams=4)
        caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return caption
    except Exception as e:
        return f"Error processing image: {e}"

This function takes an image path as input, processes the image, and generates a caption using the model. If an error occurs, it returns an error message.

#### 5. Generating Captions for Images in a Directory

In [None]:
# Directory containing images
image_dir = "persons"  # Replace with your folder path
output_dir = "output_captions"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Iterate over images in the directory
print(f"Generating captions for images in '{image_dir}'...")
for image_file in os.listdir(image_dir):
    if image_file.lower().endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_dir, image_file)
        caption = generate_caption(image_path)
        print(f"Image: {image_file}\nCaption: {caption}\n")

        # Save the caption to a file
        with open(os.path.join(output_dir, f"{image_file}_caption.txt"), "w") as f:
            f.write(caption)

print(f"Captions saved in '{output_dir}'.")

This section iterates over all images in the specified directory, generates captions for each image, and saves the captions to text files in the output directory.

### Conclusion
This project effectively showcases the integration of advanced NLP and computer vision techniques to create an image captioning system. The application has numerous potential use cases, such as aiding visually impaired users, enhancing content accessibility, and automating image metadata generation. By including this project in your portfolio, you demonstrate proficiency in state-of-the-art AI technologies and your ability to apply them to solve real-world problems.

---