# IMAGE CAPTIONING

Image captioning is the process of generating a textual description of an image. It involves the application of computer vision and natural language processing techniques to recognize and describe the contents of an image in words. This process typically uses deep learning models, specifically convolutional neural networks (CNNs) to interpret the image's visual data and recurrent neural networks (RNNs) or transformers to generate descriptive text. Image captioning is used in various applications, including aiding visually impaired users, organizing and retrieving images more efficiently in digital databases, and enhancing user interactions with multimedia content on social media and other platforms.


In [None]:
# INSTALLATION #
# !pip3 install transformers
# !pip3 install fairseq
# !pip3 install sacremoses sentencepiece
# !pip3 install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
# !pip3 install pythainlp

In [6]:
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

def predict(image_paths):
    images = []
    for image_path in image_paths:
        i_image = Image.open(image_path)
        if i_image.mode != "RGB":
          i_image = i_image.convert(mode="RGB")
        
        images.append(i_image)

    pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)

    output_ids = model.generate(pixel_values, **gen_kwargs)

    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    preds = [pred.strip() for pred in preds]
    
    return preds

In [3]:
from pythainlp.translate import Translate
en2th = Translate('en', 'th')

2024-04-26 15:10:09 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX


Corpus: scb_1m_en-th_moses
- Downloading: scb_1m_en-th_moses 1.0


100%|██████████| 1174648148/1174648148 [08:18<00:00, 2358046.88it/s]
2024-04-26 15:18:39 | INFO | fairseq.file_utils | loading archive file /Users/worawittepsan/pythainlp-data/scb_1m_en-th_moses_1.0/SCB_1M-MT_OPUS+TBASE_en-th_moses-spm_130000-16000_v1.0/models
2024-04-26 15:18:39 | INFO | fairseq.file_utils | loading archive file /Users/worawittepsan/pythainlp-data/scb_1m_en-th_moses_1.0/SCB_1M-MT_OPUS+TBASE_en-th_moses-spm_130000-16000_v1.0/vocab
2024-04-26 15:18:41 | INFO | fairseq.tasks.translation | [en] dictionary: 130000 types
2024-04-26 15:18:41 | INFO | fairseq.tasks.translation | [th] dictionary: 15984 types
2024-04-26 15:18:43 | INFO | fairseq.models.fairseq_model | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': './checkpoints/1m-scb+mt-opus_27.5.2020/en-th/moses-spm/130000-16000/log', 'wandb_project': None, 'azureml_logging': False, 's

## PREDICTION

In [19]:
# IMAGE -> ENG CAPTION -> THA CAPTION
import os 
imagepaths = ["image_folder/"+ file for file in os.listdir("image_folder/")]
 
En_captions = predict(imagepaths)
for i in range(len(En_captions)):
    print(f"Th caption of {imagepaths[i]}\n \
        EN: {En_captions[i]}\n \
        TH: {en2th.translate(En_captions[i])}")

Th caption of image_folder/1.jpeg
         EN: a herd of sheep standing next to each other
         TH: ฝูงแกะยืนเคียงข้างกัน
Th caption of image_folder/4.jpeg
         EN: a plate of food on a table
         TH: อาหารจานหนึ่งบนโต๊ะ
Th caption of image_folder/5.jpeg
         EN: a green plant sitting on top of a wooden table
         TH: ต้นไม้สีเขียวนั่งอยู่บนโต๊ะไม้
Th caption of image_folder/2.jpeg
         EN: a white cat sitting on top of a table
         TH: แมวขาวนั่งอยู่บนโต๊ะ
Th caption of image_folder/3.jpeg
         EN: a sign that is on top of a table
         TH: ป้ายที่อยู่บนโต๊ะ
