Skip to content

tricao7/GenerativeAI-ImageTranslation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ImageTranslation - Generative AI Image Captioning

'Skier In Mid Air Over Snowy Mountain With Mountain In Background'

Table of Contents

  1. Introduction
  2. Data
  3. Model
  4. Future Work
  5. Conclusion
  6. References

Introduction

Generative AI is a crucial in today's world, as it allows for the creation of new data that can be used for a variety of purposes. A common use case for generative ai is image captioning where we try to harness the power of transformers and neural networks to aid us in creating more natural and sophisticated text responses. In addition to text generated responses, narration is also improving and can generate more natural sounding Text-To-Speech descriptions.

Data

The user may input an image or a link to an image to generate text and audio for that specified image. The data is deleted once the app is closed.

Model

For image captioning we will be using the Salesforce Blip Image Claptioning Large model and processing because it has proven to provide accurate quick captioning. An alternative to this model could be the Salesforce Blip Image Claptioning Base for a lighter weight alternative.

from transformers import BlipProcessor, BlipForConditionalGeneration

# Model Name
hf_model="Salesforce/blip-image-captioning-large"

# Initialize Image-to-text Processor and Model
processor = BlipProcessor.from_pretrained(hf_model)
model = BlipForConditionalGeneration.from_pretrained(hf_model)

Our Text-To-Speech model takes the output from the previous caption and generate a .wav files of the caption being vocalized.

from transformers import VitsModel, AutoTokenizer

# Model Name
hf_model="facebook/mms-tts-eng"

# Initialize TTS Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(hf_model)
tts_model = VitsModel.from_pretrained(hf_model)

When processing the image and generating text and audio output we can get the following results:

'A Man And A Woman In A Kayak Paddling In The Water'

Future-work

02-24-2024:

  • As of now the model is used to generate a caption from images, and can do it relatively well, however, this model is limited and only captures the scene's main actions. It still lacks the capability of descibing all of the elements in the image, for example, the above sample image fails to describe the scenery and background.
  • Selecting a better model, or using a paid API like GPT-4 would yield better results on a large scale.
  • Another transformer model could be trained/used to add more context to the image and allow for better descriptions.
  • I would like to translate the text to different major languages and add text to speech capabilities for these languages.
  • Would like to support dynamic video files for real time descriptions (.mp4, .mov, etc.) to better support those with vision-impairments.

Conclusion

Overall, it can be seen that the model does describe most pictures with great accuracy, but there is still a lot of details that are left out that could be important. If we are to improve Image-to-text, we may want to join models that can describe actions well with models that can describe details well to improve the robustness of the captioning. If we are able to accomplish this well, we can provide a service that can accommodate those with vision-impairments in every language.

References

  1. Hugging Face Transformers
  2. Salesforce Blip Image Claptioning Large
  3. Facebook Massively Multilingual Speech Model
  4. Flask Documentation
  5. Sample Image 1
  6. Sample Image 2