# Mini-Project 01

# Flow

- The below image shows a rough flow of the approach used along with the different models and api's involved.

![Image](https://drive.google.com/uc?id=1_tJU1vDfYTN_PgRbn-iPt5iAml6Pyhky)

- The file takes about 6-7 to run completely.

# Generating a suitable Caption for the image

- Here, I am using the [`thwri/CogFlorence-2-Large-Freeze`](https://huggingface.co/thwri/CogFlorence-2-Large-Freeze) model which is fine-tuned version of the [`microsoft/Florence-2-large`](https://huggingface.co/microsoft/Florence-2-large) over the [`Ejafa/ye-pop`](https://huggingface.co/datasets/Ejafa/ye-pop) dataset.

- First, All the necessary dependencies are installed for the task.

- After that, the image is loaded using the Image class.

- Next, The model and processor are intialised and then we proceed towards getting caption for the image.

In [None]:
!pip install pycocotools
!pip install matplotlib datasets
!pip install opencv-python-headless transformers
!pip install timm flash_attn

In [None]:
from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
from PIL import Image
import copy
from matplotlib import pyplot as plt


In [None]:
# Enter the image file path here
img_path = "/content/test_4.jpg"
image = Image.open(img_path).convert("RGB")
img= plt.imread(img_path);
plt.imshow(img)

In [None]:
# initialise the processor and model
# Note: I have used the freeze version as the normal large version was taking more ram leading the system to crash
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2-Large-Freeze", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2-Large-Freeze", trust_remote_code=True)

# function to get the description of the image
def run_example(task_prompt, image):

    prompt = task_prompt
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens= 32,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

# entering <MORE_DETAILED_CAPTION> to get better transcription
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)

In [None]:
generated_text= result['<MORE_DETAILED_CAPTION>']
generated_text

# Converting English to Other languages

- Now I have the English description of the image, I proceeded to convert it to languages mentioned below along with their indic-code and IETF code.


| Language                       | Indic Code      | IETF Code                        |
|--------------------------------|-----------|---------------------------------|
| Punjabi                        | pan_Guru  | pa |
| Bengali                        | ben_Beng  | bn |
| Hindi                        | hin_Deva  | hi |
| Marathi                         | mar_Deva  | mr |
| Gujarati                       | guj_Gujr  | gu |

In [None]:
# Clone the required Git repository for IndicTrans2
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [None]:
# Clone the Hugging face interface from github
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [None]:
%%capture
!git clone https://github.com/VarunGumma/IndicTransToolkit

%cd IndicTransToolkit
!pip install --editable ./


In [None]:
import torch
from IndicTransToolkit import IndicProcessor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [None]:
# initialising the models
ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

target_lan= ["hin_Deva", "mar_Deva", "guj_Gujr", "ben_Beng", 'pan_Guru']
sentences= [ generated_text  ]
translated_sec= {}

# getting the translation for all the languages
for tar_lan in target_lan:
  batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang= tar_lan)
  batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")

  with torch.inference_mode():
      outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

  with tokenizer.as_target_tokenizer():
      # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
      # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
      outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

  # storing the output in a dictionary with its language code as the key
  output = ip.postprocess_batch(outputs, lang= tar_lan)
  translated_sec[tar_lan]= output[0]

In [None]:
translated_sec

# Converting the text to Speech

- Here, I have used the `gTTS` API to convert the text to speech for different languages.

In [None]:
pip install gtts

In [None]:
from gtts import gTTS
import librosa
from IPython.display import Audio

In [None]:
# mapping the indic2 codes to the IETF codes followed by gTTS
mapper= {
    'hin_Deva': 'hi',
    'mar_Deva': 'mr',
    'guj_Gujr': 'gu',
    'ben_Beng': 'bn',
    'pan_Guru': 'pa'
}
audio_files= {}
for lang, audio_text in translated_sec.items():
  if(lang != 'ory_Orya'):

    to_lang= mapper[lang]
    speak = gTTS(text= audio_text , lang=to_lang, slow=False)
    speak.save(f"/content/audio_{lang}.mp4")

In [None]:
y, sr = librosa.load('/content/audio_hin_Deva.mp4')

Audio(data=y, rate=sr)

# References

1. [`IETF languages`](https://en.wikipedia.org/wiki/IETF_language_tag)
2. [`Huggingface`](https://huggingface.co/)