# Assignment 1

I have created a multilingual text translation using IndicTrans2 models which was originally trained with the fairseq to HuggingFace transformers.

## Procedure
  1. There are 3 cases of translation:
    
| Language                       | Code      |   | Language                        | Code      |
|---|---|---|---|---|
Bengali | ben_Beng  | to | Punjabi  | pan_Guru  |
Marathi | mar_Deva | to | Gujarati | guj_Gujr  |
Kannada | kan_Knda  | to | Sanskrit  | san_Deva  |

  2. For checking the validity of the translation, The Sentence taken are taken in English and then converted to each of the languages.
  3. After that, translation is done.

## Necessary Step

Please run the cells below to install the necessary dependencies.

In [None]:
# Clone the required Git repository for IndicTrans2
%%capture
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [None]:
# Clone the Hugging face interface from github
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [None]:
# Install other essential dependecies for working of the transformer
%%capture
!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!python3 -m pip install --editable ./
%cd ..

Restart your run-time first and then run the cells below.

## Working for Transformer


1. Importing the important modules:
  * transformer
  * torch
  * AutoModelForSeq2SeqLM from transformer
  * BitsAndBytesConfig from transformer
  * IndicProcessor from from IndicTransTokenizer
  * IndicTransTokenizer from IndicTransTokenizer

2. Set the Batch size equal to 4. Then create a variable DEVICE and set it to "cuda" if torch.cuda.is_available() or else set it as "cpu". Finally set Quantization as "None"

3. Two functions are there.
    * First function to intialise the model and the tokenizer and returns both
    * Another function which helps in the translation of a whole batch.


In [None]:
import torch
from transformers import BitsAndBytesConfig, AutoModelForSeq2SeqLM
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
import transformers as trf

In [None]:
# Set the variables
BATCH_SIZE= 4;
DEVICE= "cuda" if torch.cuda.is_available() else "cpu"
Quantization= None


### Model initializer and tokenizer function.


Create a function initialize_model_and_tokenizer which takes in 3 arguments: ckpt_dir, direction, quantization.
Inside the function, if quantization  = '4-bit' then create a variable qconfig and use appropriate BitsAndByteConfig to instantiate it. Else if quantization  = '8-bit', then do the necessary. Else, set it to None.

BitsAndBytesConfig(https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig).)

Next, create a variable tokenizer

Next step will be to create a model variable set to AutoModelForSeq2SeqLM where we have to load the pretrained model from checkpoint directory

In [None]:
# Create a function initialize_model_and_tokenizer which takes in 4 arguments: ckpt_dir, direction, quantization.
def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
    if (quantization  == '4-bit'):
      qconfig= BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_use_double_quant= True,
          bnb_4bit_compute_dtype= torch.bfloat16
    )
    elif (quantization  == '8-bit'):
      qconfig= BitsAndBytesConfig(
          load_in_8bit=True,
          bnb_8bit_use_double_quant= True,
          bnb_8bit_compute_dtype= torch.bfloat32
    )
    else:
      qconfig=None

    # Create a variable tokenizer and set it as IndicTransTokenizer with direction set as direction.
    tokenizer= IndicTransTokenizer( direction=direction )

    # Create a model variable set to AutoModelForSeq2SeqLM, Keep trust_remote_code=True, low_cpu_mem_usage=True and quantization_config=qconfig.
    model= AutoModelForSeq2SeqLM.from_pretrained(ckpt_dir, trust_remote_code=True, low_cpu_mem_usage=True, quantization_config=qconfig)


    # if qconfig is none, save the model in device.
    if qconfig == None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    model.eval();
    # return both tokenizer and model
    return tokenizer, model


## Helper Function to get translation

In [None]:
def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            src=True,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text
        generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations

## Languages and their codes

Now we have to Finally join all the functions and datasets together to create our own predictions.

Here is the list of languages supported by the IndicTrans2 models:

| Language                       | Code      | Language                        | Code      | Language                       | Code      |
|--------------------------------|-----------|---------------------------------|-----------|--------------------------------|-----------|
| Assamese                       | asm_Beng  | Kashmiri (Arabic)               | kas_Arab  | Punjabi                        | pan_Guru  |
| Bengali                        | ben_Beng  | Kashmiri (Devanagari)           | kas_Deva  | Sanskrit                       | san_Deva  |
| Bodo                           | brx_Deva  | Maithili                        | mai_Deva  | Santali                        | sat_Olck  |
| Dogri                          | doi_Deva  | Malayalam                       | mal_Mlym  | Sindhi (Arabic)                | snd_Arab  |
| English                        | eng_Latn  | Marathi                         | mar_Deva  | Sindhi (Devanagari)            | snd_Deva  |
| Konkani                        | gom_Deva  | Manipuri (Bengali)              | mni_Beng  | Tamil                          | tam_Taml  |
| Gujarati                       | guj_Gujr  | Manipuri (Meitei)               | mni_Mtei  | Telugu                         | tel_Telu  |
| Hindi                          | hin_Deva  | Nepali                          | npi_Deva  | Urdu                           | urd_Arab  |
| Kannada                        | kan_Knda  | Odia                            | ory_Orya  |                                |           |


## Input Lines in English

In [None]:
lan_en=[
    "When I was young, I used to go to the park every day.",
    "He has many old books, which he inherited from his ancestors.",
    "I can't figure out how to solve my problem.",
    "She is very hardworking and intelligent, which is why she got all the good marks.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "She went to the market with her sister to buy a new sari.",
    "Raj told me that he is going to his grandmother's house next month.",
    "All the kids were having fun at the party and were eating lots of sweets.",
    "My friend has invited me to his birthday party, and I will give him a gift."
]

## Bengali to Punjabi Conversion

In [None]:
input_bengali=[
 'আমি যখন ছোট ছিলাম, আমি প্রতিদিন পার্কে যেতাম।',
 'তাঁর কাছে অনেক পুরনো বই রয়েছে, যা তিনি তাঁর পূর্বপুরুষদের কাছ থেকে উত্তরাধিকার সূত্রে পেয়েছিলেন।',
 'আমি বুঝতে পারি না কিভাবে আমার সমস্যার সমাধান করব।',
 'সে খুব পরিশ্রমী এবং বুদ্ধিমান, যে কারণে সে সমস্ত ভাল নম্বর পেয়েছে।',
 'আমরা গত সপ্তাহে একটি নতুন সিনেমা দেখেছি, যা খুব অনুপ্রেরণামূলক ছিল।',
 'আপনি যদি সেই সময় আমার সঙ্গে দেখা করতেন, তাহলে আমরা বাইরে খেতে যেতাম।',
 'সে তার বোনের সাথে একটি নতুন শাড়ি কিনতে বাজারে গিয়েছিল।',
 'রাজ আমাকে বলেছিল যে সে আগামী মাসে তার দাদির বাড়িতে যাচ্ছে।',
 'সব বাচ্চারা পার্টিতে মজা করছিল এবং প্রচুর মিষ্টি খাচ্ছিল।',
 'আমার বন্ধু আমাকে তার জন্মদিনের পার্টিতে আমন্ত্রণ জানিয়েছে এবং আমি তাকে একটি উপহার দেব।'
 ]

output_punjabi=[
 'ਜਦੋਂ ਮੈਂ ਛੋਟਾ ਸੀ, ਮੈਂ ਹਰ ਰੋਜ਼ ਪਾਰਕ ਜਾਂਦਾ ਸੀ।',
 'ਉਨ੍ਹਾਂ ਕੋਲ ਬਹੁਤ ਸਾਰੀਆਂ ਪੁਰਾਣੀਆਂ ਕਿਤਾਬਾਂ ਹਨ, ਜੋ ਉਨ੍ਹਾਂ ਨੂੰ ਆਪਣੇ ਪੁਰਖਿਆਂ ਤੋਂ ਵਿਰਾਸਤ ਵਿੱਚ ਮਿਲੀਆਂ ਹਨ।',
 'ਮੈਨੂੰ ਸਮਝ ਨਹੀਂ ਆ ਰਿਹਾ ਕਿ ਮੈਂ ਆਪਣੀ ਸਮੱਸਿਆ ਦਾ ਹੱਲ ਕਿਵੇਂ ਕਰਾਂ।',
 'ਉਹ ਬਹੁਤ ਮਿਹਨਤੀ ਅਤੇ ਬੁੱਧੀਮਾਨ ਹੈ, ਇਸੇ ਲਈ ਉਸ ਨੂੰ ਸਾਰੇ ਚੰਗੇ ਅੰਕ ਮਿਲੇ।',
 'ਅਸੀਂ ਪਿਛਲੇ ਹਫ਼ਤੇ ਇੱਕ ਨਵੀਂ ਫਿਲਮ ਵੇਖੀ, ਜੋ ਬਹੁਤ ਪ੍ਰੇਰਣਾਦਾਇਕ ਸੀ।',
 'ਜੇ ਤੁਸੀਂ ਉਸ ਸਮੇਂ ਮੈਨੂੰ ਮਿਲਦੇ ਤਾਂ ਅਸੀਂ ਖਾਣਾ ਖਾਣ ਲਈ ਬਾਹਰ ਜਾਂਦੇ।',
 'ਉਹ ਆਪਣੀ ਭੈਣ ਨਾਲ ਨਵੀਂ ਸਾਡ਼ੀ ਖਰੀਦਣ ਲਈ ਬਾਜ਼ਾਰ ਗਈ ਸੀ।',
 'ਰਾਜ ਨੇ ਮੈਨੂੰ ਦੱਸਿਆ ਕਿ ਉਹ ਅਗਲੇ ਮਹੀਨੇ ਆਪਣੀ ਦਾਦੀ ਦੇ ਘਰ ਜਾ ਰਿਹਾ ਹੈ।',
 'ਸਾਰੇ ਬੱਚੇ ਪਾਰਟੀ ਵਿੱਚ ਬਹੁਤ ਮਸਤੀ ਕਰ ਰਹੇ ਸਨ ਅਤੇ ਬਹੁਤ ਸਾਰੀਆਂ ਮਠਿਆਈਆਂ ਖਾ ਰਹੇ ਸਨ।',
 'ਮੇਰੇ ਦੋਸਤ ਨੇ ਮੈਨੂੰ ਆਪਣੇ ਜਨਮਦਿਨ ਦੀ ਪਾਰਟੀ ਵਿੱਚ ਸੱਦਾ ਦਿੱਤਾ ਹੈ, ਅਤੇ ਮੈਂ ਉਸਨੂੰ ਇੱਕ ਤੋਹਫ਼ਾ ਦੇਵਾਂਗਾ।'
 ]

AI4_BHARAT= "ai4bharat/indictrans2-indic-indic-1B"; my_direction="/content/IndicTrans2/huggingface_interface/IndicTransTokenizer/IndicTransTokenizer/indic-indic/";

# getting tokenizer and model by passing arg. to initialize_model_and_tokenizer function
Tokenizer, model= initialize_model_and_tokenizer( ckpt_dir=AI4_BHARAT, direction= my_direction, quantization="4-bit"  )

indic= IndicProcessor(inference=True)

# Choose the source langauge as English and target language as Hindi.
lan_src_1= "ben_Beng";  lan_tar_1="pan_Guru"

# Find target translation using the batch_translate function with arg: input_sentences, src_lang, tgt_lang, model, tokenizer, ip
target_trans_1= batch_translate( input_bengali, lan_src_1, lan_tar_1, model, Tokenizer, indic  )

del Tokenizer, model

In [None]:
target_trans_1

## Marathi to Gujrati conversion

In [None]:
input_marathi= [
 'मी लहान असताना दररोज उद्यानात जायचो.',
 'त्याच्याकडे अनेक जुनी पुस्तके आहेत, जी त्याला त्याच्या पूर्वजांकडून वारशाने मिळाली आहेत.',
 'माझ्या समस्येचे निराकरण कसे करावे हे मला समजत नाही.',
 'ती खूप मेहनती आणि बुद्धिमान आहे, म्हणूनच तिला सर्व चांगले गुण मिळाले.',
 'आम्ही गेल्या आठवड्यात एक नवीन चित्रपट पाहिला, जो खूप प्रेरणादायी होता.',
 'त्यावेळी तुम्ही मला भेटला असता तर आम्ही बाहेर जेवायला गेलो असतो.',
 'ती तिच्या बहिणीसोबत नवीन साडी खरेदी करण्यासाठी बाजारात गेली.',
 'राजने मला सांगितले की तो पुढच्या महिन्यात त्याच्या आजीच्या घरी जात आहे.',
 'सर्व मुले पार्टीमध्ये मजा करत होती आणि भरपूर मिठाई खात होती.',
 'माझ्या मित्राने मला त्याच्या वाढदिवसाच्या सोहळ्याला आमंत्रित केले आहे आणि मी त्याला भेटवस्तू देईन.'
]
output_gujrati= [
 'જ્યારે હું નાનો હતો, ત્યારે હું દરરોજ બગીચામાં જતો હતો.',
 'તેમની પાસે ઘણા જૂના પુસ્તકો છે, જે તેમને તેમના પૂર્વજો પાસેથી વારસામાં મળ્યા હતા.',
 'હું સમજી શકતો નથી કે મારી સમસ્યાનો ઉકેલ કેવી રીતે લાવવો.',
 'તે ખૂબ જ મહેનતુ અને બુદ્ધિશાળી છે, તેથી જ તેને બધા સારા ગુણ મળ્યા.',
 'અમે ગયા અઠવાડિયે એક નવી ફિલ્મ જોઈ, જે ખૂબ જ પ્રેરણાદાયક હતી.',
 'જો તમે તે સમયે મને મળ્યા હોત તો અમે બહાર જમવા ગયા હોત.',
 'તે તેની બહેન સાથે નવી સાડી ખરીદવા માટે બજારમાં ગઈ હતી.',
 'રાજે મને કહ્યું કે તે આવતા મહિને તેની દાદીના ઘરે જઈ રહ્યો છે.',
 'બધા બાળકો પાર્ટીમાં મજા કરી રહ્યા હતા અને ઘણી બધી મીઠાઈઓ ખાઈ રહ્યા હતા.',
 'મારા મિત્રે મને તેના જન્મદિવસની પાર્ટીમાં આમંત્રણ આપ્યું છે, અને હું તેને ભેટ આપીશ.'
 ]

AI4_BHARAT= "ai4bharat/indictrans2-indic-indic-1B"; my_direction="/content/IndicTrans2/huggingface_interface/IndicTransTokenizer/IndicTransTokenizer/indic-indic/";

# getting tokenizer and model by passing arg. to initialize_model_and_tokenizer function
Tokenizer, model= initialize_model_and_tokenizer( ckpt_dir=AI4_BHARAT, direction= my_direction, quantization="4-bit"  )

indic= IndicProcessor(inference=True)

# Choose the source langauge as English and target language as Hindi.
lan_src_2= "mar_Dev"; lan_tar_2="guj_Gujr"

# Find target translation using the batch_translate function with arguments: input_sentences, src_lang, tgt_lang, model, tokenizer, ip
target_trans_2= batch_translate(input_marathi , lan_src_2, lan_tar_2, model, Tokenizer, indic  )

del Tokenizer, model

In [None]:
target_trans_2

## Kannada to Sanskrit

In [None]:
input_kannada= [
 'ನಾನು ಚಿಕ್ಕವಳಿದ್ದಾಗ, ಪ್ರತಿದಿನ ಉದ್ಯಾನವನಕ್ಕೆ ಹೋಗುತ್ತಿದ್ದೆ.',
 'ಅವರು ತಮ್ಮ ಪೂರ್ವಜರಿಂದ ಆನುವಂಶಿಕವಾಗಿ ಪಡೆದ ಅನೇಕ ಹಳೆಯ ಪುಸ್ತಕಗಳನ್ನು ಹೊಂದಿದ್ದಾರೆ.',
 'ನನ್ನ ಸಮಸ್ಯೆಯನ್ನು ಹೇಗೆ ಪರಿಹರಿಸಿಕೊಳ್ಳುವುದು ಎಂದು ನನಗೆ ಅರ್ಥವಾಗುತ್ತಿಲ್ಲ.',
 'ಅವಳು ತುಂಬಾ ಕಷ್ಟಪಟ್ಟು ದುಡಿಯುವವಳು ಮತ್ತು ಬುದ್ಧಿವಂತಳು, ಅದಕ್ಕಾಗಿಯೇ ಅವಳು ಎಲ್ಲಾ ಉತ್ತಮ ಅಂಕಗಳನ್ನು ಪಡೆದಳು.',
 'ನಾವು ಕಳೆದ ವಾರ ಹೊಸ ಚಲನಚಿತ್ರವೊಂದನ್ನು ನೋಡಿದೆವು, ಅದು ಬಹಳ ಸ್ಪೂರ್ತಿದಾಯಕವಾಗಿತ್ತು.',
 'ಆ ಸಮಯದಲ್ಲಿ ನೀವು ನನ್ನನ್ನು ಭೇಟಿಯಾಗಿದ್ದರೆ, ನಾವು ತಿನ್ನಲು ಹೊರಗೆ ಹೋಗುತ್ತಿದ್ದೆವು.',
 'ಆಕೆ ತನ್ನ ಸಹೋದರಿಯೊಂದಿಗೆ ಹೊಸ ಸೀರೆಯನ್ನು ಖರೀದಿಸಲು ಮಾರುಕಟ್ಟೆಗೆ ಹೋದಳು.',
 'ಮುಂದಿನ ತಿಂಗಳು ತನ್ನ ಅಜ್ಜಿಯ ಮನೆಗೆ ಹೋಗುತ್ತಿದ್ದೇನೆ ಎಂದು ರಾಜ್ ನನಗೆ ಹೇಳಿದನು.',
 'ಎಲ್ಲಾ ಮಕ್ಕಳು ಪಾರ್ಟಿಯಲ್ಲಿ ಮೋಜು ಮಾಡುತ್ತಿದ್ದರು ಮತ್ತು ಸಾಕಷ್ಟು ಸಿಹಿತಿಂಡಿಗಳನ್ನು ತಿನ್ನುತ್ತಿದ್ದರು.',
 'ನನ್ನ ಸ್ನೇಹಿತ ತನ್ನ ಹುಟ್ಟುಹಬ್ಬದ ಸಂತೋಷಕೂಟಕ್ಕೆ ನನ್ನನ್ನು ಆಹ್ವಾನಿಸಿದ್ದಾನೆ ಮತ್ತು ನಾನು ಅವನಿಗೆ ಉಡುಗೊರೆಯನ್ನು ನೀಡುತ್ತೇನೆ.'
]

output_sanskrit= [
 'यदा अहं बालकः आसीत्, तदा प्रतिदिनं उद्यानं गच्छामि स्म।',
 'तस्य समीपे बहूनि पुरातनानि पुस्तकानि सन्ति, यानि सः स्वपितृभ्यः उत्तराधिकाररूपेण प्राप्तवान्।',
 'मम समस्यायाः निराकरणं कथं करणीयम् इति अहं कल्पयितुं न शक्नोमि।',
 'सा अतीव परिश्रमी, बुद्धिमती च अस्ति, अतः सा सर्वाणि उत्तमानि अङ्कानि प्राप्नोत्।',
 'वयं गतसप्ताहे नूतनं चलच्चित्रं दृष्टवन्तः, यत् अत्यन्तं प्रेरकम् आसीत्।',
 'तस्मिन् समये यदि त्वं मह्यं मिलितवान् तर्हि वयं भोजनार्थं बहिः गमिष्यामः स्म।',
 'सा नूतनां शाटिकां क्रेतुं भगिन्या सह विपणिं गतवती।',
 'अग्रिमे मासे सः मातामहीगृहं गमिष्यति इति राज् अवदत्।',
 'सर्वे बालकाः उत्सवे विनोदं कुर्वन्तः, अनेकानि मधुराणि खादन्तः च आसन्।',
 'मम मित्रः तस्य जन्मदिनसमारोहे आमन्त्रयत्, अहं तं उपहारं ददामि।'
]

# Create a variable to store the directory path
AI4_BHARAT= "ai4bharat/indictrans2-indic-indic-1B"; direction="/content/IndicTrans2/huggingface_interface/IndicTransTokenizer/IndicTransTokenizer/indic-indic/";

# getting tokenizer and model by passing arg. to initialize_model_and_tokenizer function
Tokenizer, model= initialize_model_and_tokenizer( ckpt_dir=AI4_BHARAT, direction= my_direction, quantization="4-bit"  )

indic= IndicProcessor(inference=True)

# Choose the source langauge as English and target language as Hindi.
lan_src_3= "kan_Knda";
lan_tar_3="san_Deva"

# Find target translation using the batch_translate function with arg: input_sentences, src_lang, tgt_lang, model, tokenizer, ip
target_trans_3= batch_translate( input_kannada , lan_src_3, lan_tar_3, model, Tokenizer, indic  )

# flush the models to free the GPU memory
del Tokenizer, model

In [None]:
target_trans_3