This repository contains resources for creating a dataset required for fine-tuning AI4Bharat's IndicTrans2, an existing open-source translation model. Additionally, it includes a notebook for fine-tuning the model using Google Colab.
While the model is already impressive at translating English text to Hindi, it lacks attention to fine details, especially when it comes to conversational inputs. It is noticeable that the model predominantly produces output from a male perspective.
The goal is to fine-tune the model to capture more information from conversational inputs and generate appropriate output.
As a Hindi and English speaker, this repository is centered around Hindi but should also work for other Indic languages that model supports.
To create the dataset for fine-tuning, follow these steps:
-
Clone the Repository:
git clone https://github.com/saurabhv749/indictrans2-conv cd indictrans2-conv/dataset
-
Configuration: We will use the
Anyscale
API for text generation. Obtain your API key by visiting the Credentials Page and paste it intodataset/config.yaml
. If your target language is other than Hindi, please changetgt_lang
accordingly. I'm usingMixtral-8x7B-Instruct-v0.1
, but you can use any model listed here. -
Generate Data: We will use a LLM to generate some conversational data for us to start with. First, add topics to
dataset/src/topics
to generate conversations around those topics.Generate conversation data by running:
python generate.py
-
Preprocessing: The generated conversations will be in the file
dataset/src/samples.txt
. Clean it so that every line contains a dialogue. Also, take special care of quotation marks. -
Translate: Once you're done with formatting
dataset/src/samples.txt
, runpython translate.py
to create translated text along with the original sentences.
-
Fix Translations: Yes, more manual work. Correct translations, but don't mess with the spacing/new lines.
-
Create Dataset: Finally, run
python create_dataset.py
to create a zip file with data in the required folder structure.
The indictrans2-finetune.ipynb
contains a Google Colab notebook for fine-tuning the translation model using the created en-indic-exp.zip
. Follow the instructions within the notebook to:
- Open the notebook in Colab.
- Upload
en-indic-exp.zip
. - Extract the zip file.
- Fine-tune the model.