Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset used for finetuning mt5 model #43

Closed
SepehrAminiAfshar opened this issue Jul 8, 2023 · 3 comments
Closed

Dataset used for finetuning mt5 model #43

SepehrAminiAfshar opened this issue Jul 8, 2023 · 3 comments

Comments

@SepehrAminiAfshar
Copy link

Hi
First of all, thank you for your great work on this project. You've reached among best results on Spider benchmark and your clear and complete readme file allowed me to run your code very easily.

I want to see if I can finetune a text2natsql model on mt5 like you did on CSpider. I was wondering how much data I have to create as I want to create a dataset like CSpider but in Persian languge.

Was CSpider the only dataset used for finetuning mt5 backbone or other datasets were also used?

@lihaoyang-ruc
Copy link
Contributor

Yes, I only use CSpider (Chinese version of Spider) dataset to fine-tune mT5, which contains 7000 (+1659) training examples.

However, honestly, I don't know how much data you need to prepare. Ultimate performance depends on the quality and quantity of the training data as well as the capabilities of the foundation model (e.g., we use T5 for English Spider and mT5 for Chinese CSpider).

In my experiments, I found that the Chinese capability of mT5 is not that strong, which may be due to the existence of the "curse of multilinguality". Therefore, choosing a suitable and powerful Persian language model is also important.

@lihaoyang-ruc
Copy link
Contributor

For a detailed description of the term "curse of multilinguality", please refer to https://aclanthology.org/2020.acl-main.747.pdf.

@SepehrAminiAfshar
Copy link
Author

SepehrAminiAfshar commented Jul 12, 2023

Thank you for your answer and the heads up!
It is much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants