Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create my own Model for Sentence Similarity/Automated Scoring #72

Closed
dhimasyoga16 opened this issue Aug 10, 2020 · 7 comments
Closed

Create my own Model for Sentence Similarity/Automated Scoring #72

dhimasyoga16 opened this issue Aug 10, 2020 · 7 comments

Comments

@dhimasyoga16
Copy link

I have a wikipedia dump file as my corpus (which is in Indonesian, i've extract it and convert it to .txt)
How can i train my corpus file with bert multilingual cased (fine tune) with BERTScore so i can have my own model for specific task such as sentence similarity or automated short answer scoring?

Or maybe i should do this with the original BERT?
Thankyou so much in advance.

@Tiiiger
Copy link
Owner

Tiiiger commented Aug 10, 2020

hi @dhimasyoga16 thank you for your interest in this repo. I am not sure what the question is asking.

Are you asking for how to fine-tune the bert-multilingual model? For that, you need to check the huggingface model to see how to continue training bert-multilingual with the mask language modeling objectives. See https://github.com/huggingface/transformers/tree/master/examples/language-modeling.

If you have already fine-tuned bert-multilingual, you can feed in the model path to --model when calling the score function.

Feel free to follow up with more questions.

@dhimasyoga16
Copy link
Author

Hi, thankyou so much for the quick assist and the answer.
I also feel sorry for my hard-to-understand question, my bad.

Can i ask one more question?
I've done feature extracting by running the extract_features.py and it generates 17.3GB json file. Can i use that json file as my model?
Because i want the BERTScore to analyze Indonesian sentences/texts better.

Thankyou so much once again :)

@Tiiiger
Copy link
Owner

Tiiiger commented Aug 11, 2020

hi @dhimasyoga16 which extract_features.py file are you talking about? is it in this repo?

if you have precomputed the features, you can modify the code (https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py#L253) to load the features instead of computing them again.

@Tiiiger Tiiiger closed this as completed Aug 13, 2020
@dhimasyoga16
Copy link
Author

Hi, sorry for the inactivity of this issue.
Refer to this link : https://github.com/huggingface/transformers/tree/master/examples/language-modeling

How can i fine-tune the bert multilingual model in Indonesian language? Can i use a wikipedia dump file as a corpus?

@Tiiiger
Copy link
Owner

Tiiiger commented Aug 15, 2020

hi @dhimasyoga16 this is a question that is better posed to the huggingface repo. Hopefully they will have a detailed instruction.

We are not really experts on this topic.

@dhimasyoga16
Copy link
Author

Hi, i've successfully created my language model using huggingface transformers.

When i'm trying to do a testing (using my own model of course), why does the --num_layers affect the score?
For example, when i'm using --num_layers 2 it gave me lower score rather than when i'm using --num_layers 6.
Any detailed explanation about this?
And which --num_layers that will give better scoring accuracy?

Sorry for so much questions, i'm new to NLP.

@Tiiiger
Copy link
Owner

Tiiiger commented Aug 19, 2020

hi @dhimasyoga16 please see our paper for the effect of using different --num_layers. basically, this argument controls which pre-trained layer of representations you are using. It is hard to say which --num_layers would work best for you application without seeing any validation data from our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants