Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #1

Open
sachaarbonel opened this issue Jul 31, 2020 · 14 comments
Open

Questions #1

sachaarbonel opened this issue Jul 31, 2020 · 14 comments

Comments

@sachaarbonel
Copy link

sachaarbonel commented Jul 31, 2020

Hi @rdenadai, thank's for your great work! I was playing around with your model on hugging face and I got results starting with `Ġ, is it normal? Plus, I wanted to know if you were willing to collaborate on a finetuned pos model. My understanding is that we need a conllu dataset such as UD_Portuguese-GSD and to clean it up to fit the format used by @mrm8488 in his notebook I started working on tool to cleanup such datasets.

@rdenadai
Copy link
Owner

Hi @sachaarbonel thanks for taking the time to test the model... i'm thinking in better improve it, i find it sometimes wonky, but for now doing so will be cost me like U$ 180,00 on GCP (i got a much more bigger corpus [check out my repo with a Word2Vec model trained on this new corpus] to train and perhaps got a bigger vocab and more training epochs).

Anyway, as for your question about the Ġ, i think is normal based on the Tokernizer used (and here i'm doing some inference based on other models that are trained using ByteLevelBPETokenizer like roberta-base, but i need to better explore this space to give you a more precise answer here.

As for the collaboration, of course i could, i really appreciate that. And for the record, i'm ideia for the model (the next step), would be train on a pos-tagged and my main area of research that is sentiment analysis.

I'm already in search for more than the UD dataset for portuguese pos-tagged... and found the following two great repo on github.

@sachaarbonel
Copy link
Author

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

@mrm8488
Copy link

mrm8488 commented Jul 31, 2020 via email

@rdenadai
Copy link
Owner

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

It correspond to almost ~4 days of training in a T4 or P100 it depends... but now the dataset i'm using is almost x3 bigger than the one previously trained.

No worries about the datasets... i'm also going to explore then as i mentioned above.

@rdenadai
Copy link
Owner

rdenadai commented Jul 31, 2020

@mrm8488 i do need to change the LineByLineTextDataset... i build my own class based on this one, since i'm now thinking in training using a 2.5Gb dataset.

The model that is on huggingface web site is one trained with 900mb corpus... a small corpus.

Please check out the code here => https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16

It lazy loads each line of the dataset using pandas read_csv method...

@mrm8488
Copy link

mrm8488 commented Jul 31, 2020 via email

@rdenadai
Copy link
Owner

rdenadai commented Jul 31, 2020

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line to use Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files.

The simple approuch you could try, is point the pandas to your file and see if it could load... and the use Dask... and them change the class to fit better you needs.

@mrm8488
Copy link

mrm8488 commented Jul 31, 2020 via email

@mrm8488
Copy link

mrm8488 commented Jul 31, 2020 via email

@rdenadai
Copy link
Owner

rdenadai commented Jul 31, 2020 via email

@sachaarbonel
Copy link
Author

I can help cleaning datasets if you guys need

@rdenadai
Copy link
Owner

rdenadai commented Aug 1, 2020

@sachaarbonel thanks, all the help are appreciate, since im thinking in rerun BR_BERTo again (perhaps choose different parameters), and the do the pos-tagger, i have some time to clean up the pos-tagger datasets i mention above, and of course for my second task to this model which is sentiment analysis (i need to build this dataset, since didnt found a good dataset for that in portuguese, i did have one, but i revisiting it show me that i should better take a good care on this).

In case you guys want the dataset i`m using to train the BR_BERTo, just say and i can send a Google Drive link so you can download it.

@sachaarbonel
Copy link
Author

I guess creating a sentiment analysis dataset should be feasible by applying a translation model from hugginface on a English dataset

@rdenadai
Copy link
Owner

rdenadai commented Aug 1, 2020

Yeap that's one way... And i'm thinking in doing this, one problem is check if each phase is corrected translated or not.

And most datasets, only have 2-3 labels (positive/negative/neutral), since Affective Computing is my area of research interest i'm looking for a more wide range of emotions.

@github-staff github-staff deleted a comment from vokaplok Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@rdenadai @mrm8488 @sachaarbonel and others