Questions #1

sachaarbonel · 2020-07-31T05:19:24Z

Hi @rdenadai, thank's for your great work! I was playing around with your model on hugging face and I got results starting with `Ġ, is it normal? Plus, I wanted to know if you were willing to collaborate on a finetuned pos model. My understanding is that we need a conllu dataset such as UD_Portuguese-GSD and to clean it up to fit the format used by @mrm8488 in his notebook I started working on tool to cleanup such datasets.

rdenadai · 2020-07-31T11:17:12Z

Hi @sachaarbonel thanks for taking the time to test the model... i'm thinking in better improve it, i find it sometimes wonky, but for now doing so will be cost me like U$ 180,00 on GCP (i got a much more bigger corpus [check out my repo with a Word2Vec model trained on this new corpus] to train and perhaps got a bigger vocab and more training epochs).

Anyway, as for your question about the Ġ, i think is normal based on the Tokernizer used (and here i'm doing some inference based on other models that are trained using ByteLevelBPETokenizer like roberta-base, but i need to better explore this space to give you a more precise answer here.

As for the collaboration, of course i could, i really appreciate that. And for the record, i'm ideia for the model (the next step), would be train on a pos-tagged and my main area of research that is sentiment analysis.

I'm already in search for more than the UD dataset for portuguese pos-tagged... and found the following two great repo on github.

sachaarbonel · 2020-07-31T17:45:44Z

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

mrm8488 · 2020-07-31T18:01:39Z

Hi, Rodolfo. So you point in your model card you used the HF script for training a model from scratch. So how much data did you use? I have done several experiments and as its scripts uses LineByLineTextDataset and it loads everything in memory so could not train more than about 600 MB of data. Is this your case or did you modified anything? Bests, Manu El vie., 31 jul. 2020 a las 19:46, Sacha Arbonel (<notifications@github.com>) escribió:

…

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA34BHMHX43B3A22B3FWXELR6L7NRANCNFSM4PPOI46Q> .

rdenadai · 2020-07-31T19:14:40Z

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

It correspond to almost ~4 days of training in a T4 or P100 it depends... but now the dataset i'm using is almost x3 bigger than the one previously trained.

No worries about the datasets... i'm also going to explore then as i mentioned above.

rdenadai · 2020-07-31T19:29:21Z

@mrm8488 i do need to change the LineByLineTextDataset... i build my own class based on this one, since i'm now thinking in training using a 2.5Gb dataset.

The model that is on huggingface web site is one trained with 900mb corpus... a small corpus.

Please check out the code here => https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16

It lazy loads each line of the dataset using pandas read_csv method...

mrm8488 · 2020-07-31T19:40:42Z

I see, good job. Your CustomDataset does not scale? I need to build one for 200GB of data. I think I am gonna use HF/NLP that convert even plain text files to Apache Arrow format. El vie., 31 jul. 2020 21:29, Rodolfo De Nadai <notifications@github.com> escribió:

…

@mrm8488 <https://github.com/mrm8488> i do need to change the LineByLineTextDataset... i build my own class based on this one, since i'm now thinking in training using a 2.5Gb dataset. The model that is on huggingface web site is one trained with 900mb corpus... a small corpus. Please check out the code here => https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA34BHJD26AEZQXDT2XV7NTR6MLR7ANCNFSM4PPOI46Q> .

rdenadai · 2020-07-31T19:50:54Z

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line to use Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files.

The simple approuch you could try, is point the pandas to your file and see if it could load... and the use Dask... and them change the class to fit better you needs.

mrm8488 · 2020-07-31T20:11:18Z

Thank you so much for your advice El vie., 31 jul. 2020 21:51, Rodolfo De Nadai <notifications@github.com> escribió:

…

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line using Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q> .

mrm8488 · 2020-07-31T21:02:19Z

And why seq_length = 128 instead of 512? You wanted to try a small model first? El vie., 31 jul. 2020 22:11, Manuel Romero <mrm8488@gmail.com> escribió:

…

Thank you so much for your advice El vie., 31 jul. 2020 21:51, Rodolfo De Nadai ***@***.***> escribió: > Didn't try with that much of data... only with 2.5Gb... one thing you > could try to change in my custom class is, instead of loading the file > using pandas, change the line using Dask... this way you could scale up > much more, and since Dask uses pyarrow in it's internal you could use that > to read parquet files. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q> > . >

rdenadai · 2020-07-31T21:37:44Z

Yeap... I don't have enough power (in my computar i have a GTX 1060) or money (GCP is in dolar) to make a bigger model for nos. Em sex, 31 de jul de 2020 18:02, Manuel Romero <notifications@github.com> escreveu:

…

And why seq_length = 128 instead of 512? You wanted to try a small model first? El vie., 31 jul. 2020 22:11, Manuel Romero ***@***.***> escribió: > Thank you so much for your advice > > El vie., 31 jul. 2020 21:51, Rodolfo De Nadai ***@***.***> > escribió: > >> Didn't try with that much of data... only with 2.5Gb... one thing you >> could try to change in my custom class is, instead of loading the file >> using pandas, change the line using Dask... this way you could scale up >> much more, and since Dask uses pyarrow in it's internal you could use that >> to read parquet files. >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#1 (comment)>, >> or unsubscribe >> < https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q > >> . >> > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAADH3RJ7L5CRP7PAGBGDR6MWOTANCNFSM4PPOI46Q> .

sachaarbonel · 2020-08-01T08:20:38Z

I can help cleaning datasets if you guys need

rdenadai · 2020-08-01T12:23:47Z

@sachaarbonel thanks, all the help are appreciate, since im thinking in rerun BR_BERTo again (perhaps choose different parameters), and the do the pos-tagger, i have some time to clean up the pos-tagger datasets i mention above, and of course for my second task to this model which is sentiment analysis (i need to build this dataset, since didnt found a good dataset for that in portuguese, i did have one, but i revisiting it show me that i should better take a good care on this).

In case you guys want the dataset i`m using to train the BR_BERTo, just say and i can send a Google Drive link so you can download it.

sachaarbonel · 2020-08-01T18:22:30Z

I guess creating a sentiment analysis dataset should be feasible by applying a translation model from hugginface on a English dataset

rdenadai · 2020-08-01T23:33:08Z

Yeap that's one way... And i'm thinking in doing this, one problem is check if each phase is corrected translated or not.

And most datasets, only have 2-3 labels (positive/negative/neutral), since Affective Computing is my area of research interest i'm looking for a more wide range of emotions.

github-staff deleted a comment from vokaplok Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions #1

Questions #1

sachaarbonel commented Jul 31, 2020 •

edited

Loading

rdenadai commented Jul 31, 2020

sachaarbonel commented Jul 31, 2020

mrm8488 commented Jul 31, 2020 via email

rdenadai commented Jul 31, 2020

rdenadai commented Jul 31, 2020 •

edited

Loading

mrm8488 commented Jul 31, 2020 via email

rdenadai commented Jul 31, 2020 •

edited

Loading

mrm8488 commented Jul 31, 2020 via email

mrm8488 commented Jul 31, 2020 via email

rdenadai commented Jul 31, 2020 via email

sachaarbonel commented Aug 1, 2020

rdenadai commented Aug 1, 2020

sachaarbonel commented Aug 1, 2020

rdenadai commented Aug 1, 2020

Questions #1

Questions #1

Comments

sachaarbonel commented Jul 31, 2020 • edited Loading

rdenadai commented Jul 31, 2020

sachaarbonel commented Jul 31, 2020

mrm8488 commented Jul 31, 2020 via email

rdenadai commented Jul 31, 2020

rdenadai commented Jul 31, 2020 • edited Loading

mrm8488 commented Jul 31, 2020 via email

rdenadai commented Jul 31, 2020 • edited Loading

mrm8488 commented Jul 31, 2020 via email

mrm8488 commented Jul 31, 2020 via email

rdenadai commented Jul 31, 2020 via email

sachaarbonel commented Aug 1, 2020

rdenadai commented Aug 1, 2020

sachaarbonel commented Aug 1, 2020

rdenadai commented Aug 1, 2020

sachaarbonel commented Jul 31, 2020 •

edited

Loading

rdenadai commented Jul 31, 2020 •

edited

Loading

rdenadai commented Jul 31, 2020 •

edited

Loading