Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AraBert #18

Open
zaidalyafeai opened this issue Jan 30, 2020 · 38 comments
Open

AraBert #18

zaidalyafeai opened this issue Jan 30, 2020 · 38 comments

Comments

@zaidalyafeai
Copy link
Owner

@zaidalyafeai zaidalyafeai commented Jan 30, 2020

Hey All,

This is a temporary issue to discuss the idea of training a Bert model specific for Arabic - AraBert. We will move to another repository once we have a clear understanding of the problem and how to tackle it as well as having experts in the field. There are three main issues to discuss here and I want your opinions about them

  1. Dataset: we can use the avilable Wiki dumbs. If you have more suggestions let us know.
  2. Models: We can use the available Bert models.
  3. Training: We need sponsor for that to train in the cloud. If you can help with that issue please let us know.

If you have any comments on any of the above points don't hesistate to respond. More importantly, if you have any experience in training large models let us know that. Open source contribution is a slow and lengthy process and it might take us a some time to finish the project. If you are planning to contribute please be sure to cut some time for this project.

Thanks
Zaid

@itinawi

This comment has been minimized.

Copy link

@itinawi itinawi commented Jan 30, 2020

How is this different from the Multilingual BERT which was trained using the wikidata dumps of 100+ languages?

It's a great project nonetheless, and I'd be happy to contribute. We would benefit a lot more if we train on a bigger Arabic corpus than just the Wikidata.

Compute remains an issue so we would need to find sponsors for cloud training.

@zaidalyafeai

This comment has been minimized.

Copy link
Owner Author

@zaidalyafeai zaidalyafeai commented Jan 30, 2020

@itinawi That is a great point. As you said multilingual bert was trained on many languages which makes it weaker in language-sepcific tasks. Furthermore, many other language-sepcific models has shown to outperform Multilingual bert in many tasks. For instance, Spanish Bert outperforms multilingual bert as shown in this repository. There are other benchmarsk on French for instance as shown using CamemBert. Moreover, multilingual bert doesn't preform well on low-resource languages. Not to mention the vocabulary constraints on mBert.

@k-halid

This comment has been minimized.

Copy link

@k-halid k-halid commented Jan 30, 2020

Hello @zaidalyafeai,

AraBert is a good idea for a more robust Arabic language model. In addition, traditional Bert uses many aspects such as Bert tokenizer that did not account for Arabic. Therefore, it is necessary to have AraBert which will not contribute only to Arabic natural language processing, but it may highlit some important points related to natural language processing and more unified models in the future.

For me, I have trained the vanilla transformer before as a language model. Therefore, I would be happy to contribute.

@mobarmg

This comment has been minimized.

Copy link

@mobarmg mobarmg commented Jan 30, 2020

Hi Zaid,
I am interested in your project and we in Naseej AI Lab could sponsor it. What would be the estimated computing resources needed for the project?

@abdallah197

This comment has been minimized.

Copy link

@abdallah197 abdallah197 commented Jan 30, 2020

Hi @zaidalyafeai
I more than interested. I have been working with bert for a while. Would it be possible to make a slack channel?

@EsamGhaleb

This comment has been minimized.

Copy link

@EsamGhaleb EsamGhaleb commented Jan 30, 2020

Cool idea.

I could contribute to coding and developing, but I don't have specific experience in NLP.

BTW, Al-Bert would be a nice name as well :)

@azamalfayez

This comment has been minimized.

Copy link

@azamalfayez azamalfayez commented Jan 30, 2020

I am interested.count me in..

@Fatima-200159617

This comment has been minimized.

Copy link

@Fatima-200159617 Fatima-200159617 commented Jan 30, 2020

Salam alaykoum Zaid,
I am sending this message on behalf of the bigIR group the "Information Retrieval and Big Data" research group at Qatar University.
Thanks for this initiative, we are interested in contributing to this project. Please let us know how can we help.
Our group published the ArabWeb16 dataset with 150M Arabic Web pages with significant coverage of dialectal Arabic as well as Modern Standard Arabic.
The dataset can be a great candidate for training the BERT model. Please find below the link to the dataset paper.
https://dl.acm.org/doi/abs/10.1145/2911451.2914677

@HashimHL

This comment has been minimized.

Copy link

@HashimHL HashimHL commented Jan 30, 2020

Regarding the dataset.

I thought about social media sites like twitter, facebook, askfm etc but their data isn't clean and long enough, it requires a lot of preprocessing, even then it's hard to get the data by obeying their policies.

There is a similar site to Wikipedia called mawdoo3.com the articles are clean and don't carry misspellings,Also it's written in classical Arabic. The data on the website can be used under the rules of ”Fair use ” like for educational purposes. Based on the website claim there are more than 150k articles, just for comparison, Wikipedia has 118k Arabic ones, however, article’s length may vary so that might not be a good comparison.

I also had a thought about free Arabic ebooks but what I have found most of them are old and contain a hard Arabic language on top of that they aren't PDF just scanned images so pulling the text out of those will not gonna be an easy task and I'm not sure about the ethics here since there might be some copyright rules.

If I had the opportunity to help I would happily do so.

@KhaledEssam

This comment has been minimized.

Copy link

@KhaledEssam KhaledEssam commented Jan 30, 2020

I think using some data from OPUS (http://opus.nlpl.eu/) which contains subtitles, news, and other resources might help the model generalize outside of Wikipedia-like use cases maybe?

@mani144

This comment has been minimized.

Copy link

@mani144 mani144 commented Jan 30, 2020

Wh can use books that is written in Arabic with PDF format, but I'm not sure if these books is scanned as pictures or written as a text

@MagedSaeed

This comment has been minimized.

Copy link
Collaborator

@MagedSaeed MagedSaeed commented Jan 30, 2020

@mani144 You proposed a very nice idea. Is anyone here familiar with a good Arabic OCR? These scanned books are not handwrittin. This might make it easier to extract text. What do you thin?

@mobarmg

This comment has been minimized.

Copy link

@mobarmg mobarmg commented Jan 30, 2020

For Arabic Corpora we could provide you with our naseej.net news database [1998-2012] more than 240K page.

@moaaztaha

This comment has been minimized.

Copy link

@moaaztaha moaaztaha commented Jan 30, 2020

We can use videos with Arabic captions on youtube like videos from BBC Arabic

@zaidalyafeai

This comment has been minimized.

Copy link
Owner Author

@zaidalyafeai zaidalyafeai commented Jan 30, 2020

@abdallah197 yes, at this moment we need a slack channel. Would you like to create one for this purpose ?

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Jan 30, 2020

It seems that compute is the easiest part as some have already proposed to sponsor it (either GPU or TPU).
Models: can use transformers based on En BERT (or train from scratch) - maybe test both.
Corpus: I suggest we curate a large enough one with good coverage of topics and style - may be a mixture of Arabic Wikipedia, Mawdoo3's content (if they are willing to share it), QU, Naseej and some books (need to do some processing including de-duplication of same/similar articles).
I trained a MultiFiT model a few days ago (based on Ar wiki) and it performed well in classification tasks (https://www.linkedin.com/feed/update/urn:li:activity:6627500876145672192/). I agree multilingual BERT is almost useless for Arabic (very limited vocabulary representation).

@NightXBurn

This comment has been minimized.

Copy link

@NightXBurn NightXBurn commented Jan 30, 2020

Hi everyone, very happy to see AraBert project thank you @zaidalyafeai. I want to contribute with some ideas that can make AraBert performant.

  • Knowing that in Arabic diacritics have an impact on the semantic of words we would train AraBert with diacritics not only nondiactritized words. Thus, for tokenization, we can use the new Dropout BPE method available in YouTokenToMe library which can handle diacritics token and also better than vanilla Bert tokenizer.

  • For dataset, in addition to the corpora proposed previously, we can add the books proposed in tashkeela which contains old writing styles and diacritized words

  • Multilingual BERT gives an amazing representation of Arabic words thanks to Google team, it can be used to solve state of the art problems in Arabic language. It uses WordPiece tokenizer which can handle rich vocabulary languages such as Arabic. In the task link the models that apply multilingual BERT are the best. However, it cannot handle diacritics which is a phenomenon that occurs in Arabic language.

@k-halid

This comment has been minimized.

Copy link

@k-halid k-halid commented Jan 30, 2020

Hello everyone,

I have some ideas related to the points mentioned in @zaidalyafeai post. First of all, I think we can use the dataset from El-khair et al corpus which is to my knowledge one of the largest unlabled Arabic corpus. Also, we can use the Wiki Arabic corpus.

For the model, I suggest that we use both TensorFlow and Pytorch Bert model if it is possible. First and most importantly, this will open more opportunities for ML developers who interested in Arabic NLP to expand this work in some specific NLP tasks. Second, I am more familiar with Pytorch and I would like to contribute to this work because I am heavily interested in Arabic NLP

@amaloraini

This comment has been minimized.

Copy link

@amaloraini amaloraini commented Jan 30, 2020

Salam,

Thanks @zaidalyafeai for the brilliant idea.

I'm currently waiting for BERT-base to finish training on Arabic Wikipedia articles. I'm using TF and it has been more than one week, but it should finish soon. I will see if it outperforms the original multilingual BERT and I will share all my findings.

We have 3 Geforce GTX 1080 Ti and they work fine if you limit the batch size to 8 (yet, take too long to finish).

I have done the following preprocessing steps:
1- Removed all diacritics
2- Normalized "alif"
3- Used BPE dropout to create the vocab and to tokenize the words

If someone has access to more resources, it might be a good idea to try out RoBERTa.

I will be happy to contribute, so please let me know if you need any additional information.

Abdulrahman

@OS47

This comment has been minimized.

Copy link

@OS47 OS47 commented Jan 30, 2020

Hi zaid, i have a desire to contribute with the community. Still not reaching to BERT concepts but having good background in DL & NLP.

Hoping the community growing and serve arabic contents

@fadybaly

This comment has been minimized.

Copy link

@fadybaly fadybaly commented Jan 30, 2020

Hello everyone,

My friend @WissamAntoun and I have just finished training AraBERT. we trained it on a 21GB with 3.7B tokens arabic corpus.

as for the preprocessing steps we decided to go with two approaches, and both will be uploaded to huggingface very shortly.

for the first approach we:
1- removed diacritics
2- replaced the urls, emails, numbers, mentions with unique tokens
3- used Farasa tokenizer to separate ahrof el wasel و, ل, ب, ف, ال and trained a sentencepiece vocab with vocab size of 64k.

for the second approach we didn't replace the numbers with unique tokens and we used the Farasa tokenizer more heavily as to remove the suffixes and prefixes.

regarding downstream tasks we evaluated on several sentiment analysis dataset as well as NER and QA, the models outperformed SOTA in all the tasks.
the results will be published soon after we submit a paper detailing our approaches to OSACT4.

we are also planning to make it compatible with the transformers library (adapting the tokenizer)

@itinawi

This comment has been minimized.

Copy link

@itinawi itinawi commented Jan 30, 2020

That's awesome news Fady! Moving the needle forward in a field that needs much love. Are you guys planning on releasing the pre-trained model? I'm sure a lot of researchers here would love to play around with that.

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Jan 30, 2020

This is really great @fadybaly and @WissamAntoun - looking forward to check the models @hf. Can you describe the corpus content used in the training?

@fadybaly

This comment has been minimized.

Copy link

@fadybaly fadybaly commented Jan 30, 2020

That's awesome news Fady! Moving the needle forward in a field that needs much love. Are you guys planning on releasing the pre-trained model? I'm sure a lot of researchers here would love to play around with that.

of course, we're releasing a tensorflow version to huggingface tomorrow, with pytorch following it shortly afterwards

@fadybaly

This comment has been minimized.

Copy link

@fadybaly fadybaly commented Jan 30, 2020

This is really great @fadybaly and @WissamAntoun - looking forward to check the models @hf. Can you describe the corpus content used in the training?

it's trained on modern standard arabic data collected from open source datasets and heavily cleaned web scraped news articles websites.

@mobarmg

This comment has been minimized.

Copy link

@mobarmg mobarmg commented Jan 30, 2020

Great efforts @fadybaly , can not wait to test your model.

@ajamjoom

This comment has been minimized.

Copy link

@ajamjoom ajamjoom commented Jan 30, 2020

of course, we're releasing a tensorflow version to huggingface tomorrow, with pytorch following it shortly afterwards

Amazing! would be great if you can share a link here once it's ready. Thanks

Looking forward to trying it out.

@zaidalyafeai

This comment has been minimized.

Copy link
Owner Author

@zaidalyafeai zaidalyafeai commented Jan 30, 2020

@fadybaly thanks for efforts. We will try to follow your steps to train a larger model. Btw, are you affiliated with hugging face ? Why did you decide to publish the models there first? I mean you can create your own repo with enough information and explanation then upload the models there if you are interested!

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Jan 31, 2020

it's trained on modern standard arabic data collected from open source datasets and heavily cleaned web scraped news articles websites.

Can you please share random samples of the preprocessed corpus (just before tokenization) - say around 10 MB of randomly selected texts limited to one page per text item (1st approach).

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Jan 31, 2020

This French BERT model may be helpful in the coding work: https://github.com/getalp/Flaubert
(in addition to @fadybaly and @WissamAntoun work if code is released, of course).
If a larger corpus is to be used, I suggest it is shared on a central server with enough space so different pre-processing/selection/configuration options can be used. Compute sponsors can link nodes as well (some control on resource usage needed - ex. approve models to train for x days).

@WissamAntoun

This comment has been minimized.

Copy link

@WissamAntoun WissamAntoun commented Jan 31, 2020

Hi everyone,

@fadybaly and i managed to put together a small demo on fine-tuning our araBERT.
(Colab: https://colab.research.google.com/drive/1KSy89fAkWt6EGfnFQElDjXrBror9lIZh,
araBERT TF ckpt: https://drive.google.com/open?id=1GQr8ue04rsvkOO3GieYTWnCOmB3-CNJz)

To use araBERT you will need access to the Farasa Segmenter since in this version of araBERT we segmented the words before we did sentencepiece tokenization. (we are currently training another version of araBERT that won't use segmented words, but we think that the performance might suffer a bit because the effective vocabulary will be heavily reduced, we think it will be done by next week).

@zaidalyafeai, no we aren't affiliated with huggingface, but we think releasing a compatible version is really helpful for the community. We also have a fork of the bert repo with the tokenizer made compatible with our vocab, we still need to figure out the best way to make hugginface's BertTokenizer compatible.

We also have the ability to train a Bert-large model, but we prioritized the base model because while Bert-Large will perform better, it is still impractical to use for most applications due to the compute requirement.

We are also finishing the write up for an arxiv preprint detailing the pretraining processes and data

@zaidalyafeai

This comment has been minimized.

Copy link
Owner Author

@zaidalyafeai zaidalyafeai commented Jan 31, 2020

@WissamAntoun very happy to see some nlp work on Arabic. I hope that gets the recognition it deserves (will help on that of course) . I encourage you to spend some time to update the fork readme so future researchers can benefit from your work.

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Feb 2, 2020

Our group published the ArabWeb16 dataset with 150M Arabic Web pages with significant coverage of dialectal Arabic as well as Modern Standard Arabic.
The dataset can be a great candidate for training the BERT model. Please find below the link to the dataset paper.
https://dl.acm.org/doi/abs/10.1145/2911451.2914677

@Fatima-200159617 Is http://bigir1.qu.edu.qa:3000/ down? Is there an open format (txt, csv) version of the labeled datasets?

@Fatima-200159617

This comment has been minimized.

Copy link

@Fatima-200159617 Fatima-200159617 commented Feb 2, 2020

Our group published the ArabWeb16 dataset with 150M Arabic Web pages with significant coverage of dialectal Arabic as well as Modern Standard Arabic.
The dataset can be a great candidate for training the BERT model. Please find below the link to the dataset paper.
https://dl.acm.org/doi/abs/10.1145/2911451.2914677

@Fatima-200159617 Is http://bigir1.qu.edu.qa:3000/ down? Is there an open format (txt, csv) version of the labeled datasets?

The dataset is available and can be accessed from the link below
https://sites.google.com/view/arabicweb16/home
You can search the entire ArabicWeb16 dataset from the website. Downloading the dataset requires signing the agreement first, then you can download it as WARC files.
Which labeled dataset you are planning to use the Content-based Categorized Dataset or the Dialect-labelled Dataset?
The server is under maintenance now, I will inform you when it is up soon.

@zaidalyafeai

This comment has been minimized.

Copy link
Owner Author

@zaidalyafeai zaidalyafeai commented Feb 2, 2020

Hey All,

Thanks for your informative feedback and resources you shared in this thread. I didn't expect to have this thoughtful and brilliant community who care about Arabic research. That being said, as shown in this thread there seems to be some efforts to publish Bert in Arabic and we don't want to do a repatitve work. Our plans will change slightly to build on that work and hopefully extend the community to work on other models. Please keep in mind that our main objective is to enrich the Arabic research regardless of lack of support and resources. Personally, I am not looking for paybacks besides learning from this awesome community. In short, if you would like to continue this journey to learn, challenge and hopefully produce something fruitful, thumbs up and I will create a new repository to discuss our next steps.

~Zaid

@Jeddadh

This comment has been minimized.

Copy link

@Jeddadh Jeddadh commented Feb 2, 2020

Hello,

This is a great idea ! I am very interested in the project. I have a good experience with BERT (Training and finetuning NLP models in general) and a good understanding of the Arabic language (grammar, vocabulary ... ).
I will be glad to contribute to the project.

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Feb 3, 2020

Which labeled dataset you are planning to use the Content-based Categorized Dataset or the Dialect-labelled Dataset?

Actually, I found https://sites.google.com/view/arabicweb16/home was looking for the dataset and tried to see a sample of text contents to access suitability for an Arabic BERT model. Will wait till server is up (no access to resources that can deal with the full dataset).

@abedkhooli

This comment has been minimized.

Copy link

@abedkhooli abedkhooli commented Feb 3, 2020

Our plans will change slightly to build on that work and hopefully extend the community to work on other models.

I think there is room for various efforts in this field. Some have already taken place and these need not be repeated (but great to learn from). Collective efforts are best but may take time for larger teams (less agility), so smaller teams may be a good option (provided infrastructure is available).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.