Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAraBert #18
AraBert #18
Comments
This comment has been minimized.
This comment has been minimized.
|
How is this different from the Multilingual BERT which was trained using the wikidata dumps of 100+ languages? It's a great project nonetheless, and I'd be happy to contribute. We would benefit a lot more if we train on a bigger Arabic corpus than just the Wikidata. Compute remains an issue so we would need to find sponsors for cloud training. |
This comment has been minimized.
This comment has been minimized.
|
@itinawi That is a great point. As you said multilingual bert was trained on many languages which makes it weaker in language-sepcific tasks. Furthermore, many other language-sepcific models has shown to outperform Multilingual bert in many tasks. For instance, Spanish Bert outperforms multilingual bert as shown in this repository. There are other benchmarsk on French for instance as shown using CamemBert. Moreover, multilingual bert doesn't preform well on low-resource languages. Not to mention the vocabulary constraints on mBert. |
This comment has been minimized.
This comment has been minimized.
|
Hello @zaidalyafeai, AraBert is a good idea for a more robust Arabic language model. In addition, traditional Bert uses many aspects such as Bert tokenizer that did not account for Arabic. Therefore, it is necessary to have AraBert which will not contribute only to Arabic natural language processing, but it may highlit some important points related to natural language processing and more unified models in the future. For me, I have trained the vanilla transformer before as a language model. Therefore, I would be happy to contribute. |
This comment has been minimized.
This comment has been minimized.
|
Hi Zaid, |
This comment has been minimized.
This comment has been minimized.
|
Hi @zaidalyafeai |
This comment has been minimized.
This comment has been minimized.
|
Cool idea. I could contribute to coding and developing, but I don't have specific experience in NLP. BTW, Al-Bert would be a nice name as well :) |
This comment has been minimized.
This comment has been minimized.
|
I am interested.count me in.. |
This comment has been minimized.
This comment has been minimized.
|
Salam alaykoum Zaid, |
This comment has been minimized.
This comment has been minimized.
|
Regarding the dataset. I thought about social media sites like twitter, facebook, askfm etc but their data isn't clean and long enough, it requires a lot of preprocessing, even then it's hard to get the data by obeying their policies. There is a similar site to Wikipedia called mawdoo3.com the articles are clean and don't carry misspellings,Also it's written in classical Arabic. The data on the website can be used under the rules of ”Fair use ” like for educational purposes. Based on the website claim there are more than 150k articles, just for comparison, Wikipedia has 118k Arabic ones, however, article’s length may vary so that might not be a good comparison. I also had a thought about free Arabic ebooks but what I have found most of them are old and contain a hard Arabic language on top of that they aren't PDF just scanned images so pulling the text out of those will not gonna be an easy task and I'm not sure about the ethics here since there might be some copyright rules. If I had the opportunity to help I would happily do so. |
This comment has been minimized.
This comment has been minimized.
|
I think using some data from OPUS (http://opus.nlpl.eu/) which contains subtitles, news, and other resources might help the model generalize outside of Wikipedia-like use cases maybe? |
This comment has been minimized.
This comment has been minimized.
|
Wh can use books that is written in Arabic with PDF format, but I'm not sure if these books is scanned as pictures or written as a text |
This comment has been minimized.
This comment has been minimized.
|
@mani144 You proposed a very nice idea. Is anyone here familiar with a good Arabic OCR? These scanned books are not handwrittin. This might make it easier to extract text. What do you thin? |
This comment has been minimized.
This comment has been minimized.
|
For Arabic Corpora we could provide you with our naseej.net news database [1998-2012] more than 240K page. |
This comment has been minimized.
This comment has been minimized.
|
We can use videos with Arabic captions on youtube like videos from BBC Arabic |
This comment has been minimized.
This comment has been minimized.
|
@abdallah197 yes, at this moment we need a slack channel. Would you like to create one for this purpose ? |
This comment has been minimized.
This comment has been minimized.
|
It seems that compute is the easiest part as some have already proposed to sponsor it (either GPU or TPU). |
This comment has been minimized.
This comment has been minimized.
|
Hi everyone, very happy to see AraBert project thank you @zaidalyafeai. I want to contribute with some ideas that can make AraBert performant.
|
This comment has been minimized.
This comment has been minimized.
|
Hello everyone, I have some ideas related to the points mentioned in @zaidalyafeai post. First of all, I think we can use the dataset from El-khair et al corpus which is to my knowledge one of the largest unlabled Arabic corpus. Also, we can use the Wiki Arabic corpus. For the model, I suggest that we use both TensorFlow and Pytorch Bert model if it is possible. First and most importantly, this will open more opportunities for ML developers who interested in Arabic NLP to expand this work in some specific NLP tasks. Second, I am more familiar with Pytorch and I would like to contribute to this work because I am heavily interested in Arabic NLP |
This comment has been minimized.
This comment has been minimized.
|
Salam, Thanks @zaidalyafeai for the brilliant idea. I'm currently waiting for BERT-base to finish training on Arabic Wikipedia articles. I'm using TF and it has been more than one week, but it should finish soon. I will see if it outperforms the original multilingual BERT and I will share all my findings. We have 3 Geforce GTX 1080 Ti and they work fine if you limit the batch size to 8 (yet, take too long to finish). I have done the following preprocessing steps: If someone has access to more resources, it might be a good idea to try out RoBERTa. I will be happy to contribute, so please let me know if you need any additional information. Abdulrahman |
This comment has been minimized.
This comment has been minimized.
|
Hi zaid, i have a desire to contribute with the community. Still not reaching to BERT concepts but having good background in DL & NLP. Hoping the community growing and serve arabic contents |
This comment has been minimized.
This comment has been minimized.
|
Hello everyone, My friend @WissamAntoun and I have just finished training AraBERT. we trained it on a 21GB with 3.7B tokens arabic corpus. as for the preprocessing steps we decided to go with two approaches, and both will be uploaded to huggingface very shortly. for the first approach we: for the second approach we didn't replace the numbers with unique tokens and we used the Farasa tokenizer more heavily as to remove the suffixes and prefixes. regarding downstream tasks we evaluated on several sentiment analysis dataset as well as NER and QA, the models outperformed SOTA in all the tasks. we are also planning to make it compatible with the transformers library (adapting the tokenizer) |
This comment has been minimized.
This comment has been minimized.
|
That's awesome news Fady! Moving the needle forward in a field that needs much love. Are you guys planning on releasing the pre-trained model? I'm sure a lot of researchers here would love to play around with that. |
This comment has been minimized.
This comment has been minimized.
|
This is really great @fadybaly and @WissamAntoun - looking forward to check the models @hf. Can you describe the corpus content used in the training? |
This comment has been minimized.
This comment has been minimized.
of course, we're releasing a tensorflow version to huggingface tomorrow, with pytorch following it shortly afterwards |
This comment has been minimized.
This comment has been minimized.
it's trained on modern standard arabic data collected from open source datasets and heavily cleaned web scraped news articles websites. |
This comment has been minimized.
This comment has been minimized.
|
Great efforts @fadybaly , can not wait to test your model. |
This comment has been minimized.
This comment has been minimized.
Amazing! would be great if you can share a link here once it's ready. Thanks Looking forward to trying it out. |
This comment has been minimized.
This comment has been minimized.
|
@fadybaly thanks for efforts. We will try to follow your steps to train a larger model. Btw, are you affiliated with hugging face ? Why did you decide to publish the models there first? I mean you can create your own repo with enough information and explanation then upload the models there if you are interested! |
This comment has been minimized.
This comment has been minimized.
Can you please share random samples of the preprocessed corpus (just before tokenization) - say around 10 MB of randomly selected texts limited to one page per text item (1st approach). |
This comment has been minimized.
This comment has been minimized.
|
This French BERT model may be helpful in the coding work: https://github.com/getalp/Flaubert |
This comment has been minimized.
This comment has been minimized.
|
Hi everyone, @fadybaly and i managed to put together a small demo on fine-tuning our araBERT. To use araBERT you will need access to the Farasa Segmenter since in this version of araBERT we segmented the words before we did sentencepiece tokenization. (we are currently training another version of araBERT that won't use segmented words, but we think that the performance might suffer a bit because the effective vocabulary will be heavily reduced, we think it will be done by next week). @zaidalyafeai, no we aren't affiliated with huggingface, but we think releasing a compatible version is really helpful for the community. We also have a fork of the bert repo with the tokenizer made compatible with our vocab, we still need to figure out the best way to make hugginface's BertTokenizer compatible. We also have the ability to train a Bert-large model, but we prioritized the base model because while Bert-Large will perform better, it is still impractical to use for most applications due to the compute requirement. We are also finishing the write up for an arxiv preprint detailing the pretraining processes and data |
This comment has been minimized.
This comment has been minimized.
|
@WissamAntoun very happy to see some nlp work on Arabic. I hope that gets the recognition it deserves (will help on that of course) . I encourage you to spend some time to update the fork readme so future researchers can benefit from your work. |
This comment has been minimized.
This comment has been minimized.
@Fatima-200159617 Is http://bigir1.qu.edu.qa:3000/ down? Is there an open format (txt, csv) version of the labeled datasets? |
This comment has been minimized.
This comment has been minimized.
The dataset is available and can be accessed from the link below |
This comment has been minimized.
This comment has been minimized.
|
Hey All, Thanks for your informative feedback and resources you shared in this thread. I didn't expect to have this thoughtful and brilliant community who care about Arabic research. That being said, as shown in this thread there seems to be some efforts to publish Bert in Arabic and we don't want to do a repatitve work. Our plans will change slightly to build on that work and hopefully extend the community to work on other models. Please keep in mind that our main objective is to enrich the Arabic research regardless of lack of support and resources. Personally, I am not looking for paybacks besides learning from this awesome community. In short, if you would like to continue this journey to learn, challenge and hopefully produce something fruitful, thumbs up and I will create a new repository to discuss our next steps. ~Zaid |
This comment has been minimized.
This comment has been minimized.
|
Hello, This is a great idea ! I am very interested in the project. I have a good experience with BERT (Training and finetuning NLP models in general) and a good understanding of the Arabic language (grammar, vocabulary ... ). |
This comment has been minimized.
This comment has been minimized.
Actually, I found https://sites.google.com/view/arabicweb16/home was looking for the dataset and tried to see a sample of text contents to access suitability for an Arabic BERT model. Will wait till server is up (no access to resources that can deal with the full dataset). |
This comment has been minimized.
This comment has been minimized.
I think there is room for various efforts in this field. Some have already taken place and these need not be repeated (but great to learn from). Collective efforts are best but may take time for larger teams (less agility), so smaller teams may be a good option (provided infrastructure is available). |
Hey All,
This is a temporary issue to discuss the idea of training a Bert model specific for Arabic - AraBert. We will move to another repository once we have a clear understanding of the problem and how to tackle it as well as having experts in the field. There are three main issues to discuss here and I want your opinions about them
If you have any comments on any of the above points don't hesistate to respond. More importantly, if you have any experience in training large models let us know that. Open source contribution is a slow and lengthy process and it might take us a some time to finish the project. If you are planning to contribute please be sure to cut some time for this project.
Thanks
Zaid