Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will the corpus for training be open-sourced? #6

Closed
michael-wzhu opened this issue Aug 14, 2019 · 2 comments
Closed

Will the corpus for training be open-sourced? #6

michael-wzhu opened this issue Aug 14, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@michael-wzhu
Copy link

As we all know, chinese NLP research has been slowed down by inavailability of large open-source corpus, and this issue has become more and more severe due to the recent advances of large pre-trained LMs. So could you make the training corpus open-source, for further research or followup works?

@yaleimeng
Copy link

Some corpus is protected by copyright and this project owner has no right to release.
For those public corpus, it is actually easy to obtain. You can search keywords 'Chinese corpus' on GitHub, or gather it by yourself.

@ymcui
Copy link
Owner

ymcui commented Aug 15, 2019

Thank you for your clarification. @yaleimeng

I agree with that the large-scale training data with free access is important in future NLP research. However, the license issue is inevitable in reality. One thing that you should have noticed: you CAN NOT find ready-to-download large-scale Baike data but you will find a lot of spider programs.
In this context, I'm afraid you have to use these spider programs for crawling the data by yourself. Sorry for the inconvenience that have caused.

@ymcui ymcui closed this as completed Aug 16, 2019
@ymcui ymcui added the enhancement New feature or request label Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants