Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wudao数据集的下载预处理脚本问题 #42

Closed
skepsun opened this issue May 9, 2023 · 1 comment
Closed

wudao数据集的下载预处理脚本问题 #42

skepsun opened this issue May 9, 2023 · 1 comment

Comments

@skepsun
Copy link

skepsun commented May 9, 2023

首先是下载链接我试了自己账号申请的链接无法下载,只能用scidb的链接,不需要登录,然后用curl下载老是出错(下完了文件md5不一致,也没法解压),就换成了wget,终于下载成功。我用的下载代码是(没有循环):

wget -v -c 'https://download.scidb.cn/download?fileId=63a30383fed6a8a9e8454302&dataSetType=organization&fileName=WuDaoCorporaText-2.0-open.rar' -O data/WuDaoCorpus2.0_base_200G.rar

然后解压的命令没有指定保存路径,如果是在项目根目录运行这个sh文件的话会解压到根目录里(Open-LLama/WuDaoCorpus2.0_base_200G/)。需要将其移到data文件里,或者修改data/preprocess_wudao.py里的路径。
另外pile真的很难下(还得翻墙)……

@s-JoL
Copy link
Owner

s-JoL commented May 9, 2023

感谢对下载数据集部分的建议,这个下载方法看起来不错,我已经加到了readme里 并且@你了。我用了循环是因为wudao那个链接不太稳定,每下载1G会中断,不得不加个循环不断的继续下载才行。

curl和wget可能是处理redirect有区别,在下载instruct数据集的时候也有几个用curl下载不了的。

unrar没指定路径的问题,刚刚更新了。

@skepsun skepsun closed this as completed May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants