Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mirror Request] Hugging Face Model Hub #937

Closed
JetRunner opened this issue Aug 20, 2020 · 55 comments
Closed

[Mirror Request] Hugging Face Model Hub #937

JetRunner opened this issue Aug 20, 2020 · 55 comments
Labels
Accepted Mirror Request Request for new mirror

Comments

@JetRunner
Copy link

项目名称与简介(Project Intro.)

huggingface/transformers is the No. 1 open-source project in the field of Natural Language Processing (NLP). It provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pretrained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.

The requested mirror of Hugging Face's Model Hub is the most important feature of this library. People can download pretrained models from the hub (which is very slow in China) and use for their own applications. The most downloaded pretrained model has >15 million downloads over 30 days.

上游地址与镜像方法(How to Mirror)

https://s3.amazonaws.com/models.huggingface.co/bert/

It's a S3 bucket so it can be synced with the S3 sync tool. We (Hugging Face) can provide necessary assistance from our side.

其他信息(Other)

  • 镜像大小(Mirror Size): 1.5 TB
@Harry-Chen Harry-Chen added the Mirror Request Request for new mirror label Aug 20, 2020
@JetRunner
Copy link
Author

More context here:

The model hub has a website: https://huggingface.co/models
Since we are requesting as the first-party, there should be no worry for possible legality/authorization concerns.

@Harry-Chen
Copy link
Member

I wonder how will you distribute these pretrained models if we set up a mirror for them. Will you redirect access to these files on your servers (maybe based on geolocation) or require additional setup from users (such as an environmental variable or some other config)?

@JetRunner
Copy link
Author

JetRunner commented Aug 20, 2020

@Harry-Chen Hi Harry, we would go with the latter since redirecting seems to be a bad idea given that geolocation is not accurate.

@Harry-Chen
Copy link
Member

Ok. I noticed that you are using https://cdn.huggingface.co/ to distribute downloadable contents at the moment. So that means our mirror will be an alternative source for users, correct?

@JetRunner
Copy link
Author

JetRunner commented Aug 20, 2020

Good point! We are in the process of transiting the user requests to our https://cdn.huggingface.co/ (which is a CDN for the S3 bucket, obviously). However, neither way works for our users in China and they have to suffer from the download speed below 100KB/s. And yes, Tuna mirror will be an alternative source for our Chinese users.

@Harry-Chen
Copy link
Member

Thanks for the information. We will discuss it internally and reply ASAP.

@JetRunner
Copy link
Author

JetRunner commented Aug 20, 2020

@Harry-Chen Thanks, Harry! I would like to emphasize the importance of our library (and model hub) for NLP/ML engineers/researchers and please don't hesitate to ask us again for anything you need.

@Harry-Chen
Copy link
Member

We are happy to accept this request. However, due to the limited space on our servers (and the relatively large size of your repo), we plan to begin synchronizing after #939 is resolved.

BTW, your S3 bucket needs to allow list-objects for us to use as upstream.

@JetRunner
Copy link
Author

Thanks! We will configure the S3 as you suggested.

@Harry-Chen
Copy link
Member

@JetRunner There are problems when I try to synchronize your S3 bucket. I noticed some objects with the same keys can change from directory to file or vice verse. This will cause the synchronization to fail because the tool (aws s3) we use will move the downloaded objects to the new name.

There are some examples:

download failed: s3://models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/ to ../../data/mirrors/hugging-face-models/sampathkethineedi/sampathkethineedi/industry-classification/ [Errno 20] Not a directory: '/data/mirrors/hugging-face-models/sampathkethineedi/sampathkethineedi/industry-classification/.17Bc1BAa' -> '/data/mirrors/hugging-face-models/sampathkethineedi/sampathkethineedi/industry-classification/'
download failed: s3://models.huggingface.co/bert/sshleifer/mbart-trimmed-enro to ../../data/mirrors/hugging-face-models/sshleifer/mbart-trimmed-enro [Errno 21] Is a directory: '/data/mirrors/hugging-face-models/sshleifer/mbart-trimmed-enro.D0Fe7b34' -> '/data/mirrors/hugging-face-models/sshleifer/mbart-trimmed-enro'

This is because mbart-trimmed-enro used to be a directory https://mirrors.tuna.tsinghua.edu.cn/hugging-face-models/sshleifer/mbart-trimmed-enro/ but now it becomes a file https://s3.amazonaws.com/models.huggingface.co/bert/sshleifer/mbart-trimmed-enro. For industry-classification the similar problem exists.

@JetRunner
Copy link
Author

Hi @Harry-Chen! We will look into it on Monday. We really thank your hard work and we would appreciate if you can allow us some time to debug this on weekdays.

@Harry-Chen
Copy link
Member

@JetRunner Sure. Our mirror is located at https://mirrors.tuna.tsinghua.edu.cn/hugging-face-models/, you can use it for debugging.

@sshleifer
Copy link

Cool project!

I deleted mbart-trimmed-enro on our side (it is a currently abandoned project).

https://huggingface.co/sampathkethineedi/industry-classification looks like a valid directory to me on our end.

@JetRunner
Copy link
Author

@sshleifer Thanks, Sam!
@Harry-Chen Do you mind trying sync again? If the problem persists, we'll investigate it further.

@Harry-Chen
Copy link
Member

@sshleifer Thanks for your help! The problem on mbart-trimmed-enro is resolved now, but the other still remains.

To reproduce this, you could run aws --no-sign-request s3 sync s3://models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/ . anywhere and you will see the error.

I believe it is due to that there is an object named models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/, confusing the synchronization tool. You can download that object at https://s3.amazonaws.com/models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/. I believe this object should also be removed, and only models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/README.md should be kept in S3.

Also there is one object named unicamp-dl//README.md. The double slash in the filename causes the matching between object name and directory level (should be unicamp-dl/README.md) to always fail. So this file is deleted by aws s3 and redownloaded every time we initiate a synchronization. This is not a big problem because it's only a text file. But obviously renaming it would be a better choice.

@sshleifer
Copy link

sshleifer commented Aug 23, 2020

Does this break your mirror globally or only for the affected paths?
fixed unicamp.
What is the command you want for the other file?

@Harry-Chen
Copy link
Member

This will not affect other files, but will cause the aws s3 tool to exit abnormally and then the synchronization process will be considered failed (thought all other files are successfully downloaded).

I can exclude this path in the command we use, but it would not be a good choice because any other downstream (there will be more than one for we have multiple servers for load balancing) will have to use the same config. Could you check about the models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/ object in S3 and delete it (seems it has the same content with models.huggingface.co/bert/sampathkethineedi/sampathkethineedi/industry-classification/README.md)?

@sshleifer
Copy link

sshleifer commented Aug 24, 2020

removed the whole dir, it seems accidental. But this kind of thing can easily happen again. Would be great if your command did not fail if s3 paths are unexpected.
I don't think aws s3 sync, for example, would fail in this kind of situation, but I don't have much context on your code.

@Harry-Chen
Copy link
Member

@sshleifer I tried to sync again and now it works.

I am actually using this command to sync:

aws --no-sign-request s3 sync s3://models.huggingface.co/bert /path/to/hugging-face-models/

There might be some options that could stop it from failing on such paths, but its default behaviour is to fail (at least with awscli 1.18.124 that we use). If I could not solve it, maybe I will try other tools such as rclone.

@JetRunner
Copy link
Author

Hi @Harry-Chen, what's the current status of the mirror? Is it actively syncing? I don't see it on the status page. If the mirror is active we can move to the next step.

@Harry-Chen
Copy link
Member

Yes it is syncing, but only on our one server. So when you are load-balanced to other servers you won't be able to see it on the status page but can still access it by our fallback mechanism.

However the synchronization still will fail due to similar problems like above. This time it is caused by two files facebook/bart-large and facebook/bart-large-xsum. Directories with the same names also exist. I tried awscli and rclone and none of them allows file and directory to have the same name. This error makes sense, because any local filesystem does not know how to handle such situation. I hope you could prevent such situation from happening on your side.

@julien-c
Copy link

julien-c commented Sep 9, 2020

Hi @Harry-Chen, we added a periodic integrity check to our S3 so that files bearing the same name as a "folder" (a concept which doesn't exist in S3) will be deleted on our side.

You can go ahead with the mirroring if you'd like.

Thanks for your help!

@Harry-Chen
Copy link
Member

@julien-c That seems great! I believe everything is fine now. Currently the mirror is located athttps://mirrors.tuna.tsinghua.edu.cn/hugging-face-models/ and synchronizes every 6 hours (could be more frequent if needed). You could continue with the next part now (e.g. add mirror option to your hugging-face-transformers library). Then you could provide documentation on the usage of our mirror and I will add it to our help page.

Also I am setting up hugging-face-models on other mirrors sites that we maintain. Once done, I will update with additional addresses of this mirror.

@JetRunner
Copy link
Author

It's awesome! Thanks @Harry-Chen @julien-c

@JetRunner
Copy link
Author

JetRunner commented Sep 9, 2020

Also I am setting up hugging-face-models on other mirrors sites that we maintain. Once done, I will update with additional addresses of this mirror.

@Harry-Chen
BTW, what are "other mirrors" you are maintaining? Our proposed PR is https://github.com/huggingface/transformers/pull/6679/files.
If you do have other mirrors we would modify the PR to allow selection.

@Harry-Chen
Copy link
Member

@JetRunner There will be some extra sites:

And other open-source mirrors sites (such as USTC) might also synchronize your repo.

So I don't believe hardcoding one single mirror address is a good idea. It would be better to provide more options, something more flexible like:

  • use_cn_mirror: to use a mirror in China (let's say TUNA by default)
  • mirror_address: let the user override the endpoint

What do you think?

@JetRunner
Copy link
Author

@Harry-Chen Thanks for the suggestion. I'll modify the PR.

@JetRunner
Copy link
Author

JetRunner commented Sep 14, 2020

@Harry-Chen We've merged the PR on our side.
Here's how to use the mirror:

To use the mirror, you can simply add a parameter mirror to the from_pretrained method for models. Note that the mirror parameter can be a URL or selected from the preset list ['tuna', 'bfsu'].

AutoModel.from_pretrained('bert-base-uncased', mirror='tuna')

or

AutoModel.from_pretrained('bert-base-uncased', mirror='https://mirrors.tuna.tsinghua.edu.cn/hugging-face-models')

mirror works with transformers > 3.1.0.

@JetRunner
Copy link
Author

@Harry-Chen Could you please update the website with the usage doc above?

Also I can't find hugging face mirror on tuna now. Is it still syncing?

@fuzhenxin
Copy link

Hi @JetRunner , hugging face is a great project. We also mirror the models in (https://mirrors.pku.edu.cn/hugging-face-models/). But we find the files in S3 is different from (https://huggingface.co/). Is this a designed feature? Or the S3 models is not updated? Thanks!

  1. There are only four models in S3 of /uer/:
[root@cn100 ~]# aws --no-sign-request s3 ls s3://models.huggingface.co/bert/uer/
                           PRE chinese_roberta_L-2_H-128/
                           PRE chinese_roberta_L-4_H-256/
                           PRE gpt2-chinese-couplet/
                           PRE gpt2-chinese-poem/

But on the official website, there are a lot of models: image

  1. And the updated time is different too.
    The updated time of uer/gpt2-chinese-couplet in S3:
[root@cn100 ~]# aws --no-sign-request s3 ls s3://models.huggingface.co/bert/uer/gpt2-chinese-couplet/
2020-11-25 06:25:59        859 config.json
2020-11-25 06:26:01  420921295 pytorch_model.bin
2020-11-25 06:26:20        112 special_tokens_map.json
2020-11-25 06:26:21  408449360 tf_model.h5
2020-11-25 06:26:26        414 tokenizer_config.json
2020-11-25 06:26:27     109516 vocab.txt

The updated time in website: image

@JetRunner
Copy link
Author

Hi @JetRunner , hugging face is a great project. We also mirror the models in (https://mirrors.pku.edu.cn/hugging-face-models/). But we find the files in S3 is different from (https://huggingface.co/). Is this a designed feature? Or the S3 models is not updated? Thanks!

  1. There are only four models in S3 of /uer/:
[root@cn100 ~]# aws --no-sign-request s3 ls s3://models.huggingface.co/bert/uer/
                           PRE chinese_roberta_L-2_H-128/
                           PRE chinese_roberta_L-4_H-256/
                           PRE gpt2-chinese-couplet/
                           PRE gpt2-chinese-poem/

But on the official website, there are a lot of models: image

  1. And the updated time is different too.
    The updated time of uer/gpt2-chinese-couplet in S3:
[root@cn100 ~]# aws --no-sign-request s3 ls s3://models.huggingface.co/bert/uer/gpt2-chinese-couplet/
2020-11-25 06:25:59        859 config.json
2020-11-25 06:26:01  420921295 pytorch_model.bin
2020-11-25 06:26:20        112 special_tokens_map.json
2020-11-25 06:26:21  408449360 tf_model.h5
2020-11-25 06:26:26        414 tokenizer_config.json
2020-11-25 06:26:27     109516 vocab.txt

The updated time in website: image

Hi @fuzhenxin Thanks for the kind words. I'll let @julien-c answer that.

@julien-c
Copy link

Hi, we kept this S3 bucket for backward compatibility but we do not use it anymore, as we moved to a git/git-lfs backed storage system. (cc @Pierrci)

Not sure whether there's an easy fix for this unfortunately.

I think the future-proof way to improve bandwidth for users based in China is for us to open an AWS China account.

@Harry-Chen
Copy link
Member

Due to the change of upstream, I have now paused the synchronization of hugging-face-models on our server.

@fuzhenxin
Copy link

Hi, we kept this S3 bucket for backward compatibility but we do not use it anymore, as we moved to a git/git-lfs backed storage system. (cc @Pierrci)

Not sure whether there's an easy fix for this unfortunately.

I think the future-proof way to improve bandwidth for users based in China is for us to open an AWS China account.

Got it. Thanks!

@Harry-Chen
Copy link
Member

We have now removed hugging-face-models due to deprecation from upstream. https://mirrors.tuna.tsinghua.edu.cn/news/remove-hugging-face/

@JetRunner
Copy link
Author

JetRunner commented Sep 8, 2021

@fuzhenxin @Harry-Chen We have deprecated the mirroring option and we're working with AWS China to provide a self-hosted solution.

huggingface/transformers#13470

@xxllp
Copy link

xxllp commented Sep 10, 2021

too bad

2 similar comments
@wyh196646
Copy link

too bad

@vipcxj
Copy link

vipcxj commented Dec 5, 2022

too bad

@zhaojinzhou
Copy link

希望重启。
最近大模型太火了。
何况这是一个 10年的树木 的工程。

@Harry-Chen
Copy link
Member

@zhaojinzhou 是 hugging face 方面停止了对镜像的支持。

@adeenayakup
Copy link

adeenayakup commented Apr 24, 2023

目前hugging face 正在努力改善在中国的infra,希望可以更好的帮助中文社区发展。

@xianbaoqian
Copy link

xianbaoqian commented Apr 24, 2023

感谢大家的支持。之前的镜像方案因为一些新功能的加入 (stats, git 等) 无法正常工作,所以我们停止了对镜像的支持。我们在筹划建立官方的国内镜像。敬请期待。

@livehl

This comment was marked as duplicate.

@xxllp

This comment was marked as duplicate.

@lovemyspring

This comment was marked as duplicate.

@wylzabc

This comment was marked as duplicate.

@lsm03624

This comment was marked as duplicate.

@honglyua

This comment was marked as duplicate.

@statelesshz
Copy link

感谢大家的支持。之前的镜像方案因为一些新功能的加入 (stats, git 等) 无法正常工作,所以我们停止了对镜像的支持。我们在筹划建立官方的国内镜像。敬请期待。

@xianbaoqian 您好,建立国内镜像的进展如何?期待您的回复

@Harry-Chen
Copy link
Member

感谢大家的支持。之前的镜像方案因为一些新功能的加入 (stats, git 等) 无法正常工作,所以我们停止了对镜像的支持。我们在筹划建立官方的国内镜像。敬请期待。

@xianbaoqian 您好,建立国内镜像的进展如何?期待您的回复

关于 Hugging Face 国内镜像的事宜,请直接联系对方公司。本 issue 不应用于相关的咨询或者催促用途。

@trotsky1997
Copy link

现在完全无法访问了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Mirror Request Request for new mirror
Projects
None yet
Development

No branches or pull requests