# How to extract data from HF repository

The idea is to get familiarize with the HF API to extract data from ML repositories.
Is important to realize that relevant metadata basically can be found in the readme.MD file.

### Download a specific file

In the quickstart guide of the [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/v0.8.0/en/quick-start), there is an example on how to retrieve particular documents from the ML repository.  

In [1]:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="google/pegasus-xsum", filename="README.md")

  from .autonotebook import tqdm as notebook_tqdm


'/home/vscode/.cache/huggingface/hub/models--google--pegasus-xsum/snapshots/8d8ffc158a3bee9fbb03afacdfc347c823c5ec8b/README.md'

Now, we would like to pick where is the document being downloaded. Even check if we can bring it to memory right away.

For the first problem we have the following [documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/download). We can use the `local_dir="path/to/folder"` attribute of the `hf_hub_download` function.

In [2]:
hf_hub_download(repo_id="google/pegasus-xsum", filename="README.md",local_dir="/home/vscode/README_files")

'/home/vscode/README_files/README.md'

### Model cards
Even better than the above, there are ways to only query the Model card metadata that is usually on the README.md file, using the [ModelCard](https://huggingface.co/docs/huggingface_hub/main/en/guides/model-cards) function. 

Now the problem with this approach is clear, not all the metadata can be found on the Model card, a lot of useful information can be extracted from the README.md file.

Even with the previous setback in can said that this functionality really helps to proccess the data better. 

Now I don't know if it is worth to spend one of the 10k limited requests we have every minute on a preprocessing step. 

In [3]:
from huggingface_hub import ModelCard

card = ModelCard.load('google/pegasus-xsum')

print(card)

---
language: en
tags:
- summarization
model-index:
- name: google/pegasus-xsum
  results:
  - task:
      type: summarization
      name: Summarization
    dataset:
      name: samsum
      type: samsum
      config: samsum
      split: train
    metrics:
    - type: rouge
      value: 21.8096
      name: ROUGE-1
      verified: true
    - type: rouge
      value: 4.2525
      name: ROUGE-2
      verified: true
    - type: rouge
      value: 17.4469
      name: ROUGE-L
      verified: true
    - type: rouge
      value: 18.8907
      name: ROUGE-LSUM
      verified: true
    - type: loss
      value: 3.0317161083221436
      name: loss
      verified: true
    - type: gen_len
      value: 20.3122
      name: gen_len
      verified: true
  - task:
      type: summarization
      name: Summarization
    dataset:
      name: xsum
      type: xsum
      config: default
      split: test
    metrics:
    - type: rouge
      value: 46.8623
      name: ROUGE-1
      verified: true
    - type

` card.data ` : Returns a ModelCardData instance with the model card’s metadata. 
Call `.to_dict()` on this instance to get the representation as a dictionary.

In [4]:
print(card.data.to_dict())

{'language': 'en', 'tags': ['summarization'], 'model-index': [{'name': 'google/pegasus-xsum', 'results': [{'task': {'type': 'summarization', 'name': 'Summarization'}, 'dataset': {'name': 'samsum', 'type': 'samsum', 'config': 'samsum', 'split': 'train'}, 'metrics': [{'type': 'rouge', 'value': 21.8096, 'name': 'ROUGE-1', 'verified': True}, {'type': 'rouge', 'value': 4.2525, 'name': 'ROUGE-2', 'verified': True}, {'type': 'rouge', 'value': 17.4469, 'name': 'ROUGE-L', 'verified': True}, {'type': 'rouge', 'value': 18.8907, 'name': 'ROUGE-LSUM', 'verified': True}, {'type': 'loss', 'value': 3.0317161083221436, 'name': 'loss', 'verified': True}, {'type': 'gen_len', 'value': 20.3122, 'name': 'gen_len', 'verified': True}]}, {'task': {'type': 'summarization', 'name': 'Summarization'}, 'dataset': {'name': 'xsum', 'type': 'xsum', 'config': 'default', 'split': 'test'}, 'metrics': [{'type': 'rouge', 'value': 46.8623, 'name': 'ROUGE-1', 'verified': True}, {'type': 'rouge', 'value': 24.4533, 'name': 'RO

### List models

Now the next step is how to get the names of the model repositories. After that, it would be nice to figure out a way to get the names of models based on a time period.

Let's explore the [list_models](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_models) function.

We can access relevant information of the model by using the properties of the [ModelInfo](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.hf_api.ModelInfo) Object.

In [11]:
from huggingface_hub import HfApi

api = HfApi()

# List all models
model_list = api.list_models()

i = 15

for model in model_list:
    if(i==0):
        break
    i-=1
    print(model.id ,model.created_at)

albert/albert-base-v1 2022-03-02 23:29:04+00:00
albert/albert-base-v2 2022-03-02 23:29:04+00:00
albert/albert-large-v1 2022-03-02 23:29:04+00:00
albert/albert-large-v2 2022-03-02 23:29:04+00:00
albert/albert-xlarge-v1 2022-03-02 23:29:04+00:00
albert/albert-xlarge-v2 2022-03-02 23:29:04+00:00
albert/albert-xxlarge-v1 2022-03-02 23:29:04+00:00
albert/albert-xxlarge-v2 2022-03-02 23:29:04+00:00
bert-base-cased-finetuned-mrpc 2022-03-02 23:29:04+00:00
bert-base-cased 2022-03-02 23:29:04+00:00
bert-base-chinese 2022-03-02 23:29:04+00:00
bert-base-german-cased 2022-03-02 23:29:04+00:00
bert-base-german-dbmdz-cased 2022-03-02 23:29:04+00:00
bert-base-german-dbmdz-uncased 2022-03-02 23:29:04+00:00
bert-base-multilingual-cased 2022-03-02 23:29:04+00:00


After around 30000 of the initial models, different dates start to appear, and they look to be orderd in an accedent pattern.
The initial date appears to be 2022-03-02 23:29:04+00:00, and then another batch with 2022-03-02 23:29:05+00:00 appears as well. 

In [30]:
model_list = api.list_models()
i = 30000
unique_dates = set()
for model in model_list:
    if(i==0):
        break
    i-=1
    unique_dates.add(model.created_at)

print(unique_dates)
print(len(unique_dates))

{datetime.datetime(2022, 3, 2, 23, 29, 4, tzinfo=datetime.timezone.utc), datetime.datetime(2022, 3, 2, 23, 29, 5, tzinfo=datetime.timezone.utc)}
2


In [35]:
model_list = api.list_models()
i = 31000
unique_dates = set()
for model in model_list:
    if(i==0):
        break
    i-=1
    unique_dates.add(model.created_at)

print(len(unique_dates))
last_10_dates = list(unique_dates)[-10:]
print([str(x) for x in last_10_dates])

892
['2022-03-09 11:52:08+00:00', '2022-03-09 13:30:22+00:00', '2022-03-03 15:53:14+00:00', '2022-03-04 12:49:07+00:00', '2022-03-07 00:22:21+00:00', '2022-03-08 14:22:06+00:00', '2022-03-03 04:53:15+00:00', '2022-03-06 20:35:13+00:00', '2022-03-05 20:52:57+00:00', '2022-03-07 09:51:17+00:00']


### Faster downloads
If you are running on a machine with high bandwidth, you can increase your download speed with [hf_transfer](https://github.com/huggingface/hf_transfer), a Rust-based library developed to speed up file transfers with the Hub. To enable it:

- Specify the hf_transfer extra when installing huggingface_hub (e.g. pip install huggingface_hub[hf_transfer]).
- Set HF_HUB_ENABLE_HF_TRANSFER=1 as an environment variable.

Test will have to be made to actually determine how worth it it is to use this feature, given that the README.md files weight between 3KB to 10KB.