# How to extract data from HF repository

The idea is to get familiarize with the HF API to extract data from ML repositories.
Is important to realize that relevant metadata basically can be found in the readme.MD file.

### Download a specific file

In the quickstart guide of the [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/v0.8.0/en/quick-start), there is an example on how to retrieve particular documents from the ML repository.  

In [9]:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="google/pegasus-xsum", filename="README.md")

'/home/vscode/.cache/huggingface/hub/models--google--pegasus-xsum/snapshots/8d8ffc158a3bee9fbb03afacdfc347c823c5ec8b/README.md'

Now, we would like to pick where is the document being downloaded. Even check if we can bring it to memory right away.

For the first problem we have the following [documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/download). We can use the `local_dir="path/to/folder"` attribute of the `hf_hub_download` function.

In [10]:
hf_hub_download(repo_id="google/pegasus-xsum", filename="README.md",local_dir="/home/vscode/README_files")

'/home/vscode/README.md/README.md'

### Model cards
Even better than the above, there are ways to only query the Model card metadata that is usually on the README.md file, using the [ModelCard](https://huggingface.co/docs/huggingface_hub/main/en/guides/model-cards) function. 

Now the problem with this approach is clear, not all the metadata can be found on the Model card, a lot of useful information can be extracted from the README.md file.

Even with the previous setback in can said that this functionality really helps to proccess the data better. 

Now I don't know if it is worth to spend one of the 10k limited requests we have every minute on a preprocessing step. 

In [6]:
from huggingface_hub import ModelCard

card = ModelCard.load('nateraw/vit-base-beans')

print(card)

---
language: en
license: apache-2.0
tags:
- generated_from_trainer
- image-classification
datasets:
- beans
metrics:
- accuracy
widget:
- src: https://huggingface.co/nateraw/vit-base-beans/resolve/main/healthy.jpeg
  example_title: Healthy
- src: https://huggingface.co/nateraw/vit-base-beans/resolve/main/angular_leaf_spot.jpeg
  example_title: Angular Leaf Spot
- src: https://huggingface.co/nateraw/vit-base-beans/resolve/main/bean_rust.jpeg
  example_title: Bean Rust
model-index:
- name: vit-base-beans
  results:
  - task:
      type: image-classification
      name: Image Classification
    dataset:
      name: beans
      type: beans
      args: default
    metrics:
    - type: accuracy
      value: 0.9774436090225563
      name: Accuracy
  - task:
      type: image-classification
      name: Image Classification
    dataset:
      name: beans
      type: beans
      config: default
      split: test
    metrics:
    - type: accuracy
      value: 0.9453125
      name: Accuracy
     

` card.data ` : Returns a ModelCardData instance with the model card’s metadata. 
Call `.to_dict()` on this instance to get the representation as a dictionary.

In [8]:
print(card.data.to_dict())

{'language': 'en', 'license': 'apache-2.0', 'tags': ['generated_from_trainer', 'image-classification'], 'datasets': ['beans'], 'metrics': ['accuracy'], 'widget': [{'src': 'https://huggingface.co/nateraw/vit-base-beans/resolve/main/healthy.jpeg', 'example_title': 'Healthy'}, {'src': 'https://huggingface.co/nateraw/vit-base-beans/resolve/main/angular_leaf_spot.jpeg', 'example_title': 'Angular Leaf Spot'}, {'src': 'https://huggingface.co/nateraw/vit-base-beans/resolve/main/bean_rust.jpeg', 'example_title': 'Bean Rust'}], 'model-index': [{'name': 'vit-base-beans', 'results': [{'task': {'type': 'image-classification', 'name': 'Image Classification'}, 'dataset': {'name': 'beans', 'type': 'beans', 'args': 'default'}, 'metrics': [{'type': 'accuracy', 'value': 0.9774436090225563, 'name': 'Accuracy'}]}, {'task': {'type': 'image-classification', 'name': 'Image Classification'}, 'dataset': {'name': 'beans', 'type': 'beans', 'config': 'default', 'split': 'test'}, 'metrics': [{'type': 'accuracy', 'v

### List models

Now the next step is how to get the names of the model repositories. After that, it would be nice to figure out a way to get the names of models based on a time period.

### Faster downloads
If you are running on a machine with high bandwidth, you can increase your download speed with [hf_transfer](https://github.com/huggingface/hf_transfer), a Rust-based library developed to speed up file transfers with the Hub. To enable it:

- Specify the hf_transfer extra when installing huggingface_hub (e.g. pip install huggingface_hub[hf_transfer]).
- Set HF_HUB_ENABLE_HF_TRANSFER=1 as an environment variable.

Test will have to be made to actually determine how worth it it is to use this feature, given that the README.md files weight between 3KB to 10KB.