When using a pretrained model, make sure to check how it was trained, on which datasets, its limits, and its biases. All of this information should be indicated on its model card

The huggingface_hub library Python library is a package which offers a set of tools for the model and datasets hubs. It provides simple methods and classes for common tasks like getting information about repositories on the hub and managing them. It provides simple APIs that work on top of git to manage those repositories’ content and to integrate the Hub in your projects and libraries.

In [None]:
from huggingface_hub import (
    # User management
    login,
    logout,
    whoami,

    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,

    # And some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

The Repository class manages a local repository in a git-like manner. It abstracts most of the pain points one may have with git to provide all features that we require.

Using this class requires having git and git-lfs installed, so make sure you have git-lfs installed and set up before you begin.


In [None]:
from huggingface_hub import Repository

repo = Repository("<path_to_dummy_folder>", clone_from="<namespace>/dummy-model")

repo.git_pull()
repo.git_add()
repo.git_commit()
repo.git_push()
repo.git_tag()

The model card is the central definition of the model, ensuring reusability by fellow community members and reproducibility of results, and providing a platform on which other members may build their artifacts. We create it through the README file
The model card usually starts with a very brief, high-level overview of what the model is for, followed by additional details in the following sections:

Model description
Intended uses & limitations
How to use
Limitations and bias
Training data
Training procedure
Evaluation results

- Model description --> includes architecture, version, if it was introduced in a paper, if an original implementation is available, the author and general information about the model. Any copyright should be attributed here. General info about training procedures, parameters, important disclaimers.

- Intended uses and limitations --> Description of the use cases the model is intended for, including languages, fields, domains where it can be applied. It also document areas that are known to be out of scope for the model or where it is likely to perform suboptimally.

- How to use --> examples of how to use the model. This can showcase usage of the pipeline() function, usage of the model and tokenizer classes 

- Training data --> Should indicate which datasets the model was trained on. Brief description of dataset

- Training procedure --> Should describe all the relevant aspects of training that are useful from a reproducibility perspective. Includes any preprocessing and postprocessing that were done on the data, as well as details such as the number of epochs the model was trained for, batch size, learning rate.

- Variable and metrics --> Description of the metrics used for evaluation, and the different factors you are mesuring. It makes it easy to compare the model with others.

-Evaluation results --> Provide an indication of how well the model performs on the evaluation dataset. Provide the decision threshold used in evaluation in case of needing it.


In [None]:
#Loading local data sets:

from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

#By default, loading local files creates a DatasetDict object with a train split. We can see this by inspecting the squad_it_dataset object:

squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

 To include both the train and test splits in a single DatasetDict object so we can apply Dataset.map() functions across both splits at once, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split

In [None]:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset


DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

In [None]:
#For loading a remote dataset, Instead of providing a path to local files, we point the data_files argument of load_dataset() to one
# or more URLs where the remote files are stored

url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")