The goal of this notebook is to show you how you can use W&B Registry to track, share, and use dataset and model artifacts in your machine learning workflow by you and other members of your organization. By the end of this notebook, you will know how to use W&B to:

1. Create a [custom registry](https://docs.wandb.ai/guides/registry/create_registry)
2. Create [collections](https://docs.wandb.ai/guides/registry/create_collection) within our registry
3. Make our dataset and model artifacts available to other members of our organization. 
4. See how to download artifacts from the registry for inference

To do this, we will create a basic neural network to classify the biological class of animals.

## Install and import packages

In [None]:
#!pip install wandb torch ucimlrepo scikit-learn

In [None]:
import torch 
from torch import nn
import wandb
from ucimlrepo import fetch_ucirepo

from sklearn.model_selection import train_test_split

## Retrieve and process dataset
We will use the open source [Zoo dataset](https://archive.ics.uci.edu/dataset/111/zoo) from the UCI Machine Learning Repository.

### Retrieve dataset
We can either manually download the dataset or use the [`ucimlrepo` package](https://github.com/uci-ml-repo/ucimlrepo) to import the dataset directly into our notebook. For this example, we will go with the latter and import the dataset directly into this notebook:

In [None]:
# fetch dataset 
zoo = fetch_ucirepo(id=111) 
  
# data (as pandas dataframes) 
X = zoo.data.features 
y = zoo.data.targets 

### Explore the data

In [None]:
print("features: ", X.shape, "type: ", type(X))
print("labels: ", y.shape, "type: ", type(y))

In [None]:
X.head(5)

### Process data

Most of the major processing was already done for us (no missing values, normalized, etc.). For training we are going to convert our dataset from pandas DataFrames to tensores, convert the data type of our input tensotre to match the data type of the nn.Linear module, and convert our labels tensor to index from 0-6:

In [None]:
# Data type of the data must match the data type of the model, the default dtype for nn.Linear is torch.float32
dataset = torch.tensor(X.values).type(torch.float32) 

# Convert to tensor and format labels from 0 - 6 for indexing
labels = torch.tensor(y.values)  - 1

print("dataset: ", dataset.shape, "dtype: ",dataset.dtype)
print("labels: ", labels.shape, "dtype: ",labels.dtype)

Save processed dataset locally using [`torch.save`](https://pytorch.org/docs/stable/generated/torch.save.html)

In [None]:
torch.save(dataset, "zoo_dataset.pt")
torch.save(labels, "zoo_labels.pt")

## Create a registry for our dataset and models

Let's create a registry to organize both our dataset artifacts and (at a later step) our model artifacts. To do this, navigate to the Registry App in the W&B App UI:

2. Within Custom registry, click on the **Create registry** button.
3. Provide a name for your registry in the **Name** field. For this example, we will name our registry "Zoo_Classifier".
4. Optionally provide a description about the registry.
5. From the [**Registry visibility**](https://docs.wandb.ai/guides/registry/configure_registry#registry-visibility-types) dropdown, click select "Organization".
6. Select "All types" from the **Accepted artifacts** type dropdown.
7. Click on the **Create registry** button.


Note: You do not need to use one registry for organizing and tracking different types of artifacts. Another popular choice is to create a regsitry specifically for datasets, a registry specifically for models, and so forth.

## Track and publish dataset 

Within our "Zoo_Classifier" we will create a collection called "Datasets". A collection is a set of linked artifact versions in a registry. In this example we will create two collections: one for our datasets and one for our models. First, let's create a collection for our datasets. To create a collection we need to do two things:

1. Specify the full path name where we want to store our artifact. 
   * The full paht name consists of: `{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}`
2. Use the `run.link_artifact` method and pass our artifact object and full path name



In [None]:
PROJECT = "zoo_experiment"
TEAM_ENTITY = "smle-reg-team-2"
ORG_NAME = "smle-registries-bug-bash"
REGISTRY_NAME = "Zoo_Classifier"
COLLECTION_NAME = "Datasets"

target_path=f"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}"
print(target_path)

In [None]:
run = wandb.init(
    entity=TEAM_ENTITY,
    project=PROJECT,
    job_type="upload_dataset"
)

artifact = wandb.Artifact(
    name="zoo_dataset",
    type="dataset"
)

artifact.add_file(local_path="zoo_dataset.pt", name="zoo_dataset")
artifact.add_file(local_path="zoo_labels.pt", name="zoo_labels")

run.link_artifact(artifact=artifact, target_path=target_path)

run.finish()

### Split data
Split the data into a training and test set.

In [None]:
# using the train test split function
X_train, X_test, y_train, y_test = train_test_split(dataset,labels, random_state=42,test_size=0.25, shuffle=True)

## Define model

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_stack = nn.Sequential(
            nn.Linear(in_features=16 , out_features=16),
            nn.Sigmoid(),
            nn.Linear(in_features=16, out_features=7)
        )

    def forward(self, x):
        logits = self.linear_stack(x)
        return logits

model = NeuralNetwork()
print(model)

### Define hyperparameters, loss function, and optimizer

In [None]:
hyperparameter_config = {
    "learning_rate": 0.1,
    "epochs": 1000,
    "model_type": "Multivariate_neural_network_classifier",
}

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=hyperparameter_config["learning_rate"])

## Train model

Train model, save model, store model as an artifact in W&B

In [None]:
run = wandb.init(entity = TEAM_ENTITY, project = PROJECT, job_type = "training", config = hyperparameter_config)

# Training loop
for e in range(hyperparameter_config["epochs"]):
    pred = model(X_train)
    loss = loss_fn(pred, y_train.squeeze(1))
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    wandb.log({
            "train/epoch_ndx": e,
            "train/train_loss": loss
        })

    # Evaluate model

    # Checkpoint model
    if e % 99 == 1:
        print("epoch: ", e,"loss:", loss.item())
    
        ## Checkpoint model
        PATH = 'zoo_wandb.pth' 
        torch.save(model.state_dict(), PATH)
        
        artifact = wandb.Artifact(
            name=f"zoo-{wandb.run.id}",
            type="model",
            metadata={
                "num_classes": 7,
                "model_type": wandb.config["model_type"]
            }
        )
        # Add artifact file
        artifact.add_file(PATH)
        artifact.save()

run.finish()

## Publish model to the registry
Let's make this model artifact available to other users in our organization. To do this, we will create another collection within our Zoo_Classifier registry.

To create a collection within our registry, we will need to get the full name (or path) of our model artifact. Go to the W&B App UI and find the full name of the model artifact you want to link to the registry:

1. Click on the **Artifacts** tab
2. Select the name of the artifact within the left navbar
3. Click on the **Version** tab
4. Within the **Version overview**, you will find the full name of your artifact. Make note of the name.

In [None]:
ORG_NAME = "smle-registries-bug-bash"
REGISTRY_NAME = "Zoo_Classifier"
COLLECTION_NAME = "Trained_models"

In [None]:
target_path=f"{ORG_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}"
print(target_path)

In [None]:
run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)
name="smle-reg-team-2/zoo_experiment/zoo-nhqnys3o:v10"
model_artifact = run.use_artifact(artifact_or_name=name, type="model")
run.link_artifact(artifact=model_artifact, target_path=target_path)
run.finish()

## Download artifacts from registry for inference

For this last section, suppose you are a different user. This user wants to take take the model and dataset that you pushed to the registry and make predictions on a new test set. Also suppose that this user has [member role permissions](https://docs.wandb.ai/guides/registry/configure_registry#registry-roles-permissions) which means they can view and download artifacts from our registry.

How can this person get your artifacts that you published to the registry? Simple:

1. Know the path of the artifact in the registry
2. Use the W&B Python SDK to download the artifacts

### Download model

In [None]:
run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)
name="smle-registries-bug-bash/wandb-registry-Zoo_Classifier/Trained_models:v0"
registry_model = run.use_artifact(artifact_or_name=name)
local_model_path = registry_model.download()

For PyTorch models, we need to redefine our model architecture:

In [None]:
loaded_model = NeuralNetwork()
loaded_model.load_state_dict(torch.load(f=local_model_path + "/zoo_wandb.pth"))

### Get dataset from registry

Let's get the dataset from our registry. For this example, we will download the dataset and use the same random seed to get our test set and labels.

In [None]:
name = "smle-registries-bug-bash/wandb-registry-Zoo_Classifier/Datasets:v0"

In [None]:
run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)
dataset_artifact = run.use_artifact(artifact_or_name=name, type="dataset")
local_dataset_path = dataset_artifact.download()

In [None]:
# Load dataset and labels into notebook
loaded_data = torch.load(f=local_dataset_path+ "/zoo_dataset")
loaded_labels = torch.load(f=local_dataset_path + "/zoo_labels")

# using the train test split function using the same random state seed
X_train, X_test, y_train, y_test = train_test_split(loaded_data,loaded_labels, random_state=42,test_size=0.25, shuffle=True)
run.finish()

### Make predictions with loaded model

(Noah to do, track this w/ W&B)

In [None]:
run = wandb.init(entity=TEAM_ENTITY, project=PROJECT)

In [None]:
outputs = loaded_model(X_test)

In [None]:
__, predicted = torch.max(outputs, 1)
print(predicted[:10])

In [None]:
class_labels = {
    0: "Aves",
    1: "Mammalia",
    2: "Reptilia",
    3: "Actinopterygii",
    4: "Amphibia",
    5: "Insecta",
    6: "Crustacea",
}

In [None]:
results = list(map(lambda x: class_labels.get(x), predicted.numpy()))
results[:10]

In [None]:
run.finish()