# Version datasets in model training runs

## Introduction

You can version datasets, models, and other file objects as Artifacts in Neptune.

This guide shows how to:
* Keep track of a dataset version in your model training runs with artifacts  
* Query the dataset version from previous runs to make sure you are training on the same dataset version
* Group your Neptune Runs by the dataset version they were trained on

By the end of this guide, you will train a few models making sure that the same dataset was used and see the Runs for this dataset version in the Neptune app. 


[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/experiments?compare=IwdgNMQ&split=tbl&dash=artifacts&viewId=6777136b-938e-4639-943d-3f6bc52f8497)

![image](https://neptune.ai/wp-content/uploads/artifacts-grouped-by-dataset-version.png)

## Before you start

This notebook example lets you try out Neptune as an anonymous user, with zero setup.

* If you are running the notebook on your local machine, you need to have [Python](https://www.python.org/downloads/) and [pip](https://pypi.org/project/pip/) installed.
* If you want to see the example recorded to your own workspace instead:
    * Create a Neptune account → [Take me to registration](https://neptune.ai/register)
    * Create a Neptune project that you will use for tracking metadata → [Tell me more about projects](https://docs.neptune.ai/administration/projects)

## Install Neptune and dependencies

In [2]:
! pip install neptune-client scikit-learn==0.24.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Prepare a model training script

Create a training script where you:
* Specify dataset paths for training and testing
* Define model parameters
* Calculate the score on the test set

In [1]:
import neptune.new as neptune

run = neptune.init(
    project="mlops/mlops",
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiIzYzg3NDA4MS1iNGM1LTRmYzQtYmMyYy1lYjUyZmZhZjg1NTMifQ==",

)  # your credentials


https://app.neptune.ai/mlops/mlops/e/MLOP-14


Info (NVML): NVML Shared Library Not Found. GPU usage metrics may not be reported. For more information, see https://docs.neptune.ai/you-should-know/what-can-you-log-and-display#hardware-consumption


Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#.stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


In [2]:
run["Tags"] = "test"

In [27]:
clean_data = pd.read_csv('/Users/maomao/Desktop/machine leanring operatinos/team1/cleaned_data_before_modeling',index_col=0)
clean_data1 = data[:int(0.8*len(data))]
clean_data_test = data[int(0.8*len(data)):]
data_train.to_csv("data_train.csv")
data_test.to_csv("data_test.csv")

In [30]:
data_train.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn', 'tenureCat',
       'MCC', 'TCC'],
      dtype='object')

In [31]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

TRAIN_DATASET_PATH = "/content/data_train.csv"
TEST_DATASET_PATH = "/content/data_test.csv"


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [ 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges','tenureCat',
       'MCC', 'TCC']
    TARGET_COLUMN = ["Churn"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(**params)
    model = rf.fit(X_train, y_train)
    res = model.predict(X_test)

    return classification_report(y_test,res, output_dict=True)

## Initialize Neptune and create new run

Connect your script to Neptune application and create new run.

In [37]:
import neptune.new as neptune

run = neptune.init(
    project="mlops/mlops",
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiJjMzYwNGRmMi02ZjY5LTRlN2QtYmM1NC04MjdmOGIxYjMxY2YifQ==",

)

https://app.neptune.ai/mlops/mlops/e/MLOP-13
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#.stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


Click on the link above to open this run in Neptune.

For now, it is empty but keep the tab with the run open to see what happens next.

**Few explanations**

In the above code You tell Neptune: 

* **who you are**: your Neptune API token `api_token` 
* **where you want to send your data**: your Neptune `project`.

At this point you have a new run in Neptune. For now on you will use `run` to log metadata to it.

---

**Note**


Instead of logging data to the public project 'common/quickstarts' as an anonymous user 'neptuner' you can log it to your own project.

To do that:

1. Get your [Neptune API token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)
2. Pass the token to ``api_token`` argument of ``neptune.init()`` method: ``api_token=YOUR_API_TOKEN``
3. Get your [Neptune project name](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)
3. Pass your project to the `project` argument of the `init()` method.

For example:

```python
neptune.init(project="YOUR_WORKSPACE/YOUR_PROJECT", api_token="YOUR_API_TOKEN")
```

## Add tracking of the dataset version

Save datasets versions as Neptune artifacts

In [38]:
TRAIN_DATASET_PATH = "/content/data_train.csv"
TEST_DATASET_PATH = "/content/data_test.csv"

In [39]:
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

**Note:**

You can also version the entire folder where your datasets are by running

```python
run["datasets"].track_files(DATASET_FOLDER)
```

Also, people often keep track of datasets at the project level with [Project metadata](https://docs.neptune.ai/api-reference/project).

For more information see [Organize and share dataset versions](https://docs.neptune.ai/how-to-guides/data-versioning/organize-and-share-dataset-versions).

## Run model training and log parameters and metrics to Neptune

Log parameters to Neptune

In [40]:
params = {
    "n_estimators": 5,
    "max_depth": 2,
    "max_features": 5,
}
run["parameters"] = params

Log test score to Neptune

In [41]:
score = train_model(params, TRAIN_DATASET_PATH, TEST_DATASET_PATH)




In [42]:
score

{'0': {'f1-score': 0.863448275862069,
  'precision': 0.8186573670444638,
  'recall': 0.9134241245136187,
  'support': 1028},
 '1': {'f1-score': 0.5381026438569209,
  'precision': 0.6603053435114504,
  'recall': 0.4540682414698163,
  'support': 381},
 'accuracy': 0.7892122072391767,
 'macro avg': {'f1-score': 0.7007754598594949,
  'precision': 0.7394813552779571,
  'recall': 0.6837461829917175,
  'support': 1409},
 'weighted avg': {'f1-score': 0.7754733391736649,
  'precision': 0.7758382606100578,
  'recall': 0.7892122072391767,
  'support': 1409}}

In [43]:
run["metrics/accuracy"] = score["accuracy"]

In [44]:
run["metrics/recall"] = score["macro avg"]["recall"]

Get the run ID of your model training from Neptune. 

This will be useful when asserting the same dataset versions on the baseline and new datasets. 

In [13]:
baseline_run_id = run["sys/id"].fetch()
print(baseline_run_id)

MLOP-9


## Stop logging to the current run
<font color=red>**Warning:**</font><br>
Once you are done logging, you should stop tracking the run using the `stop()` method.
This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

In [14]:
run.stop()

Shutting down background jobs, please wait a moment...
Done!
Waiting for the remaining 1 operations to synchronize with Neptune. Do not kill this process.
All 1 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/mlops/mlops/e/MLOP-9


## Add a version check for the training and testing datasets

You can fetch the dataset version hash from the baseline and compare it with the new current version of the dataset.

Create a new Neptune run and track the dataset version:

In [15]:
new_run = neptune.init(
    project="mlops/mlops",
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiJjMzYwNGRmMi02ZjY5LTRlN2QtYmM1NC04MjdmOGIxYjMxY2YifQ==",
)
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

https://app.neptune.ai/mlops/mlops/e/MLOP-10
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#.stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


Compare the current dataset version with the baseline dataset version

In [17]:
new_run.wait()  # force asynchronous logging operations to finish

assert baseline_run["datasets/train"].fetch_hash() == new_run["datasets/train"].fetch_hash()
assert baseline_run["datasets/test"].fetch_hash() == new_run["datasets/test"].fetch_hash()

NameError: ignored

## Run model training with new parameters

Change the parameters and run model training

In [18]:
params = {
    "n_estimators": 8,
    "max_depth": 3,
    "max_features": 2,
}
new_run["parameters"] = params

score = train_model(params, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

new_run["metrics/test_score"] = score



## Stop logging to the current run

In [19]:
new_run.stop()
baseline_run.stop()

Shutting down background jobs, please wait a moment...
Done!
Waiting for the remaining 1 operations to synchronize with Neptune. Do not kill this process.
All 1 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/mlops/mlops/e/MLOP-10


NameError: ignored

## See all model training runs for this dataset version

To see all training runs for a particular dataset version:
* Go to the `Runs table` in the Neptune app
* Click on **+Add column**, type in 'artifacts/train' and click on it to add to the `Runs table`
* Add parameters and test score in the same way
* See that your model training run improved thanks to better parameters because the dataset version didn't change. 

You can also [use ** +Group by**](https://docs.neptune.ai/how-to-guides/neptune-ui/groupby) to group by train dataset versions and find the training runs you care about quickly. 

[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/experiments?compare=IwdgNMQ&split=tbl&dash=artifacts&viewId=6777136b-938e-4639-943d-3f6bc52f8497)

![image](https://neptune.ai/wp-content/uploads/artifacts-grouped-by-dataset-version.png)