# Diving Deeper into Weights & Biases

In this notebook, we will explore the following

* Versioning datasets using [Artifacts](https://docs.wandb.ai/guides/artifacts).
* Exploring and visualizing our datasets with [Tables](https://docs.wandb.ai/guides/data-vis).

## Import the Libraries

In [None]:
import wandb

import os
import pandas as pd

## Logging Dataset to Artifacts

Download the `train.csv` and `test.csv` files from [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data) and place them in the `data` directory.

In [None]:
# Initialize a WandB Run
wandb.init(project="mlops-zoomcamp-wandb", job_type="log_data")

# Log the `data` directory as an artifact
artifact = wandb.Artifact('Titanic', type='dataset', metadata={"Source": "https://www.kaggle.com/competitions/titanic/data"})
artifact.add_dir('data')
wandb.log_artifact(artifact)

# End the WandB Run
wandb.finish()

## Versioning the Data

In [None]:
# Initialize a WandB Run
wandb.init(project="mlops-zoomcamp-wandb", job_type="log_data")

# Fetch the dataset artifact 
artifact = wandb.use_artifact('geekyrakshit/mlops-zoomcamp-wandb/Titanic:v0', type='dataset')
artifact_dir = artifact.download()

Read the dataset files

In [None]:
train_df = pd.read_csv(os.path.join(artifact_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(artifact_dir, "test.csv"))

In [None]:
num_train_examples = int(0.8 * len(train_df))
num_val_examples = len(train_df) - num_train_examples

print(num_train_examples, num_val_examples)

In [None]:
train_df["Split"] = ["Train"] * num_train_examples + ["Validation"] * num_val_examples
train_df.to_csv("data/train.csv", encoding='utf-8', index=False)

In [None]:
# Log the `data` directory as an artifact
artifact = wandb.Artifact('Titanic', type='dataset', metadata={"Source": "https://www.kaggle.com/competitions/titanic/data"})
artifact.add_dir('data')
wandb.log_artifact(artifact)

# End the WandB Run
wandb.finish()

## Explore the Dataset

In [None]:
# Initialize a WandB Run
wandb.init(project="mlops-zoomcamp-wandb", job_type="explore_data")

# Fetch the latest version of the dataset artifact 
artifact = wandb.use_artifact('geekyrakshit/mlops-zoomcamp-wandb/Titanic:latest', type='dataset')
artifact_dir = artifact.download()

# Read the files
train_val_df = pd.read_csv(os.path.join(artifact_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(artifact_dir, "test.csv"))

In [None]:
# Create tables corresponding to datasets
train_val_table = wandb.Table(dataframe=train_val_df)
test_table = wandb.Table(dataframe=test_df)

# Log the tables to Weights & Biases
wandb.log({
    "Train-Val-Table": train_val_table,
    "Test-Table": test_table
})

# End the WandB Run
wandb.finish()