# DataShelf Example

This notebook demonstrates the core functionality of DataShelf - a git-like version control system for datasets.

We'll walk through:
1. Setting up a DataShelf project
2. Creating collections to organize datasets
3. Saving and versioning datasets
4. Understanding the metadata structure

*Thanks Claude for helping with the dialogue :)*

## Setup

First, let's import the necessary libraries and create some sample data to work with.

In [1]:
import pandas as pd
import numpy as np
from datashelf.core import init, create_collection, save

# Create sample datasets for our example
np.random.seed(42)

# Sample sales data
sales_data = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'units_sold': [150, 200, 75, 300],
    'price': [25.99, 45.50, 15.00, 60.00],
    'region': ['North', 'South', 'East', 'West']
})

print("Sample sales data:")
print(sales_data)

Sample sales data:
    product  units_sold  price region
0  Widget A         150  25.99  North
1  Widget B         200  45.50  South
2  Widget C          75  15.00   East
3  Widget D         300  60.00   West


## Step 1: Initialize DataShelf

Before we can start versioning datasets, we need to initialize DataShelf in our project directory. This creates a `.datashelf` folder that will store all our metadata and dataset versions.

In [2]:
# Note: This will prompt you to confirm the directory
# In a real scenario, make sure you're in your project root
init()

2025-07-21 20:32:44,504 - INFO - .datashelf directory and metadata initialized


## Step 2: Create a Collection

Collections help organize related datasets. Think of them like folders for your data versions. Let's create a collection for our sales analysis.

In [3]:
# Create a collection for sales data
create_collection("Sales Analysis 2024")

2025-07-21 20:32:44,515 - INFO - collection directory: Sales Analysis 2024 and metadata file sales_analysis_2024_metadata.yaml created.


## Step 3: Save Dataset Versions

Now we can start saving dataset versions. Each save includes:
- **name**: A descriptive name for this dataset
- **tag**: A label (like "raw", "cleaned", "final")
- **message**: A commit message explaining what this version contains

In [4]:
# Save the raw sales data
save(df=sales_data, 
     collection_name="Sales Analysis 2024", 
     name="quarterly_sales", 
     tag="raw", 
     message="Initial quarterly sales data from database")

2025-07-21 20:32:44,529 - INFO - quarterly_sales added to Sales Analysis 2024


## Step 4: Creating Data Transformations

Let's create some transformed versions of our data and save them as new versions.

In [5]:
# Add calculated columns
sales_enriched = sales_data.copy()
sales_enriched['revenue'] = sales_enriched['units_sold'] * sales_enriched['price']
sales_enriched['revenue_per_unit'] = sales_enriched['revenue'] / sales_enriched['units_sold']

print("Enriched sales data:")
print(sales_enriched)

Enriched sales data:
    product  units_sold  price region  revenue  revenue_per_unit
0  Widget A         150  25.99  North   3898.5             25.99
1  Widget B         200  45.50  South   9100.0             45.50
2  Widget C          75  15.00   East   1125.0             15.00
3  Widget D         300  60.00   West  18000.0             60.00


In [6]:
# Save the enriched version
save(df=sales_enriched, 
     collection_name="Sales Analysis 2024", 
     name="quarterly_sales", 
     tag="enriched", 
     message="Added revenue calculations and per-unit metrics")

2025-07-21 20:32:44,568 - INFO - quarterly_sales added to Sales Analysis 2024


In [7]:
# Create a summary dataset
sales_summary = pd.DataFrame({
    'total_units': [sales_enriched['units_sold'].sum()],
    'total_revenue': [sales_enriched['revenue'].sum()],
    'avg_price': [sales_enriched['price'].mean()],
    'top_product': [sales_enriched.loc[sales_enriched['revenue'].idxmax(), 'product']]
})

print("Sales summary:")
print(sales_summary)

Sales summary:
   total_units  total_revenue  avg_price top_product
0          725        32123.5    36.6225    Widget D


In [8]:
# Save the summary
save(df=sales_summary, 
     collection_name="Sales Analysis 2024", 
     name="sales_summary", 
     tag="final", 
     message="Final summary statistics for quarterly report")

2025-07-21 20:32:44,602 - INFO - sales_summary added to Sales Analysis 2024


## Step 5: Testing Duplicate Detection

DataShelf automatically detects when you try to save identical data and prevents duplicates.

In [9]:
# Try to save the same data again
save(df=sales_data, 
     collection_name="Sales Analysis 2024", 
     name="duplicate_test", 
     tag="raw", 
     message="This should be detected as a duplicate")

2025-07-21 20:32:44,612 - INFO - duplicate_test's hash matches a dataframe that is already saved in datashelf: quarterly_sales.


## Step 6: Working with Multiple Collections

Let's create another collection to demonstrate organization.

In [10]:
# Create a second collection
create_collection("Customer Analytics")

# Create some customer data
customer_data = pd.DataFrame({
    'customer_id': range(1001, 1006),
    'name': ['Alice Johnson', 'Bob Smith', 'Carol Davis', 'David Wilson', 'Eva Brown'],
    'total_purchases': [5, 12, 3, 8, 15],
    'avg_order_value': [45.20, 67.80, 23.50, 55.10, 89.90]
})

print("Customer data:")
print(customer_data)

2025-07-21 20:32:44,620 - INFO - collection directory: Customer Analytics and metadata file customer_analytics_metadata.yaml created.


Customer data:
   customer_id           name  total_purchases  avg_order_value
0         1001  Alice Johnson                5             45.2
1         1002      Bob Smith               12             67.8
2         1003    Carol Davis                3             23.5
3         1004   David Wilson                8             55.1
4         1005      Eva Brown               15             89.9


In [11]:
# Save to the customer analytics collection
save(df=customer_data, 
     collection_name="Customer Analytics", 
     name="customer_profiles", 
     tag="raw", 
     message="Initial customer profile data from CRM")

2025-07-21 20:32:44,659 - INFO - customer_profiles added to Customer Analytics


## Understanding the DataShelf Structure

Let's explore what DataShelf has created for us behind the scenes.

In [12]:
import os
import yaml

# Check the .datashelf directory structure
def show_directory_tree(path, prefix="", max_depth=3, current_depth=0):
    if current_depth > max_depth:
        return
    
    items = sorted(os.listdir(path))
    for i, item in enumerate(items):
        item_path = os.path.join(path, item)
        is_last = i == len(items) - 1
        
        current_prefix = "└── " if is_last else "├── "
        print(f"{prefix}{current_prefix}{item}")
        
        if os.path.isdir(item_path) and not item.startswith('.'):
            next_prefix = prefix + ("    " if is_last else "│   ")
            show_directory_tree(item_path, next_prefix, max_depth, current_depth + 1)

print("DataShelf directory structure:")
print(".datashelf/")
show_directory_tree(".datashelf")

DataShelf directory structure:
.datashelf/
├── customer_analytics
│   ├── customer_analytics_metadata.yaml
│   └── customer_profiles_raw.csv
├── datashelf_metadata.yaml
└── sales_analysis_2024
    ├── quarterly_sales_enriched.csv
    ├── quarterly_sales_raw.csv
    ├── sales_analysis_2024_metadata.yaml
    └── sales_summary_final.csv


In [13]:
# Look at the main metadata file
with open('.datashelf/datashelf_metadata.yaml', 'r') as f:
    main_metadata = yaml.safe_load(f)

print("Main DataShelf metadata:")
print(yaml.dump(main_metadata, default_flow_style=False, sort_keys=False))

Main DataShelf metadata:
config:
- date_created: '2025-07-21 20:32:44'
  number_of_collections: 2
  collections:
  - sales_analysis_2024
  - customer_analytics
collections:
- collection_name: sales_analysis_2024
  date_created: '2025-07-21 20:32:44'
  date_last_modified: '2025-07-21 20:32:44'
  files:
  - sales_analysis_2024_metadata.yaml
  - quarterly_sales
  - quarterly_sales
  - sales_summary
- collection_name: customer_analytics
  date_created: '2025-07-21 20:32:44'
  date_last_modified: '2025-07-21 20:32:44'
  files:
  - customer_analytics_metadata.yaml
  - customer_profiles



In [14]:
# Look at a collection's metadata
with open('.datashelf/sales_analysis_2024/sales_analysis_2024_metadata.yaml', 'r') as f:
    collection_metadata = yaml.safe_load(f)

print("Sales Analysis 2024 collection metadata:")
print(yaml.dump(collection_metadata, default_flow_style=False, sort_keys=False))

Sales Analysis 2024 collection metadata:
config:
- collection_name: sales_analysis_2024
  date_created: '2025-07-21 20:32:44'
  number_of_files: 4
  most_recent_commit: /Users/rohankrishnan/Documents/GitHub/datashelf/examples/.datashelf/sales_analysis_2024/sales_summary_final.csv
  date_last_modified: '2025-07-21 20:32:44'
files:
- name: sales_analysis_2024_metadata.yaml
  hash: ''
  date_created: '2025-07-21 20:32:44'
  date_last_modified: ''
  tag: ''
  version: null
  message: ''
- name: quarterly_sales
  hash: 276aff861de54b01d75e7b8522ef1b6a6c67bfe037574fa5f9d72ab2c39a4448
  date_created: '2025-07-21 20:32:44'
  date_last_modified: '2025-07-21 20:32:44'
  tag: raw
  version: null
  message: Initial quarterly sales data from database
- name: quarterly_sales
  hash: 89104c39df3a484e6b9d784f88597580058b7c746ecabbbef3fe51b6944153d8
  date_created: '2025-07-21 20:32:44'
  date_last_modified: '2025-07-21 20:32:44'
  tag: enriched
  version: null
  message: Added revenue calculations and

## Summary

In this example, we've demonstrated the current capabilities of DataShelf:

✅ **What DataShelf can do now:**
- Initialize project-level dataset versioning
- Create organized collections for related datasets
- Save pandas DataFrames with metadata (tags, messages, timestamps)
- Automatically detect and prevent duplicate data storage
- Maintain comprehensive metadata about all dataset versions
- Track dataset history with SHA-256 hashing

🚧 **Coming in future versions:**
- Load specific dataset versions by name/tag
- Compare datasets across different time periods
- Restore previous dataset states
- Command-line interface
- Support for additional data formats (Polars, etc.)

DataShelf provides a solid foundation for dataset version control, making it easy to track how your data evolves throughout your analysis workflow.