# DataShelf v0.2.0: Complete Features Guide

A comprehensive walkthrough of DataShelf's version control system for datasets.

## Table of Contents

1. [Introduction & Setup](#introduction--setup)
2. [Core Concepts](#core-concepts)
3. [Basic Workflow](#advanced-features)
4. [Advanced Features](#advanced-features)
5. [CLI Usage](#cli-usage)
6. [Configuration Management](#configuration-management)
7. [Data Loading & Retrieval](#data-loading--retrieval)
8. [Metadata Inspection](#metadata-inspection)
9. [Best Practices](#best-practices)
10. [Troubleshooting](#troubleshooting)

---

## Introduction & Setup

DataShelf is a simple version control system for datasets that helps data scientists and analysts track how their datasets evolve over time. It provides hash-based deduplication, metadata tracking, and organized collection management.

### Installation

```bash
# From source
git clone https://github.com/r0hankrishnan/datashelf.git
df datashelf
pip install -e.

# Dev install
pip install -e ".[dev]"
```

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import datashelf.core as ds
from datashelf.core.config import check_tag_enforcement, get_allowed_tags
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducible examples
np.random.seed(42)
print("Let's get started with our demo for DataShelf v0.2.0!")

Let's get started with our demo for DataShelf v0.2.0!


## Core Concepts

Before diving into the features, let's understand DataShelf's key concepts:
- **Collections**: Logical grouping of related datasets (like folders)
- **Versions**: Each dataset save creates a timestamped version with metadata
- **Tags**: Labels for easy identification ("raw", "cleaned", "final", etc.)
- **Messages**: Descriptive commit messages explaining dataset changes
- **Hashing** SHA-256 based deduplication prevents storing identical data

### Create Sample Data

In [2]:
# Sales data
sales_data = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Widget A', 'Widget B', 'Widget C', 'Widget D', 'Widget E'],
    'category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home'],
    'units_sold': [150, 200, 75, 300, 120],
    'unit_price': [25.99, 45.50, 15.00, 60.00, 35.75],
    'region': ['North', 'South', 'East', 'West', 'North']
})

# Customer data - secondary dataset
customer_data = pd.DataFrame({
    'customer_id': range(1001, 1011),
    'name': ['Alice Johnson', 'Bob Smith', 'Carol Davis', 'David Wilson', 'Eva Brown',
             'Frank Miller', 'Grace Lee', 'Henry Taylor', 'Ivy Chen', 'Jack Robinson'],
    'age': [28, 34, 22, 45, 31, 29, 38, 52, 26, 41],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',
             'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
    'total_purchases': [5, 12, 3, 8, 15, 7, 9, 4, 11, 6],
    'avg_order_value': [45.20, 67.80, 23.50, 55.10, 89.90, 38.75, 72.30, 41.60, 58.40, 49.20]
})

print("Sample datasets created:")
print(f"Sales data: {sales_data.shape}")
print(f"Customer data: {customer_data.shape}")
print("\nSales data preview:")
print(sales_data.head())

Sample datasets created:
Sales data: (5, 6)
Customer data: (10, 6)

Sales data preview:
  product_id product_name     category  units_sold  unit_price region
0       P001     Widget A  Electronics         150       25.99  North
1       P002     Widget B         Home         200       45.50  South
2       P003     Widget C  Electronics          75       15.00   East
3       P004     Widget D       Sports         300       60.00   West
4       P005     Widget E         Home         120       35.75  North


---

## Basic Workflow

### 1. Initialize DataShelf

In [3]:
# Initialize DataShelf in your project directory
# This creates a .datashelf folder with config and metadata files
ds.init()

2025-07-30 17:34:44,968 - INFO - .datashelf directory with config and metadata files initialized.


0

**What happens during init():**
- Creates `.datashelf/` directory
- Generates `datashelf_metadata.yaml` (project-level metadata)
- Creates `datashelf_config.yaml` (config settings)
- Sets up tag enforcement and allowed tags

### 2. Create Collections

Collections help organize related datasets. Let's create collections for different types of analyses:

In [4]:
# Create collection for different analyses
ds.create_collection("Sales Analysis Q4")
ds.create_collection("Customer Analytics")
ds.create_collection("Experimental Data")

print("Collections created!")

2025-07-30 17:34:44,983 - INFO - Collection directory: sales_analysis_q4 and metadata file created.
2025-07-30 17:34:44,986 - INFO - Collection directory: customer_analytics and metadata file created.
2025-07-30 17:34:44,991 - INFO - Collection directory: experimental_data and metadata file created.


Collections created!


### 3. Save Your First Dataset

In [5]:
result = ds.save(
    df = sales_data,
    collection_name = "Sales Analysis Q4",
    name = "Quarterly Sales",
    tag = "raw",
    message = "Initial Q4  sales data from database export"
)
print(f"Save operation result: {result}")

2025-07-30 17:34:45,007 - INFO - Save as CSV (0.00 MB)
2025-07-30 17:34:45,017 - INFO - Quarterly Sales added to Sales Analysis Q4


Save operation result: 0


### 4. Create and Save Transformations

In [6]:
# Create enriched version with calculations
sales_enriched = sales_data.copy()
sales_enriched['total_revenue'] = sales_enriched['units_sold'] * sales_enriched['unit_price']
sales_enriched['revenue_per_unit'] = sales_enriched['total_revenue'] / sales_enriched['units_sold']
sales_enriched['price_category'] = pd.cut(
    sales_enriched['unit_price'], 
    bins=[0, 30, 50, 100], 
    labels=['Low', 'Medium', 'High']
)

# Save the enriched version
ds.save(
    df=sales_enriched,
    collection_name="Sales Analysis Q4",
    name="quarterly_sales",
    tag="intermediate",
    message="Added revenue calculations and price categorization"
)

print("Enriched data saved!")
print(f"New columns: {[col for col in sales_enriched.columns if col not in sales_data.columns]}")

2025-07-30 17:34:45,034 - INFO - Save as CSV (0.00 MB)
2025-07-30 17:34:45,045 - INFO - quarterly_sales added to Sales Analysis Q4


Enriched data saved!
New columns: ['total_revenue', 'revenue_per_unit', 'price_category']


### Create Summary Analytics

In [7]:
# Create aggregated summary
sales_summary = sales_enriched.groupby('category').agg({
    'units_sold': 'sum',
    'total_revenue': 'sum',
    'unit_price': 'mean'
}).round(2)

sales_summary.reset_index(inplace=True)
sales_summary['avg_units_per_product'] = sales_summary['units_sold'] / sales_enriched.groupby('category').size()

# Save the summary
ds.save(
    df=sales_summary,
    collection_name="Sales Analysis Q4",
    name="category_summary",
    tag="final",
    message="Final category-level summary for Q4 report"
)

print("Summary data created and saved:")
print(sales_summary)

2025-07-30 17:34:45,067 - INFO - Save as CSV (0.00 MB)
2025-07-30 17:34:45,080 - INFO - category_summary added to Sales Analysis Q4


Summary data created and saved:
      category  units_sold  total_revenue  unit_price  avg_units_per_product
0  Electronics         225         5023.5       20.49                    NaN
1         Home         320        13390.0       40.62                    NaN
2       Sports         300        18000.0       60.00                    NaN



---

## Advanced Features

### Duplicate Detection

DataShelf automatically prevents saving identical datasets:

In [8]:
print("Testing duplicate detection...")

# Try to save the same data again
result = ds.save(
    df=sales_data,  # Same data as before
    collection_name="Sales Analysis Q4",
    name="duplicate_test",
    tag="raw",
    message="This should be detected as a duplicate"
)

print(f"Duplicate detection result: {result}")
print("DataShelf prevented duplicate storage!")

2025-07-30 17:34:45,095 - INFO - duplicate_test's hash matches a dataframe that is already saved in datashelf: Quarterly Sales.


Testing duplicate detection...
Duplicate detection result: 0
DataShelf prevented duplicate storage!


### Working with Multiple Collections

In [9]:
# Save customer data to different collection
ds.save(
    df=customer_data,
    collection_name="Customer Analytics",
    name="customer_profiles",
    tag="raw",
    message="Customer profile data from CRM system"
)

# Create customer segments
customer_segments = customer_data.copy()
customer_segments['age_group'] = pd.cut(
    customer_segments['age'], 
    bins=[0, 30, 40, 50, 100], 
    labels=['Young', 'Mid', 'Mature', 'Senior']
)
customer_segments['value_tier'] = pd.cut(
    customer_segments['avg_order_value'], 
    bins=[0, 40, 60, 80, 100], 
    labels=['Bronze', 'Silver', 'Gold', 'Platinum']
)

ds.save(
    df=customer_segments,
    collection_name="Customer Analytics",
    name="customer_segments",
    tag="intermediate",
    message="Added age groups and value tier segmentation"
)

print("Customer data processing complete!")

2025-07-30 17:34:45,105 - INFO - Save as CSV (0.00 MB)
2025-07-30 17:34:45,113 - INFO - customer_profiles added to Customer Analytics
2025-07-30 17:34:45,120 - INFO - Save as CSV (0.00 MB)
2025-07-30 17:34:45,141 - INFO - customer_segments added to Customer Analytics


Customer data processing complete!


### Smart File Format Selection

When saving a dataset, DataShelf automatically chooses between CSV and Parquet file formats based on data size:

In [10]:
# Create a larger dataset to demonstrate format selection
large_data = pd.DataFrame({
    'id': range(200000),
    'value_1': np.random.randn(200000),
    'value_2': np.random.randn(200000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 200000),
    'timestamp': pd.date_range('2024-01-01', periods=200000, freq='H')
})

# Save large dataset (will automatically use Parquet for efficiency)
ds.save(
    df=large_data,
    collection_name="Experimental Data",
    name="large_dataset",
    tag="raw",
    message="Large experimental dataset for format testing"
)

print(f"Large dataset saved: {large_data.shape}")
print("DataShelf automatically selected optimal file format!")

2025-07-30 17:34:45,201 - INFO - Save as Parquet (18.00 MB)
2025-07-30 17:34:45,278 - INFO - large_dataset added to Experimental Data


Large dataset saved: (200000, 5)
DataShelf automatically selected optimal file format!



---

## CLI Usage

DataShelf provides a command-line interface for common operations:

### Available CLI Commands

```bash
# Initialize DataShelf
datashelf init

# Create a new collection
datashelf create-collection "My New Collection"

# Display DataShelf metadata
datashelf ls ds-md

# Display collections overview
datashelf ls ds-coll

# Display collection metadata (will prompt for collection name)
datashelf ls coll-md

# Display collection files (will prompt for collection name)
datashelf checkout collection_name hash_value

In [11]:
# Simulate CLI commands (these would be run in the terminal)
print("CLI Examples:")
print("=" * 50)
print("$ datashelf init")
print()
print("# Create a collection")
print("$ datashelf create-collection 'sales Analysis Q4'")
print()
print("# View collections")
print("$ datashelf ls ds-coll")
print()
print("# View collection files")
print("$ datashelf ls coll-files")
print("Collection name? Sales Analysis Q4")
print()
print("# Checkout a specific dataset")
print("$ datashelf checkout 'Sales Analysis Q4' abc123...")

CLI Examples:
$ datashelf init

# Create a collection
$ datashelf create-collection 'sales Analysis Q4'

# View collections
$ datashelf ls ds-coll

# View collection files
$ datashelf ls coll-files
Collection name? Sales Analysis Q4

# Checkout a specific dataset
$ datashelf checkout 'Sales Analysis Q4' abc123...



---

## Configuration Management

### Tag Enforcement

DataShelf comes with a set of allowable tags out-of-the-box. While there are functions available to disable tag enforcement, **disabling tag enforcement is not recommended**. Having a set of allowed tags makes organizing and viewing saved datasets much more straight-forward.

In [12]:
# Check current tag enforcement status
print(f"Tag enforcement enabled: {check_tag_enforcement()}")
print(f"Allowed tags: {get_allowed_tags()}")

# Demonstrate tag validation
try:
    ds.save(
        df=sales_data.head(2),
        collection_name="Experimental Data",
        name="test_data",
        tag="invalid_tag",  # This will fail
        message="Testing tag validation"
    )
except ValueError as e:
    print(f"Tag validation caught invalid tag: {e}")

# Use a valid tag
ds.save(
    df=sales_data.head(2),
    collection_name="Experimental Data",
    name="test_data",
    tag="ad-hoc",  # This is valid
    message="Testing with valid ad-hoc tag"
)
print("Valid tag accepted!")

2025-07-30 17:34:45,290 - ERROR - invalid_tag is not a valid tag.Tag enforcement is currently set to True.You cannot change enforcement as of this version of DataShelf.Please use one of the following allowed tags raw, intermediate, cleaned, ad-hoc, final
2025-07-30 17:34:45,296 - INFO - Save as CSV (0.00 MB)
2025-07-30 17:34:45,307 - INFO - test_data added to Experimental Data


Tag enforcement enabled: True
Allowed tags: ['raw', 'intermediate', 'cleaned', 'ad-hoc', 'final']
Tag validation caught invalid tag: invalid_tag is not a valid tag.Tag enforcement is currently set to True.You cannot change enforcement as of this version of DataShelf.Please use one of the following allowed tags raw, intermediate, cleaned, ad-hoc, final
Valid tag accepted!


### Configuration Options

In [13]:
# The default configuration includes:
print("Default DataShelf Configuration:")
print("- tag_enforcement: True")
print("- allowed_tags: ['raw', 'intermediate', 'cleaned', 'ad-hoc', 'final']")
print("- collection_tag_overrides: {} (for future use)")

Default DataShelf Configuration:
- tag_enforcement: True
- allowed_tags: ['raw', 'intermediate', 'cleaned', 'ad-hoc', 'final']
- collection_tag_overrides: {} (for future use)



---

## Data Loading & Retrieval

### Load Datasets Back into Memory

The `load()` function retrieves a specific dataset version and returns it as a pandas DataFrame:

In [14]:
# First, let's get some hash values to work with
ds.ls("coll-files")  # This will prompt for collection name: "Sales Analysis Q4"

In [15]:
# Load a specific dataset version using its hash
# Note: Replace 'hash_value' with actual hash from your metadata
try:
    # This is a demonstration - you'd use actual hash values from your metadata
    loaded_data = ds.load("Sales Analysis Q4", "9d77eabf6b934ce8e759742429021d0afeb2ccaa339c2db35ea4c96fdf96ff3f")
    print("Dataset loaded successfully!")
    print(loaded_data.head())
except:
    print("Demo: Use actual hash values from your collection metadata to load data")

Dataset loaded successfully!
  product_id product_name     category  units_sold  unit_price region
0       P001     Widget A  Electronics         150       25.99  North
1       P002     Widget B         Home         200       45.50  South
2       P003     Widget C  Electronics          75       15.00   East
3       P004     Widget D       Sports         300       60.00   West
4       P005     Widget E         Home         120       35.75  North


### Checkout Datasets to Files

The `checkout()` function retrieves a specific dataset and copies to the same directory that .datashelf/ is in:

In [16]:
# Checkout copies a dataset file to your working directory
# ds.checkout("Sales Analysis Q4", "your_actual_hash_here")
print("Demo: checkout() copies dataset files to your working directory")
print("Use: ds.checkout('collection_name', 'hash_value')")

Demo: checkout() copies dataset files to your working directory
Use: ds.checkout('collection_name', 'hash_value')



---

## Metadata Inspection

DataShelf provides comprehensive metadata viewing capabilities:

### Using the Display Functions

In [17]:
print("DataShelf creates the following structure:")
print("""
your_project/
├── .datashelf/
│   ├── datashelf_metadata.yaml      # Project-level metadata
│   ├── datashelf_config.yaml        # Configuration settings
│   └── collection_name/
│       ├── collection_metadata.yaml # Collection-specific metadata  
│       └── dataset_files.[csv|parquet] # Your actual datasets
├── your_notebooks.ipynb
└── your_scripts.py
""")

DataShelf creates the following structure:

your_project/
├── .datashelf/
│   ├── datashelf_metadata.yaml      # Project-level metadata
│   ├── datashelf_config.yaml        # Configuration settings
│   └── collection_name/
│       ├── collection_metadata.yaml # Collection-specific metadata  
│       └── dataset_files.[csv|parquet] # Your actual datasets
├── your_notebooks.ipynb
└── your_scripts.py



### Metadata Fields Explained

In [18]:
print("Key Metadata Fields:")
print("=" * 30)
print("""
Project Level (datashelf_metadata.yaml):
- date_created: When DataShelf was initialized
- number_of_collections: Total collections count
- collections: List of all collections with their metadata

Collection Level (collection_metadata.yaml): 
- collection_name: Name of the collection
- date_created/date_last_modified: Timestamps
- number_of_files: Count of datasets in collection
- most_recent_commit: Path to latest saved dataset
- max_version: Highest version number used

File Level (within collections):
- name: Dataset name
- hash: SHA-256 hash for deduplication
- tag: User-assigned tag (raw, cleaned, etc.)
- version: Auto-incrementing version number
- message: Commit message
- file_path: Full path to dataset file
- deleted: Soft deletion flag (future feature)
""")

Key Metadata Fields:

Project Level (datashelf_metadata.yaml):
- date_created: When DataShelf was initialized
- number_of_collections: Total collections count
- collections: List of all collections with their metadata

Collection Level (collection_metadata.yaml): 
- collection_name: Name of the collection
- date_created/date_last_modified: Timestamps
- number_of_files: Count of datasets in collection
- most_recent_commit: Path to latest saved dataset
- max_version: Highest version number used

File Level (within collections):
- name: Dataset name
- hash: SHA-256 hash for deduplication
- tag: User-assigned tag (raw, cleaned, etc.)
- version: Auto-incrementing version number
- message: Commit message
- file_path: Full path to dataset file
- deleted: Soft deletion flag (future feature)




---

## Best Practices

### 1. Project Organization

In [19]:
print("📁 Project Organization Best Practices:")
print("""
1. Initialize DataShelf in your project root directory
2. Create meaningful collection names:
   \u2713 "Customer_Analytics_2024"
   \u2713 "Sales_Forecasting_Models" 
   \u2716 "data" or "temp"

3. Use consistent naming conventions:
   - snake_case for collection names
   - descriptive dataset names
   - standardized tags
""")

📁 Project Organization Best Practices:

1. Initialize DataShelf in your project root directory
2. Create meaningful collection names:
   ✓ "Customer_Analytics_2024"
   ✓ "Sales_Forecasting_Models" 
   ✖ "data" or "temp"

3. Use consistent naming conventions:
   - snake_case for collection names
   - descriptive dataset names
   - standardized tags



### 2. Tagging Strategy

In [20]:
print("Recommended Tagging Strategy:")
print("""
- raw: Original, unmodified data from source
- intermediate: Partially processed, work-in-progress
- cleaned: Cleaned and validated data
- ad-hoc: Experimental or one-off analysis
- final: Completed, ready-for-use datasets

Custom workflow example:
raw → intermediate → cleaned → final
  ↓
ad-hoc (for experiments)
""")

Recommended Tagging Strategy:

- raw: Original, unmodified data from source
- intermediate: Partially processed, work-in-progress
- cleaned: Cleaned and validated data
- ad-hoc: Experimental or one-off analysis
- final: Completed, ready-for-use datasets

Custom workflow example:
raw → intermediate → cleaned → final
  ↓
ad-hoc (for experiments)



### 3. Commit Messages

In [21]:
print("Effective Commit Messages:")
print("""
\u2713 Good examples:
- "Added revenue calculations and price tiers"
- "Removed outliers and standardized categories"
- "Final aggregation for Q4 executive report"

\u2716 Avoid:
- "updated data"
- "fixes"
- "version 2"

Be specific about what changed and why!
""")

Effective Commit Messages:

✓ Good examples:
- "Added revenue calculations and price tiers"
- "Removed outliers and standardized categories"
- "Final aggregation for Q4 executive report"

✖ Avoid:
- "updated data"
- "fixes"
- "version 2"

Be specific about what changed and why!



### 4. Version Control Workflow

In [22]:
print("Recommended Workflow:")
print("""
1. Save raw data immediately upon import
2. Document each transformation step
3. Use meaningful tags to track processing stages
4. Save intermediate versions of complex transformations
5. Tag final outputs clearly for easy identification
6. Leverage duplicate detection to avoid waste
""")

Recommended Workflow:

1. Save raw data immediately upon import
2. Document each transformation step
3. Use meaningful tags to track processing stages
4. Save intermediate versions of complex transformations
5. Tag final outputs clearly for easy identification
6. Leverage duplicate detection to avoid waste




---

## Troubleshooting

### Common Issues and Solutions

In [23]:
print("\u2699 Troubleshooting Guide:")
print("=" * 30)
print("""
Issue: "NotADirectoryError: .datashelf does not exist"
Solution: Run ds.init() first to initialize DataShelf

Issue: "Tag validation error"
Solution: Check allowed tags with get_allowed_tags() and use valid tags

Issue: "Collection already exists"  
Solution: This is normal - DataShelf won't overwrite existing collections

Issue: "Cannot find .datashelf directory"
Solution: Make sure you're in the correct directory or that DataShelf was initialized

Issue: File format questions
Solution: DataShelf auto-selects CSV (<10MB) or Parquet (≥10MB) for optimal performance
""")

⚙ Troubleshooting Guide:

Issue: "NotADirectoryError: .datashelf does not exist"
Solution: Run ds.init() first to initialize DataShelf

Issue: "Tag validation error"
Solution: Check allowed tags with get_allowed_tags() and use valid tags

Issue: "Collection already exists"  
Solution: This is normal - DataShelf won't overwrite existing collections

Issue: "Cannot find .datashelf directory"
Solution: Make sure you're in the correct directory or that DataShelf was initialized

Issue: File format questions
Solution: DataShelf auto-selects CSV (<10MB) or Parquet (≥10MB) for optimal performance



### Checking DataShelf Status 

In [24]:
# Quick health check function
def datashelf_status():
    try:
        from pathlib import Path
        datashelf_path = Path.cwd() / '.datashelf'
        if datashelf_path.exists():
            print("\u2713 DataShelf initialized")
            collections = [p for p in datashelf_path.iterdir() if p.is_dir()]
            print(f"\u2713 Found {len(collections)} collections")
            
            # Check config
            config_path = datashelf_path / 'datashelf_config.yaml'
            if config_path.exists():
                print("\u2713 Configuration file present")
            else:
                print("\u26A0  Configuration file missing")
                
        else:
            print("\u2716 DataShelf not initialized - run ds.init()")
    except Exception as e:
        print(f"\u2716 Error checking status: {e}")

datashelf_status()

✓ DataShelf initialized
✓ Found 3 collections
✓ Configuration file present


---

## Summary

DataShelf v0.2.0 provides a robust foundation for dataset version control with the following capabilities:

### Current Features
- **Project initialization** with `ds.init()`
- **Collection management** with `ds.create_collection()`
- **Dataset versioning** with `ds.save()`
- **Duplicate detection** via SHA-256 hashing
- **Smart file format selection** (CSV/Parquet)
- **Configurable tag enforcement**
- **Comprehensive metadata tracking**
- **CLI interface** for common operations
- **Data loading** with `ds.load()` and `ds.checkout()`
- **Metadata inspection** with `ds.ls()`

### Future Enhancements
- Advanced query capabilities
- Data comparison tools
- Branch-like functionality
- Integration with popular ML frameworks
- Enhanced CLI features
- Support for additional data formats

### Getting Started Checklist
1. Install DataShelf: `pip install -e .`
2. Initialize project: `ds.init()`
3. Create collections: `ds.create_collection("My Analysis")`
4. Save datasets: `ds.save(df, collection_name, name, tag, message)`
5. Explore metadata: `ds.ls("ds-md")`
6. Load previous versions: `ds.load(collection_name, hash_value)`

DataShelf helps you maintain organized, traceable, and efficient dataset workflows.

---