# Thinkube Storage Guide

Learn how to use persistent storage in the Thinkube Jupyter environment.

## Storage Architecture

Thinkube uses SeaweedFS for persistent storage with the following mount points:

- `/home/jovyan/thinkube/notebooks/` - Your personal notebooks (100GB)
- `/home/jovyan/thinkube/datasets/` - Shared datasets (500GB)
- `/home/jovyan/thinkube/models/` - Shared models (200GB)

All data in these directories persists across pod restarts.

## Check Storage Availability

In [None]:
import os
import subprocess

# Check disk usage
def get_disk_usage(path):
    result = subprocess.run(['df', '-h', path], capture_output=True, text=True)
    return result.stdout

print("Notebooks storage:")
print(get_disk_usage('/home/jovyan/thinkube/notebooks'))

print("\nDatasets storage:")
print(get_disk_usage('/home/jovyan/thinkube/datasets'))

print("\nModels storage:")
print(get_disk_usage('/home/jovyan/thinkube/models'))

## Working with Notebooks

Save your notebooks in `/home/jovyan/thinkube/notebooks/` for persistence.

In [None]:
# List your notebooks
notebooks_dir = '/home/jovyan/thinkube/notebooks'
for root, dirs, files in os.walk(notebooks_dir):
    # Skip hidden directories
    dirs[:] = [d for d in dirs if not d.startswith('.')]
    level = root.replace(notebooks_dir, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    sub_indent = ' ' * 2 * (level + 1)
    for file in files:
        if not file.startswith('.'):
            print(f"{sub_indent}{file}")

## Shared Datasets

Use `/home/jovyan/thinkube/datasets/` for datasets that should be accessible from any pod.

In [None]:
import pandas as pd

# Example: Save a dataset to shared storage
datasets_dir = '/home/jovyan/thinkube/datasets'
os.makedirs(datasets_dir, exist_ok=True)

# Create sample dataset
df = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [10, 20, 30, 40, 50],
    'label': [0, 1, 0, 1, 0]
})

# Save to shared datasets
dataset_path = os.path.join(datasets_dir, 'sample_dataset.csv')
df.to_csv(dataset_path, index=False)
print(f"Dataset saved to {dataset_path}")

# Load from shared datasets
loaded_df = pd.read_csv(dataset_path)
print(f"\nLoaded dataset shape: {loaded_df.shape}")
print(loaded_df.head())

## Shared Models

Use `/home/jovyan/thinkube/models/` for trained models accessible from any pod.

In [None]:
import pickle
from sklearn.linear_model import LinearRegression
import numpy as np

models_dir = '/home/jovyan/thinkube/models'
os.makedirs(models_dir, exist_ok=True)

# Train a simple model
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X, y)

# Save model to shared storage
model_path = os.path.join(models_dir, 'sample_model.pkl')
with open(model_path, 'wb') as f:
    pickle.dump(model, f)
print(f"Model saved to {model_path}")

# Load model from shared storage
with open(model_path, 'rb') as f:
    loaded_model = pickle.load(f)
print(f"\nLoaded model prediction for X=6: {loaded_model.predict([[6]])[0]}")

## Best Practices

1. **Notebooks**: Save work-in-progress in `/home/jovyan/thinkube/notebooks/`
2. **Datasets**: Store reusable datasets in `/home/jovyan/thinkube/datasets/`
3. **Models**: Save trained models in `/home/jovyan/thinkube/models/`
4. **Scratch space**: Use `/home/jovyan/scratch/` for temporary large files (not persistent)
5. **Clean up**: Remove unused files to free up space

## SeaweedFS S3 Access

You can also access storage via S3 API using boto3.

In [None]:
import os
import boto3
from dotenv import load_dotenv

# Load Thinkube environment
load_dotenv('/home/jovyan/.thinkube_env')

# Connect to SeaweedFS S3
s3 = boto3.client(
    's3',
    endpoint_url=os.getenv('S3_ENDPOINT'),
    aws_access_key_id=os.getenv('S3_ACCESS_KEY'),
    aws_secret_access_key=os.getenv('S3_SECRET_KEY')
)

# List buckets
buckets = s3.list_buckets()
print("Available S3 buckets:")
for bucket in buckets.get('Buckets', []):
    print(f"  - {bucket['Name']}")