# D2O Delta Sharing - Recipient Demo

## Overview
This notebook demonstrates how to access Databricks Delta Sharing data as an **open recipient** (non-Databricks user). 

In this demo, we'll:
1. Load credentials from the config file
2. Connect to the Delta Share
3. List available shares and tables
4. Query the shared data using pandas
5. Create visualizations using seaborn

**Prerequisites:**
- The provider has created a share and recipient
- You have received the credential file (`config.share`)
- This notebook is running in a Docker container with the necessary libraries

## Step 1: Import Required Libraries

In [None]:
import delta_sharing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json
import os
import warnings

# Configure matplotlib and seaborn
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully")
print(f"Delta Sharing version: {delta_sharing.__version__}")
print(f"Pandas version: {pd.__version__}")

## Step 2: Load Delta Sharing Credentials

The credentials are passed via environment variable `DELTA_SHARING_CONFIG` which contains the JSON configuration from the `config.share` file.

In [None]:
# Load credentials from environment variable
config_json = os.environ.get('DELTA_SHARING_CONFIG')

if not config_json:
    raise ValueError("DELTA_SHARING_CONFIG environment variable not set!")

# Parse the configuration
config = json.loads(config_json)

# Write to a temporary file (delta-sharing library requires a file path)
config_file_path = '/tmp/config.share'
with open(config_file_path, 'w') as f:
    json.dump(config, f, indent=2)

print("✓ Credentials loaded successfully")
print(f"Endpoint: {config['endpoint']}")
print(f"Token expires: {config.get('expirationTime', 'N/A')}")

# Create a SharingClient
client = delta_sharing.SharingClient(config_file_path)
print("✓ Delta Sharing client initialized")

## Step 3: List Available Shares

Let's discover what shares are available to us.

In [None]:
# List all available shares
shares = client.list_shares()

print(f"✓ Found {len(shares)} share(s)\n")
for share in shares:
    print(f"Share: {share.name}")
    if hasattr(share, 'id'):
        print(f"  ID: {share.id}")

## Step 4: List Schemas in the Share

Now let's see what schemas (databases) are available in the share.

In [None]:
# Get the first share (assuming it's the external_retail share)
share_name = shares[0].name
print(f"Working with share: {share_name}\n")

# List schemas in the share
schemas = client.list_schemas(delta_sharing.Share(name=share_name))

print(f"✓ Found {len(schemas)} schema(s)\n")
for schema in schemas:
    print(f"Schema: {schema.name}")
    if hasattr(schema, 'share'):
        print(f"  Share: {schema.share}")

## Step 5: List Tables in the Schema

Let's see what tables are available for us to query.

In [None]:
# Get the first schema
schema_name = schemas[0].name
print(f"Working with schema: {schema_name}\n")

# List all tables in the schema
tables = client.list_tables(delta_sharing.Schema(name=schema_name, share=share_name))

print(f"✓ Found {len(tables)} table(s)\n")
for i, table in enumerate(tables, 1):
    print(f"{i}. Table: {table.name}")
    if hasattr(table, 'share'):
        print(f"   Share: {table.share}")
    if hasattr(table, 'schema'):
        print(f"   Schema: {table.schema}")
    print()

## Step 6: Query the Customers Table

Let's load the customers table into a pandas DataFrame and explore the data.

In [None]:
# Construct table URL for customers
customers_table_url = f"{config_file_path}#{share_name}.{schema_name}.customers"

# Load the table into a pandas DataFrame
print("Loading customers table...")
customers_df = delta_sharing.load_as_pandas(customers_table_url)

print(f"✓ Loaded {len(customers_df)} customer records\n")
print("Data shape:", customers_df.shape)
print("\nColumn names:")
print(customers_df.columns.tolist())
print("\nFirst few records:")
customers_df.head()

## Step 7: Query the Sales Transactions Table

Now let's load the sales transactions data.

In [None]:
# Construct table URL for sales transactions
sales_table_url = f"{config_file_path}#{share_name}.{schema_name}.sales_transactions"

# Load the table into a pandas DataFrame
print("Loading sales transactions table...")
sales_df = delta_sharing.load_as_pandas(sales_table_url)

print(f"✓ Loaded {len(sales_df)} transaction records\n")
print("Data shape:", sales_df.shape)
print("\nColumn names:")
print(sales_df.columns.tolist())
print("\nFirst few records:")
sales_df.head()

## Step 8: Data Analysis and Summary Statistics

Let's explore the data with some basic statistics.

In [None]:
print("=" * 60)
print("CUSTOMER DATA SUMMARY")
print("=" * 60)
print(f"\nTotal customers: {len(customers_df):,}")
print("\nCustomer info:")
print(customers_df.info())

print("\n" + "=" * 60)
print("SALES DATA SUMMARY")
print("=" * 60)
print(f"\nTotal transactions: {len(sales_df):,}")

# Calculate basic statistics if amount column exists
if 'amount' in sales_df.columns:
    print(f"Total revenue: ${sales_df['amount'].sum():,.2f}")
    print(f"Average transaction: ${sales_df['amount'].mean():,.2f}")
    print(f"Max transaction: ${sales_df['amount'].max():,.2f}")
    print(f"Min transaction: ${sales_df['amount'].min():,.2f}")
    
print("\nSales data info:")
print(sales_df.info())

## Step 9: Visualization 1 - Sales Distribution

Let's create some visualizations using seaborn to better understand the data.

In [None]:
# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Sales Data Analysis Dashboard', fontsize=16, fontweight='bold')

# Plot 1: Distribution of transaction amounts (if amount column exists)
if 'amount' in sales_df.columns:
    sns.histplot(data=sales_df, x='amount', bins=30, kde=True, ax=axes[0, 0])
    axes[0, 0].set_title('Distribution of Transaction Amounts')
    axes[0, 0].set_xlabel('Transaction Amount ($)')
    axes[0, 0].set_ylabel('Frequency')
    
    # Plot 2: Box plot of amounts
    sns.boxplot(data=sales_df, y='amount', ax=axes[0, 1])
    axes[0, 1].set_title('Transaction Amount Box Plot')
    axes[0, 1].set_ylabel('Amount ($)')
else:
    axes[0, 0].text(0.5, 0.5, 'No amount column found', ha='center', va='center')
    axes[0, 1].text(0.5, 0.5, 'No amount column found', ha='center', va='center')

# Plot 3: Transaction count over time (if date column exists)
date_cols = [col for col in sales_df.columns if 'date' in col.lower() or 'time' in col.lower()]
if date_cols:
    date_col = date_cols[0]
    sales_df[date_col] = pd.to_datetime(sales_df[date_col])
    sales_by_date = sales_df.groupby(sales_df[date_col].dt.date).size()
    axes[1, 0].plot(sales_by_date.index, sales_by_date.values, marker='o')
    axes[1, 0].set_title('Transactions Over Time')
    axes[1, 0].set_xlabel('Date')
    axes[1, 0].set_ylabel('Number of Transactions')
    axes[1, 0].tick_params(axis='x', rotation=45)
else:
    axes[1, 0].text(0.5, 0.5, 'No date column found', ha='center', va='center')

# Plot 4: Top customers by transaction count
if 'customer_id' in sales_df.columns:
    top_customers = sales_df['customer_id'].value_counts().head(10)
    sns.barplot(x=top_customers.values, y=top_customers.index.astype(str), ax=axes[1, 1])
    axes[1, 1].set_title('Top 10 Customers by Transaction Count')
    axes[1, 1].set_xlabel('Number of Transactions')
    axes[1, 1].set_ylabel('Customer ID')
else:
    axes[1, 1].text(0.5, 0.5, 'No customer_id column found', ha='center', va='center')

plt.tight_layout()
plt.show()

print("✓ Visualizations created successfully")

## Step 10: Visualization 2 - Customer Demographics

Let's analyze customer demographics if the data is available.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle('Customer Demographics Analysis', fontsize=16, fontweight='bold')

# Plot 1: Customer distribution by a categorical column (if exists)
categorical_cols = customers_df.select_dtypes(include=['object']).columns
if len(categorical_cols) > 0:
    col = categorical_cols[0]
    value_counts = customers_df[col].value_counts().head(10)
    sns.barplot(x=value_counts.values, y=value_counts.index, ax=axes[0])
    axes[0].set_title(f'Top 10 {col} Distribution')
    axes[0].set_xlabel('Count')
    axes[0].set_ylabel(col)
else:
    axes[0].text(0.5, 0.5, 'No categorical columns found', ha='center', va='center')

# Plot 2: Pie chart of another categorical column if available
if len(categorical_cols) > 1:
    col = categorical_cols[1]
    value_counts = customers_df[col].value_counts().head(8)
    axes[1].pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', startangle=90)
    axes[1].set_title(f'{col} Distribution')
elif len(categorical_cols) == 1:
    # Use the same column but different visualization
    col = categorical_cols[0]
    value_counts = customers_df[col].value_counts()
    axes[1].pie(value_counts.values[:8], labels=value_counts.index[:8], autopct='%1.1f%%', startangle=90)
    axes[1].set_title(f'{col} Distribution (Pie Chart)')
else:
    axes[1].text(0.5, 0.5, 'No categorical columns found', ha='center', va='center')

plt.tight_layout()
plt.show()

print("✓ Customer demographics visualizations created")

## Step 11: Advanced Analysis - Merge and Analyze

Let's join the customer and sales data for deeper insights.

In [None]:
# Try to merge if customer_id exists in both dataframes
if 'customer_id' in sales_df.columns and 'customer_id' in customers_df.columns:
    merged_df = sales_df.merge(customers_df, on='customer_id', how='left')
    
    print(f"✓ Merged dataset created with {len(merged_df)} records")
    print(f"\nMerged columns: {merged_df.columns.tolist()}")
    
    # Calculate revenue by customer segment if applicable
    if 'amount' in merged_df.columns:
        # Group by any categorical customer column
        cat_cols = [col for col in customers_df.columns if customers_df[col].dtype == 'object' and col != 'customer_id']
        
        if cat_cols:
            segment_col = cat_cols[0]
            revenue_by_segment = merged_df.groupby(segment_col)['amount'].agg(['sum', 'mean', 'count'])
            revenue_by_segment.columns = ['Total Revenue', 'Avg Transaction', 'Transaction Count']
            revenue_by_segment = revenue_by_segment.sort_values('Total Revenue', ascending=False)
            
            print(f"\nRevenue Analysis by {segment_col}:")
            print(revenue_by_segment)
            
            # Visualize
            fig, axes = plt.subplots(1, 2, figsize=(16, 6))
            fig.suptitle(f'Revenue Analysis by {segment_col}', fontsize=16, fontweight='bold')
            
            # Total revenue by segment
            sns.barplot(x=revenue_by_segment['Total Revenue'], y=revenue_by_segment.index, ax=axes[0])
            axes[0].set_title('Total Revenue')
            axes[0].set_xlabel('Revenue ($)')
            
            # Average transaction by segment
            sns.barplot(x=revenue_by_segment['Avg Transaction'], y=revenue_by_segment.index, ax=axes[1])
            axes[1].set_title('Average Transaction Value')
            axes[1].set_xlabel('Amount ($)')
            
            plt.tight_layout()
            plt.show()
else:
    print("Cannot merge datasets - customer_id column not found in both tables")
    print(f"Sales columns: {sales_df.columns.tolist()}")
    print(f"Customer columns: {customers_df.columns.tolist()}")

## Summary

🎉 **Demo Complete!**

In this notebook, we successfully demonstrated D2O (Databricks-to-Open) Delta Sharing as a recipient:

✅ **What we accomplished:**
1. Loaded credentials from environment variable
2. Connected to the Delta Sharing endpoint
3. Listed available shares, schemas, and tables
4. Queried shared data using pandas
5. Performed data analysis and generated summary statistics
6. Created multiple visualizations using seaborn and matplotlib
7. Merged datasets for advanced analytics

**Key Benefits of D2O Delta Sharing:**
- 🚀 **No Data Duplication**: Access live data without copying
- 🔒 **Secure**: Token-based authentication
- ⚡ **Real-time**: Always get the latest data from provider
- 💰 **Cost-effective**: No storage costs for recipients
- 🛠️ **Tool Agnostic**: Use any tool that supports Delta Sharing (Python, Power BI, Tableau, etc.)
- 🌐 **Open Standard**: Based on open Delta Sharing protocol

**Next Steps:**
- Explore more complex queries and aggregations
- Integrate with your existing data pipelines
- Build dashboards using Power BI or other BI tools
- Set up automated reporting workflows

## 📚 Additional Resources

### Documentation
- **[START-HERE.md](../START-HERE.md)** - Quick navigation guide
- **[README-D2O-DEMO.md](../README-D2O-DEMO.md)** - Complete setup instructions
- **[QUICKSTART-D2O.md](../QUICKSTART-D2O.md)** - Quick reference card
- **[ARCHITECTURE-D2O.md](../ARCHITECTURE-D2O.md)** - System architecture
- **[EXPECTED-OUTPUT.md](../EXPECTED-OUTPUT.md)** - What to expect

### External Links
- [Databricks Delta Sharing Docs](https://docs.databricks.com/delta-sharing/)
- [Delta Sharing Protocol](https://github.com/delta-io/delta-sharing)
- [Python delta-sharing Library](https://github.com/delta-io/delta-sharing/tree/main/python)

### Troubleshooting
If you encounter issues:
1. Check container logs: `docker logs d2o-demo`
2. Verify token expiration in config.share
3. Review provider notebook for share/recipient setup
4. See troubleshooting section in README-D2O-DEMO.md

---
**© 2025 Databricks, Inc. All rights reserved.**