# ERPNext Data Pipeline for Banking Churn POC

This notebook handles the complete ERPNext data pipeline for the Apex National Bank churn prediction POC.

## Overview

ERPNext serves as our **Core Banking System**, containing:
- **Customers** - Bank account holders (mapped from ERPNext Customer doctype)
- **Items** - Banking products (Savings, Current, FD, Loans, Cards)
- **Territories** - Bank branches/regions
- **Sales Invoices** - Account transactions/statements

## POC Scope

| Metric | Production | POC |
|--------|------------|-----|
| Customers | 500,000 | **500** |
| Branches | 50 | **10** |
| Transactions | Millions | **~8,500** |
| Time Range | 3 years | **3 years (2023-2026)** |

## Customer Segments for Churn Modeling

| Segment | Distribution | Behavior |
|---------|--------------|----------|
| Active | 60% | Regular transactions throughout |
| At-Risk | 25% | Declining activity in recent months |
| Churned | 15% | Stopped transacting 3-12 months ago |

## Notebook Sections

1. **Configuration** - Environment setup and credentials
2. **API Client** - ERPNext REST API wrapper
3. **Exploration** - Understand existing data structure
4. **Cleanup** - Delete existing data (optional)
5. **Data Ingestion** - Create POC banking data
6. **Data Extraction** - Export to CSV/JSON for Databricks

---
## 1. Configuration

### Prerequisites

1. ERPNext account (free at https://frappecloud.com)
2. API Key and Secret from ERPNext: Settings > My Settings > API Access
3. Credentials in `docs/.env` file (auto-loaded)

### Credentials File Format

The notebook automatically loads credentials from `docs/.env`:

```powershell
$env:ERPNEXT_API_KEY = "your_api_key"
$env:ERPNEXT_API_SECRET = "your_api_secret"
```

In [None]:
# Required packages
import requests
import json
import csv
import random
import os
import time
from datetime import datetime, timedelta
from typing import List, Dict, Any
from pathlib import Path

print("Packages imported successfully!")

In [None]:
# =============================================================================
# CONFIGURATION - Auto-load from .env file
# =============================================================================

import re

def load_env_file(env_path: str) -> dict:
    """
    Load credentials from .env file (PowerShell format).
    
    Parses lines like: $env:VAR_NAME = "value"
    """
    credentials = {}
    try:
        with open(env_path, 'r') as f:
            for line in f:
                # Match PowerShell format: $env:VAR_NAME = "value"
                match = re.match(r'\$env:(\w+)\s*=\s*["\']([^"\']*)["\']', line.strip())
                if match:
                    var_name, value = match.groups()
                    credentials[var_name] = value
        print(f"Loaded {len(credentials)} credentials from {env_path}")
    except FileNotFoundError:
        print(f"Warning: {env_path} not found, using environment variables")
    return credentials

# Load from .env file
ENV_FILE = Path("../../docs/.env")
env_creds = load_env_file(str(ENV_FILE))

# ERPNext Instance URL (no trailing slash)
ERPNEXT_URL = "https://erpnext-rnm-aly.m.erpnext.com"

# API Credentials (from .env file or environment variables)
API_KEY = env_creds.get("ERPNEXT_API_KEY", os.getenv("ERPNEXT_API_KEY", ""))
API_SECRET = env_creds.get("ERPNEXT_API_SECRET", os.getenv("ERPNEXT_API_SECRET", ""))

# Data Generation Parameters
START_DATE = datetime(2023, 1, 1)  # 3 years of data
END_DATE = datetime(2026, 1, 11)
NUM_CUSTOMERS = 500

# Customer Segment Distribution
SEGMENTS = {
    "active": 0.60,    # 60% - Regular activity
    "at_risk": 0.25,   # 25% - Declining activity  
    "churned": 0.15,   # 15% - Stopped transacting
}

# Output Directory
OUTPUT_DIR = Path("../../data/raw")

# Protected system records (never delete these)
PROTECTED_TERRITORIES = ["All Territories", "Bangladesh", "Rest Of The World"]

# Check credentials
if API_KEY and API_SECRET:
    print(f"\nConfiguration loaded!")
    print(f"  ERPNext URL: {ERPNEXT_URL}")
    print(f"  API Key: {API_KEY[:10]}...")
    print(f"  Date Range: {START_DATE.date()} to {END_DATE.date()}")
    print(f"  Target Customers: {NUM_CUSTOMERS}")
else:
    print("WARNING: API credentials not set!")
    print("\nEither:")
    print("  1. Add credentials to docs/.env file")
    print("  2. Or set environment variables manually")

In [None]:
# =============================================================================
# REFERENCE DATA FOR GENERATION
# =============================================================================

FIRST_NAMES = [
    "Ahmed", "Mohammad", "Fatima", "Ayesha", "Rahman", "Hassan", "Zainab", "Omar",
    "Sara", "Ali", "Nadia", "Karim", "Layla", "Yusuf", "Amina", "Ibrahim",
    "Mariam", "Tariq", "Salma", "Bilal", "Hana", "Rashid", "Dalia", "Faisal",
    "Noura", "Khalid", "Rania", "Jamal", "Samira", "Mustafa", "Leena", "Adnan",
]

LAST_NAMES = [
    "Khan", "Ahmed", "Rahman", "Hassan", "Ali", "Sheikh", "Malik", "Chowdhury",
    "Islam", "Hossain", "Akter", "Begum", "Uddin", "Siddiqui", "Mirza", "Shah",
]

TERRITORIES = [
    "Downtown Branch", "Uptown Branch", "Eastside Branch", "Westend Branch",
    "Southgate Branch", "Airport Branch", "University Branch", "Mall Branch",
    "Industrial Branch", "Suburban Branch",
]

PRODUCTS = [
    {"code": "PROD-SAV-001", "name": "Basic Savings Account", "rate": 100},
    {"code": "PROD-SAV-002", "name": "Premium Savings Account", "rate": 500},
    {"code": "PROD-CUR-001", "name": "Current Account", "rate": 200},
    {"code": "PROD-CUR-002", "name": "Business Current Account", "rate": 1000},
    {"code": "PROD-FD-001", "name": "Fixed Deposit 1 Year", "rate": 10000},
    {"code": "PROD-FD-002", "name": "Fixed Deposit 3 Year", "rate": 25000},
    {"code": "PROD-LOAN-001", "name": "Personal Loan", "rate": 5000},
    {"code": "PROD-LOAN-002", "name": "Home Loan", "rate": 50000},
    {"code": "PROD-CARD-001", "name": "Credit Card Classic", "rate": 150},
    {"code": "PROD-CARD-002", "name": "Credit Card Platinum", "rate": 500},
]

print(f"Reference data loaded:")
print(f"  Names: {len(FIRST_NAMES)} first, {len(LAST_NAMES)} last")
print(f"  Branches: {len(TERRITORIES)}")
print(f"  Products: {len(PRODUCTS)}")

---
## 2. ERPNext API Client

This class wraps all ERPNext REST API operations needed for our pipeline.

In [None]:
class ERPNextClient:
    """
    ERPNext REST API Client
    
    Provides methods for:
    - CRUD operations on documents
    - Bulk data extraction
    - Cleanup operations (cancel + delete)
    
    Reference: https://docs.frappe.io/framework/user/en/api/rest
    """
    
    def __init__(self, url: str, api_key: str, api_secret: str):
        self.url = url.rstrip('/')
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"token {api_key}:{api_secret}",
            "Accept": "application/json",
            "Content-Type": "application/json"
        })
    
    # -------------------------------------------------------------------------
    # Connection & Basic Operations
    # -------------------------------------------------------------------------
    
    def test_connection(self) -> str:
        """Test API connection, returns logged in user"""
        response = self.session.get(f"{self.url}/api/method/frappe.auth.get_logged_user")
        response.raise_for_status()
        return response.json().get("message", "Unknown")
    
    def get_count(self, doctype: str, filters: list = None) -> int:
        """Get count of documents"""
        params = {"doctype": doctype}
        if filters:
            params["filters"] = json.dumps(filters)
        response = self.session.get(f"{self.url}/api/method/frappe.client.get_count", params=params)
        response.raise_for_status()
        return response.json().get("message", 0)
    
    # -------------------------------------------------------------------------
    # Read Operations
    # -------------------------------------------------------------------------
    
    def get_list(self, doctype: str, fields: list = None, filters: list = None, 
                 limit: int = 1000) -> List[Dict]:
        """Fetch list of documents with pagination"""
        all_records = []
        offset = 0
        batch_size = 100
        
        while True:
            params = {
                "limit_page_length": batch_size,
                "limit_start": offset,
            }
            if fields:
                params["fields"] = json.dumps(fields)
            if filters:
                params["filters"] = json.dumps(filters)
            
            try:
                response = self.session.get(f"{self.url}/api/resource/{doctype}", params=params)
                response.raise_for_status()
                data = response.json().get("data", [])
                
                if not data:
                    break
                    
                all_records.extend(data)
                offset += batch_size
                
                if len(data) < batch_size or len(all_records) >= limit:
                    break
            except Exception as e:
                print(f"Error fetching {doctype}: {e}")
                break
        
        return all_records[:limit]
    
    def get_doc(self, doctype: str, name: str) -> Dict:
        """Fetch single document with all fields"""
        response = self.session.get(f"{self.url}/api/resource/{doctype}/{name}")
        response.raise_for_status()
        return response.json().get("data", {})
    
    # -------------------------------------------------------------------------
    # Write Operations
    # -------------------------------------------------------------------------
    
    def insert(self, doctype: str, data: Dict) -> Dict:
        """Insert new document"""
        response = self.session.post(f"{self.url}/api/resource/{doctype}", json=data)
        if response.status_code != 200:
            raise Exception(f"{response.status_code}: {response.text[:300]}")
        return response.json().get("data", {})
    
    # -------------------------------------------------------------------------
    # Delete Operations (for cleanup)
    # -------------------------------------------------------------------------
    
    def cancel_doc(self, doctype: str, name: str) -> bool:
        """Cancel a submitted document (required before deletion)"""
        try:
            response = self.session.post(
                f"{self.url}/api/method/frappe.client.cancel",
                json={"doctype": doctype, "name": name}
            )
            response.raise_for_status()
            return True
        except:
            return False
    
    def delete_doc(self, doctype: str, name: str) -> bool:
        """Delete a document"""
        try:
            response = self.session.post(
                f"{self.url}/api/method/frappe.client.delete",
                json={"doctype": doctype, "name": name}
            )
            response.raise_for_status()
            return True
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 404:
                return True  # Already deleted
            return False
        except:
            return False
    
    def delete_gl_entries_for_voucher(self, voucher_type: str, voucher_no: str) -> int:
        """Delete GL Entries linked to a voucher (for cleanup)"""
        try:
            response = self.session.get(
                f"{self.url}/api/resource/GL Entry",
                params={
                    "filters": f'[["voucher_type","=","{voucher_type}"],["voucher_no","=","{voucher_no}"]]',
                    "fields": '["name"]',
                    "limit_page_length": 100
                }
            )
            response.raise_for_status()
            gl_entries = response.json().get("data", [])
            
            for gl in gl_entries:
                self.delete_doc("GL Entry", gl["name"])
            return len(gl_entries)
        except:
            return 0
    
    def delete_gl_entries_for_party(self, party_type: str, party: str) -> int:
        """Delete GL Entries linked to a party (for customer cleanup)"""
        try:
            response = self.session.get(
                f"{self.url}/api/resource/GL Entry",
                params={
                    "filters": f'[["party_type","=","{party_type}"],["party","=","{party}"]]',
                    "fields": '["name"]',
                    "limit_page_length": 500
                }
            )
            response.raise_for_status()
            gl_entries = response.json().get("data", [])
            
            for gl in gl_entries:
                self.delete_doc("GL Entry", gl["name"])
            return len(gl_entries)
        except:
            return 0


# Initialize client
client = ERPNextClient(ERPNEXT_URL, API_KEY, API_SECRET)

# Test connection
try:
    user = client.test_connection()
    print(f"Connected to ERPNext as: {user}")
except Exception as e:
    print(f"Connection failed: {e}")
    print("\nPlease check:")
    print("1. API Key and Secret are correct")
    print("2. ERPNext instance is accessible")

---
## 3. Exploration

Understand what data exists in ERPNext before making changes.

In [None]:
def explore_existing_data(client: ERPNextClient):
    """
    Survey existing data in ERPNext
    
    Shows counts for all relevant doctypes to understand current state.
    """
    print("=" * 60)
    print("ERPNext Data Survey")
    print(f"Instance: {ERPNEXT_URL}")
    print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
    print("=" * 60)
    
    doctypes = [
        ("Customer", "Bank account holders"),
        ("Item", "Banking products"),
        ("Territory", "Branches/regions"),
        ("Sales Invoice", "Transactions"),
        ("GL Entry", "Ledger entries"),
        ("Payment Entry", "Payments"),
        ("Company", "Organization"),
    ]
    
    results = {}
    for doctype, desc in doctypes:
        try:
            count = client.get_count(doctype)
            results[doctype] = count
            status = "OK" if count > 0 else "Empty"
            print(f"  {doctype:20} : {count:6} records  ({desc})")
        except Exception as e:
            results[doctype] = 0
            print(f"  {doctype:20} : Error - {e}")
    
    return results

# Run exploration
current_data = explore_existing_data(client)

In [None]:
# View sample customers (if any exist)
print("\n" + "=" * 60)
print("Sample Customer Records")
print("=" * 60)

customers = client.get_list("Customer", 
    fields=["name", "customer_name", "territory", "email_id", "website"],
    limit=5
)

if customers:
    for i, c in enumerate(customers, 1):
        print(f"\n{i}. {c.get('customer_name', 'N/A')}")
        print(f"   ID: {c.get('name')}")
        print(f"   Branch: {c.get('territory', 'N/A')}")
        print(f"   Email: {c.get('email_id', 'N/A')}")
        print(f"   Segment: {c.get('website', 'N/A')}")  # We store segment in website field
else:
    print("No customers found.")

---
## 4. Cleanup (Optional)

Run this section **ONLY** if you need to delete existing data and start fresh.

### Important Notes:
- Deletes ALL user-created data (customers, invoices, items)
- Preserves system records (All Territories, etc.)
- Requires typing `DELETE ALL` to confirm
- Handles linked records (GL Entries) properly

In [None]:
def cleanup_sales_invoices(client: ERPNextClient) -> int:
    """Delete all Sales Invoices (handles GL entries and cancellation)"""
    print("\nDeleting Sales Invoices...")
    
    invoices = client.get_list("Sales Invoice", fields=["name", "docstatus"], limit=5000)
    print(f"  Found {len(invoices)} invoices")
    
    deleted = 0
    for inv in invoices:
        name = inv["name"]
        
        # Delete linked GL Entries first
        client.delete_gl_entries_for_voucher("Sales Invoice", name)
        
        # Cancel if submitted
        if inv.get("docstatus") == 1:
            client.cancel_doc("Sales Invoice", name)
        
        # Delete
        if client.delete_doc("Sales Invoice", name):
            deleted += 1
        
        time.sleep(0.03)
    
    print(f"  Deleted: {deleted}/{len(invoices)}")
    return deleted


def cleanup_customers(client: ERPNextClient) -> int:
    """Delete all Customers (handles linked GL entries)"""
    print("\nDeleting Customers...")
    
    customers = client.get_list("Customer", fields=["name"], limit=1000)
    print(f"  Found {len(customers)} customers")
    
    deleted = 0
    for cust in customers:
        name = cust["name"]
        
        # Delete linked GL Entries
        client.delete_gl_entries_for_party("Customer", name)
        
        # Delete customer
        if client.delete_doc("Customer", name):
            deleted += 1
        
        time.sleep(0.03)
    
    print(f"  Deleted: {deleted}/{len(customers)}")
    return deleted


def cleanup_items(client: ERPNextClient) -> int:
    """Delete all Items"""
    print("\nDeleting Items...")
    
    items = client.get_list("Item", fields=["name"], limit=100)
    print(f"  Found {len(items)} items")
    
    deleted = 0
    for item in items:
        if client.delete_doc("Item", item["name"]):
            deleted += 1
        time.sleep(0.03)
    
    print(f"  Deleted: {deleted}/{len(items)}")
    return deleted


def cleanup_territories(client: ERPNextClient) -> int:
    """Delete user-created Territories (preserves system defaults)"""
    print("\nDeleting Territories...")
    
    territories = client.get_list("Territory", fields=["name"], limit=100)
    to_delete = [t for t in territories if t["name"] not in PROTECTED_TERRITORIES]
    print(f"  Found {len(territories)} territories, {len(to_delete)} to delete")
    print(f"  Protected: {', '.join(PROTECTED_TERRITORIES)}")
    
    deleted = 0
    for territory in to_delete:
        if client.delete_doc("Territory", territory["name"]):
            deleted += 1
        time.sleep(0.03)
    
    print(f"  Deleted: {deleted}/{len(to_delete)}")
    return deleted

In [None]:
# ============================================================================
# RUN CLEANUP (OPTIONAL - DESTRUCTIVE OPERATION)
# ============================================================================
#
# Uncomment and run this cell ONLY if you want to delete all existing data.
# This cannot be undone!
#
# To run: Remove the triple quotes and run the cell

"""
print("=" * 60)
print("FULL CLEANUP - This will delete ALL data!")
print("=" * 60)

# Preview
invoices = client.get_list("Sales Invoice", fields=["name"], limit=5000)
customers = client.get_list("Customer", fields=["name"], limit=1000)
items = client.get_list("Item", fields=["name"], limit=100)

print(f"\nRecords to be deleted:")
print(f"  Sales Invoices: {len(invoices)}")
print(f"  Customers: {len(customers)}")
print(f"  Items: {len(items)}")

# Confirmation
confirm = input("\nType 'DELETE ALL' to proceed: ")

if confirm == "DELETE ALL":
    # Delete in dependency order
    cleanup_sales_invoices(client)
    cleanup_customers(client)
    cleanup_items(client)
    cleanup_territories(client)
    print("\nCleanup complete!")
else:
    print("\nAborted. No data was deleted.")
"""

print("Cleanup cell is commented out for safety.")
print("Remove the triple quotes to enable cleanup.")

---
## 5. Data Ingestion

Create POC banking data in ERPNext.

### Key Technical Note: Backdated Invoices

ERPNext by default validates that `posting_date` matches the current date. To create historical transactions, we must set:

```python
"set_posting_time": 1  # Allows backdated invoices
```

Reference: https://github.com/frappe/erpnext/issues/8809

In [None]:
# Helper functions for data generation

def get_segment() -> str:
    """Randomly assign customer segment based on distribution"""
    rand = random.random()
    if rand < SEGMENTS["active"]:
        return "active"
    elif rand < SEGMENTS["active"] + SEGMENTS["at_risk"]:
        return "at_risk"
    return "churned"


def generate_transaction_dates(segment: str, start: datetime) -> List[datetime]:
    """
    Generate transaction dates based on customer segment behavior.
    
    - Active: Regular transactions throughout the period
    - At-Risk: Normal first 70%, then declining frequency
    - Churned: Stopped transacting 3-12 months ago
    """
    dates = []
    current = start
    
    if segment == "active":
        while current < END_DATE:
            dates.append(current)
            current += timedelta(days=random.randint(15, 45))
    
    elif segment == "at_risk":
        cutoff = start + (END_DATE - start) * 0.7
        while current < cutoff:
            dates.append(current)
            current += timedelta(days=random.randint(15, 45))
        while current < END_DATE:
            dates.append(current)
            current += timedelta(days=random.randint(60, 120))  # Less frequent
    
    elif segment == "churned":
        churn_date = END_DATE - timedelta(days=random.randint(90, 365))
        while current < churn_date:
            dates.append(current)
            current += timedelta(days=random.randint(15, 45))
    
    return dates


print("Helper functions defined.")

In [None]:
def create_territories(client: ERPNextClient) -> List[str]:
    """Create bank branch territories"""
    print("\n" + "=" * 60)
    print("Creating Territories (Bank Branches)")
    print("=" * 60)
    
    existing = [t["name"] for t in client.get_list("Territory", fields=["name"], limit=100)]
    created = []
    
    for name in TERRITORIES:
        if name in existing:
            print(f"  - Exists: {name}")
            created.append(name)
        else:
            try:
                client.insert("Territory", {
                    "territory_name": name,
                    "parent_territory": "All Territories",
                })
                print(f"  + Created: {name}")
                created.append(name)
                time.sleep(0.1)
            except Exception as e:
                print(f"  x Failed: {name} - {e}")
    
    print(f"\nTotal: {len(created)} territories")
    return created


def create_products(client: ERPNextClient) -> List[Dict]:
    """Create banking products (Items)"""
    print("\n" + "=" * 60)
    print("Creating Items (Banking Products)")
    print("=" * 60)
    
    existing = [i["name"] for i in client.get_list("Item", fields=["name"], limit=100)]
    created = []
    
    for product in PRODUCTS:
        if product["code"] in existing:
            print(f"  - Exists: {product['name']}")
            created.append(product)
        else:
            try:
                client.insert("Item", {
                    "item_code": product["code"],
                    "item_name": product["name"],
                    "item_group": "Services",
                    "stock_uom": "Nos",
                    "is_stock_item": 0,
                    "is_sales_item": 1,
                    "standard_rate": product["rate"],
                })
                print(f"  + Created: {product['name']}")
                created.append(product)
                time.sleep(0.1)
            except Exception as e:
                print(f"  x Failed: {product['name']} - {e}")
    
    print(f"\nTotal: {len(created)} products")
    return created

In [None]:
def create_customers(client: ERPNextClient, territories: List[str]) -> List[Dict]:
    """Create customer records with segment distribution"""
    print("\n" + "=" * 60)
    print(f"Creating Customers ({NUM_CUSTOMERS} total)")
    print("=" * 60)
    
    existing = [c["name"] for c in client.get_list("Customer", fields=["name"], limit=1000)]
    customers = []
    segment_counts = {"active": 0, "at_risk": 0, "churned": 0}
    
    for i in range(NUM_CUSTOMERS):
        first = random.choice(FIRST_NAMES)
        last = random.choice(LAST_NAMES)
        full_name = f"{first} {last} {random.randint(100, 999)}"
        
        if full_name in existing:
            continue
        
        segment = get_segment()
        territory = random.choice(territories)
        
        try:
            result = client.insert("Customer", {
                "customer_name": full_name,
                "customer_type": "Individual",
                "customer_group": "Individual",
                "territory": territory,
                "gender": random.choice(["Male", "Female"]),
                "mobile_no": f"+880{random.randint(1300000000, 1999999999)}",
                "email_id": f"{first.lower()}.{last.lower()}{random.randint(1,99)}@gmail.com",
                "website": segment,  # Store segment in website field for later use
            })
            
            customers.append({
                "name": result.get("name"),
                "segment": segment,
                "territory": territory,
            })
            segment_counts[segment] += 1
            
            if (i + 1) % 50 == 0:
                print(f"  Created {i + 1}/{NUM_CUSTOMERS} customers...")
            
            time.sleep(0.05)
        
        except Exception as e:
            if i < 5:
                print(f"  x Failed: {full_name} - {str(e)[:80]}")
    
    print(f"\nSegment Distribution:")
    print(f"  Active: {segment_counts['active']} ({segment_counts['active']/len(customers)*100:.1f}%)")
    print(f"  At-Risk: {segment_counts['at_risk']} ({segment_counts['at_risk']/len(customers)*100:.1f}%)")
    print(f"  Churned: {segment_counts['churned']} ({segment_counts['churned']/len(customers)*100:.1f}%)")
    
    return customers

In [None]:
def create_transactions(client: ERPNextClient, customers: List[Dict], products: List[Dict]) -> int:
    """
    Create Sales Invoices (transactions) with backdating support.
    
    KEY: Uses set_posting_time=1 to allow historical dates.
    """
    print("\n" + "=" * 60)
    print("Creating Sales Invoices (Transactions)")
    print("=" * 60)
    print("  Using set_posting_time=1 for backdated invoices...")
    
    total = 0
    errors = 0
    
    for i, customer in enumerate(customers):
        # Calculate customer start date based on segment
        if customer["segment"] == "churned":
            days_ago = random.randint(700, 1000)
        elif customer["segment"] == "at_risk":
            days_ago = random.randint(300, 700)
        else:
            days_ago = random.randint(30, 900)
        
        customer_start = END_DATE - timedelta(days=days_ago)
        if customer_start < START_DATE:
            customer_start = START_DATE + timedelta(days=random.randint(1, 30))
        
        # Generate transaction dates
        tx_dates = generate_transaction_dates(customer["segment"], customer_start)
        
        for tx_date in tx_dates:
            product = random.choice(products)
            amount = product["rate"] * random.uniform(0.5, 2.0)
            due_date = tx_date + timedelta(days=random.randint(30, 60))
            
            try:
                client.insert("Sales Invoice", {
                    "customer": customer["name"],
                    "posting_date": tx_date.strftime("%Y-%m-%d"),
                    "due_date": due_date.strftime("%Y-%m-%d"),
                    "set_posting_time": 1,  # KEY: Allow backdated invoices
                    "territory": customer["territory"],
                    "items": [{
                        "item_code": product["code"],
                        "qty": 1,
                        "rate": round(amount, 2),
                    }],
                    "docstatus": 0,
                })
                total += 1
                
                if total % 100 == 0:
                    print(f"  Created {total} transactions...")
                
                time.sleep(0.03)
            
            except Exception as e:
                errors += 1
                if errors <= 3:
                    print(f"  x Error: {str(e)[:100]}")
        
        if (i + 1) % 50 == 0:
            print(f"  Processed {i + 1}/{len(customers)} customers ({total} transactions)")
    
    print(f"\nTotal: {total} transactions created, {errors} errors")
    return total

In [None]:
# ============================================================================
# RUN DATA INGESTION
# ============================================================================
# 
# This cell creates all POC data in ERPNext:
# - 10 Territories (bank branches)
# - 10 Items (banking products)
# - 500 Customers (with segment distribution)
# - ~8,500 Sales Invoices (transactions)
#
# Runtime: Approximately 10-15 minutes

print("=" * 60)
print("ERPNext Data Ingestion")
print(f"Target: {ERPNEXT_URL}")
print(f"Date Range: {START_DATE.date()} to {END_DATE.date()}")
print("=" * 60)

# Create data
territories = create_territories(client)
products = create_products(client)
customers = create_customers(client, territories)

if customers:
    total_tx = create_transactions(client, customers, products)

# Summary
print("\n" + "=" * 60)
print("INGESTION COMPLETE")
print("=" * 60)
print(f"  Territories: {len(territories)}")
print(f"  Products: {len(products)}")
print(f"  Customers: {len(customers)}")
print(f"  Transactions: {total_tx}")

---
## 6. Data Extraction

Export data from ERPNext to CSV/JSON files for Databricks ingestion.

### Output Files

| File | Description | Use Case |
|------|-------------|----------|
| `erp_customers.csv` | Customer master data | dbt seed / bronze layer |
| `erp_items.csv` | Banking products | Reference data |
| `erp_territories.csv` | Branch/region data | Reference data |
| `erp_sales_invoices.csv` | Transactions | Fact table / bronze layer |

In [None]:
def extract_customers(client: ERPNextClient) -> List[Dict]:
    """Extract all customer records"""
    print("\nExtracting Customers...")
    
    fields = [
        "name", "customer_name", "customer_type", "customer_group",
        "territory", "gender", "mobile_no", "email_id", "website",
        "creation", "modified"
    ]
    
    customers = client.get_list("Customer", fields=fields, limit=1000)
    print(f"  Found {len(customers)} customers")
    return customers


def extract_items(client: ERPNextClient) -> List[Dict]:
    """Extract all item records"""
    print("\nExtracting Items...")
    
    fields = [
        "name", "item_code", "item_name", "item_group",
        "standard_rate", "description", "creation", "modified"
    ]
    
    items = client.get_list("Item", fields=fields, limit=100)
    print(f"  Found {len(items)} items")
    return items


def extract_territories(client: ERPNextClient) -> List[Dict]:
    """Extract all territory records"""
    print("\nExtracting Territories...")
    
    fields = ["name", "territory_name", "parent_territory", "creation", "modified"]
    
    territories = client.get_list("Territory", fields=fields, limit=100)
    print(f"  Found {len(territories)} territories")
    return territories


def extract_sales_invoices(client: ERPNextClient) -> List[Dict]:
    """Extract all sales invoice records"""
    print("\nExtracting Sales Invoices...")
    
    fields = [
        "name", "customer", "customer_name", "posting_date", "due_date",
        "territory", "grand_total", "status", "docstatus",
        "creation", "modified"
    ]
    
    invoices = client.get_list("Sales Invoice", fields=fields, limit=10000)
    print(f"  Found {len(invoices)} invoices")
    return invoices

In [None]:
def save_to_csv(data: List[Dict], filename: str, output_dir: Path):
    """Save data to CSV file"""
    if not data:
        print(f"  No data for {filename}")
        return
    
    output_dir.mkdir(parents=True, exist_ok=True)
    filepath = output_dir / filename
    
    # Get all unique keys
    all_keys = set()
    for record in data:
        all_keys.update(record.keys())
    fieldnames = sorted(list(all_keys))
    
    with open(filepath, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)
    
    print(f"  Saved: {filename} ({len(data)} records)")


def save_to_json(data: List[Dict], filename: str, output_dir: Path):
    """Save data to JSON file"""
    if not data:
        print(f"  No data for {filename}")
        return
    
    output_dir.mkdir(parents=True, exist_ok=True)
    filepath = output_dir / filename
    
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, default=str)
    
    print(f"  Saved: {filename} ({len(data)} records)")

In [None]:
# ============================================================================
# RUN DATA EXTRACTION
# ============================================================================

print("=" * 60)
print("ERPNext Data Extraction")
print(f"Source: {ERPNEXT_URL}")
print(f"Output: {OUTPUT_DIR.absolute()}")
print("=" * 60)

# Extract data
customers = extract_customers(client)
items = extract_items(client)
territories = extract_territories(client)
invoices = extract_sales_invoices(client)

# Save to CSV
print("\n" + "=" * 60)
print("Saving to CSV")
print("=" * 60)

save_to_csv(customers, "erp_customers.csv", OUTPUT_DIR)
save_to_csv(items, "erp_items.csv", OUTPUT_DIR)
save_to_csv(territories, "erp_territories.csv", OUTPUT_DIR)
save_to_csv(invoices, "erp_sales_invoices.csv", OUTPUT_DIR)

# Save to JSON
print("\n" + "=" * 60)
print("Saving to JSON")
print("=" * 60)

save_to_json(customers, "erp_customers.json", OUTPUT_DIR)
save_to_json(items, "erp_items.json", OUTPUT_DIR)
save_to_json(territories, "erp_territories.json", OUTPUT_DIR)
save_to_json(invoices, "erp_sales_invoices.json", OUTPUT_DIR)

# Summary
print("\n" + "=" * 60)
print("EXTRACTION COMPLETE")
print("=" * 60)
print(f"\nFiles saved to: {OUTPUT_DIR.absolute()}")
print(f"\n  erp_customers.csv      : {len(customers)} records")
print(f"  erp_items.csv          : {len(items)} records")
print(f"  erp_territories.csv    : {len(territories)} records")
print(f"  erp_sales_invoices.csv : {len(invoices)} records")

---
## Summary

This notebook provides the complete ERPNext data pipeline:

1. **Configuration** - Set up credentials and parameters
2. **API Client** - ERPNext REST API wrapper with CRUD operations
3. **Exploration** - Survey existing data
4. **Cleanup** - Delete data for fresh start (optional)
5. **Ingestion** - Create POC banking data
6. **Extraction** - Export to CSV/JSON

### Key Technical Learnings

| Issue | Solution |
|-------|----------|
| Backdated invoices rejected | Use `set_posting_time: 1` |
| Can't delete invoices | Delete GL Entries first, then cancel, then delete |
| Can't delete customers | Delete linked GL Entries for party first |
| Rate limiting | Add `time.sleep()` between API calls |

### Next Steps

1. Run Salesforce CRM data generator
2. Run Supabase digital channel data generator
3. Run Google Sheets legacy data generator
4. Build Bronze layer in dbt

---
*Created for Banking Customer Churn Prediction POC*