<div style="display: flex; justify-content: space-between; align-items: center; padding: 8px 16px; background: #F8F9FA; border-bottom: 2px solid #E0E0E0; margin: 0; line-height: 1;">
    <div style="font-size: 14px; color: #666;">
        <span style="font-weight: bold; color: #333;">{SOURCE_PLATFORM} → Databricks Migration</span>
        <span style="margin-left: 8px; color: #999;">|</span>
        <span style="margin-left: 8px;">01 - Discover</span>
    </div>
    <div style="display: flex; align-items: center; gap: 8px;">
        <img src="https://cdn.simpleicons.org/snowflake/29B5E8" width="24" height="24"/>
        <span style="color: #999; font-size: 16px;">→</span>
        <img src="https://cdn.simpleicons.org/databricks/FF3621" width="24" height="24"/>
    </div>
</div>


<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# Discovery Checklist

## Overview

Before migration begins, you must thoroughly understand your current **{SOURCE_PLATFORM}** environment. This module provides a comprehensive checklist of what to discover, who to talk to, and how to document findings.

## Learning Objectives

By the end of this lesson, you will be able to:
- Identify all categories of information needed for migration planning
- Conduct stakeholder interviews effectively
- Document dependencies, SLAs, and ownership
- Create a complete migration inventory

## Discovery Categories

A complete discovery spans five categories. Missing any category leads to surprises during migration.

<br />
<div class="mermaid">
flowchart LR
    subgraph DISCOVERY["Migration Discovery"]
        direction TB
        A["Data<br/>Assets"] 
        B["Pipelines<br/>& ETL"]
        C["Consumers<br/>& Users"]
        D["Security<br/>& Access"]
        E["Operations<br/>& SLAs"]
    end
    style DISCOVERY fill:#fff,stroke:#FF3621,stroke-width:2px
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

| Category | What to Discover | Why It Matters |
|----------|------------------|----------------|
| **Data Assets** | Tables, views, schemas, sizes, formats | Scope the data migration effort |
| **Pipelines & ETL** | Jobs, procedures, UDFs, orchestration | Plan code conversion and testing |
| **Consumers & Users** | BI tools, apps, user types, connections | Coordinate downstream cutover |
| **Security & Access** | Roles, policies, compliance requirements | Replicate access controls |
| **Operations & SLAs** | Schedules, SLAs, monitoring, runbooks | Maintain service levels |


## Data Assets

Understanding the scope and characteristics of your data is essential to estimate migration effort and choose the right methods.

### What to Capture

| Object Type | Key Questions |
|-------------|---------------|
| **Databases & Schemas** | How many? Naming conventions? Logical organization? |
| **Tables** | Count, total size, row counts, largest tables? Locations? Formats? |
| **Views** | Count, complexity, nested dependencies? |
| **External Tables** | Locations, formats, external data sources? |

### Data Characteristics

For significant tables, understand:

| Characteristic | Question | Migration Impact |
|----------------|----------|------------------|
| **Size** | How large (GB/TB)? | Migration duration, method selection |
| **Growth Rate** | Daily/monthly growth? | Ongoing sync strategy |
| **Update Pattern** | Append-only, updates, deletes? | CDC vs full refresh decision |
| **Partitioning** | Current partitioning strategy? | Target table design |
| **Data Quality and Validation** | Known data quality issues? What validation is in place? | Target pipeline design |
| **Retention** | How long is data retained? | Storage and archive planning |

## Pipelines and ETL

Transformation logic and orchestration typically represent the bulk of migration complexity - inventory everything to identify hotspots early.

### What to Capture

| Component | Key Questions | Migration Impact |
|-----------|---------------|------------------|
| **Scheduled Jobs** | How many? Frequency? Dependencies? | Convert to Lakeflow Jobs |
| **Stored Procedures** | Count? Complexity? Business logic? | Rewrite to notebooks/SQL |
| **UDFs** | Count? Languages used? | Rewrite if non-SQL |
| **Streaming/CDC** | Real-time components? | Convert to Structured Streaming |
| **External Orchestration** | Airflow, dbt, other tools? | Integration changes |
| **Observability, SRE, Ops** | What tools and processes are in place? | Monitoring or Process Changes |

### Complexity Indicators

| Low Complexity | High Complexity |
|----------------|-----------------|
| Simple SQL transforms | Complex stored procedures |
| Single table outputs | Multi-table transactions |
| No dependencies | Many upstream/downstream dependencies |
| Standard SQL functions | Custom UDFs, proprietary features |
| Batch processing | Real-time/streaming requirements |

## Consumers and Users

Everyone and everything that depends on the data needs to be identified to coordinate cutover timing and minimize disruption.

### Downstream Systems

| Consumer Type | Key Questions |
|---------------|---------------|
| **BI Tools** | Which tools (Tableau, Power BI, Looker)? Connection method? |
| **Applications** | Which apps query directly? APIs? Connection strings? |
| **Data Science** | Notebooks? ML models? Feature pipelines? |
| **Data Sharing** | External recipients? Contractual obligations? |
| **Exports/Feeds** | Scheduled extracts? Formats? Destinations? |

### User Types

| User Type | Discovery Focus |
|-----------|-----------------|
| **Data Engineers** | Pipeline ownership, development patterns |
| **Data Analysts** | SQL usage, BI tools, query patterns |
| **Data Scientists** | Notebook usage, ML workflows |
| **Business Users** | Dashboard access, report dependencies |
| **Service Accounts** | Automation, API access, credentials |

## Security and Access

Current access controls and compliance requirements must be documented accurately to replicate them in the target environment.

### Access Control

| Component | Key Questions |
|-----------|---------------|
| **Roles** | Role hierarchy? Naming conventions? |
| **Users & Groups** | How are users organized? IdP integration? |
| **Object Privileges** | Permission patterns? Who owns what? |
| **Row-Level Security** | Which tables? Policy logic? |
| **Column Masking** | Sensitive columns? Masking rules? |

### Compliance Requirements

| Requirement | Questions to Answer |
|-------------|---------------------|
| **Data Residency** | Geographic restrictions on data location? |
| **Encryption** | At-rest and in-transit requirements? |
| **Audit Logging** | What must be logged? Retention period? |
| **Regulatory** | GDPR, HIPAA, SOC2, PCI-DSS, industry-specific? |

## Operations and SLAs

Capturing performance baselines and service expectations upfront gives you clear success criteria for validation.

### Service Level Agreements

| SLA Type | Key Questions |
|----------|---------------|
| **Data Freshness** | When must data be available? (e.g., daily ETL by 6 AM) |
| **Query Performance** | Response time expectations? (e.g., dashboards < 10 sec) |
| **Availability** | Uptime requirements? (e.g., 99.9%) |
| **Latency** | Real-time requirements? (e.g., < 5 min lag) |

### Operational Readiness

| Area | What to Document |
|------|------------------|
| **Monitoring** | Current tools, dashboards, health checks |
| **Alerting** | Alert rules, notification channels, escalation |
| **Incident Response** | Runbooks, on-call rotation, contacts |
| **Recovery Procedures** | Backup/restore, failure handling |

## Stakeholder Engagement

No single person has visibility into all systems, dependencies, and requirements - discovery is a team effort.

### Who to Interview

| Role | Focus Areas |
|------|-------------|
| **Platform Admin** | Technical inventory, infrastructure, access control |
| **Data Engineers** | Pipeline details, complexity, pain points |
| **Data Analysts** | Query patterns, BI tools, performance issues |
| **Application Owners** | Integration points, SLAs, dependencies |
| **Security/Compliance** | Governance requirements, audit needs |
| **Finance** | Current costs, contract terms, budget |

### Key Questions by Role

**Data Engineers**: What are the most complex pipelines? Biggest pain points? What would you redesign?

**Data Analysts**: What queries are slowest? What capabilities are missing? What data do you use most?

**Application Owners**: How does your app connect? What happens if data is unavailable? What is your maintenance window?

## Summary

### Discovery Deliverables

| Deliverable | Contents |
|-------------|----------|
| **Data Inventory** | All databases, schemas, tables with sizes and characteristics |
| **Pipeline Catalog** | All jobs, procedures, UDFs with complexity assessment |
| **Consumer Map** | All downstream systems, tools, and user types |
| **Security Baseline** | Roles, policies, compliance requirements |
| **Dependency Graph** | Upstream and downstream relationships |
| **SLA Documentation** | Performance requirements and current baselines |

### Next Steps

- Use profiling tools to validate and quantify discovery findings
- Proceed to [**1.4 Profiling and Complexity Scoring**]($./1.4 - Profiling and Complexity Scoring) for automated assessment

<div style="color: #FF3621; font-weight: bold; font-size: 2em; margin-bottom: 12px;">COURSE DEVELOPER (remove before publishing)</div>

### Template Customization

**Placeholders to replace:**
- `{SOURCE_PLATFORM}` - Source platform name
- `{SOURCE_PLATFORM_ICON}` - Icon URL

**Platform-specific additions:**
- Add platform-specific system catalog queries for discovery
- Include platform-specific object types (e.g., Snowflake Streams/Tasks, Redshift Spectrum tables)
- Reference platform-specific profiling tools
- Add common platform-specific pain points to look for

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>
