# Module 01 - Introduction to Databricks and Workspace

## Overview

Welcome to Databricks! This module introduces you to the Databricks platform, its workspace, and core concepts. Since you already know SQL, Python, Pandas, and PySpark, we'll focus on how Databricks enhances your existing skills.

## Learning Objectives

By the end of this module, you will understand:
- What is Databricks and why it's important for data engineering
- Databricks workspace navigation and structure
- Creating and working with Databricks notebooks
- Understanding clusters and compute resources
- Basic operations in Databricks environment
- Differences between Databricks and local PySpark


## What is Databricks?

**Databricks** is a unified analytics platform built on Apache Spark that simplifies big data processing and machine learning. It's a cloud-native platform that provides:

### Key Features

1. **Unified Analytics Platform**: Combines data engineering, data science, and business analytics in one place
2. **Apache Spark Optimization**: Optimized Spark runtime that's 2-10x faster than standard Spark
3. **Collaborative Workspace**: Share notebooks, dashboards, and insights with your team
4. **Managed Infrastructure**: No need to manage clusters, Spark versions, or dependencies manually
5. **Delta Lake Integration**: Built-in support for ACID transactions and time travel on data lakes
6. **Multi-language Support**: Python, SQL, Scala, and R in the same notebook

### Why Databricks for Data Engineers?

- **Faster Development**: Pre-configured Spark clusters, no setup time
- **Better Performance**: Optimized Spark engine (Databricks Runtime)
- **Production Ready**: Built-in job scheduling, monitoring, and alerting
- **Cloud Integration**: Native integration with AWS, Azure, and GCP storage
- **Collaboration**: Teams can work together on notebooks and share results
- **Cost Effective**: Auto-scaling clusters, pay only for what you use


## Databricks Architecture Overview

### Components

1. **Control Plane**: Databricks-managed services (workspace, notebooks, jobs)
2. **Data Plane**: Your cloud account (VPC, storage, compute)
3. **Workspace**: Your collaborative environment (notebooks, dashboards, libraries)
4. **Clusters**: Compute resources that run your Spark jobs
5. **Jobs**: Scheduled or on-demand execution of notebooks or scripts

## Layers: Control Plane vs Data Plane

- **Control plane (managed by Databricks)**: UI/API, notebooks, job scheduler, cluster manager, SQL endpoints, Unity Catalog metastore, audit logs, tokens/SCIM/IAM integration, secrets management.

- **Data plane (your cloud account)**: compute instances (driver + workers), DBFS root (object storage mount), customer-owned data in object storage, VPC/VNet networking, private endpoints. Compute pulls config from the control plane but data never needs to transit the control plane.

- **Implication**: security teams care about keeping PII in the data plane while still leveraging SaaS management; networking rules (PE/VPCE) control traffic.

## Multi-Cloud Nuances (AWS, Azure, GCP)
- **AWS**: data plane in your VPC; S3 for storage; IAM roles for passthrough; PrivateLink for control-plane + S3; Kinesis/Kafka integrations.

- **Azure**: data plane in your VNet; ADLS Gen2; Managed Identities for passthrough; Private Link for CP/DP + storage firewall; AAD tokens.

- **GCP**: data plane in your VPC; GCS; Service Accounts with short-lived tokens; Private Service Connect for control-plane; BigQuery connector optional.

- **Serverless SQL**: lives in Databricks-owned account; networking and patching handled by Databricks; great for BI with minimal ops.

## Execution Flow (Notebook/Job → Data)

1. User submits a notebook cell / job task via UI/API.

2. Control plane authenticates, fetches metadata (cluster config, UC policies).

3. Driver starts in data plane with config; workers attach; libraries pulled from DBFS/whl/pip mirror.

4. Spark plan built (Catalyst) → optimized (AQE, DPP) → executed; I/O hits object storage via data-plane network.

5. Results return to driver → UI/warehouse; lineage and audit persisted in control plane; logs/metrics emitted to storage/monitoring.

## Databricks vs Local PySpark

| Feature | Local PySpark | Databricks |
|---------|---------------|------------|
| Setup | Manual installation | Pre-configured |
| Cluster Management | Manual | Automatic |
| Performance | Standard Spark | Optimized (2-10x faster) |
| Collaboration | Limited | Built-in |
| Storage | Local/Manual config | Integrated with cloud storage |
| Monitoring | Basic | Advanced dashboards |
| Cost | Fixed hardware | Pay-per-use


## Understanding the Databricks Workspace

The workspace is your main interface in Databricks. It's organized into:

### Workspace Structure

1. **Workspace**: Your personal or shared folder structure
   - `/Users/your_email@domain.com/` - Your personal workspace
   - `/Shared/` - Shared with all users
   - `/Repos/` - Git repositories

2. **Notebooks**: Interactive documents with code and markdown
   - Support multiple languages (Python, SQL, Scala, R)
   - Can mix languages in the same notebook
   - Support for widgets and parameters

3. **Clusters**: Compute resources
   - Single Node: For development (no distributed computing)
   - Standard: Multi-node clusters for production
   - High Concurrency: For SQL workloads

4. **Jobs**: Scheduled or triggered tasks
   - Run notebooks or JAR files
   - Can be scheduled with cron expressions
   - Support for job dependencies

5. **SQL Warehouses**: For SQL analytics
   - Serverless compute for SQL queries
   - Auto-scaling and auto-termination

6. **Libraries**: Python, JAR, or other dependencies
   - Can be installed at cluster or notebook level
   - Support for PyPI, Maven, CRAN, etc.


## Cluster and Warehouse Architecture

- **Driver**: runs SparkContext, notebook commands, job orchestration; talks to control plane for auth/config.

- **Workers**: executors doing distributed compute; autoscaling adds/removes workers within min/max; spot/ondemand mix to save cost.

- **SQL Warehouse**: managed compute tuned for BI; serverless variant lives fully in control plane account (no VPC needed) while pro/classic live in your VPC.

- **Photon**: vectorized engine for SQL/DataFrame; accelerates I/O + compute without code change.

- **Adaptive execution**: AQE, dynamic partition pruning, broadcast hints—all apply to SQL and PySpark.

## Creating Your First Notebook

In Databricks, you can create notebooks through the UI or programmatically. Let's explore the notebook environment.


In [0]:
# Check Databricks runtime version
import sys
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")


In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# 1. Spark Version (This works fine)
print(f"Spark version: {spark.version}")

# 2. Databricks Runtime Version (The reliable way)
# This looks for the environment variable set on the worker/driver
import os
dbr_version = os.getenv("DATABRICKS_RUNTIME_VERSION", "N/A")
print(f"Databricks Runtime: {dbr_version}")

# 3. App ID (Safe get with a default)
# app_id = spark.conf.get("spark.app.id", "N/A")
# print(f"App ID: {app_id}")

In [0]:
%sql
SELECT current_version(), version();

In [0]:
# Check available Spark configurations
print("Key Spark Configurations:")
configs = spark.sparkContext.getConf().getAll()
for key, value in sorted(configs)[:20]:  # Show first 20
    print(f"{key}: {value}")


Serverless compute (for notebooks, jobs, and SQL warehouses) uses the Spark Connect architecture. This is a modern, decoupled mode where:

Your client (the notebook or warehouse frontend) talks to a remote Spark server via gRPC.
There is no local Spark driver JVM running in your session.
The traditional SparkContext (accessed via spark.sparkContext) lives on the driver in classic Spark — but it doesn't exist in the same way here.

As a result, any code that tries to access spark.sparkContext (or related low-level driver/JVM attributes like getConf().getAll(), applicationId, hadoopConfiguration, parallelize, broadcast, RDD APIs, checkpointing via context, etc.) will fail with this exact [JVM_ATTRIBUTE_NOT_SUPPORTED] error (or similar ones like [CONFIG_NOT_AVAILABLE]).

## Understanding Clusters

Clusters are the compute resources in Databricks. They can be:

### Cluster Types

1. **All-Purpose Clusters**: For interactive development
   - Multiple users can attach
   - Manual start/stop
   - Best for notebooks and ad-hoc analysis

2. **Job Clusters**: For automated jobs
   - Created automatically for jobs
   - Terminated after job completion
   - Cost-effective for scheduled tasks

3. **SQL Warehouses**: For SQL analytics
   - Serverless compute
   - Auto-scaling
   - Optimized for SQL queries

### Cluster Components

- **Driver Node**: Coordinates the cluster and runs your code
- **Worker Nodes**: Execute tasks in parallel
- **Databricks Runtime**: Optimized Spark distribution with pre-installed libraries


In [0]:
# Get cluster information
print("Cluster Information:")
print(f"Number of cores: {spark.sparkContext.defaultParallelism}")
print(f"Spark Master: {spark.sparkContext.master}")

# Check if running on Databricks
try:
    dbutils_info = dbutils.notebook.entry_point.getDbutils()
    print("\nRunning on Databricks: Yes")
    print(f"Notebook path: {dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()}")
except:
    print("\nRunning on Databricks: No (or dbutils not available)")


## Databricks Utilities (dbutils)

`dbutils` is a Databricks-specific utility that provides helper functions for:

- **File system operations**: Working with DBFS (Databricks File System)
- **Secrets management**: Accessing secure credentials
- **Notebook workflows**: Running other notebooks
- **Widgets**: Creating interactive parameters
- **Data**: Mounting external storage

### Key dbutils Commands

```python
# File system
dbutils.fs.ls("/")
dbutils.fs.mkdirs("/tmp/demo")
dbutils.fs.cp("source", "destination")

# Secrets
dbutils.secrets.get(scope="my-scope", key="my-key")

# Notebooks
dbutils.notebook.run("path/to/notebook", timeout_seconds=60)

# Widgets
dbutils.widgets.text("input", "default_value")
dbutils.widgets.get("input")
```


In [0]:
# Explore dbutils file system
print("DBFS Root Directory:")
files = dbutils.fs.ls("/")
for file in files:
    print(f"  {file.name:30s} - {file.size} bytes")


In [0]:
# Check Databricks File System (DBFS) structure
print("\nDBFS Structure:")
print("\n/dbfs - Databricks File System root")
print("/FileStore - Uploaded files and generated outputs")
print("/databricks - System files")
print("/tmp - Temporary files")
print("/mnt - Mount points for external storage")



## Working with Multiple Languages

One of Databricks' powerful features is the ability to mix languages in a single notebook. You can switch between Python, SQL, Scala, and R seamlessly.


In [0]:
# Python cell - Create a sample DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create sample data
data = [("Alice", 25, "Engineering"),
        ("Bob", 30, "Sales"),
        ("Charlie", 35, "Engineering"),
        ("Diana", 28, "Marketing")]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView("employees")

print("DataFrame created and registered as temporary view 'employees'")
df.show()

print("Using df.display() in Databricks to show output instead of df.show() as used in PySpark")
df.display()


Now let's query the same data using SQL:


In [0]:
%sql
-- SQL cell - Query the temporary view
SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(age) as avg_age
FROM employees
GROUP BY department
ORDER BY employee_count DESC


## Notebook Magic Commands

Databricks supports magic commands for notebook operations:

- `%python` - Run Python code
- `%sql` - Run SQL code
- `%scala` - Run Scala code
- `%r` - Run R code
- `%sh` - Run shell commands
- `%md` - Markdown cell
- `%run` - Run another notebook
- `%fs` - File system operations (shortcut for dbutils.fs)
- `%pip` - Install Python packages
- `%sql` - SQL queries

### Example: Using Magic Commands


In [0]:
%fs

# Using %fs magic command (alternative to dbutils.fs)

ls


# The %fs magic command (and almost all dbutils.fs commands) is completely disabled when your notebook is attached to a serverless SQL warehouse.

In [0]:
%sql
-- List files in a volume
LIST 'dbfs:/Volumes/retail_catalog/v01/retail-pipeline';

-- Or more detailed
SELECT path, modificationTime, length 
FROM read_files(
  '/Volumes/retail_catalog/v01/retail-pipeline',
  format => 'binaryFile'
);

In [0]:
%%sh

# Using %sh for shell commands

echo "Hello from shell" && date


## Key Differences: Databricks vs Local PySpark

### 1. SparkSession Initialization

**Local PySpark**:
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.some.config", "value") \
    .getOrCreate()
```

**Databricks**:
```python
# SparkSession is already created and optimized
spark = SparkSession.builder.getOrCreate()
# Or simply use the pre-configured 'spark' object
```

### 2. File System Access

**Local PySpark**:
```python
df = spark.read.csv("file:///path/to/file.csv")
```

**Databricks**:
```python
# Use DBFS paths
df = spark.read.csv("/FileStore/tables/file.csv")
# Or mount external storage
df = spark.read.csv("/mnt/adls/data/file.csv")
```

### 3. Library Management

**Local PySpark**:
```python
# Install via pip, manage manually
!pip install pandas
```

**Databricks**:
```python
# Install at cluster or notebook level
%pip install pandas
# Or use cluster libraries (persistent across restarts)
```


In [0]:
# Demo: Create a simple DataFrame and compare with Pandas
import pandas as pd
from pyspark.sql import SparkSession

# Create Pandas DataFrame
pandas_df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'score': [85, 90, 88, 92, 87]
})

print("Pandas DataFrame:")
print(pandas_df)
print(f"\nPandas DataFrame size: {pandas_df.shape}")

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)

print("\nSpark DataFrame:")
spark_df.show()
print(f"\nSpark DataFrame count: {spark_df.count()}")


## Best Practices for Databricks Notebooks

1. **Use %md cells** for documentation and explanations
2. **Organize code logically** - separate data ingestion, transformation, and output
3. **Use widgets** for parameterization
4. **Leverage temporary views** to share data between Python and SQL cells
5. **Use display()** instead of show() for better visualization
6. **Clean up resources** - unpersist DataFrames when done
7. **Use cluster libraries** for frequently used packages
8. **Version control** - use Git integration for notebooks


In [0]:
# Demo: Using display() for better visualization
from pyspark.sql.functions import col, avg, count

# Create sample sales data
sales_data = [
    ("2024-01-01", "Product A", 100.0, "North"),
    ("2024-01-01", "Product B", 150.0, "South"),
    ("2024-01-02", "Product A", 120.0, "North"),
    ("2024-01-02", "Product C", 200.0, "East"),
    ("2024-01-03", "Product B", 180.0, "South"),
]

sales_df = spark.createDataFrame(sales_data, ["date", "product", "amount", "region"])

# Use display() for interactive tables (better than show())
display(sales_df)

# Aggregated view
summary = sales_df.groupBy("region").agg(
    count("*").alias("transaction_count"),
    avg("amount").alias("avg_amount")
)

display(summary)


## Summary

In this module, you learned:

✅ **What Databricks is** - A unified analytics platform built on Apache Spark

✅ **Workspace structure** - How Databricks organizes notebooks, clusters, and jobs

✅ **Notebook basics** - Creating and working with multi-language notebooks

✅ **Clusters** - Understanding compute resources in Databricks

✅ **dbutils** - Databricks-specific utilities for file operations and more

✅ **Key differences** - How Databricks differs from local PySpark

✅ **Best practices** - Tips for effective notebook development

### Next Steps

In the next module, we'll explore:
- Data ingestion from various sources
- Working with DBFS and external storage
- Integrating with Azure Data Lake Storage Gen2
- Reading and writing different file formats


## Exercise

Try these exercises to practice:

1. Create a new notebook and explore the dbutils.fs commands
2. Create a Spark DataFrame from a Python list and convert it to a Pandas DataFrame
3. Use both Python and SQL cells to analyze the same dataset
4. Use the display() function to visualize a DataFrame with at least 100 rows
5. Check your cluster configuration and note the number of cores available
