# **PHASE 1: FOUNDATION (Days 1-4)**

## **DAY 1 (09/01/26)– _Platform Setup & First Steps_**



### __Section - 1 - Learn__

### **_1. Why Databricks vs Pandas/Hadoop?_**

* **Massive Scalability:** Unlike Pandas, which is limited by a single machine's RAM, Databricks uses **distributed computing** to process petabytes of data across clusters.
* **Superior Speed:** Databricks processes data **in-memory**, making it up to 100x faster than Hadoop’s MapReduce, which relies on slow disk-based reading and writing.
* **Zero Infrastructure Overhead:** As a fully **managed cloud service**, Databricks eliminates the "Hadoop tax"—the massive effort required to manually configure and maintain complex server clusters.
* **Unified Environment:** It bridges the gap between teams by allowing Data Engineers (SQL/Scala) and Data Scientists (Python/R) to collaborate within the same **shared notebooks**.
* **ACID Reliability:** Through **Delta Lake**, Databricks adds a layer of data integrity (ACID transactions) that prevents data corruption, a common issue in traditional Hadoop data lakes.
* **Cost-Efficient Auto-scaling:** Databricks automatically scales clusters up for heavy workloads and **shuts them down** when idle, whereas Hadoop clusters often sit idle while still incurring costs.
* **Production-Ready AI:** It includes built-in integration with **MLflow**, making it much easier to move from a small Python prototype to a global machine learning model than it is in a Hadoop environment.

---

### _**2. Lakehouse architecture basics**_

A **Data Lakehouse** is a modern architecture that merges the low-cost, flexible storage of a **Data Lake** with the high-performance management and transactions of a **Data Warehouse**.

Here are the 7 key pillars of Lakehouse architecture:

* **Unified Storage:** It acts as a single source of truth, storing **structured** (tables), **semi-structured** (JSON/XML), and **unstructured** (images/video) data in one place, eliminating data silos.
* **ACID Transactions:** Using layers like Delta Lake, it ensures that data operations (Inserts, Updates, Deletes) are reliable—if a job fails halfway, it won't leave your data in a corrupted, "half-written" state.
* **Decoupled Compute & Storage:** You can scale your storage (like AWS S3 or Azure Data Lake) and your processing power (Databricks clusters) **independently**, which is significantly more cost-effective than traditional databases.
* **Schema Enforcement:** It prevents "data swamp" issues by rejecting data that doesn't fit a predefined format, ensuring high data quality for downstream BI and analytics.
* **Support for Diverse Workloads:** One platform handles everything—**Data Engineering** (ETL), **BI/SQL reporting** (Dashboards), and **Machine Learning** (training models)—without needing to move data between different systems.
* **Open Data Formats:** Data is stored in open-source, non-proprietary formats like **Parquet**. This prevents vendor lock-in and allows other tools to read the data directly.
* **The Medallion Framework:** It typically organizes data into three logical layers to manage quality:
  * **Bronze** (Raw data)
  * **Silver** (Cleaned/Filtered)
  * **Gold** (Aggregated/Business-ready).

---

### **_3. Databricks workspace structure_**

A Databricks Workspace is the centralized environment where teams collaborate. It is structured into three distinct "planes" that separate the user interface, the heavy processing, and the storage.

Here is the structure broken down into its 7 core components:

##### 1. The Dual-Plane Architecture

* **Control Plane:** Managed by Databricks in their cloud account. It contains the **Web UI**, **Notebook source code**, **Job scheduler**, and **Cluster management** services.
* **Compute Plane:** Typically resides in *your* cloud account (AWS/Azure/GCP). This is where your data is actually processed by clusters of virtual machines, ensuring your data never leaves your security boundary.

##### 2. Unity Catalog (The Governance Layer)

* The modern "brain" of the workspace that manages fine-grained permissions.
* It follows a **Three-Tier Namespace**: `Catalog` → `Schema (Database)` → `Table/Volume`.
* This allows you to manage access to data across multiple workspaces from one central place.

##### 3. Compute Resources (The Engine)

* **All-Purpose Clusters:** Used for interactive analysis and development in notebooks.
* **Job Clusters:** Transient clusters that spin up only for a specific automated task and shut down immediately after to save costs.
* **SQL Warehouses:** Specialized compute optimized for running SQL queries and powering BI dashboards.

##### 4. Workspace Assets (The "Filesystem")

* **Notebooks:** Collaborative documents containing runnable code (Python, SQL, Scala, R), visualizations, and narrative text.
* **Git Folders (Repos):** Integrated version control that lets you sync your Databricks notebooks directly with GitHub, GitLab, or Bitbucket.
* **Dashboards:** Real-time data visualizations built directly on top of your processed tables.

##### 5. Databricks SQL

* A dedicated workspace persona for SQL analysts.
* Includes a **Query Editor**, **Visualization tools**, and **Alerts** that notify you if a specific data metric (like "daily revenue") drops below a threshold.

##### 6. Workflows & Orchestration

* **Jobs:** The tool used to schedule notebooks or JARs to run at specific times or in response to events.
* **Delta Live Tables (DLT):** A framework for building reliable, maintainable, and testable data processing pipelines.

##### 7. Storage Foundation

* **DBFS (Databricks File System):** A layer over your cloud storage (S3/ADLS) that makes it look like a local file system.
* **Metastore:** The central repository for metadata (table definitions, partitions) so the compute knows how to read your raw files as structured tables.

---


### **_4. Industry use cases (Netflix, Shell, Comcast)_**
Databricks has become the go-to platform for massive enterprises because it handles the scale and real-time demands that traditional tools cannot.

Here are the specific ways Netflix, Shell, and Comcast use the platform:

#### _Netflix: Content & Experience Personalization_

* **Recommendation Engine:** Processes billions of user interactions (viewing history, clicks, search patterns) to power the "Top Picks" and "Because You Watched" algorithms.
* **Artwork Personalization:** Uses machine learning to choose which movie thumbnail to show you based on your taste (e.g., showing the lead actor if you like their work).
* **Content Acquisition:** Analyzes global viewing trends to predict which new shows or movies are worth the multi-million dollar investment for production or licensing.
* **A/B Testing at Scale:** Rapidly tests different UI layouts and features across millions of users simultaneously to see which drives the highest retention.

#### *Shell: The Energy Transition & Reliability*

* **Predictive Maintenance:** Monitors millions of high-frequency sensors on wind turbines, oil rigs, and solar panels to predict equipment failure before it happens.
* **Carbon Reduction:** Analyzes data from across its global supply chain and refineries to optimize energy use and lower its overall carbon footprint.
* **Renewable Energy Forecasting:** Combines weather data (wind speed, solar irradiance) with power grid demand to optimize how and when renewable energy is sold.
* **Digital Twins:** Creates virtual replicas of physical assets (like an offshore rig) to simulate changes and optimize production in a safe, digital environment.

#### *Comcast: Voice Commands & Customer Insight*

* **Voice Remote Excellence:** Processes billions of voice commands from 20+ million remotes in real-time, allowing users to find content instantly using AI.
* **Infrastructure Optimization:** Replaced 640 legacy machines with just 64 Databricks nodes, reducing compute costs by **10x** while increasing speed.
* **Proactive Support:** Analyzes customer telemetry data to identify service issues or "frustration signals" before the customer even picks up the phone to complain.
* **Real-time Ad Targeting:** Uses massive viewership datasets to help advertisers deliver relevant commercials to 125 million households across traditional and streaming TV.

---

### **Practice**

In [0]:
# Create simple DataFrame
data = [("iPhone", 999), ("Samsung", 799), ("MacBook", 1299)]
df = spark.createDataFrame(data, ["product", "price"])
df.show()

# Filter expensive products
df.filter(df.price > 1000).show()

+-------+-----+
|product|price|
+-------+-----+
| iPhone|  999|
|Samsung|  799|
|MacBook| 1299|
+-------+-----+

+-------+-----+
|product|price|
+-------+-----+
|MacBook| 1299|
+-------+-----+



---

### **Resources**

- [Databricks Trial](https://www.databricks.com/try-databricks)
- [Databricks Quickstart](https://docs.databricks.com/en/introduction/)

---