# **PHASE 3: ADVANCED ANALYTICS (Days 9-11)**

## **DAY 9 (17/01/26) - SQL Analytics & Dashboards**



### **Section 1 - Learn**:

### **_1. SQL Warehouse_**
In Databricks, a **SQL Warehouse** (formerly known as a SQL Endpoint) is a specialized compute resource optimized specifically for **SQL queries, BI reporting, and data visualization**.

While a standard Cluster is like a "Swiss Army Knife" for data engineering and ML, a SQL Warehouse is a "Scalpel" designed for the high-concurrency needs of analysts.

##### **1. Key Characteristics**

* **Optimized for SQL:** It uses the **Delta Engine** (C++ based) to execute SQL queries significantly faster than standard Python-heavy clusters.
* **Instant Availability:** **Serverless SQL Warehouses** start up in seconds, eliminating the "cold start" wait time of traditional clusters.
* **High Concurrency:** It is designed to handle dozens of users running queries at the same time by automatically scaling up (adding more clusters) and down based on demand.
* **Simplified Management:** You don't have to choose instance types or Spark configurations. You simply choose a size (2X-Small to 4X-Large).

##### **2. Warehouse Types**

| Type | Best For... | Key Advantage |
| --- | --- | --- |
| **Serverless** | BI, Ad-hoc Analysts, Dashboards | Instant start, low overhead, no infrastructure to manage. |
| **Pro** | Scheduled SQL Tasks, dbt jobs | Includes advanced features like predictive I/O and performance monitoring. |
| **Classic** | Legacy workloads | Basic SQL execution on standard cloud VMs. |

##### **3. SQL Warehouse vs. All-Purpose Cluster**

* **User Type:** **All-Purpose Clusters** are for Data Engineers writing Python/Scala. **SQL Warehouses** are for Business Analysts using SQL or connecting BI tools (Power BI, Tableau).
* **Auto-Stop:** SQL Warehouses have very aggressive auto-stop settings (as low as 1 minute for Serverless), which saves significant costs by shutting down as soon as the last query finishes.
* **Interface:** SQL Warehouses are the backbone of the **Databricks SQL** persona, powering the SQL Editor, Dashboards, and Alerts.

##### **4. Connecting BI Tools**

To connect a tool like Power BI or Tableau, you simply grab the **Connection Details** from the SQL Warehouse settings (Server Hostname and HTTP Path). Because the warehouse stays "idle" until a query arrives, it provides a seamless experience for business users.

##### **Best Practices**

* **Use Serverless:** It provides the best performance-to-cost ratio and eliminates "idle time" waste.
* **Scaling Limits:** Set a **Min** and **Max** number of clusters. For example, `Min: 1` and `Max: 5` allows the warehouse to expand during the morning rush and shrink at night.
* **Tagging:** Use tags on your warehouses to track which department (Marketing, Finance) is responsible for the compute costs.

---

### **_2. Complex analytical queries_**

When your data moves into the **Gold layer**, simple `SELECT` and `WHERE` clauses often aren't enough. Complex analytical queries in Databricks utilize advanced SQL and PySpark features to perform multi-step logic, time-series analysis, and sophisticated data transformations.

Here are the primary patterns for handling complex analysis:

##### **1. Common Table Expressions (CTEs)**

CTEs are essential for making complex logic readable. Instead of nesting five subqueries inside each other, you define "temporary result sets" that you can reference later in the same query.

* **Benefit:** Improves code maintainability and allows the Spark optimizer to better understand the query structure.
* **Syntax:** Use the `WITH` clause to define your steps sequentially.

```sql
WITH regional_sales AS (
  SELECT region, sum(amount) as total_rev 
  FROM gold.sales GROUP BY region
),
top_regions AS (
  SELECT region FROM regional_sales WHERE total_rev > 1000000
)
SELECT * FROM gold.sales WHERE region IN (SELECT region FROM top_regions)

```

##### **2. Advanced Window Functions**

Beyond simple rankings, window functions enable complex trend analysis:

* **Range-based Frames:** Calculate a "moving average of the last 30 days" rather than just the last 30 rows.
* **Analytical Offsets:** Use `LAG` and `LEAD` to compare a customer's current purchase to their previous one to find "Time to Re-purchase" metrics.
* **Percentiles:** Use `percent_rank()` or `ntile(4)` to segment your data into quartiles (e.g., identifying the top 25% of high-spending customers).

##### **3. Pivoting and Unpivoting**

Data is often stored in a "Long" format (one row per date/metric), but reports often require a "Wide" format (columns for each month).

* **`PIVOT`**: Converts row values into columns (e.g., turning 12 months of rows into 12 columns).
* **`UNPIVOT`**: Converts columns back into rows, which is crucial for cleaning data that arrived in a "spreadsheet-style" format.

##### **4. Higher-Order Functions (Complex Types)**

Modern data often includes **Arrays** or **Maps** (JSON-like structures). Databricks SQL provides specialized functions to manipulate these without "flattening" or "exploding" the table, which is much faster.

* **`transform()`**: Applies a function to every element in an array.
* **`filter()`**: Removes elements from an array based on a condition.
* **`aggregate()`**: Reduces an array to a single value (e.g., summing all items in a "cart" array).

```sql
-- Example: Increase all prices in an array by 10%
SELECT transform(product_prices, p -> p * 1.1) AS inflated_prices FROM orders

```

##### **5. Statistical and ML Functions**

Databricks SQL includes built-in functions for data science without needing Python:

* **`corr(x, y)`**: Calculates the Pearson correlation coefficient.
* **`regr_slope(y, x)`**: Returns the slope of the linear regression line.
* **`approx_count_distinct()`**: Uses the HyperLogLog algorithm to count unique values in massive datasets with 99% accuracy but 10x faster than a standard `COUNT(DISTINCT)`.

##### **6. Lateral Joins (Exploding Data)**

When you have a column containing a list (array) and you need to join it against another table, a `LATERAL VIEW` or `EXPLODE` allows you to expand that list into individual rows while keeping the context of the parent row.

---

### **_3. Dashboard creation_**
In Databricks, **Dashboards** allow you to turn your SQL queries into interactive, shareable visualizations for business users. There are currently two ways to build dashboards, but the platform is moving toward the more modern **AI-Generated Lakehouse Dashboards**.

##### **1. Lakehouse Dashboards (The Modern Standard)**

These are high-performance, WYSIWYG (What You See Is What You Get) dashboards that are decoupled from the underlying notebooks or SQL editors.

* **Draft and Publish:** You can work on a "Draft" version without affecting what the business users see. Once ready, you "Publish" the update.
* **Filter Widgets:** Add dropdowns, date pickers, and text inputs that allow users to slice the data dynamically across all charts on the page.
* **Direct Sharing:** Share dashboards with users who don't have access to the underlying SQL code. They can view the data using the permissions of the dashboard owner (Run-as-Owner).
* **Automatic Scaling:** They are powered by **SQL Warehouses**, meaning they can handle many simultaneous users much better than a notebook-based dashboard.

##### **2. The Creation Workflow**

The process typically follows these four steps:

1. **Define Data:** Write one or more SQL queries to pull the metrics you need (usually from your **Gold** tables).
2. **Add Visualizations:** Choose from a wide variety of chart types:
* **Counter:** For high-level KPIs (e.g., "Total Revenue").
* **Bar/Line/Area:** For trends over time or category comparisons.
* **Pie/Donut:** For market share or composition.
* **Scatter/Heatmap:** For relationship and density analysis.


3. **Layout Design:** Drag and drop visualizations onto a canvas. You can resize them and group them into tabs to organize the story.
4. **Set Refresh Schedule:** Schedule the dashboard to refresh its data automatically (e.g., every morning at 8:00 AM) and optionally send a PDF snapshot via email.

##### **3. Dashboard Features**

| Feature | Benefit |
| --- | --- |
| **Cross-filtering** | Clicking a segment in one chart automatically filters the other charts on the dashboard. |
| **Parameters** | Pass variables from the UI directly into the SQL queries to change the scope of the data. |
| **Download to CSV** | Allows end-users to export the raw data behind a specific visualization for their own offline analysis. |
| **Mobile Optimized** | Dashboards automatically resize for viewing on tablets and phones. |

##### **4. Best Practices for Dashboarding**

* **Query Gold Tables:** Always point your dashboard at **Gold** tables to ensure the data is already cleaned, aggregated, and optimized for speed.
* **Keep it Simple:** Limit the number of visualizations to 5-7 per page. Too many charts can overwhelm the user and slow down the initial load time.
* **Use SQL Warehouses:** Always use a **Serverless SQL Warehouse** for the best "instant-on" experience for your users.
* **Text Annotations:** Use text boxes to explain what the charts are showing or to provide business context for the trends being displayed.

##### **5. Legacy "Notebook" Dashboards**

You may still see "Dashboards" created directly inside a notebook (using the "View" menu). While useful for quick data exploration by engineers, these are being replaced by Lakehouse Dashboards because they are harder to govern and share with non-technical users.


---

### **_4. Visualizations & filters_**

In Databricks, visualizations and filters are the key to turning static data into interactive insights. With the recent shift to **AI/BI Dashboards** (formerly Lakehouse Dashboards), these features have become more powerful, supporting natural language creation and deep interactivity.

##### **1. Types of Visualizations**

Databricks offers a diverse library of chart types tailored for different analytical needs:

* **Trend & Comparison:** Bar, Line, and Area charts for tracking metrics over time.
* **Composition:** Pie and Donut charts to show proportions.
* **Relationship:** Scatter plots and Heatmaps to find correlations.
* **Advanced:** Combo charts (mixing bars and lines), Funnel charts, and Box plots for statistical distribution.
* **Geospatial:** Point maps for visualizing location-based data.

> **AI Tip:** You can now create visualizations by simply describing them to the **Databricks Assistant** (e.g., *"Show me a bar chart of total revenue by month"*).

##### **2. Interaction with Filters**

Filters allow users to slice and dice data dynamically without changing the underlying SQL code. There are three primary ways to implement them:

###### **A. Filter Widgets (On-Canvas)**

You can place these directly on your dashboard layout for end-user interaction:

* **Single/Multi-Value Dropdowns:** Select specific categories (e.g., "Region" or "Clerk ID").
* **Date & Range Pickers:** Select a specific date or a relative range like "Last 7 Days."
* **Text Entry:** Search for specific keywords using "Exact Match" or "Contains."
* **Range Sliders:** Filter numeric values like "Account Balance" or "Age."

###### **B. Cross-Filtering**

This is a "click-to-filter" feature. If you click on a specific bar in a chart (e.g., the "North" region), all other visualizations on the dashboard that share that dataset will automatically update to show data for just that region.

###### **C. Drill-Through**

You can configure a visualization so that clicking a data point takes the user to a **Details** page, carrying over the filter context. For example, clicking a year in a summary bar chart can open a page with a detailed table of every transaction for that specific year.

##### **3. Parameters vs. Field Filters**

Understanding the difference is crucial for performance:

| Feature | Field Filters | Parameters (`:variable`) |
| --- | --- | --- |
| **Execution** | Applied in the browser/UI layer. | Re-runs the SQL query on the warehouse. |
| **Performance** | Faster for small to medium datasets. | Slower but necessary for massive data. |
| **Flexibility** | Limited to existing dataset columns. | Can be placed anywhere (e.g., inside a subquery). |
| **Use Case** | General slicing and dicing. | Optimizing early data reduction (filtering before joins). |

##### **Best Practices**

* **Use Global Filters:** Place filters in the "Global Filter Panel" if they need to apply across multiple pages.
* **Set Default Values:** Always set a logical default (e.g., "Last 30 Days") so the dashboard doesn't attempt to load 10 years of data on startup.
* **Leverage AI Forecast:** For line charts, you can use the built-in "AI Forecast" button to project future trends based on historical data.

---

### **Practice**

In [0]:
%sql
CREATE CATALOG ecommerce ;
USE CATALOG ecommerce;
CREATE SCHEMA bronze;
CREATE SCHEMA silver;
CREATE SCHEMA gold;

---

In [0]:
%sql
-- Permissions
GRANT SELECT ON TABLE gold.products TO `analysts@company.com`;
GRANT ALL PRIVILEGES ON SCHEMA silver TO `engineers@company.com`;

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-5531138585135986>, line 1[0m
[0;32m----> 1[0m get_ipython()[38;5;241m.[39mrun_cell_magic([38;5;124m'[39m[38;5;124msql[39m[38;5;124m'[39m, [38;5;124m'[39m[38;5;124m'[39m, [38;5;124m'[39m[38;5;124m-- Permissions[39m[38;5;130;01m\n[39;00m[38;5;124mGRANT SELECT ON TABLE gold.products TO `analysts@company.com`;[39m[38;5;130;01m\n[39;00m[38;5;124mGRANT ALL PRIVILEGES ON SCHEMA silver TO `engineers@company.com`;[39m[38;5;130;01m\n[39;00m[38;5;124m'[39m)

File [0;32m/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2541[0m, in [0;36mInteractiveShell.run_cell_magic[0;34m(self, magic_name, line, cell)[0m
[1;32m   2539[0m [38;5;28;01mwith[39;00m [38;5;28mself[39m[38;5;241m.[39mbuiltin_trap:
[1;32m   2540[0m     args [38;5;241m=

In [0]:
%sql
-- Controlled view
CREATE VIEW gold.top_products AS
SELECT product_name, revenue, conversion_rate
FROM gold.products
WHERE purchases > 10
ORDER BY revenue DESC LIMIT 100;

---

### **Resources**
- [Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/)
- [Data governance](https://docs.databricks.com/data-governance/unity-catalog/get-started.html)

----