###● Describe the relationship between the data lakehouse and the data warehouse.


In [0]:
# It provides the low-cost storage and flexibility of a data lake with the management and performance features of a data warehouse.
# Data lakehouses enable organizations to store structured, semi-structured, and unstructured data in a single system, supporting analytics and BI workloads.
# Data warehouses are optimized for structured data and analytics, while lakehouses support broader data types and use cases.
# Lakehouses often use open formats (like Delta Lake) to provide ACID transactions, schema enforcement, and governance similar to data warehouses.

description = """
The data lakehouse is an architecture that unifies the capabilities of data lakes and data warehouses. 
It allows organizations to store all types of data (structured, semi-structured, unstructured) in a single repository, 
while providing data management, reliability, and performance features traditionally found in data warehouses. 
This enables advanced analytics, machine learning, and business intelligence on a single platform.
"""

display({"Lakehouse vs Warehouse Relationship": description})

###Identify the improvement in data quality in the data lakehouse over the data lake.

In [0]:
improvement = """
Data lakehouses improve data quality over traditional data lakes by introducing ACID transactions, schema enforcement, and data governance features. 
These capabilities ensure data consistency, reliability, and integrity, reducing issues like data corruption, inconsistent schemas, and lack of lineage that are common in data lakes.
"""

display({"Data Quality Improvement in Lakehouse": improvement})

###Compare and contrast silver and gold tables, which workloads will use a bronze table as a source, which workloads will use a gold table as a source.


In [0]:
comparison = """
Silver tables are cleansed and enriched datasets, typically used for data exploration, reporting, and as a foundation for further transformations. Gold tables are highly curated, business-level aggregates or data marts, optimized for analytics, BI dashboards, and machine learning.

Workloads using bronze tables as a source: ETL pipelines, data cleansing, and enrichment processes (moving data from raw to silver).
Workloads using gold tables as a source: Business intelligence, reporting, advanced analytics, and machine learning models that require trusted, aggregated data.
"""

display({"Silver vs Gold Tables & Workload Sources": comparison})

###Identify elements of the Databricks Platform Architecture, such as what is located in the data plane versus the control plane and what resides in the customer’s cloud account


In [0]:
architecture = {
    "Control Plane": [
        "Databricks workspace UI",
        "Job scheduling and orchestration",
        "Cluster management services",
        "Notebook management",
        "Authentication and access control"
    ],
    "Data Plane": [
        "Compute clusters (VMs/instances)",
        "Data processing engines (Spark, DBSQL, ML runtimes)",
        "Data storage access (to customer-managed storage)",
        "Execution of user code and jobs"
    ],
    "Customer's Cloud Account": [
        "Data plane resources (compute clusters, VMs)",
        "Customer-managed data storage (e.g., S3, ADLS, GCS)",
        "Networking and security configurations",
        "Encryption keys and IAM roles"
    ]
}

display({"Databricks Platform Architecture Elements": architecture})

###Differentiate between all-purpose clusters and jobs clusters.

In [0]:
difference = """
All-purpose clusters are interactive, reusable clusters designed for collaborative data analysis, ad hoc queries, and development in notebooks. They remain active until manually terminated and can be shared by multiple users.

Jobs clusters are ephemeral, single-use clusters created automatically for running production jobs or scheduled workflows. They are terminated automatically after the job completes, ensuring resource isolation and cost efficiency.
"""

display({"All-Purpose Clusters vs Jobs Clusters": difference})

###Identify how cluster software is versioned using the Databricks Runtime.

In [0]:
runtime_versioning_info = """
Databricks clusters use the Databricks Runtime, which is versioned using a major.minor format (e.g., 12.2, 13.3 LTS). 

- Major versions (e.g., 12.x to 13.x) indicate significant changes, possibly not backward-compatible.
- Feature (minor) versions (e.g., 13.2 to 13.3) are backward-compatible enhancements within a major version.
- LTS (Long Term Support) versions (e.g., 13.3 LTS) receive extended support.

Each Databricks Runtime version bundles specific libraries, system components, and features. The runtime version is selected when creating a cluster.
"""

display({"Databricks Runtime Versioning": runtime_versioning_info})

###Identify how clusters can be filtered to view those that are accessible by the user.

In [0]:
cluster_filtering_info = """
Clusters accessible by a user can be filtered using the Databricks REST API or the Databricks UI.

- In the Databricks UI, the Clusters page lists only those clusters the user has permission to view or attach to, based on cluster-level access control.
- Using the REST API (`2.0/clusters/list`), the response includes only clusters the user is authorized to access. The API enforces permissions, so users see only clusters they can manage or attach to.

To programmatically filter clusters accessible by the current user in a Databricks notebook, use the Databricks REST API with a personal access token. The API response will automatically be scoped to the user's permissions.

Example (Python with Databricks Utilities):
"""

display({"Filtering Accessible Clusters": cluster_filtering_info})

import requests
import json

# Replace with your Databricks workspace URL and personal access token
workspace_url = "https://<databricks-instance>"
token = dbutils.secrets.get(scope="my_scope", key="databricks_token")

response = requests.get(
    f"{workspace_url}/api/2.0/clusters/list",
    headers={"Authorization": f"Bearer {token}"}
)

clusters = response.json().get("clusters", [])
display([{"cluster_name": c["cluster_name"], "cluster_id": c["cluster_id"]} for c in clusters])

###Describe how clusters are terminated and the impact of terminating a cluster.

In [0]:
cluster_termination_info = """
Clusters in Databricks can be terminated manually via the UI, programmatically using the REST API, or automatically based on configured auto-termination settings.

- Manual Termination: Users with appropriate permissions can terminate clusters from the Clusters UI or by calling the `2.0/clusters/delete` REST API endpoint.
- Auto-Termination: Clusters can be configured to automatically terminate after a specified period of inactivity.

Impact of Terminating a Cluster:
- All active jobs, notebooks, and interactive sessions running on the cluster are immediately stopped.
- Unsaved results or in-memory data are lost.
- The cluster's compute resources are released, and billing for those resources stops.
- Metadata, libraries, and configuration are preserved for future restarts, but ephemeral data is lost.

Terminating a cluster is a non-reversible operation for the running state; workloads must be restarted on a new or restarted cluster.
"""

display({"Cluster Termination and Impact": cluster_termination_info})

### Identify a scenario in which restarting the cluster will be useful.

In [0]:
restart_scenario_info = """
Restarting a Databricks cluster is useful when you need to clear the cluster's state, such as after updating installed libraries, environment variables, or configuration settings. For example, if you upgrade a Python package or change an environment variable, a cluster restart ensures that all nodes use the new configuration, resolving issues caused by stale dependencies or inconsistent environments.
"""

display({"Scenario Where Restarting Cluster is Useful": restart_scenario_info})

###Describe how to use multiple languages within the same notebook.

In [0]:
multi_language_info = """
Databricks notebooks support multiple languages within the same notebook by using language magic commands at the beginning of a cell. The supported languages include Python (`%python`), SQL (`%sql`), Scala (`%scala`), and R (`%r`). By default, cells use the notebook's primary language, but you can switch languages per cell using these magic commands.

For example:
- `%python` for Python code
- `%sql` for SQL queries
- `%scala` for Scala code
- `%r` for R code

Variables defined in one language can sometimes be accessed in another using Databricks utilities like `dbutils.notebook.run` or by explicitly saving and loading data between cells.
"""

display({"Using Multiple Languages in a Notebook": multi_language_info})

###Identify how to run one notebook from within another notebook.

In [0]:
run_notebook_info = """
You can run one Databricks notebook from within another notebook using the `dbutils.notebook.run` command. This allows you to modularize your code and reuse notebooks as functions or scripts.

Example usage:
dbutils.notebook.run("notebook_path", timeout_seconds, arguments)

- "notebook_path": The path to the notebook you want to run.
- timeout_seconds: Maximum time in seconds to wait for the notebook to finish.
- arguments: (Optional) A dictionary of arguments to pass to the notebook.

The called notebook can access the arguments using `dbutils.widgets.get("argument_name")`.
"""

display({"Running One Notebook from Another": run_notebook_info})

###Identify how notebooks can be shared with others.

In [0]:
notebook_sharing_info = """
Databricks notebooks can be shared with others in several ways:

1. **Workspace Permissions**: Share notebooks by adjusting permissions in the Databricks workspace. You can grant view, edit, or run access to users or groups.

2. **Export and Import**: Export notebooks as `.dbc`, `.ipynb`, or `.py` files and share them externally. Recipients can import these files into their own Databricks workspace.

3. **URL Sharing**: Share the notebook's URL with users who have appropriate workspace access.

4. **Version Control Integration**: Use Git integration to share and collaborate on notebooks through repositories.

Choose the method that best fits your collaboration and security requirements.
"""

display({"Sharing Databricks Notebooks": notebook_sharing_info})

###Describe how Databricks Repos enables CI/CD workflows in Databricks.

In [0]:
repos_cicd_info = """
Databricks Repos integrates Git repositories directly into the Databricks workspace, enabling robust CI/CD workflows:

1. **Source Control Integration**: Connect notebooks, libraries, and other files to Git providers (GitHub, GitLab, Azure DevOps, Bitbucket), ensuring version control and collaboration.

2. **Branching and Pull Requests**: Use Git branches for feature development, bug fixes, and code reviews. Pull requests facilitate code review and automated testing before merging.

3. **Automated Testing**: Integrate with CI tools (e.g., GitHub Actions, Azure Pipelines) to run automated tests on code changes, ensuring code quality and reliability.

4. **Continuous Deployment**: Use CD pipelines to deploy notebooks, jobs, and other artifacts to Databricks environments automatically after successful tests and reviews.

5. **Traceability and Auditability**: Track changes, authorship, and history of all code and configuration changes, supporting compliance and reproducibility.

Databricks Repos streamlines the development lifecycle, enabling collaborative, automated, and reliable delivery of analytics and machine learning solutions.
"""

display({"Databricks Repos and CI/CD Workflows": repos_cicd_info})

###Identify Git operations available via Databricks Repos.

In [0]:
git_operations_info = """
Databricks Repos supports the following Git operations:

1. **Clone**: Clone remote Git repositories into the Databricks workspace.
2. **Pull**: Fetch and merge changes from the remote repository to keep local content up to date.
3. **Commit**: Record changes to tracked files in the local repository.
4. **Push**: Send committed changes from the local repository to the remote repository.
5. **Branch**: Create, switch, and manage branches for feature development or bug fixes.
6. **Merge**: Merge changes from one branch into another.
7. **Revert**: Undo changes by reverting commits.
8. **Resolve Conflicts**: Address merge conflicts directly within the Databricks UI.
9. **View History**: Inspect commit history and diffs for files and notebooks.
10. **Pull Requests (via Git provider)**: Initiate and manage pull requests through the connected Git provider.

These operations enable collaborative development and version control directly within Databricks.
"""

display({"Git Operations in Databricks Repos": git_operations_info})

###Identify limitations in Databricks Notebooks version control functionality relative to Repos.

In [0]:
notebooks_vc_limitations = """
Limitations of Databricks Notebooks version control relative to Repos:

1. **Granularity**: Native notebook version control only tracks changes at the notebook level, lacking file-level and multi-file project support.
2. **Collaboration**: No support for branching, merging, or pull requests, limiting collaborative workflows.
3. **Integration**: Cannot integrate with external Git providers (e.g., GitHub, GitLab, Azure DevOps) for CI/CD, automated testing, or deployment.
4. **History and Traceability**: Limited commit history and diff capabilities compared to Git-based workflows.
5. **Project Structure**: No support for organizing code across multiple files or projects as in a Git repo.
6. **Automation**: Lacks hooks for automated workflows (e.g., CI/CD pipelines) triggered by code changes.
7. **Conflict Resolution**: No built-in tools for resolving merge conflicts.
8. **Auditability**: Limited audit trails and change tracking compared to Git-backed Repos.

Databricks Repos addresses these limitations by providing full Git integration, enabling robust version control, collaboration, and automation.
"""

display({"Databricks Notebooks Version Control Limitations vs Repos": notebooks_vc_limitations})