# Spark (h)and(s)-on Data Bricks

# Data Bricks
https://www.databricks.com

## A Data Lakehouse Platform

**Simple. Open. Multicloud.**

The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes.

This unified approach simplifies your modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science and machine learning. It’s built on open source and open standards to maximize flexibility. And, its common approach to data management, security and governance helps you operate more efficiently and innovate faster

![](https://www.databricks.com/en-website-assets/static/5a945e175f09eb2f522b911352a149f2/Marketecture.svg)

# Simple

The unified approach simplifies your data architecture by eliminating the data silos that traditionally separate analytics, BI, data science and machine learning. With a lakehouse, you can eliminate the complexity and expense that make it hard to achieve the full potential of your analytics and AI initiatives.

![](images/databricks-simple.png)

# Open

Delta Lake forms the open foundation of the lakehouse by providing reliability and world-record-setting performance directly on data in the data lake. You’re able to avoid proprietary walled gardens, easily share data, and build your modern data stack with unrestricted access to the ecosystem of open source data projects and the broad Databricks partner network.

![](https://www.databricks.com/en-website-assets/static/e3d1995c123d1a62316de195faff275e/3024c/Platform-orignal-creators.webp)

# Multicloud

The Databricks Lakehouse Platform offers you a consistent management, security, and governance experience across all clouds. You don’t need to invest in reinventing processes for every cloud platform that you’re using to support your data and AI efforts. Instead, your data teams can simply focus on putting all your data work to discover new insights.

![](https://www.databricks.com/en-website-assets/static/f4d140f58197009a3d0953ce3651ccc8/Multicloud.svg)

## Community Edition 
https://community.cloud.databricks.com/login.html

## Create Account
- Sign Up for a new account
- Select "Get started with Community Edition" instead of selecting a Cloud Provider
- Verify the email

## Run Tutorial
- Click on Guide: Quickstart tutorial
- Create a Cluster clicking on Connect
- Run Cells in order

# Apache Spark

## Unified engine for large-scale data analytics
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters 
![](https://spark.apache.org/images/spark-logo-trademark.png)

## The Apache Spark project's History
Spark was originally written by the founders of Databricks during their time at UC Berkeley. The Spark project started in 2009, was open sourced in 2010, and in 2013 its code was donated to Apache, becoming Apache Spark. The employees of Databricks have written over 75% of the code in Apache Spark and have contributed more than 10 times more code than any other organization. Apache Spark is a sophisticated distributed computation framework for executing code in parallel across many different machines. While the abstractions and interfaces are simple, managing clusters of computers and ensuring production-level stability is not. Databricks makes big data simple by providing Apache Spark as a hosted solution.

A Gentle Introduction to Apache Spark on Databricks

## The Genesis of Spark

## From Hadoop 1.0
- Big Data and Distributed Computing at Google (2004)
- Hadoop at Yahoo! (2006)

![](https://media.makeameme.org/created/guys-its.jpg)

![](https://minimalistquotes.com/wp-content/uploads/2022/08/simple-things-should-be-simple-and-complex-things-.jpg)


The question then became:
there a way to make Hadoop and MR simpler and faster?

## Spark 1.0 and beyond
- Spark’s Early Years at AMPLab (2009) 
- First Paper 10-20x faster then map reduce (2010)
- Spark 1.0 Released (2014)
- Spark 2.0: Unifying DataFrame and Dataset. Structured Streaming (2016)
- Spark 3.0: Hadoop 3.0 support, Support for Pandas, SQL Engine Faster (2020)
- Spark 3.4: Spark Connect (2023)

# Features

## 1. Speed

![](https://cc-media-foxit.fichub.com/image/fox-it-mondofox/0177f439-3c0f-44ae-9803-c25f8bfac0dd/flash-vs-superman-game-2jpg-maxw-824.jpg)

### Run workloads 100x faster

![Logistic Regression](https://spark.apache.org/images/logistic-regression.png)

Apache Spark achieves
- high performance for both batch and streaming data
- using a state-of-the-art DAG scheduler
- a query optimizer
- a physical execution engine.

## Why Spark is faster ?

### 1. Hardware improvements

Today’s commodity servers come cheap, with hundreds of gigabytes of memory, multiple cores, and the underlying Unix-based operating system taking advantage of efficient multithreading and parallel processing.

![](https://external-preview.redd.it/RVpCIxhliY2p5vKF8I-AoCLIoI48yIEpVPXDduTG6Fc.jpg?auto=webp&s=88a001359893e5533423e9886d4d55cfd2dbdf62)

### 2. Direct Acyclic Graph (DAG) Scheduler and Query Optimizer

Provides an efficient computational graph that can usually be decomposed into tasks that are executed in parallel across workers on the cluster.

![](https://www.researchgate.net/publication/336769100/figure/fig2/AS:817393752371221@1571893265396/Spark-DAG-for-a-WordCount-application-with-two-stages-each-consisting-of-three-tasks.png)

https://www.researchgate.net/publication/336769100_Artificial_neural_networks_based_techniques_for_anomaly_detection_in_Apache_Spark

### 3. Physical execution engine 

Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.

![](https://shuttermuse.com/wp-content/uploads/2015/03/what-is-a-tungsten-light.jpeg)

## 2. Ease of Use

### Modularity

Write applications quickly in Java, Scala, Python, R, and SQL.

Spark offers over 80 high-level operators that make it easy to build parallel apps. 

And you can use it **interactively** from the Scala, Python, R, and SQL shells.

#### Scala Example


```scala
df = spark.read.json("logs.json") 
df.where("age > 21").select("name.first").show()
```

## Generality

### Combine SQL, streaming, and complex analytics.

Spark powers a stack of libraries including 

- SQL and DataFrames

- MLlib for machine learning

- GraphX

- Spark Streaming

![](https://spark.apache.org/images/spark-stack.png)

### It can access diverse external data sources

#### Analyse
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets

#### Query
Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on using relational transformations and can also be used to create a temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries over its data.

https://spark.apache.org/docs/latest/sql-data-sources.html

# Examples
https://spark.apache.org/examples.html

## Compute Spark PI
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/1514004441740508/2956912205716139/latest.html

## Word Count
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/1177351990298623/2956912205716139/latest.html

## Data Frame Text Search
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/704214092842249/2956912205716139/latest.html

# Data Structures

Spark has several core abstractions: Datasets, DataFrames, SQL Tables, and Resilient Distributed Datasets (RDDs).

These abstractions all represent distributed collections of data 

## Transformation

In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to modify the DataFrame you have into the one that you want. These instructions are called transformations.


## Resilient Distribuited Datasets (RDD)
https://spark.apache.org/docs/latest/rdd-programming-guide.html

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

![](https://1.bp.blogspot.com/-wMroEy8Ow-k/WdCUxRefTTI/AAAAAAAABNM/Z14px-DgqGYqPfAfwNIILI9EX-ozLGplQCLcBGAs/s640/apache-spark-streaming-13-638.jpg)

An RDD in Spark is simply an immutable distributed collection of objects. 

Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. 

RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

> Learning Spark: https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf

## A live example
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/1455717965234675/2956912205716139/latest.html

## Data Frames
https://spark.apache.org/docs/latest/sql-getting-started.html

## What is a DataFrame?
https://www.databricks.com/glossary/what-are-dataframes

A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

Every DataFrame contains a blueprint, known as a schema, that defines the name and data type of each column. Spark DataFrames can contain universal data types like StringType and IntegerType, as well as data types that are specific to Spark, such as StructType. Missing or incomplete values are stored as null values in the DataFrame.


![](https://www.databricks.com/wp-content/uploads/2018/05/DataFrames.png)

Dataframe representation







A simple analogy is that a DataFrame is like a spreadsheet with named columns. However, the difference between them is that while a spreadsheet sits on one computer in one specific location, a DataFrame can span thousands of computers. In this way, DataFrames make it possible to do analytics on big data, using distributed computing clusters.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Visicalc.png/440px-Visicalc.png)

The reason for putting the data on more than one computer should be intuitive: either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.

![](https://intellipaat.com/mediaFiles/2015/08/Resilient-Distributed-Datasets-RDDs.jpg)

https://intellipaat.com/blog/tutorial/spark-tutorial/programming-with-rdds/

## DataFrames

The concept of a DataFrame is common across many different languages and frameworks. DataFrames are the main data type used in pandas, the popular Python data analysis library, and DataFrames are also used in R, Scala, and other languages.


![](https://image.slidesharecdn.com/jumpstartintoapachesparkanddatabricks-160212150759/95/jump-start-into-apache-spark-and-databricks-13-638.jpg?cb=1463623478)


![](https://databricks.com/wp-content/uploads/2016/06/Unified-Apache-Spark-2.0-API-1.png)

# Data Frame Example
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/3119543398385477/2956912205716139/latest.html

# References
- https://docs.databricks.com/introduction/index.html
- https://spark.apache.org/docs/0.8.0/api/pyspark/pyspark.rdd.RDD-class.html
- https://docs.databricks.com/dbfs/databricks-datasets.html