# Learning PySpark

**Status:** *ongoing*

---

## Summary

1. What is Apache Spark?
2. Architecture
3. Spark Ecosystem

---

### 1. What is Apache Spark?

[Spark](http://spark.apache.org/) is a general-purpose, distributed programming framework that was developed at the AMPLab at the University of California, Berkeley. It is open source software that provides an in-memory computation framework and it is also good for batch processing. Spark works well with real-time (or, better to say, near-real-time) data.

Machine learning and graph algorithms are iterative. Where Spark do magic. According to its research paper, it is approximately 100 times faster than its peer, Hadoop. Data can be cached in memory. Caching intermediate data in iterative algorithms provides amazingly fast processing speed. Spark can be programmed with Java, Scala, Python, and R.

Spark might be considered as an improved [Hadoop](https://hadoop.apache.org/). Because we can implement a [MapReduce](https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) algorithm in Spark, Spark uses the benefit of HDFS; this means Spark can read data from HDFS and store data to HDFS too, and Spark handles iterative computation efficiently because data can be persisted in memory. It is good for interactive data analysis.

<br>

![spark-stack](https://user-images.githubusercontent.com/9319823/45998657-ca3d0900-c0a3-11e8-8bb8-32672e87d119.png)

<br>

Spark stack encompasses four libraries: [SQL and DataFrames](http://spark.apache.org/sql/), [MLlib](http://spark.apache.org/mllib/) for machine learning, [GraphX](http://spark.apache.org/graphx/), and [Spark Streaming](http://spark.apache.org/streaming/). You can combine these libraries seamlessly in the same application.

- **SparkSQL:** SparkSQL library is a wrapper over the PySpark core that applies SQL-like analysis on a huge amount of structured or semistructured data. We can also use SQL queries with PySparkSQL. We can connect it to Apache Hive, and HiveQL can be applied too. PySparkSQL introduced the DataFrame, which is a tabular representation of structured data that is like a table in a relational database management system.


- **SparkML:** MLlib is a wrapper over the PySpark core that deals with machine-learning algorithms. The machine-learning API provided by the MLlib library is easy to use. MLlib supports many machine-learning algorithms for classification, clustering, text analysis, and more.


- **GraphX:** GraphX is Apache Spark's API for graphs and graph-parallel computation. It unifies ETL, exploratory analysis, and iterative graph computation within a single system. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.


- **Spark Streaming:** Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python. It is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.


![streaming-arch](https://user-images.githubusercontent.com/9319823/45999822-150c5000-c0a7-11e8-8a8a-f88b2c5b1c88.png)


#### Advantages of Spark
1. **Swift Processing:** Spark reduce the number of read-write to disk.
2. **Dynamic in Nature:** Spark provide 80 high-level operators, which can help to develop a parallel application.
3. **In-Memory Computation:** We didn’t waste our time to fetch data from disk every time, it saves time by caching data
4. **Re-Usability:** The Spark code can be reused for batch-processing, join stream against historical data.
5. **Fault Tolerance:** Through RDD, it provides fault tolerance. Spark RDD are designed to handle the failure of any worker node in the cluster, which ensures that the loss of data is reduced to zero.
 
#### Disadvantages of Spark
1. **Expensive:** In-memory capability can become a bottleneck when we want cost-efficient processing of big data as keeping data in memory is quite expensive.
2. **Latency:** Apache Spark has higher latency as compared to Apache Flink.
3. **Mannual Optimization:** The Spark job requires to be manually optimized and is adequate to specific datasets.
4. **No File Management:** Apache Spark does not have its own file management system, thus it relies on some other platform like Hadoop.
5. **Problem With Small Files:** If we use Spark with Hadoop, we come across a problem of a small file. HDFS provides a limited number of large files rather than a large number of small files. Another place where Spark legs behind is we store the data gzipped in S3.

### 2. Architecture

![spark_architecture](https://user-images.githubusercontent.com/9319823/45994904-09645d80-c096-11e8-87e4-2b53f058ba99.png)

### 3. Spark Ecosystem

---

### References

**Note:** Adjust references

1. Raju Kumar Mishra (auth.)- PySpark Recipes_ A Problem-Solution Approach with PySpark2-Apress (2018)
2. http://spark.apache.org/
3. http://spark.apache.org/graphx/
4. https://spark.apache.org/docs/latest/streaming-programming-guide.html