# Spark with Python (PySpark) Tutorial For Beginners
Materials for this notebook are gathered from https://sparkbyexamples.com/pyspark-tutorial/

## PySpark RDD – Resilient Distributed Dataset
PySpark RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark that is fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.

### RDD Creation
In order to create an RDD, first, you need to create a SparkSession which is an entry point to the PySpark application. SparkSession can be created using a `builder()` or `newSession()` methods of the SparkSession.

Spark session internally creates a `sparkContext` variable of `SparkContext`. You can create multiple SparkSession objects but only one SparkContext per JVM. In case if you want to create another new SparkContext you should stop existing Sparkcontext (using `stop()`) before creating a new one.

In [2]:
from pyspark.sql import SparkSession

# Creating a Spark session
spark = SparkSession.builder\
    .master("local[1]")\
    .appName("SparkByExamples.com")\
    .getOrCreate()

22/01/05 22:54:37 WARN Utils: Your hostname, Winsons-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.77 instead (on interface en0)
22/01/05 22:54:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/05 22:54:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### using parallelize()
SparkContext has several functions to use with RDDs. For example, it’s `parallelize()` method is used to create an RDD from a list.

In [3]:
# Create RDD from parallelize()
datalist = [("Java", 20000), ("Python", 30000), ("Scala", 25000), ("JavaScript", 40000)]
rdd = spark.sparkContext.parallelize(datalist)

### using textFile()
RDD can also be created from a text file using `textFile()` function of the SparkContext.

In [None]:
# Create RDD from external Data Source
# rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

## RDD Operations

On PySpark RDD, you can perform two kinds of operations.

**RDD transformations** – Transformations are lazy operations. When you run a transformation(for example update), instead of updating a current RDD, these operations return another RDD.

**RDD actions** – operations that trigger computation and return RDD values to the driver.

### RDD Transformations
Transformations on Spark RDD returns another RDD and transformations are lazy meaning they don’t execute until you call an action on RDD. Some transformations on RDD’s are `flatMap()`, `map()`, `reduceByKey()`, `filter()`, `sortByKey()` and return new RDD instead of updating the current.

### RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD function that returns non RDD[T] is considered as an action. 

Some actions on RDD’s are `count()`, `collect()`, `first()`, `max()`, `reduce()` and more.

# PySpark DataFrame

> DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
> 
> – Databricks

If you are coming from a Python background I would assume you already know what Pandas DataFrame is; PySpark DataFrame is mostly similar to Pandas DataFrame with exception PySpark DataFrames are distributed in the cluster (meaning the data in DataFrame’s are stored in different machines in a cluster) and any operations in PySpark executes in parallel on all machines whereas Panda Dataframe stores and operates on a single machine.

## DataFrame creation

Simplest way to create an DataFrame is from a Python list of data. DataFrame can also be created from an RDD and by reading a files from several sources.

### using createDataFrame()
By using `createDataFrame()` function of the SparkSession you can create a DataFrame.

In [4]:
data = [
    ('James','','Smith','1991-04-01','M',3000),
    ('Michael','Rose','','2000-05-19','M',4000),
    ('Robert','','Williams','1978-09-05','M',4000),
    ('Maria','Anne','Jones','1967-12-01','F',4000),
    ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]

spark_df = spark.createDataFrame(data=data, schema=columns)

Since DataFrame’s are structure format which contains names and column, we can get the schema of the DataFrame using `df.printSchema()`.



In [5]:
spark_df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



`df.show()` shows the 20 elements from the DataFrame.

In [6]:
spark_df.show()

[Stage 0:>                                                          (0 + 1) / 1]

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



                                                                                

## DataFrame operations
Like RDD, DataFrame also has operations like Transformations and Actions.

## DataFrame from external data sources
In realtime applications, DataFrame’s are created from external sources like files from the local system, HDFS, S3 Azure, HBase, MySQL table e.t.c. Below is an example of how to read a csv file from a local system.

In [None]:
# spark_df = spark.read.csv("/tmp/resources/zipcodes.csv")
# spark_df.printSchema()

### Supported file formats
DataFrame has a rich set of API which supports reading and writing several file formats

- csv
- text
- Avro
- Parquet
- tsv
- xml and many more

# PySpark SQL Tutorial

PySpark SQL is one of the most used PySpark modules which is used for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax.

In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL’s on Spark Dataframe, in the later section of this PySpark SQL tutorial, you will learn in details using SQL `select`, `where`, `group by`, `join`, `union` e.t.c

In order to use SQL, first, create a temporary table on DataFrame using `createOrReplaceTempView()` function. Once created, this table can be accessed throughout the SparkSession using `sql()` and it will be dropped along with your SparkContext termination.

Use `sql()` method of the SparkSession object to run the query and this method returns a new DataFrame.

In [7]:
spark_df.createOrReplaceTempView("PERSON_DATA")

spark_df_sql = spark.sql("SELECT * FROM PERSON_DATA")

spark_df_sql.printSchema()

spark_df_sql.show()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



In [8]:
# Let's see another example using `group by`
grouped_df = spark.sql("SELECT gender, count(*) FROM PERSON_DATA GROUP BY gender")
grouped_df.show()

+------+--------+
|gender|count(1)|
+------+--------+
|     F|       2|
|     M|       3|
+------+--------+



# PySpark Streaming Tutorial

PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c

## Streaming from TCP Socket

Use `readStream.format("socket")` from Spark session object to read data from the socket and provide options host and port where you want to stream data from.

Spark reads the data from socket and represents it in a “value” column of DataFrame. `df.printSchema()` outputs.


In [None]:
stream_df = spark.readStream\
    .format("socket")\
    .option("host", "localhost")\
    .option("port", "9090")\
    .load()

After processing, you can stream the DataFrame to console. In real-time, we ideally stream it to either Kafka, database e.t.c

In [None]:
query = stream_df.writeStream\
    .format("console")\
    .outputMode("complete")\
    .start()\
    .awaitTermination()