# Brief introduction to PySpark

<br>
> **Contents**
1. Spark workflow and data structure
2. Common SQL / PySpark functions and keywords comparison
3. Window functions
4. Data exploration of Ford GoBike dataset
5. ETL example

<br>
---------------------------------------
## 1. Spark workflow, RDD, DataFrame

Spark is a unified analytics engine for large-scale data processing. It is built on a paradigm of functional programming - operations (transformations) in the pipeline are "lazy", they are not executed and applyied immediatelly, they are "delayed" until a result is needed. 
<br>
Fundamental data structure in Spark is RDD (Resilient Distributed Dataset). RDDs are spread across many machines in the cluster. <br>
DataFrame is based on RDD, it is a distributed collection of data organized into named columns like a table in relational database. <br>If you want to learn more, everything is covered in detail at [databricks blog](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html).
<br>
<br>
**RDD characteristics:**
* immutable
* in-memory
* lazy evaluated
* parallel
* structured and unstructured data
* two types of operations: transformations and actions

**DataFrame characteristics:** 
* immutable
* in-memory
* resilient
* distributed
* parallel
* structured
* allows SQL/Hive queries

First we need to install pyspark and import all dependencies. Remember that Spark works on Java 8, so you need to install this particular distribution on your machine as well.

In [None]:
!pip install pyspark

In [None]:
import pyspark
from pyspark.sql import SparkSession, Row, functions as f
from pyspark.sql.window import Window
from pyspark.sql.types import StringType, TimestampType, DoubleType, FloatType, IntegerType, LongType, StructField, StructType
from IPython.display import Image

import random
random.seed(1990)

`SparkSession` class is an entry point for any Spark application. It allows you to interact with Spark API. 
<br>
`getOrCreate()` returns a new Spark app or points to already existing one.

In [None]:
spark = SparkSession.builder.appName("spark_app").getOrCreate()
spark

Let's generate first DataFrame with some random data

In [None]:
# Regions
geo_id = [random.choice(["regA","regB","regC","regD","regE"]) for x in range(500)]

# Products
prod_id = [random.choice(["prodA","prodB","prodC","prodD","prodE","prodF",
                          "prodG","prodH","prodI","prodJ","prodK","prodL"]) for x in range(500)]

# Values
value = [random.uniform(1000,10000) for x in range(500)]
value[5] = None
value[15] = None
value[245] = None

In [None]:
df = spark.createDataFrame([Row(prod=p, geo=g, val=v) for p,g,v in zip(prod_id, geo_id, value)])
df.show(7)

In [None]:
df.createOrReplaceTempView("train_df")

In [None]:
geo_df = spark.createDataFrame([Row(geo_id = "regA", geo_name = "Europe"),
                                Row(geo_id = "regB", geo_name = "Asia"),
                                Row(geo_id = "regC", geo_name = "N_America"),
                                Row(geo_id = "regD", geo_name = "S_America"),
                                Row(geo_id = "regE", geo_name = "Africa")])

In [None]:
geo_df.createOrReplaceTempView("geo_df")
geo_df.show()

<br>
## 2. SQL / Spark functions comparison

In [None]:
# if True instead of Spark syntax a SQL equivalent will be executed
sql = False

In [None]:
# SQL and Spark equivalent:
spark.sql("SELECT prod FROM train_df").show(7) if sql else \
df.select("prod").show(7)

In [None]:
prod_ids = ["prodA","prodB","prodC","prodD","prodE","prodF", "prodG","prodH","prodI","prodJ","prodK","prodL"]
prod_names = ["smarfone", "PC", "laptop", "headphones", "tv", "speaker", 
              "keyboard", "mouse", "charger", "powerbank", "microphone", "camera"]

prod_df = spark.createDataFrame([Row(prod_id = i, prod_name = n) for i,n in zip(prod_ids, prod_names)])
prod_df.createOrReplaceTempView("prod_df")
prod_df.show()

In [None]:
# SQL example of INNER JOIN
spark.sql("SELECT geo_df.geo_name, prod_df.prod_name, train_df.val FROM train_df \
            INNER JOIN prod_df ON prod_df.prod_id = train_df.prod \
            INNER JOIN geo_df ON geo_df.geo_id = train_df.geo").show(7)

In [None]:
# Spark join() is also 'inner' by default
df.groupBy("prod").agg(f.round(f.sum("val"), 2).alias("total value"))\
                  .sort("total value")\
                  .join(prod_df, df.prod == prod_df.prod_id)\
                  .select("prod_name", "total value")\
                  .show()

In [None]:
# WHERE / where()
# SQL and Spark equivalent:
spark.sql("SELECT * FROM train_df WHERE prod != 'prodA' AND val > 9900").show() if sql else \
df.where(df["prod"] != "prodA").where(f.col("val") > 9900).show()

In [None]:
# LIKE / like()
# SQL and Spark equivalent:
spark.sql("select * from train_df where prod like '%A'").show(7) if sql else \
df.where(df.prod.like('%A')).show(7)

In [None]:
# orderBy()
df.groupBy(["prod","geo"]).sum().orderBy("geo").show(7)

In [None]:
# SUM, AVG, COUNT / sum(), avg(), count()
# SQL and Spark equivalent:
q = ("SELECT prod, geo, SUM(val) val_sum, AVG(val) val_avg, COUNT(*), COUNT(val) FROM train_df GROUP BY prod, geo")
spark.sql(q).show(7) if sql else \
df.groupBy(["prod","geo"]).agg(f.sum("val").alias("val_sum"), f.avg("val").alias("val_avg"), 
                               f.count("*"), f.count("val")).show(7)

In [None]:
# Select unique combinations
# DISTINCT / distinct()
spark.sql("SELECT DISTINCT prod, geo FROM train_df").show(5) if sql else \
df.select("prod", "geo").distinct().show(5)

# alternative
# df.dropDuplicates(["prod", "geo"]).show(5)

In [None]:
# Drop rows with any null values
# SQL and Spark equivalent:
spark.sql("select * from train_df where val is not null").count() if sql else \
df.dropna("any").count()

In [None]:
# Show rows with null values in val column
df.where(df.val.isNull()).show()
# df.where(f.isnull("val")).show()

In [None]:
# Replace null values with 1
# SQL and Spark equivalent:
q = "SELECT prod, geo, IF(val is null, 1, val) AS val FROM train_df"
spark.sql(q).show() if sql else \
df.fillna(1).show(7)
# df.fillna({"val": 1}).show(7)

In [None]:
# Replacing values with REGEXP_REPLACE / replace()
# SQL and Spark equivalent:
spark.sql("SELECT geo, REGEXP_REPLACE(prod, 'prodE', 'Product E') AS prod, val FROM train_df").show(3) if sql else \
df.replace("prodE", "Product E").show(3)

# or with dictionary
# df.replace({"prodA": "Product A", "prodB": "Product B"}).show(3)

In [None]:
# Rename columns
# SQL and Spark equivalent:
spark.sql("SELECT prod, geo, val AS volume FROM train_df").show(3) if sql else \
df.withColumnRenamed("val", "volume").show(3)

# df.select(df.val.alias("volume")).show(3)

In [None]:
# New column
# SQL and Spark equivalent:
spark.sql("SELECT *, val/1000 AS minival FROM train_df").show(3) if sql else \
df.withColumn("minival", df["val"] / 1000).show(3)
# df.select("*", (df.val/1000).alias("minival)).show(3)

In [None]:
# CASE WHEN THEN/ when() otherwise()
# SQL and Spark equivalent:
q = "SELECT prod, CASE WHEN val > 7500 THEN 1 WHEN val < 2500 THEN 3 ELSE 2 END AS out FROM train_df"
spark.sql(q).show(5) if sql else \
df.select(df.prod, f.when(df.val > 7500, 1).when(df.val < 2500, 3).otherwise(2).alias("out")).show(5)

In [None]:
# SUBSTRING / substring() 
# SQL and Spark equivalent:
spark.sql("SELECT SUBSTRING(prod, 4, 2) AS id FROM train_df").show(5) if sql else \
df.select(f.substring("prod", 4, 2).alias("id")).show(5)
# or
# df.select(df.prod.substr(4, 2).alias("id")).show(5)

<br>
## 3. Window functions
Window function calculates a return value for every input row of a table based on a group of rows. 

In [None]:
windowSpec = Window.partitionBy('prod')

# SQL and Spark equivalent:
spark.sql("SELECT prod, val, SUM(val) OVER (PARTITION BY prod) AS prod_val FROM train_df").show(3) if sql else \
df.select("prod", "val", f.sum("val").over(windowSpec).alias("prod_val")).show(3)

In [None]:
windowSpec = Window.partitionBy("prod").orderBy("geo")

# rank() function returns rank value, if it's the same it returns the same score
# then the number of row index
# dense_rank() returns rank value in natural order
df.withColumn("ranked", f.rank().over(windowSpec))\
  .withColumn("ranked_dense", f.dense_rank().over(windowSpec))\
  .withColumn("row_number", f.row_number().over(windowSpec)).show(10)

In [None]:
windowSpec = Window.partitionBy("prod").orderBy("val")

# Cumulative sum
df.withColumn("sum_from_start", f.sum(df.val).over(windowSpec)).show(5)

In [None]:
# Stop the app
spark.stop()

<br>
## 4. Bike trips data exploration

Bike rental systems is a growing part of mobility market. I will analyze Lyft bikeshare system data from 2017 and 2018. <br>
The data is available at https://www.lyft.com/bikes/bay-wheels/system-data 
<br>
<br>
We need to merge multiple csv files to one Spark DataFrame.
<br>
-------------------------------------------------------------------------------------


In [None]:
spark = SparkSession.builder.appName('bike_app').master("local[*]").getOrCreate()

In [None]:
# Merge all 2018 csv files to one DataFrame
files = "../input/ford-gobike-data/2018*.csv"
goBike = spark.read.csv(files, header=True, inferSchema=True)
print(f"Total Records = {goBike.count()}")

In [None]:
# Add 2017 data
goBike = goBike.drop("bike_share_for_all_trip")
goBike = goBike.unionAll(spark.read.csv("../input/ford-gobike-data/2017-fordgobike-tripdata.csv", header=True, inferSchema=True))
print(f"Total Records = {goBike.count()}")

># [↑ Almost 2 mln records]()

In [None]:
goBike.printSchema()

In [None]:
# Exemplary record
goBike.show(1, vertical=True)

In [None]:
# Some data is missing
goBike.where(goBike.member_birth_year.isNull()).count()

In [None]:
# 1. Remove rows with null values
goBike = goBike.dropna("any")

In [None]:
# 2. Distribution of "member_gender" variable
goBike.groupBy("member_gender").count().show()

>About 75% of users are men
># [Men: 3x more often]()

In [None]:
# 3. Count min, max, avg age of customers
goBike.select((2020 - goBike["member_birth_year"]).alias("age")).describe()\
      .select("summary", f.round("age", 0).alias("age")).show()

> Age of the oldest cyclist in the dataset. Probably a random number entered in registry. 
> # [139 years old (ಠ_ಠ)]()

In [None]:
# 4. Count total number of unique bikes
goBike.select("bike_id").distinct().count()

In [None]:
# 5. Count total number of unique stations
goBike.select("start_station_id").union(goBike.select("end_station_id")).distinct().count()

In [None]:
# 6. Check bike with shortest and longest rental time
goBike.groupBy("bike_id").agg(f.sum("duration_sec").alias("total_time")).orderBy("total_time").show(1)
goBike.groupBy("bike_id").agg(f.sum("duration_sec").alias("total_time")).orderBy(f.desc("total_time")).show(1)

> The longest journey
> # [11 days]()

In [None]:
# 7. Calculate average time of single rental
goBike.select(f.avg("duration_sec").alias("average")).show()

> Average rental
> # [13 minutes]()

In [None]:
# 8. Find stations with the most traffic between them
goBike.select(f.when(goBike["start_station_id"] > goBike["end_station_id"], 
                     f.array(goBike["start_station_id"], goBike["end_station_id"]))\
                     .otherwise(f.array(goBike["end_station_id"], goBike["start_station_id"]))\
                     .alias("route"))\
      .groupBy("route")\
      .count()\
      .orderBy(f.desc("count"))\
      .show(1)

In [None]:
# Show stations from route above
goBike.filter((goBike.start_station_id == 6) | (goBike.start_station_id == 15)) \
      .select("start_station_name").distinct().show(truncate=False)
# Source: google maps
Image("../input/images/popular_route.JPG", width=800)

In [None]:
# 9. Find rush hour
goBike.select(f.hour("start_time").alias("hour"))\
.groupBy("hour").count().orderBy(f.desc("count")).show(7)

>Rush hours during day
># [8am & 5pm]()

In [None]:
# 10. Find average rentals grouped by weekday
goBike.select(f.date_format("start_time", "dd.MM.yyyy").alias("date"), \
              f.date_format("start_time", "E").alias("weekday")) \
      .groupBy("date", "weekday").count() \
      .groupBy("weekday").agg(f.avg("count").alias("avg_use")) \
      .orderBy("avg_use").show()

>Most popular day for a ride
># [Tuesday]()

In [None]:
# 11. Calculate average distance between stations for all trips
from math import radians, cos, sin, asin, sqrt
from pyspark.sql.types import FloatType

# Credit: Michael Dunn https://stackoverflow.com/a/4913653
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers
    return c * r

haversine_udf = f.udf(haversine, FloatType())

In [None]:
goBike.select(haversine_udf("start_station_longitude", "start_station_latitude", \
                            "end_station_longitude", "end_station_latitude").alias("distance")) \
.agg(f.avg("distance").alias("avg distance [km]")).show()

>Avarege distance of a single bike trip
># [1.59 km]()


In [None]:
spark.stop()

# 5. ETL pipeline example

Now based on our data let's create exemplary ETL process:
* Extract: Load the data from source to a DataFrame with defined structure.
* Transform: Create new DataFrame with daily data
* Load: Save transformed frame to file or pandas

<br>
<br>
### EXTRACT
This time let's define a structure of loaded csv data:

In [None]:
spark = SparkSession.builder.appName('etl_app').master("local[*]").getOrCreate()

def load_with_schema(spark):

    schema = StructType([
        StructField("duration_sec", IntegerType(), True),
        StructField("start_time", TimestampType(), True),
        StructField("end_time", TimestampType(), True),
        StructField("start_station_id", StringType(), True),
        StructField("start_station_name", StringType(), True),
        StructField("start_station_latitude", DoubleType(), True),
        StructField("start_station_longitude", DoubleType(), True),
        StructField("end_station_id", StringType(), True),
        StructField("end_station_name", StringType(), True),
        StructField("end_station_latitude", DoubleType(), True),
        StructField("end_station_longitude", DoubleType(), True),
        StructField("bike_id", IntegerType(), True),
        StructField("user_type", StringType(), True),
        StructField("member_birth_year", IntegerType(), True),
        StructField("member_gender", StringType(), True)
    ])

    df = spark \
        .read \
        .format("csv") \
        .schema(schema)         \
        .option("header", "true") \
        .load("../input/ford-gobike-data/*.csv")

    return df

goBike = load_with_schema(spark)
print(f"Total Records = {goBike.count()}")

> **ZADANIE 2**: Utwórz DataFrame `dataDaily` zawierający dane zagregowane do poziomu dnia. Zbiór ma zawierać następujące informacje (kolumny): 
- 'date' : data 
- 'avg_duration_sec' : średni czas wypożyczeń danego dnia
- 'n_trips' : liczba wypożyczeń danego dnia
- 'n_bikes' : liczba unikatowych rowerów użytych danego dnia
- 'n_routes' : liczba unikatowych kombinacji stacji (x -> y == y -> x) danego dnia
- 'n_subscriber' : liczba wypożyczeń dokonanych przez subskrybentów danego dnia

<br>
### TRANSFORM
We will create a new DataFrame `dailyData` with aggregated data by single day. <br>
Let's transform our data into these new columns: 

- 'date' : date 
- 'avg_duration_sec' : average time of rental of that day
- 'n_trips' : total number of trips of that day
- 'n_bikes' : total number of unique bikes rented that day
- 'n_routes' : total number of unique trips combinations (x -> y == y -> x) of that day
- 'n_subscriber' : total number of subsribers' rentals of that day

In [None]:
dailyData = goBike.withColumn("date", f.date_format("start_time", "dd.MM.yyyy")) \
                  .groupBy("date") \
                  .agg(f.avg("duration_sec").alias("avg_duration_sec"), 
                       f.count("*").alias("n_trips"), 
                       f.countDistinct("bike_id").alias("n_bikes"), 
                       f.sum(f.when(goBike.user_type == "Subscriber", 1).otherwise(0))
                       .alias("n_subscriber"))

In [None]:
temp = goBike.select(f.date_format("start_time", "dd.MM.yyyy").alias("date"), 
                    f.when(goBike["start_station_id"] > goBike["end_station_id"], 
                           f.array(goBike["start_station_id"], goBike["end_station_id"]))\
                    .otherwise(f.array(goBike["end_station_id"], goBike["start_station_id"])).alias("route"))\
            .groupBy("date").agg(f.countDistinct("route").alias("n_routes"))

In [None]:
dailyData = dailyData.join(temp, "date")
dailyData.show()

<br>
### LOAD
Be aware of high computational cost of these operations depending on how big is your DataFrame


In [None]:
# Option 1: save data to .parquet file
# goBike.write.parquet('path/to/location/transformed.parquet')

# Option 2: save data to .csv
# goBike.write.csv('path/to/location/transformed.csv')

# Option 3: save data to pandas
# goBike.toPandas()

In [None]:
spark.stop()

That's it. If you like it, please upvote. Thank you