<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# Spark

## 1. Introduction

* When dataset size exceeds your computer memory (RAM or even storage), [Big Data](https://en.wikipedia.org/wiki/Big_data) tools are used to break the dataset in chunks and process it step by step
* Big Data tools allow you to make this process automatically and take care of everything under the hood with little extra code
* [Spark](https://spark.apache.org/) is the most popular Big Data framework so far
* Spark syntax resembles pandas API with some differences

## 2. Installation

You need Java to run Spark. It is written in [Scala](https://www.scala-lang.org/), a JVM based (and functional style) language

### 2.1 Java installation

#### 2.1.1 Conda

```
conda install openjdk -y
```

#### 2.1.2 Apt

```
sudo apt install default-jdk
```

### 2.2 PySpark installation

#### 2.2.1 Conda

At this moment it only works with Python 3.7

```
conda install pyspark -y
```

#### 2.2.2 Pip

```
pip install pyspark
```

## 3. Setup

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
            .appName('big_data_session') \
            .master('local[*]') \
            .config('spark.ui.showConsoleProgress', True) \
            .config('spark.sql.repl.eagerEval.enabled', True) \
            .getOrCreate()

In [2]:
spark

## 4. Data processing

Data can be downloaded from the following url:

- [Spanish Rail Tickets Pricing](https://www.kaggle.com/thegurusteam/spanish-high-speed-rail-system-ticket-pricing)

In [3]:
DATA_PATH = '/home/ubuntu/Desktop/renfe.csv'

sdf = spark.read.option('quote', '"').option('escape', '"').csv(DATA_PATH, 
                                                                header=True, 
                                                                inferSchema=True)

sdf

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date
1,renfe,MADRID,BARCELONA,2019-04-18 05:50:00,2019-04-18 08:55:00,3.08,AVE,Preferente,68.95,Promo,,{},2019-04-11 21:49:46
2,renfe,MADRID,BARCELONA,2019-04-18 13:25:00,2019-04-18 16:24:00,2.98,AVE-TGV,Turista,107.7,Flexible,,{},2019-04-11 21:49:46
3,renfe,MADRID,BARCELONA,2019-04-18 06:30:00,2019-04-18 09:20:00,2.83,AVE,Turista,75.4,Promo,,{},2019-04-11 21:49:46
4,renfe,MADRID,BARCELONA,2019-04-18 15:30:00,2019-04-18 18:40:00,3.17,AVE,Preferente,,Promo,,{},2019-04-11 21:49:46
5,renfe,MADRID,BARCELONA,2019-04-18 07:00:00,2019-04-18 09:30:00,2.5,AVE,Turista Plus,106.75,Promo,,{},2019-04-11 21:49:46
6,renfe,MADRID,BARCELONA,2019-04-18 06:30:00,2019-04-18 09:20:00,2.83,AVE,Turista,75.4,Promo,,{},2019-04-11 21:49:46
7,renfe,MADRID,BARCELONA,2019-04-18 07:30:00,2019-04-18 10:40:00,3.17,AVE,Turista Plus,90.5,Promo,,{},2019-04-11 21:49:46
8,renfe,MADRID,BARCELONA,2019-04-18 19:00:00,2019-04-18 21:30:00,2.5,AVE,Preferente,115.65,Promo,,{},2019-04-11 21:49:46
9,renfe,MADRID,BARCELONA,2019-04-18 08:00:00,2019-04-18 10:30:00,2.5,AVE,Turista,88.95,Promo,,{},2019-04-11 21:49:46
10,renfe,MADRID,BARCELONA,2019-04-18 08:00:00,2019-04-18 10:30:00,2.5,AVE,Turista,88.95,Promo,,{},2019-04-11 21:49:46


__VERY IMPORTANT INFO: sdf is a Spark DataFrame, which means it is a distributed DataFrame, not a typical Python object that lives in RAM memory (Pandas DataFrame)__

- From DataBricks (Spark creators) about what a Spark DataFrame is:

_"In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a dataframe in Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs (Resilient Distributed Datasets)"_

- Spark DataFrames do no live in computers / cluster nodes memory, they are evaluated at the time some calculations are required

#### types

In [4]:
sdf.dtypes

[('id', 'int'),
 ('company', 'string'),
 ('origin', 'string'),
 ('destination', 'string'),
 ('departure', 'string'),
 ('arrival', 'string'),
 ('duration', 'double'),
 ('vehicle_type', 'string'),
 ('vehicle_class', 'string'),
 ('price', 'double'),
 ('fare', 'string'),
 ('seats', 'int'),
 ('meta', 'string'),
 ('insert_date', 'string')]

In [5]:
from pyspark.sql import functions as sf

from pyspark.sql.types import TimestampType

for dt_column in ['departure', 'arrival', 'insert_date']:
    sdf = sdf.withColumn('departure', sdf['departure'].cast(TimestampType()))

sdf

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date
1,renfe,MADRID,BARCELONA,2019-04-18 05:50:00,2019-04-18 08:55:00,3.08,AVE,Preferente,68.95,Promo,,{},2019-04-11 21:49:46
2,renfe,MADRID,BARCELONA,2019-04-18 13:25:00,2019-04-18 16:24:00,2.98,AVE-TGV,Turista,107.7,Flexible,,{},2019-04-11 21:49:46
3,renfe,MADRID,BARCELONA,2019-04-18 06:30:00,2019-04-18 09:20:00,2.83,AVE,Turista,75.4,Promo,,{},2019-04-11 21:49:46
4,renfe,MADRID,BARCELONA,2019-04-18 15:30:00,2019-04-18 18:40:00,3.17,AVE,Preferente,,Promo,,{},2019-04-11 21:49:46
5,renfe,MADRID,BARCELONA,2019-04-18 07:00:00,2019-04-18 09:30:00,2.5,AVE,Turista Plus,106.75,Promo,,{},2019-04-11 21:49:46
6,renfe,MADRID,BARCELONA,2019-04-18 06:30:00,2019-04-18 09:20:00,2.83,AVE,Turista,75.4,Promo,,{},2019-04-11 21:49:46
7,renfe,MADRID,BARCELONA,2019-04-18 07:30:00,2019-04-18 10:40:00,3.17,AVE,Turista Plus,90.5,Promo,,{},2019-04-11 21:49:46
8,renfe,MADRID,BARCELONA,2019-04-18 19:00:00,2019-04-18 21:30:00,2.5,AVE,Preferente,115.65,Promo,,{},2019-04-11 21:49:46
9,renfe,MADRID,BARCELONA,2019-04-18 08:00:00,2019-04-18 10:30:00,2.5,AVE,Turista,88.95,Promo,,{},2019-04-11 21:49:46
10,renfe,MADRID,BARCELONA,2019-04-18 08:00:00,2019-04-18 10:30:00,2.5,AVE,Turista,88.95,Promo,,{},2019-04-11 21:49:46


In [6]:
sdf.dtypes

[('id', 'int'),
 ('company', 'string'),
 ('origin', 'string'),
 ('destination', 'string'),
 ('departure', 'timestamp'),
 ('arrival', 'string'),
 ('duration', 'double'),
 ('vehicle_type', 'string'),
 ('vehicle_class', 'string'),
 ('price', 'double'),
 ('fare', 'string'),
 ('seats', 'int'),
 ('meta', 'string'),
 ('insert_date', 'string')]

#### sample data

In [7]:
sdf_sample = sdf.sample(fraction=0.01, withReplacement=False)

sdf_sample

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date
122,renfe,MADRID,BARCELONA,2019-05-22 19:00:00,2019-05-22 21:30:00,2.5,AVE,Turista Plus,82.35,Promo,,{},2019-04-11 21:50:04
189,renfe,MADRID,BARCELONA,2019-04-27 10:30:00,2019-04-27 13:15:00,2.75,AVE,Turista,58.15,Promo,,{},2019-04-11 21:50:20
207,renfe,MADRID,BARCELONA,2019-04-27 19:30:00,2019-04-27 22:40:00,3.17,AVE,Turista Plus,90.5,Promo,,{},2019-04-11 21:50:20
255,renfe,MADRID,BARCELONA,2019-05-28 17:00:00,2019-05-28 19:30:00,2.5,AVE,Turista,88.95,Promo,,{},2019-04-11 21:50:23
271,renfe,MADRID,BARCELONA,2019-05-28 08:20:00,2019-05-28 11:05:00,2.75,AVE,Turista,75.4,Promo,,{},2019-04-11 21:50:23
292,renfe,MADRID,SEVILLA,2019-06-01 13:10:00,2019-06-01 20:51:00,7.68,MD-LD,Preferente,,Promo,,{},2019-04-11 21:50:34
329,renfe,MADRID,SEVILLA,2019-06-02 19:00:00,2019-06-02 21:38:00,2.63,AVE,Preferente,69.4,Promo,,{},2019-04-11 21:50:38
668,renfe,MADRID,SEVILLA,2019-05-21 08:00:00,2019-05-21 10:32:00,2.53,AVE,Turista,47.3,Promo,,{},2019-04-11 21:51:50
777,renfe,MADRID,SEVILLA,2019-04-22 18:00:00,2019-04-22 20:32:00,2.53,AVE,Turista,53.4,Promo,,{},2019-04-11 21:51:53
857,renfe,MADRID,BARCELONA,2019-06-01 20:30:00,2019-06-01 23:40:00,3.17,AVE,Turista Plus,59.5,Promo,,{},2019-04-11 21:52:08


#### persist data

In [8]:
SAMPLE_PATH = '/home/ubuntu/Desktop/renfe_sample'

sdf_sample.write.mode('overwrite').parquet(SAMPLE_PATH)

#### query data

In [9]:
sdf_sample = spark.read.parquet(SAMPLE_PATH)

sdf_sample

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date
27299680,renfe,MADRID,GIRONA,2020-05-24 06:20:00,2020-05-24 10:23:00,4.05,AVE,,,,,{},2020-03-15 12:19:09
27299730,renfe,MADRID,LLEIDA,2020-03-30 15:30:00,2020-03-30 17:35:00,2.08,AVE,,,,,{},2020-03-15 12:19:10
27299837,renfe,MADRID,GIRONA,2020-05-27 19:00:00,2020-05-27 22:18:00,3.3,AVE,,,,,{},2020-03-15 12:19:12
27299972,renfe,MADRID,LLEIDA,2020-04-16 11:30:00,2020-04-16 13:35:00,2.08,AVE,,,,,{},2020-03-15 12:19:15
27300049,renfe,MADRID,LLEIDA,2020-05-03 09:30:00,2020-05-03 11:29:00,1.98,AVE,,,,,{},2020-03-15 12:19:50
27300479,renfe,MADRID,LLEIDA,2020-05-07 09:30:00,2020-05-07 11:29:00,1.98,AVE,,,,,{},2020-03-15 12:19:59
27300508,renfe,TARRAGONA,MADRID,2020-04-17 08:33:00,2020-04-17 11:10:00,2.62,AVE,,,,,{},2020-03-15 12:20:00
27300757,renfe,MADRID,LLEIDA,2020-05-08 17:30:00,2020-05-08 19:35:00,2.08,AVE,,,,,{},2020-03-15 12:20:04
27300861,renfe,TARRAGONA,MADRID,2020-04-20 07:57:00,2020-04-20 10:01:00,2.07,AVE,,,,,{},2020-03-15 12:20:07
27300865,renfe,MADRID,LLEIDA,2020-04-21 07:30:00,2020-04-21 09:35:00,2.08,AVE,,,,,{},2020-03-15 12:20:07


In [10]:
sdf_sample.select(['origin', 'destination']).limit(5)

origin,destination
MADRID,GIRONA
MADRID,LLEIDA
MADRID,GIRONA
MADRID,LLEIDA
MADRID,LLEIDA


#### filter data

In [11]:
sdf_filtered = sdf_sample.filter((sf.col('meta') != '{}') & (sf.col('price') < 60) & (sf.col('seats').isNotNull()))

sdf_filtered

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date
37403110,renfe,MADRID,BARCELONA,2020-10-15 09:40:00,2020-10-15 16:17:48,6.63,AVE-LD,Turista con enlace,50.6,Promo +,232,"{""Turista con enl...",2020-09-28 16:10:...
37403347,renfe,MADRID,BARCELONA,2020-10-07 09:40:00,2020-10-07 16:17:48,6.63,AVE-LD,Turista con enlace,41.4,Promo +,216,"{""Turista con enl...",2020-09-28 16:10:...
37403576,renfe,BARCELONA,MADRID,2020-11-07 09:03:00,2020-11-07 18:07:12,9.07,REG.EXP.,Turista,43.25,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...
37403767,renfe,BARCELONA,VALENCIA,2020-11-11 10:00:00,2020-11-11 15:07:48,5.13,REG.EXP.,Turista,29.55,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...
37404286,renfe,LEÓN,MADRID,2020-11-20 20:31:00,2020-11-20 22:46:00,2.25,ALVIA,Turista,27.6,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:12:...
37404324,renfe,MADRID,CÓRDOBA,2020-10-04 13:30:00,2020-10-04 15:15:00,1.75,ALVIA,Turista,55.9,Flexible,21,"{""Turista"": {""Fle...",2020-09-28 16:12:...
37404397,renfe,LEÓN,MADRID,2020-10-29 15:50:00,2020-10-29 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...
37404439,renfe,LEÓN,MADRID,2020-10-27 15:50:00,2020-10-27 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...
37404836,renfe,VALENCIA,BARCELONA,2020-11-21 19:15:00,2020-11-21 21:55:12,2.67,EUROMED,Turista,28.5,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:13:...
37404905,renfe,SEVILLA,MADRID,2020-11-24 10:45:00,2020-11-24 13:16:48,2.53,AVE,Turista,27.0,Promo +,267,"{""Turista"": {""Pro...",2020-09-28 16:13:...


#### create new columns

In [12]:
from pyspark.sql.types import IntegerType

sdf_filtered.withColumn('duration_computed', (sf.col('arrival').cast(IntegerType()) - sf.col('departure').cast(IntegerType())) / 3600)

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date,duration_computed
37403110,renfe,MADRID,BARCELONA,2020-10-15 09:40:00,2020-10-15 16:17:48,6.63,AVE-LD,Turista con enlace,50.6,Promo +,232,"{""Turista con enl...",2020-09-28 16:10:...,
37403347,renfe,MADRID,BARCELONA,2020-10-07 09:40:00,2020-10-07 16:17:48,6.63,AVE-LD,Turista con enlace,41.4,Promo +,216,"{""Turista con enl...",2020-09-28 16:10:...,
37403576,renfe,BARCELONA,MADRID,2020-11-07 09:03:00,2020-11-07 18:07:12,9.07,REG.EXP.,Turista,43.25,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...,
37403767,renfe,BARCELONA,VALENCIA,2020-11-11 10:00:00,2020-11-11 15:07:48,5.13,REG.EXP.,Turista,29.55,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...,
37404286,renfe,LEÓN,MADRID,2020-11-20 20:31:00,2020-11-20 22:46:00,2.25,ALVIA,Turista,27.6,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:12:...,
37404324,renfe,MADRID,CÓRDOBA,2020-10-04 13:30:00,2020-10-04 15:15:00,1.75,ALVIA,Turista,55.9,Flexible,21,"{""Turista"": {""Fle...",2020-09-28 16:12:...,
37404397,renfe,LEÓN,MADRID,2020-10-29 15:50:00,2020-10-29 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...,
37404439,renfe,LEÓN,MADRID,2020-10-27 15:50:00,2020-10-27 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...,
37404836,renfe,VALENCIA,BARCELONA,2020-11-21 19:15:00,2020-11-21 21:55:12,2.67,EUROMED,Turista,28.5,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:13:...,
37404905,renfe,SEVILLA,MADRID,2020-11-24 10:45:00,2020-11-24 13:16:48,2.53,AVE,Turista,27.0,Promo +,267,"{""Turista"": {""Pro...",2020-09-28 16:13:...,


#### make aggregations

In [13]:
sdf_filtered.groupby(['origin', 'destination']).agg({'price': 'mean'})

origin,destination,avg(price)
BARCELONA,VALENCIA,28.352215189873405
ZARAGOZA,MADRID,44.7396531791908
MADRID,CÓRDOBA,42.75325670498077
MADRID,ZARAGOZA,47.28177419354846
MÁLAGA,MADRID,40.98489932885911
VALLADOLID,MADRID,31.55361560418648
MADRID,BARCELONA,49.27578125
MADRID,VALLADOLID,31.587778810408857
ZARAGOZA,BARCELONA,36.24157281553397
CÓRDOBA,MADRID,43.8485604606525


In [14]:
sdf_filtered.count()

9298

#### apply custom functions

In [15]:
sdf_filtered.select(['meta']).show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|meta                                                                                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Turista con enlace": {"Promo +": {"price": 50.6, "seats": 232}, "Flexible": {"price": 92.0, "seats": 232}}}                                                                                           |
|{"Turista con enlace": {"Promo +": {"price": 41.4, "seats": 216}, "Flexible": {"price": 92.0, "seats": 216}}}                                                                              

In [16]:
import json

@sf.udf('integer')
def get_first_class_fare_seats(meta):
    try:
        meta_dict = eval(meta)
        first_available_class = [*meta_dict][0]
        first_available_fare = [*meta_dict[first_available_class]][0]
        seats = meta_dict[first_available_class][first_available_fare]['seats']
        return seats

    except:
        return 0

In [17]:
sdf_filtered = sdf_filtered.withColumn('seats_first_class_fare', get_first_class_fare_seats(sf.col('meta')))

sdf_filtered

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats,meta,insert_date,seats_first_class_fare
37403110,renfe,MADRID,BARCELONA,2020-10-15 09:40:00,2020-10-15 16:17:48,6.63,AVE-LD,Turista con enlace,50.6,Promo +,232,"{""Turista con enl...",2020-09-28 16:10:...,232
37403347,renfe,MADRID,BARCELONA,2020-10-07 09:40:00,2020-10-07 16:17:48,6.63,AVE-LD,Turista con enlace,41.4,Promo +,216,"{""Turista con enl...",2020-09-28 16:10:...,216
37403576,renfe,BARCELONA,MADRID,2020-11-07 09:03:00,2020-11-07 18:07:12,9.07,REG.EXP.,Turista,43.25,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...,209
37403767,renfe,BARCELONA,VALENCIA,2020-11-11 10:00:00,2020-11-11 15:07:48,5.13,REG.EXP.,Turista,29.55,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...,209
37404286,renfe,LEÓN,MADRID,2020-11-20 20:31:00,2020-11-20 22:46:00,2.25,ALVIA,Turista,27.6,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:12:...,236
37404324,renfe,MADRID,CÓRDOBA,2020-10-04 13:30:00,2020-10-04 15:15:00,1.75,ALVIA,Turista,55.9,Flexible,21,"{""Turista"": {""Fle...",2020-09-28 16:12:...,21
37404397,renfe,LEÓN,MADRID,2020-10-29 15:50:00,2020-10-29 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...,245
37404439,renfe,LEÓN,MADRID,2020-10-27 15:50:00,2020-10-27 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...,245
37404836,renfe,VALENCIA,BARCELONA,2020-11-21 19:15:00,2020-11-21 21:55:12,2.67,EUROMED,Turista,28.5,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:13:...,236
37404905,renfe,SEVILLA,MADRID,2020-11-24 10:45:00,2020-11-24 13:16:48,2.53,AVE,Turista,27.0,Promo +,267,"{""Turista"": {""Pro...",2020-09-28 16:13:...,267


#### rename columns

In [18]:
sdf_filtered = sdf_filtered.withColumnRenamed('seats', 'seats_cheapest_class_fare')

sdf_filtered

id,company,origin,destination,departure,arrival,duration,vehicle_type,vehicle_class,price,fare,seats_cheapest_class_fare,meta,insert_date,seats_first_class_fare
37403110,renfe,MADRID,BARCELONA,2020-10-15 09:40:00,2020-10-15 16:17:48,6.63,AVE-LD,Turista con enlace,50.6,Promo +,232,"{""Turista con enl...",2020-09-28 16:10:...,232
37403347,renfe,MADRID,BARCELONA,2020-10-07 09:40:00,2020-10-07 16:17:48,6.63,AVE-LD,Turista con enlace,41.4,Promo +,216,"{""Turista con enl...",2020-09-28 16:10:...,216
37403576,renfe,BARCELONA,MADRID,2020-11-07 09:03:00,2020-11-07 18:07:12,9.07,REG.EXP.,Turista,43.25,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...,209
37403767,renfe,BARCELONA,VALENCIA,2020-11-11 10:00:00,2020-11-11 15:07:48,5.13,REG.EXP.,Turista,29.55,Adulto ida,209,"{""Turista"": {""Adu...",2020-09-28 16:11:...,209
37404286,renfe,LEÓN,MADRID,2020-11-20 20:31:00,2020-11-20 22:46:00,2.25,ALVIA,Turista,27.6,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:12:...,236
37404324,renfe,MADRID,CÓRDOBA,2020-10-04 13:30:00,2020-10-04 15:15:00,1.75,ALVIA,Turista,55.9,Flexible,21,"{""Turista"": {""Fle...",2020-09-28 16:12:...,21
37404397,renfe,LEÓN,MADRID,2020-10-29 15:50:00,2020-10-29 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...,245
37404439,renfe,LEÓN,MADRID,2020-10-27 15:50:00,2020-10-27 20:48:12,4.97,MD,Turista,38.85,Adulto ida,245,"{""Turista"": {""Adu...",2020-09-28 16:12:...,245
37404836,renfe,VALENCIA,BARCELONA,2020-11-21 19:15:00,2020-11-21 21:55:12,2.67,EUROMED,Turista,28.5,Promo +,236,"{""Turista"": {""Pro...",2020-09-28 16:13:...,236
37404905,renfe,SEVILLA,MADRID,2020-11-24 10:45:00,2020-11-24 13:16:48,2.53,AVE,Turista,27.0,Promo +,267,"{""Turista"": {""Pro...",2020-09-28 16:13:...,267


#### create virtual sql tables and query them

In [19]:
sdf_filtered.createTempView('renfe')

In [20]:
SQL_QUERY = """
select
origin,
destination,
avg(price) as mean_price, 
avg(seats_cheapest_class_fare) as mean_seats
from renfe
group by origin, destination
order by mean_price desc
"""

In [21]:
routes_prices_sdf = spark.sql(SQL_QUERY)

routes_prices_sdf

origin,destination,mean_price,mean_seats
MADRID,BARCELONA,49.27578125,243.5
BARCELONA,MADRID,47.93766233766234,240.7142857142857
MADRID,ZARAGOZA,47.28177419354846,198.02956989247312
ZARAGOZA,MADRID,44.7396531791908,235.6242774566474
CÓRDOBA,MADRID,43.8485604606525,216.2476007677543
MADRID,SEVILLA,43.31225961538459,236.47596153846155
MADRID,CÓRDOBA,42.75325670498077,220.84291187739464
SEVILLA,MADRID,42.68211206896551,238.81896551724137
MÁLAGA,MADRID,40.98489932885911,260.4295302013423
MADRID,MÁLAGA,40.703472222222274,256.78472222222223


#### transform Spark DataFrame into pandas DataFrame

In [22]:
routes_prices_df = routes_prices_sdf.toPandas()

routes_prices_df

Unnamed: 0,origin,destination,mean_price,mean_seats
0,MADRID,BARCELONA,49.275781,243.5
1,BARCELONA,MADRID,47.937662,240.714286
2,MADRID,ZARAGOZA,47.281774,198.02957
3,ZARAGOZA,MADRID,44.739653,235.624277
4,CÓRDOBA,MADRID,43.84856,216.247601
5,MADRID,SEVILLA,43.31226,236.475962
6,MADRID,CÓRDOBA,42.753257,220.842912
7,SEVILLA,MADRID,42.682112,238.818966
8,MÁLAGA,MADRID,40.984899,260.42953
9,MADRID,MÁLAGA,40.703472,256.784722


<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>