## Problem 09

Weightage: 40

Join trips data with stations and get a denormalized table with both startstationname and endstationname on top of all fields from trips.


## Data Description
All of the citibike trip data is available under **/public/citibike/trips**. It contain multiple folders - one for each month. Here is the schema.

```
root
 |-- tripduration: integer (nullable = true)
 |-- starttime: timestamp (nullable = true)
 |-- stoptime: timestamp (nullable = true)
 |-- startstationid: string (nullable = true)
 |-- endstationid: string (nullable = true)
 |-- bikeid: integer (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: integer (nullable = true)
 |-- month: integer (nullable = true)
```
All of the citibike station data is available under **/public/citibike/stations**. 
```
root
 |-- stationid: long (nullable = true)
 |-- stationlatitude: string (nullable = true)
 |-- stationlongitude: string (nullable = true)
 |-- stationname: string (nullable = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem09/solution
```
* Use Parquet File format with any number of files.
* Here are the column names. Data types should be as below.
```
 |-- tripduration: integer (nullable = true)
 |-- starttime: timestamp (nullable = true)
 |-- stoptime: timestamp (nullable = true)
 |-- startstationid: integer (nullable = true)
 |-- endstationid: integer (nullable = true)
 |-- bikeid: integer (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- startstationname: string (nullable = true)
 |-- endstationname: string (nullable = true)
```
* There are no requirements for sorting the data.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem09/solution
```
* Run this code to create dataframe by name data.
```
import getpass
username = getpass.getuser()
data = spark.read. \
    parquet(f'/user/{username}/mock_test_02/problem09/solution')
```
* Run `data.printSchema()` to validate the data types of the fields.
```
root
 |-- tripduration: integer (nullable = true)
 |-- starttime: timestamp (nullable = true)
 |-- stoptime: timestamp (nullable = true)
 |-- startstationid: integer (nullable = true)
 |-- endstationid: integer (nullable = true)
 |-- bikeid: integer (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- startstationname: string (nullable = true)
 |-- endstationname: string (nullable = true)
```
* Run `data.count()` to validate number of records. It should be **54462016**
* Run `data.show()` to preview the data. Make sure all the data is showing up as expected.

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 09 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', 10)

In [3]:
cityBikePath="/public/citibike/trips/month=*/part-*.csv.gz"
cityBikeDf=spark.read.csv(cityBikePath,header=True)

In [4]:
stationsPath="/public/citibike/stations/part-*.json"
stationsDf=spark.read.json(stationsPath)

In [5]:
from pyspark.sql.functions import col,when

In [None]:
cityBikeDf.withColumn("startstationname",when(cityBikeDf['startstationid']==stationsDf["stationid"],stationsDf['stationname'])). \
          withColumn("endstationname",when(cityBikeDf['endstationid']==stationsDf["stationid"],stationsDf['stationname'])).show()

In [6]:
joinedDf=cityBikeDf.join(stationsDf,on=cityBikeDf['startstationid']==stationsDf['stationid'])

In [7]:
startstationDf=joinedDf.withColumn("startstationname",col("stationname")). \
                select("tripduration","starttime","stoptime","startstationid","endstationid","bikeid","usertype","birthyear","gender","startstationname")

In [12]:
startstationDf.count()

54462016

In [13]:
resultDf=startstationDf.join(stationsDf,on=startstationDf["endstationid"]==stationsDf["stationid"],how="left"). \
         withColumn("endstationname",col("stationname")). \
        select("tripduration","starttime","stoptime","startstationid","endstationid","bikeid","usertype","birthyear","gender","startstationname","endstationname")

In [14]:
resultDf.count()

54462016

In [17]:
resultDf.filter(col("endstationname").isNull()). \
    select("startstationid","startstationname","endstationid","endstationname").show()

+--------------+--------------------+------------+--------------+
|startstationid|    startstationname|endstationid|endstationname|
+--------------+--------------------+------------+--------------+
|          3623|W 120 St & Clarem...|        3198|          null|
|           327|Vesey Pl & River ...|        3192|          null|
|           327|Vesey Pl & River ...|        3192|          null|
|          3547|Broadway & Moylan Pl|        3198|          null|
|           347|Greenwich St & W ...|        3481|          null|
|          3295|Central Park W & ...|        3202|          null|
|           146|Hudson St & Reade St|        3267|          null|
|          3467|W Broadway & Spri...|        3276|          null|
|           494|     W 26 St & 8 Ave|        3639|          null|
|           212|W 16 St & The Hig...|        3185|          null|
|          3368|        5 Ave & 3 St|        3205|          null|
|           127|Barrow St & Hudso...|        3184|          null|
|         

In [24]:
stationsDf.select("stationname").filter(col("stationid")==3681).show()

+-----------+
|stationname|
+-----------+
+-----------+



In [11]:
resultDf.select("startstationid","startstationname","endstationid","endstationname")

startstationid,startstationname,endstationid,endstationname
3733,Avenue C & E 18 St,504,1 Ave & E 16 St
3372,E 74 St & 1 Ave,3686,Gansevoort St & H...
3329,Degraw St & Smith St,270,Adelphi St & Myrt...
476,E 31 St & 3 Ave,168,W 18 St & 6 Ave
3168,Central Park West...,423,W 54 St & 9 Ave
3531,Frederick Douglas...,3289,W 90 St & Amsterd...
3299,E 98 St & Park Ave,3160,Central Park West...
305,E 58 St & 3 Ave,3810,Central Park West...
486,Broadway & W 29 St,478,11 Ave & W 41 St
3255,8 Ave & W 31 St,523,W 38 St & 8 Ave


In [25]:
resultDf.printSchema()

root
 |-- tripduration: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- stoptime: string (nullable = true)
 |-- startstationid: string (nullable = true)
 |-- endstationid: string (nullable = true)
 |-- bikeid: string (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- startstationname: string (nullable = true)
 |-- endstationname: string (nullable = true)



In [26]:
resultDf.coalesce(3).write. \
    parquet(f'/user/{username}/mock_test_02/problem09/solution')

# Validations

In [27]:
%%sh
hdfs dfs -ls /user/`whoami`/mock_test_02/problem09/solution

Found 4 items
-rw-r--r--   3 itv001477 supergroup          0 2021-12-11 05:55 /user/itv001477/mock_test_02/problem09/solution/_SUCCESS
-rw-r--r--   3 itv001477 supergroup  466718495 2021-12-11 05:55 /user/itv001477/mock_test_02/problem09/solution/part-00000-39567662-9444-4fd3-bcbf-9fe4a70236a0-c000.snappy.parquet
-rw-r--r--   3 itv001477 supergroup  472014901 2021-12-11 05:53 /user/itv001477/mock_test_02/problem09/solution/part-00001-39567662-9444-4fd3-bcbf-9fe4a70236a0-c000.snappy.parquet
-rw-r--r--   3 itv001477 supergroup  437850262 2021-12-11 05:53 /user/itv001477/mock_test_02/problem09/solution/part-00002-39567662-9444-4fd3-bcbf-9fe4a70236a0-c000.snappy.parquet


In [28]:
data = spark.read. \
  parquet(f'/user/{username}/mock_test_02/problem09/solution')

In [29]:
data.printSchema()

root
 |-- tripduration: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- stoptime: string (nullable = true)
 |-- startstationid: string (nullable = true)
 |-- endstationid: string (nullable = true)
 |-- bikeid: string (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- startstationname: string (nullable = true)
 |-- endstationname: string (nullable = true)



In [30]:
data.count()

54462016

In [31]:
data.show()

+------------+--------------------+--------------------+--------------+------------+------+----------+---------+------+--------------------+--------------------+
|tripduration|           starttime|            stoptime|startstationid|endstationid|bikeid|  usertype|birthyear|gender|    startstationname|      endstationname|
+------------+--------------------+--------------------+--------------+------------+------+----------+---------+------+--------------------+--------------------+
|         327|2019-09-01T00:00:...|2019-09-01T00:05:...|          3733|         504| 39213|Subscriber|     1968|     1|  Avenue C & E 18 St|     1 Ave & E 16 St|
|        2219|2019-09-29T12:04:...|2019-09-29T12:41:...|          3372|        3686| 18261|Subscriber|     1974|     2|     E 74 St & 1 Ave|Gansevoort St & H...|
|        1145|2019-09-01T00:00:...|2019-09-01T00:19:...|          3329|         270| 21257|  Customer|     1969|     0|Degraw St & Smith St|Adelphi St & Myrt...|
|         816|2019-09-29T12: