## Problem 07

Weightage: 25

Get Station Name, latitude, longitude and number of bikes started from each station using **startstationid** for each day in the data set.

## Data Description
All of the citibike trip data is available under **/public/citibike/trips**. It contain multiple folders - one for each month. Here is the schema.

```
root
 |-- tripduration: integer (nullable = true)
 |-- starttime: timestamp (nullable = true)
 |-- stoptime: timestamp (nullable = true)
 |-- startstationid: string (nullable = true)
 |-- endstationid: string (nullable = true)
 |-- bikeid: integer (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: integer (nullable = true)
 |-- month: integer (nullable = true)
```
All of the citibike station data is available under **/public/citibike/stations**. 
```
root
 |-- stationid: long (nullable = true)
 |-- stationlatitude: string (nullable = true)
 |-- stationlongitude: string (nullable = true)
 |-- stationname: string (nullable = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem07/solution
```
* Use CSV and save the output to exactly 2 files. Make sure to preserve the header.
* Here are the column names. Data types should be as below.
```
 |-- stationname: string (nullable = true)
 |-- stationlatitude: double (nullable = true)
 |-- stationlongitude: double (nullable = true)
 |-- ridestartdate: integer (nullable = true)
 |-- ridecount: integer (nullable = true)
```
* Data should be sorted in ascending order by ridestartdate and then in descending order by ridecount.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem07/solution
```
* Run this code to create dataframe by name data.
```
import getpass
username = getpass.getuser()
data = spark.read. \
    csv(f'/user/{username}/mock_test_02/problem07/solution',
        header=True,
        inferSchema=True
       )
```
* Run `data.printSchema()` to validate the data types of the fields.
```
root
 |-- stationname: string (nullable = true)
 |-- stationlatitude: double (nullable = true)
 |-- stationlongitude: double (nullable = true)
 |-- ridestartdate: integer (nullable = true)
 |-- ridecount: integer (nullable = true)
```
* Run `data.count()` to validate number of records. It should be **785303**
* Run `data.orderBy(col('ridestartdate'), col('ridecount').desc()).show()` to preview the sample output.

|         stationname|  stationlatitude| stationlongitude|ridestartdate|ridecount|
|--------------------|-----------------|-----------------|-------------|---------|
|Central Park S & ...|      40.76590936|     -73.97634151|     20170101|      160|
|Centre St & Chamb...|      40.71273266|      -74.0046073|     20170101|      126|
|  Broadway & W 60 St|      40.76915505|     -73.98191841|     20170101|      114|
|  Broadway & E 14 St|           40.734|          -73.992|     20170101|      110|
|Central Park West...|40.77579376683666|-73.9762057363987|     20170101|      103|
|West St & Chamber...|      40.71754834|     -74.01322069|     20170101|      101|
|Central Park Nort...|        40.799484|       -73.955613|     20170101|       98|
|  Carmine St & 6 Ave|      40.73038599|     -74.00214988|     20170101|       97|
|Allen St & Stanto...|        40.722055|       -73.989111|     20170101|       96|
|     9 Ave & W 22 St|       40.7454973|     -74.00197139|     20170101|       96|
|     5 Ave & E 88 St|           40.782|          -73.959|     20170101|       94|
|Grand Army Plaza ...|           40.764|          -73.974|     20170101|       93|
|     5 Ave & E 73 St|      40.77282817|     -73.96685276|     20170101|       89|
|Christopher St & ...|      40.73291553|     -74.00711384|     20170101|       89|
|Central Park West...|      40.78472675|     -73.96961715|     20170101|       85|
|Grand Army Plaza ...|       40.6729679|     -73.97087984|     20170101|       82|
|Grand St & Elizab...|        40.718822|        -73.99596|     20170101|       82|
|Greenwich Ave & 8...|    40.7390169121|   -74.0026376103|     20170101|       82|
|Central Park West...|       40.7734066|     -73.97782542|     20170101|       80|
|    12 Ave & W 40 St|      40.76087502|     -74.00277668|     20170101|       80|


In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 07 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', 2)

In [13]:
cityBikePath="/public/citibike/trips/month=*/part-*.csv.gz"
cityBikeDf=spark.read.csv(cityBikePath,header=True)

In [14]:
cityBikeDf.show()

+------------+--------------------+--------------------+--------------+------------+------+----------+---------+------+
|tripduration|           starttime|            stoptime|startstationid|endstationid|bikeid|  usertype|birthyear|gender|
+------------+--------------------+--------------------+--------------+------------+------+----------+---------+------+
|         327|2019-09-01T00:00:...|2019-09-01T00:05:...|          3733|         504| 39213|Subscriber|     1968|     1|
|        2219|2019-09-29T12:04:...|2019-09-29T12:41:...|          3372|        3686| 18261|Subscriber|     1974|     2|
|        1145|2019-09-01T00:00:...|2019-09-01T00:19:...|          3329|         270| 21257|  Customer|     1969|     0|
|         816|2019-09-29T12:04:...|2019-09-29T12:18:...|           476|         168| 34200|Subscriber|     1974|     2|
|        1293|2019-09-01T00:00:...|2019-09-01T00:21:...|          3168|         423| 15242|  Customer|     1969|     0|
|         886|2019-09-29T12:04:...|2019-

In [4]:
cityBikeDf.count()

54464729

In [5]:
stationsPath="/public/citibike/stations/part-*.json"
stationsDf=spark.read.json(stationsPath)

In [6]:
stationsDf.count()

1026

In [7]:
from pyspark.sql.functions import col,lit,date_format,count

In [19]:
resultDf=cityBikeDf.join(stationsDf,on=cityBikeDf['startstationid']==stationsDf['stationid']). \
    withColumn('ridestartdate',date_format(cityBikeDf['starttime'],'yyyyMMdd')). \
    groupBy(stationsDf['stationname'],stationsDf['stationlatitude'],stationsDf['stationlongitude'],'ridestartdate'). \
    agg(count(lit(1)).alias('ridecount')). \
    orderBy(col('ridestartdate'),col('ridecount').desc())

In [21]:
resultDf.coalesce(2). \
    write.option('header','true'). \
    csv(f'/user/{username}/mock_test_02/problem07/solution')

# Validations

In [22]:
%%sh
hdfs dfs -ls /user/`whoami`/mock_test_02/problem07/solution

Found 3 items
-rw-r--r--   3 itv001477 supergroup          0 2021-12-10 02:29 /user/itv001477/mock_test_02/problem07/solution/_SUCCESS
-rw-r--r--   3 itv001477 supergroup   22527254 2021-12-10 02:29 /user/itv001477/mock_test_02/problem07/solution/part-00000-d96ae177-c227-4e02-a2dc-b7dbe0066531-c000.csv
-rw-r--r--   3 itv001477 supergroup   22222399 2021-12-10 02:29 /user/itv001477/mock_test_02/problem07/solution/part-00001-d96ae177-c227-4e02-a2dc-b7dbe0066531-c000.csv


In [23]:
data = spark.read. \
  csv(f'/user/{username}/mock_test_02/problem07/solution',
      header=True,
      inferSchema=True
     )

In [24]:
data.printSchema()

root
 |-- stationname: string (nullable = true)
 |-- stationlatitude: double (nullable = true)
 |-- stationlongitude: double (nullable = true)
 |-- ridestartdate: integer (nullable = true)
 |-- ridecount: integer (nullable = true)



In [25]:
data.count()

785303

In [26]:
data.orderBy(col('ridestartdate'), col('ridecount').desc()).show()

+--------------------+-----------------+-----------------+-------------+---------+
|         stationname|  stationlatitude| stationlongitude|ridestartdate|ridecount|
+--------------------+-----------------+-----------------+-------------+---------+
|Central Park S & ...|      40.76590936|     -73.97634151|     20170101|      160|
|Centre St & Chamb...|      40.71273266|      -74.0046073|     20170101|      126|
|  Broadway & W 60 St|      40.76915505|     -73.98191841|     20170101|      114|
|  Broadway & E 14 St|           40.734|          -73.992|     20170101|      110|
|Central Park West...|40.77579376683666|-73.9762057363987|     20170101|      103|
|West St & Chamber...|      40.71754834|     -74.01322069|     20170101|      101|
|Central Park Nort...|        40.799484|       -73.955613|     20170101|       98|
|  Carmine St & 6 Ave|      40.73038599|     -74.00214988|     20170101|       97|
|Allen St & Stanto...|        40.722055|       -73.989111|     20170101|       96|
|   