In [1]:
import geopandas
import keplergl
w1 = keplergl.KeplerGl(height=600)

User Guide: https://github.com/keplergl/kepler.gl/blob/master/docs/keplergl-jupyter/user-guide.md


In [2]:
w1

KeplerGl(height=600)

# Dataframes

Apache Spark&trade; allow you to use DataFrames to query large data files.

## In this lesson you:
* Learn about Spark DataFrames.
* Query large files using Spark DataFrames.
* Visualize query results using charts.



### Introducing DataFrames

Under the covers, DataFrames are derived from data structures known as Resilient Distributed Datasets (RDDs). RDDs and DataFrames are immutable distributed collections of data. Let's take a closer look at what some of these terms mean before we understand how they relate to DataFrames:

* **Resilient**: They are fault tolerant, so if part of your operation fails, Spark  quickly recovers the lost computation.
* **Distributed**: RDDs are distributed across networked machines known as a cluster.
* **DataFrame**: A data structure where data is organized into named columns, like a table in a relational database, but with richer optimizations under the hood. 

Without the named columns and declared types provided by a schema, Spark wouldn't know how to optimize the executation of any computation. Since DataFrames have a schema, they use the Catalyst Optimizer to determine the optimal way to execute your code.

DataFrames were invented because the business community uses tables in a relational database, Pandas or R DataFrames, or Excel worksheets. A Spark DataFrame is conceptually equivalent to these, with richer optimizations under the hood and the benefit of being distributed across a cluster.

#### Interacting with DataFrames

Once created (instantiated), a DataFrame object has methods attached to it. Methods are operations one can perform on DataFrames such as filtering,
counting, aggregating and many others.

> <b>Example</b>: To create (instantiate) a DataFrame, use this syntax: `df = ...`

To display the contents of the DataFrame, apply a `show` operation (method) on it using the syntax `df.show()`. 

The `.` indicates you are *applying a method on the object*.

In working with DataFrames, it is common to chain operations together, such as: `df.select().filter().orderBy()`.  

By chaining operations together, you don't need to save intermediate DataFrames into local variables (thereby avoiding the creation of extra objects).

Also note that you do not have to worry about how to order operations because the optimizier determines the optimal order of execution of the operations for you.

`df.select(...).orderBy(...).filter(...)`

versus

`df.filter(...).select(...).orderBy(...)`

#### DataFrames and SQL

DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.

Suppose we have a data set we loaded as a table called `myTable` and an equivalent DataFrame, called `df`.
We have three fields/columns called `col_1` (numeric type), `col_2` (string type) and `col_3` (timestamp type)
Here are basic SQL operations and their DataFrame equivalents. 

Notice that columns in DataFrames are referenced by `col("<columnName>")`.

| SQL                                         | DataFrame (Python)                    |
| ------------------------------------------- | ------------------------------------- | 
| `SELECT col_1 FROM myTable`                 | `df.select(col("col_1"))`             | 
| `DESCRIBE myTable`                          | `df.printSchema()`                    | 
| `SELECT * FROM myTable WHERE col_1 > 0`     | `df.filter(col("col_1") > 0)`         | 
| `..GROUP BY col_2`                          | `..groupBy(col("col_2"))`             | 
| `..ORDER BY col_2`                          | `..orderBy(col("col_2"))`             | 
| `..WHERE year(col_3) > 1990`                | `..filter(year(col("col_3")) > 1990)` | 
| `SELECT * FROM myTable LIMIT 10`            | `df.limit(10)`                        |
| `display(myTable)` (text format)            | `df.show()`                           | 
| `display(myTable)` (html format)            | `display(df)`                         |

**Hint:** You can also run SQL queries with the special syntax `spark.sql("SELECT * FROM myTable")`

In this course you see many other usages of DataFrames. It is left up to you to figure out the SQL equivalents 
(left as exercises in some cases).

### Start Spark context
First thing first, let's create the sparkContext, if you don't understand this part, don't worry. We'll cover this in greater details in future lessons.

In [3]:
MODE = "LOCAL"
# MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
import json
import math
import numbers
import numpy as np
import plotly

plotly.offline.init_notebook_mode(connected=True)

try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession")

conf = None
if MODE == "LOCAL":
    conf = SparkConf().\
            setAppName("pyspark_day01_dataframe").\
            setMaster('local[*]').\
            set("spark.ui.port", "4040").\
            set("spark.scheduler.mode", "FAIR")


spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()

sc = spark.sparkContext


def display(df, limit=10):
    return df.limit(limit).toPandas()


def dfTest(id, expected, result):
    assert str(expected) == str(
        result), "{} does not equal expected {}".format(result, expected)


print('Spark Session Started')

Stopped a SparkSession
Spark Session Started


In [4]:
spark

### Querying Data 
This lesson uses the `people-10m` data set, which is in Parquet format.

The data is fictitious; in particular, the Social Security numbers are fake.

Run the command below to see the contents of the `people-10m.parquet` file.

In [5]:
! ls ../Data/ny-citibike-trip/

'2013-07 - Citi Bike trip data.csv'  '2013-12 - Citi Bike trip data.csv'
'2013-08 - Citi Bike trip data.csv'   201307-201402-citibike-tripdata.zip
'2013-09 - Citi Bike trip data.csv'  '2014-01 - Citi Bike trip data.csv'
'2013-10 - Citi Bike trip data.csv'  '2014-02 - Citi Bike trip data.csv'
'2013-11 - Citi Bike trip data.csv'


In [6]:
ny_citibike_trip_path = "../Data/ny-citibike-trip/201*-* - Citi Bike trip data.csv"
nyCitibikeTripDF = spark.read.option("header", "true").csv(ny_citibike_trip_path)

In [7]:
display(nyCitibikeTripDF)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,504,2013-11-01 00:00:17,2013-11-01 00:08:41,326,E 11 St & 1 Ave,40.72953837,-73.98426726,349,Rivington St & Ridge St,40.71850211,-73.98329859,16393,Subscriber,1983,2
1,747,2013-11-01 00:00:20,2013-11-01 00:12:47,375,Mercer St & Bleecker St,40.72679454,-73.99695094,504,1 Ave & E 15 St,40.73221853,-73.98165557,18244,Subscriber,1971,1
2,1005,2013-11-01 00:00:30,2013-11-01 00:17:15,264,Maiden Ln & Pearl St,40.70706456,-74.00731853,265,Stanton St & Chrystie St,40.72229346,-73.99147535,14636,Customer,\N,0
3,622,2013-11-01 00:01:29,2013-11-01 00:11:51,472,E 32 St & Park Ave,40.7457121,-73.98194829,174,E 25 St & 1 Ave,40.7381765,-73.97738662,16685,Subscriber,1980,1
4,1454,2013-11-01 00:01:37,2013-11-01 00:25:51,293,Lafayette St & E 8 St,40.73028666,-73.9907647,490,8 Ave & W 33 St,40.751551,-73.993934,18055,Subscriber,1975,2
5,1991,2013-11-01 00:01:53,2013-11-01 00:35:04,358,Christopher St & Greenwich St,40.73291553,-74.00711384,237,E 11 St & 2 Ave,40.73047309,-73.98672378,17529,Customer,\N,0
6,1989,2013-11-01 00:02:00,2013-11-01 00:35:09,358,Christopher St & Greenwich St,40.73291553,-74.00711384,237,E 11 St & 2 Ave,40.73047309,-73.98672378,17238,Customer,\N,0
7,690,2013-11-01 00:02:03,2013-11-01 00:13:33,509,9 Ave & W 22 St,40.7454973,-74.00197139,546,E 30 St & Park Ave S,40.74444921,-73.98303529,15892,Subscriber,1972,1
8,499,2013-11-01 00:02:19,2013-11-01 00:10:38,128,MacDougal St & Prince St,40.72710258,-74.00297088,531,Forsyth St & Broome St,40.71893904,-73.99266288,17260,Subscriber,1988,1
9,657,2013-11-01 00:02:22,2013-11-01 00:13:19,509,9 Ave & W 22 St,40.7454973,-74.00197139,546,E 30 St & Park Ave S,40.74444921,-73.98303529,19053,Subscriber,1986,1


Take a look at the schema with the `printSchema` method. This tells you the field name, field type, and whether the column is nullable or not (default is true).

In [8]:
nyCitibikeTripDF.printSchema()

root
 |-- tripduration: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- stoptime: string (nullable = true)
 |-- start station id: string (nullable = true)
 |-- start station name: string (nullable = true)
 |-- start station latitude: string (nullable = true)
 |-- start station longitude: string (nullable = true)
 |-- end station id: string (nullable = true)
 |-- end station name: string (nullable = true)
 |-- end station latitude: string (nullable = true)
 |-- end station longitude: string (nullable = true)
 |-- bikeid: string (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birth year: string (nullable = true)
 |-- gender: string (nullable = true)



Answer the following question:
> According to our data, which women were born after 1990?

Use the DataFrame `select` and `filter` methods.

In [9]:
from pyspark.sql.functions import year
riderDF = (nyCitibikeTripDF 
    .select('tripduration', 'starttime', 'stoptime', 'start station latitude','start station longitude', 'end station latitude','end station longitude', 'bikeid', 'usertype', 'gender')
    .filter("usertype = 'Customer'")
)
display(riderDF)

Unnamed: 0,tripduration,starttime,stoptime,start station latitude,start station longitude,end station latitude,end station longitude,bikeid,usertype,gender
0,1005,2013-11-01 00:00:30,2013-11-01 00:17:15,40.70706456,-74.00731853,40.72229346,-73.99147535,14636,Customer,0
1,1991,2013-11-01 00:01:53,2013-11-01 00:35:04,40.73291553,-74.00711384,40.73047309,-73.98672378,17529,Customer,0
2,1989,2013-11-01 00:02:00,2013-11-01 00:35:09,40.73291553,-74.00711384,40.73047309,-73.98672378,17238,Customer,0
3,421,2013-11-01 00:06:14,2013-11-01 00:13:15,40.736502,-73.97809472,40.74025878,-73.98409214,17732,Customer,0
4,2002,2013-11-01 00:09:09,2013-11-01 00:42:31,40.71286844,-73.95698119,40.744219,-73.97121214,18157,Customer,0
5,370,2013-11-01 00:18:55,2013-11-01 00:25:05,40.73912601,-73.97973776,40.72955361,-73.98057249,19969,Customer,0
6,1667,2013-11-01 00:20:00,2013-11-01 00:47:47,40.70277159,-73.99383605,40.70277159,-73.99383605,20318,Customer,0
7,1577,2013-11-01 00:20:31,2013-11-01 00:46:48,40.70277159,-73.99383605,40.70277159,-73.99383605,17391,Customer,0
8,1345,2013-11-01 00:41:05,2013-11-01 01:03:30,40.72779126,-73.98564945,40.69839895,-73.98068914,15054,Customer,0
9,258,2013-11-01 00:43:07,2013-11-01 00:47:25,40.73038599,-74.00214988,40.73401143,-74.00293877,19857,Customer,0


In [10]:
riderDF.count()

680911

### Most Used Bikes

In [11]:
mostUsedBikesDF = riderDF.groupBy('bikeid').agg(F.count(F.col('usertype')).alias('userCount'))

In [12]:
display(mostUsedBikesDF)

Unnamed: 0,bikeid,userCount
0,18333,136
1,14838,150
2,20219,136
3,16974,23
4,15269,148
5,14899,112
6,15634,132
7,18314,87
8,19132,76
9,18634,127


### Longest Trips by Time

In [13]:
riderDF.printSchema()

root
 |-- tripduration: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- stoptime: string (nullable = true)
 |-- start station latitude: string (nullable = true)
 |-- start station longitude: string (nullable = true)
 |-- end station latitude: string (nullable = true)
 |-- end station longitude: string (nullable = true)
 |-- bikeid: string (nullable = true)
 |-- usertype: string (nullable = true)
 |-- gender: string (nullable = true)



In [14]:
from pyspark.sql.types import TimestampType, FloatType

longestTripByTimeDF = (
    riderDF.withColumn('tripDuration', 
                      F.minute((F.unix_timestamp(F.col('stoptime')) - F.unix_timestamp(F.col('starttime')))
                       .cast(TimestampType()))
                      ).orderBy(F.desc('tripDuration'))

)

In [15]:
display(longestTripByTimeDF)

Unnamed: 0,tripDuration,starttime,stoptime,start station latitude,start station longitude,end station latitude,end station longitude,bikeid,usertype,gender
0,59,2014-01-09 13:29:36,2014-01-09 16:29:05,40.76370739,-73.9851615,40.76370739,-73.9851615,20898,Customer,0
1,59,2014-01-01 11:55:21,2014-01-01 12:54:35,40.69221589,-73.9842844,40.70823502,-74.00530063,17886,Customer,0
2,59,2014-01-02 13:18:33,2014-01-02 14:18:21,40.76590936,-73.97634151,40.76695317,-73.98169333,18506,Customer,0
3,59,2014-01-01 11:54:48,2014-01-01 12:54:41,40.69221589,-73.9842844,40.70823502,-74.00530063,19914,Customer,0
4,59,2014-01-11 16:45:08,2014-01-11 17:44:57,40.75038009,-73.98338988,40.72456089,-73.99565293,16034,Customer,0
5,59,2014-01-12 13:05:05,2014-01-12 14:04:13,40.76590936,-73.97634151,40.76590936,-73.97634151,19401,Customer,0
6,59,2014-01-12 14:42:54,2014-01-12 15:42:16,40.69794,-73.96986848,40.68317813,-73.9659641,18244,Customer,0
7,59,2014-01-12 15:18:11,2014-01-12 18:17:28,40.73224119,-74.00026394,40.75513557,-73.98658032,19874,Customer,0
8,59,2014-01-15 17:26:23,2014-01-15 18:26:12,40.7454973,-74.00197139,40.74901271,-73.98848395,17610,Customer,0
9,59,2014-01-15 17:48:49,2014-01-15 18:48:02,40.76590936,-73.97634151,40.76590936,-73.97634151,18869,Customer,0


### Longest Trip by Lat/Long

In [16]:
cleanRidersDF = (riderDF.dropna('any').select(
    F.col('tripduration'), F.col('starttime'), F.col('stoptime'),
    F.col('start station latitude').cast('float'),
    F.col('start station longitude').cast('float'),
    F.col('end station latitude').cast('float'),
    F.col('end station longitude').cast('float'), F.col('bikeid'),
    F.col('usertype'), F.col('gender')))
display(cleanRidersDF)

Unnamed: 0,tripduration,starttime,stoptime,start station latitude,start station longitude,end station latitude,end station longitude,bikeid,usertype,gender
0,1005,2013-11-01 00:00:30,2013-11-01 00:17:15,40.707066,-74.007317,40.722294,-73.991478,14636,Customer,0
1,1991,2013-11-01 00:01:53,2013-11-01 00:35:04,40.732914,-74.007111,40.730473,-73.986725,17529,Customer,0
2,1989,2013-11-01 00:02:00,2013-11-01 00:35:09,40.732914,-74.007111,40.730473,-73.986725,17238,Customer,0
3,421,2013-11-01 00:06:14,2013-11-01 00:13:15,40.736504,-73.978096,40.740257,-73.984093,17732,Customer,0
4,2002,2013-11-01 00:09:09,2013-11-01 00:42:31,40.712868,-73.956978,40.744221,-73.971214,18157,Customer,0
5,370,2013-11-01 00:18:55,2013-11-01 00:25:05,40.739124,-73.979736,40.729553,-73.980576,19969,Customer,0
6,1667,2013-11-01 00:20:00,2013-11-01 00:47:47,40.70277,-73.993835,40.70277,-73.993835,20318,Customer,0
7,1577,2013-11-01 00:20:31,2013-11-01 00:46:48,40.70277,-73.993835,40.70277,-73.993835,17391,Customer,0
8,1345,2013-11-01 00:41:05,2013-11-01 01:03:30,40.727791,-73.985649,40.698399,-73.98069,15054,Customer,0
9,258,2013-11-01 00:43:07,2013-11-01 00:47:25,40.730385,-74.002151,40.734013,-74.002937,19857,Customer,0


In [17]:
cleanRidersDF.printSchema()

root
 |-- tripduration: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- stoptime: string (nullable = true)
 |-- start station latitude: float (nullable = true)
 |-- start station longitude: float (nullable = true)
 |-- end station latitude: float (nullable = true)
 |-- end station longitude: float (nullable = true)
 |-- bikeid: string (nullable = true)
 |-- usertype: string (nullable = true)
 |-- gender: string (nullable = true)



In [18]:
from math import radians, cos, sin, asin, sqrt


def get_distance(longit_a, latit_a, longit_b, latit_b):
    # Transform to radians
    longit_a, latit_a, longit_b, latit_b = map(
        radians, [longit_a, latit_a, longit_b, latit_b])
    dist_longit = longit_b - longit_a
    dist_latit = latit_b - latit_a

    # Calculate area
    area = sin(dist_latit /
               2)**2 + cos(latit_a) * cos(latit_b) * sin(dist_longit / 2)**2

    # Calculate the central angle
    central_angle = 2 * asin(sqrt(area))
    radius = 6371

    # Calculate Distance
    distance = central_angle * radius
    return abs(round(distance, 2))

In [19]:
get_distance(40.707066, 74.007317, 40.722294, 73.991478)

1.82

In [20]:
getDistanceUDF = spark.udf.register("getDistanceUDFSQL", get_distance, FloatType())

In [21]:
longestTripByLatLongDF = (
    cleanRidersDF.withColumn('tripDistance', 
                      getDistanceUDF(F.col('start station latitude'),
                                     F.col('start station longitude'),
                                     F.col('end station latitude'),
                                     F.col('end station longitude')
                          )
                      )
).orderBy(F.desc('tripDistance'))

In [22]:
display(longestTripByLatLongDF)

Unnamed: 0,tripduration,starttime,stoptime,start station latitude,start station longitude,end station latitude,end station longitude,bikeid,usertype,gender,tripDistance
0,4812,2013-10-21 17:16:23,2013-10-21 18:36:35,40.714977,-74.013016,40.680984,-73.95005,14871,Customer,0,7.08
1,1677,2013-08-10 22:32:02,2013-08-10 22:59:59,40.701221,-74.012344,40.680984,-73.95005,19464,Customer,0,6.95
2,1217,2013-07-12 18:44:44,2013-07-12 19:05:01,40.680344,-73.955772,40.708347,-74.017136,16827,Customer,0,6.88
3,1656,2013-07-28 16:29:40,2013-07-28 16:57:16,40.708347,-74.017136,40.680344,-73.955772,17826,Customer,0,6.88
4,1982,2013-07-21 18:19:00,2013-07-21 18:52:02,40.680344,-73.955772,40.708347,-74.017136,20191,Customer,0,6.88
5,1473,2013-09-14 13:06:18,2013-09-14 13:30:51,40.680344,-73.955772,40.708347,-74.017136,20622,Customer,0,6.88
6,1009,2013-08-19 15:53:03,2013-08-19 16:09:52,40.680344,-73.955772,40.71534,-74.016586,17636,Customer,0,6.85
7,484,2013-07-15 19:28:05,2013-07-15 19:36:09,40.705692,-74.016777,40.680344,-73.955772,17142,Customer,0,6.83
8,3434,2013-08-05 22:15:56,2013-08-05 23:13:10,40.720432,-74.010208,40.680984,-73.95005,19353,Customer,0,6.8
9,1515,2013-09-08 18:57:42,2013-09-08 19:22:57,40.711514,-74.015755,40.680344,-73.955772,15228,Customer,0,6.74


### Visualization

In this section, you'll learn how to visualize your spark dataframe using matplotlib and plotly

In [23]:
kepler_map = keplergl.KeplerGl(height=800)

User Guide: https://github.com/keplergl/kepler.gl/blob/master/docs/keplergl-jupyter/user-guide.md


In [24]:
kepler_map

KeplerGl(height=800)

In [25]:
cleanRidersPandasDF = cleanRidersDF.toPandas().sample(100)

In [26]:
start_location_gdf = (
    geopandas.GeoDataFrame(cleanRidersPandasDF, 
                           geometry=geopandas.points_from_xy(cleanRidersPandasDF['start station latitude'], 
                                                             cleanRidersPandasDF['start station longitude']))

)

In [27]:
kepler_map.add_data(data=start_location_gdf, name="Start-Locations")

In [28]:
kepler_map.save_to_html(file_name='Citi-Bike-Start-to-End-Trips.html', read_only=True)

Map saved to Citi-Bike-Start-to-End-Trips.html!
