# Initial data analysis of Taxi rides
This notebook contains EDA of Taxi data in order to gain information for the ETL process

In [2]:
import os
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local') \
    .appName('taxi') \
    .getOrCreate()

We begin by reading one month's data and:
* take a look at the dataframe schema
* peek at the top rows
* check how much data we are dealing with

In [7]:
data_folder = '/Users/tomra/Projects/data-engineering/udacity-data-engineer-nanodegree/06-capstone-project/data'
df = spark.read \
    .format('csv') \
    .options(header=True, inferSchema=True) \
    .load(os.path.join(data_folder, 'chicago-taxi-rides-2016', 'chicago_taxi_trips_2016_01.csv'))
df.printSchema()

root
 |-- taxi_id: integer (nullable = true)
 |-- trip_start_timestamp: timestamp (nullable = true)
 |-- trip_end_timestamp: timestamp (nullable = true)
 |-- trip_seconds: integer (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- pickup_census_tract: string (nullable = true)
 |-- dropoff_census_tract: integer (nullable = true)
 |-- pickup_community_area: integer (nullable = true)
 |-- dropoff_community_area: integer (nullable = true)
 |-- fare: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- extras: double (nullable = true)
 |-- trip_total: double (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- company: integer (nullable = true)
 |-- pickup_latitude: integer (nullable = true)
 |-- pickup_longitude: integer (nullable = true)
 |-- dropoff_latitude: integer (nullable = true)
 |-- dropoff_longitude: integer (nullable = true)



In [12]:
df.limit(5).toPandas()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85,2016-01-13 06:15:00,2016-01-13 06:15:00,180,0.4,,,24.0,24.0,4.5,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,2776,2016-01-22 09:30:00,2016-01-22 09:45:00,240,0.7,,,,,4.45,4.45,0.0,0.0,8.9,Credit Card,,,,,
2,3168,2016-01-31 21:30:00,2016-01-31 21:30:00,0,0.0,,,,,42.75,5.0,0.0,0.0,47.75,Credit Card,119.0,,,,
3,4237,2016-01-23 17:30:00,2016-01-23 17:30:00,480,1.1,,,6.0,6.0,7.0,0.0,0.0,0.0,7.0,Cash,,686.0,500.0,686.0,500.0
4,5710,2016-01-14 05:45:00,2016-01-14 06:00:00,480,2.71,,,32.0,,10.25,0.0,0.0,0.0,10.25,Cash,,385.0,478.0,,


In [9]:
print(f"Total amount of records in January: {df.count()}")

Total amount of records in January: 1705805


## Check amount of null values
Next we check how many NULL values exist in the dataset to determine how to handle them.

In [23]:
from pyspark.sql.functions import isnull, when, count, col

df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).toPandas()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,23,0,125,314,14,1705805,738326,285789,313655,33,33,33,33,33,0,632726,285757,285757,311682,311682


There are several columns with a substantial amount of missing data. This is mostly due to retaining the anonymity of the passengers.

## Check basic decriptive statistics of numerical fields
Finally we'll perform a simple check of the "shpe" of the numerical values in our dataset.

In [24]:
df.describe().toPandas()

Unnamed: 0,summary,taxi_id,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,count,1705782.0,1705491.0,1705791.0,0.0,967479.0,1420016.0,1392150.0,1705772.0,1705772.0,1705772.0,1705772.0,1705772.0,1705805,1073079.0,1420048.0,1420048.0,1394123.0,1394123.0
1,mean,4389.322577562666,653.442181752938,2.8727017026136483,,516.8220157750194,23.220739062095078,20.990691376647632,13.153964152301626,1.5151068196688693,0.0043082017995371,0.948484985097658,15.621889226697071,,92.60232098475508,392.1435683864207,437.7735576543892,401.3053224141629,438.8507061428583
2,stddev,2515.81925889202,932.726047050346,18.107933771818317,,357.5866115771377,19.819355022412203,17.372373287471714,32.874214509629354,2.7449608123129936,0.836362097367778,25.596044140431605,42.72207968577074,,34.13512623911379,252.7693261645224,194.88939422262249,254.4150475007729,202.3008724903298
3,min,0.0,0.0,0.0,,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,Cash,2.0,0.0,1.0,0.0,1.0
4,max,8762.0,86340.0,3280.0,,1140.0,77.0,77.0,9002.29,450.0,999.99,9993.41,9997.16,Unknown,119.0,784.0,785.0,784.0,785.0


### Longest duration

In [34]:
# 28800 seconds = 8 hours
df.select('taxi_id', 'trip_start_timestamp', 'trip_end_timestamp','trip_seconds', 'trip_miles', 'trip_total') \
    .where('trip_seconds > 28800') \
    .sort('trip_seconds', ascending=False) \
    .toPandas()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,trip_total
0,5884,2016-01-29 18:45:00,2016-01-30 18:45:00,86340,0.0,3.00
1,4598,2016-01-01 01:15:00,2016-01-02 01:15:00,86340,0.0,24.81
2,5884,2016-01-10 03:45:00,2016-01-11 03:45:00,86340,0.0,8.00
3,5884,2016-01-20 16:15:00,2016-01-21 16:15:00,86340,0.0,15.30
4,819,2016-01-10 20:30:00,2016-01-11 20:30:00,86340,0.0,96.00
...,...,...,...,...,...,...
257,5943,2016-01-03 15:45:00,2016-01-04 00:00:00,29460,17.3,46.62
258,7560,2016-01-31 15:45:00,2016-02-01 00:00:00,29280,0.0,0.00
259,5411,2016-01-13 15:45:00,2016-01-14 00:00:00,29280,1.0,6.00
260,8149,2016-01-18 23:00:00,2016-01-19 07:15:00,29160,0.0,162.00


### Longest distance

In [35]:
df.select('taxi_id', 'trip_start_timestamp', 'trip_end_timestamp','trip_seconds', 'trip_miles', 'trip_total') \
    .where('trip_miles > 500') \
    .sort('trip_miles', ascending=False) \
    .toPandas()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,trip_total
0,5272,2016-01-06 21:15:00,2016-01-06 22:00:00,2460,3280.0,125.31
1,4362,2016-01-14 08:30:00,2016-01-14 09:30:00,3420,3000.0,78.25
2,4362,2016-01-28 08:15:00,2016-01-28 09:00:00,2280,2970.0,94.38
3,4362,2016-01-02 18:45:00,2016-01-02 19:15:00,1920,2430.0,62.25
4,5272,2016-01-10 17:15:00,2016-01-10 18:00:00,2580,2130.0,102.00
...,...,...,...,...,...,...
247,5272,2016-01-31 17:45:00,2016-01-31 18:00:00,960,530.0,18.00
248,4303,2016-01-28 08:45:00,2016-01-28 09:00:00,1020,520.0,15.75
249,4303,2016-01-01 02:15:00,2016-01-01 02:30:00,1020,520.0,17.77
250,5272,2016-01-20 21:15:00,2016-01-20 21:15:00,780,510.0,15.50


### Most expensive trips

In [36]:
df.select('taxi_id', 'trip_start_timestamp', 'trip_end_timestamp','trip_seconds', 'trip_miles', 'trip_total') \
    .where('trip_total > 1000') \
    .sort('trip_total', ascending=False) \
    .toPandas()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,trip_total
0,170,2016-01-15 11:45:00,2016-01-15 11:45:00,120,0.0,9997.16
1,7611,2016-01-02 02:00:00,2016-01-02 02:15:00,1380,0.9,9052.39
2,7611,2016-01-02 04:00:00,2016-01-02 04:15:00,1200,0.2,9051.20
3,2146,2016-01-04 19:00:00,2016-01-04 19:15:00,840,0.1,9051.05
4,7611,2016-01-02 03:45:00,2016-01-02 03:45:00,360,0.0,9050.98
...,...,...,...,...,...,...
98,2146,2016-01-05 09:45:00,2016-01-05 09:45:00,0,0.0,1050.32
99,1270,2016-01-02 16:00:00,2016-01-02 16:00:00,0,0.0,1050.32
100,3916,2016-01-01 14:30:00,2016-01-01 14:45:00,660,0.0,1009.73
101,2509,2016-01-02 20:00:00,2016-01-02 20:00:00,60,0.0,1000.35


#### Expensive trips with short duration or distance

In [38]:
df.select('taxi_id', 'trip_start_timestamp', 'trip_end_timestamp','trip_seconds', 'trip_miles', 'trip_total') \
    .where('trip_total > 1000 AND (trip_seconds < 3600 OR trip_miles < 100)') \
    .sort('trip_total', ascending=False) \
    .toPandas()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,trip_total
0,170,2016-01-15 11:45:00,2016-01-15 11:45:00,120,0.0,9997.16
1,7611,2016-01-02 02:00:00,2016-01-02 02:15:00,1380,0.9,9052.39
2,7611,2016-01-02 04:00:00,2016-01-02 04:15:00,1200,0.2,9051.20
3,2146,2016-01-04 19:00:00,2016-01-04 19:15:00,840,0.1,9051.05
4,7611,2016-01-02 03:45:00,2016-01-02 03:45:00,360,0.0,9050.98
...,...,...,...,...,...,...
97,2146,2016-01-05 09:45:00,2016-01-05 09:45:00,0,0.0,1050.32
98,1270,2016-01-02 16:00:00,2016-01-02 16:00:00,0,0.0,1050.32
99,3916,2016-01-01 14:30:00,2016-01-01 14:45:00,660,0.0,1009.73
100,2509,2016-01-02 20:00:00,2016-01-02 20:00:00,60,0.0,1000.35


## Summary
It certainly looks like there are outliers/oddities in the dataset. As I am no expert on the subject I decided to leave the data as is for the following reasons:
* There might be special cases that warrants some of this data valid
* We cannot simply drop e.g. top n percentile for each column as it is the interrelationship between column values that make up for the odd situations
* Without expert opinion it is difficult to define threshold for what to keep and what to drop
* By leaving the data intact it is possible to use it to develop a proper algorithm that would handle the outliers
* One option could be to store the dropped data in a separate dataset for analysis

The following actions have been taken by the dataset authors, prior to releasing the data:
> ...\[the Taxi ID\] is created specifically for this dataset, with no external meaning, to allow users to determine rides provided by the same taxi but not which taxi.
>
> ...
> 
> ...we have rounded all start and end times to the nearest 15 minutes.
> 
> ...
> 
> ...we provide location only at the Census Tract and Community Area levels

The dataset origin had this to say about outliers in the data:
> ...we have applied the following corrections to the data.
> * Trip times less than zero or greater than 86,400 seconds are removed.
> * Trip lengths less than zero or greater than 3,500 miles are removed.
> * If any component of the trip cost is less than $0 or greater than $10,000, all components of the trip cost are removed.
>
> ...
>
> Naturally, many of the extreme values that remain likely are also wrong but we prefer to leave it to the user to filter further, based on his or her judgement and needs for a particular use of the data.

Link to the source: https://digital.cityofchicago.org/index.php/chicago-taxi-data-released/