# Cleaning Up the NYC Taxi Rides Dataset
The inspiration for using this dataset came from the following Kaggle competition \ datasets:
- [Competition: New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration)
- [Competition: New York City Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction)
- [Dataset: 2014 New York City Taxi Trips](https://www.kaggle.com/kentonnlp/2014-new-york-city-taxi-trips)

The original dataset taken from here: <http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml>

The description for the dataset's fields can be found here: <http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf>

We choose to use the data from August because it has no holidays, and also because since 2016 the exact location data was replaced by an area indicator.


## Importing Packages

In [1]:
import numpy as np
import pandas as pd

## Deffining Parameters

In [11]:
n_samples = 100000  ## The number of the samples in the clean database

## Bounding Box
Set a bounding box of 10 x 10 km with the West-South UTM coordinate (582500, 4505500)

In [12]:
## You might need to install pyproj: "pip install pyproj==1.9.6"
import pyproj

latlong_to_utm = pyproj.Proj("+proj=utm +zone=18T, +ellps=WGS84 +datum=WGS84 +units=m +no_defs")

utm_west = 582.5
utm_south = 4505.5
utm_east = utm_west + 10
utm_north = utm_south + 10

west_longitude, south_latitude = latlong_to_utm(utm_west * 1e3, utm_south * 1e3, inverse=True)
east_longitude, north_latitude = latlong_to_utm(utm_east * 1e3, utm_north * 1e3, inverse=True)

print('Lat-Long bounding box (as [west, east, south, north]): [{}, {}, {}, {}]'.format(west_longitude, east_longitude, south_latitude, north_latitude))
print('\n')
print('UTM bounding box (as [west, east, south, north] in kilometers): [{:.1f}, {:.1f}, {:.1f}, {:.1f}]'.format(utm_west, utm_east, utm_south, utm_north))

Lat-Long bounding box (as [west, east, south, north]): [-74.02351874272918, -73.90369889637867, 40.69627480706037, 40.78528305246283]


UTM bounding box (as [west, east, south, north] in kilometers): [582.5, 592.5, 4505.5, 4515.5]


## Downloading Dataset

In [13]:
import os
import subprocess
import requests
import tqdm

if not os.path.isfile('../original/yellow_tripdata_2015-08.csv'):
    response = requests.get('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-08.csv', stream=True)
    with open('../original/yellow_tripdata_2015-08.csv', 'wb') as fid:
        total_length = int(response.headers.get('content-length'))
        for chunk in tqdm.tqdm_notebook(response.iter_content(chunk_size=1024), desc='Downloading', total=(total_length / 1024) + 1): 
            if chunk:
                fid.write(chunk)
                fid.flush()

## Loading the Datset
This might take a few minutes since the full dataset size is 1.6 GB

In [14]:
full_dataset = pd.read_csv('../original/yellow_tripdata_2015-08.csv')

## Displaying the first 10 rows of the dataset

In [15]:
print(len(full_dataset))
full_dataset.head(10)

11130304


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-08-01 00:00:15,2015-08-01 00:36:21,1,7.22,-73.999809,40.74334,1,N,-73.942848,40.806622,2,29.5,0.5,0.5,0.0,0.0,0.3,30.8
1,1,2015-08-01 00:00:16,2015-08-01 00:14:52,1,2.3,-73.977043,40.774902,1,N,-73.978256,40.749863,1,12.0,0.5,0.5,2.93,0.0,0.3,16.23
2,1,2015-08-01 00:00:16,2015-08-01 00:06:30,1,1.5,-73.959122,40.775127,1,N,-73.980392,40.782314,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95
3,1,2015-08-01 00:00:16,2015-08-01 00:06:18,1,0.9,-73.976624,40.780746,1,N,-73.970558,40.788845,1,6.0,0.5,0.5,1.45,0.0,0.3,8.75
4,2,2015-08-01 00:00:16,2015-08-01 00:16:28,1,2.44,-73.978592,40.785919,1,N,-73.997353,40.756302,1,13.0,0.5,0.5,2.0,0.0,0.3,16.3
5,2,2015-08-01 00:00:16,2015-08-01 00:13:17,1,3.36,-73.976379,40.785889,1,N,-73.942413,40.82209,1,13.0,0.5,0.5,3.58,0.0,0.3,17.88
6,2,2015-08-01 00:00:16,2015-08-01 00:14:00,2,2.34,-73.986214,40.760872,1,N,-73.956924,40.771561,1,11.5,0.5,0.5,1.0,0.0,0.3,13.8
7,2,2015-08-01 00:00:16,2015-08-01 00:25:25,1,10.19,-73.789978,40.644058,1,N,-73.931221,40.67588,2,31.5,0.5,0.5,0.0,0.0,0.3,32.8
8,1,2015-08-01 00:00:17,2015-08-01 00:26:59,2,3.3,-73.993744,40.727383,1,N,-73.998161,40.764099,1,18.0,0.5,0.5,2.0,0.0,0.3,21.3
9,1,2015-08-01 00:00:17,2015-08-01 00:08:26,1,1.8,-73.994881,40.740059,1,N,-73.976715,40.749336,1,8.0,0.5,0.5,1.85,0.0,0.3,11.15


## Cleaning up the data

In [16]:
dataset = full_dataset.copy()  # Creat a copy of the data

## Extract some relevat fields for further proccessing
pickup_time = pd.to_datetime(dataset['tpep_pickup_datetime'])
dropoff_time = pd.to_datetime(dataset['tpep_dropoff_datetime'])
pickup_date = pd.to_datetime(pickup_time.dt.date.astype(str))
pickup_time_of_day = pickup_time - pickup_date

## Convert Latitude and Longitude to UTM
dataset['pickup_easting'], dataset['pickup_northing'] = latlong_to_utm(dataset['pickup_longitude'].values, dataset['pickup_latitude'].values)

dataset['dropoff_easting'], dataset['dropoff_northing'] = latlong_to_utm(dataset['dropoff_longitude'].values, dataset['dropoff_latitude'].values)

## converting meters to kilometers
dataset['pickup_easting'] /= 1000
dataset['pickup_northing'] /= 1000
dataset['dropoff_easting'] /= 1000
dataset['dropoff_northing'] /= 1000

## Convert trip distance to kilometers
dataset['trip_distance'] *= 1.60934

## Generate the duration, day_of_week and time_of_day fields
dataset['duration'] = (dropoff_time - pickup_time).astype(int).values / 1e9 / 60
dataset['day_of_week'] = pickup_time.dt.weekday
dataset['day_of_month'] = pickup_time.dt.day
dataset['time_of_day'] = pickup_time_of_day.astype(int).values / 1e9 / 60 / 60

## Filter the data
dataset = dataset.query(
    ## Keep only rides which started and ended within the bounding box
    'pickup_easting > {} &'.format(utm_west) +
    'pickup_easting < {} &'.format(utm_east) +
    'pickup_northing > {} &'.format(utm_south) +
    'pickup_northing < {} &'.format(utm_north) +
    'dropoff_easting > {} &'.format(utm_west) +
    'dropoff_easting < {} &'.format(utm_east) +
    'dropoff_northing > {} &'.format(utm_south) +
    'dropoff_northing < {} &'.format(utm_north) +
    ## Remove zero length rides
    'trip_distance > 0 &' +
    '((pickup_easting != dropoff_easting) | (pickup_northing != dropoff_northing)) &' +
    ## Remove really long rides
    'trip_distance < 25000 &' +
    'duration < 60 &' +
    ## Remove really short rides
    'duration > 0.1 &' +
    ## Remove rides with non regualr rates
    'RatecodeID == 1')

## Sample out n_samples random ride
rand_gen = np.random.RandomState(0)
dataset = dataset.sample(n_samples, random_state=rand_gen).reset_index(drop=True)

## Remove unneccesery fields
dataset.pop('VendorID')
dataset.pop('RatecodeID')
dataset.pop('store_and_fwd_flag')
dataset.pop('extra')
dataset.pop('tolls_amount')
dataset.pop('improvement_surcharge')
dataset.pop('mta_tax')
dataset.pop('total_amount')
dataset.pop('pickup_latitude')
dataset.pop('pickup_longitude')
dataset.pop('dropoff_latitude')
dataset.pop('dropoff_longitude')
dataset.pop('tpep_pickup_datetime')
dataset.pop('tpep_dropoff_datetime')

## Print first 10 rows
dataset.head(10)

Unnamed: 0,passenger_count,trip_distance,payment_type,fare_amount,tip_amount,pickup_easting,pickup_northing,dropoff_easting,dropoff_northing,duration,day_of_week,day_of_month,time_of_day
0,2,2.768065,2,9.5,0.0,586.996941,4512.979705,588.155118,4515.180889,11.516667,3,13,12.801944
1,1,3.21868,2,10.0,0.0,587.151523,4512.923924,584.850489,4512.632082,12.666667,6,16,20.961389
2,1,2.574944,1,7.0,2.49,587.005357,4513.3597,585.434188,4513.174964,5.516667,0,31,20.412778
3,1,0.965604,1,7.5,1.65,586.648975,4511.729212,586.67153,4512.554065,9.883333,1,25,13.031389
4,1,2.46229,1,7.5,1.66,586.967178,4511.894301,585.262474,4511.755477,8.683333,2,5,7.703333
5,5,1.56106,1,7.5,2.2,585.926415,4512.880385,585.168973,4511.540103,9.433333,3,20,20.667222
6,1,2.574944,1,8.0,1.0,586.731409,4515.084445,588.710175,4514.209184,7.95,5,8,23.841944
7,1,0.80467,2,5.0,0.0,585.344614,4509.712541,585.843967,4509.545089,4.95,5,29,15.831389
8,1,3.653202,1,10.0,1.1,585.422062,4509.477536,583.671081,4507.735573,11.066667,5,8,2.098333
9,6,1.625433,1,5.5,1.36,587.875433,4514.931073,587.701248,4513.709691,4.216667,3,13,21.783056


## Save the clean dataset

In [17]:
dataset.to_csv('../../datasets/nyc_taxi_rides.csv', index=False)

## Generate a map image according to the binding box

Probably not the most elegant way to do it, but:

- Goto <https://www.openstreetmap.org/export>.
- Enter the bounding box coordinates into the export ranges on the left.
- Press the share button on the right.
- Select "Set custom dimensions", and adjust the new bounding box to fit the one defined by the export bounding box.
- Set the scale to 1:25000
- Press download and download the image to [../../media/nyc_map.png](../../media/nyc_map.png)
- Convert the image into a gray scale image.