# Preprocessing the taxi data - Intentionally Blank

**NOTE: This notebook does not need to be executed. A copy of the preprocessed dataframe is saved as a [parquet file](https://parquet.apache.org/)**. For the preparation notebook click [here](./prep.ipynb)

Before preparing and cleaning the taxi dataset, we should first preprocess the CSV to make it smaller. We collected the data from the Chicago Data Portal. We filtered the original dataset by the trip_start_timestamp directly via the API to minimize the initial filesize. To get all the trips from 2016 we used the following query: https://data.cityofchicago.org/resource/wrvz-psew.csv?$where=trip_start_timestamp%20between%20%272016-01-01T00:00:00%27%20and%20%20%272016-12-31T23:59:59%27&$limit=1000000000.
<br>For further information about the dataset and the API click the following link: [Chicago Data Portal - Taxi Trips](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew).

To run this notebook you need to download the dataset from the aforementioned link. And copy it to the "data" folder as "original_taxi_data.csv". Furthermore atleast 16GB of ram is needed to read the CSV and 32GB is recommended because the initial CSV is ~14GB big.

**Dependencies:**
- Pandas
- Pyarrow (conda install pyarrow)
  - Needed for saving to a parquet file

In [1]:
# Importing the libraries
import pandas as pd
import os  
os.makedirs('./data', exist_ok=True) 

In [2]:
# Reading the csv file
# Note: This file is not included in the repository due to its size. Please use the link above to download the file. Loading the file may take a few minutes.
taxi_df = pd.read_csv('data/original_taxi_data.csv')
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'])
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'])

In [3]:
# Checking for the right time range
taxi_df["trip_start_timestamp"].min(), taxi_df["trip_start_timestamp"].max()

(Timestamp('2016-01-01 00:00:00'), Timestamp('2016-12-31 23:45:00'))

In [4]:
# Checking memory usage for later comparison
taxi_df.memory_usage(deep=True)

Index                                128
trip_id                       3080655883
taxi_id                       5875028507
trip_start_timestamp           254074712
trip_end_timestamp             254074712
trip_seconds                   254074712
trip_miles                     254074712
pickup_census_tract            254074712
dropoff_census_tract           254074712
pickup_community_area          254074712
dropoff_community_area         254074712
fare                           254074712
tips                           254074712
tolls                          254074712
extras                         254074712
trip_total                     254074712
payment_type                  2034421040
company                       2256911991
pickup_centroid_latitude       254074712
pickup_centroid_longitude      254074712
pickup_centroid_location      2692429277
dropoff_centroid_latitude      254074712
dropoff_centroid_longitude     254074712
dropoff_centroid_location     2681476484
dtype: int64

In [5]:
taxi_df.head(5)

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,extras,trip_total,payment_type,company,pickup_centroid_latitude,pickup_centroid_longitude,pickup_centroid_location,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_centroid_location
0,223789629c9e0a01fbab0d787d2664ccdb8355c0,507b1e4d1f39a8a26e7249e6a627f5a0c798dfdafa7b16...,2016-12-31 23:45:00,2016-12-31 23:45:00,180.0,0.7,,,,,...,1.0,5.75,Cash,City Service,,,,,,
1,a1d390b16ede0f133408103b79dcb56bbd74365e,73b2f5adecea91eeef3900303a07f1b0519a594cffb6b0...,2016-12-31 23:45:00,2017-01-01 00:15:00,2160.0,5.4,,,,,...,0.0,23.5,Cash,Chicago Taxicab,,,,,,
2,2fffdf0e5b45125ed3fd7027b92e31bd7e7085ef,d41ab2be597b82c3e6b0b0ecccf98883a84db0d9aed4f6...,2016-12-31 23:45:00,2017-01-01 00:00:00,1080.0,5.1,,,,,...,0.0,15.75,Cash,City Service,,,,,,
3,3c1d5e90e522f7be0bf92c96f5164360d8d02f94,24515782c70f09819506a7724a57e77c78fea60c4dc91d...,2016-12-31 23:45:00,2017-01-01 00:00:00,780.0,2.9,,,,,...,0.0,11.0,Cash,Sun Taxi,,,,,,
4,d9046368ad0f1ba4cc27c659e9467cd3602bd458,f1eda6f0cb8e48e7fdb5f623a4a5113a84c159fbf73638...,2016-12-31 23:45:00,2016-12-31 23:45:00,0.0,0.0,,,,,...,0.0,5.0,Credit Card,Suburban Dispatch LLC,,,,,,


We delete irrelevant columns to save as much memory as possible.

In [6]:
#Drop columns pickup_centroid_location, dropoff_centroid_location, fare, tips, tolls, extras, payment_type, pickup_community_area, dropoff_community_area, company
taxi_df = taxi_df.drop(columns=['pickup_centroid_latitude', 'pickup_centroid_longitude', 'dropoff_centroid_latitude', 'dropoff_centroid_longitude', 'fare', 'tips', 'tolls', 'extras', 'payment_type', 'pickup_community_area', 'dropoff_community_area', 'company'])

We preemptively delete rows where the trip_end_timestamp, trip_start_timestamp and taxi_ids columns have null values, because to compute the idle seconds, there should not be any null values in the time stamp.

In [7]:
taxi_df[taxi_df["trip_end_timestamp"].isnull() | taxi_df["trip_start_timestamp"].isnull() | taxi_df["taxi_id"].isnull()]

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_location,dropoff_centroid_location
404,6c616d9bd36d367fee0ec951f1510d9afdea8249,,2016-12-31 23:45:00,2017-01-01 00:00:00,600.0,0.0,1.703108e+10,1.703108e+10,6.75,POINT (-87.6188683546 41.8909220259),POINT (-87.6288741572 41.8920726347)
2873,96842033a8a7dd35681c452479dbe9795f5d7e6f,e7e187c80ff0f05f971fef2ac660198b4e86ccecae67e7...,2016-12-31 22:45:00,NaT,,0.0,,,0.00,POINT (-87.771166703 41.9788295262),
2902,0ee2a90ea42e24b272180bccd26c29f1300acff6,,2016-12-31 22:45:00,2016-12-31 23:00:00,1200.0,0.0,1.703108e+10,1.703108e+10,15.25,POINT (-87.6188683546 41.8909220259),POINT (-87.6318639497 41.8920421365)
5694,d6354aa062c3245e39ad7cb8c0063fafca52bfcb,,2016-12-31 21:45:00,2016-12-31 22:15:00,1320.0,0.0,1.703108e+10,1.703108e+10,12.75,POINT (-87.6188683546 41.8909220259),POINT (-87.6129454143 41.8919715078)
6440,fff8f42d31d886b242bb03077ed285f5deb71843,,2016-12-31 21:30:00,2016-12-31 21:45:00,240.0,0.0,1.703108e+10,1.703108e+10,4.50,POINT (-87.6188683546 41.8909220259),POINT (-87.6262149064 41.8925077809)
...,...,...,...,...,...,...,...,...,...,...,...
31709369,b438457d94b43f7c9eb85a61c8618eede325fb12,89ee7f39a29ed33f083ce17d20e0d9f7a07528a5188084...,2016-01-01 11:30:00,NaT,,0.0,1.703183e+10,,0.00,POINT (-87.717503858 41.942859303),
31716791,d40e9fe9b72dfa210478bda8150137e9dfcd7fb1,94024afd53bfce6f81da57630f32bebb2242ef299c70ea...,2016-01-01 07:15:00,NaT,,0.0,,,0.00,POINT (-87.7215590627 41.968069),
31723301,25caee01f98a77edc18f5332f86969573703f04d,,2016-01-01 04:15:00,2016-01-01 04:15:00,0.0,0.0,,,92.00,,
31723302,f99474b30b65e483ca5ceb8c892269c2a12b08c0,,2016-01-01 04:15:00,2016-01-01 04:15:00,0.0,0.0,,,45.00,,


In [8]:
taxi_df.dropna(subset=['trip_end_timestamp', 'trip_start_timestamp', 'taxi_id'], axis=0, inplace=True)

In [9]:
# Sort the taxi data by the start timestamp
taxi_df = taxi_df.sort_values(['trip_start_timestamp'])

# Reset the index
taxi_df = taxi_df.reset_index(drop=True)

taxi_df.set_index(["taxi_id", taxi_df.index], inplace=True)

In [10]:
idle_seconds = pd.Series()
i = 0
for id in taxi_df.index.get_level_values(0).unique():
    idle_seconds = pd.concat([idle_seconds, taxi_df.loc[id, "trip_start_timestamp"] - taxi_df.loc[id, "trip_end_timestamp"].shift(1)])
idle_seconds.name = "idle_seconds"
idle_seconds = idle_seconds.dt.total_seconds()

In [11]:
taxi_df.set_index([taxi_df.index.get_level_values(1)], inplace=True)

In [12]:
idle_seconds

0                 NaN
4674           1800.0
2535593     2716200.0
2545157        9000.0
2545571           0.0
              ...    
31743851        900.0
31746187        900.0
31747643       1800.0
31749451        900.0
31753557       4500.0
Name: idle_seconds, Length: 31753989, dtype: float64

In [13]:
taxi_df = taxi_df.merge(idle_seconds, left_index=True, right_index=True, how='left')

Deleting rows with null values and duplicates is done in this notebook instead of the preparation notebook to ensure that most computers with low memory can run the preparation notebook.

In [14]:
display(taxi_df[taxi_df.isnull().any(axis = 1)])

Unnamed: 0,trip_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_location,dropoff_centroid_location,idle_seconds
0,869d4dbf2d7df18738ed7ba777d50d3699099c52,2016-01-01 00:00:00,2016-01-01 00:15:00,900.0,2.2,1.703128e+10,1.703108e+10,11.65,POINT (-87.642648998 41.8792550844),POINT (-87.6129454143 41.8919715078),
1,8530459178720931cd606e1cf3e3a190b86ae0a2,2016-01-01 00:00:00,2016-01-01 00:15:00,960.0,0.7,1.703132e+10,1.703108e+10,9.95,POINT (-87.6219716519 41.8774061234),POINT (-87.6188683546 41.8909220259),
2,94861124a32087e9d391617103e65fc0210ab9cf,2016-01-01 00:00:00,2016-01-01 00:00:00,0.0,0.0,1.703108e+10,1.703108e+10,3.25,POINT (-87.6318639497 41.8920421365),POINT (-87.6318639497 41.8920421365),
3,494af005a2da882b7df2e6223c14fa9a2d570da8,2016-01-01 00:00:00,2016-01-01 00:00:00,360.0,0.9,,,5.85,POINT (-87.68751551520002 41.9751709433),POINT (-87.68751551520002 41.9751709433),
4,d3f06bb8876509bae42ea8735491d803ec521403,2016-01-01 00:00:00,2016-01-01 00:15:00,480.0,0.0,1.703108e+10,1.703108e+10,8.75,POINT (-87.6262105324 41.8991556134),POINT (-87.6378442095 41.8932163595),
...,...,...,...,...,...,...,...,...,...,...,...
31753981,9c80573849a626137f99dd2d811179d38e155d4c,2016-12-31 23:45:00,2017-01-01 00:00:00,960.0,0.1,,,12.50,POINT (-87.6763559892 41.90120699410001),POINT (-87.6251921424 41.8788655841),3600.0
31753983,044d06dc5bdd7630ca55df20cbfea8f7d47fbea7,2016-12-31 23:45:00,2017-01-01 00:00:00,420.0,0.9,,,7.00,POINT (-87.6763559892 41.90120699410001),POINT (-87.6763559892 41.90120699410001),1800.0
31753986,7d9258852eca3f47601e246b73be9c998485c0fd,2016-12-31 23:45:00,2017-01-01 00:00:00,780.0,0.0,,,15.00,POINT (-87.6333080367 41.899602111),POINT (-87.6559981815 41.9442266014),0.0
31753987,758a79cb3a542ae41dac222ba5ea16ff481da6d6,2016-12-31 23:45:00,2017-01-01 00:00:00,1140.0,4.3,,,15.00,POINT (-87.6763559892 41.90120699410001),POINT (-87.667569312 41.8502663663),900.0


In [15]:
# Drop rows with missing values
taxi_df = taxi_df.dropna(how='any', axis=0, subset=taxi_df.columns.difference(['idle_seconds']))
print(f"Number of rows after deleting rows with null values: {len(taxi_df)} ")

Number of rows after deleting rows with null values: 20354299 


In [16]:
display(taxi_df[taxi_df.duplicated(subset=['trip_start_timestamp', 'trip_end_timestamp', 'trip_id'])].head(5))

Unnamed: 0,trip_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_location,dropoff_centroid_location,idle_seconds


Because no duplicate rows exist in our dataset we skip the deletion part and drop the obsolete trip_id column.

In [17]:
taxi_df = taxi_df.drop(columns=[ "trip_id"])

In [18]:
# Convert trip_seconds to uint32 without losing information
taxi_df = taxi_df.astype({'trip_seconds': 'uint32', 'pickup_census_tract': 'int64', 'dropoff_census_tract': 'int64'})

In [19]:
# Last look at the data
taxi_df.head(5)

Unnamed: 0,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_location,dropoff_centroid_location,idle_seconds
0,2016-01-01,2016-01-01 00:15:00,900,2.2,17031281900,17031081402,11.65,POINT (-87.642648998 41.8792550844),POINT (-87.6129454143 41.8919715078),
1,2016-01-01,2016-01-01 00:15:00,960,0.7,17031320400,17031081403,9.95,POINT (-87.6219716519 41.8774061234),POINT (-87.6188683546 41.8909220259),
2,2016-01-01,2016-01-01 00:00:00,0,0.0,17031081700,17031081700,3.25,POINT (-87.6318639497 41.8920421365),POINT (-87.6318639497 41.8920421365),
4,2016-01-01,2016-01-01 00:15:00,480,0.0,17031081201,17031081800,8.75,POINT (-87.6262105324 41.8991556134),POINT (-87.6378442095 41.8932163595),
5,2016-01-01,2016-01-01 00:15:00,720,4.4,17031061902,17031081403,15.65,POINT (-87.640698076 41.9431550855),POINT (-87.6188683546 41.8909220259),


In [20]:
# Optional: If you want to save the preprocessed data as a csv file uncomment the following line
# taxi_df.to_csv('data/taxi_data_preprocessed.csv', index=False)

# Saving the preprocessed data as a parquet file with gzip compression
taxi_df.to_parquet('data/taxi_data_preprocessed.gzip', compression='gzip')