# Preprocessing the taxi data - Intentionally Blank

**NOTE: This notebook does not need to be executed. A copy of the preprocessed dataframe is saved as a [parquet file](https://parquet.apache.org/)**. For the preparation notebook click [here](./prep.ipynb)

Before preparing and cleaning the taxi dataset, we should first preprocess the CSV to make it smaller. We collected the data from the Chicago Data Portal. We filtered the original dataset by the trip_start_timestamp directly via the API to minimize the initial filesize. To get all the trips from 2016 we used the following query: https://data.cityofchicago.org/resource/wrvz-psew.csv?$where=trip_start_timestamp%20between%20%272016-01-01T00:00:00%27%20and%20%20%272016-12-31T23:59:59%27&$limit=1000000000.
<br>For further information about the dataset and the API click the following link: [Chicago Data Portal - Taxi Trips](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew).

To run this notebook you need to download the dataset from the aforementioned link. And copy it to the "data" folder as "original_taxi_data.csv". Furthermore atleast 16GB of ram is needed to read the CSV and 32GB is recommended because the initial CSV is ~14GB big.

**Dependencies:**
- Pandas
- Pyarrow (conda install pyarrow)
  - Needed for saving to a parquet file

In [1]:
# Importing the libraries
import pandas as pd
import os  
os.makedirs('./data', exist_ok=True) 

In [2]:
# Reading the csv file
# Note: This file is not included in the repository due to its size. Please use the link above to download the file. Loading the file may take a few minutes.
taxi_df = pd.read_csv('data/original_taxi_data.csv')

In [3]:
# Checking for the right time range
taxi_df["trip_start_timestamp"].min(), taxi_df["trip_start_timestamp"].max()

('2016-01-01T00:00:00.000', '2016-12-31T23:45:00.000')

In [4]:
# Checking memory usage for later comparison
taxi_df.memory_usage(deep=True)

Index                                128
trip_id                       3080655883
taxi_id                       5875028507
trip_start_timestamp          2540747120
trip_end_timestamp            2540631248
trip_seconds                   254074712
trip_miles                     254074712
pickup_census_tract            254074712
dropoff_census_tract           254074712
pickup_community_area          254074712
dropoff_community_area         254074712
fare                           254074712
tips                           254074712
tolls                          254074712
extras                         254074712
trip_total                     254074712
payment_type                  2034421040
company                       2256911991
pickup_centroid_latitude       254074712
pickup_centroid_longitude      254074712
pickup_centroid_location      2692429277
dropoff_centroid_latitude      254074712
dropoff_centroid_longitude     254074712
dropoff_centroid_location     2681476484
dtype: int64

In [5]:
taxi_df.head(5)

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,extras,trip_total,payment_type,company,pickup_centroid_latitude,pickup_centroid_longitude,pickup_centroid_location,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_centroid_location
0,e5a83fdb24dd07ccf65750dc7a0bd91782d80866,b695f0a6aeeb0c364c6ee6ca5c16f6621e00c50fd6e7d4...,2016-12-31T23:45:00.000,2017-01-01T00:15:00.000,1607.0,2.0,17031840000.0,17031080000.0,32.0,8.0,...,0.0,16.5,Credit Card,Chicago Carriage Cab Corp,41.880994,-87.632746,POINT (-87.6327464887 41.8809944707),41.895033,-87.619711,POINT (-87.6197106717 41.8950334495)
1,fe838501d7bbc346229694ab319236f3f6293980,9d2bc650f24375604a82a15892dfbbea47dce34e8c3236...,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,469.0,0.28,17031840000.0,17031840000.0,32.0,32.0,...,1.0,6.75,Cash,Chicago Carriage Cab Corp,41.880994,-87.632746,POINT (-87.6327464887 41.8809944707),41.880994,-87.632746,POINT (-87.6327464887 41.8809944707)
2,cad9465c7067dae350e88ce2832fec2d4d709888,92a78c8b1d09e1d7668d08f04825e490957dcdfe6448e5...,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,300.0,0.8,17031080000.0,17031080000.0,8.0,8.0,...,1.5,7.0,Cash,City Service,41.895033,-87.619711,POINT (-87.6197106717 41.8950334495),41.892042,-87.631864,POINT (-87.6318639497 41.8920421365)
3,0fbcfdc3799233220b66074d629093373a3237a5,b77a2dcc078698ea493d4d703014076e4272dc7d8b420e...,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,240.0,0.4,17031080000.0,17031080000.0,8.0,8.0,...,1.0,5.5,Cash,Blue Diamond,41.907492,-87.63576,POINT (-87.6357600901 41.9074919303),41.907492,-87.63576,POINT (-87.6357600901 41.9074919303)
4,73496b0c0946a62f11ffa9712017b5aef70b23f5,e8b30fe3cdcf458994b6943ba607e06f31b92202cab6b7...,2016-12-31T23:45:00.000,2017-01-01T00:00:00.000,540.0,1.2,,,6.0,7.0,...,0.0,7.5,Cash,Sun Taxi,41.944227,-87.655998,POINT (-87.6559981815 41.9442266014),41.922686,-87.649489,POINT (-87.6494887289 41.9226862843)


We delete irrelevant columns to save as much memory as possible.

In [6]:
#Drop columns pickup_centroid_location, dropoff_centroid_location, fare, tips, tolls, extras, payment_type, pickup_community_area, dropoff_community_area, company
taxi_df = taxi_df.drop(columns=['pickup_centroid_location', 'dropoff_centroid_location', 'fare', 'tips', 'tolls', 'extras', 'payment_type', 'pickup_community_area', 'dropoff_community_area', 'company', 'taxi_id'])

Deleting rows with null values and duplicates is done in this notebook instead of the preparation notebook to ensure that most computers with low memory can run the preparation notebook.

In [7]:
display(taxi_df[taxi_df.isnull().any(axis = 1)].head(5))

Unnamed: 0,trip_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude
4,73496b0c0946a62f11ffa9712017b5aef70b23f5,2016-12-31T23:45:00.000,2017-01-01T00:00:00.000,540.0,1.2,,,7.5,41.944227,-87.655998,41.922686,-87.649489
5,98c1c0bf07b3c3bc68831141207db5563f67a7e0,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,680.0,2.28,,,10.5,41.878866,-87.625192,41.901207,-87.676356
6,4279ec66210268c8777fa3b0060cc5790d13a9b4,2016-12-31T23:45:00.000,2017-01-01T00:00:00.000,120.0,1.1,,,5.5,41.944227,-87.655998,41.944227,-87.655998
8,17158b3ccd5753b62e52de5becdcec9fed92a288,2016-12-31T23:45:00.000,2017-01-01T00:00:00.000,960.0,4.1,,,13.75,41.922686,-87.649489,41.947792,-87.683835
11,2ac4bfda77755446a465f8928176281c299d2433,2016-12-31T23:45:00.000,2017-01-01T00:00:00.000,1056.0,2.78,,,11.0,41.901207,-87.676356,41.878866,-87.625192


In [8]:
# Drop rows with missing values
taxi_df = taxi_df.dropna(how='any',axis=0)
print(f"Number of rows after deleting rows with null values: {len(taxi_df)} ")

Number of rows after deleting rows with null values: 20356209 


In [11]:
display(taxi_df[taxi_df.duplicated(subset=['trip_start_timestamp', 'trip_end_timestamp', 'trip_id'])].head(5))

Unnamed: 0,trip_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude


Because no duplicate rows exist in our dataset we skip the deletion part and drop the obsolete trip_id column.

In [12]:
taxi_df = taxi_df.drop(columns=[ "trip_id"])

In [13]:
# Last look at the data
taxi_df.head(5)

Unnamed: 0,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude
0,2016-12-31T23:45:00.000,2017-01-01T00:15:00.000,1607.0,2.0,17031840000.0,17031080000.0,16.5,41.880994,-87.632746,41.895033,-87.619711
1,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,469.0,0.28,17031840000.0,17031840000.0,6.75,41.880994,-87.632746,41.880994,-87.632746
2,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,300.0,0.8,17031080000.0,17031080000.0,7.0,41.895033,-87.619711,41.892042,-87.631864
3,2016-12-31T23:45:00.000,2016-12-31T23:45:00.000,240.0,0.4,17031080000.0,17031080000.0,5.5,41.907492,-87.63576,41.907492,-87.63576
7,2016-12-31T23:45:00.000,2017-01-01T00:00:00.000,960.0,4.4,17031060000.0,17031080000.0,19.8,41.93631,-87.651563,41.895033,-87.619711


In [14]:
# Convert trip_seconds to uint32 without losing information
taxi_df = taxi_df.astype({"trip_seconds": "uint32"})

In [15]:
# Optional: If you want to save the preprocessed data as a csv file uncomment the following line
# taxi_df.to_csv('data/taxi_data_preprocessed.csv', index=False)

# Saving the preprocessed data as a parquet file with gzip compression
taxi_df.to_parquet('data/taxi_data_preprocessed.gzip', compression='gzip')