# Process taxi_zones.csv

The code below is for reading the CSV file using the pandas library to process the lists of all pick up and drop off locations in coordinates for each borough and zone. The `OBJECTID` column will later get joined with the fact trips table to retrieve the zone-borough and the coordinates instead of `LocationID` for a reason that's discussed later.


In [11]:
import pandas as pd
import re
import numpy as np

In [16]:
tz_df = pd.read_csv("taxi_zones.csv")
tz_df.head()

Unnamed: 0,OBJECTID,LocationID,zone,borough,the_geom
0,1,1,Newark Airport,EWR,MULTIPOLYGON (((-74.18445299999996 40.69499599...
1,2,2,Jamaica Bay,Queens,MULTIPOLYGON (((-73.82337597260663 40.63898704...
2,3,3,Allerton/Pelham Gardens,Bronx,MULTIPOLYGON (((-73.84792614099985 40.87134223...
3,4,4,Alphabet City,Manhattan,MULTIPOLYGON (((-73.97177410965318 40.72582128...
4,5,5,Arden Heights,Staten Island,MULTIPOLYGON (((-74.17421738099989 40.56256808...


These latitude and longitude combinations will be used to mark the approximate locations of the borough and zone on the interactive map in the dashboard, so I plan to create two separate columns `latitude` and `longitude` to store the average values of those latitudes and longitudes respectively. I've decided to go with the average of all lat/lon coordinates per borough/zone as it's only used to mark the **approximate** locations on the map.


In [17]:
for k in range(len(tz_df)):
    lat_str_list = re.findall(r"-\d\d\.\d*", tz_df.iloc[[k]]["the_geom"][k])
    lat_float_list = [float(x) for x in lat_str_list]
    tz_df.loc[k, "latitude"] = np.average(lat_float_list)

for k in range(len(tz_df)):
    lon_str_list = re.findall(r"(?<=\s)\d\d\.\d*", tz_df.iloc[[k]]["the_geom"][k])
    lon_float_list = [float(x) for x in lon_str_list]
    tz_df.loc[k, "longitude"] = np.average(lon_float_list)

tz_df

Unnamed: 0,OBJECTID,LocationID,zone,borough,the_geom,latitude,longitude
0,1,1,Newark Airport,EWR,MULTIPOLYGON (((-74.18445299999996 40.69499599...,-74.174270,40.690243
1,2,2,Jamaica Bay,Queens,MULTIPOLYGON (((-73.82337597260663 40.63898704...,-73.817643,40.612163
2,3,3,Allerton/Pelham Gardens,Bronx,MULTIPOLYGON (((-73.84792614099985 40.87134223...,-73.846510,40.864294
3,4,4,Alphabet City,Manhattan,MULTIPOLYGON (((-73.97177410965318 40.72582128...,-73.975209,40.723853
4,5,5,Arden Heights,Staten Island,MULTIPOLYGON (((-74.17421738099989 40.56256808...,-74.189803,40.556678
...,...,...,...,...,...,...,...
258,259,259,Woodlawn/Wakefield,Bronx,MULTIPOLYGON (((-73.85107116191898 40.91037152...,-73.853635,40.900107
259,260,260,Woodside,Queens,MULTIPOLYGON (((-73.90175373399988 40.76077547...,-73.905907,40.746439
260,261,261,World Trade Center,Manhattan,MULTIPOLYGON (((-74.01332610899988 40.70503078...,-74.013983,40.707456
261,262,262,Yorkville East,Manhattan,MULTIPOLYGON (((-73.94383256699986 40.78285908...,-73.943489,40.778363


The column `the_geom` is now processed into two separate columns `latitude` and `longitude`. It's no longer needed, so I dropped it along with the column `LocationID` which has duplicates. Instead, I renamed `OBJECTID` to `LocationID` so column names match later when joining it with the trip table.


In [18]:
tz_df["zone_borough"] = tz_df["zone"] + ", " + tz_df["borough"]
tz_df = tz_df.drop(["LocationID", "the_geom", "zone", "borough"], axis=1).rename(
    columns={"OBJECTID": "LocationID"}
)
tz_df

Unnamed: 0,LocationID,latitude,longitude,zone_borough
0,1,-74.174270,40.690243,"Newark Airport, EWR"
1,2,-73.817643,40.612163,"Jamaica Bay, Queens"
2,3,-73.846510,40.864294,"Allerton/Pelham Gardens, Bronx"
3,4,-73.975209,40.723853,"Alphabet City, Manhattan"
4,5,-74.189803,40.556678,"Arden Heights, Staten Island"
...,...,...,...,...
258,259,-73.853635,40.900107,"Woodlawn/Wakefield, Bronx"
259,260,-73.905907,40.746439,"Woodside, Queens"
260,261,-74.013983,40.707456,"World Trade Center, Manhattan"
261,262,-73.943489,40.778363,"Yorkville East, Manhattan"


Exporting the processed file.


In [19]:
tz_df.to_csv("taxi_zone_clean.csv", index=False)