# Objective for Part 1

We will collect the Land Transport Authority's taxi availability data for one month in 2019.

Steps:

- Explore the taxi dataset and API call method at the LTA taxi availability website.
- Perform API call to download the dataset required for our model.
- From the downloaded data, we extract only the required features in JSON format, and convert these JSON objects into usable Dataframe.


In [1]:
# Step 1: Import the libraries you need
import pandas as pd
import requests

The <strong>Land Transport Authority (LTA)</strong> in Singapore possesses a dataset of historical and real-time taxi availability thru a central system. The API call returns the location coordinates of all taxis that are currently available for hire but does not include "Hired" or "Busy" Taxis.

We go to LTA Taxi Availability (https://data.gov.sg/dataset/taxi-availability) and explore how the API call works to return us the January 2019 dataset that we wanted.


We then perform a single API call for a fixed period in the notebook to see how the data looks like

In [2]:
# Step 2a: use requests to make a get API call at the URL and assign it to a variable
response = requests.get("https://api.data.gov.sg/v1/transport/taxi-availability?date_time=2019-01-01T00%3A00%3A00")
# Step 2b: declare another variable, and save the JSON in it
taxi_json = response.json()
# Step 2c: peek at your JSON
taxi_json

{'type': 'FeatureCollection',
 'crs': {'type': 'link',
  'properties': {'href': 'http://spatialreference.org/ref/epsg/4326/ogcwkt/',
   'type': 'ogcwkt'}},
 'features': [{'type': 'Feature',
   'geometry': {'type': 'MultiPoint',
    'coordinates': [[103.6267, 1.307992],
     [103.63226, 1.30884],
     [103.6376, 1.300256],
     [103.63767, 1.30045],
     [103.64233, 1.3272],
     [103.64262, 1.31503],
     [103.652616666667, 1.3172154],
     [103.66998, 1.32412],
     [103.67939, 1.32625],
     [103.68554, 1.34106],
     [103.6856, 1.340405],
     [103.688642833333, 1.340839],
     [103.689112833333, 1.342593],
     [103.69163, 1.34406],
     [103.6931, 1.345999],
     [103.6936, 1.344527],
     [103.69386, 1.34267],
     [103.6939, 1.344551],
     [103.694, 1.36935],
     [103.694041833333, 1.34023216666667],
     [103.69427, 1.33496],
     [103.69448, 1.34395],
     [103.6949, 1.339654],
     [103.694952333333, 1.346155],
     [103.6959465, 1.34455316666667],
     [103.696101833333, 1

In [3]:
# Step 3: Turn the JSON response directly into a DataFrame
pd_taxi_json = pd.json_normalize(taxi_json)

In [4]:
# Step 4a: Declare a new variable that contains only your 'features' from the JSON
taxi_features = taxi_json['features']
# Step 4b: Turn it into a DataFrame
pd_taxi_features = pd.json_normalize(taxi_features)

In [5]:
pd_taxi_features.head()

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.timestamp,properties.taxi_count,properties.api_info.status
0,Feature,MultiPoint,"[[103.6267, 1.307992], [103.63226, 1.30884], [...",2018-12-31T23:59:44+08:00,5887,healthy


We then proceed to create a list of timestamps starting from 1st Jan 2019 to 31st Jan 2019 for every 5 min interval to be passed into our API call

In [6]:
# Step 5: Generate a date range in 5-min intervals
list_jan2019 = pd.date_range(start='1/1/2019', end='1/31/2019', freq='5T')


In [8]:
# Step 6: Create three new lists containing the formatted parts of the DateTime
list_jan2019_date = []
list_jan2019_hour = []
list_jan2019_min = []

for datetime in list_jan2019:
    str_date = str(datetime)
    list_jan2019_date.append(str_date[:10] + "T")

    list_jan2019_hour.append(str_date[11:13] + "%3A")

    list_jan2019_min.append(str_date[14:16] + "%3A00")


In [9]:
# Step 7: zip all of the three lists together (don't forget the %3A)
list_jan2019_zip = [str1+str2+str3 for str1,str2,str3 in zip(list_jan2019_date, list_jan2019_hour, list_jan2019_min)]
list_jan2019_new = list(list_jan2019_zip)

In [10]:
# Step 8: Make your API calls and build your DataFrame

# declare the base URL
base_url = "https://api.data.gov.sg/v1/transport/taxi-availability?date_time="

# declare the empty list
list_features = []


In [None]:
# use a for loop in the list you got from Step 7

t0 = time.time()
#url extraction and json conversion in chunks of 500 to avoid hangs on my old laptop, 
# you can uncomment and use these if you face similar issues
################################################
#for datetime in list_jan2019_new[:500]:
#for datetime in list_jan2019_new[500:1000]:
#for datetime in list_jan2019_new[1000:1500]:
#for datetime in list_jan2019_new[1500:2000]:
#for datetime in list_jan2019_new[2000:2500]:
#for datetime in list_jan2019_new[2500:3000]:
#for datetime in list_jan2019_new[3000:3500]:
#for datetime in list_jan2019_new[3500:4000]:
#for datetime in list_jan2019_new[4000:4500]:
#for datetime in list_jan2019_new[4500:5000]:
#for datetime in list_jan2019_new[5000:5500]:
#for datetime in list_jan2019_new[5500:6000]:
#for datetime in list_jan2019_new[6000:6500]:
#for datetime in list_jan2019_new[6500:7000]:
#for datetime in list_jan2019_new[7000:7500]:
#for datetime in list_jan2019_new[7500:8000]:
#for datetime in list_jan2019_new[8000:8500]:
#for datetime in list_jan2019_new[8500:]:
#####################################################
for datetime in list_jan2019_new:
    
    # combine the base_url and the current date in the for loop    
    # make a get request using the combined URL
    response = requests.get(base_url + datetime)
    
    # get the JSON from the response of the get request
    taxi_json = response.json()
    
    # declare a variable which contains only the 'features' part of the JSON response
    # turn the variable into a DataFrame
    taxi_features = pd.json_normalize(taxi_json['features'])
    
    # append the dataframe into the empty list above
    list_features.append(taxi_features)

t1 = time.time()

#print(f"{t1-t0} seconds to download 500 urls and convert json")



In [None]:
# Step 9: concatenate all of the dataframes you appended into the empty list
df = pd.concat(list_features)

In [None]:
# Step 10: create the 'time' column and assign it with the list from Step 5
df['time'] = list_jan2019

In [None]:
df.info()

In [None]:
# Step 11: Export your DataFrame to CSV
df.to_csv("taxi.csv", index=False)