## Data used in these notebooks: NYC Taxi trips open data

Given hourly historical taxi trips data for NYC, your task is to predict #pickups in next 7 days, per hour and per pickup zone.  <br>

<ul>
<li>Original data source:  <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" target="_blank"> https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page</a> </li>
<li>AWS-hosted public source:  <a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" target="_blank">https://registry.opendata.aws/nyc-tlc-trip-records-pds/ </a> </li>
<li>AWS managed weather data ingestion as a service that is bundled with Amazon Forecast, aggregated by location and by hour.  Initially only for USA and Europe, but depending on demand, possibly in the future for other global regions. </li>
<li>Data used:  Yellow taxis dates: 2018-12 through 2020-02 to avoid COVID effects </li>
</ul>

 
### Features and cleaning
Note: ~5GB Raw Data has already been cleaned and joined using AWS Glue (tutorials to be created in future). 
<ul>
    <li>Join shape files Latitude, Longitude</li>
    <li>Add Trip duration in minutes</li>
    <li>Drop negative trip distances, 0 fares, 0 passengers, less than 1min trip durations </li>
    <li>Drop 2 unknown zones ['264', '265']
    </ul>

## Step 0:  Set up and install libraries <a class="anchor" id="setup"></a>

In [51]:
# Import standard open libraries
import pandas as pd

# AWS libraries and initialization
import boto3
import sagemaker

<b>Define S3 bucket</b></br>
The cell below will use the default SageMaker S3 bucket in the account

In [52]:
from sagemaker import get_execution_role
sess = sagemaker.Session()
s3_bucket_name = sess.default_bucket()
    
# create prefix for organizing your new bucket
s3_prefix = "nyc-taxi-trips"
print(f"using folder '{s3_prefix}'")

using folder 'nyc-taxi-trips'


## Step 1. Read data <a class="anchor" id="read"></a>

The first thing we're going to do is read the headerless .csv file.  Then we need to identify which columns map to required Amazon Forecast inputs.

<img src="https://amazon-forecast-samples.s3-us-west-2.amazonaws.com/common/images/nyctaxi_map_fields.png" width="82%">
<br>

In [53]:
## Read cleaned, joined, featurized data from Glue ETL processing
df_raw = pd.read_csv("s3://amazon-forecast-samples/data_prep_templates/clean_features.csv"
                          , parse_dates=True
                          , header=None
                          , dtype={0:'str'
                                   , 1: 'str'
                                   , 2: 'str'
                                   , 3:'str'
                                   , 4: 'int32'
                                   , 5: 'float64'
                                   , 6: 'str'
                                   , 7: 'str'
                                   , 8: 'str'}
                          , names=['pulocationid', 'pickup_hourly', 'pickup_day_of_week'
                                   , 'day_hour', 'trip_quantity', 'mean_item_loc_weekday'
                                   , 'pickup_geolocation', 'pickup_borough', 'binned_max_item'])

# drop duplicates
print(df_raw.shape)
df_raw.drop_duplicates(inplace=True)

df_raw['pickup_hourly'] = pd.to_datetime(df_raw["pickup_hourly"], format="%Y-%m-%d %H:%M:%S", errors='coerce')
print(df_raw.shape)
start_time = df_raw.pickup_hourly.min()
end_time = df_raw.pickup_hourly.max()
print(f"Min timestamp = {start_time}")
print(f"Max timestamp = {end_time}")
df_raw.sample(5)

(1507488, 9)
(1507488, 9)
Min timestamp = 2019-07-01 00:00:00
Max timestamp = 2020-02-29 23:00:00


Unnamed: 0,pulocationid,pickup_hourly,pickup_day_of_week,day_hour,trip_quantity,mean_item_loc_weekday,pickup_geolocation,pickup_borough,binned_max_item
140229,92,2019-09-17 20:00:00,Tuesday,Tuesday_20,0,1.21637,40.76412728_-73.83044713,Queens,Cat_1
1180043,179,2019-10-15 01:00:00,Tuesday,Tuesday_1,1,4.14946,40.77142534_-73.92681236,Queens,Cat_1
1039834,118,2019-08-30 10:00:00,Friday,Friday_10,0,1.0,40.58563106_-74.13707013,Staten Island,Cat_1
528615,152,2019-10-15 21:00:00,Tuesday,Tuesday_21,7,7.35476,40.8175772_-73.95432487,Manhattan,Cat_1
1027600,208,2019-10-13 19:00:00,Sunday,Sunday_19,0,1.08571,40.82468614_-73.82488634,Bronx,Cat_1


## Step 2. Transform Data <a class="anchor" id="transform_tts"></a>

In [54]:
# map expected column names
item_id = "pulocationid"
target_value = "trip_quantity"
timestamp = "pickup_hourly"

# forecast setting
FORECAST_FREQ = "H"

# specify array of dimensions you'll use for forecasting
forecast_dims = [timestamp, item_id]

print(f"forecast_dims = {forecast_dims}")

forecast_dims = ['pickup_hourly', 'pulocationid']


In [55]:
## Assemble TTS required columns

tts = df_raw[[timestamp, item_id, target_value]].copy()

print(f"start date = {tts[timestamp].min()}")
print(f"end date = {tts[timestamp].max()}")

# check it
print(tts.shape)
print(tts.dtypes)
tts.head(5)

start date = 2019-07-01 00:00:00
end date = 2020-02-29 23:00:00
(1507488, 3)
pickup_hourly    datetime64[ns]
pulocationid             object
trip_quantity             int32
dtype: object


Unnamed: 0,pickup_hourly,pulocationid,trip_quantity
0,2019-07-02 09:00:00,1,0
1,2019-07-03 01:00:00,1,0
2,2019-07-05 06:00:00,1,0
3,2019-07-06 08:00:00,1,0
4,2019-07-26 17:00:00,1,0


In [56]:
## Step 3. Save Data and Upload to S3 <a class="anchor" id="save_tts"></a>

In [57]:
# Save tts to S3
local_file = "tts.csv"
# Save merged file locally
tts.to_csv(local_file, header=False, index=False)

key = f"{s3_prefix}/tts.csv"
boto3.Session().resource('s3').Bucket(s3_bucket_name).Object(key).upload_file(local_file)

Save variables for use with other notebooks

In [60]:
%store df_raw
%store s3_bucket_name
%store s3_prefix
%store start_time
%store end_time
%store item_id
%store target_value
%store timestamp
%store FORECAST_FREQ
%store forecast_dims

Stored 'df_raw' (DataFrame)
Stored 's3_bucket_name' (str)
Stored 's3_prefix' (str)
Stored 'start_time' (Timestamp)
Stored 'end_time' (Timestamp)
Stored 'item_id' (str)
Stored 'target_value' (str)
Stored 'timestamp' (str)
Stored 'FORECAST_FREQ' (str)
Stored 'forecast_dims' (list)
