## Data used in these notebooks: NYC Taxi trips open data

Given hourly historical taxi trips data for NYC, your task is to predict #pickups in next 7 days, per hour and per pickup zone.  <br>

<ul>
<li>Original data source:  <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" target="_blank"> https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page</a> </li>
<li>AWS-hosted public source:  <a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" target="_blank">https://registry.opendata.aws/nyc-tlc-trip-records-pds/ </a> </li>
<li>AWS managed weather data ingestion as a service that is bundled with Amazon Forecast, aggregated by location and by hour.  Initially only for USA and Europe, but depending on demand, possibly in the future for other global regions. </li>
<li>Data used:  Yellow taxis dates: 2018-12 through 2020-02 to avoid COVID effects </li>
</ul>

 
### Features and cleaning
Note: ~5GB Raw Data has already been cleaned and joined using AWS Glue (tutorials to be created in future). 
<ul>
    <li>Join shape files Latitude, Longitude</li>
    <li>Add Trip duration in minutes</li>
    <li>Drop negative trip distances, 0 fares, 0 passengers, less than 1min trip durations </li>
    <li>Drop 2 unknown zones ['264', '265']
    </ul>

## Step 0:  Set up and install libraries <a class="anchor" id="setup"></a>

In [None]:
# Import standard open libraries
import pandas as pd
import json
from time import sleep

# AWS libraries and initialization
import boto3
import sagemaker

<b>Define S3 bucket</b></br>
The cell below will use the default SageMaker S3 bucket in the account

In [None]:
from sagemaker import get_execution_role
sess = sagemaker.Session()
s3_bucket_name = sess.default_bucket()
    
# create prefix for organizing your new bucket
s3_prefix = "nyc-taxi-trips"
print(f"using folder '{s3_prefix}'")

## Step 1. Read data <a class="anchor" id="read"></a>

The first thing we're going to do is read the headerless .csv file.  Then we need to identify which columns map to required Amazon Forecast inputs.

<img src="https://amazon-forecast-samples.s3-us-west-2.amazonaws.com/common/images/nyctaxi_map_fields.png" width="82%">
<br>

In [None]:
## Read cleaned, joined, featurized data from Glue ETL processing
df_raw = pd.read_csv("s3://amazon-forecast-samples/data_prep_templates/clean_features.csv"
                          , parse_dates=True
                          , header=None
                          , dtype={0:'str'
                                   , 1: 'str'
                                   , 2: 'str'
                                   , 3:'str'
                                   , 4: 'int32'
                                   , 5: 'float64'
                                   , 6: 'str'
                                   , 7: 'str'
                                   , 8: 'str'}
                          , names=['pulocationid', 'pickup_hourly', 'pickup_day_of_week'
                                   , 'day_hour', 'trip_quantity', 'mean_item_loc_weekday'
                                   , 'pickup_geolocation', 'pickup_borough', 'binned_max_item'])

# drop duplicates
print(df_raw.shape)
df_raw.drop_duplicates(inplace=True)

df_raw['pickup_hourly'] = pd.to_datetime(df_raw["pickup_hourly"], format="%Y-%m-%d %H:%M:%S", errors='coerce')
print(df_raw.shape)
start_time = df_raw.pickup_hourly.min()
end_time = df_raw.pickup_hourly.max()
print(f"Min timestamp = {start_time}")
print(f"Max timestamp = {end_time}")
df_raw.sample(5)

## Step 2. Transform Data <a class="anchor" id="transform_tts"></a>

In [None]:
# map expected column names
item_id = "pulocationid"
target_value = "trip_quantity"
timestamp = "pickup_hourly"

# forecast setting
FORECAST_FREQ = "H"

# specify array of dimensions you'll use for forecasting
forecast_dims = [timestamp, item_id]

print(f"forecast_dims = {forecast_dims}")

In [None]:
## Assemble TTS required columns

tts = df_raw[[timestamp, item_id, target_value]].copy()

print(f"start date = {tts[timestamp].min()}")
print(f"end date = {tts[timestamp].max()}")

# check it
print(tts.shape)
print(tts.dtypes)
tts.head(5)

## Step 3. Save Data and Upload to S3 <a class="anchor" id="save_tts"></a>

In [None]:
# Save tts to S3
local_file = "tts.csv"
# Save merged file locally
tts.to_csv(local_file, header=False, index=False)

key = f"{s3_prefix}/tts.csv"
boto3.Session().resource('s3').Bucket(s3_bucket_name).Object(key).upload_file(local_file)

## Step 4. Prepare Forecast Access to S3 <a class="anchor" id="forecast_role"></a>

In [None]:
iam = boto3.client("iam")
forecast_role_name = "ForecastToS3"

create_role_response = iam.create_role(
    RoleName=forecast_role_name,
    AssumeRolePolicyDocument=json.dumps({
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "forecast.amazonaws.com",
                },
                "Action": "sts:AssumeRole",
            },
        ]
    }),
)

forecast_role_arn = create_role_response["Role"]["Arn"]
print(forecast_role_arn)

# Note that AmazonForecastFullAccess provides access to some specifically-named default S3 buckets as well,
# but we just want it for the Forecast permissions themselves:
iam.attach_role_policy(
    RoleName=forecast_role_name,
    PolicyArn="arn:aws:iam::aws:policy/AmazonForecastFullAccess",
)

# By default (since we're experimenting), this code attaches over-generous S3 permissions (full access):
iam.attach_role_policy(
    RoleName=forecast_role_name,
    PolicyArn="arn:aws:iam::aws:policy/AmazonS3FullAccess",
)
# You could instead use something like the below to give access to *only* the relevant buckets:
# inline_s3_policy = {
#     "Version": "2012-10-17",
#     "Statement": [
#         {
#             "Effect": "Allow",
#             "Action": "s3:*",
#             "Resource": [
#                 # (Assuming you're not running in a different partition e.g. aws-cn)
#                 f"arn:aws:s3:::{bucket_name}",
#                 f"arn:aws:s3:::{bucket_name}/*",
#             ]
#         },
#     ],
# }
# if bucket_name != export_bucket_name:
#     inline_s3_policy["Statement"][0]["Resource"].append(f"arn:aws:s3:::{export_bucket_name}")
#     inline_s3_policy["Statement"][0]["Resource"].append(f"arn:aws:s3:::{export_bucket_name}/*")

# iam.put_role_policy(
#     RoleName=role_name,
#     PolicyName="ForecastPoCBucketAccess",
#     PolicyDocument=json.dumps(inline_s3_policy)
# )

# IAM policy attachments *may* take up to a minute to propagate, so just to be safe:
sleep(60) 

Save variables for use with other notebooks

In [None]:
%store df_raw
%store s3_bucket_name
%store s3_prefix
%store start_time
%store end_time
%store item_id
%store target_value
%store timestamp
%store FORECAST_FREQ
%store forecast_dims