Rohit's First Kernal - NYC Taxi Fare Prediction
===========
This is the first kernal for submission for Google Cloud Playground [New York City Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction)

Strategy
--------------------
1. Filter out outliers
    1. Remove data outside NYC
    2. Remove data where fare is unresonable (too high / too low)
2. Use Linear Regression ML Model On Clean Data
3. Use Linear Fit On Unclean Data

Using NYC Open Data
-------------------
NYC Open Data is stored in Google Big Query open datasets. To access this data in your notebook, check out kernal [How to Query the NYC Open Data
](https://www.kaggle.com/paultimothymooney/how-to-query-the-nyc-open-data)


## Setup Import Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# to plot 3d scatter plots
from mpl_toolkits.mplot3d import Axes3D

import math

# to print out current time
import datetime
import os

import traceback

import tensorflow as tf
import shutil

tf.logging.set_verbosity(tf.logging.INFO)

print(tf.__version__)

1.10.0


## Read exploratory dataset into pandas dataframe

In [2]:
# BASE_PATH = os.path.dirname("__file__")
BASE_PATH = r'M:/kaggle/NY Taxi Cab/notebook/'

BATCH_SIZE = 512

print('Started reading dataset ------------- ', datetime.datetime.now())

# Try to load the data. This may be an intensive process
df_train = pd.read_csv(os.path.join(BASE_PATH, r'..\input\train_split\train-000000000003.csv'), nrows=BATCH_SIZE*2, parse_dates=["pickup_datetime"]);

print('Finished reading dataset ------------- ', datetime.datetime.now())

Started reading dataset -------------  2018-09-15 22:51:20.085878
Finished reading dataset -------------  2018-09-15 22:51:20.281872


## Describe some dataset statistics

In [3]:
df_train.head(n=10)

Unnamed: 0,key,key_original,fare_amount,pickup_datetime,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers
0,2009-09-25 00:48:44+00-73.98575140.732774-73.9...,2009-09-25 00:48:44 UTC,56.61,2009-09-25 00:48:44,Fri,0,-73.985751,40.732774,-73.916322,40.560941,1
1,2013-06-21 00:47:00+00-73.95714840.717472-73.9...,2013-06-21 00:47:00 UTC,11.0,2013-06-21 00:47:00,Fri,0,-73.957148,40.717472,-73.960302,40.696392,1
2,2015-05-08 00:49:08+00-73.95177459716796940.77...,2015-05-08 00:49:08 UTC,10.5,2015-05-08 00:49:08,Fri,0,-73.951775,40.777752,-73.980659,40.737827,2
3,2010-12-24 00:45:35+00-74.00531640.728795-74.0...,2010-12-24 00:45:35 UTC,6.9,2010-12-24 00:45:35,Fri,0,-74.005316,40.728795,-74.004893,40.748165,1
4,2013-11-08 00:26:59+00-73.9420140.79534-73.945...,2013-11-08 00:26:59 UTC,7.5,2013-11-08 00:26:59,Fri,0,-73.94201,40.79534,-73.945941,40.814598,1
5,2013-07-26 00:51:00+00-73.9816940.751112-73.99...,2013-07-26 00:51:00 UTC,10.5,2013-07-26 00:51:00,Fri,0,-73.98169,40.751112,-73.997617,40.720812,5
6,2010-11-19 00:36:00+00-73.99061340.750782-73.9...,2010-11-19 00:36:00 UTC,7.7,2010-11-19 00:36:00,Fri,0,-73.990613,40.750782,-73.978248,40.7489,2
7,2011-12-30 00:30:00+00-73.98829840.72796-73.92...,2011-12-30 00:30:00 UTC,26.1,2011-12-30 00:30:00,Fri,0,-73.988298,40.72796,-73.921317,40.867857,2
8,2012-12-14 00:11:38+00-73.97705840.752328-73.9...,2012-12-14 00:11:38 UTC,12.5,2012-12-14 00:11:38,Fri,0,-73.977058,40.752328,-73.979702,40.782815,1
9,2010-05-21 00:49:01+00-74.00522940.728762-73.9...,2010-05-21 00:49:01 UTC,8.9,2010-05-21 00:49:01,Fri,0,-74.005229,40.728762,-73.97742,40.749979,1


In [4]:
df_train.describe()

Unnamed: 0,fare_amount,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers
count,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0
mean,11.642822,0.44043,-73.981624,40.745231,-73.9732,40.746367,1.744141
std,9.572735,0.593518,0.055177,0.027119,0.058057,0.039485,1.363884
min,2.5,0.0,-75.45426,40.64459,-75.487345,40.560941,1.0
25%,6.1,0.0,-73.995657,40.7283,-73.991881,40.726121,1.0
50%,8.9,0.0,-73.986445,40.743842,-73.980107,40.745272,1.0
75%,13.5,1.0,-73.975086,40.759941,-73.957853,40.76405,2.0
max,112.8,2.0,-73.776658,41.066758,-73.75689,41.076337,6.0


## Define training dataset properties

In [5]:
CSV_COLUMNS = 'key,key_original,fare_amount,pickup_datetime,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers'.split(',')
LABEL_COLUMN = 'fare_amount'
KEY_FEATURE_COLUMN = 'key'
DEFAULTS = [['nokey'], ['nokey'], [0.0], ['badDate'], ['Sun'], [0], [-74.0], [40.0], [-74.0], [40.7], [0.0]]

## These are the raw input columns, and will be provided for prediction also

In [6]:
INPUT_COLUMNS = [
    # Define features
    tf.feature_column.categorical_column_with_vocabulary_list('dayofweek', vocabulary_list = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']),
    tf.feature_column.categorical_column_with_identity('hourofday', num_buckets = 24),

    # Numeric columns
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
    
    # Engineered features that are created in the input_fn
    tf.feature_column.numeric_column('latdiff'),
    tf.feature_column.numeric_column('londiff'),
    tf.feature_column.numeric_column('euclidean')
]

## Define evaluation metrics

In [7]:
def add_eval_metrics(labels, predictions):
    pred_values = predictions['predictions']
    return {
        'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)
    }

## Build the estimator

In [8]:
def build_estimator(model_dir, nbuckets, hidden_units):
    """
     Build an estimator starting from INPUT COLUMNS.
     These include feature transformations and synthetic features.
     The model is a wide-and-deep model.
    """

    # Input columns
    (dayofweek, hourofday, plat, plon, dlat, dlon, pcount, latdiff, londiff, euclidean) = INPUT_COLUMNS

    # Bucketize the lats & lons
    latbuckets = np.linspace(37.0, 45.0, nbuckets).tolist()
    lonbuckets = np.linspace(-78.0, -70.0, nbuckets).tolist()
    b_plat = tf.feature_column.bucketized_column(plat, latbuckets)
    b_dlat = tf.feature_column.bucketized_column(dlat, latbuckets)
    b_plon = tf.feature_column.bucketized_column(plon, lonbuckets)
    b_dlon = tf.feature_column.bucketized_column(dlon, lonbuckets)

    # Feature cross
    ploc = tf.feature_column.crossed_column([b_plat, b_plon], nbuckets * nbuckets)
    dloc = tf.feature_column.crossed_column([b_dlat, b_dlon], nbuckets * nbuckets)
    pd_pair = tf.feature_column.crossed_column([ploc, dloc], nbuckets ** 4 )
    day_hr =  tf.feature_column.crossed_column([dayofweek, hourofday], 24 * 7)

    # Wide columns and deep columns.
    wide_columns = [
        # Feature crosses
        dloc, ploc, pd_pair,
        day_hr,

        # Sparse columns
        dayofweek, hourofday,

        # Anything with a linear relationship
        pcount 
    ]

    deep_columns = [
        # Embedding_column to "group" together ...
        tf.feature_column.embedding_column(pd_pair, 10),
        tf.feature_column.embedding_column(day_hr, 10),

        # Numeric columns
        plat, plon, dlat, dlon,
        latdiff, londiff, euclidean
    ]
    
    ## setting the checkpoint interval to be much lower for this task
    run_config = tf.estimator.RunConfig(save_checkpoints_secs = 30, 
                                        keep_checkpoint_max = 3)
    estimator = tf.estimator.DNNLinearCombinedRegressor(
        model_dir = model_dir,
        linear_feature_columns = wide_columns,
        dnn_feature_columns = deep_columns,
        dnn_hidden_units = hidden_units,
        config = run_config)

    # add extra evaluation metric for hyperparameter tuning
    estimator = tf.contrib.estimator.add_metrics(estimator, add_eval_metrics)
    return estimator

## Create feature engineering function that will be used in the input and serving input functions

In [9]:
def add_engineered(features):
    # this is how you can do feature engineering in TensorFlow
    lat1 = features['pickuplat']
    lat2 = features['dropofflat']
    lon1 = features['pickuplon']
    lon2 = features['dropofflon']
    latdiff = (lat1 - lat2)
    londiff = (lon1 - lon2)
    
    # set features for distance with sign that indicates direction
    features['latdiff'] = latdiff
    features['londiff'] = londiff
    dist = tf.sqrt(latdiff * latdiff + londiff * londiff)
    features['euclidean'] = dist
    return features

## Create serving input function to be able to serve predictions

In [10]:
def serving_input_fn():
    feature_placeholders = {
        # All the real-valued columns
        column.name: tf.placeholder(tf.float32, [None]) for column in INPUT_COLUMNS[2:7]
    }
    feature_placeholders['dayofweek'] = tf.placeholder(tf.string, [None])
    feature_placeholders['hourofday'] = tf.placeholder(tf.int32, [None])

    features = add_engineered(feature_placeholders.copy())
    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)

## Create input function to load data into datasets

In [11]:
def read_dataset(filename, mode, batch_size = 512):
    def _input_fn():
        def decode_csv(value_column):
            columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))
            label = features.pop(LABEL_COLUMN)
            return add_engineered(features), label
        
        # Create list of files that match pattern
        file_list = tf.gfile.Glob(filename)

        # Create dataset from file list
        dataset = tf.data.TextLineDataset(file_list).skip(1).map(decode_csv)

        if mode == tf.estimator.ModeKeys.TRAIN:
            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)
        else:
            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)
        batch_features, batch_labels = dataset.make_one_shot_iterator().get_next()
        return batch_features, batch_labels
    return _input_fn

## Create estimator train and evaluate function

In [12]:
def train_and_evaluate(args):
    estimator = build_estimator(args['output_dir'], args['nbuckets'], args['hidden_units'].split(' '))
    train_spec = tf.estimator.TrainSpec(
        input_fn = read_dataset(
            filename = args['train_data_paths'],
            mode = tf.estimator.ModeKeys.TRAIN,
            batch_size = args['train_batch_size']),
        max_steps = args['train_steps'])
    exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
    eval_spec = tf.estimator.EvalSpec(
        input_fn = read_dataset(
            filename = args['eval_data_paths'],
            mode = tf.estimator.ModeKeys.EVAL,
            batch_size = args['eval_batch_size']),
        steps = 100,
        exporters = exporter)
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

In [15]:
# OUTPUTDIR = os.path.join(BASE_PATH, r'../ML_Model/')
OUTPUTDIR = r'C:\Users\mistr\source\repos\rrmistry\kaggle\NY_Taxi_Cab\ML_Model'

if os.path.exists(OUTPUTDIR):
    shutil.rmtree(OUTPUTDIR)

with tf.Session() as sess:
    
    arguments = {
        "output_dir": OUTPUTDIR,
        "train_data_paths": os.path.join(BASE_PATH, r'..\input\train_split\train-*.csv'),
        "eval_data_paths": os.path.join(BASE_PATH, r'..\input\train_split\valid-*.csv'),
        "train_batch_size": 512,
        "eval_batch_size": 512,
        "train_steps": 5000,
        "eval_steps": 10,
        "nbuckets": 10,
        "hidden_units": "128 32 4",
        "eval_delay_secs": 10,
        "min_eval_frequency": 1,
        "format": "csv"
    }
    
    # Run the training job:
    try:
        train_and_evaluate(arguments)
    except:
        traceback.print_exc()

INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\mistr\\source\\repos\\rrmistry\\kaggle\\NY_Taxi_Cab\\ML_Model', '_log_step_count_steps': 100, '_service': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_is_chief': True, '_save_checkpoints_secs': 30, '_keep_checkpoint_max': 3, '_task_id': 0, '_tf_random_seed': None, '_train_distribute': None, '_num_ps_replicas': 0, '_device_fn': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000015F96251748>, '_num_worker_replicas': 1, '_master': '', '_session_config': None, '_task_type': 'worker'}
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\mistr\\source\\repos\\rrmistry\\kaggle\\NY_Taxi_Cab\\ML_Model', '_log_step_count_steps': 100, '_service': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_is_chief': True, '_save_checkpoints_secs': 30, '_keep_checkpoint_ma

INFO:tensorflow:Saving checkpoints for 2003 into C:\Users\mistr\source\repos\rrmistry\kaggle\NY_Taxi_Cab\ML_Model\model.ckpt.
INFO:tensorflow:Skip the current checkpoint eval due to throttle secs (600 secs).
INFO:tensorflow:global_step/sec: 13.7323
INFO:tensorflow:loss = 37262.105, step = 2100 (7.282 sec)
INFO:tensorflow:global_step/sec: 14.5603
INFO:tensorflow:loss = 45862.25, step = 2200 (6.868 sec)
INFO:tensorflow:global_step/sec: 14.9254
INFO:tensorflow:loss = 31499.559, step = 2300 (6.700 sec)
INFO:tensorflow:global_step/sec: 14.8302
INFO:tensorflow:loss = 48031.004, step = 2400 (6.743 sec)
INFO:tensorflow:Saving checkpoints for 2439 into C:\Users\mistr\source\repos\rrmistry\kaggle\NY_Taxi_Cab\ML_Model\model.ckpt.
INFO:tensorflow:Skip the current checkpoint eval due to throttle secs (600 secs).
INFO:tensorflow:global_step/sec: 14.4676
INFO:tensorflow:loss = 42913.484, step = 2500 (6.911 sec)
INFO:tensorflow:global_step/sec: 15.2045
INFO:tensorflow:loss = 49715.754, step = 2600 (6.