# Predict NYC Taxi Tips 
The notebook ingests, prepares and then trains a model based on an Open Dataset that tracks NYC Yellow Taxi trips and various attributes around them. The goal is to for a given trip, predict whether there will be a tip or not. The model then will be converted to ONNX format and tracked by MLFlow.
We will later use the ONNX model for inferencing in Azure Synapse SQL Pool using the new model scoring wizard.
## Note:
**Please note that for successful conversion to ONNX, this notebook requires using  Scikit-learn version 0.20.3.**
Run the first cell to list the packages installed and check your sklearn version. Uncomment the pip install command to install the correct version

%pip install scikit-learn==0.20.3



## Load data
Get a sample data of nyc yellow taxi from Azure Open Datasets

In [16]:
#%pip list
#%pip install scikit-learn==0.20.3

Package                               Version
------------------------------------- -------------------
absl-py                               0.10.0
adal                                  1.2.4
aiohttp                               3.6.2
aioredis                              1.3.1
alembic                               1.4.2
ansiwrap                              0.8.4
antlr4-python3-runtime                4.7.2
applicationinsights                   0.11.9
argcomplete                           1.12.0
argon2-cffi                           20.1.0
astor                                 0.8.1
astroid                               2.4.2
async-timeout                         3.0.1
atari-py                              0.2.6
attrs                                 20.1.0
autopep8                              1.5.4
azure-batch                           9.0.0
azure-cli                             2.10.1
azure-cli-command-modules-nspkg       2.0.3
azure-cli-core                        2.10.1
azure-cli

In [17]:
from azureml.opendatasets import NycTlcYellow
from datetime import datetime
from dateutil import parser

start_date = parser.parse('2018-05-01')
end_date = parser.parse('2018-05-07')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
nyc_tlc_df.info()

[Info] read from /tmp/tmp_6hw7482/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-118.c000.snappy.parquet
[Info] read from /tmp/tmp_6hw7482/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00001-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426336-117.c000.snappy.parquet
[Info] read from /tmp/tmp_6hw7482/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00002-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426334-119.c000.snappy.parquet
[Info] read from /tmp/tmp_6hw7482/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00003-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426340-115.c000.snappy.parquet
[Info] read from /tmp/tmp_6hw7482/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/yellow/puYear=2018/puMonth=5/part-00004-ti

In [18]:
from IPython.display import display

sampled_df = nyc_tlc_df.sample(n=10000, random_state=123)
display(sampled_df.head(5))

Unnamed: 0,vendorID,tpepPickupDateTime,tpepDropoffDateTime,passengerCount,tripDistance,puLocationId,doLocationId,startLon,startLat,endLon,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,totalAmount,puYear,puMonth
87213,2,2018-05-05 19:07:01,2018-05-05 19:28:44,1,3.95,164,112,,,,...,2,17.0,0.0,0.5,0.3,0.0,5.76,23.56,2018,5
145405,2,2018-05-05 22:46:06,2018-05-05 22:59:11,1,1.22,264,264,,,,...,1,9.5,0.5,0.5,0.3,2.16,0.0,12.96,2018,5
457648,1,2018-05-06 18:53:06,2018-05-06 19:06:31,1,2.2,246,162,,,,...,2,11.0,0.0,0.5,0.3,0.0,0.0,11.8,2018,5
369051,2,2018-05-02 09:25:13,2018-05-02 09:36:32,1,0.89,161,162,,,,...,1,8.0,0.0,0.5,0.3,1.76,0.0,10.56,2018,5
38871,2,2018-05-04 02:58:10,2018-05-04 03:01:10,3,0.45,79,4,,,,...,1,4.0,0.5,0.5,0.3,1.32,0.0,6.62,2018,5


## Prepare and featurize data
- There are extra dimensions that are not going to be useful in the model. We just take the dimensions that we need and put them into the featurised dataframe. 
- There are also a bunch of outliers in the data so we need to filter them out.

In [19]:
import numpy
import pandas

def get_pickup_time(df):
    pickupHour = df['pickupHour'];
    if ((pickupHour >= 7) & (pickupHour <= 10)):
        return 'AMRush'
    elif ((pickupHour >= 11) & (pickupHour <= 15)):
        return 'Afternoon'
    elif ((pickupHour >= 16) & (pickupHour <= 19)):
        return 'PMRush'
    else:
        return 'Night'

featurized_df = pandas.DataFrame()
featurized_df['tipped'] = (sampled_df['tipAmount'] > 0).astype('int')
featurized_df['fareAmount'] = sampled_df['fareAmount'].astype('float32')
featurized_df['paymentType'] = sampled_df['paymentType'].astype('int')
featurized_df['passengerCount'] = sampled_df['passengerCount'].astype('int')
featurized_df['tripDistance'] = sampled_df['tripDistance'].astype('float32')
featurized_df['pickupHour'] = sampled_df['tpepPickupDateTime'].dt.hour.astype('int')
featurized_df['tripTimeSecs'] = ((sampled_df['tpepDropoffDateTime'] - sampled_df['tpepPickupDateTime']) / numpy.timedelta64(1, 's')).astype('int')

featurized_df['pickupTimeBin'] = featurized_df.apply(get_pickup_time, axis=1)
featurized_df = featurized_df.drop(columns='pickupHour')

display(featurized_df.head(5))


Unnamed: 0,tipped,fareAmount,paymentType,passengerCount,tripDistance,tripTimeSecs,pickupTimeBin
87213,0,17.0,2,1,3.95,1303,PMRush
145405,1,9.5,1,1,1.22,785,Night
457648,0,11.0,2,1,2.2,805,PMRush
369051,1,8.0,1,1,0.89,679,AMRush
38871,1,4.0,1,3,0.45,180,Night


In [20]:
filtered_df = featurized_df[(featurized_df.tipped >= 0) & (featurized_df.tipped <= 1)\
    & (featurized_df.fareAmount >= 1) & (featurized_df.fareAmount <= 250)\
    & (featurized_df.paymentType >= 1) & (featurized_df.paymentType <= 2)\
    & (featurized_df.passengerCount > 0) & (featurized_df.passengerCount < 8)\
    & (featurized_df.tripDistance >= 0) & (featurized_df.tripDistance <= 100)\
    & (featurized_df.tripTimeSecs >= 30) & (featurized_df.tripTimeSecs <= 7200)]

filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9776 entries, 87213 to 333274
Data columns (total 7 columns):
tipped            9776 non-null int64
fareAmount        9776 non-null float32
paymentType       9776 non-null int64
passengerCount    9776 non-null int64
tripDistance      9776 non-null float32
tripTimeSecs      9776 non-null int64
pickupTimeBin     9776 non-null object
dtypes: float32(2), int64(4), object(1)
memory usage: 534.6+ KB


## Split training and testing data sets
- 70% of the data is used to train the model.
- 30% of the data is used to test the model.

In [22]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(filtered_df, test_size=0.3, random_state=123)

x_train = pandas.DataFrame(train_df.drop(['tipped'], axis = 1))
y_train = pandas.DataFrame(train_df.iloc[:,train_df.columns.tolist().index('tipped')])

x_test = pandas.DataFrame(test_df.drop(['tipped'], axis = 1))
y_test = pandas.DataFrame(test_df.iloc[:,test_df.columns.tolist().index('tipped')])

## Export test data as CSV
Export the test data as a CSV file. Later, we will load the CSV file into Synapse SQL pool to test the model.

In [6]:
test_df.to_csv('test_data.csv', index=False)

## Train model
Train a bi-classifier to predict whether a taxi trip will be a tipped or not.

In [24]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

float_features = ['fareAmount', 'tripDistance']
float_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

integer_features = ['paymentType', 'passengerCount', 'tripTimeSecs']
integer_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['pickupTimeBin']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('float', float_transformer, float_features),
        ('integer', integer_transformer, integer_features),
        ('cat', categorical_transformer, categorical_features)
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

# Train the model
clf.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('float', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', ver...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [25]:
# Evalute the model
score = clf.score(x_test, y_test)
print(score)

0.9672690078418003


## Convert the model to ONNX format
Currently, T-SQL scoring only supports ONNX model format (https://onnx.ai/).

In [9]:
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, Int64TensorType, DoubleTensorType, StringTensorType

def convert_dataframe_schema(df, drop=None):
    inputs = []
    for k, v in zip(df.columns, df.dtypes):
        if drop is not None and k in drop:
            continue
        if v == 'int64':
            t = Int64TensorType([1, 1])
        elif v == 'float32':
            t = FloatTensorType([1, 1])
        elif v == 'float64':
            t = DoubleTensorType([1, 1])
        else:
            t = StringTensorType([1, 1])
        inputs.append((k, t))
    return inputs

model_inputs = convert_dataframe_schema(x_train)
onnx_model = convert_sklearn(clf, "nyc_taxi_tip_predict", model_inputs)



## Register the model with MLFlow

In [10]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code A5MG8N367 to authenticate.
Interactive authentication successfully completed.
Nellies_AML_ws
nellies_aml_ws_rg
eastus
58f8824d-32b0-4825-9825-02fa6a801546


In [11]:
import mlflow
import mlflow.onnx

from mlflow.models.signature import infer_signature

experiment_name = 'nyc_taxi_tip_predict_exp'
artifact_path = 'nyc_taxi_tip_predict_artifact'

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
mlflow.set_experiment(experiment_name)

with mlflow.start_run() as run:
    # Infer signature
    input_sample = x_train.head(1)
    output_sample = pandas.DataFrame(columns=['output_label'], data=[1])
    signature = infer_signature(input_sample, output_sample)

    # Save the model to the outputs directory for capture
    mlflow.onnx.log_model(onnx_model, artifact_path, signature=signature, input_example=input_sample)

    # Register the model to AML model registry
    mlflow.register_model('runs:/' + run.info.run_id + '/' + artifact_path, 'nyc_taxi_tip_predict')


Successfully registered model 'nyc_taxi_tip_predict'.
Created version '1' of model 'nyc_taxi_tip_predict'.
