# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to read and query csv files with cuDF and BlazingSQL.
- How to implement a linear regression model with cuML.

#### BlazingSQL install check
The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected).

In [1]:
import sys 
# point import path notebooks-contrib/utils
sys.path.append('../../../utils/')
from sql_check import bsql_start
# check that BlazingSQL is installed
bsql_start()

"You've got BlazingSQL set up perfectly! Let's get started with SQL in RAPIDS AI!"

## Imports

In [2]:
import cudf
from cuml import LinearRegression
from blazingsql import BlazingContext

## Create BlazingContext
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [3]:
bc = BlazingContext()

BlazingContext ready


### Download Data
For this demo we will train our model with 20,000,000 rows of data from 4 CSV files (5,000,000 rows each). 

The cell below will download them from AWS to the main `notebooks-contrib/data/blazingsql/` folder for you.

In [4]:
import os
import urllib.request

# relative path to data folder
data_dir = '../../../data/blazingsql/'
# does folder exist?
if not os.path.exists(data_dir):
    print('creating blazingsql directory')
    # create folder
    os.system('mkdir ../../data/blazingsql')

In [5]:
# download taxi data
base_url = 'https://blazingsql-colab.s3.amazonaws.com/taxi_data/'
# thanks to Taurean Dyer
years = list(range(0, 3))
for i in range(0, 4):
    fn = 'taxi_0' + str(i) + '.csv'
    if not os.path.isfile(data_dir + fn):
        print(f'Downloading {base_url + fn} to {data_dir + fn}')
        urllib.request.urlretrieve(base_url + fn, data_dir + fn)

Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv to ../../../data/blazingsql/taxi_00.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv to ../../../data/blazingsql/taxi_01.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv to ../../../data/blazingsql/taxi_02.csv
Downloading https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv to ../../../data/blazingsql/taxi_03.csv


## Extract, transform, load
In order to train our Linear Regression model, we must first preform ETL so to prepare our data.

### ETL: Read and Join CSVs

In [6]:
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 
                'float32', 'float32', 'float32']

# load first csv 
gdf_00 = cudf.read_csv(data_dir+'taxi_00.csv', names=col_names, dtype=col_types)
# load second csv
gdf_01 = cudf.read_csv(data_dir+'taxi_01.csv', names=col_names, dtype=col_types)
# load third csv
gdf_02 = cudf.read_csv(data_dir+'taxi_02.csv', names=col_names, dtype=col_types)
# load fourth csv
gdf_03 = cudf.read_csv(data_dir+'taxi_03.csv', names=col_names, dtype=col_types)

# combine all those dataframes into one master dataframe
gdf = cudf.concat([gdf_00,gdf_01, gdf_02, gdf_03])

# what's it look like?
gdf.head()

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2012-02-02 22:30:19.002,8.9,-73.988708,40.758804,-73.986519,40.737202,1.0
1,2014-09-20 07:19:24.001,4.0,-73.990204,40.746708,-73.994728,40.750515,1.0
2,2013-02-23 07:18:05.001,5.5,-74.016762,40.709438,-74.009003,40.719498,3.0
3,2015-04-18 23:49:27.009,13.5,-74.002708,40.73373,-73.986099,40.734776,1.0
4,2010-03-04 08:15:59.001,10.5,-73.988365,40.737663,-74.012459,40.713932,1.0


In [7]:
gdf.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,20000000.0,20000000.0,20000000.0,19999870.0,19999870.0,20000000.0
mean,11.34995,-72.50857,39.92013,-72.51212,39.9217,1.685199
std,28.78444,12.73661,9.756748,12.73016,9.628169,1.334611
min,-176.0,-3442.06,-3475.482,-3440.697,-3475.482,0.0
25%,6.0,-73.99206,40.73494,-73.99139,40.73404,1.0
50%,8.5,-73.9818,40.75267,-73.98015,40.75317,1.0
75%,12.5,-73.96707,40.76714,-73.96367,40.76811,2.0
max,93963.36,3456.222,3408.79,3456.222,3537.133,208.0


### ETL: Create Table

In [8]:
%time
# make a table from the combined df
bc.create_table('train_taxi', gdf, column_names=col_names)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.2 µs


<pyblazing.apiv2.context.BlazingTable at 0x7f186c3385f8>

### ETL: Query Tables for Training Data

In [9]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            train_taxi
            '''

# run query on table (returns cuDF DataFrame)
X_train = bc.sql(query)

# fill null values 
X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)
X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)
X_train['passenger_count'] = X_train['passenger_count'].fillna(0)

# how's it look? 
X_train.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0


In [10]:
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM train_taxi')

y_train.head()

Unnamed: 0,fare_amount
0,8.9
1,4.0
2,5.5
3,13.5
4,10.5


## Linear Regression
### LR: Train Model

In [11]:
%%time
# call & create cuML model
lr = LinearRegression(fit_intercept=True, normalize=False, algorithm="eig")

# train Linear Regression model 
reg = lr.fit(X_train, y_train)

# display results
print(f"Coefficients:\n{reg.coef_}\n")
print(f"Y intercept:\n{reg.intercept_}\n")

Coefficients:
0   -0.027290
1    0.003329
2    0.106803
3    0.637564
4    0.000871
5   -0.000516
6    0.092400
dtype: float32

Y intercept:
3.3568549156188965

CPU times: user 689 ms, sys: 590 ms, total: 1.28 s
Wall time: 1.28 s


### LR: Use Model to Predict Future Taxi Fares 

#### Download Test Data
The cell below will check to see if you've already got the Test data, and, if you don't, will download it for you.

In [12]:
# do we have Test taxi file?
if not os.path.isfile('../../../data/blazingsql/test.csv'):
    !wget -P ../../../data/blazingsql https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv

--2020-01-21 17:21:52--  https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv
Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.120.105
Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.120.105|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982916 (960K) [text/csv]
Saving to: ‘../../../data/blazingsql/test.csv’


2020-01-21 17:21:53 (2.56 MB/s) - ‘../../../data/blazingsql/test.csv’ saved [982916/982916]



We are going to create this table directly from CSV, BlazingSQL requires the full path to the data for table creation. This cell uses the `pwd` bash command to identify the path to this directory, then adds the path to the notebooks-contrib `data/` directory to provide a full path to the test data.

In [13]:
# identify path to this notebook, !pwd returns SList w/ path (str) at 0th index
path = !pwd
# extract path notebooks-contrib then
path = path[0].split('intermediate_notebooks')[0] 
# add path to data from there
path = path + 'data/blazingsql/' + 'test.csv'
# how's it look?
path

'/home/jupyter-winston/notebooks-contrib/data/blazingsql/test.csv'

In [14]:
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']

# create test table directly from CSV
bc.create_table('test_taxi', path, names=col_names, dtype=col_types)

<pyblazing.apiv2.context.BlazingTable at 0x7f186c3dc358>

In [15]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            test_taxi
            '''

# run query on table (returns cuDF DataFrame)
X_test = bc.sql(query)

# fill null values 
X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)
X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)
X_test['passenger_count'] = X_test['passenger_count'].fillna(0)

# how's it look? 
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,13.0,27.0,1.0,15.0,-0.00811,-0.01997,1.0
1,13.0,27.0,1.0,15.0,-0.012024,0.019814,1.0
2,11.0,8.0,10.0,11.0,0.002869,-0.005119,1.0
3,21.0,1.0,12.0,12.0,-0.009277,-0.016178,1.0
4,21.0,1.0,12.0,12.0,-0.022537,-0.045345,1.0


In [16]:
# predict fares 
predictions = lr.predict(X_test)

# display predictions
predictions

0       12.854630
1       12.854605
2       11.256927
3       11.811884
4       11.811888
5       11.811880
6       11.222965
7       11.222733
8       11.222973
9       12.239309
10      12.239325
11      12.239347
12       9.696036
13       9.696022
14      11.468582
15      11.468594
16      11.460928
17      11.460958
18      11.460936
19      11.460926
20      13.485119
21      12.707811
22      12.707788
23      12.707800
24      12.707800
25      12.707785
26      12.707952
27      12.707806
28      12.707804
29      12.707785
          ...    
9884    12.643631
9885    12.643671
9886    12.643652
9887    12.643633
9888    12.643650
9889    12.643656
9890    12.643648
9891    12.643673
9892    12.643652
9893    12.643667
9894    12.643648
9895    12.643719
9896    12.643631
9897    13.454716
9898    13.212105
9899    14.138895
9900    13.368757
9901    13.635015
9902    14.171509
9903    13.832354
9904    13.669437
9905    13.259691
9906    14.138172
9907    13.452593
9908    13

In [17]:
# add predictions to test dataframe
X_test['predicted_fare'] = predictions

# how's that look?
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count,predicted_fare
0,13.0,27.0,1.0,15.0,-0.00811,-0.01997,1.0,12.85463
1,13.0,27.0,1.0,15.0,-0.012024,0.019814,1.0,12.854605
2,11.0,8.0,10.0,11.0,0.002869,-0.005119,1.0,11.256927
3,21.0,1.0,12.0,12.0,-0.009277,-0.016178,1.0,11.811884
4,21.0,1.0,12.0,12.0,-0.022537,-0.045345,1.0,11.811888
