# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to read and query csv files with cuDF and BlazingSQL.
- How to implement a linear regression model with cuML.

#### BlazingSQL install check
The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected).

In [1]:
import sys 
# point import path notebooks-contrib/utils
sys.path.append('../../../utils/')
from sql_check import bsql_start
# check that BlazingSQL is installed
bsql_start()

Unable to locate BlazingSQL, would you like to installing it now
Installing BlazingSQL, this may take a few minutes.


"Let's get started with SQL in RAPIDS AI!"

## Imports

In [2]:
import cudf
from cuml import LinearRegression
from blazingsql import BlazingContext

## Create BlazingContext
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [3]:
bc = BlazingContext()

BlazingContext ready


### Download Data
For this demo we will train our model with 20,000,000 rows of data from 4 CSV files (5,000,000 rows each). 

The cell below will download them from AWS to the main `notebooks-contrib/data/blazingsql/` folder for you.

In [6]:
import os
import urllib.request

data_dir = '../../../data/blazingsql/'
if not os.path.exists(data_dir):
    print('creating blazingsql directory')
    os.system('mkdir ../../data/blazingsql')

In [8]:
# download taxi data
base_url = 'https://blazingsql-colab.s3.amazonaws.com/taxi_data/'
years = list(range(0, 3))
for i in range(0,4):
    fn = 'taxi_0'+str(i)+ '.csv'
    if not os.path.isfile(data_dir+fn):
        print(f'Downloading {base_url+fn} to {data_dir+fn}')
        urllib.request.urlretrieve(base_url+fn, data_dir+fn)

## Extract, transform, load
In order to train our Linear Regression model, we must first preform ETL so to prepare our data.

### ETL: Read and Join CSVs

In [9]:
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 
                'float32', 'float32', 'float32']

# load first csv 
gdf_00 = cudf.read_csv(data_dir+'taxi_00.csv', names=col_names, dtype=col_types)
# load second csv
gdf_01 = cudf.read_csv(data_dir+'taxi_01.csv', names=col_names, dtype=col_types)
# load third csv
gdf_02 = cudf.read_csv(data_dir+'taxi_02.csv', names=col_names, dtype=col_types)
# load fourth csv
gdf_03 = cudf.read_csv(data_dir+'taxi_03.csv', names=col_names, dtype=col_types)

# combine all those dataframes into one master dataframe
gdf = cudf.concat([gdf_00,gdf_01, gdf_02, gdf_03])

# what's it look like?
gdf.head()

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2012-02-02 22:30:19.002,8.9,-73.988708,40.758804,-73.986519,40.737202,1.0
1,2014-09-20 07:19:24.001,4.0,-73.990204,40.746708,-73.994728,40.750515,1.0
2,2013-02-23 07:18:05.001,5.5,-74.016762,40.709438,-74.009003,40.719498,3.0
3,2015-04-18 23:49:27.009,13.5,-74.002708,40.73373,-73.986099,40.734776,1.0
4,2010-03-04 08:15:59.001,10.5,-73.988365,40.737663,-74.012459,40.713932,1.0


In [23]:
gdf.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,20000000.0,20000000.0,20000000.0,19999870.0,19999870.0,20000000.0
mean,11.34995,-72.50857,39.92013,-72.51212,39.9217,1.685199
std,28.78444,12.73661,9.756748,12.73016,9.628169,1.334611
min,-176.0,-3442.06,-3475.482,-3440.697,-3475.482,0.0
25%,6.0,-73.99206,40.73494,-73.99139,40.73404,1.0
50%,8.5,-73.9818,40.75267,-73.98015,40.75317,1.0
75%,12.5,-73.96707,40.76714,-73.96367,40.76811,2.0
max,93963.36,3456.222,3408.79,3456.222,3537.133,208.0


### ETL: Create Table

In [10]:
%time
# make a table from the combined df
bc.create_table('taxi', gdf, column_names=col_names)

CPU times: user 5 µs, sys: 3 µs, total: 8 µs
Wall time: 13.8 µs


<pyblazing.apiv2.context.BlazingTable at 0x7fb7e803a5c0>

### ETL: Query Tables for Training Data

In [11]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            taxi
        '''

# run query on table (returns cuDF DataFrame)
X_train = bc.sql(query)

# fill null values 
X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)
X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)
X_train['passenger_count'] = X_train['passenger_count'].fillna(0)

# how's it look? 
X_train.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0


In [12]:
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM main.taxi')

y_train.head()

Unnamed: 0,fare_amount
0,8.9
1,4.0
2,5.5
3,13.5
4,10.5


## Linear Regression
### LR: Train Model

In [13]:
%%time
# call & create cuML model
lr = LinearRegression(fit_intercept=True, normalize=False, algorithm="eig")

# train Linear Regression model 
reg = lr.fit(X_train, y_train)

# display results
print(f"Coefficients:\n{reg.coef_}\n")
print(f"Y intercept:\n{reg.intercept_}\n")

Coefficients:
0   -0.027290
1    0.003329
2    0.106799
3    0.637490
4    0.000872
5   -0.000516
6    0.092426
dtype: float32

Y intercept:
3.357696056365967

CPU times: user 737 ms, sys: 272 ms, total: 1.01 s
Wall time: 1 s


### LR: Use Model to Predict Future Taxi Fares 

Test data for this notebook is already stored in `notebooks-contrib/data/blazingsql/` so there's no need to download it. We are, however, going to create this table directly from CSV, and BlazingSQL requires the full path to the data for table creation. 

This cell uses the `pwd` bash command to identify the path to this directory, then joins it with the relative path to the notebooks-contrib `data/` directory to provide a full path to the test data.

In [14]:
# identify path to this notebook, !pwd returns SList w/ path (str) at 0th index
path = !pwd
# extract path notebooks-contrib then
path = path[0].split('intermediate_notebooks')[0] 
# add path to data from there
path = path + 'data/blazingsql/' + 'taxi_test.csv'

# how's it look?
path

'/rapids/notebooks/wip/blazing012/notebooks-contrib/data/blazingsql/taxi_test.csv'

In [15]:
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']

# create test table directly from CSV
bc.create_table('test', path, names=col_names, dtype=col_types)

<pyblazing.apiv2.context.BlazingTable at 0x7fb89a648978>

In [16]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            taxi
        '''

# run query on table (returns cuDF DataFrame)
X_test = bc.sql(query)

# fill null values 
X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)
X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)
X_test['passenger_count'] = X_test['passenger_count'].fillna(0)

# how's it look? 
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0


In [17]:
# predict fares 
predictions = lr.predict(X_test)

# display predictions
predictions

0           10.719891
1           13.211721
2           12.021486
3           12.871935
4            9.940409
5           10.526224
6           11.606909
7           12.899637
8           12.669996
9           10.323013
10          10.590853
11          10.805092
12          12.207203
13          12.639708
14          11.235060
15          12.924519
16          13.301220
17          13.529264
18           9.150438
19          10.302011
20          11.055346
21          11.334454
22          13.591984
23           8.949201
24          11.257716
25          11.186388
26          10.717313
27          10.259829
28          13.128613
29          10.062141
              ...    
19999970    12.150715
19999971    11.417639
19999972    13.032169
19999973     9.576767
19999974    10.092113
19999975    12.519596
19999976    12.120779
19999977    10.179247
19999978    10.186872
19999979     9.063965
19999980    11.674242
19999981    11.639637
19999982    13.062763
19999983     9.718081
19999984  

In [18]:
# combine into a table of table points and predictions
X_test['predicted_fare'] = predictions

# how's that look?
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count,predicted_fare
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0,10.719891
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0,13.211721
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0,12.021486
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0,12.871935
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0,9.940409
