# BlazingSQL + cuML NYC Taxi Cab Fare Prediction

This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. 

In this notebook, we will cover: 
- How to read and query csv files with cuDF and BlazingSQL.
- How to implement a linear regression model with cuML.

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-39814657-5&cid=555&t=event&ec=guides&ea=taxi_fare_prediction&dt=taxi_fare_prediction)


## Imports

In [1]:
# Notebooks-contrib test 
import os
try:
    import matplotlib
except ModuleNotFoundError:
    os.system('conda install -y matplotlib')
    import matplotlib
    
# Import RAPIDS AI stack
import cudf
from cuml import LinearRegression
from blazingsql import BlazingContext

## Create BlazingContext
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [4]:
bc = BlazingContext()

lo
BlazingContext ready


### Download Data
For this demo we will train our model with 20,000,000 rows of data from 4 CSV files (5,000,000 rows each). 

The cell below will download them from AWS to the main `notebooks-contrib/data/blazingsql/` folder for you.

In [3]:
!wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv
!wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv
!wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv
!wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv
    
# tag relative path to data directory
path = '../../../data/blazingsql/'

--2019-12-17 19:25:55--  https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.64.216
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.64.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 393974627 (376M) [application/x-www-form-urlencoded]
Saving to: ‘../../../data/blazingsql/taxi_00.csv’


2019-12-17 19:32:03 (1.02 MB/s) - ‘../../../data/blazingsql/taxi_00.csv’ saved [393974627/393974627]

--2019-12-17 19:32:03--  https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.228.80
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.228.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 393961373 (376M) [application/x-www-form-urlencoded]
Saving to: ‘../../../dat

## Extract, transform, load
In order to train our Linear Regression model, we must first preform ETL so to prepare our data.

### ETL: Read and Join CSVs

In [6]:
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 
                'float32', 'float32', 'float32']

# load first csv 
gdf_00 = cudf.read_csv(path+'taxi_00.csv', names=col_names, dtype=col_types)
# load second csv
gdf_01 = cudf.read_csv(path+'taxi_01.csv', names=col_names, dtype=col_types)
# load third csv
gdf_02 = cudf.read_csv(path+'taxi_01.csv', names=col_names, dtype=col_types)
# load fourth csv
gdf_03 = cudf.read_csv(path+'taxi_01.csv', names=col_names, dtype=col_types)

# combine all those dataframes into one master dataframe
gdf = cudf.concat([gdf_00,gdf_01, gdf_02, gdf_03])

# what's it look like?
gdf.head()

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2012-02-02 22:30:19.002,8.9,-73.988708,40.758804,-73.986519,40.737202,1.0
1,2014-09-20 07:19:24.001,4.0,-73.990204,40.746708,-73.994728,40.750515,1.0
2,2013-02-23 07:18:05.001,5.5,-74.016762,40.709438,-74.009003,40.719498,3.0
3,2015-04-18 23:49:27.009,13.5,-74.002708,40.73373,-73.986099,40.734776,1.0
4,2010-03-04 08:15:59.001,10.5,-73.988365,40.737663,-74.012459,40.713932,1.0


### ETL: Create Table

In [7]:
%time
# make a table from the combined df
bc.create_table('taxi', gdf, column_names=col_names)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.44 µs


<pyblazing.apiv2.context.BlazingTable at 0x7f5485731940>

### ETL: Query Tables for Training Data

In [8]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            taxi
        '''

# run query on table (returns cuDF DataFrame)
X_train = bc.sql(query)

# fill null values 
X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)
X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)
X_train['passenger_count'] = X_train['passenger_count'].fillna(0)

# how's it look? 
X_train.head()

31600


Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0


In [9]:
# query dependent variable y
y_train = bc.sql('SELECT fare_amount FROM main.taxi')

y_train.head()

31600


Unnamed: 0,fare_amount
0,8.9
1,4.0
2,5.5
3,13.5
4,10.5


## Linear Regression
### LR: Train Model

In [12]:
%%time
# call & create cuML model
lr = LinearRegression(fit_intercept=True, normalize=False, algorithm="eig")

# train Linear Regression model 
reg = lr.fit(X_train, y_train)

# display results
print(f"Coefficients:\n{reg.coef_}\n")
print(f"Y intercept:\n{reg.intercept_}\n")

Coefficients:
0   -0.026636
1    0.003887
2    0.108050
3    0.629376
4    0.000906
5   -0.001342
6    0.091615
dtype: float32

Y intercept:
3.4228439331054688

CPU times: user 1.24 s, sys: 539 ms, total: 1.78 s
Wall time: 1.77 s


### LR: Use Model to Predict Future Taxi Fares 

Test data for this notebook is already stored in `notebooks-contrib/data/blazingsql/` so there's no need to download it. We are, however, going to create this table directly from CSV, and BlazingSQL requires the full path to the data for table creation. 

This cell uses the `pwd` bash command to identify the path to this directory, then joins it with the relative path to the notebooks-contrib `data/` directory to provide a full path to the test data.

In [19]:
# identify path to this notebook, !pwd returns SList w/ path (str) at 0th index
path = !pwd
# extract path notebooks-contrib then
path = path[0].split('intermediate_notebooks')[0] 
# add path to data from there
path = path + 'data/blazingsql/' + 'taxi_test.csv'

# how's it look?
path

'/home/winston/notebooks-contrib/data/blazingsql/taxi_test.csv'

In [20]:
# set column names and types
col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
col_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']

# create test table directly from CSV
bc.create_table('test', path, names=col_names, dtype=col_types)

<pyblazing.apiv2.context.BlazingTable at 0x7f5460aef1d0>

In [21]:
# extract time columns, long & lat, # riders (all floats)
query = '''
        select 
            cast(hour(key) as float) hours, 
            cast(dayofmonth(key) as float) days, 
            cast(month(key) as float) months, 
            cast(year(key) - 2000 as float) years,  
            cast(dropoff_longitude - pickup_longitude as float) longitude_distance, 
            cast(dropoff_latitude - pickup_latitude as float) latitude_distance, 
            cast(passenger_count as float) passenger_count
        from 
            taxi
        '''

# run query on table (returns cuDF DataFrame)
X_test = bc.sql(query)

# fill null values 
X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)
X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)
X_test['passenger_count'] = X_test['passenger_count'].fillna(0)

# how's it look? 
X_test.head()

31600


Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0


In [23]:
# predict fares 
predictions = lr.predict(X_test)

# display predictions
predictions

0           10.704890
1           13.189461
2           11.998629
3           12.844664
4            9.934841
5           10.527137
6           11.609999
7           12.858987
8           12.649053
9           10.325417
10          10.609123
11          10.820753
12          12.196549
13          12.613345
14          11.207047
15          12.902152
16          13.267491
17          13.489919
18           9.172511
19          10.305005
20          11.059145
21          11.318907
22          13.563890
23           8.966608
24          11.246246
25          11.188231
26          10.730618
27          10.256178
28          13.079275
29          10.068510
              ...    
19999970    12.522391
19999971     9.035202
19999972    11.899643
19999973    12.144493
19999974    13.204147
19999975    10.045351
19999976    10.575220
19999977    11.252844
19999978    12.807266
19999979    13.101965
19999980    13.322552
19999981    12.951199
19999982    10.354801
19999983    10.283142
19999984  

In [24]:
# combine into a table of table points and predictions
X_test['predicted_fare'] = predictions

# how's that look?
X_test.head()

Unnamed: 0,hours,days,months,years,longitude_distance,latitude_distance,passenger_count,predicted_fare
0,22.0,2.0,2.0,12.0,0.00219,-0.021603,1.0,10.70489
1,7.0,20.0,9.0,14.0,-0.004524,0.003807,1.0,13.189461
2,7.0,23.0,2.0,13.0,0.007759,0.010059,3.0,11.998629
3,23.0,18.0,4.0,15.0,0.016609,0.001045,1.0,12.844664
4,8.0,4.0,3.0,10.0,-0.024094,-0.023731,1.0,9.934841
