<a href="https://colab.research.google.com/github/rogerfvieira/fare_prediction/blob/main/fare_prediction(linear_regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# !pip install --upgrade google-cloud-bigquery
# !pip install google-colab

# Fare prediction (Linear_regression) using bigqeury ML

This notebook contains analysis on the taxi cab trips for new york city , a publicly available dataset. The objective is to analyze the data and create a model that can predict fare prices based on selected features. This is to be done using BigQuery and BigqueryML. Down below is the notebook containing the steps via the BQ API using python.
 
**Important**
Running the notebook locally will raise issues as it is connecting to Google cloud, if you don’t have a google cloud account then just view the cells as they already display the outputs. If you would like to run the notebook locally you must change the project 'play-368717' to your particular project of choice, be aware that this can potentially incur charges on your google cloud platform account so I recommend against this.


## Imports


In [1]:
from google.cloud import bigquery 
from google.colab import auth
import pandas as pd

In [2]:
auth.authenticate_user()

## Cursor creation

Connecting to the big query project

In [3]:
project_id = 'play-368717'
dataset_id = 'linear_regression'
client = bigquery.Client(project=project_id)

## Unclean data source

In [4]:
unclean_data_query = """
SELECT *
FROM bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
LIMIT 100
"""

df_unclean = client.query(unclean_data_query).to_dataframe()

In [5]:
df_unclean

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,airport_fee,total_amount,pickup_location_id,dropoff_location_id,data_file_year,data_file_month
0,2,2022-02-21 22:14:25+00:00,2022-02-21 22:14:46+00:00,1,0E-9,1.0,N,1,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,264,264,2022,2
1,2,2022-02-28 17:08:16+00:00,2022-02-28 17:09:03+00:00,1,0E-9,1.0,N,1,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,7,7,2022,2
2,2,2022-02-03 20:44:26+00:00,2022-02-03 20:44:40+00:00,1,0E-9,1.0,N,2,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,193,193,2022,2
3,2,2022-02-06 11:07:13+00:00,2022-02-06 11:07:46+00:00,1,0E-9,1.0,N,1,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,264,264,2022,2
4,2,2022-02-08 08:50:50+00:00,2022-02-08 08:51:06+00:00,1,0E-9,1.0,N,1,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,193,193,2022,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1,2022-02-22 13:59:46+00:00,2022-02-22 14:51:08+00:00,1,29.900000000,5.0,N,4,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,132,265,2022,2
96,2,2022-02-10 11:42:07+00:00,2022-02-10 11:42:24+00:00,1,0E-9,1.0,N,1,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,193,193,2022,2
97,2,2022-02-24 18:44:59+00:00,2022-02-24 18:45:10+00:00,1,0E-9,1.0,N,1,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,193,193,2022,2
98,1,2022-02-25 17:04:21+00:00,2022-02-25 17:04:21+00:00,0,0E-9,1.0,N,2,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,0E-9,113,264,2022,2


## Data Cleaning / feature engineering and selection

We must clean the data, in order to perform linear regression, the values must be non null

In [6]:
training_data_query = """
     CREATE OR REPLACE TABLE `play-368717.linear_regression.training_data_fare_prediction` AS(
  SELECT vendor_id,
         pickup_datetime,
         dropoff_datetime,
         passenger_count,
         trip_distance,
         pickup_location_id ,
         dropoff_location_id,
         fare_amount
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`
  WHERE pickup_datetime IS NOT NULL
  AND dropoff_datetime IS NOT NULL
  AND passenger_count IS NOT NULL
  AND trip_distance IS NOT NULL AND trip_distance!=0
  AND pickup_location_id IS NOT NULL
  AND dropoff_location_id IS NOT NULL
  AND fare_amount IS NOT NULL
  )
"""
training_data= client.query(training_data_query)

In [7]:
dataset_ref = bigquery.DatasetReference(project_id, dataset_id)
table_ref = dataset_ref.table("training_data_fare_prediction")
table = client.get_table(table_ref)
df_training_data = client.list_rows(table).to_dataframe()

This is the cleaned data. With this cleaned data I now have the training data required for the liear regression model.

In [8]:
df_training_data

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_location_id,dropoff_location_id,fare_amount
0,2,2022-02-28 13:00:51+00:00,2022-02-28 13:27:56+00:00,1,14.570000000,88,1,58.500000000
1,2,2022-02-13 22:23:49+00:00,2022-02-13 22:36:33+00:00,1,4.570000000,256,54,15.500000000
2,2,2022-02-18 01:59:52+00:00,2022-02-18 02:19:50+00:00,2,5.290000000,88,189,18.500000000
3,2,2022-02-04 22:29:05+00:00,2022-02-04 22:35:55+00:00,1,1.620000000,40,181,7.000000000
4,1,2022-02-16 08:36:29+00:00,2022-02-16 09:29:12+00:00,1,14.300000000,65,136,39.200000000
...,...,...,...,...,...,...,...,...
27982186,1,2022-01-08 18:45:14+00:00,2022-01-08 19:03:51+00:00,2,4.500000000,264,264,17.000000000
27982187,2,2022-01-04 05:50:14+00:00,2022-01-04 05:56:39+00:00,1,1.580000000,264,264,7.000000000
27982188,2,2022-01-25 18:03:52+00:00,2022-01-25 18:13:53+00:00,2,1.670000000,264,264,8.500000000
27982189,2,2022-01-01 14:45:01+00:00,2022-01-01 14:58:16+00:00,1,1.590000000,264,264,10.000000000


## Model Creation

In [9]:
model_creation_query = """
CREATE OR REPLACE MODEL `play-368717.linear_regression.fare_prediction_model` OPTIONS(MODEL_TYPE='LINEAR_REG',LABELS=['fare_amount']) AS
SELECT * FROM `play-368717.linear_regression.training_data_fare_prediction`;
"""

In [10]:
fare_prediction_model = client.query(model_creation_query)

In [11]:
models = client.list_models('linear_regression')

## Model Evaluation

In [12]:
model_evaluation_query = '''
SELECT * FROM ML.EVALUATE(MODEL `play-368717.linear_regression.fare_prediction_model`)
'''

In [13]:
model_eval =  client.query(model_evaluation_query).to_dataframe()

In [14]:
model_eval

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,4.357807,54.032623,0.151982,2.967463,0.684807,0.684807
