## A better baseline using the median and taking into account city/weekend/hour/month
	
I will try to minimize RMSE a bit further from the median using a bit more information:

- City
- Weekday
- Month
- Hour

This should lower the RMSE a bit further, even taking into account I am leaving some data behind wich I should be able to exploit later.


In [None]:
# Replace 'kaggle-competitions-project' with YOUR OWN project id here --  
PROJECT_ID = 'kaggle-competitions-project'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('bqml_example', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

# create a reference to our table
table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.train")

In [None]:
%load_ext google.cloud.bigquery

## Create a table with stats summary 

First, let's jut calculate the median over the city/weekend/month/hour fields.

I will use this SQL construct everywhere to avoid executing the same notebook twice (testing, and submitting - commit - to kaggle).

```sql
CREATE TABLE IF NOT EXISTS 'tablename'
AS SELECT << something >>
;

```

In [None]:
%%bigquery 

create table `kaggle-competitions-project`.kaggle.medianas_01
as select city,
       weekend,
       month,
       hour,
       tts20[offset(20)] as time_p20,
       tts50[offset(50)] as time_p50,
       tts80[offset(80)] as time_p80,
       dffs20[offset(20)] as dist_p20,
       dffs50[offset(50)] as dist_p50,
       dffs80[offset(80)] as dist_p80
from (select 
       city,
       weekend,
       month,
       hour,
       approx_quantiles(TotalTimeStopped_p20,100) as tts20,
       approx_quantiles(TotalTimeStopped_p50,100) as tts50,
       approx_quantiles(TotalTimeStopped_p80,100) as tts80,
       approx_quantiles(DistanceToFirstStop_p20,100) as dffs20,
       approx_quantiles(DistanceToFirstStop_p50,100) as dffs50,
       approx_quantiles(DistanceToFirstStop_p80,100) as dffs80
    from `kaggle-competition-datasets`.geotab_intersection_congestion.train
    group by city, weekend, month, hour
    )

Now just join the test table with the results we got before.
Best (economical) way to do it for me is a create table if not exists because when the notebook runs it still works.

The test table has some entries without a matching city/month/weekend/hour pairings in the training set.
In this case, instead of producing a NULL value in the join (which is not allowed by kaggle), we set the median of the whole dataset hoping this will lower the RSME: (0, 0, 40, 0, 0, 95.4)


In [None]:
%%bigquery

CREATE TABLE IF NOT EXISTS `kaggle-competitions-project`.kaggle.baseline_02
AS SELECT
    rowID as TargetId,
    IFNULL(time_p20, 0) as time_p20,
    IFNULL(time_p50, 0) as time_p50,
    IFNULL(time_p80, 40) as time_p80,
    IFNULL(dist_p20, 0) as dist_p20,
    IFNULL(dist_p50, 0) as dist_p50,
    IFNULL(dist_p80, 95.4) as dist_p80
FROM `kaggle-competition-datasets`.geotab_intersection_congestion.test test
LEFT JOIN `kagglebqml-254810`.kaggle.medianas_01 m01
  ON test.city = m01.city and
      test.month = m01.month and
      test.weekend = m01.weekend and
      test.hour = m01.hour;

## Create submission table

Now we have to create a submission result from the results in the baseline_02 table.
This could be done instead of creating the baseline_02 table; but it is easier this way (not cheper, though).

Maybe this could be the a better place to put the IFNULL instead of the baseline_02 table.


In [None]:
%%bigquery

CREATE TABLE IF NOT EXISTS `kaggle-competitions-project`.kaggle.submission_02
AS
   SELECT CONCAT(CAST(TargetID as string), '_0') as TargetId, time_p20 as Target from `kagglebqml-254810`.kaggle.baseline_02
   UNION ALL
   SELECT CONCAT(CAST(TargetID as string), '_1') as TargetId, time_p50 as Target from `kagglebqml-254810`.kaggle.baseline_02
   UNION ALL
   SELECT CONCAT(CAST(TargetID as string), '_2') as TargetId, time_p80 as Target from `kagglebqml-254810`.kaggle.baseline_02
   UNION ALL
   SELECT CONCAT(CAST(TargetID as string), '_3') as TargetId, dist_p20 as Target from `kagglebqml-254810`.kaggle.baseline_02
   UNION ALL
   SELECT CONCAT(CAST(TargetID as string), '_4') as TargetId, dist_p50 as Target from `kagglebqml-254810`.kaggle.baseline_02
   UNION ALL
   SELECT CONCAT(CAST(TargetID as string), '_5') as TargetId, dist_p80 as Target from `kagglebqml-254810`.kaggle.baseline_02
   


## Submit the data to Kaggle

We don't need to create a CSV file in this notebook.
The best and easy way to do it, is to tell BigQuery to export the data as a CSV file (compressed with gzip) into Google Cloud Storage.
Just remember to use 'csv.gz' as the file extension.