This kernel is an extension of the [template](https://www.kaggle.com/sirtorry/bigquery-ml-template-intersection-congestion). New features include:
* Making some parameters (like the label that is predicted) variable using some jupyter magic
* Feature engineering - distance from center and turn direction (following this kernel: https://www.kaggle.com/dcaichara/feature-engineering-and-lightgbm)
* TODO: combine insights from a different dataset

In [None]:
# Replace 'kaggle-competitions-project' with YOUR OWN project id here --  
PROJECT_ID = <Project_ID>

import os
import pandas as pd

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
dataset = client.create_dataset('bqml_example', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

# create a reference to our table
table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.train")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

Create a magic jupyter command %%with_config that will change variables in a cell to certain values. 

E.g. {LABEL_COLUMN} -> TotalTimeStopped_p20

In [None]:
# Allow you to easily have Python variables in SQL query.
from IPython.core.magic import register_cell_magic
from IPython import get_ipython

LABEL_TO_NUM = {
    'TotalTimeStopped_p20' : '0',
    'TotalTimeStopped_p50' : '1',
    'TotalTimeStopped_p80' : '2',
    'DistanceToFirstStop_p20' : '3',
    'DistanceToFirstStop_p50' : '4',
    'DistanceToFirstStop_p80' : '5',
}
    
# Metrics for prediction
# TotalTimeStopped_p20, TotalTimeStopped_p50, TotalTimeStopped_p80, 
# DistanceToFirstStop_p20, DistanceToFirstStop_p50, DistanceToFirstStop_p80.
QUERY_CONFIG = {}
QUERY_CONFIG['LABEL_COLUMN'] = 'DistanceToFirstStop_p80'
QUERY_CONFIG['LABEL_NUM'] = LABEL_TO_NUM[QUERY_CONFIG['LABEL_COLUMN']]
QUERY_CONFIG['MODEL_NAME'] = '`bqml_example.model_' + QUERY_CONFIG['LABEL_NUM'] + "`"

@register_cell_magic('with_config')
def with_config(line, cell):
    contents = cell.format(**QUERY_CONFIG)
    get_ipython().run_cell(contents)


## Feature Engineering

Although big query ML does a lot of the work for you, here I add additional features by combining the data with additional information, such as the coordinates of the city centre. 

In [None]:
%load_ext google.cloud.bigquery

In [None]:
# Helper tables containing information to JOIN

# CREATE TABLE IF NOT EXISTS helper.Directions (
#   direction STRING,
#   value FLOAT64
# );
           
# INSERT INTO helper.Directions
# VALUES ("N", 0),
#        ("NE", 1/4),
#        ("E", 1/2),
#        ("SE", 3/4),
#        ("S", 1),
#        ("SW", 5/4),
#        ("W", 3/2),
#        ("NW", 7/4);

# # Use this to later calculate distances from center
# CREATE TABLE IF NOT EXISTS helper.Cities (
#   city STRING,
#   centerLatitude FLOAT64,
#   centerLongitude FLOAT64
# );
                           
# INSERT INTO helper.Cities
# VALUES ("Atlanta", 33.753746, -84.386330),
#        ("Boston", 42.361145, -71.057083),
#        ("Chicago", 41.881832, -87.623177),
#        ("Philadelphia", 39.952583, -75.165222);

In [None]:
%%with_config
%%bigquery
CREATE OR REPLACE MODEL {MODEL_NAME}
OPTIONS(model_type='linear_reg') AS
SELECT
    {LABEL_COLUMN} as label,
    Weekend,
    Hour,
    Month,
    K.City,
    IFNULL(EntryHeading, ExitHeading) as entryheading,
    IFNULL(ExitHeading, EntryHeading) as exitheading,
    IF( EntryStreetName = ExitStreetName , 1, 0) as samestreet,
    SQRT( POW( C.centerLatitude - K.Latitude, 2) + POW(C.centerLongitude - K.Longitude, 2) ) as distance,
    D1.value - D2.value as diffHeading
FROM
    `kaggle-competition-datasets.geotab_intersection_congestion.train` as K
    LEFT JOIN `instant-medium-261000.helper.Directions` as D1 on D1.direction = EntryHeading
    LEFT JOIN `instant-medium-261000.helper.Directions` as D2 on D2.direction = ExitHeading
    LEFT JOIN `instant-medium-261000.helper.Cities` as C on C.city = K.City
WHERE
    RowId < 2600000

The query takes several minutes to complete. After the first iteration is
    complete, your model (`model1`) appears in the navigation panel of the
    BigQuery UI. Because the query uses a `CREATE MODEL` statement to create a
    table, you do not see query results. The output is an empty string.

## Step three: Get training statistics

To see the results of the model training, you can use the
[`ML.TRAINING_INFO`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-train)
function, or you can view the statistics in the BigQuery UI.
In this tutorial, you use the `ML.TRAINING_INFO` function.

A machine learning algorithm builds a model by examining many examples and
attempting to find a model that minimizes loss. This process is called empirical
risk minimization.

Loss is the penalty for a bad prediction &mdash; a number indicating
how bad the model's prediction was on a single example. If the model's
prediction is perfect, the loss is zero; otherwise, the loss is greater. The
goal of training a model is to find a set of weights that have low
loss, on average, across all examples.

To see the model training statistics that were generated when you ran the
`CREATE MODEL` query, run the following:

In [None]:
%%with_config
%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL {MODEL_NAME})
ORDER BY iteration 

In [None]:
%%with_config
%%bigquery
SELECT * 
FROM ML.EVALUATE(MODEL {MODEL_NAME},
(SELECT
    {LABEL_COLUMN} as label,
    Weekend,
    Hour,
    Month,
    K.City,
    IFNULL(EntryHeading, ExitHeading) as entryheading,
    IFNULL(ExitHeading, EntryHeading) as exitheading,
    IF( EntryStreetName = ExitStreetName , 1, 0) as samestreet,
    SQRT( POW( C.centerLatitude - K.Latitude, 2) + POW(C.centerLongitude - K.Longitude, 2) ) as distance,
    D1.value - D2.value as diffHeading
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train` as K
  LEFT JOIN `instant-medium-261000.helper.Directions` as D1 on D1.direction = EntryHeading
  LEFT JOIN `instant-medium-261000.helper.Directions` as D2 on D2.direction = ExitHeading
  LEFT JOIN `instant-medium-261000.helper.Cities` as C on C.city = K.City
WHERE
    RowId > 2600000
))

## Step five: Use your model to predict outcomes

Now that you have evaluated your model, the next step is to use it to predict
%%outcomes.

In [None]:
%%with_config
%%bigquery df
SELECT
    RowId,
    predicted_label as {LABEL_COLUMN}
FROM
  ML.PREDICT(MODEL {MODEL_NAME},
    (
SELECT
    K.RowId,
    K.Weekend,
    K.Hour,
    K.Month,
    K.City,
    IFNULL(EntryHeading, ExitHeading) as entryheading,
    IFNULL(ExitHeading, EntryHeading) as exitheading,
    IF( EntryStreetName = ExitStreetName , 1, 0) as samestreet,
    SQRT( POW( C.centerLatitude - K.Latitude, 2) + POW(C.centerLongitude - K.Longitude, 2) ) as distance,
    D1.value - D2.value as diffHeading
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.test` as K
  LEFT JOIN `instant-medium-261000.helper.Directions` as D1 on D1.direction = EntryHeading
  LEFT JOIN `instant-medium-261000.helper.Directions` as D2 on D2.direction = ExitHeading
  LEFT JOIN `instant-medium-261000.helper.Cities` as C on C.city = K.City    
    ))
ORDER BY RowId ASC

## Saving results

In [None]:
df['RowId'] = df['RowId'].apply(str) + '_' + QUERY_CONFIG['LABEL_NUM']
df.rename(columns={'RowId': 'TargetId', QUERY_CONFIG['LABEL_COLUMN']: 'Target'}, inplace=True)
df

Each label is predicted in a different kernel run, then combine the results

In [None]:
df.sort_values(by='TargetId', inplace=True)
df.to_csv('submission_{}.csv'.format(QUERY_CONFIG['LABEL_NUM']))

## Combine datasets into one submission

In [None]:
# Combining resulting datasets
subm_df = pd.read_csv('submission_0.csv', index_col=[0])
for i in range(1, 6):
    temp_df = pd.read_csv('submission_{}.csv'.format(i), index_col=[0])
    subm_df = subm_df.append(temp_df)

subm_df.head()

In [None]:
# arrange values in the right order for submission by sorting, reindexing, and 
def order_func(target_id):
    row, target_num = target_id.split("_")
    return (int(row), int(target_num))

subm_df['order'] = subm_df['TargetId'].apply(order_func)

subm_df.sort_values(by='order', inplace=True)
subm_df.drop(columns=['order'], inplace=True)
subm_df.reset_index(drop=True, inplace=True)

In [None]:
subm_df.head(10) 

In [None]:
subm_df.to_csv("submission.csv", index=False)