## Google BigQuery

BigQuery is a popular data warehouse solution from Google. Its serverless architecture makes it highly scalable with very little effort. The serverless architecture allows end user to purely focus on the individual functions in the code.

In `Google BigQuery` end user does not have to worry about the underlying hardware, virtual machine, and number of nodes or instances etc. The user simply writes an SQL query and executes it. BigQuery’s execution algorithm analyzes the necessary resources needed to execute the query as fast as possible, provisions the resources, performs the query execution, and releases the resources.

## Introduction to BigQuery ML

BigQuery ML enables users to create and execute machine learning models in BigQuery using standard SQL queries. BigQuery ML democratizes machine learning by enabling SQL practitioners to build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.

It supports standard SQL dialect which is ANSI:2011 compliant. Model is created and deployed automatically as part of the training job. One of the biggest advantages is that the data is not required to be moved out of the data warehouse thereby saving an extra step. Traditionally as part of the training process , the data is moved out of the data store to be pre-processed for feature engineering step. BigQuery handles the feature engineering and pre-processing automatically out of the box. Due to it’s serverless architecture, the training process is automatically scalable.

As of now BigQuery supports following models:

    1. Linear Regression
    2. Binary logistic regression
    3. Multiclass logistic regression for classification
    4. K-means clustering for data segmentation
    5. Previously-trained TensorFlow models

In [None]:
# GCP Project Id
PROJECT_ID = 'bigquery-bqml-kaggle'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

## Create Dataset

In [None]:
dataset = client.create_dataset('bqml_intersection', exists_ok=True)

In [None]:
# create a reference to our table
table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.train")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

In [None]:
# create a reference to our table
test_table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.test")
# look at five rows from test table
client.list_rows(test_table, max_results=5).to_dataframe()

### Table Schema

In [None]:
# Print information on all the columns in the "train" table
table.schema

In [None]:
# Print information on all the columns in the "test" table
test_table.schema

In [None]:
# Preview the first five entries in the "Latitude" and "Longitude" column of the "train" table
client.list_rows(table, selected_fields=table.schema[2:4], max_results=5).to_dataframe()

In [None]:
# magic command
%load_ext google.cloud.bigquery

### EntryStreetName and ExitStreetName `GROUP BY` City

In [None]:
%%bigquery total_street_name
SELECT
    City,
    COUNT(EntryStreetName) AS EntryStreetNameCount,
    COUNT(ExitStreetName) AS ExitStreetNameCount
FROM `kaggle-competition-datasets.geotab_intersection_congestion.train`
GROUP BY City
ORDER BY City DESC

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
total_street_name.plot(kind='bar', x='City', y=['EntryStreetNameCount','ExitStreetNameCount']);

### EntryHeading and ExitHeading `GROUP BY` CITY

In [None]:
%%bigquery total_heading
SELECT
    City,
    COUNT(EntryHeading) AS EntryHeadingCount,
    COUNT(ExitHeading) AS ExitHeadingCount
FROM `kaggle-competition-datasets.geotab_intersection_congestion.train`
GROUP BY City
ORDER BY City DESC

In [None]:
total_heading.plot(kind='bar', x='City', y=['EntryHeadingCount','ExitHeadingCount']);

In [None]:
%%bigquery latitude_longitude
SELECT Latitude,Longitude

FROM `kaggle-competition-datasets.geotab_intersection_congestion.train`

In [None]:
sns.relplot(x="Latitude", y="Longitude", data=latitude_longitude);

### count IntersectionId GROUP BY City	

In [None]:
%%bigquery count_IntersectionId
SELECT City,
COUNT(IntersectionId) AS total_IntersectionId
FROM `kaggle-competition-datasets.geotab_intersection_congestion.train`
GROUP BY City
ORDER BY City

In [None]:
count_IntersectionId.plot(kind='bar', x='City', y='total_IntersectionId');

## Create model

### Train our model:

In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `bqml_intersection.total_time_p20`
OPTIONS(model_type='linear_reg') AS
SELECT
    TotalTimeStopped_p20 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train`
WHERE
    RowId < 2600000

In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `bqml_intersection.total_time_p50`
OPTIONS(model_type='linear_reg') AS
SELECT
    TotalTimeStopped_p50 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train`
WHERE
    RowId < 2600000

In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `bqml_intersection.total_time_p80`
OPTIONS(model_type='linear_reg') AS
SELECT
    TotalTimeStopped_p80 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train`
WHERE
    RowId < 2600000


In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `bqml_intersection.distance_p20`
OPTIONS(model_type='linear_reg') AS
SELECT
    DistanceToFirstStop_p20 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train`
WHERE
    RowId < 2600000

In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `bqml_intersection.distance_p50`
OPTIONS(model_type='linear_reg') AS
SELECT
    DistanceToFirstStop_p50 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train`
WHERE
    RowId < 2600000

In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `bqml_intersection.distance_p80`
OPTIONS(model_type='linear_reg') AS
SELECT
    DistanceToFirstStop_p80 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
FROM
  `kaggle-competition-datasets.geotab_intersection_congestion.train`
WHERE
    RowId < 2600000

## Get training statistics

To see the results of the model training, you can use the
[`ML.TRAINING_INFO`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-train)
function, or you can view the statistics in the BigQuery UI.
In this tutorial, you use the `ML.TRAINING_INFO` function.

A machine learning algorithm builds a model by examining many examples and
attempting to find a model that minimizes loss. This process is called empirical
risk minimization.

Loss is the penalty for a bad prediction &mdash; a number indicating
how bad the model's prediction was on a single example. If the model's
prediction is perfect, the loss is zero; otherwise, the loss is greater. The
goal of training a model is to find a set of weights that have low
loss, on average, across all examples.

To see the model training statistics that were generated when you ran the
`CREATE MODEL` query, run the following:

In [None]:
%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `bqml_intersection.total_time_p20`)
ORDER BY iteration 

In [None]:
%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `bqml_intersection.distance_p20`)
ORDER BY iteration 

## Evaluate model

In [None]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `bqml_intersection.total_time_p20`, (
  SELECT
    TotalTimeStopped_p20 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
  FROM
    `kaggle-competition-datasets.geotab_intersection_congestion.train`
  WHERE
    RowId > 2600000))

In [None]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `bqml_intersection.total_time_p50`, (
  SELECT
    TotalTimeStopped_p50 as label,
    Hour,
    Weekend,
    Month,
    EntryStreetName,
    ExitStreetName,
    EntryHeading,
    ExitHeading,
    Path,
    City
  FROM
    `kaggle-competition-datasets.geotab_intersection_congestion.train`
  WHERE
    RowId > 2600000))

## Predict outcomes

In [None]:
%%bigquery df_1
SELECT
  RowId,
  predicted_label as Target
FROM
  ML.PREDICT(MODEL `bqml_intersection.distance_p20`,
    (
    SELECT
        RowId,
        Hour,
        Weekend,
        Month,
        EntryStreetName,
        ExitStreetName,
        EntryHeading,
        ExitHeading,
        Path,
        City
    FROM
      `kaggle-competition-datasets.geotab_intersection_congestion.test`))
    ORDER BY RowId ASC

In [None]:
%%bigquery df_2
SELECT
  RowId,
  predicted_label as Target
FROM
  ML.PREDICT(MODEL `bqml_intersection.distance_p50`,
    (
    SELECT
        RowId,
        Hour,
        Weekend,
        Month,
        EntryStreetName,
        ExitStreetName,
        EntryHeading,
        ExitHeading,
        Path,
        City
    FROM
      `kaggle-competition-datasets.geotab_intersection_congestion.test`))
    ORDER BY RowId ASC

In [None]:
%%bigquery df_3
SELECT
  RowId,
  predicted_label as Target
FROM
  ML.PREDICT(MODEL `bqml_intersection.distance_p80`,
    (
    SELECT
        RowId,
        Hour,
        Weekend,
        Month,
        EntryStreetName,
        ExitStreetName,
        EntryHeading,
        ExitHeading,
        Path,
        City
    FROM
      `kaggle-competition-datasets.geotab_intersection_congestion.test`))
    ORDER BY RowId ASC

In [None]:
%%bigquery df_4
SELECT
  RowId,
  predicted_label as Target
FROM
  ML.PREDICT(MODEL `bqml_intersection.total_time_p20`,
    (
    SELECT
        RowId,
        Hour,
        Weekend,
        Month,
        EntryStreetName,
        ExitStreetName,
        EntryHeading,
        ExitHeading,
        Path,
        City
    FROM
      `kaggle-competition-datasets.geotab_intersection_congestion.test`))
    ORDER BY RowId ASC

In [None]:
%%bigquery df_5
SELECT
  RowId,
  predicted_label as Target
FROM
  ML.PREDICT(MODEL `bqml_intersection.total_time_p50`,
    (
    SELECT
        RowId,
        Hour,
        Weekend,
        Month,
        EntryStreetName,
        ExitStreetName,
        EntryHeading,
        ExitHeading,
        Path,
        City
    FROM
      `kaggle-competition-datasets.geotab_intersection_congestion.test`))
    ORDER BY RowId ASC

In [None]:
%%bigquery df_6
SELECT
  RowId,
  predicted_label as Target
FROM
  ML.PREDICT(MODEL `bqml_intersection.total_time_p80`,
    (
    SELECT
        RowId,
        Hour,
        Weekend,
        Month,
        EntryStreetName,
        ExitStreetName,
        EntryHeading,
        ExitHeading,
        Path,
        City
    FROM
      `kaggle-competition-datasets.geotab_intersection_congestion.test`))
    ORDER BY RowId ASC

## Output as CSV

Let's format the results to fit the submission schema. The [format of the submission file](https://www.kaggle.com/c/bigquery-geotab-intersection-congestion/data) requires that the header be: `TargetId` and `Target` for the predictions column. Since each of the results provided by this model is for TotalTimeStopped_p20, they'll have a TargetId of {RowID}_0 and the Target will be the predicted value for TotalTimeStopped_p20.

In [None]:
import pandas as pd
df_1['RowId'] = df_1['RowId'].apply(str) + '_0'
df_2['RowId'] = df_2['RowId'].apply(str) + '_1'
df_3['RowId'] = df_3['RowId'].apply(str) + '_2'
df_4['RowId'] = df_4['RowId'].apply(str) + '_3'
df_5['RowId'] = df_5['RowId'].apply(str) + '_4'
df_6['RowId'] = df_6['RowId'].apply(str) + '_5'

In [None]:
df = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6], axis=0)

In [None]:
df.rename(columns={'RowId': 'TargetId'}, inplace=True)

In [None]:
# df['RowId'] = df['RowId'].apply(str) + '_0'
# df.rename(columns={'RowId': 'TargetId', 'TotalTimeStopped_p20': 'Target'}, inplace=True)
# df

Finally, you'll want to output the results as a CSV. 

In [None]:
# df.to_csv('submission.csv',index=False)
submission = pd.read_csv("../input/bigquery-geotab-intersection-congestion/sample_submission.csv")
submission = submission.merge(df, on='TargetId')
submission.rename(columns={'Target_y': 'Target'}, inplace=True)
submission = submission[['TargetId', 'Target']]
submission.to_csv('submission.csv', index=False)