# Getting started with BigQuery ML

BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.


## Objectives
In this demo, you use:

+ BigQuery ML to create a binary logistic regression model using the `CREATE MODEL` statement
+ The `ML.EVALUATE` function to evaluate the ML model
+ `bq extract` to export the model to Cloud Storage
+ `gcloud ai-platform models create` to Deploy the model to AI Platform


## Create your dataset

Enter the following code to import the BigQuery Python client library and initialize a client. The BigQuery client is used to send and receive messages from the BigQuery API.

In [57]:
from google.cloud import bigquery
from datetime import datetime

client = bigquery.Client(location="US")

Next, you create a BigQuery dataset to store your ML model. Run the following to create your dataset:

In [None]:
dataset = client.create_dataset("BQML_tutorial")

## Create your model

Next, you create a logistic regression model using the Google Analytics sample
dataset for BigQuery. The model is used to predict whether a
website visitor will make a transaction. The standard SQL query uses a
`CREATE MODEL` statement to create and train the model. Standard SQL is the
default query syntax for the BigQuery python client library.

The BigQuery python client library provides a cell magic,
`%%bigquery`, which runs a SQL query and returns the results as a Pandas
`DataFrame`.

To run the `CREATE MODEL` query to create and train your model:

In [59]:
%%bigquery
CREATE OR REPLACE MODEL `BQML_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

The query takes several minutes to complete. After the first iteration is
complete, your model (`sample_model`) appears in the navigation panel of the
BigQuery web UI. Because the query uses a `CREATE MODEL` statement to create a
table, you do not see query results. The output is an empty `DataFrame`.

## Evaluate your model

After creating your model, you evaluate the performance of the classifier using
the [`ML.EVALUATE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate)
function. You can also use the [`ML.ROC_CURVE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-roc)
function for logistic regression specific metrics.

A classifier is one of a set of enumerated target values for a label. For
example, in this tutorial you are using a binary classification model that
detects transactions. The two classes are the values in the `label` column:
`0` (no transactions) and not `1` (transaction made).

To run the `ML.EVALUATE` query that evaluates the model:

In [None]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
  SELECT
    IF(totals.transactions IS NULL, 0, 1) AS label,
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(geoNetwork.country, "") AS country,
    IFNULL(totals.pageviews, 0) AS pageviews
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

When the query is complete, the results appear below the query. The
results should look like the following:

![Model evaluation results table](./resources/model-evaluation.png)

Because you performed a logistic regression, the results include the following
columns:

+ [`precision`](https://developers.google.com/machine-learning/glossary/#precision)
+ [`recall`](https://developers.google.com/machine-learning/glossary/#recall)
+ [`accuracy`](https://developers.google.com/machine-learning/glossary/#accuracy)
+ [`f1_score`](https://en.wikipedia.org/wiki/F1_score)
+ [`log_loss`](https://developers.google.com/machine-learning/glossary/#Log_Loss)
+ [`roc_auc`](https://developers.google.com/machine-learning/glossary/#AUC)


## Use your model for batch prediction

Now that you have evaluated your model, the next step is to use it to predict
outcomes. You use your model to predict the number of transactions made by
website visitors from each country. And you use it to predict purchases per user.

To run the query that uses the model to predict the number of transactions:

In [None]:
%%bigquery
SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
  SELECT
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(totals.pageviews, 0) AS pageviews,
    IFNULL(geoNetwork.country, "") AS country
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
  GROUP BY country
  ORDER BY total_predicted_purchases DESC
  LIMIT 10

When the query is complete, the results appear below the query. The
results should look like the following. Because model training is not
deterministic, your results may differ.

![Model predictions table](./resources/transaction-predictions.png)

In the next example, you try to predict the number of transactions each website
visitor will make. This query is identical to the previous query except for the
`GROUP BY` clause. Here the `GROUP BY` clause &mdash; `GROUP BY fullVisitorId`
&mdash; is used to group the results by visitor ID.

To run the query that predicts purchases per user:

In [None]:
%%bigquery
SELECT
  fullVisitorId,
  SUM(predicted_label) as total_predicted_purchases
FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
  SELECT
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(totals.pageviews, 0) AS pageviews,
    IFNULL(geoNetwork.country, "") AS country,
    fullVisitorId
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
  GROUP BY fullVisitorId
  ORDER BY total_predicted_purchases DESC
  LIMIT 10

When the query is complete, the results appear below the query. The
results should look like the following:

![Purchase predictions table](./resources/purchase-predictions.png)

# Export model for Online Predictions

In order to make real-time predictions, you can host your model in AI Platform Predictions.
AI Platform Prediction manages computing resources in the cloud to run your models. 
You can request predictions from your models and get predicted target values for them.

In [31]:

PROJECT='remy-sandbox'
BUCKET='ml-demo-rw'
REGION='us-central1'

#Make new bucket if necessary
#gsutil mb -l eu gs://${BUCKET}


Extract the model in the TensorFlow SavedModel format to Cloud Storage

In [10]:
 
!bq extract -m BQML_tutorial.sample_model \
           gs://$BUCKET/bqml_model_export/BQML_export_model

Waiting on bqjob_r7d9ca5e546c015b_00000175b90b433b_1 ... (23s) Current status: DONE   


When the command finishes, check that the model has been exported

In [13]:
!gsutil ls gs://$BUCKET/bqml_model_export/BQML_export_model/

gs://ml-demo-rw/bqml_model_export/BQML_export_model/
gs://ml-demo-rw/bqml_model_export/BQML_export_model/saved_model.pb
gs://ml-demo-rw/bqml_model_export/BQML_export_model/assets/
gs://ml-demo-rw/bqml_model_export/BQML_export_model/variables/


# Deploy the model to AI Platform Predictions

In [45]:
MODEL_DIR="gs://ml-demo-rw/bqml_model_export/BQML_export_model/"
VERSION_NAME="BQML_" + datetime.now().strftime("%d_%m_%Y%H_%M_%S")
MODEL_NAME="BQML_export_model"
FRAMEWORK="tensorflow"

In [60]:
#Only for doing quick demo version
#VERSION_NAME="BQML_a"

In [61]:
print(VERSION_NAME)

BQML_a


First, create a new model on the AI Platform. You can chose to host it on a single regional endpoint, or a global endpoint. Note that you will only incur charges when predictions are being run against the endpoint, even though it stays active continuously.
It may take a few minutes to create the endpoint. 


In [None]:
!gcloud ai-platform models create $MODEL_NAME \
--regions=$REGION \
--enable-logging

In [None]:
Next, create the version of your model. 

In [63]:
!gcloud beta ai-platform versions create $VERSION_NAME \
    --model=$MODEL_NAME \
    --origin=$MODEL_DIR \
    --python-version=3.7 \
    --runtime-version=2.1 \
    --machine-type=n1-standard-2 # if no machine type specified, will choose default


Creating version (this might take a few minutes)......done.                    


## Send input to the deployed model for Prediction

Create a sample prediction request. Open a terminal window (File>New>Terminal) and execute the following, changing the cd path to the directory in which your notebook is in:


In [None]:
#Open a terminal window (File>New>Terminal) and execute the following, changing the cd path to the directory in which your notebook is in:
'''
cd tutorials/bigquery
cat > input.json
'''
#Alternatively, you can use the notebook UI to create the input.json file


In [49]:
#Paste the following code into the input.json file:
{"os": "Android", "is_mobile":true,"country":"Japan", "pageviews":1}
#If you used the terminal, enter Ctrl+D to exit

{'os': 'Android', 'is_mobile': 'True', 'country': 'Japan', 'pageviews': 1}

Submit the request against the deployed model

In [55]:
!gcloud ai-platform predict --model $MODEL_NAME \
       --version $VERSION_NAME --json-instances input.json

LABEL_PROBS                                  LABEL_VALUES  PREDICTED_LABEL
[0.0017362785287754875, 0.9982637214712246]  [u'1', u'0']  [u'0']


## Cleaning up

To delete the resources created by this tutorial, execute the following code to delete the dataset and its contents:

In [None]:
client.delete_dataset(dataset, delete_contents=True)