##### Copyright 2021 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Beam SQL in notebooks
## Run with DataflowRunner

This example demonstrates how to run Beam SQL using DataflowRunner. 
Please run `Apache_Beam_SQL_in_notebooks.ipynb` to learn Beam SQL basics.

In [None]:
# The notebook environment should have docker and jdk 1.8 installed.
!docker image list
!java -version

In [None]:
# Optionally sets the logging level to reduce distraction.
import logging

logging.root.setLevel(logging.ERROR)

Let's install the `names` package to randomly generate some names.

In [None]:
%pip install names

Import all modules needed for this example.

In [None]:
import names
import typing

import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.runners.interactive import interactive_beam as ib

Create a pipeline `p` with the `InteractiveRunner`.

In [None]:
p = beam.Pipeline(InteractiveRunner())

In [None]:
class Person(typing.NamedTuple):
    id: int
    name: str

In [None]:
# get a subset of full names
persons_2 = (p 
             | beam.Create([Person(id=x, name=names.get_full_name()) for x in range(5, 15)]))
ib.show(persons_2)

## Run Beam SQL on Dataflow via `beam_sql` magic

Next you can execute the Beam SQL on Dataflow by specifying `-r DataflowRunner`.

A form will be generated below for you to fill in minimum pipeline options needed. Some of the fields might have been auto-populated based on the context of this notebook environment.

There are 2 buttons:
- `Run on Dataflow` submits a Dataflow job from this notebook.
- `Show Options` shows you the current pipeline options configured for the job to be submitted.

**Important**: If you're using Beam built from source code, you need to execute the cell after next cell to set `sdk_location` before clicking the `RUN ON DATAFLOW` button generated by this cell.

**Tips**: In the form generated by the `beam_sql` magic, use `gs://your-GCS-bucket` as the `GCS Bucket` and put `names` in the `Additional Packages`. The output PCollection will be automatically saved to `gs://your-GCS-bucket/staging/on_dataflow` file on Cloud Storage.

In [None]:
# you might need to update ipywidgets if the form cannot be shown
# %pip install -U ipywidgets

In [None]:
%%beam_sql -o on_dataflow -r DataflowRunner
SELECT * FROM persons_2

In [None]:
# Uncomment and execute if you're using Beam built from source code.
# from apache_beam.options.pipeline_options import SetupOptions
# options_on_dataflow.view_as(SetupOptions).sdk_location = '/dir/to/your/apache-beam-x.xx.x.tar.gz'

In [None]:
# Replace your-GCS-bucket with the real input and execute once the dataflow job is done.
!gsutil cat 'gs://your-GCS-bucket/staging/abc-00000-of-00001'