##### Copyright 2021 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Beam SQL in notebooks
## What's the difference between pipelines using Beam SQL and those not using it?

This example uses the COVID postive case analysis to demonstrate the difference between pipelines using Beam SQL and those not using it. Please run `Apache_Beam_SQL_in_notebooks.ipynb` to learn Beam SQL basics.

In [None]:
# The notebook environment should have docker and jdk 1.8 installed.
!docker image list
!java -version

In [None]:
# Optionally sets the logging level to reduce distraction.
import logging

logging.root.setLevel(logging.ERROR)

Let's build a pipeline to find out the data for the state with the most COVID positive cases on a specific day.
## A pipeline without using Beam SQL

### Get the data

In [None]:
# The covidtracking project has stopped collecting new data, current data ends on 2021-03-07
# 'https://covidtracking.com/api/v1/states/current.json' stops working.
# Here, we load the covid case data on 2021-03-07 under assets
import json
with open("../assets/covid_case_20210307.json", "r") as fh:
    current_data = json.load(fh)

### Create a PCollection from the data

In [None]:
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.runners.interactive import interactive_beam as ib

p = beam.Pipeline(runner=InteractiveRunner())
raw_data = p | 'Create PCollection from json' >> beam.Create(current_data)

### Create a orderable wrapper for the data

In [None]:

from functools import total_ordering

@total_ordering
class UsCovidDataOrderByPositive:
    def __init__(self, data):
        self.data = data
    
    def __gt__(self, other):
        if self.data['positive']:
            return self.data['positive'] > other.data['positive']
        else:
            return False

### Pick 4 columns from the data

Below code uses a plain dictionary to hold the data.

You don't have to define a schema nor explicitily be aware of the type of each field. 

But this can be dangerous when you write a more complicated pipeline where your assumptions about the data is incorrect.

With `visualize_data=True` in `show()`, you can spot the element with the maximum positive in the visualization by setting `Color By positive`.

In [None]:
data = raw_data | 'Parse' >> beam.Map(
    lambda e: {
        'date': e['date'], 
        'state': e['state'], 
        'positive': e['positive'], 
        'negative': e['negative']})
ib.show(data, visualize_data=True)

Find the element with the most positive.

This is rather verbose, you have to wrap the data into an orderable wrapper, find the maximum entry by the ordering and then unwrap the data.

In [None]:
entry_with_max_positive = (
    data | 'Data OrderByPositive' >> beam.Map(lambda e: UsCovidDataOrderByPositive(e))
         | 'Find Maximum Positive' >> beam.CombineGlobally(max)
         | 'Convert Back to Data' >> beam.Map(lambda orderable_data: orderable_data.data))
ib.show(entry_with_max_positive)

## A pipeline using Beam SQL

Let's build the same pipeline but with Beam SQL.

There is something wrong with the below schema. It works with normal Beam usage in Python but it doesn't work with Beam SQL. Can you spot the mistakes?

In [None]:
from typing import NamedTuple


class UsCovidData(NamedTuple):
    date: str
    state: str
    positive: int
    negative: int

The answer:

- `date` is a keyword in (Calcite)SQL, use a different field name such as `partition_date`;
- `date` from the data is an integer type, not str. Make sure you convert the data using `str()` or use `date: int`.
- `negative` has missing values and the default is None. So instead of `negative: int`, it should be `negative: Optional[int]`. Or you can convert None into 0 when using the schema.

In [None]:
from typing import Optional


# Adjusted schema based on the data
class UsCovidData(NamedTuple):
    partition_date: str  # Remember to str(e['date']).
    state: str
    positive: int
    negative: Optional[int]

### Read data

In [None]:
p_sql = beam.Pipeline(runner=InteractiveRunner())
covid_data = (p_sql 
        | 'Create PCollection from csv' >> beam.Create(current_data)
        | 'Parse' >> beam.Map(
            lambda e: UsCovidData(
                partition_date=str(e['date']),
                state=e['state'],
                positive=e['positive'],
                negative=e['negative'])).with_output_types(UsCovidData))
ib.show(covid_data)

### Find the maximum positive value

In [None]:
%%beam_sql -o max_positive
SELECT partition_date, MAX(positive) AS `positive`
FROM covid_data
GROUP BY partition_date

#### Join the maximum positive value with the original data to get the rest fields.

Below code also handles the `negative is None` case to use a default value 0.

In [None]:
%%beam_sql -o entry_with_max_positive
SELECT covid_data.partition_date, covid_data.state, covid_data.positive, {fn IFNULL(covid_data.negative, 0)} AS `negative`
FROM covid_data JOIN max_positive
USING (partition_date, positive)