##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Final Exercises

The time has come to test everything you learned during these notebooks (which we hope is a lot).

The next exercises are the `hello world` of ETL pipelines: a **WordCount**. First as usual, but then you'll add a modification. You will also have a **streaming** exercise.

As always, there are many possible solutions to this, so it's fine if your solution doesn't match the ones posted here.

In [None]:
import logging
import re
import json
import time
import traceback
from utils.utils import *
from utils.solutions import solutions
import google.auth

from IPython.core.display import display, HTML

import apache_beam as beam
from apache_beam import FlatMap, Map, ParDo, Flatten, Filter
from apache_beam import Values, CombineGlobally, CombinePerKey
from apache_beam import pvalue, window, WindowInto
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.combiners import Top, Mean, Count
from apache_beam.io.textio import ReadFromText, WriteToText
from apache_beam.io.gcp.pubsub import ReadFromPubSub
from apache_beam.io.gcp.bigquery import BigQueryDisposition, WriteToBigQuery

from apache_beam.runners import DataflowRunner
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

from apache_beam.testing.util import assert_that, is_empty, equal_to

### Standard WordCount

The pipeline is going to read file `kinglear.txt` from a public Cloud Storage bucket and output the number of times each word appears.

**NOTE**: The pipeline is counting the words case sensitive, i.e., "Friend" and "friend" are ***not*** counted together.

In [None]:
p = beam.Pipeline(InteractiveRunner())
    
path = "gs://dataflow-samples/shakespeare/kinglear.txt"

def split_words(text):
    words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
    return #TODO 

#TODO finish pipeline
count = (p 

# For testing the solution - Don't modify
wc =  (count | "Filter" >> Filter(lambda x: x[0] in words_test))

assert_that(wc, equal_to(solutions[7]["wordcount"]))

With `InteractiveRunner` you can even visualize the output data:

In [None]:
ib.show(count, visualize_data=True)

### Hints

**Process elements**
<details><summary>Hint</summary>
<p>
 
The notebook about [I/O operations](5_IOOperations.ipynb) showed that when using `ReadFromText`, it read by lines. So you need to from every line (one element) output the words (more than one element). This is a `FlatMap`.
</p>
</details>

<details><summary>Code</summary>
<p>

```
count = (p | "ReadTxt" >> ReadFromText(path)
           | "Split Words" >> FlatMap(split_words)
```
</p>
</details>

**Split words return**
<details><summary>Hint</summary>
<p>

Variable `words` is a list of the words in each text line, but this is not quite the output you need, you need key-value pairs as an output ( * ) so you can count them properly; the key is the word, and the value can be `1` (**).
    
(*) You don't actually need key-value pairs in this case, at the end there is another way of doing the same without key-value pairs.
    
(**) Depending on how we process the key-value pairs, you need to set `1` as value or set whatever as value. Another solution is shared with each example. The solution below accepts whichever value.
</p>
</details>


<details><summary>Code</summary>
<p>
    
```  
    def split_words(text):
        words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
        return [(x, 1) for x in words]    
```
</p>
</details>

**Full code**

<details><summary>Solution 1</summary>
<p>
    
```
p = beam.Pipeline(InteractiveRunner())
    
path = "gs://dataflow-samples/shakespeare/kinglear.txt"

def split_words(text):
    words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
    return [(x, 1) for x in words]

count = (p | "ReadTxt" >> ReadFromText(path)
           | "Split Words" >> FlatMap(split_words)
           | "Count" >> Count.PerKey())

# For testing the solution - Don't modify
wc =  (count | "Filter" >> Filter(lambda x: x[0] in words_test))

assert_that(wc, equal_to(solutions[7]['wordcount']))
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>

This solution doesn't require to output key-value pairs in the FlatMap. The reason why is because it uses `Count.PerElement()`, which count the number of occurrences of each distinct element in the PCollection.

```
p = beam.Pipeline(InteractiveRunner())
    
path = "gs://dataflow-samples/shakespeare/kinglear.txt"

def split_words(text):
    words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
    return words

count = (p | "ReadTxt" >> ReadFromText(path)
   | "Split Words" >> FlatMap(split_words)
   | "Count" >> Count.PerElement())

# For testing the solution - Don't modify
wc =  (count | "Filter" >> Filter(lambda x: x[0] in words_test))

assert_that(wc, equal_to(solutions[7]['wordcount']))
```
</p>
</details>

<details><summary>Solution 3</summary>
<p>
    
This solution uses the `CombinePerKey` rather than `Count`. It's a lower level solution and it can be easily modified to do other operations.

```
p = beam.Pipeline(InteractiveRunner())

path = "gs://dataflow-samples/shakespeare/kinglear.txt"

def split_words(text):
    words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
    return [(x, 1) for x in words]

count = (p | "ReadTxt" >> ReadFromText(path)
           | "Split Words" >> FlatMap(split_words)
           | "Count" >> CombinePerKey(sum))

# For testing the solution - Don't modify
wc =  (count | "Filter" >> Filter(lambda x: x[0] in words_test))

assert_that(wc, equal_to(solutions[7]['wordcount']))
```
</p>
</details>

## Modified wordcount

Now let's spice things up. The pipeline counts words from two different sources: `kinglear.txt` and `hamlet.txt` (both in the same public Cloud Storage bucket) but it doesn't count all words, stop words will be discarded (i.e., "and", "for", "to",...). The stop words list is stored in a file (locally, in the `input` folder), and you may add or take some words out of that file `stopwords.txt`, the pipeline has to consider this. This time you will store the output locally in a file.

Before starting coding, we recommend you to go and check the stop words file, to be able to process it.

The posted solution uses as a base the previous solution using `Count.PerElement()`.

**NOTE**: even though the pipeline is counting the words case sensitive, it is not doing it for stopwords, i.e., both "To" and "to" should be removed. You can use `<string>.lower()` to turn words into lower case.

In [None]:
p = beam.Pipeline(InteractiveRunner())
          
path_1 = "gs://dataflow-samples/shakespeare/kinglear.txt"
path_2 = "gs://dataflow-samples/shakespeare/hamlet.txt"
stopwords_path = "input/stopwords.txt"

output_path = "Output/modified_wordcount"

# TODO: Finish pipeline

count_no_stopwords = (p | )

# Do the file writing here                      
write = count_no_stopwords

# For testing the solution - Don't modify
filtered = count_no_stopwords | "Filter" >> Filter(lambda x: x[0] in words_no_sw_test)
stop_words = count_no_stopwords | "Get StopWords" >> Filter(lambda x: x[0] in sw_test)

assert_that(filtered, equal_to(solutions[7]["modified_wordcount"]), label="words")
assert_that(stop_words, is_empty(), label="stopwords")

### Hints

**Process stop words**
<details><summary>Hint</summary>
<p>

Each word in the stop word file is separated using `", "`, so you can split by that. Since from one element you need more elements, you can use `FlapMap`
</p>
</details>

<details><summary>Code</summary>
<p>

```
stopwords_p = (p | "Read Stop Words" >> ReadFromText(stopwords_path)
                 | FlatMap(lambda x: x.split(", "))) 
```
</p>
</details>

**Split words with condition**
<details><summary>Hint</summary>
<p>

This is the same situation as before in which you need to use `FlatMap` (`ParDo` also works, of course). Since you have a condition, you need to use a dynamic parameter, which translates as `Side Inputs`. What you need is the list of stop words, so use `AsList` when using the pipeline as `Side Input`.
</p>
</details>

<details><summary>Code</summary>
<p>

```
      
    def split_words(text, stopwords):
        words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
        return [x for x in words if x.lower() not in stopwords]

     {..}  | "Split Words" >> FlatMap(split_words, stopwords=beam.pvalue.AsList(stopwords_p))
```
</p>
</details>

**Full code**

<details><summary>Solution</summary>
<p>
    
```
p = beam.Pipeline(InteractiveRunner())
    
path_1 = "gs://dataflow-samples/shakespeare/kinglear.txt"
path_2 = "gs://dataflow-samples/shakespeare/hamlet.txt"
stopwords_path = "input/stopwords.txt"

output_path = "Output/modified_wordcount"

def split_words(text, stopwords):
    words = re.findall(r'[\w\']+', text.strip(), re.UNICODE)
    return [x for x in words if x.lower() not in stopwords]

stopwords_p = (p | "Read Stop Words" >> ReadFromText(stopwords_path)
                 | FlatMap(lambda x: x.split(", "))) 

text_1 = p | "Read Text 1" >> ReadFromText(path_1)
text_2 = p | "Read Text 2" >> ReadFromText(path_2)

count_no_stopwords = ((text_1, text_2)  | Flatten()  
                                        | "Split Words" >> FlatMap(
                                            split_words, 
                                            stopwords=beam.pvalue.AsList(stopwords_p))
                                        | "Count" >> Count.PerElement())

write = count_no_stopwords | "Write" >> WriteToText(file_path_prefix=output_path, file_name_suffix=".txt")

ib.show(count_no_stopwords, visualize_data=True)

# For testing the solution - Don't modify
filtered = count_no_stopwords | "Filter" >> Filter(lambda x: x[0] in words_no_sw_test)
stop_words = count_no_stopwords | "Get StopWords" >> Filter(lambda x: x[0] in sw_test)

assert_that(filtered, equal_to(solutions[7]["modified_wordcount"]), label="words")
assert_that(stop_words, is_empty(), label="stopwords")
```
</p>
</details>

## Streaming Exercise

The streaming pipeline is going to read from other topic we will create (`beambasics-exercise`). The structure of the messages is `{name (string), spent (integer)}` (messages are parsed in the pipeline).

It needs to calculate the total amount each buyer (`name`) spends every minute, and write it to BigQuery.

**Important note**: PubSubIO already adds the timestamp to the element (sent time), so you **don't** need to add the timestamp manually with `window.TimestampedValue`.

Let's create the Pub/Sub topic first.

In [None]:
!gcloud pubsub topics create beambasics-exercise

In [None]:
def streaming_pipeline(project_param, region="us-central1"):
    topic = "projects/{}/topics/beambasics-exercise".format(project_param)
    bucket = "gs://beam-basics-{}".format(project_param)
    table = "{}:beam_basics.exercise".format(project_param)
    schema = "name:string,total_spent:integer"

    options = PipelineOptions(
        streaming=True,
        project=project,
        region=region,
        staging_location="%s/staging" % bucket,
        temp_location="%s/temp" % bucket
    )
        
    p = beam.Pipeline(DataflowRunner(), options=options)
        
    #TODO: Finish pipeline
    pubsub = (p | "Read Topic" >> ReadFromPubSub(topic=topic)
                | "To dict" >> Map(json.loads) # Example message: {"name": "Guillem", "spent": 10}
                | "To KV" >> Map(lambda x: x["name"], int(x["spent"])))
    
    
    return p.run()

To test if the pipeline works, run the following cell. (Hints are below)

The publisher is already imported from file `utils.py`. It should take about five minutes to finish all the messages (the publisher is throttled), so take this time to check if the outputs of the pipeline are right.

In [None]:
num_messages = 600
project = google.auth.default()[1]
try:
    pipeline = streaming_pipeline(project)
    print("\n PIPELINE RUNNING \n")
    url = ("https://console.cloud.google.com/dataflow/jobs/%s/%s?project=%s" %
     (pipeline._job.location, pipeline._job.id,
      pipeline._job.projectId))
    display(HTML('Click <a href="%s" target="_new">here</a> for the details of your Dataflow job!' % url))
    print("\nLet's wait a bit so the workers can start up \n")
    time.sleep(30)
    print("Ok, let's start the publishing!\n")
    try:
        publish_to_topic(num_messages, "beambasics-exercise", project, notebook_number=7)
        print("\n PUBLISHING DONE\n")
    except (KeyboardInterrupt, SystemExit):
        raise
    except:
        print("\n PUBLISHING FAILED")
        traceback.print_exc()
except (KeyboardInterrupt, SystemExit):
    raise
except:
    print("\n PIPELINE FAILED")
    traceback.print_exc()      

### Hints

**Calculate total**
<details><summary>Hint</summary>
<p>

Since you want to get the total of money spent, you need to sum the values, this is done with a `CombinePerKey`. But, since you are using streaming, you need to add Windows for Aggregations.  
</p>
</details>


<details><summary>Code</summary>
<p>

```
     pubsub | "FixedWindow" >> WindowInto(window.FixedWindows(60))
            | "Sum" >> CombinePerKey(sum)
```
</p>
</details>

**Write to BigQuery**
<details><summary>Hint</summary>
<p>

You need to do two things in order to write to BigQuery: prepare the elements and then actually write to BigQuery. When using Python, `WriteToBigQuery` takes either dictionaries or `TableRows` ([doc](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html?highlight=tablerow#apache_beam.io.gcp.bigquery.TableRowJsonCoder)), this solution uses dictionaries.
    
Check the schema to be sure that you don't get errors. 
</p>
</details>

<details><summary>Code</summary>
<p>

```
      
    def prepare_for_bq(element):
        dictionary = {
            "name": element[0],
            "total_spent": element[1],
          }
        return dictionary
        
    {..}
    
                | "Prepare for BigQuery" >> Map(prepare_for_bq)
                | "Write To BigQuery" >> WriteToBigQuery(table=table, schema=schema,
                                  create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
                                  write_disposition=BigQueryDisposition.WRITE_APPEND))
    
```
</p>
</details>

**Full code**

<details><summary>Solution</summary>
<p>
    
```
def streaming_pipeline(project_param, region="us-central1"):
    
    topic = f"projects/{project_param}/topics/beambasics-exercise"
    bucket = f"gs://beam-basics-{project_param}"
    table = f"{project_param}:beam_basics.exercise"
    schema = "name:string,total_spent:integer"
    
    options = PipelineOptions(
        streaming=True,
        project=project,
        region=region,
        staging_location="%s/staging" % bucket,
        temp_location="%s/temp" % bucket
    )
        
    p = beam.Pipeline(DataflowRunner(), options=options)
        
    def prepare_for_bq(element):
        dictionary = {
            "name": element[0],
            "total_spent": element[1],
          }
        return dictionary
        
    pubsub = (p | "Read Topic" >> ReadFromPubSub(topic=topic)
                | "To dict" >> Map(json.loads)
                | "To KV" >> Map(lambda x: (x["name"], int(x["spent"]))) # Example message: {"name": "Guillem", "spent": 10}
                | "FixedWindow" >> WindowInto(window.FixedWindows(60))
                | "Sum" >> CombinePerKey(sum)
                | "Prepare for BigQuery" >> Map(prepare_for_bq)
                | "Write To BigQuery" >> WriteToBigQuery(table=table, schema=schema,
                                      create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
                                      write_disposition=BigQueryDisposition.WRITE_APPEND))
 
    return p.run()

```
</p>
</details>

## Remember to shut down the pipeline when you are done with it

In [None]:
pipeline.cancel()