##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Example 2: Streaming Word Count

This example demonstrates how to set up a streaming processing pipeline that reads from a
[Google Pub/Sub](https://cloud.google.com/pubsub) topic. Each message in the Pub/Sub topic is a word from Shakespeare's work *King Lear*, 

The difference between this example and [Example 1](01-Word_Count.ipynb) is that Example 1
takes in a **bounded** source as an input, while this example takes in
an *infinite* data stream, or an **unbounded** source, that is constantly providing data in real time.

The pipeline performs a frequency count on each of those words by window.

You can use this notebook to explore the data in each `PCollection`.

Note that running this example may incur a small [charge](https://cloud.google.com/pubsub/pricing#message_delivery_pricing) if your aggregated Pub/Sub usage is past the free tier.


Check your gcloud configuration. If not specified, the account used is the default compute engine service account that looks like `${project_number}-compute@developer.gserviceaccount.com`.

Check the [IAM](https://cloud.google.com/iam) configuration of the account to **ensure it has authorization** to run the code in your notebooks. In this example, to read from a Pub/Sub topic, it needs at least `Pub/Sub Editor` role.

In [None]:
!gcloud config list

Let's make sure the Pub/Sub API is enabled.

In [None]:
!gcloud services list | grep -q pubsub && echo "Enabled" || echo "Not enabled"

If the pubsub service is not enabled and the above account in use has project `Editor` role, uncomment and execute below command to
[allow](https://cloud.google.com/apis/docs/getting-started#enabling_apis) your project to access the Pub/Sub service.
              
Otherwise, please login as or ask the project admin to enable Pub/Sub service in its console.

In [None]:
#!gcloud services enable pubsub

Starting with the necessary imports:

In [None]:
import apache_beam as beam
from apache_beam.runners.interactive import interactive_runner
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
import google.auth

Now we are setting up the options to create the streaming pipeline:

In [None]:
# Setting up the Apache Beam pipeline options.
options = pipeline_options.PipelineOptions()

# Sets the pipeline mode to streaming, so we can stream the data from PubSub.
options.view_as(pipeline_options.StandardOptions).streaming = True

# Sets the project to the default project in your current Google Cloud environment.
# The project will be used for creating a subscription to the Pub/Sub topic.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()

The pipeline reads from Google Cloud Pub/Sub, which is an unbounded source. By default, *Apache Beam Notebooks* will record
data from the unbounded sources for replayability. 

In this example, the pipeline reads from a public Pub/Sub topic `projects/pubsub-public-data/topics/shakespeare-kinglear` that outputs multiple words from *King Lear* every second.


In [None]:
# The Google Cloud PubSub topic for this example.
topic = "projects/pubsub-public-data/topics/shakespeare-kinglear"

The following sets how long the Interactive Runner records data from each unbounded source. These recordings are used to enable a deterministic replay of the entire pipeline. The following sets the data recording duration to 2 minutes (120 seconds).

In [None]:
ib.options.recording_duration = '2m'

The following creates a pipeline with the *Interactive Runner* as the runner with the options we just created.

In [None]:
p = beam.Pipeline(interactive_runner.InteractiveRunner(), options=options)

This creates a `PTransform` that will create a subscription to the given Pub/Sub topic and reads from the subscription.

In [None]:
words = p | "read" >> beam.io.ReadFromPubSub(topic=topic)

Because we are reading from an unbounded source, we need to create a windowing scheme so that we can
count the words by window. The following creates fixed windowing with each window being 10 seconds in duration.
For more information about windowing in Apache Beam, visit the [Apache Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/#windowing-basics).


In [None]:
windowed_words = (words 
                  | "window" >> beam.WindowInto(beam.window.FixedWindows(10)))

The following `PTransform` will count the words by window.

In [None]:
windowed_word_counts = (windowed_words
                        | "count" >> beam.combiners.Count.PerElement())

The `ib.show()` method takes a `PCollection` as a parameter, runs the pipeline that contributes to it, and
shows its content as data comes in. This method will return when all the data has been read.

The optional parameter `include_window_info=True` will include the window information for each element in the output.
You will see 3 additional columns: `event_time`, `windows`, and `pane_info`.
`event_time` is the timestamp associated with the value.
`windows` in this example tells you the start timestamp of the window and its duration.
`pane_info` describes the [triggering](https://beam.apache.org/documentation/programming-guide/#triggers) information for the pane that contained the value.

This example does not use custom triggering so by default there will be only one pane per window labeled `Pane 0`.

Note that this also automatically records a bounded segment of the unbounded source until the 2-minute recording duration passes. To stop the `ib.show()` early, you can click the button with tooltip `Interrupt the kernel` in the toolbar (see [FAQ #5. Why does the `ib.collect` or `ib.show` take forever to finish execution?
    How do I stop it?](../../faq.md#q5)).

In [None]:
ib.show(windowed_word_counts, include_window_info=True)

You can provide 2 more options to limit how much of the recorded data to show.

The optional parameter `n=20` limits the `ib.show()` to show at most 20 elements. If not set, the default value `inf` tails all elements until the source recording is over.

The optional parameter `duration=30` limits the `ib.show()` to show at most elements that are computed based on the first 30 seconds worth of data from the recorded sources. If not set, the default value `inf` tails all elements until the source recording is over.

If both parameters are set, the `ib.show()` stops whenever either threshold is met. So below `ib.show()` shows at most 20 elements that are computed based on the first 30 seconds worth of data from the recorded sources.

In [None]:
ib.show(windowed_word_counts, include_window_info=True, n=20, duration=30)

Because we have recorded a bounded segment of the unbounded source, the following will show the same data
as the previous `ib.show()` call. This is to ensure replayability so that you can iteratively augment
your pipeline and verify the output with the same input, which you will see in future cells in this notebook.
Note the parameter `visualize_data=True`. This optional parameter gives you a visualization of the data (see [FAQ #3.How do I read the visualization](../../faq.md#q3)). 

In [None]:
ib.show(windowed_word_counts, include_window_info=True, visualize_data=True, n=20, duration=30)

As mentioned, to ensure replayability for iterative prototyping of your pipeline,
`ib.show()` calls will reuse the recorded data by default. You can change this behavior and
have it always fetch new data, by doing:


In [None]:
# Uncomment and run this only if you would like to change the replay behavior:
# ib.options.enable_recording_replay = False

The following `PTransform` will convert the words to lowercase and then count them by window.

In [None]:
windowed_lower_word_counts = (windowed_words
                              | beam.Map(lambda word: word.lower())
                              | "count" >> beam.combiners.Count.PerElement())

Assuming you have not changed `ib.options.enable_recording_replay`, the following will return the count using the same words 
as before but with lowercase.
Because all words are converted to lowercase before being counted, some words will have a higher count than before.

In [None]:
ib.show(windowed_lower_word_counts, include_window_info=True, n=20, duration=10)

The following gives you a [Pandas Dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that represents the `PCollection`.

In [None]:
ib.collect(windowed_lower_word_counts, include_window_info=True, n=20, duration=30)

You could use optional parameters `n` and `duration` in `ib.show()` or `ib.collect()` to control how much of the recorded data to operate on for a streaming pipeline to iterate your pipeline development faster. However, if you want to stop a recording early on or record fresh data, you can use `ib.recordings` APIs to explicitly control the long running background source recording jobs.

In [None]:
# The describe method tells you about the current status of a background source recording job for a given streaming pipeline.
# If it's in a cancelled state, it means the background source recording job already hits one of the hard caps configured in ib.options.
ib.recordings.describe(p)
# The stop method stops the background source recording job immediately.
ib.recordings.stop(p)
# The clear method clears the recorded data.
ib.recordings.clear(p)
# The record method explicitly starts a new background source recording job for the given streaming pipeline.
# If `clear` is called while `record` is not called, the next `ib.show()` or `ib.collect()` starts a new recording implicitly.
# ib.recordings.record(p)

After the above `clear` and optional explicit `record`, below `ib.collect` generates different output.

In [None]:
ib.collect(windowed_lower_word_counts, include_window_info=True, n=20, duration=30)

When you are done with this example, you might want to visit the [PubSub subscription page](https://console.cloud.google.com/cloudpubsub/subscription/list) to delete any subscription created by this example.

Just like the first example, this example is designed to run easily on a single machine. If the input stream has a very high volume, add an output sink to your `PCollection` result by doing something like:
```
windowed_lower_word_counts | beam.io.<some output transform>
```
and let [Google Cloud Dataflow](https://cloud.google.com/dataflow) run your pipeline.

You can find the list of built-in input and output transforms [here](https://beam.apache.org/documentation/io/built-in/).

Refer to the [user guide](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development) on how to run a Dataflow job using a pipeline assembled from your notebook. You can also refer to [this walkthrough](Dataflow_Word_Count.ipynb) which is based on the [first word count example notebook](01-Word_Count.ipynb).

If you have any feedback on this notebook, drop us a line at beam-notebooks-feedback@google.com.