##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# How to run the examples on Dataflow

This notebook illustrates how to run the pipeline in the [First Word Count example](01-Word_Count.ipynb) with the Dataflow Runner, instead of the Interactive Runner.

Note that running this example incurs a small [charge](https://cloud.google.com/dataflow/pricing) from Dataflow.

Let's make sure the Dataflow API is enabled. This [allows](https://cloud.google.com/apis/docs/getting-started#enabling_apis) your project to access the Dataflow service:


In [None]:
!gcloud services enable dataflow

Starting with the necessary imports:


In [None]:
import re
import apache_beam as beam
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.runners import DataflowRunner
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib


import google.auth

This next cell was copied from the [01-Word_Count](01-Word_Count.ipynb) example. It contains the same pipeline construction code that performs a word count on text files hosted on Cloud Storage.

In [None]:
class ReadWordsFromText(beam.PTransform):
    
    def __init__(self, file_pattern):
        self._file_pattern = file_pattern
    
    def expand(self, pcoll):
        return (pcoll.pipeline
                | beam.io.ReadFromText(self._file_pattern)
                | beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))
    
p = beam.Pipeline(InteractiveRunner())

words = p | 'read' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')

counts = (words 
          | 'count' >> beam.combiners.Count.PerElement())

lower_counts = (words
                | "lower" >> beam.Map(lambda word: word.lower())
                | "lower_count" >> beam.combiners.Count.PerElement())

Note that the `Pipeline` is constructed by an `InteractiveRunner`, so you can use operations such as `ib.collect` or `ib.show`.

In [None]:
ib.show(counts)

### Dataflow Additions

Now, for something a bit different. Because Dataflow executes in the cloud, you need to output to a cloud sink. In this case, you are loading the transformed data into Cloud Storage.

First, set up the `PipelineOptions` to specify to the Dataflow service the Google Cloud project, the region to run the Dataflow Job, and the SDK location.

In [None]:
# Setting up the Apache Beam pipeline options.
options = pipeline_options.PipelineOptions(flags=[])

# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()

# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'us-central1'

In [None]:
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = 'gs://<CHANGE ME>/dataflow'

In [None]:
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location

# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location

In [None]:
# The directory to store the output files of the job.
output_gcs_location = '%s/output' % dataflow_gcs_location

# Specifying the Cloud Storage location to write `counts` to,
# based on the `output_gcs_location` variable set earlier.
(counts | 'Write counts to Cloud Storage' 
 >> beam.io.WriteToText(output_gcs_location + '/wordcount-output.txt'))

# Specifying the Cloud Storage location to write `lower_counts` to,
# based on the `output_gcs_location` variable set earlier.
(lower_counts | 'Write lower counts to Cloud Storage' 
 >> beam.io.WriteToText(output_gcs_location + '/wordcount-lower-output.txt'))

In [None]:
# IMPORTANT! Ensure that the graph is correct before sending it out to Dataflow.
# Because this is a notebook environment, unintended additions to the graph may have occurred when rerunning cells. 
ib.show_graph(p)

### Running the pipeline

Now you are ready to run the pipeline on Dataflow. `run_pipeline()` runs the pipeline and return a pipeline result object.

In [None]:
pipeline_result = DataflowRunner().run_pipeline(p, options=options)

Using the `pipeline_result` handle, the following code builds a link to the Google Cloud Console web page that shows you details of the Dataflow job you just started:

In [None]:
from IPython.core.display import display, HTML
url = ('https://console.cloud.google.com/dataflow/jobs/%s/%s?project=%s' % 
      (pipeline_result._job.location, pipeline_result._job.id, pipeline_result._job.projectId))
display(HTML('Click <a href="%s" target="_new">here</a> for the details of your Dataflow job!' % url))

Wait for the job to finish. The following call blocks notebook execution until the Dataflow job is finished. This will take a few minutes. With this example, you will notice the Dataflow job takes a much longer time to finish compared to running directly in the notebook environment.
This is because it takes time for a Dataflow job to set up the environment for parallel execution. Note that this is a very small job, and Dataflow is more suited for parallelizing and running large jobs
(e.g. thousands of files with billions of words).

In [None]:
pipeline_result.wait_until_finish()

### Checking the results

Now that the job is finished, check the results in Cloud Storage using the [`gsutil`](https://cloud.google.com/storage/docs/gsutil) command-line tool. Note that `beam.io.WriteToText` writes the results in a sharded set of output files. For example, if the output is specified as `gs://my_bucket/output_directory/result.txt`, the results are written in files with names like `gs://my_bucket/output_directory/result.txt-<shard>-of-<number-of-shards>`.

In [None]:
!gsutil ls {output_gcs_location}

Now check the content of the files by looking at the first 10 lines of the files. 

In [None]:
!gsutil cat {output_gcs_location}/wordcount-output.txt* | head -10

In [None]:
!gsutil cat {output_gcs_location}/wordcount-lower-output.txt* | head -10

That's it! Using this technique, you can also try launching Dataflow jobs for other examples listed.

If you have any feedback on this notebook, drop us a line at beam-notebooks-feedback@google.com.