# Table of Contents
 <p><div class="lev1"><a href="#Hello-world"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hello world</a></div><div class="lev1"><a href="#Hello-world-(with-Map)"><span class="toc-item-num">2&nbsp;&nbsp;</span>Hello world (with Map)</a></div><div class="lev1"><a href="#Hello-world-(with-FlatMap)"><span class="toc-item-num">3&nbsp;&nbsp;</span>Hello world (with FlatMap)</a></div><div class="lev1"><a href="#Hello-world-(with-FlatMap-and-yield)"><span class="toc-item-num">4&nbsp;&nbsp;</span>Hello world (with FlatMap and yield)</a></div><div class="lev1"><a href="#Counting-words"><span class="toc-item-num">5&nbsp;&nbsp;</span>Counting words</a></div><div class="lev1"><a href="#Counting-words-with-GroupByKey"><span class="toc-item-num">6&nbsp;&nbsp;</span>Counting words with GroupByKey</a></div><div class="lev1"><a href="#Type-hints"><span class="toc-item-num">7&nbsp;&nbsp;</span>Type hints</a></div><div class="lev1"><a href="#BigQuery"><span class="toc-item-num">8&nbsp;&nbsp;</span>BigQuery</a></div><div class="lev1"><a href="#Combiner-Examples"><span class="toc-item-num">9&nbsp;&nbsp;</span>Combiner Examples</a></div><div class="lev1"><a href="#More-Examples"><span class="toc-item-num">10&nbsp;&nbsp;</span>More Examples</a></div><div class="lev1"><a href="#Organizing-Your-Code"><span class="toc-item-num">11&nbsp;&nbsp;</span>Organizing Your Code</a></div>

# Hello world

Create a transform from an iterable and use the pipe operator to chain transforms:

In [1]:
# Standard imports
import google.cloud.dataflow as df
# Create a pipeline executing on a direct runner (local, non-cloud).
p = df.Pipeline('DirectPipelineRunner')
# Create a PCollection with names and write it to a file.
(p
 | df.Create('add names', ['Ann', 'Joe'])
 | df.Write('save', df.io.TextFileSink('./output/names')))
# Execute the pipeline.
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7fde7bf6da90>

In [2]:
!head ./output/names

Ann
Joe


# Hello world (with Map)

The <mark>Map</mark> transform takes a callable, which will be applied to each element of the input <mark>PCollection</mark> and must return an element to go into the output <mark>PCollection</mark>.

In [3]:
import google.cloud.dataflow as df
p = df.Pipeline('DirectPipelineRunner')
# Read file with names, add a greeting for each, and write results.
(p
 | df.Read('load messages', df.io.TextFileSource('./output/names'))
 | df.Map('add greeting',
          lambda name, msg: '%s %s!' % (msg, name),
          'Hello')
 | df.Write('save', df.io.TextFileSink('./output/greetings')))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7fde8deedb10>

In [4]:
!head ./output/greetings

Hello Ann!
Hello Joe!


# Hello world (with FlatMap)

A <mark>FlatMap</mark> is like a <mark>Map</mark> except its callable returns a (possibly empty) iterable of elements for the output <mark>PCollection</mark>.

In [7]:
import google.cloud.dataflow as df
p = df.Pipeline('DirectPipelineRunner')
# Read previous file, add a name to each greeting and write results.
(p
 | df.Read('load messages', df.io.TextFileSource('./output/names'))
 | df.FlatMap('add greetings',
              lambda name, msgs: ['%s %s!' % (m, name) for m in msgs],
              ['Hello', 'Hola'])
 | df.Write('save', df.io.TextFileSink('./output/greetings')))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7fde8deede90>

In [8]:
!head ./output/greetings

Hello Ann!
Hola Ann!
Hello Joe!
Hola Joe!


# Hello world (with FlatMap and yield)

The callable of a <mark>FlatMap</mark> can be a generator, that is, a function using <mark>yield</mark>.

In [9]:
import google.cloud.dataflow as df
p = df.Pipeline('DirectPipelineRunner')
# Add greetings using a FlatMap function using yield.
def add_greetings(name, messages):
  for m in messages:
    yield '%s %s!' % (m, name)

(p
 | df.Read('load names', df.io.TextFileSource('./output/names'))
 | df.FlatMap('greet', add_greetings, ['Hello', 'Hola'])
 | df.Write('save', df.io.TextFileSink('./output/greetings')))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7fde7b60bb90>

In [10]:
!head ./output/greetings

Hello Ann!
Hola Ann!
Hello Joe!
Hola Joe!


# Counting words

This example counts the words in a text and also shows how to read a text file from Google Cloud Storage.

In [11]:
import re
import google.cloud.dataflow as df
p = df.Pipeline('DirectPipelineRunner')
(p
 | df.Read('read',
           df.io.TextFileSource(
           'gs://dataflow-samples/shakespeare/kinglear.txt'))
 | df.FlatMap('split', lambda x: re.findall(r'\w+', x))
 | df.combiners.Count.PerElement('count words')
 | df.Write('write', df.io.TextFileSink('./output/results')))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7fde7b630250>

In [12]:
!head ./output/results

(u'wants', 1)
(u'whose', 15)
(u'Duke', 8)
(u'helps', 1)
(u'disclaim', 1)
(u'Mum', 1)
(u'shell', 1)
(u'gone', 17)
(u'battles', 1)
(u'between', 9)


# Counting words with GroupByKey

Here we use <mark>GroupByKey</mark> to count the words. This is a somewhat forced example of <mark>GroupByKey</mark>; normally one would use the transform <mark>df.combiners.Count.PerElement</mark>, as in the previous example. The example also shows the use of a wild-card in specifying the text file source.

In [13]:
import re
import google.cloud.dataflow as df
p = df.Pipeline('DirectPipelineRunner')
class MyCountTransform(df.PTransform):
  def apply(self, pcoll):
    return (pcoll
    | df.Map('one word', lambda w: (w, 1))
    # GroupByKey accepts a PCollection of (w, 1) and
    # outputs a PCollection of (w, (1, 1, ...))
    | df.GroupByKey('group words')
    | df.Map('count words', lambda (word, counts): (word, len(counts))))

(p
 | df.Read('read', df.io.TextFileSource('./output/names*'))
 | df.FlatMap('split', lambda x: re.findall(r'\w+', x))
 | MyCountTransform()
 | df.Write('write', df.io.TextFileSink('./output/results')))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7fde7a6900d0>

In [14]:
!head ./output/results

(u'Ann', 1)
(u'Joe', 1)


# Type hints

In some cases, you can improve the efficiency of the data encoding by providing type hints. For example:

In [1]:
import google.cloud.dataflow as df
from google.cloud.dataflow.typehints import typehints
p = df.Pipeline('DirectPipelineRunner')
(p
 | df.Read('A', df.io.TextFileSource('./output/names'))
 | df.Map('B1', lambda x: (x, 1)).with_output_types(typehints.KV[str, int])
 | df.GroupByKey('GBK')
 | df.Write('C', df.io.TextFileSink('./output/results')))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7f43ce3f2210>

In [2]:
!head ./output/results

(u'Ann', [1])
(u'Joe', [1])


# BigQuery

Here is a pipeline that reads input from a BigQuery table and writes the result to a different table. This example calculates the number of tornadoes per month from weather data. To run it you will need to provide an output table that you can write to.

In [12]:
import pandas as pd

query = "SELECT month, tornado FROM [clouddataflow-readonly:samples.weather_stations] LIMIT 10"

df = pd.read_gbq(query, project_id='YOUR-PROJECT', private_key='YOUR-PRIVATE-KEY')
df.head()

Requesting query... ok.
Query running...
Query done.
Processed: 8.8 kb

Retrieving results...
Got 10 rows.

Total time taken 0.89 s.
Finished at 2016-05-14 11:16:04.


Unnamed: 0,month,tornado
0,5,False
1,10,False
2,3,False
3,11,True
4,2,False


In [4]:
import google.cloud.dataflow as df
input_table = 'clouddataflow-readonly:samples.weather_stations'
project = 'YOUR-PROJECT'
output_table = 'DATASET.TABLENAME'
p = df.Pipeline(argv=['--project', project])
(p
 | df.Read('read', df.io.BigQuerySource(input_table))
 | df.FlatMap(
     'months with tornadoes',
     lambda row: [(int(row['month']), 1)] if row['tornado'] else [])
 | df.CombinePerKey('monthly count', sum)
 | df.Map('format', lambda (k, v): {'month': k, 'tornado_count': v})
 | df.Write('write', df.io.BigQuerySink(
      output_table,
      schema='month:INTEGER, tornado_count:INTEGER',
      create_disposition=df.io.BigQueryDisposition.CREATE_IF_NEEDED,
      write_disposition=df.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7f43ede86cd0>

In [6]:
!bq head -n 10 YOUR-PROJECT:DATASET.TABLENAME

+-------+---------------+
| month | tornado_count |
+-------+---------------+
|     2 |             7 |
|     4 |             5 |
|     3 |             6 |
|    10 |            10 |
|    12 |            10 |
|     8 |             4 |
|     7 |             8 |
|     1 |            16 |
|     5 |             6 |
|     6 |             5 |
+-------+---------------+


Here is a pipeline that achieves the same functionality, i.e., calculates the number of tornadoes per month, but uses a query to filter out input instead of using the whole table.

In [7]:
import google.cloud.dataflow as df
project = 'YOUR-PROJECT'
output_table = 'DATASET.TABLENAME'
input_query = 'SELECT month, COUNT(month) AS tornado_count ' \
        'FROM [clouddataflow-readonly:samples.weather_stations] ' \
        'WHERE tornado=true GROUP BY month'
p = df.Pipeline(argv=['--project', project])
(p
| df.Read('read', df.io.BigQuerySource(query=input_query))
| df.Write('write', df.io.BigQuerySink(
    output_table,
    schema='month:INTEGER, tornado_count:INTEGER',
    create_disposition=df.io.BigQueryDisposition.CREATE_IF_NEEDED,
    write_disposition=df.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()

<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7f43c5126210>

In [8]:
!bq head -n 10 YOUR-PROJECT:DATASET.TABLENAME

+-------+---------------+
| month | tornado_count |
+-------+---------------+
|     5 |             6 |
|     1 |            16 |
|     3 |             6 |
|     6 |             5 |
|     2 |             7 |
|     9 |             7 |
|    12 |            10 |
|     4 |             5 |
|    11 |             9 |
|     7 |             8 |
+-------+---------------+


# Combiner Examples

A common case for Dataflow combiners is to sum (or max or min) over the values of each key. Such standard Python functions can be used directly as combiner functions. In fact, any function "reducing" an iterable to a single value can be used.

In [13]:
import google.cloud.dataflow as df
p = df.Pipeline('DirectPipelineRunner')

SAMPLE_DATA = [('a', 1), ('b', 10), ('a', 2), ('a', 3), ('b', 20)]

(p
 | df.Create(SAMPLE_DATA)
 | df.CombinePerKey(sum)
 | df.Write(df.io.TextFileSink('./output/results')))
p.run()


<google.cloud.dataflow.runners.direct_runner.DirectPipelineResult at 0x7f43ce317150>

In [14]:
!head ./output/results

('a', 6)
('b', 30)


The `google/cloud/dataflow/examples/cookbook/combiners_test.py` file in the source distribution contains more combiner examples.

# More Examples

The `google/cloud/dataflow/examples` subdirectory in the source distribution has some larger examples.

# Organizing Your Code

Many projects will grow to multiple source code files. It is beneficial to organize the project so that all the code involved in running a workflow can be built as a Python package so that it can be installed in the VM workers executing a job.

Please follow the example in `google/cloud/dataflow/examples/complete/juliaset`. If the code is organized in this fashion then you can use the --setup_file command line option to create a source distribution out of the project files, stage the resulting tarball and later install it in the workers executing the job.