##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# I/O Operations

In most cases, we don't use `Create` to create elements and print the elements as final destination, we use sources to read from and sinks to write to. They can be files in our system, buckets on Cloud Storage, BigQuery tables or Pub/Sub topics.

In [1]:
import logging
from utils.solutions import solutions

import apache_beam as beam
from apache_beam import  Map
from apache_beam.transforms.combiners import Count
from apache_beam.io.textio import ReadFromText, WriteToText

from apache_beam.testing.util import assert_that
from apache_beam.testing.util import matches_all, equal_to

from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib


There are many [I/O operations](https://beam.apache.org/releases/pydoc/current/apache_beam.io.html). In this notebook we are only going to deal with reading/writing text files and we'll leave BigQuery and Pub/Sub for the next notebook ([Streaming](6_Streaming.ipynb)).

**`ReadFromText`** reads from a file path. It can also be a Cloud Storage path. The parameter `min_bundle_size` sets the minimum [bundle](https://beam.apache.org/documentation/runtime/model/#bundling-and-persistence) size of each split the source has (i.e., the file is split in N bundles of at least `min_bundle_size`). The parameter is optional and, if not set, Apache Beam will handle it for you.

**`WriteToText`** writes to a file path or a Cloud Storage path. The optional parameter `num_shards` sets the number of output files. It is not recommended to change it, but certain use cases require a fixed amount of output files.

In [4]:
p = beam.Pipeline(InteractiveRunner())
    
path = "input/example.txt"
output_path = "output/example"

def print_fn(e):
    # We are adding this step to know what the elements look like
    print("Element: {}".format(e))
    return e

write = (p | "Read" >> ReadFromText(path))
           | "Map" >> Map(print_fn)
           | "Write" >> WriteToText(file_path_prefix=output_path, file_name_suffix=".txt", num_shards=2))

ib.show(write)

IndentationError: unexpected indent (<ipython-input-4-bcb0ea0bd0f7>, line 12)

In this case `ib.show()` gets the file paths.

You can also use wildcards as paths. Using the previous output as input:

In [None]:
p = beam.Pipeline(InteractiveRunner())

output_path = "output/example"

read = p | "Read" >> ReadFromText(file_pattern=output_path + '*')

ib.show_graph(p)
ib.show(read)

Note that because the files are read in parallel, the order of the lines may change.

## Exercise

In a public Cloud Storage bucket, there is the file `hamlet.txt` which the pipeline is using for this exercise.

The goal is to know how many lines that file has. Optionally you can write the number to a file, but in the posted solution it won't be saved.

In [None]:
p = beam.Pipeline(InteractiveRunner())
    
path = "gs://dataflow-samples/shakespeare/hamlet.txt"

# TODO: Finish the pipeline 
count_lines = (p | )

ib.show(count_lines)

# For testing the solution - Don't modify
assert_that(count_lines, equal_to(solutions[5]))

### Hints

<details><summary>Read Hint</summary>
<p>
When the file is read, it's already split by lines.
</p>
</details>


<details><summary><b>Code</b></summary>
<p>

```
p = beam.Pipeline(InteractiveRunner())
    
path = "gs://dataflow-samples/shakespeare/hamlet.txt"

count_lines = (p | ReadFromText(path)
                 | Count.Globally())

ib.show(count_lines)

# For testing the solution - Don't modify
assert_that(count_lines, equal_to(solutions[5]))
```
</p>
</details>
