# Creating Collections

## Creating in-memory `PCollection`s

- We can now create rudimentary Beam pipelines and pass parameters into them. Now we'll learn how to create `PCollection`s and fill them with data.
- There are a few options for doing this:
    - Create a `PCollection` of data stored in an in-memory collection class in your driver program
    - Read data from a variety of external sources such as local or cloud-based files or databases using Beam-provided IO adapters

To create a `PCollection` from in-memory data, we use the `Create` transform:

```python
import apache_beam as beam

# Output PCollection - don't worry about this, we'll learn this 
# in future lessons. Just know that this outputs the PCollection
class Output(beam.PTransform):
    class _OutputFn(beam.DoFn):
        def __init__(self, prefix=''):
            super().__init__()
            self.prefix = prefix

        def process(self, element):
            print(self.prefix+str(element))

    def __init__(self, label=None,prefix=''):
        super().__init__(label)
        self.prefix = prefix

    def expand(self, input):
        input | beam.ParDo(self._OutputFn(self.prefix))

with beam.Pipeline() as p:

    (
        p
        | 'Create range' >> beam.Create(range(1, 11))
        | 'Output range' >> Output()
    ) 

    (
        p
        | 'Create words' >> beam.Create(['To', 'be', 'or', 'not', 'to', 'be'])
        | 'Output words' >> Output()
    )
```

In [3]:
%%bash

python source/03-creating-collections-01.py

1
2
3
4
5
6
7
8
9
10
To
be
or
not
to
be


## Creating `PCollection`s from text files

You use a Beam-provided IO adapter to read from an external source. They all return a `PCollection`.

For example, to read text, we use `TextIO.Read`.

```python
# Now create the PCollection by reading text files. Separate elements will be added for each line in the input file
(p | beam.io.ReadFromText('gs://some/inputData.txt'))
```

In a pipeline:

```python
with beam.Pipeline() as p:

    input = (
        p
        | 'Log lines' >> beam.io.ReadFromText('gs://apache-beam-samples/shakespear/kinglear.txt')
        | beam.Filter(lambda line: lin != "")
    )

    input

    (
        input
        # Limit output to 10 items
        | 'Log fixed lines' >> beam.combiners.Sample.FixedSizeGlobally(10)
        | beam.FlatMap(lambda sentence: sentence)
        | Output(prefix = 'Fixed first 10 lines:')
    )


    words = (
        p
        | 'Log words' >> beam.io.ReadFromText('gs://apache-beam-samples/shakespear/kinglear.txt')
        | beam.Filter(lambda word: not word.isspace() or word.isalnum())
        | beam.combiners.Sample.FixedSizeGlobally(10) 
        | beam.FlatMap(lambda word: word)    
        | 'Log output words' >> Output(prefix = 'Word: ')
    )
```


## Creating `PCollection`s from CSVs

This consists of two main parts:
- Using TextIO.Read to load the text lines from the CSV
- Parsing lines of text into tabular format

```python
# Standard DoFn definition, inherit class from beam.DoFn then 
# write main processing logic inside `process()`
class ExtractTaxiRideCostFn(beam.DoFn):

    def process(self, element):
        line = element.split(',')
        return tryParseTaxiRideCost(line,16)

# If num of elements in a row > index, then pull the item at 
# that index, otherwise return 0
def tryParseTaxiRideCost(line,index):
    if(len(line) > index):
        # Using yield here seems to be standard in Beam pipelines, and 
        # apparently helps with memory-efficiency (since we don't keep 
        # the entire list in mem, only the next/current item) and optimisation
        yield line[index]
    else:
        yield 0.0

# This pipeline
with beam.Pipeline() as p:
  lines = (
    p 
    | 'Log lines' >> beam.io.ReadFromText('gs://apache-beam-samples/nyc_taxi/misc/sample1000.csv')
    | beam.ParDo(ExtractTaxiRideCostFn())
    | beam.combiners.Sample.FixedSizeGlobally(10)
    | beam.FlatMap(lambda cost: cost)
    | Output(prefix = 'Taxi cost: ')
  )

#> Taxi cost: 9.35
#> Taxi cost: 15.38
#> Taxi cost: 10.8
#> Taxi cost: 9.8
#> Taxi cost: 18.55
#> Taxi cost: 30.3
#> Taxi cost: 6.2
#> Taxi cost: 7.3
#> Taxi cost: 28.5
#> Taxi cost: 5.3
```