##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Example 3: Streaming NYC Taxi Ride Data

This example demonstrates how to set up a streaming pipeline that processes a stream that contains NYC taxi ride data.
Each element in the stream contains the location of the taxi, the timestamp, the meter reading, the meter increment, the passenger count, and ride status in JSON format.

You'll be able to use this notebook to explore the data in each `PCollection`.

Note that running this example may incur a small [charge](https://cloud.google.com/pubsub/pricing#message_delivery_pricing) if your aggregated Pub/Sub usage is past the free tier.

Let's make sure the Pub/Sub API is enabled. This [allows](https://cloud.google.com/apis/docs/getting-started#enabling_apis) your project to access the Pub/Sub service:


In [None]:
!gcloud services enable pubsub

Starting with the necessary imports:

In [None]:
import apache_beam as beam
from apache_beam.runners.interactive import interactive_runner
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.transforms import trigger
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
import google.auth
import json
import pandas as pd

# The Google Cloud PubSub topic that we are reading from for this example.
topic = "projects/pubsub-public-data/topics/taxirides-realtime"

# So that Pandas Dataframes do not truncate data...
pd.set_option('display.max_colwidth', -1)

Now we are setting up the options to create the streaming pipeline:

In [None]:
# Setting up the Beam pipeline options.
options = pipeline_options.PipelineOptions()

# Sets the pipeline mode to streaming, so we can stream the data from PubSub.
options.view_as(pipeline_options.StandardOptions).streaming = True

# Sets the project to the default project in your current Google Cloud environment.
# The project will be used for creating a subscription to the PubSub topic.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()

We are working with unbounded sources. By default, *Apache Beam Notebooks* will record data from the unbounded sources for replayability. 
The following sets the data recording duration to 2 minutes (120 seconds).

In [None]:
ib.options.recording_duration = '2m'

The following creates a pipeline with the *Interactive Runner* as the runner with the options we just created.

In [None]:
p = beam.Pipeline(interactive_runner.InteractiveRunner(), options=options)

The following creates a `PTransform` that will create a subscription to the given Pub/Sub topic and reads from the subscription. 
The data is in JSON format, so we add another `Map` `PTransform` to parse the data as JSON.

In [None]:
data = p | "read" >> beam.io.ReadFromPubSub(topic=topic) | beam.Map(json.loads)

If you want, you can inspect the raw JSON data by doing:

In [None]:
# Uncomment and run this if you want to inspect the raw JSON data:
# ib.show(data)

Because we are reading from an unbounded source, we need to create a windowing scheme.
Let's do sliding windows with a 10-second duration each window, with one second for each slide.
For more information about windowing in Apache Beam, visit the [Apache Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/#windowing-basics).


In [None]:
windowed_data = (data | "window" >> beam.WindowInto(beam.window.SlidingWindows(10, 1)))

If you want, you can visualize (see [FAQ #3.How do I read the visualization](../../faq.md#q3)) the windowed JSON data by doing:

In [None]:
# Uncomment and run this if you want to visualize the windowed JSON data:
# ib.show(windowed_data, include_window_info=True, visualize_data=True)

You will see duplicate data for each element because each element has to appear in multiple
windows with sliding windows.

Now let's calculate the 10-second dollar run rate for each second, by summing the `meter_increment` JSON field for each window.

First, extract the `meter_increment` field from the JSON object.

In [None]:
meter_increments = windowed_data | beam.Map(lambda e: e.get('meter_increment'))

If you want, you can inspect the extracted data by doing:

In [None]:
# Uncomment and run this if you want to inspect the extracted data:
# ib.show(meter_increments, include_window_info=True)

Now sum all elements by window:

In [None]:
run_rates = meter_increments | beam.CombineGlobally(sum).without_defaults()

The following shows the 10-second dollar run rate for each second.

In [None]:
ib.show(run_rates, include_window_info=True)

Now you can add a sink to `run_rates`, and run a [Google Cloud Dataflow](https://cloud.google.com/dataflow) job, and you'll have a continuous run rate PubSub feed!
```
run_rates | beam.io.WriteToPubSub(topic=<your-topic>)
```

Refer to the [user guide](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development) on how to run a Dataflow job using a pipeline assembled from your notebook. You can also refer to [this walkthrough](Dataflow_Word_Count.ipynb) which is based on the [first word count example notebook](01-Word_Count.ipynb).

Also, as mentioned in the beginning, there are many other fields in each element of the stream. As an exercise, try extracting other fields and applying your own computations.

When you are done with this example, you might want to visit the [PubSub subscription page](https://console.cloud.google.com/cloudpubsub/subscription/list) to delete any subscription created by this example.

If you have any feedback on this notebook, drop us a line at beam-notebooks-feedback@google.com.