##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Visualizing Data

This notebook demonstrates how to collect data from a `PCollection` and use some common Python or Javascript libraries to visualize them.

Setup (**Important**: run these first)
- [Dependencies Needed](#Dependencies-Needed)
- [Example Data](#Example-Data)

Demos (You can always use `tab` for auto-completion and `shift` + `tab` when the cursor is on a piece of code for docstrings)
1. [Native Interactive Beam Visualization](#Native-Interactive-Beam-Visualization)
2. [Pandas DataFrame](#Pandas-DataFrame)
3. [Matplotlib](#Matplotlib)
4. [Seaborn](#Seaborn)
5. [Bokeh](#Bokeh)
6. [D3.js](#D3.js)

## Dependencies Needed

**Disclaimer**: Third party visualization libraries and their dependencies are not developed or managed by `Interactive Beam` or `Dataflow Notebooks`.

- You only need to `!jupyter labextension install...` once for this notebook instance.
- You only need to `%pip install` once for each kernel you use.
- Follow the instructions of the pip install output when an error occurs or a kernel restart is required.

In [None]:
!jupyter labextension install @jupyter-widgets/jupyterlab-manager@2.0.0 --no-build
!jupyter labextension install @bokeh/jupyter_bokeh --no-build
# Check the installation.
!jupyter labextension list

%pip install numpy matplotlib pandas seaborn bokeh
# Restart the kernel after the installation. You can click the button with a
# circled arrow icon and tooltip "Restart the kernel" in the tool bar or click
# the menu item "Kernel" > "Restart Kernel..." to restart the kernel.

## Example Data

- The data is fetched from [covidtracking.com](http://covidtracking.com).
- It contains daily COVID19 tracking data on 2020-08-27 for the US grouped by states.

In [None]:
from csv import reader
from collections import namedtuple

import apache_beam as beam
from apache_beam.runners.interactive import interactive_beam as ib
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner

# The source file contains the raw data in CSV format.
csv_file = '../assets/visualize_data/example_data.csv'

def read_headers(csv_file):
  with open(csv_file, 'r') as f:
    header_line = f.readline().strip()
  return next(reader([header_line]))

# Read the first row of the CSV file to take out the headers.
headers = read_headers(csv_file)
# Build a namedtuple with the headers as a schema for the raw data.
UsCovidData = namedtuple('UsCovidData', headers)

class UsCovidDataCsvReader(beam.DoFn):
  """A parser DoFn that converts each row of raw data in CSV format into
  a UsCovidData schemed namedtuple."""

  def __init__(self, schema):
    self._schema = schema
    
  def process(self, element):
    values = [int(val) if val.isdigit() else val for val in next(reader([element]))]
    return [self._schema(*values)]

# Build a Beam pipeline with InteractiveRunner that uses DirectRunner as the
# underlying runner by default.
p = beam.Pipeline(runner=InteractiveRunner())
pcoll_data = (p 
    | 'Read rows of the csv file' >> beam.io.ReadFromText(csv_file, skip_header_lines=1)
    | 'Parse rows into UsCovidData typed elements' >> beam.ParDo(UsCovidDataCsvReader(UsCovidData)))

# Collect the PCollection's data into a Pandas DataFrame.
data = ib.collect(pcoll_data)
data.describe()

## Native Interactive Beam Visualization

`Interactive Beam`'s native visualization utility `show` renders a paginated orderable and searchable datatable and [Facets](https://pair-code.github.io/facets/) visualization.

- It's designed for users to navigate and gain insight of their data effortlessly.
- It's the most interactive compared to other libraries since users can click, drag and type inputs to customize its visualization without coding.
- More information can be found in [FAQ #3.How do I read the visualization](../../faq.md#q3).

In [None]:
ib.show(pcoll_data, visualize_data=True)

## Pandas DataFrame

The data collected from a PCollection is a Pandas `DataFrame`, so the easiest way to visualize the data is through pandas itself.

- Pandas uses the standard convention for referencing the matplotlib API.
- Pandas visualization [guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

Native Interactive Beam visualization automatically generates histogram for each column of the data.
On top of that, We can use Pandas DataFrame's visualization to build stacked histograms.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

plt.close('all')
df = ib.collect(pcoll_data)
plt.figure()

# Display a stacked histogram for 'positive' and 'negative'. 
df[['positive', 'negative']].plot.hist(alpha=0.5, stacked=True)

## Matplotlib

- As explained in [Types of inputs to plotting functions](https://matplotlib.org/tutorials/introductory/usage.html#types-of-inputs-to-plotting-functions):
>All of plotting functions expect numpy.array or numpy.ma.masked_array as input. Classes that are 'array-like' such as pandas data objects and numpy.matrix may or may not work as intended. It is best to convert these to numpy.array objects prior to plotting.

- User [guide](https://matplotlib.org/users/index.html)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

df = ib.collect(pcoll_data)
# Convert from Pandas DataFrame to numpy array.
df_array = df[['positive', 'negative']].values

# The minimum efforts to build a plot that is not readable nor useful with matplotlib.
fig, ax = plt.subplots()
ax.plot(df_array)

The above plot draws 56 data points each with 2 columns. Each column of data is plotted as a separated line plot. The line above is the 'negative' number for each data point and the line below is the 'positive' number.

Here are the reasons why we cannot interpret the plot:

- There is no title, no legend, and no labels for us to make sense of the data.
- The x-axis is metadata (the index of each data point) not data itself.
- The line plot does not make sense since there is no trending nor timeseries between data points. A scatter plot would be more useful.

There are [two ways](https://matplotlib.org/tutorials/introductory/usage.html#the-object-oriented-interface-and-the-pyplot-interface) to use matplotlib:

- Object-oriented interface (OO-style)
- Rely on pyplot

We demonstrate both by making the above plot readable.

In [None]:
# OO-style

fig_oo, ax_oo = plt.subplots()  # Create a figure and an axes.
ax_oo.plot(df['positive'].values, df['negative'].values, 'bo', label='correlation')
ax_oo.set_xlabel('positive')  # Add an x-label to the axes.
ax_oo.set_ylabel('negative')  # Add a y-label to the axes.
ax_oo.set_title('Correlation between positive and negative')  # Add a title to the axes.
ax_oo.legend()

In [None]:
# Rely on pyplot

plt.plot(df['positive'].values, df['negative'].values, 'bo', label='correlation')  # Plot some data on the (implicit) axes.
plt.xlabel('positive')
plt.ylabel('negative')
plt.title('Correlation between positive and negative')
plt.legend()

## Seaborn

- [Tutorials](https://seaborn.pydata.org/tutorial.html)

We demonstrate categorical plot and distribution plot from Seaborn as we have done similar things with other libraries.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = ib.collect(pcoll_data)

sns.set(style='ticks', color_codes=True)
# Display categorical plot of 'positive', 'negative' and 'total' case numbers.
sns.catplot(data=df[['positive', 'negative', 'total']])

In [None]:
# Display distribution of 'positive' case numbers.
sns.distplot(df['positive'])

## Bokeh

Compared to other Python libraries listed in this demonstration, Bokeh has more interactivity in the visualization rather than static matplotlib plots. This is similar to the Native Interactive Beam visualization.

The advantage of interactivity is the visualization allows the user to dive into each data point instead of just giving the user an overview.
One visualization based on different user configuration can be interpreted in different ways to solve different problems.

The disadvantage of interactive visualization is they tend to use more resources when rendered and hard to share.

- User [guide](https://docs.bokeh.org/en/latest/docs/user_guide.html)
- To use Bokeh in JupyterLab, make sure 
  - `jupyter labextension install @jupyter-widgets/jupyterlab-manager`
  - Then `jupyter labextension install @bokeh/jupyter_bokeh`
- **Warning**: There is a known Bokeh-JupyterLab [issue](https://github.com/bokeh/jupyter_bokeh/issues/29). If you refresh the page or
  open a saved notebook, when there is a Bokeh plot in the output, the below code cell with `show(bokeh_plot)` will become invisible.
  You have to clear the output, save the notebook, and then reopen this notebook to view the code.

We demonstrate Bokeh with yet another 'positive'-'negative' correlation plot.

In [None]:
from bokeh.plotting import output_notebook

output_notebook()

In [None]:
from bokeh.plotting import figure, show

df = ib.collect(pcoll_data)

bokeh_plot = figure(
   tools='pan,box_zoom,reset,save',  # The interactive tools offered by bokeh.
   y_range=[1, 10**7], title='Correlation between positive and negative',
   x_axis_label='positive', y_axis_label='negative'
)

bokeh_plot.circle(df['positive'], df['negative'], legend_label='correlation', fill_color='white', size=8)

show(bokeh_plot)

## D3.js

Different from all other demonstrated libraries, D3.js is a Javascript library.

- D3 [Wiki](https://github.com/d3/d3/wiki)
- Advanced tutorial: [D3 in Depth](https://www.d3indepth.com/)

Below shows how you can use a completely different approach to do visualization in your browser rather than relying on blackbox libraries to pre-render all outputs in the kernel.

We demonstrate D3.js with a force simulated bubble chart. You have to execute below cells again if the page is refreshed or the notebook is opened from a saved state.

In [None]:
# Move data from the Python kernel to a Javascript object in your browser.

from IPython.display import Javascript

df = ib.collect(pcoll_data)
df_json = df[['state', 'positive']].to_json(orient='records')

# Assign the json formatted DataFrame to a global Javascript variable `df_json`.
Javascript('window.df_json={};'.format(df_json))

In [None]:
%%HTML

<script src='https://d3js.org/d3.v5.js'></script>
    
<div>
  Below bubble chart generated has these attributes:
  <ul>
    <li> State names as labels.</li>
    <li>The bigger the positive number, the bigger the bubble.</li>
    <li>Use mouse to drag the graph.</li>
    <li>Use double click or mouse wheel to zoom in or out of the graph.</li>
    <li>On hover each circle, it displays the positive number.</li>
  </ul>
  <svg id='bubble' width='800' height='600'></svg>
</div>

<script>
  function bubbleChart() {
    let bubbleData = window.df_json
    let width = 800, height = 600;
    let nodes = d3.shuffle(bubbleData.map((d) => {
      return {
        radius: d['positive'] / 10000,
        label: d['state'],
        ...d,
      };
    }));

    let simulation = d3
      .forceSimulation(nodes)
      .force('center', d3.forceCenter().x(width/2).y(height/2))
      .force('forceX', d3.forceX().strength(0.1).x(width/2))
      .force('forceY', d3.forceY().strength(0.1).y(height/2))
      .force('charge', d3.forceManyBody().strength(-80))
      .force(
        'collision',
        d3.forceCollide().strength(1).radius(function (d) {
          return d.radius;
        }))
      .on('tick', ticked);

    let zoomable = d3
      .select('#bubble')
      .attr('viewBox', [0, 0, width, height]);
    let zg = zoomable.append('g').attr('width', width).attr('height', height);
    zoomable.call(
      d3
        .zoom()
        .scaleExtent([0.25, 5])
        .on('zoom', function () {
          zg.attr('transform', d3.event.transform);
        }));

    let u = d3
      .select('#bubble')
      .select('g')
      .selectAll('g')
      .data(nodes);

    let node = u.enter().append('g');
    node
      .append('circle')
      .attr('r', function (d) {
        return d.radius;
      })
      .attr('cx', function (d) {
        return d.x;
      })
      .attr('cy', function (d) {
        return d.y;
      })
      .style('fill', '#FBB65B')
      .append('svg:title')
      .text((d) => d['positive']);
    node
      .append('text')
      .attr('x', (d) => d.x)
      .attr('y', (d) => d.y + d.radius/6)
      .attr('text-anchor', 'middle')
      .style('fill', '#000000')
      .style('font-size', (d) => d.radius/2)
      .text((d) => d.label);
    node.merge(u);
    u.exit().remove();

    function ticked() {
      node.attr('transform', function(d) {
        return 'translate(' + d.x + ',' + d.y + ')'
      });
    }
  }

  bubbleChart();
</script>