In [1]:
%run supportvectors-common.ipynb



<div style="color:#aaa;font-size:8pt">
<hr/>

 </blockquote>
 <hr/>
</div>



# Revisiting the bar, with Altair

We will now repeat the bar-plots exercise we did with `matplotlib`, but this time with the `altair` library.

## Load and summarize

In [2]:
url = "https://raw.githubusercontent.com/supportvectors/viz-datasets/main/" \
    + "regional_covid_data.csv"
data = pd.read_csv(url)
data

Unnamed: 0,Region,Confirmed,Deaths,Recovered,Active
0,Africa,723207,12223,440645,270339
1,Americas,8839286,342732,4468616,4027938
2,Eastern Mediterranean,1490744,38339,1201400,251005
3,Europe,3299523,211144,1993723,1094656
4,South-East Asia,1835297,41349,1156933,637015
5,Western Pacific,292428,8249,206770,77409


## The Bar chart family

Let us now start learning to create bar-charts with `altair`, starting with the very basic one.

`altair` chart visualization can be done using the `Chart` class and its methods.

*  top level **chart** object accepts data
*  **mark** method specifies how the encoded attributes should be represented in the chart
*  **encode** method maps data columns to visual attributes of the chart

We create bar charts using the `Chart.mark_bar()` method. 
Refer the code below and try to understand the syntax. We will break it down in the next cell.

In [3]:

(alt.Chart(data)
     .mark_bar()                      # mark_bar for creating bar plots
     .encode(
             x='Region:N',            # encode x-axis values to the data column 'Region' of type "nominal"
             y='Confirmed:Q',         # encode y-axis values to the data column 'Confirmed' of type "quantitative"
     )
)

# compare to plotting with pandas
# data.plot.bar(x='Region', y='Confirmed')

### The encode method

Let us now try to understand the encode method.

The `Chart.encode()` method provides several **channels** for mapping the data columns to visual attributes of which we will review three in this notebook:

1. position channels - `x`, `y`, `theta` etc.
2. mark property channels - `color`, `size`, `row`, `column` etc.
3. text and tooltip channels - `text`, `tooltip`

In the code above we encode the position channel `x` with the data column `Region`. Which means the x-axis values are inferred from the data column `Region`. What about the `:N` ?

#### Encoding types

The encoding type of a data column determines how altair interprets the values of the column. The data columns can be encoded in several different types:

| encoding type | shorthand* | description |
|---|---|---|
| `quantitative` | `Q` | continuous real valued quantity |
| `ordinal` | `O` | discrete ordered quantity |
| `norminal` | `N` | discrete unordered category |
| `temporal` | `T` | time or date value |
| `geojson` | `G` | geographic shape |

To learn more about encoding types refer:
https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types 

Since `Region` is a categorical variable, we choose the type as `nominal`. Note that it is very important to choose the appropriate encoding type. 

### Shorthand

A **shorthand** is used to conviniently specify the column name and it's type as a string. (It is used to specify the aggregate type as well). 

`x = 'Region:N'` is short for `x = alt.X(field='Region', type='nominal')`

The shorthand is used throughout this notebook.

### Observations on the basic bar chart

This is a straightforward rendering of the bar chart, where we see the number of confirmed cases by region. This is a much better graph compared to matplotlib's basic bar chart. However Altair provides many options to improve over this. Let us take a step by step approach to improve this chart.

### Try out the horizontal format?

We may be tempted to experiment with a horizontal layout of the graph instead. With `altair` this can be done simply by interchanging the x-axis and y-axis values i.e. put quantitative values on the x-axis and categoricals on the y-axis.


In [4]:
(alt.Chart(data)
    .mark_bar()
    .encode(
            x='Confirmed:Q', # put quantitative values on the x-axis
            y='Region:N',
           )
)

### Customize with encode 

Each channel can be customized using the **encode channel options**. For example the channel `x` can be encoded using the `alt.X` class with the encode channel options passed as arguments. The encode channel options include:
* `field`
* `type`
* `aggregate`
* `title`
* `scale` 
* `sort` etc.

For the full list of encode channel options refer : https://altair-viz.github.io/user_guide/encoding.html#encoding-channel-options.

### Sorting the bars by size

Previously with pandas chart visualizations we had to sort the underlying dataframe before passing it to the plotting function. With `altair` we can specify the sort using a shorthand and pass it as an argument to the relevent axis encoding.

`sort = '-x'` is short for `sort = alt.Sort(encoding='x', order='descending')`

In [5]:
(alt.Chart(data)
    .mark_bar()
    .encode(
            x='Confirmed:Q', 
             y=alt.Y(
                     'Region:N', 
                      sort='-x', #sort y-values by x-values in descending order
                     ),
           )
)

### Resizing the graph

The graph looks a bit too small-- perhaps we can set the size of the graph explicitly.

The `Chart.properties()` method is used to configure the figure.

In [6]:
(alt.Chart(data)
    .mark_bar()
    .encode(
            x='Confirmed:Q',                
            y=alt.Y('Region:N', sort='-x'), 
           )
    .properties(
                width=800,         #set the width of the figure
                height=400,        #set the height of the figure
               )
)

### Set a title to the figure

Unless there is a reason to, we should always specify a title to our figures. The title of the figure is specified using the `Chart.properties()` method

In [7]:
(alt.Chart(data)
    .mark_bar()
    .encode(
            x='Confirmed:Q',               
            y=alt.Y('Region:N', sort='-x'), 
            )
    .properties(
                width=800,
                height=400,
                title='Confirmed COVID-19 cases by UN region', # set the title of the figure
                )
)


### Customize the x-axis label

In [8]:
(alt.Chart(data)
    .mark_bar()
    .encode(
            x=alt.X(
                    'Confirmed:Q',
                    axis=alt.Axis(title='Confirmed covid cases') # set the x-axis label
                    ),               
            y=alt.Y('Region:N', sort='-x'), 
            )
    .properties(
                width=800,
                height=400,
                title='Confirmed COVID-19 cases by UN region', 
                )
)

### Adding the values explicitly to the bars

Sometimes, to explain vital parts of the data, it may desirable to draw attention to a few facts. Perhaps the values are not quite easy to infer accurately in the figure above. To help the reader,let us add text adjacent to each bar. It can be done by layering separate charts for the text over the base chart. Text charts are created using the `Chart.mark_text()` method. 

We intend to reuse `data`, `x` and `y` encoding of the bar chart: `bar` for the text chart as well. Therefore we use `bar.mark_text()` to create the text chart which utilizes the encodings and the data of `bar`. We pass the mark properties to the `mark_text()` method and encode the text values with the data column `Confirmed`  

Let us run the following cell to see what happens.


In [9]:
bar = (alt.Chart(data)
          .mark_bar()
          .encode(
                  x=alt.X(
                          'Confirmed', 
                           axis=alt.Axis(title='Confirmed covid cases') # customize x-axis label
                         ), 
                  y=alt.Y('Region', sort='-x'),
                 )
      )


text = (bar.mark_text(
                      align='left',
                      baseline='middle',
                      dx=3,  # move text to right of the x and y position so it doesn't overlap with the top of the bar
                      fontWeight='bold',
                      color='salmon',
                     )
           .encode(text='Confirmed:Q') # encode text to the data column 'Confirmed' of type 'quantitative'
       )


Neither `bar` nor `text` is rendered! This is because  we are yet to compose these charts to form a compound plot style.

#### Compound charts

Charts can be vertically concatenated using `&` and horizontally concatenated using `|`. Charts can be layered on top of each other using `+`. 

Here we layer the bar and text chart using `(bar + text)`

In [10]:
bar = (alt.Chart(data)
          .mark_bar()
          .encode(
                  x=alt.X(
                          'Confirmed', 
                           axis=alt.Axis(title='Confirmed covid cases') # customize x-axis label
                         ), 
                  y=alt.Y('Region', sort='-x'),
                 )
      )


text = (bar.mark_text(
                      align='left',
                      baseline='middle',
                      dx=3, 
                      fontWeight='bold',
                      color='salmon',
                     )
           .encode(text='Confirmed:Q')
       )


(bar+text).properties(                        # layer charts
                      width=800,
                      height=400,
                      title='Confirmed COVID-19 cases by UN region',
                     )

## Multi-bars

How would we plot not just the feature `Confirmed` but also, say, `Deaths` in the same bar-plot? Similar to layering a text chart over the bar chart, we can layer two bar charts over one another as shown below.

In [11]:
confirmed = (alt.Chart(data)
          .mark_bar()
          .encode(
                  x=alt.X(
                          'Confirmed', 
                           axis=alt.Axis(title='Confirmed covid cases') 
                         ), 
                  y=alt.Y('Region'),
                 )
      )

deaths = (alt.Chart(data)
          .mark_bar(color='#e76f51')                                   
          .encode(
                  x=alt.X(
                          'Deaths', 
                           axis=alt.Axis(title='Deaths') 
                         ), 
                  y=alt.Y('Region'),
                 )
      )


(confirmed + deaths).properties(
                      width=800,
                      height=400,
                      title='COVID-19 Confirmed cases vs Deaths by UN region',
                     )

Layering charts does have a disadvantage because it is not possible to add custom legends without encoding. Also, layering many charts become quite tedious. We will look at a simple and efficient method next.

## Stacked bar chart

This is a variant of the bar chart where the bars are stacked on top of each other. Let us create a stacked bar chart to show the different case types: `Recovered`, `Active`, and `Deaths`

The dataset at present is in **wide-form**, i.e. each row contains one independent variable. In our dataset each `region` has its own row and the case status types: `Recovered`, `Active`, and `Deaths` are in separate columns. To create a stacked bar chart we would need to create separate charts for each column and layer them on top of one another as we did previously. However there is a much easier and simpler way to do this. We will need to transform the data to long-form.

`altair` works best with **long-form** data. To learn about long-form and wide-form data refer: https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data 

### Using transform fold

To create a stacked bar chart with `Recovered`, `Active`, and `Deaths` we must transform these columns to two columns : `case status` and `case count`, either using altair's `transform_fold()` method or with `pandas.melt()`. Here let us learn to  use `transform_fold()`

`transform_fold()` takes two arguments: 
* `fold` : the columns to fold/(melt) 
* `as_` : the names of the two new columns (default being `['key', 'value']`)

Once transformed these new columns `key` and `value` can be used for encoding

In [12]:
(alt.Chart(data)
     .transform_fold(
                     fold=['Recovered', 'Active', 'Deaths'],
                     as_=['Case status','Case count'],                               
                    )
     .mark_bar()
     .encode(
             x=alt.X(
                     'Case count:Q', 
                     ),
             y='Region:N',
             color=alt.Color('Case status:N')
             )
     .properties(
                 width=800,
                 height=400,
                 title='COVID-19 cases by UN region',
                )
)


Note that applying `transform_fold()` does not modify the dataset. The transformation is local to the chart.

In [13]:
data

Unnamed: 0,Region,Confirmed,Deaths,Recovered,Active
0,Africa,723207,12223,440645,270339
1,Americas,8839286,342732,4468616,4027938
2,Eastern Mediterranean,1490744,38339,1201400,251005
3,Europe,3299523,211144,1993723,1094656
4,South-East Asia,1835297,41349,1156933,637015
5,Western Pacific,292428,8249,206770,77409


### Add tooltips

To add more information to the plot, the tooltip feature can be used. Tooltip shows selected column values when the cursor hovers over points. Tooltips can be encoded in the same way as other channels.

In [14]:
(alt.Chart(data)
     .transform_fold(
                     fold=['Recovered', 'Active', 'Deaths'],
                     as_=['Case status','Case count'],                               
                    )
     .mark_bar()
     .encode(
             x=alt.X(
                     'Case count:Q', 
                     ),
             y='Region:N',
             color=alt.Color('Case status:N'),
             tooltip=['Case status:N','Case count:Q']  # add tooltip
             )
     .properties(
                 width=800,
                 height=400,
                 title='COVID-19 cases by UN region',
                )
)

## Grouped Multi-bars  

With the transformed data, we can un-stack the bars and create subplots for each region. In `altair` this is done very easily by the encoding channel `row` with the column `Region:N` , which creates subplots for each region. 

Exercise: How can this be done with vertical bars? 

In [15]:
(alt.Chart(data)
     .transform_fold(
                     fold=['Recovered', 'Active', 'Deaths'],
                     as_=['Case status','Case count'],                               
                    )
     .mark_bar()
     .encode(
             x=alt.X(
                     'Case count:Q', 
                     ),
             y=alt.Y('Case status:N',                  # change y-axis values from Region to Case status
                     axis=alt.Axis(title="")
                     ),
             color=alt.Color('Case status:N'),
             tooltip=['Case status:N','Case count:Q'],
             row='Region:N',                           # subplots are generated for each region 
             )
).properties(
                 width=800,
                 height=75,
                 title='COVID-19 cases by UN region',
                )


## Proportional bar chart

Another variation of the compound bar chart is the "proportional bar chart" which more clearly delineates the proportional relationship between two features.

In our current dataset, perhaps it is worthwhile comparing the proportion of recoveries vs deaths vs active, for each region. So first we need to create the proportion features in the data, as shown below.

In [16]:
data['Death_prop'] =( data.Deaths/data.Confirmed)
data['Active_prop'] =( data.Active/data.Confirmed)
data['Recovered_prop'] =( data.Recovered/data.Confirmed)

In [17]:
(alt.Chart(data)
     .transform_fold(
                     as_=['case status percent','case count percent'], 
                     fold=['Recovered_prop', 'Active_prop', 'Death_prop'],                                
                    )
     .mark_bar()
     .encode(
             x=alt.X(
                     'case count percent:Q', 
                     axis=alt.Axis(
                                   title="",
                                   format='p'
                                ),
                     ),
             y='Region:N',
             color=alt.Color('case status percent:N', title='Case status'),
             tooltip=alt.Tooltip(
                                 'case count percent:Q', 
                                  format='.2%',       # added tooltip with percentage formatting,
                                )
             )
     .properties(
                 width=800,
                 height=400,
                 title='Distribution of disease status in confirmed Covid-19 patients',
                )
)

### Inferences from the proportional bar chart 

From the previous plots we observed that the Americas has the highest number of confirmed Covid cases, however the proportion of active cases to recovered cases is the most balanced in the Americas. Europe has the highest deaths in proportion to the number of confirmed cases.