# Bar Galore
Now that you have been exposed to the three main parts of the Vega visualization grammar: `data`, `channels`, and `marks`, our conversation will shift to the specifications neccessary to create common visual representations.
The Bar chart was developed by [William Playfair](https://en.wikipedia.org/wiki/William_Playfair#Bar_chart) in 1786, and since then it has gone on to become one of the most widely used visual representations. One of the reasons why the chart is so effective is that the way quantitative data is encoded capitalizes on the visuo-perceptual system.
In this notebook, the focus will be on using the `mark_bar()` graphical mark to create variations of the simple bar chart.


## Learning Goals
Those who actively work through this notebook will be able to:
- Create the simple bar, column, sorted bar, grouped bar, stacked bar, and normalized bar charts
- Use aggregations and transformations prior to encoding data
- Customize various aspects of a chart
- Determine which bar chart variation is best suitable for a given task
- Describe the usefulness of the facet encoding channel

## Tinder Usage Data
Tinder is an online dating application that pioneered the "swiping" interaction to indicate interest. In 2017, Whatsgoodly conducted a survey of more than 3800 US university students to determine their usage of Tinder. The raw data is available on the [data.world](https://data.world/ahalps/how-many-millennials-find-someone-on-tinder) platform. We excluded the records that were not specific to a given institution and records from institutions with low response rates. To make our exploration of the data a bit more nuanced, we have added the `state` and `region`attributes to facilitate comparisons. Being that the names of the universities are long, we added the `uni_abbr` column to shorten axis values.

In [1]:
import pandas as pd
import altair as alt

In [2]:
path = 'data/tinder.csv'
data = pd.read_csv(path)
data.head(10)

Unnamed: 0,uni,type,res,pct,state,region,uni_abbr
0,Appalachian State University,Nonuse,46,0.4,North Carolina,south,ASU
1,Appalachian State University,No,53,0.461,North Carolina,south,ASU
2,Appalachian State University,Yes,16,0.139,North Carolina,south,ASU
3,Butler University,Nonuse,17,0.25,Indiana,midwest,BU
4,Butler University,No,44,0.647,Indiana,midwest,BU
5,Butler University,Yes,7,0.103,Indiana,midwest,BU
6,Cal Poly San Luis Obispo,Nonuse,7,0.28,California,pacific,CPSLO
7,Cal Poly San Luis Obispo,No,10,0.4,California,pacific,CPSLO
8,Cal Poly San Luis Obispo,Yes,8,0.32,California,pacific,CPSLO
9,Case Western Reserve University,Nonuse,12,0.387,Ohio,midwest,CWRU


| Column        | Description                                                                    |
|---------------|--------------------------------------------------------------------------------|
| uni           | University Name                                                                |
| type          | Response type to the question: _Have you ever met up with someone off tinder?_ |
| res           | Number of responses                                                            |
| pct           | Percentage of the number of responses at the given institution                 |
| state         | State the university is located in                                             |
| region        | Region in the US                                                               |
| uni_abbr      | 3-letter abbreviation of the university name                                    |

How large is the dataset?

In [3]:
data.shape

(123, 7)

There are 123 data items and 7 attributes.
What is the data type for each attribute?
    - Quantitative:  `res` and`pct`
    - Nominal: `uni`, `type`, `state`, `region`, and `uni_abbr`

Using pandas we can create a summary of each attribute. For the quantitative attributes, we will include the minimum and maximum values. For the others we will just get sense of the unique values that exist.

In [4]:
data.agg(
    {
        "res": ['min', 'max'],
        "pct": ['min', 'max'],
        "uni":['unique'],
        "type":['unique'],
        "state":['unique'],
        "region":['unique'],
    }
)

Unnamed: 0,res,pct,uni,type,state,region
min,1.0,0.053,,,,
max,95.0,0.818,,,,
unique,,,"[Appalachian State University, Butler Universi...","[Nonuse, No, Yes]","[North Carolina, Indiana, California, Ohio, Ma...","[south, midwest, pacific, northeast, mountain]"


Now that we have an overview of what the dataset includes, let's start exploring.

## Simple Bar

Let's start by answering the question, _How many students, at each university, participated in the survey?_
To do so, use the `x` channel to encode the `uni_abbr` and the `y` channel the number of respondents.

In [5]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N'),
    alt.Y('res'),
)

The simple bar chart is great for comparing quantitative values. In the preceding plot we have over 30 universities. It is easy to determine that the institution with the abbreviation **CU** has the most number of participants. Unfortunately, it is harder to determine whether **Amherst** had more participants than **TTU**. One way for us to address this limitation is to create a sorted bar chart.

## Sorted
To sort the values before encoding, we need to include the `sort` property and specify how the data should be arranged.
`-y` indicates that the attributes encoded on the `x` channel should be sorted in descending order by the value of the data encoded on the `y` channel (i.e., number of respondents )

In [6]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('res')
)

Wait, sometime seems off. For some reason, not all the bars are sorted in descending order. This is because for each university there are 3 values being represented. The number of students that responded yes, no, and no usage. For each university, the three values are placed on top of each other, but the data is being arrange in an arbitrary fashion based on all the data and not for a specific response type. This distinction is crucial.
 So a follow up question is, what is your task? Do you want to represent the _total number of respondents_ or just _those with a specific response_?
  Let us stick to our original question _What is the total number of students, at each university, that participated in the survey?_

### Aggregation: Sum
Altair allows us to apply aggregation options (e.g., `sum`, `count`, `average`, `min`, etc) to fields.
Visit the link for a listing of all [supported aggregation operations](https://vega.github.io/vega-lite/docs/aggregate.html#ops)

In [7]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)')
)

That looks more like it. Two things to note. The first is that the bars are arranged in ascending order. The second is that we are encoding the __total__ number of respondents. Notice how the y axis range is now [0, 200].
The sorted bar chart is useful for ranking tasks. Now we can clearly see that __TTU__ had more participants than __Amherst__.
Let's add a tooltip so that we actually know what the name of each institution is?

In [8]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)'),
    alt.Tooltip('uni')
).interactive()

### Transform: Filter
What if we just wanted to get a sense of the students who responded No at each institution.
One option would be to first use pandas to filter the data, attach the data to the chart and then encode.
Altair provides a shortcut with Data Transformation.
The filter transform only encodes data items that satisfy the filter expression.
Here we will state that for each data item (i.e., datum), the response type must be 'No'.

In [9]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('res'),
    alt.Tooltip('uni')
).transform_filter(
    (alt.datum.type == 'No')
).interactive()

## Stacked
If we wanted to get a sense of the number of Yes or Nonuse respondents we could create a similar chart as the one shown before. But that would be three seperate charts. It would be hard to get a sense of of how the breakdown of type for each institituion. In a stacked bar chat, each bar encoded multiple values which are encoded by color. Stacked bar charts are good for comparing parts of a whole while at the same time providing an overview across multiple groups.

In Altair bars can be stacked with little effort.
Being that the y value currently represents the total number of responses per university, we can sub-divided that bar by using color.
Let's encode `type` using the `color` channel. In addition, update the tooltip to show the value for the given response type.

In [10]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)'),
    alt.Tooltip(['uni', 'res']),
    alt.Color('type'),
).interactive()

In the chart above, the use of the `color` encoding channel causes Altair / Vega-Lite to automatically stack the bar marks.
Try adding the parameter `stack=None` to the `y` encoding channel to see what happens if we explicitly disallow stacking.


In [11]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)', stack=None),
    alt.Color('type'),
    alt.Tooltip(['uni', 'res']),
).interactive()

If we don't apply stacking, all bars start with the baseline of zero and are placed on top of each other. So for Cornell, the blue bar is draw first, then the orange and finally the red, the orange bar cannot be seen because red was  larger.
It is worth mentioning that for stacked bar charts, there is an implicit binding of the data to the channel. You do not need to include the `sum` aggregation option to obtain the two previous charts.

### Color Palette
It is hard to distinguish between red and orange. In a future notebook we will explore color in detail, but here let use `domain` and `range` to customize the color scale.
The same way we can customize axes values for the `x` and `y` encoding channels, we can customize the color scale.
The domain is the three values for the `res` field (i.e., no, yes, nonuse)
The range is the intended output values, the colors we want to use.
Cynthia Brewer has done research on the [color palettes](https://colorbrewer2.org/#type=qualitative&scheme=Set2&n=3).
We will use a qualitative color scheme that is accessible for all individuals (includings people who may be color blind0

In [12]:
domain = ['No','Nonuse', 'Yes']
range = ['#66c2a5', '#fc8d62', '#8da0cb']
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'res'])
).interactive()

## Normalized Stacked
One of the limitation of the stacked bar chart stems from its strength. When the encoded quantatitive values for the main nomimal categories ia similar, it is easy to understand how the sub-categories contribute.
That is not always the case, as in the example shown above, for about half of the institutions, it is difficult to determine whether the proportion. For instance, for __Amherst__ is the proportion of __Yes__ respondees greater than those who did not use Tinder?.
The normalized stacked bar shows the percentage of the whole of each group and so it is easier to see the relative differences between quantities in each group.
Given that our dataset already had the percentage for each university we can encode that instead of `rep`


In [13]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N'),
    alt.Y('pct'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive()

Because the data ends at 1, we need to change the axis so it actually ends at 1 and not 1.1.


In [14]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N'),
    alt.Y('pct', scale=alt.Scale(domain=[0,1])),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive()

Now what if we didn't have the percentage attribute, one option would be to use pandas to create an additional column and calculate the percentage. Fortunately Altair provides us with an additional option that is less taxing.\
Let's make 2 changes. First change the `y` attribute to encode the `res` instead of the percentage. Then add the parameter `stack=normalize` to the `y` encoding channel

In [15]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N'),
    alt.Y('res', stack="normalize"),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive()

One insight that is apparent in this chart is that **Dartmouth College** has the highest proportion of students who have met up with something from Tinder (i.e., Yes response).
Let's focus in on the institutions in the Southern states.
Using the filter transform, visualize the total number of respondents for institution in the south and stack by their response type

In [16]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive().transform_filter(
    (alt.datum.region == 'south')
)

The first thing we notice is that we have nine universities from the southern states of the USA that participated in the survey. Applachian State University had the largest number of responses.
Because there is a huge dispartities between the number of responses across institutions, the default axes being used makes it difficult to get a sense of the proportion of students at universities with a smaller number of respondents that had a successful meetup from Tinder.
Let's change from a stacked bar chart to a normalized stacked chart.

In [17]:
alt.Chart(data).mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)', stack='normalize'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'res']),
).interactive().transform_filter(
    (alt.datum.region == 'south')
)

With the normalized stacked bar we can discern that the University of Arkansas,  Virginia University, and University of Virigina had larger proportions of students who met up with a person they met on Tinder.
Each variation of the bar chart comes with its strengths and limitations.
Let's look at one more variation before concluding this lesson.

## Grouped Bar
A grouped bar chart is also referred to as a clustered bar chart or multi-series bar chart. Like all the other bar charts you have been exposed to, they are good for comparing values across groups. One of the limitations of the stacked bar chart is that it is hard to compare different segments for each bar because we are using a _non-aligned axis_ to encode the data. The grouped bar chart addresses this issue by aligning all sub-categories to the baseline axis. The limitation of this approach is that we lose the ability to compare at the higher level (i.e, compare institutions)
Unfortunately while Altair simplifies the creation of numerous charts, the specifications for the grouped bar chart is a bit more nuanced.

### Facet Channels
Our discussion on visual encoding channels has largely been focused on **position** (e.g., x and y) and **mark-property** (e.g., color) channels.
Another channel available is the facet channel (e.g., column, row, and facet).
When the facet channel is used we can specify how many unique mini-charts should be created.
For the grouped bar chart, instead of having one bar chart, we want to have one for each institution.
We will encode the `type` on the `x` channel, and use the `column` channel to encode the `uni_abbr`


In [18]:
alt.Chart(data).mark_bar().encode(
    alt.X('type'),
    alt.Y('sum(res)'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'res']),
    alt.Column('uni_abbr'),
).interactive().transform_filter(
    (alt.datum.region == 'south')
)

Instead of using the `Column` encoding channel, change it to the `Row` encoding channel and observe the differences.



## Customizing Vizzes
Specifying how to encode each field is just the beginning. Now that we have the chart, we need to focus on customizing it.
The Altair API includes a wide range of customizations. In this section we focus on a select few.
There are many ways a visualization can be customized. Altair includes global, local and encoding specific customizations.
- _Global Config_ acts on an entire chart object
- _Local Config_ acts on one mark of the chart
- _Encoding_ channels can also be used to set some chart properties


### Chart
There are a number of customizations we can make. Lets
- change the `width` and `height` of the charts
- add a descriptive title
- remove the chart borders because for the grouped bar charts we want a less individualistic approach for each chart

See the API for additional global [Chart](https://altair-viz.github.io/user_guide/configuration.html#config-chart) configurations.


In [19]:
alt.Chart(data).mark_bar().encode(
    alt.X('type'),
    alt.Y('sum(res)'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct']),
    alt.Column('uni_abbr:N'),
).interactive().transform_filter(
    (alt.datum.region == 'south')
).properties(
    width=50,
    height=100,
    title='Select Southern Universities Experiences with Tinder'
).configure_view(
    strokeWidth=0
)

### Mark
As seen in the previous notebook, we can specify the width of each bar. We can also change the width of each chart so the space between the grouped bars is negligible
See the Altair API for additional [mark](https://altair-viz.github.io/user_guide/configuration.html#mark-and-mark-style-configuration) configurations.

In [20]:
alt.Chart(data).mark_bar(size=10).encode(
    alt.X('type'),
    alt.Y('sum(res)'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct']),
    alt.Column('uni_abbr:N'),
).interactive().transform_filter(
    (alt.datum.region == 'south')
).properties(
    width=33,
    height=100,
    title='Select Southern Universities Experiences with Tinder'
).configure_view(
    strokeWidth=0
)

### Axis
The axes have default configuration, we have already change the default configuration for the color channel.
Let's make some changes to the others
For the `x` channel, we can remove the axes, as the values mirror the color legend.
For the `y` channel, we can remove the grid lines

See the Altair API for additional [axis](https://altair-viz.github.io/user_guide/configuration.html#axis-configuration) settings.


In [21]:
alt.Chart(data).mark_bar(size=10).encode(
    alt.X('type', axis=None),
    alt.Y('sum(res)', axis=alt.Axis(grid=False)),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct']),
    alt.Column('uni_abbr'),
).interactive().transform_filter(
    (alt.datum.region == 'south')
).properties(
    width=33,
    height=100,
    title='Select Southern Universities Experiences with Tinder'
).configure_view(
    strokeWidth=0
)

### Facet
For the `column` channel let's
- change the placement of the header from the top to bottom
- remove the `uni_abbr` title

See the Altair API for additional [facet](https://altair-viz.github.io/user_guide/generated/channels/altair.Column.html?highlight=facet%20channel) formatting specifications


In [22]:
alt.Chart(data).mark_bar(size=10).encode(
    alt.X('type', axis=None),
    alt.Y('sum(res)', axis=alt.Axis(grid=False)),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct']),
    alt.Column('uni_abbr', header=alt.Axis(orient='bottom'), title=None),
).interactive().transform_filter(
    (alt.datum.region == 'south')
).properties(
    width=33,
    height=100,
    title='Select Southern Universities Experiences with Tinder'
).configure_view(
    strokeWidth=0
)

### Title
For the chart we can also customize the title, let's
- change title color
- change font size

Seel the Altair API for additional [title](https://altair-viz.github.io/user_guide/configuration.html#title-configuration) configurations.

In [23]:
alt.Chart(data).mark_bar(size=10).encode(
    alt.X('type', axis=None),
    alt.Y('sum(res)', axis=alt.Axis(grid=False)),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct']),
    alt.Column('uni_abbr', header=alt.Axis(orient='bottom'), title=None),
).interactive().transform_filter(
    (alt.datum.region == 'south')
).properties(
    width=33,
    height=100,
    title='Select Southern Universities Experiences with Tinder'
).configure_view(strokeWidth=0
).configure_title(fontSize=18, color='grey')


## Summary
In this notebook you created sorted, grouped, stacked and normalized bar charts.
More importantly you hopefully have a deeper appreciation for the visualization grammar and have learned how specifying which channel should be used to encode data influences the design of a visualization.
The best thing to do at this point is practice.
Get a dataset and start building vizzes.