# Bar Galore

## Tinder Usage Data

In [2]:
# import libraries
import pandas as pd
import altair as alt

In [4]:
# Read data
path = "../data/tinder.csv"
data = pd.read_csv(path)
data.head(10)

Unnamed: 0,uni,type,res,pct,state,region,uni_abbr
0,Appalachian State University,Nonuse,46,0.4,North Carolina,south,ASU
1,Appalachian State University,No,53,0.461,North Carolina,south,ASU
2,Appalachian State University,Yes,16,0.139,North Carolina,south,ASU
3,Butler University,Nonuse,17,0.25,Indiana,midwest,BU
4,Butler University,No,44,0.647,Indiana,midwest,BU
5,Butler University,Yes,7,0.103,Indiana,midwest,BU
6,Cal Poly San Luis Obispo,Nonuse,7,0.28,California,pacific,CPSLO
7,Cal Poly San Luis Obispo,No,10,0.4,California,pacific,CPSLO
8,Cal Poly San Luis Obispo,Yes,8,0.32,California,pacific,CPSLO
9,Case Western Reserve University,Nonuse,12,0.387,Ohio,midwest,CWRU


| Column        | Description                                                                    |
|---------------|--------------------------------------------------------------------------------|
| uni           | University Name                                                                |
| type          | Response type to the question: _Have you ever met up with someone off tinder?_ |
| res           | Number of responses                                                            |
| pct           | Percentage of the number of responses at the given institution                 |
| state         | State the university is located in                                             |
| region        | Region in the US                                                               |
| uni_abbr      | 3-letter abbreviation of the university name                                    |

How large is the dataset?


In [5]:
data.shape

(123, 7)

The data has 123 rows/items and 7 attributes/variables. The following are the data type of each (Q for Quantitative, N for nominal):
- uni: N
- type: N
- res: Q
- pct: Q
- state: N
- region: N
- uni_abr: N

In [6]:
# create summary of each attribute depending on its type
# Q ---> min, max
# N ---> unique
data.agg(
    {
        "res":["min", "max"],
        "pct":["min", "max"],
        "uni":["unique"],
        "type":["unique"],
        "state":["unique"],
        "region":["unique"],
    }

)

Unnamed: 0,res,pct,uni,type,state,region
min,1.0,0.053,,,,
max,95.0,0.818,,,,
unique,,,"[Appalachian State University, Butler Universi...","[Nonuse, No, Yes]","[North Carolina, Indiana, California, Ohio, Ma...","[south, midwest, pacific, northeast, mountain]"


### Simple Bar

How many students, at each university particapated in the survey?

In [7]:
# define variable
chart = alt.Chart(data)

In [12]:
# We could make a bar chart, using uni abbreviation in the x axis
# to denote university, and responses in y axis to have number
chart.mark_bar().encode(
    alt.X("uni_abbr:N"),
    alt.Y("res")
)

This bar chart is good for comparing quantitative values, we could use it to tell which university has most number of participants, **but** hard to tell when two universities that have similar amounts, which one is higher:

e.g. **Amherst** and **TTU**

To solve this, we could sort the graph instead and not have it alphebatically.

### Sorted

In order to sort values, we need to inclued **sort** property and specify how data should be arranged.

In [13]:
# we could achieve this adding `sort` parameter
# to the channel we want to sort from
# in this case, we want to sort X channel based on values
# of Y channel by descending order (that's why you see the `-`)

chart.mark_bar().encode(
    alt.X("uni_abbr:N", sort="-y"),
    alt.Y("res")
    
)

**WOWWW**, this looks weird, not all bars are sorted in descending order.

- Each uni has 3 values being represented:
    + number of students responded yes
    + number of students responded no
    + number of students responded no usage
    
So then we need to reframe what is the question?

> Reprensent total number of respondents?
>
> OR
>
> Just those with a specific response?

Note, we are sticking to the first one, **total number**, so we could achieve this through aggregation like `sum`, `count`, `average`, `min` and etc.

### Aggregation: Sum

In [14]:
# You could achieve aggregation in the channel you
# want and have ("aggregation(attribute)")

chart.mark_bar().encode(
    alt.X("uni_abbr:N", sort="-y"),
    alt.Y("sum(res)")
    
)

Looks better now, these bars are in descending order, and this chart is actually **total** number of respondents (careful of this).Now we could see **TTU** had more participants than **Amherst**. And we could also add tooltip to know name of each institution, since knowing ABBR is not enough

In [15]:
# Adding a tooltip that shows extra attribute info
chart.mark_bar().encode(
    alt.X("uni_abbr:N", sort="-y"),
    alt.Y("sum(res)"),
    alt.Tooltip("uni") # add tooltip that shows uni name
).interactive() # This is a must to add

### Transform: Filter

If we were inrested more than total num, but specific type as well, we could use a `transform_filter` to only encodes data items that satisfy the filter expresion, and no need to **filter the original data** with pandas.

In [19]:
# Filter for each data item (datum), response type is no
chart.mark_bar().encode(
    alt.X("uni_abbr:N", sort="-y"),
    alt.Y("res"),
    alt.Tooltip("uni")
).transform_filter(
    (alt.datum.type=="No")
).interactive()

### Stacked

We could also get a sense of number of Yes or Nouse respondents like the chart below, but thats just three separate charts. So we could stack these and separate by color. This chart is good for comparing attribute with multiple groups/category levels.

Remember `y` value now is total number of responses per uni, we can sub-divided by using color, by adding `type` to the extra `color` channel. 

In [23]:
chart.mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)'),
    alt.Tooltip(['uni', 'res']),
    alt.Color('type'),
).interactive()

> NOTE: In stacked charts, it implicitly does `sum`, so in this case `alt.Y("res")` > gets to same result of `alt.Y("sum(res)"`.

If stack is none, all bars start with baseline of zero and placed on top of each other. Thus, some grouping might not show.

In [21]:
# Stack without default
chart.mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)', stack=None),
    alt.Color('type'),
    alt.Tooltip(['uni', 'res']),
).interactive()

### Color Palette

As all other visualizations, altair can also change color shown to users, by doing this we could specify color sclae with `domain` and `range` property of the `Color channel`:

- `Domain`: This should be the levels of your categorical variable/attribute
- `Range`: This should be the intended output values, and so we need a `qualitative` color scheme

In [27]:
# Specify the levels
domain = ["No", "Nonuse", "Yes"]
# although we should use range, but will change here (I just dont like it)
range = ["#66c2a5", "#fc8d62", "#8da0cb"]

# encode the chart now
chart.mark_bar().encode(
    alt.X("uni_abbr:N", sort="-y"),
    alt.Y("sum(res)"),
    alt.Color("type", scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(["uni","res"])
).interactive()

### Normalized Stacked

We could also normalize the the channels to 0-1 scale, so that we could know percentage of levels.

In [28]:
# Stack the chart as before using a different 
chart.mark_bar().encode(
    alt.X('uni_abbr:N'),
    alt.Y('pct', scale=alt.Scale(domain=[0,1])),?
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive()

> NOTE, we were given a `pct` attribute here, what if we dont?

In [30]:
# Use the scale=alt.Scale(domain=[0,1])) or stack="normalize" 
# in the channel you want to present
# in this case, we want it in the Y cahnnel

In [31]:
chart.mark_bar().encode(
    alt.X('uni_abbr:N'),
    alt.Y('res', stack="normalize"),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive()

One insight that is apparent in this chart is that **Dartmouth College** has the highest proportion of students who have met up with something from Tinder (i.e., Yes response).
Let's focus in on the institutions in the Southern states.
Using the filter transform, visualize the total number of respondents for institution in the south and stack by their response type

In [33]:
chart.mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'pct'])
).interactive().transform_filter(
    (alt.datum.region == 'south')
)

The first thing we notice is that we have nine universities from the southern states of the USA that participated in the survey. Applachian State University had the largest number of responses. Because there is a huge dispartities between the number of responses across institutions, the default axes being used makes it difficult to get a sense of the proportion of students at universities with a smaller number of respondents that had a successful meetup from Tinder. Let's change from a stacked bar chart to a normalized stacked chart.

In [35]:
chart.mark_bar().encode(
    alt.X('uni_abbr:N', sort='-y'),
    alt.Y('sum(res)', stack='normalize'),
    alt.Color('type', scale=alt.Scale(domain=domain, range=range)),
    alt.Tooltip(['uni', 'res']),
).interactive().transform_filter(
    (alt.datum.region == 'south')
)

With the normalized stacked bar we can discern that the University of Arkansas,  Virginia University, and University of Virigina had larger proportions of students who met up with a person they met on Tinder.
Each variation of the bar chart comes with its strengths and limitations.
Let's look at one more variation before concluding this lesson.