# Grammar of Graphics (30 points)

For a long time, the visualization ecosystem of Python was fragmented and not very effective. There were lot of libraries that had limited support in terms of styling and chart types. Even popular libraries such as matplotlib have fairly poor defaults. Beginners will generate bad looking plots and creating better looking ones took serious amount of work. This is in contrast with other languages such as R that have fantastic support. ggplot in R allows you to generate publication quality plots easily. If you generated the plots from the previous task, you can appreciate the above statements. 

In this task, we will do a brief exploration into the concept of grammar of graphics. This is a very powerful idea where you think visually about charts and generate them in an expressive grammar. If you used `ggplot` in R, you can see how expressive it is to generate arbitrarily complicated graphs in an easy manner. Unfortunately, none of the libraries in Python are at that caliber yet -- though [altair](https://altair-viz.github.io/) and  [plotnine](https://plotnine.org/) comes the closest.  Nevertheless, they provide a good coverage of GoG. 

Unfortunately, both of them do not scale to large datasets. Plotnine can easily take minutes to generate a chart over FEC dataset. Altair starts complaining after 5K rows and can take as much as 15-20 seconds even for simple charts for the FEC dataset. However, there are serious efforts underway to improve this. Nevertheless, it is useful to peruse these libraries to understand grammar of graphics. 

Optional: If you enjoy performance engineering like me, here is how we tackle this in this assignment. We use something called VegaFusion to partially speed things up. The key idea is to express some of the charting logic in Python allowing it to be faster. VegaFusion has some experimental idea where they express charting logic using SQL and then use DuckDB to compute those results and visualize them. It is still experimental - so we will not use the DuckDB acceleration for this. To see the difference provided by vegafusion, you can comment the alt.data_transformers.enable("vegafusion") below and restart the kernel. The charts will take 5-10x slower.

In [38]:
# These two lines ensure that all modules are reloaded every time a Python cell is executed.
# This allows us to modify some other Python file and immediately see the results
# instead of restarting the kernel and running every cell. 
%load_ext autoreload
%autoreload 2

import altair as alt
import pandas as pd

from ds5612_pa1 import t3_tasks


# altair is quite slow for large dataset such as ours.
# So, we will use vegafusion that pre-evaluating data transformations in Python.
# This will make it somewhat faster 
alt.data_transformers.enable("vegafusion")

# Please read the code of get_fec_dataset to understand what it is doing
fec_df = t3_tasks.get_fec_dataset()



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Basics of Altair (ungraded)

There are many good resources to learn Altair. A key challenge is to think in terms of visual vocabulary (data, data types, markers, channels, views etc). This is especially hard when your first exposure to visualization is through imperative libraries such as matplotlib. As you learn Altair more, you can see that the grammar allows you to express different types of charts in a consistent manner. So changing from bar chart to scatter plot is a tiny change where you change the marker. This is in contrast to something like matplotlib where the same things requires a much more extensive change. 

Couple of good ones for the assignment are given below. If you plan to use them more extensively, then you are better off learning ggplot first and then do the translation. 

1. [UW Data Viz](https://idl.uw.edu/visualization-curriculum/altair_introduction.html) is a good and concise intro to Altair. 

2. [Altair User Guide](https://altair-viz.github.io/user_guide/data.html) is a good next step if you want to learn more.

2. [Altair Gallery](https://altair-viz.github.io/gallery/index.html) shows lot of useful plots. Please spend some time on this when you are free. You can see how the expressiveness of GoG allows you to write very complex and novel visualizations in a few lines. Doing something like this is hard if not impossible in other Python visualization libraries. 


## Grammar of Graphics with Altair (ungraded)

You might want to review the visualization course slides to brush up on terms such as marks, channels etc. 

The GoG used by Altair is heavily influenced by a UW-MIT project dubbed Vega. It is related but sufficiently different from the model used by ggplot, the other exemplar of GoG. Personally, I find the Altair/Vega approach to be more logical and easier to automate (e.g. when designing dashboards where the charts are generated dynamically). It is also well integrated into the scientific Python ecosystem such as Pandas. 

I **strongly** recommend reading the entire UW Altair document as we will combine multiple ideas simultaneously. 

- The core of Altair is the `chart` object. 
- You can then add `marks` to it to control how data is visualized. For example, point is (usually) associated with scatter plot while bar is (usually) associated with bar chart. Using such a grammar makes it clear that between scatter and bar chart, the only thing that changes is the mark while the other parts (x and y axis variables) remain the same. 
- The next is `encoding` that controls the visual channels. Intuitively, it is a dictionary that maps some important visual channels (such as x, y, color, shape, size, opacity) to the columns in your dataset.
- Altair also has good support for `data aggregation` - so when you specify the encoding, you can also specify the aggregation function.
- Altair supports `transforms` and `filters` where you can perform some transformation over your data and also filter it dynamically based on some condition.
- Altair also supports the concept of `properties` where you can control miscellaneous properties with title, width/height being the most common, but also supports lot more.
- The final powerful idea - which is extremely cumbersome in other Viz libraries - is `view composition` . This allows you to define different chart objects and then combine them in a semantically coherent manner. In the simplest case, this can be concatenating the charts so that it gets visualized side by side. But you can do lot more complex composition. We will briefly explore this idea later. 

## Task 5.1 Grouped Bar Charts (7.5 points)

Let us start with something simple that should nevertheless make you try multiple ideas in Altair. We are going to create a grouped bar chart where we display the total amount of contributions obtained by Obama and Romney for each state. So there will be a "group" for Texas, one for California and so on. Within each group, there will be two bars - one for Obama and one for Romney. The y-axis will be the sum of the contribution amounts.

Here is the checklist of things that your chart should be able to handle.

1. Think carefully about the visual channels. Hint: you will need X, Y, Color and Column. Which attribute should be associated with which channel?
2. You should have to do an aggregation operation of sum for the contribution amount.
3. Your chart should have the title of "Contributions by State"
4. The X axis should have the title of "Name"
5. The Y axis should have the title of "Total Contributions"
6. The Y axis should use log scale.
7. The legend should have the title "Candidate Names"
8. The bars should use blue for Obama and red for Romney.
9. The column field should have a title of "State"

Hint: Even though this looks too complicated, the code will be less than 8-10 lines. 

You can see a sample plot below. Your plot should look something like or better than this. Pro tip: open it in a different viewer (e.g. browser) for seeing it in full glory.

![T5.1](resources/t51.svg) 

In [44]:
"Write your code here"
import numpy as np

contb_to_cand_per_state = fec_df[['contbr_st', 'cand_nm', 'contb_receipt_amt']]

bars = alt.Chart(contb_to_cand_per_state).mark_bar().encode(
    x=alt.X('cand_nm:N', title='Name'),
    y=alt.Y('sum(contb_receipt_amt)', title='Total Contributions', scale=alt.Scale(type='log')),
    color=alt.Color('cand_nm',
                    scale=alt.Scale(
                        domain=['Obama, Barack', 'Romney, Mitt'],
                        range=['blue', 'red']
                    ),
                    title='Candidate Names'
                    ),
    column=alt.Column('contbr_st:N', title='State'),
).properties(
    title = 'Contributions by State'
)

bars

## Task 5.2 Adding Interaction (7.5 points)

In the next task, we will add interaction to the previous chart. Specifically, we will once again create a bar chart where we will plot the total contribution per state for each candidate. However, we will now show the bars for one candidate at a time. We will display a dropdown that shows Obama and Romney names. When we select one of the candidates, the bar chart should dynamically update. 

Here are the things to take care:
1. A natural approach is to do something like what you did for the previous task. However, you will quickly run into a quirk of Altair. Altair stores the data associated with the chart inside the chart object. So, if you use the fec_df variable, then the chart object will store the entire 150 MB of data inside it. When you save your jupyter notebook, it will become 200 MB plus because of this. 
2. There are many ways to solve this (such as transformed df) but let us go with the simplest. Note that we do not really need the entire data frame to create the chart. As long as we have a summary data frame containing the total amount for each candidate in each state, we will fine. I have created a new data frame called `df_grouped_by_nm_st`. Use this as the data source to your alt.chart class.
3. X-axis title is "State"
4. Y-axis title is "Total Contribution"
5. Chart title is "Total Contribution Grouped by State"
6. Legend title is "State"
7. The interactivity selector name is "Candidate Name"


Hint: Even though this looks too complicated, the code will be less than 8-10 lines. Hint: use the following functions : binding_select, selection_point, add_params, transform_filter

You can see a sample plot below. Your plot should look something like or better than this. Pro tip: open the SVG file in a different viewer (e.g. browser) for seeing it in full glory.

#### Image with Candidate Selector 
![T5.2](resources/t52.png) 

#### Full Image
![T5.2](resources/t52.svg) 

In [40]:
df_grouped_by_nm_st = fec_df.groupby(["contbr_st", "cand_nm"])["contb_receipt_amt"].sum().reset_index()
df_grouped_by_nm_st

Unnamed: 0,contbr_st,cand_nm,contb_receipt_amt
0,AA,"Obama, Barack",56405.00
1,AA,"Romney, Mitt",135.00
2,AB,"Obama, Barack",2048.00
3,AE,"Obama, Barack",42973.75
4,AE,"Romney, Mitt",5680.00
...,...,...,...
120,WV,"Romney, Mitt",126725.12
121,WY,"Obama, Barack",194046.74
122,WY,"Romney, Mitt",252595.84
123,XX,"Romney, Mitt",400250.00


In [45]:
"Write your code here"

drop_down = alt.binding_select(options=['Obama, Barack', 'Romney, Mitt'], name='Candidate')
selection = alt.selection_point(fields=['cand_nm'], bind=drop_down, name='Candidate', value='Obama, Barack')

bars = alt.Chart(df_grouped_by_nm_st).mark_bar().encode(
    x=alt.X('contbr_st', title='State'),
    y=alt.Y('contb_receipt_amt', title='Total Contribution'),
    color=alt.Color('contbr_st', title='States'),
).add_params(
    selection
).transform_filter(
    selection
).properties(
    title = 'Total contributions grouped by state'
)
bars

## Task 5.3 Comparing Contributions from Top States (7.5 points)

In the next task, we will create a stacked bar chart. We will analyze how the top 100 cities give for the candidates. 


1. Create a grouped data frame `df_grouped_by_nm_st_cty` using a logic very similar to `df_grouped_by_nm_st`. It should be grouped by candidate name, contributor state and contributor city (in that order). For each group, compute the total contribution to either candidate. Do not forget to reset the index.

2. Next, we will order these entries based on total contribution and select the top-100 entries that gave the most amount to either of the candidate. In other words, get the top-100 from the previous df. It is possible that the same city is in the top-100 twice. But that is okay. 

3. Assign the top-100 rows to a variable called `sorted_df_grouped_by_nm_st_cty`.

4. Now generate the chart as described below using `sorted_df_grouped_by_nm_st_cty` AND NOT the the group by which can have 20000 or so entries. 

5. We put the candidate details side-by-side and display the states in a vertical manner. For each state, we show a stacked bar where the length of the bar corresponds to the total contribution of the corresponding city. 

6. As before, ensure that the axis, legend and chart titles are set correctly. 

7. For fun, make the city, state and the total contribution from that city in a tooltip. When I hover over the city in the bar chart, these details should be shown.

Hint: Even though this looks too complicated, the code will be less than 8-10 lines. 

You can see a sample plot below. Your plot should look something like or better than this. Pro tip: open the SVG file in a different viewer (e.g. browser) for seeing it in full glory.


![T5.3](resources/t53.svg) 

In [47]:
"Write your code here"
df_grouped_by_nm_st_cty = fec_df.groupby(["contbr_st", "cand_nm", "contbr_city"])["contb_receipt_amt"].sum().reset_index()
sorted_df_grouped_by_nm_st_cty = df_grouped_by_nm_st_cty.sort_values(by="contb_receipt_amt", ascending=False).head(100)

bar = alt.Chart(sorted_df_grouped_by_nm_st_cty).mark_bar().encode(
    x = alt.X("contb_receipt_amt", title="Total Contribution"),
    y = alt.Y("contbr_st", title="State"),
    color = alt.Color("contbr_city", title="City"),
    column = alt.Column("cand_nm", title="Candidate Name")
).properties(
    title = "Contributions in top 100 Cities"
)

bar

## Task 5.4 Multi View Composition (7.5 points)

So far, you have done some charts using Altair. Sure, the Altair code is smaller and more logical than the corresponding Matplotlib. But it is still not clear why we should use grammar of graphics. In this final task, we will do a simple assignment to answer this.

We are going to create a hybrid chart that is a combination of multiple other charts. It is extremely hard, if not impossible, to do this using traditional methods such as matplotlib. However, you can do this using less than 15 lines of code in Altair. 

Our hybrid chart is an amalgamation of three charts that is constructed using the `sorted_df_grouped_by_nm_st_cty` data frame that we created in the previous task. 
1. The first chart is a traditional stacked bar chart that is drawn for each state where the two stacks are for Obama and Romney.
2. The second is a line chart that shows the average contribution per city.
3. The final is a text  that displays the average contribution per city.

Let us do this step-by-step. 

1. The game plan is to create three different charts and then create a final chart where each of these three charts are `layers`. Create these individual charts and ensure that they are working well individually before combining them.


2. The first layer is for the stacked bar. Create a chart with the name `stacked_bars`. It is a bar chart over `sorted_df_grouped_by_nm_st_cty`. The X-axis is the state while the Y-axis is contribution amount. We will use this chart as the base and use it to control the final chart properties. So, you can use this chart to set the X-axis and Y-axis title. You will also modify it so that Obama and Romney data are colored in blue and red respectively. Finally, the Y-axis is shown in a sqrt scale (another variant of the log-scale). This is needed as some states (such as Texas or California) give much more money than others. So if we draw in normal scale, some of the other charts will look ugly.


3. The second layer is a line chart that is named as `average_line`. Again it operates on `sorted_df_grouped_by_nm_st_cty`. The X-axis is the state while the Y-axis is the average over contribution amount. So if there 4 cities in the list, the aggregate will be a quarter of the sum. You can modify the line properties so that line is black (or your favorite color) and thicker than normal using the `strokeWidth` variable. 


4. The third layer is a text chart that is named as `text`. Of course, calling it a chart is an exaggeration but it fits. It just displays text that is the average amount per state. We can do this individually but it becomes messy. So here we use the a cool trick where we take the `average_line` chart and then modify it to get a new chart. This is a good idea for two reason. First, the average is already computed - so we can just get the value instead of re-computing it. Second, the `average_line` chart already knows the position of the dot mark. So we can use that position to display the text above that dot. Look at the `mark_text`chart type and play with the align and baseline variable. You will encode the text which is the average of the contribution amount.

5. Finally, we will create a layered chart using the following line:

> final_chart = (stacked_bars + average_line + text).properties(title="Multiview Composition FTW")


Hint: Even though this looks too complicated -- believe it or not -- the whole chart can be generated with less than 15 lines. 

You can see a sample plot below. Your plot should look something like or better than this. Pro tip: open the SVG file in a different viewer (e.g. browser) for seeing it in full glory.


![T5.4](resources/t54.svg) 

In [115]:
"Write your code here"
stacked_bars = alt.Chart(sorted_df_grouped_by_nm_st_cty).mark_bar().encode(
    x = alt.X("contbr_st", title="State"),
    y = alt.Y("contb_receipt_amt", title="Total Contribution", scale=alt.Scale(type="sqrt")),
    color = alt.Color("cand_nm",
                      scale=alt.Scale(
                        domain=["Obama, Barack", "Romney, Mitt"],
                        range=["Blue", "Red"]
                      ),
                      title="Candidate Names"
                      )
).properties(
  height = 500,
  width = 1500
)

average_line = alt.Chart(sorted_df_grouped_by_nm_st_cty).mark_line().encode(
    x = alt.X("contbr_st", title="State"),
    y = alt.Y("average(contb_receipt_amt)", title="Total Contribution", scale=alt.Scale(type="sqrt")),
    color = alt.value("black")
)

text = alt.Chart(sorted_df_grouped_by_nm_st_cty).mark_text().encode(
    x = alt.X("contbr_st", title="State"),
    y = alt.Y("average(contb_receipt_amt)", title="Total Contribution", scale=alt.Scale(type="sqrt")),
    color = alt.value("black"),
    text=alt.Text("average(contb_receipt_amt)"),
)

stacked_bars + average_line + text