**SA463A &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2020 &#x25aa; Foraker and Uhan**

# Exam 1 &mdash; Part II

## Instructions

This exam has 2 parts: Part I and Part II (this part). You should have submitted Part I before starting this part.

__This exam is due at the end of class on Friday 10/2.__

For this part of the exam, you may use your own course materials (e.g. notes, textbook), as well as any materials directly linked from the [course website](https://www.usna.edu/Users/math/uhan/sa463a/). __No collaboration allowed.__

There are 3 problems in this part, worth a total of 75 points. The exam (both parts) is worth a total of 100 points.

Save your work frequently! When you are finished, submit this file using the SA463A Assignment Submission Form linked on the [course website](https://www.usna.edu/Users/math/uhan/sa463a/).

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Background

In this exam, you'll use a dataset on carbon dioxide and greenhouse gas emissions, based on [this data from Our World in Data](https://github.com/owid/co2-data). 

The dataset contains the following variables/columns for a subset of countries around the world, from 2000 to 2016:

| Column | Description |
| :- | :- |
| `year` | Year of observation |
| `region` | Region of the world |
| `country` | Country |
| `co2` | Annual production-based CO2 emissions (million tonnes) |
| `ghg` | Annual greenhouse gas emissions (million tonnes of CO2 equivalents) |
| `population` | Total population of country |
| `gdp` | Total real GDP, inflation-adjusted |


The code cell below imports Pandas and Altair, loads the data into a Pandas DataFrame called `df`, and displays the first five rows of `df`. Run this cell.

In [1]:
# Import Pandas and Altair
import pandas as pd
import altair as alt

# Create DataFrame with data
df = pd.read_csv('data/ghg.csv')

# Display first five rows
df.head()

Unnamed: 0,year,region,country,co2,ghg,population,gdp
0,2000,Americas,Argentina,141.717,366.34,36871000.0,557000000000.0
1,2001,Americas,Argentina,133.311,383.27,37276000.0,525000000000.0
2,2002,Americas,Argentina,124.382,386.17,37682000.0,453000000000.0
3,2003,Americas,Argentina,134.621,408.49,38088000.0,487000000000.0
4,2004,Americas,Argentina,157.034,436.71,38492000.0,532000000000.0


<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problem 1

Create a scatter plot that shows the relationship between annual production-based CO2 emissions and annual greenhouse gas emissions in 2016 among the countries in the dataset. Use color to highlight the region of the world associated with each point in the scatter plot. Provide a descriptive title for each of the axes and the legend. Your chart should look like this:

![](img/problem1.svg)

In [2]:
# Solution
alt.Chart(df).transform_filter(
    'datum.year == 2016'
).mark_circle(size=50).encode(
    alt.X('ghg:Q', title='Annual greenhouse gas emissions (million tonnes CO2)'),
    alt.Y('gdp:Q', title='Annual production-based CO2 emissions (million tonnes)'),
    alt.Color('region:N', title='Region of the world')
)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problem 2

Create a 6x8 matrix of line charts, one for each of the 48 countries in represented in the dataset, that shows the annual production-based CO2 emissions per year. Configure each chart to be 100 pixels wide and 100 pixels tall. Provide a descriptive title for each of the axes, and a title for the overall chart. Your matrix of charts should look like this, except with  more charts:

![](img/problem2.svg)

In [3]:
# Solution
base = alt.Chart().mark_line().encode(
    alt.X('year:O', title='Year'),
    alt.Y('co2:Q', title='CO2 emissions')
).properties(
    width=100,
    height=100
)

base.facet(
    data=df,
    facet=alt.Facet('country:N', title='Annual Production-Based CO2 Emissions By Country (million tonnes)'),
    columns=6
)
# ).transform_filter(
#     alt.FieldOneOfPredicate(
#         field='country', 
#         oneOf=['Argentina', 'Australia', 'Austria',
#                'Belarus', 'Belgium', 'Brazil']
#     )
# ).save('img/problem2.svg')

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problem 3

### Option A

Create a line chart that shows the annual greenhouse gas emissions *per capita* for each year and region. Use color to highlight the region of the world associated with each line in the chart. Provide a descriptive title for each of the axes and the legend. Your chart should look like this:

![](img/problem3.svg)

You can compute the annual greenhouse gas emissions per capita *for each region-year pair* as follows:

1. Compute the total greenhouse gas emissions for each region-year pair.
2. Compute the total population for each region-year pair.
3. Compute the annual greenhouse gas emissions per capita by taking the values you computed in \#1 and dividing them by the corresponding values you computed in \#2.


### Option B &mdash; Partial Credit

If you're having trouble creating chart described above, you can create the following chart instead for partial credit: Create a line chart that shows the total annual greenhouse gas emissions for each year and region. Use color to highlight the region of the world associated with each line in the chart. Provide a descriptive title for each of the axes and the legend. Your chart should look like this:

![](img/problem3_alt.svg)

<p style="font-weight:bold;background:yellow;">In a comment, state which option you are submitting work for. Your work will be graded using only one option's rubric.</p>

In [4]:
# Solution
alt.Chart(df).transform_aggregate(
    groupby=['region', 'year'],
    region_ghg = 'sum(ghg):Q',
    region_pop = 'sum(population):Q',
).transform_calculate(
    region_ghg_per_capita = 'datum.region_ghg / datum.region_pop'
).mark_line().encode(
    alt.X('year:O', title='Year'),
    alt.Y('region_ghg_per_capita:Q', title='Annual greenhouse gas emissions per capita (million tonnes CO2/person)'),
    alt.Color('region:N', title='Region of the world')
)

In [5]:
# Solution - Partial credit
alt.Chart(df).mark_line().encode(
    alt.X('year:O', title='Year'),
    alt.Y('sum(ghg):Q', title='Total annual greenhouse gas emissions (million tonnes CO2)'),
    alt.Color('region:N', title='Region of the world')
)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Grading rubric

### Problem 1

|                                                                                   | Points |
| :-                                                                                | -:     |
| Only used data from 2016                                                          | 4      |
| Created scatter plot, using correct encodings for CO2 emissions and GHG emissions | 10     |
| Used color to differentiate between regions                                       | 5      |
| Provided descriptive title for each of the axes and the legend                    | 6      |
| **Total**                                                                         | **25** |

### Problem 2

|                                                                             | Points |
| :-                                                                          | -:     |
| Created base line chart, using correct encodings for CO2 emissions and year | 8      |
| Adjusted size of base line chart                                            | 3      |
| Created 6x8 matrix of line charts                                           | 8      |
| Provided descriptive title or each of the axes and the overall chart        | 6      |
| **Total**                                                                   | **25** |

### Problem 3 &mdash; Option A

|                                                                                         | Points |
| :-                                                                                      | -:     |
| Compute total GHG emissions and population for each region-year pair                    | 4      |
| Compute annual GHG emissions per capita for each region-year pair                       | 3      |
| Create line chart, using correct encodings for annual GHG emissions per capita and year | 8      |
| Used color to differentiate between regions                                             | 4      |
| Provided descriptive title for each of the axes and the legend                          | 6      |
| **Total**                                                                               | **25** |

### Problem 3 &mdash; Option B

|                                                                                               | Points |
| :-                                                                                            | -:     |
| Create line chart, using correct encodings for total annual GHG emissions per capita and year | 8      |
| Used color to differentiate between regions                                                   | 4      |
| Provided descriptive title for each of the axes and the legend                                | 6      |
| **Total**                                                                                     | **18** |