# Case Study: CO2 Concentration

In this case study, we are going to reverse engineering a popular visualization on CO2 concentration over the years. This dataset is often showcased by different plotting libraries as an example (e.g. [with Altair](https://altair-viz.github.io/gallery/co2_concentration.html), [with Vega Lite](https://vega.github.io/vega-lite/examples/layer_line_co2_concentration.html)). We will walk through the steps of creating this year.

## Data

In [1]:
import altair as alt
import pandas as pd
from vega_datasets import data

source = pd.read_csv(data.co2_concentration.url)
source.head()

Unnamed: 0,Date,CO2
0,1958-03-01,315.7
1,1958-04-01,317.46
2,1958-05-01,317.51
3,1958-07-01,315.86
4,1958-08-01,314.93


The data is clean and has 2 columns: "Data" and "CO2".  Let's first plot "Date" as x and "CO2" as y in a line plot.

In [2]:
base = alt.Chart(source)
line_chart = base.mark_line()
line_chart = line_chart.encode(x="Date:T",
                               y=alt.Y("CO2:Q", scale=alt.Scale(zero=False)))
line_chart

Note that we used the suffix `:T` when encoding "Date" column to indicate the data is of temporal type. Altair will pick the right scale using this information. Also notice that we are using the full signiture of `alt.Y(...)` for y encoding.  This is because, in addition to specifying the column, we want to change the default encoding scale to not include zero. Feel free to change the y encoding to `y="CO2:Q"` and see the difference.

Now, we want to group the data by decade. Instead of using Altair's transform capabilities as shown [here](https://altair-viz.github.io/gallery/co2_concentration.html), I feel it is easier to add a few new columns using Pandas directly.

In [3]:
source['Date'] = pd.to_datetime(source['Date'])
source['year'] = source['Date'].dt.year
source['month'] = source['Date'].dt.month
source['decade'] = (source['year'] // 10) * 10
source['decade_year'] = source['Date'].dt.month / 12 + (source['year'] - source['decade'])
source.head()

Unnamed: 0,Date,CO2,year,month,decade,decade_year
0,1958-03-01,315.7,1958,3,1950,8.25
1,1958-04-01,317.46,1958,4,1950,8.333333
2,1958-05-01,317.51,1958,5,1950,8.416667
3,1958-07-01,315.86,1958,7,1950,8.583333
4,1958-08-01,314.93,1958,8,1950,8.666667


Here we added "year", "month", "decade" and "decade_year" columns.  The first three columns are self explanatory.  The "decade_year" column is a decimal value to express a date within a given decade. It goes from 0 to 10, and we will use this column as the x coordinate in our plot.

In [4]:
base = alt.Chart(source)
line_chart = base.mark_line()
line_chart = line_chart.encode(x=alt.X("decade_year:Q", title="Year in a decade"), 
                               y=alt.Y("CO2:Q", scale=alt.Scale(zero=False), title="CO2 concentration"), 
                               color=alt.Color('decade', scale=alt.Scale(scheme="magma"), legend=None))
line_chart

This plot looks much better! In addition to change the x encoding, we also encode "decade" as color.  By doing so, we are also implicitly specifying grouping information. All data points with the same "decade" value are grouped together when drawing the line plot. Try change the color to use "year" column and see the effects!

Another interesting change we made is the usage of magma color scheme. We will talk more about color later in the course.  Magma is one of the standard color scheme that Altair inherits from [Vega](https://vega.github.io/vega/).  See more color schemes [here](https://vega.github.io/vega/docs/schemes/).

This chart almost look like what we want. The only thing left is to add some text labels.

In [5]:
text_base = base.encode(x="decade_year", 
                        y="CO2:Q", 
                        text="year", 
                        color=alt.value("black"))

start_decade = text_base.mark_text(align="left", baseline="top")
start_decade = start_decade.transform_filter(alt.datum.decade_year == 1/12)
end_decade = text_base.mark_text(align="left", baseline="bottom")
end_decade = end_decade.transform_filter(alt.datum.decade_year == 10)
other_year = text_base.mark_text(align="right", baseline="top")\
                      .transform_filter(((alt.datum.year==1958) & (alt.datum.month==3))) +\
             text_base.mark_text(align="left", baseline="top")\
                      .transform_filter(((alt.datum.year==2017) & (alt.datum.month==12)))

line_chart + start_decade + end_decade + other_year

Here we add labels in three groups.  The `start_decade` and `end_decade` layers add labels correponding to the start and end of a decade.  We use the [filter transform](https://altair-viz.github.io/user_guide/transform/filter.html#) to single out data points corresponding to the start and end of the decade.  The `other_year` layer adds label to the earliest and latest data point.  Again, filter transform is used to pick those data points.

## Summary

Great job! We have completed this case study.  Here is a quick summary of the topics we touched:
* Sometimes it is easier to modify the Pandas data frame than using Altair's transform capability.
* Color encoding can implicitly group data ponts together.
* Altair provide a number [color schemes](https://vega.github.io/vega/docs/schemes/) to choose from.
* [Filter transform](https://altair-viz.github.io/user_guide/transform/filter.htm) can be useful for add labels to selected data points.