# In-Class Lecture: Altair Fundamentals 


## Learning Objectives
- Learn the Visualization Grammar used in Altair. 
- Gain experience creating vizzes with 2 - 4 channels encoded. 
- Create and differentiate between regular stacked bars and normalized stacked bars (`stack='normalize'`)
- Create grouped bar charts to show side-by-side comparisons across categories
- Identify and interpret positive, negative, and no correlation relationships in scatter plots, including outlier detection


## Environment Setup

In [4]:
import pandas as pd
import numpy as np
import altair as alt

## Data Setup

### Dataset Description
We are using an updated version of the Gapminder dataset, which contains values up until 2018 for most features.

The data was collected by the [Gapminder Foundation](https://www.gapminder.org/) and shared in [Hans Rosling's popular TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w). If you haven't seen the talk, we encourage you to watch it first!


| Column                | Description                                                                                  |
|-----------------------|----------------------------------------------------------------------------------------------|
| country               | Country name                                                                                 |
| year                  | Year of observation                                                                          |
| population            | Population in the country at each year                                                       |
| region                | Continent the country belongs to                                                             |
| sub_region            | Sub-region the country belongs to                                                            |
| income_group          | Income group                                                                                 |
| life_expectancy       | The mean number of years a newborn would <br>live if mortality patterns remained constant    |
| income                | GDP per capita (in USD) <em>adjusted <br>for differences in purchasing power</em>            |
| children_per_woman    | Average number of children born per woman                                                    |
| child_mortality       | Deaths of children under 5 years <break>of age per 1000 live births                          |
| pop_density           | Average number of people per km<sup>2</sup>                                                  |
| co2_per_capita        | CO2 emissions from fossil fuels (tonnes per capita)                                          |
| years_in_school_men   | Mean number of years in primary, secondary,<br>and tertiary school for 25-36 years old men   |
| years_in_school_women | Mean number of years in primary, secondary,<br>and tertiary school for 25-36 years old women |

In [6]:
# If running in PL use this file path
# filepath = "data/world-data-gapminder.csv"

# If running locally on your machine use this one
filepath = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/world-data-gapminder.csv'

# COMMENT OUT THE FILE PATH THAT DOESN'T APPLY TO YOUR CONTEXT

# Read in the data using pandas, remember to set parse_dates!
gapminder = pd.read_csv(filepath, parse_dates=["year"])

# Display basic information about the dataset
print(f"Dataset shape: {gapminder.shape}")
print(f"Years covered: {gapminder.year.dt.year.min()} to {gapminder.year.dt.year.max()}")
print(f"Number of countries: {gapminder.country.nunique()}")
print(f"Regions included: {', '.join(sorted(gapminder.region.unique()))}")
print(f"Column names: {sorted(gapminder.columns)}")

Dataset shape: (38982, 14)
Years covered: 1800 to 2018
Number of countries: 178
Regions included: Africa, Americas, Asia, Europe, Oceania
Column names: ['child_mortality', 'children_per_woman', 'co2_per_capita', 'country', 'income', 'income_group', 'life_expectancy', 'pop_density', 'population', 'region', 'sub_region', 'year', 'years_in_school_men', 'years_in_school_women']


## Data Wrangling 

We want to filter the data to the most recent year with available CO₂ per capita values. 
We have to do this in 2 steps. first find the most recent year that meets the requirements and then filter based on that year. 


In [7]:
# Step 1: Find most recent year with CO2 data
recent_year = gapminder[gapminder.co2_per_capita.notna()].year.max()
print(f"Most recent CO2 data: {recent_year}")

# Step 2: Filter to that year
recent_data = gapminder[gapminder.year == recent_year]


Most recent CO2 data: 2014-01-01 00:00:00


## Task 1: Sorted Bar Chart - Identifying Top Emitters

Bar charts excel at showing "which countries are highest/lowest" and help us identify clear winners and outliers.

**Exploratory Question**: *Which countries are the world's biggest CO2 emitters per capita, and how do they compare to each other?*

Understanding per capita emissions helps us see which countries have the highest individual environmental impact, regardless of their total population size. This is crucial information for climate policy discussions.

---

**Your Task**: Create a horizontal bar chart showing the top 20 countries with the highest CO2 emissions per capita.

### Step-by-Step Instructions:

Create a visualization with the following specs:
- Use the **`bar`** mark
- Encode CO2 per capita (`co2_per_capita`) on the **x channel**
- Encode country names (`country`) as nominal data on the **y channel**, sorted by CO2 values (highest at top)
- Encode continent (`region`) on the **color channel**

In [21]:
# Get top 20 countries by CO2
top_co2 = recent_data.nlargest(20, 'co2_per_capita')

# Create viz 
horizontal_bar = alt.Chart(top_co2).mark_bar(strokeWidth = 5).encode(x = 'co2_per_capita:Q', 
                                                                     y = alt.Y('country:N',sort = "x"), 
                                                                 color = 'region:N')

# x = 'string'
# x = alt.X('string', DO OTHER STUFFS HERE)

# Show plot
horizontal_bar

---
## Task 2: Global Carbon Emissions
### Comparison Strategies
When analyzing change over time, the way we structure our visualization dramatically affects the story we can tell. Different stacking and grouping approaches reveal different aspects of temporal patterns. Today we'll explore three techniques for comparing categories across time periods.

**Exploratory Question**: *How has global CO2 emissions shifted between regions from 1952 to 2002, and what's the best way to show both absolute growth and changing proportions?*

### Part A: Stacked Temporal Bar Chart
**Purpose**: Show total population growth AND regional contributions

Create a visualization with the following specs:
- Use the **`bar`** mark
- Use the provided historical data from years 1952,1962,1972,1982,1992,2002,2012
- Encode year (`year`) on the **x channel** as ordinal data
- Encode sum of population (`sum(population)`) on the **y channel**
- Encode region (`region`) on the **color channel** for automatic stacking

In [32]:
### Data Wrangling

# If year column is an integer, then make a subset, otherwise first convert to int
if np.issubdtype(gapminder['year'].dtype, np.datetime64):
    gapminder['year'] = gapminder['year'].dt.year
subset = gapminder[gapminder.year.isin([1952,1962,1972,1982,1992,2002,2012])]


# Stacked bar showing population by continent over time
stacked_bar = alt.Chart(subset).mark_bar(strokeWidth = 10).encode(x = "year:O", y ="sum(population):Q", color = "region:N")
stacked_bar

### Part B: Normalized Stacked Bar Chart  
**Purpose**: Focus on changing proportions rather than absolute numbers

Enhance the stacked chart by:
- Use the same **x channel** and **color channel** encodings as Part A
- Encode sum of co2 emissions on the **y channel** with `stack='normalize'` to show percentages

In [None]:
# Show proportions instead of absolute values
normalized_stack_bar = alt.Chart(subset).mark_bar().encode(x = 'year:O', 
                                                           y = alt.Y("sum(co2_per_capita)", stack = 'normalize'), 
                                                           color = "region:N")

# Show plot
normalized_stack_bar
normalized_stack_bar | stacked_bar.encode(y = 'sum(co2_per_capita)') # here we just edit the specification for the `stacked_bar` to sum the y!

### Part C: Grouped Bar Chart
**Purpose**: Enable direct comparison of each region across time

Create a grouped comparison by:
- Encode year (`year`) on the **x channel**
- Encode sum of population on the **y channel**  
- Encode region (`region`) on the **color channel**
- Encode region (`region`) on the **column channel** to create separate panels

In [34]:
# Grouped bars (side by side)
grouped_bar = alt.Chart(subset).mark_bar().encode(
    x = "year:O",
    y = "sum(population):Q",
    color = "region:N",
    column = "region:N"
)
grouped_bar

### When to Use Each Type

**Simple Bar:** Compare categories  
**Stacked Bar:** Compare parts of a whole + see totals  
**Normalized Stacked:** Compare proportions when totals vary greatly  
**Grouped Bar:** Compare multiple series across categories


## Task 3: Temporal Environmental Impact Analysis

Our final exploration takes us into one of the most critical questions of our time: how have regional CO2 emissions changed over decades? This analysis helps us understand which regions have been driving global emissions growth and how environmental responsibility has shifted over time.

**Exploratory Question**: *How have regional CO2 emission patterns evolved the last century, and which regions show the most dramatic changes?*


**Your Task**: Create a temporal stacked bar chart showing how regional CO2 emissions have changed over decades.

### Step-by-Step Instructions:

The data wrangling has been provided for you. Create a visualization with the following specs:
- Use the `bar` mark
- Encode year (`year`) on the **x channel** as temporal data
- Encode sum of CO2 per capita (`sum(co2_per_capita)`) on the **y channel**
- Encode region (`region`) on the **color channel** with a better color scheme using `alt.Color('region:N', scale=alt.Scale(scheme='category10'))`
- Encode multiple fields on the **tooltip channel**: `year`, `region`, and `co2_per_capita`
- Set chart width to 600 pixels using `.properties(width=600)`



In [50]:
# Data Wrangling

# Convert year (int) back to datetime64
gapminder['year'] = pd.to_datetime(gapminder['year'], format='%Y')

# Filter for countries with CO2 data
co2_data = gapminder[gapminder.co2_per_capita.notna()]

# Aggregate by continent and year
co2_by_continent = co2_data.groupby(['year', 'region']).agg({
    'co2_per_capita': 'sum'
}).reset_index()


In [52]:
# Create  stacked chart
co2_chart = alt.Chart(co2_by_continent).mark_bar().encode(x = "year:T",
                                                 y = alt.Y("sum(co2_per_capita):Q"), 
                                                 color = alt.Color("region:N", scale = alt.Scale(scheme = 'category10')),
                                                 tooltip = ["year:T", "region:N", "co2_per_capita:Q"]).properties(width=600)

co2_chart

#### Follow on
Update the viz above by normalizing the `y` channel to get a better view of the patterns



In [None]:
# Create normalized stacked chart
co2_chart_norm = ...
co2_chart_norm

## Task 4: Scatter Plot Recap Activity

**Exploratory Question**: *What is the relationship between a country's carbon emissions and life expectancy?*

This is the kind of question data analysts explore every day. Let's use our scatter plot skills to investigate this relationship and practice interpreting what we find.


**Your Task**: Create a scatter plot that reveals the relationship between carbon emissions and life expectancy using the 2014 data.

### Step-by-Step Instructions:
In the code cell below, write code that:

Create a visualization with the following specs:
- Use the **`circle`** mark
- Encode CO₂ per capita (`co2_per_capita`) on the **y channel**
- Encode life expectancy (`life_expectancy`) on the **x channel**
- Encode continent (`region`) on the **color channel**

**Add tooltips** showing `country`, `co2_per_capita`, and `life_expectancy`

In [None]:
scatter_plot = ...
scatter_plot


**How to tell if two variables are correlated by looking at a scatter plot**

**Positive Correlation**
- **Upward trend**: As X increases, Y tends to increase.
- **Tight clustering**: Points follow a clear upward line or curve.
- **Few outliers**: Most points conform to the pattern.

**Negative Correlation**
- **Downward trend**: As X increases, Y tends to decrease.
- **Tight clustering**: Points follow a clear downward line or curve.
- **Few outliers**: Most points fit the downward trend.

**No Correlation**
- **No clear trend**: Points are scattered randomly.
- **Wide spread**: No discernible line or curve.
- **Many outliers**: No obvious relationship between X and Y.


In [None]:
corr_value = recent_data['life_expectancy'].corr(recent_data['co2_per_capita'])
print(f"Correlation between life expectancy and carbon emissions: {corr_value:.2f}")


![Correlation](https://articles.outlier.org/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fkj4bmrik9d6o%2F2oArz66jpUDD00bOYo58e9%2F90ee20b033c2695c6884c5c652f75b81%2FOutlier_Graph_NegativeCorrelation-02.png&w=1080&q=75)

## Get Stepping

1. Redo the entire class. You learn by doing
2. Go to Altair's website and get familiar with the examples there.
3. Create new questions and then create visualizations to answer them. 
4. Prep for the quiz 2. It covers Tutorial 2 and Class 2 (this file) content. 