# Pre-Class Tutorial 2: Scatter Charts & Bar Charts Fundamentals

## Overview

This pre-class material introduces the fundamental concepts of data visualization using Altair, focusing on scatter charts and bar charts. You'll learn about the grammar of graphics, data types, encoding channels, and basic chart creation that will prepare you for the in-class lecture and assignment.

**Time Required:** 90 minutes  
**Prerequisites:** Basic Python and pandas knowledge

---

## Learning Goals

By the end of this tutorial, you will be able to:
- Understand the grammar of graphics: data, marks, and encoding channels
- Identify and work with different data types (nominal, ordinal, quantitative, temporal)
- Create basic scatter plots and bar charts using Altair
- Apply individual encoding channels (position, color, size, shape) to visualize data
- Distinguish between different bar chart types

---

## Part 1: Introduction to Altair & Grammar of Graphics

### What is Altair?

[Altair](https://altair-viz.github.io/) is a declarative statistical visualization library for Python. By *declarative*, we mean that you can provide a high-level specification of **what** you want the visualization to include, in terms of **data**, **graphical marks**, and **encoding channels**, rather than having to specify **how** to implement the visualization in terms of for-loops, low-level drawing commands, etc.

The key idea is that you **specify associations** between **data fields** and **visual encoding channels**, such as the x-axis, y-axis, color, etc. The rest of the plot details are handled automatically.

### Setup and Data

We will be visualizing global health and population data for a number of countries, over the time period of 1955 to 2005. The data was collected by the [Gapminder Foundation](https://www.gapminder.org/) and shared in [Hans Rosling's popular TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w). If you haven't seen the talk, we encourage you to watch it first!

In [20]:
import pandas as pd
import altair as alt
#from vega_datasets import data


#filepath = "data/world-data-gapminder.csv"
filepath = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/world-data-gapminder.csv'

# Read in the data using pandas, remember to set parse_dates!
gm = pd.read_csv(filepath, parse_dates=["year"])

# Display basic information about the dataset
print(f"Dataset shape: {gm.shape}")
print(f"Years covered: {gm.year.dt.year.min()} to {gm.year.dt.year.max()}")
print(f"Number of countries: {gm.country.nunique()}")
print(f"Regions included: {', '.join(sorted(gm.region.unique()))}")
print(f"Column names: {sorted(gm.columns)}")



# Create a subset for 2000 data for clearer examples
data2000 = gm[gm.year == '2000']   # notice how we have to treat it as a string to search, if you change this to an integer it breaks
print(f"Year 2000 subset: {data2000.shape}")

Dataset shape: (38982, 14)
Years covered: 1800 to 2018
Number of countries: 178
Regions included: Africa, Americas, Asia, Europe, Oceania
Column names: ['child_mortality', 'children_per_woman', 'co2_per_capita', 'country', 'income', 'income_group', 'life_expectancy', 'pop_density', 'population', 'region', 'sub_region', 'year', 'years_in_school_men', 'years_in_school_women']
Year 2000 subset: (178, 14)


### Your First Chart: Single Channel Encoding

Let's start with the simplest possible visualization - showing data along just one dimension. We'll create a chart that shows the distribution of fertility values using only the x-axis.

In [21]:
# Step 1: Attach data to a chart object
chart_with_data = alt.Chart(data2000)

# Step 2: Specify what kind of mark (geometric shape) to use
chart_with_marks = chart_with_data.mark_tick()

# Step 3: Encode a data field to a visual channel
simple_chart = chart_with_marks.encode(
    x='children_per_woman'  # Map fertility values to x-axis position
)

# Display the chart
simple_chart

**Let's break this down step by step:**

1. **`alt.Chart(data2000)`** - This creates a chart object and attaches our data to it. Think of this as preparing a blank canvas with your data ready to use.

2. **`.mark_tick()`** - This tells Altair to use "tick marks" (short lines) to represent each data point. Tick marks are perfect for showing distributions along one axis.

3. **`.encode(x='fertility')`** - This is where the magic happens! We're telling Altair to take the `fertility` field from our data and map it to the x-axis position. Each country's fertility rate determines where its tick mark appears horizontally.

We can write this more concisely:

In [22]:
# All in one line - this is the typical way to write Altair code
alt.Chart(data2000).mark_tick().encode(x='children_per_woman')

### Adding a Second Channel: Two-Dimensional Visualization

Now let's create our first scatter plot by adding a second encoding channel - the y-axis:

In [23]:
# Two-channel encoding: both x and y position
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman',     # Horizontal position shows fertility rate
    y='life_expectancy'    # Vertical position shows life expectancy
)

**What changed:**
- **`.mark_point()`** instead of `.mark_tick()` - We switched to point marks (circles) because we're now showing relationships between two variables
- **Added `y='life_expect'`** - Now each country appears as a point whose x-position shows fertility and y-position shows life expectancy

This creates a scatter plot where we can see the relationship between fertility rates and life expectancy across countries.

### Understanding the Grammar

Every Altair chart follows this pattern:

```
alt.Chart([DATA]).mark_[TYPE]().encode([MAPPINGS])
```

- **DATA**: Your dataset (pandas DataFrame)
- **TYPE**: The geometric shape (point, bar, line, etc.)
- **MAPPINGS**: How data fields map to visual properties (x, y, color, etc.)

---

## Part 2: Data Attribute Types

The first ingredient in effective visualization is the input data. Data values can represent different forms of measurement. What kinds of comparisons do those measurements support? And what kinds of visual encodings then support those comparisons?

Data types in visualization grammars are different from data types in programming (e.g. `int`, `float`, `string`, etc). In visualization, data is characterized as nominal/categorical, ordered, quantitative and in some grammars as temporal.

We will start by looking at the basic data types that Altair uses to inform visual encoding choices. These data types determine the kinds of comparisons we can make, and thereby guide our visualization design decisions.

### Nominal (N)

**Nominal** data represent discrete categories or groups. With nominal data we can compare the *equality* of values: *is value A the same or different than value B? (A = B)*, supporting statements like "A is equal to B" or "A is not equal to B". In the dataset above, the `country` field is nominal.

Here we can compare the equality of countries: the country value for the first row is "Afghanistan", the country value for the fourth row is "Angola", and Afghanistan ≠ Angola.

Example nominal values include movie genres (action, comedy, drama, etc.), product categories (electronics, books, clothing), and in our dataset: country names and world regions.

Let's see how nominal data affects visualization:

In [24]:
# First, let's look at the region field (nominal data)
print("Unique regions:", data2000['region'].unique())

# Simple bar chart using nominal data
alt.Chart(data2000).mark_bar().encode(
    x='region:N'  # :N explicitly tells Altair this is nominal data
)

Unique regions: ['Asia' 'Europe' 'Africa' 'Americas' 'Oceania']


**Step-by-step explanation:**
1. **`'region:N'`** - The `:N` suffix explicitly tells Altair to treat region as nominal data
2. **Automatic behavior** - Altair automatically counts how many countries are in each region and creates bars showing those counts
3. **Axis treatment** - Nominal data gets evenly spaced along the axis with category labels

### Ordinal (O)

**Ordinal** data represent discrete categories with a meaningful order. With ordinal data we can make *rank-order* comparisons: *is value A less than, equal to, or greater than value B? (A < B, A = B, A > B)*. Ordinal data support both equality and rank-order operations.

Examples include education levels (elementary < middle school < high school < college), income brackets (low < medium < high), survey responses (disagree < neutral < agree), and t-shirt sizes (small < medium < large).

Let's create ordinal data from our dataset:

In [25]:
# Create ordinal population categories
import numpy as np
data2000_with_pop_category = data2000.copy()
data2000_with_pop_category['pop_category'] = pd.cut(
    data2000_with_pop_category['population'], 
    bins=[0, 5e6, 20e6, 100e6, np.inf], 
    labels=['Small', 'Medium', 'Large', 'Very Large']
)

# Use ordinal encoding - notice the difference from nominal
alt.Chart(data2000_with_pop_category).mark_bar().encode(
    x='pop_category:O'  # :O tells Altair this is ordinal (ordered)
)

**Key difference from nominal:** The categories appear in their logical order (Small, Medium, Large, Very Large) rather than alphabetical order.

### Quantitative (Q)

**Quantitative** data represent numerical measurements. With quantitative data we can make *magnitude* comparisons: *by how much is value A larger or smaller than value B? (A - B = ?)*. Quantitative data support equality, rank-order, and arithmetic operations.

Examples include measurements such as height, weight, temperature, counts, and rates. In our dataset, `fertility` (children per woman), `life_expect` (life expectancy in years), and `pop` (population) are all quantitative.

In [26]:
# Using quantitative data - notice how the axis changes
alt.Chart(data2000).mark_tick().encode(
    x='children_per_woman:Q'  # :Q tells Altair this is quantitative
)

**Step-by-step explanation:**
1. **Continuous axis** - Unlike nominal/ordinal data, quantitative data gets a continuous numerical axis
2. **Tick marks** - Each tick mark represents one country positioned according to its exact fertility value
3. **Automatic scaling** - Altair automatically determines appropriate axis limits and tick marks

Let's compare the same data field treated as different types:

In [27]:
# Fertility as quantitative (continuous axis)
quant_chart = alt.Chart(data2000).mark_tick().encode(x='children_per_woman:Q')

# Fertility as nominal (categorical axis) - treating numbers as categories
nom_chart = alt.Chart(data2000).mark_tick().encode(x='children_per_woman:N')

# Display both to see the difference
quant_chart & nom_chart  # Side-by-side comparison

### Temporal (T)

**Temporal** values measure time points or intervals. This type is a special case of quantitative values (timestamps) with rich semantics and conventions (i.e., the [Gregorian calendar](https://en.wikipedia.org/wiki/Gregorian_calendar)). The temporal type in Altair supports reasoning about time units (year, month, day, hour, etc.), and provides methods for requesting specific time intervals.

Example temporal values include date strings such as `"2019-01-04"` and `"Jan 04 2019"`, as well as standardized date-times such as the [ISO date-time format](https://en.wikipedia.org/wiki/ISO_8601): `"2019-01-04T17:50:35.643Z"`.

In [28]:
# Using year as temporal data - shows time progression
canada_data = gm[gm.country == 'Canada']
alt.Chart(canada_data).mark_line().encode(
    x='year:T',           # :T tells Altair this is temporal data
    y='life_expectancy:Q'
)

### Data Type Hierarchy

The data types presented are not mutually exclusive, but rather form a hierarchy: ordinal data support nominal (equality) comparisons, while quantitative data support ordinal (rank-order) comparisons.

Moreover, these data types do **not** provide a fixed categorization. Just because a data field is represented using a number doesn't mean we have to treat it as a quantitative type! For example, we might interpret a set of ages (10 years old, 20 years old, etc) as nominal (underage or overage), ordinal (grouped by year), or quantitative (calculate average age).

---

## Part 3: Encoding Channels

When discussing data items, the word **attribute** is used to signify the data that describes the item. In the context of visualizations, we will use the words **field** and **attribute** interchangeably.

At the heart of Altair is the use of *encodings* that bind data fields (with a given data type) to available encoding *channels* of a chosen *mark* type. In this section we'll examine the following encoding channels:

- `x`: Horizontal (x-axis) position of the mark.
- `y`: Vertical (y-axis) position of the mark.
- `size`: Size of the mark. May correspond to area or length, depending on the mark type.
- `color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value).
- `opacity`: Mark opacity, ranging from 0 (fully transparent) to 1 (fully opaque).
- `shape`: Plotting symbol shape for `point` marks.
- `tooltip`: Tooltip text to display upon mouse hover over the mark.
- `order`: Mark ordering, determines line/area point order and drawing order.

For a complete list of available channels, see the [Altair encoding documentation](https://altair-viz.github.io/user_guide/encoding.html).

### Position Channels: The Foundation

Position is the most accurate channel for conveying quantitative information. Let's explore position channels one at a time.

#### Single Position Channel (X-axis only)

In [29]:
# Using only the x-channel to show distribution
alt.Chart(data2000).mark_tick().encode(
    x='children_per_woman:Q'  # Only x-position is encoded
)

This creates a one-dimensional plot showing how fertility values are distributed across countries.

#### Single Position Channel (Y-axis only)

In [30]:
# Using only the y-channel 
alt.Chart(data2000).mark_tick().encode(
    y='life_expectancy:Q'  # Only y-position is encoded
)

This creates a vertical distribution of life expectancy values.

#### Both Position Channels Together

In [31]:
# Using both x and y channels creates a scatter plot
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',        # Horizontal position
    y='life_expectancy:Q'       # Vertical position
)

**Step-by-step explanation:**
1. **Two dimensions** - Now each country is represented by a point whose position encodes two pieces of information
2. **Relationships** - We can see how fertility and life expectancy relate to each other
3. **Mark choice** - We switched to `mark_point()` because points work better than ticks for two-dimensional data

### Color Channel: Adding a Third Dimension

The `color` channel encodes data values using color. Let's see how color works with nominal data first:

In [32]:
# Adding color for nominal data
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    color='region:N'  # Color represents region membership
)

**Step-by-step explanation:**
1. **Three dimensions** - We now show fertility, life expectancy, AND region membership in one chart
2. **Automatic legend** - Altair automatically creates a legend showing which color represents which region
3. **Distinct colors** - For nominal data, Altair uses clearly different colors for each category

Now let's try color with quantitative data:

In [33]:
# Color with quantitative data
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    color='population:Q'  # Color intensity represents population
)

**Notice the difference:**
- **Color gradient** - Quantitative data uses a continuous color gradient instead of distinct colors
- **Different legend** - The legend shows a color bar with numerical scale instead of categories

### Size Channel: Yet Another Dimension

The `size` channel controls the size of marks. Size is most appropriate for quantitative data:

In [34]:
# Adding size encoding
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    size='population:Q'  # Point size represents population
)

**Step-by-step explanation:**
1. **Area encoding** - Point size typically represents area, so larger points have much more visual impact
2. **Magnitude comparisons** - Size is excellent for showing "how much" something is
3. **Automatic scaling** - Altair automatically scales the sizes to be visually appropriate

### Shape Channel: For Categories Only

The `shape` encoding channel sets the plotting symbol used by `point` marks. The shape encoding channel should only be used with nominal data, as perceptual rank-order and magnitude comparisons are not supported.

In [35]:
# Shape encoding for nominal data
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    shape='region:N'  # Different shapes for each region
)

**Important:** Shape only works with nominal data because there's no natural ordering to shapes (is a circle "greater than" a square?).

### Opacity Channel: Transparency Effects

The `opacity` channel controls the transparency of marks, with values ranging from 0 (fully transparent) to 1 (fully opaque):

In [36]:
# Opacity encoding
alt.Chart(data2000).mark_point(filled=True).encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    opacity='population:Q'  # Transparency represents population
)

**Note:** We added `filled=True` to make the opacity effect more visible.

### Tooltip Channel: Additional Information

The `tooltip` channel specifies information to display when hovering over a mark:

In [37]:
# Adding tooltips
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    tooltip=['country', 'population', 'region']  # Show these fields on hover
)

**Step-by-step explanation:**
1. **List of fields** - We provide a list of field names to show in the tooltip
2. **Interactive information** - Hover over any point to see additional details
3. **No visual encoding** - Tooltips don't change the appearance of the chart, just add information

---

## Part 4: Graphical Marks

A visualization represents data using a collection of *graphical marks* (bars, lines, points, etc.). The attributes of a mark — such as its position, shape, size, or color — serve as *channels* through which we can encode underlying data values.

### Point Marks

The `point` mark type conveys specific points, as in *scatter plots* and *dot plots*. Point marks are typically used to show the relationship between two or more quantitative variables, and are often the mark of choice for exploratory data analysis.

For more details, see the [Altair point mark documentation](https://altair-viz.github.io/user_guide/marks.html#point-mark).

Below is a scatter plot of `children_per_woman` vs `life_expect`, with the `region` field encoded using the `color` channel:

In [38]:
alt.Chart(data2000).mark_point().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    color='region:N'
)

### Circle Marks

The `circle` mark type is similar to `point`, but specifically creates filled circles. Circle marks are often more visually prominent than standard points and work well when you want to emphasize individual data points.

For more details, see the [Altair circle mark documentation](https://altair-viz.github.io/user_guide/marks.html#circle-mark).

Below is the same visualization using circle marks instead of points:

In [39]:
alt.Chart(data2000).mark_circle().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    color='region:N'
)

### Square Marks

The `square` mark type creates square-shaped points. Square marks can be useful for creating distinct visual styles or when you want to use shape as an additional encoding channel alongside circular points.

For more details, see the [Altair square mark documentation](https://altair-viz.github.io/user_guide/marks.html#square-mark).

Below is a scatter plot using square marks:

In [40]:
alt.Chart(data2000).mark_square().encode(
    x='children_per_woman:Q',
    y='life_expectancy:Q',
    color='region:N'
)

### Tick Marks

The `tick` mark type draws short line segments, and is often used to show the distribution of values along one or both axes. Tick marks are particularly useful for creating rug plots or showing the distribution of a single variable.

For more details, see the [Altair tick mark documentation](https://altair-viz.github.io/user_guide/marks.html#tick-mark).

Below is a tick plot showing the distribution of fertility values:

In [41]:
alt.Chart(data2000).mark_tick().encode(
    x='children_per_woman:Q'
)

### Bar Marks

The `bar` mark type creates rectangular bars and is one of the most widely used mark types for visualization. Bar marks are excellent for comparing quantities across categories and for showing distributions of categorical data.

For more details, see the [Altair bar mark documentation](https://altair-viz.github.io/user_guide/marks.html#bar-mark).

Below is a bar chart showing the count of countries by region:

In [42]:
alt.Chart(data2000).mark_bar().encode(
    x='region:N',
    y='count():Q'  # count() automatically counts rows per region
)

**Step-by-step explanation:**
1. **Automatic aggregation** - `count():Q` tells Altair to count how many rows (countries) are in each region
2. **Bar heights** - Each bar's height represents the count of countries in that region
3. **Category spacing** - Nominal data on x-axis creates evenly spaced category labels

---

## Part 5: Basic Bar Charts

The Bar chart was developed by [William Playfair](https://en.wikipedia.org/wiki/William_Playfair#Bar_chart) in 1786, and since then it has gone on to become one of the most widely used visual representations. One of the reasons why the chart is so effective is that the way quantitative data is encoded capitalizes on the visuo-perceptual system.

### Simple Bar Chart

Let's start with the most basic bar chart - counting countries by region:

In [43]:
# Basic bar chart
alt.Chart(data2000).mark_bar().encode(
    x='region:N',    # Categories on x-axis
    y='count():Q'     # Count on y-axis
)

### Horizontal Bar Chart

Sometimes horizontal orientation works better, especially when category names are long:

In [44]:
# Horizontal bar chart - swap x and y
alt.Chart(data2000).mark_bar().encode(
    y='region:N',    # Categories on y-axis
    x='count():Q'     # Count on x-axis
)

### Basic Sorting

We can sort bars to make patterns more apparent:

In [45]:
# Sort bars by count (descending)
alt.Chart(data2000).mark_bar().encode(
    x=alt.X('region:N', sort='-y'),  # Sort by y values, descending
    y='count():Q'
)

**Step-by-step explanation:**
1. **`alt.X()`** - We use the full encoding specification instead of just the string
2. **`sort='-y'`** - Sort by the y-channel values in descending order (the minus sign means descending)
3. **Result** - Region are now ordered from most to least countries

### Simple Stacked Bar Chart

Let's create a basic stacked bar chart by adding color encoding:

In [46]:
# Simple stacked chart - let Altair handle the stacking
alt.Chart(data2000).mark_bar().encode(
    x='region:N',
    y='count():Q',
    color='region:N'  # Adding color automatically creates stacking
)

**Note:** This creates a stacked effect even though we're using the same field for both x and color. This shows the basic stacking mechanism.

---

## Part 6: Practice Exercises

Before moving to the in-class lecture, try these exercises to reinforce your learning:

### Exercise 1: Single Channel Encoding
Create a chart showing the distribution of life expectancy using only tick marks along the y-axis.

### Exercise 2: Basic Scatter Plot
Create a scatter plot showing the relationship between population (`population`) and GDP per capita (`gdpPercap`).

### Exercise 3: Adding Color
Take your scatter plot from Exercise 2 and add color encoding for the `reigon` field.

### Exercise 4: Bar Chart Practice
Create a horizontal bar chart showing the count of countries by region, sorted from most to least countries.

### Exercise 5: Data Type Experiment
Create the same 1-dimensional chart using fertility data, but try it with both `:Q` (quantitative) and `:N` (nominal) data types. Compare the results.

**Example Solution for Exercise 1:**

---

## Part 7: Key Concepts Summary

### Grammar of Graphics Pattern

```
alt.Chart([DATA]).mark_[TYPE]().encode([MAPPINGS])
```

### Data Type Suffixes
- `:N` - Nominal (categories, no order)
- `:O` - Ordinal (categories with order)  
- `:Q` - Quantitative (numbers, measurements)
- `:T` - Temporal (dates and times)

### Common Mark Types
- `.mark_point()` - Scatter plot points
- `.mark_circle()` - Filled circles
- `.mark_bar()` - Bar charts
- `.mark_tick()` - Distribution marks

### Essential Encoding Channels
- `x`, `y` - Position (most important)
- `color` - Categories or intensity
- `size` - Magnitude (quantitative only)  
- `shape` - Categories only
- `opacity` - Transparency
- `tooltip` - Additional information

---

## Preparing for Class

Make sure you understand:

### Core Concepts
- How to build charts step-by-step from data + marks + encodings
- The difference between data types and when to use each
- How different encoding channels show different aspects of data
- When to use scatter plots vs. bar charts

### Essential Skills
- Creating single-channel charts (just x or just y)
- Building up to two-channel scatter plots
- Making basic bar charts with automatic counting
- Applying different data types to the same field

### Questions to Consider
- How does changing the data type (`:N` vs `:Q`) affect the same field?
- What happens when you use different mark types with the same encoding?
- When would you choose horizontal vs. vertical bar charts?
- How do you decide which encoding channel to use for a particular data field?

**Next**: In-class lecture will build on these fundamentals with advanced multi-channel encoding, complex aggregations, interpretation skills, and hands-on practice with the assignment requirements!