# Introduction to Visualizations

## Types of Visualizations


We'll focus on these today:

- Column charts (bar charts)
  - Use to compare values across categories
- Histograms
  - Use to show distribution of a single variable
- Line charts 
  - Use to show trends over time
- Scatter plots
  - Use to show relationships between two variables

In [None]:
library(ggplot2)
library(palmerpenguins)

We will use these packages today for visualization. <br>
They will work if our data is **tidy**. <br>
They will give out *Warnings* sometimes if our data is messy. 

In [None]:
# Some light cleaning so we avoid the warnings
library(tidyr)

penguins <- penguins |>
  drop_na(body_mass_g, flipper_length_mm)

### Review: How to view our data.

In [None]:
View(penguins) # Kind of long

In [None]:
penguins |> head(5) # short preview of the top

In [None]:
penguins |> tail(5) # short preview of the bottom

It helps to print the column / variable names. They will be needed with their exact spelling for plotting.

In [None]:
names(penguins)

## The Grammar of Graphics


- Data viz has a language with its own grammar
- Basic components include:
  - **Data** we are trying to visualize
  - **Aesthetics** (dimensions)
  - **Geom**etric objects (e.g. bar, line, scatter plot)
  - Color **scales**
  - **Themes**
  - Annotations
  


Let's start with the first two, the **data** and the **aesthetic**

This gives us the axes without any visualization:


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm, y = body_mass_g)

Now let's add a geom. In this case we want a scatter plot so we *add* `geom_point()`.


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm, y = body_mass_g) +
  geom_point()

In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm, y = body_mass_g, color = species) +
  geom_point()

Specify point `size` as well as transparency (`alpha`) for better visibility

In [None]:
penguins_scatter <- ggplot(penguins) + 
 aes(x = flipper_length_mm, y = body_mass_g, color = species) +
  geom_point(size = 1.2, alpha = 0.6)

In [None]:
penguins$flipper_length_mm

Let's try a **histogram**.

You only need one numerical variable in the x-axis.

In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm)

That gets the idea across but looks a little depressing, so...


...let's change the color of the columns by specifying `fill = "chartreuse4"`.

In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(fill = "chartreuse4", color = "white")

> **Tip:**
> See [here](http://sape.inf.usi.ch/quick-reference/ggplot2/colour) for more available `ggplot2` colors. 


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm, color = species) + 
  geom_histogram()

In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm, fill = species) + 
  geom_histogram()

Note how color of original columns is simply overwritten:


Now let's add some **labels** with the `labs()` function:


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    )

In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(bins = 30, fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

> **Tip:**
> See [here](https://ggplot2.tidyverse.org/reference/ggtheme.html) for available `ggplot2` themes.


Gives us a clean, elegant look. 

Let's apply that to the scatter plot from before too.

I stored the ggplot from before in the variable `penguins_scatter`. I can "add" to that the labels and the theme. 

In [None]:
penguins_scatter +
  labs(
    title = "Penguin body mass vs flipper length",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    caption = "Source: palmerpenguins"
  ) +
theme_minimal()

## Histograms
- They are generally used for continuous variables (e.g., income, age, etc.)
    - A *continuous* variable is one that can take on any value within a range (e.g., 0.5, 1.2, 3.7, etc.)
    - A *discrete* variable is one that can only take on certain values (e.g., 1, 2, 3, etc.)
- Typically, the height of the bar represents the number of observations which fall in that bin


## Histogram Code


Note that you only need to specify the x-axis variable in the `aes()` function. `ggplot2` will automatically visualize the y-axis for a histogram.


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(bins = 30, fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

Change number of bins (bars) using `bins` or `binwidth` arguments (default number of bins = 30):


At 50 bins...


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(bins = 50, fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

At 100 bins...probably too many!


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(bins = 100, fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

Using `binwidth` instead of `bins`... 


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm) + 
  geom_histogram(binwidth = 2, fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

Since *count* is relative, you can use *density* instead.

In [None]:
ggplot(penguins, aes(after_stat(density), x = flipper_length_mm)) + 
  geom_histogram(binwidth = 5, fill = "chartreuse4", color = "white") +
  labs(
    x = "Flipper Length (mm)", 
    y = "Density", 
    title = "Penguin Flipper Lengths", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

For densities, the total area sums to 1. The height of a bar represents the probability of observations in that bin (rather than the number of observations).


## Try it out!


1. Pick a variable that you want to explore the distribution of <br>
   (get the correct spelling with `names()` of your dataset). 
2. Make a histogram

   a. Only specify `x = ` in `aes()`

   b. Specify geom as `geom_histogram`

3. Choose color the bars
4. Choose appropriate labels
5. Change number of bins
6. Change from count to density


## Fill vs. Color

Use **color** (e.g. `color = ` or `scale_color_*`) to modify the color of points, lines, or text. 
- Commonly applied to:
  - Scatter plots
  - Line charts
  - Text elements

Use **fill** (e.g. `fill = ` or `scale_fill_*`) to modify the fill color of shapes like bars, boxes, or polygons. 
- Commonly applied to:
  - Bar charts
  - Box plots
  - Histograms

## Column / Bar chart

`geom_bar()`: Use when you have **raw data** and want ggplot2 to **count** rows for you.

In [None]:
ggplot(penguins, aes(x = species, fill = species)) +
  geom_bar() +
  labs(
    x = "Species", 
    y = "Count",
    fill = "Species",
    title = "Penguin Count by Species", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

`geom_col()`: Use when you already have **summarized data** with a **y-value** column.

In [None]:
library(dplyr)
penguin_counts <- penguins |>
  count(species, name = "n")

ggplot(penguin_counts, aes(x = species, y = n, fill = species)) +
  geom_col() +
  labs(
    x = "Species", 
    y = "Count", 
    title = "Penguin Count by Species", 
    caption = "Source: palmerpenguins"
    ) +
theme_minimal()

## Line chart

- Plot trends over time.
- Have the variable on the x-axis be type data or time. (This will affect the ploting).
- Do not use bars. Lines will be less cluttered when you have multiple trends to plot.

In [None]:
class(economics$date)

In [None]:
ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line() +
  labs(
    title = "U.S. unemployment over time",
    x = "Date",
    y = "Number unemployed (thousands)"
  ) +
  theme_minimal()

Set a thicker line for visibility.

In [None]:
ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line(linewidth = 1) +
  labs(
    title = "U.S. unemployment over time",
    x = "Date",
    y = "Number unemployed (thousands)"
  ) +
  theme_minimal()

## Comparing groups

In general, when we compare groups in plots, there are a few practices to consider:

1. Facet (multiple panels), or
2. Standardize (compare shapes), or
3. Plot two series that naturally share units.

You can standardized to make the data more similar or you can use different plots with different axis ranges to view the shapes.

This is an example of trying to compare data that's not alike. The y-axis does not have the same units. Also, the magntitude of the units are very different.

In [None]:
ggplot(economics) +
  geom_line(aes(x = date, y = unemploy), linewidth = 1) +
  geom_line(aes(x = date, y = pop), linewidth = 1) +
  theme_minimal()

In [None]:
df <- as.data.frame(Seatbelts) # Let's use the built in time series data from ggplot

df |> head(5)

In [None]:
# Seatbelts has the time hidden in it's index
# time(Seatbelts) returns the time index (type <ts>) for each observation which I convert to a a numeric vector.
df$t <- as.numeric(time(Seatbelts))  # 

df |> head(5) # time include now as fractions of years.

In [None]:
ggplot(df, aes(x = t)) +
  geom_line(aes(y = front, color = "front killed")) +
  geom_line(aes(y = rear, color = "rear killed")) +
  labs(title = "UK Road Casualties Over Time", x = "Time", y = "Count", color = "Series") +
  theme_minimal()

Notice here, you can specify variables in `aes()` either inside `ggplot()` (applies to all layers) or inside a specific `geom` (applies only to that layer).

## Facets

Faceting means splitting one plot into **multiple small plots (panels)** based on the values of one or more categorical variables, so you can **compare patterns across groups**.

There are two ways you might go about this:


1) one variable: `facet_wrap()`

In [None]:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point(alpha = 0.7) +
  facet_wrap(~ species) + # default is nrow = 1
  theme_minimal()

In [None]:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point(alpha = 0.7) +
  facet_wrap(~ species, ncol = 1) +
  theme_minimal()

2. two variables: `facet_grid()`

In [None]:
penguins <- penguins |>
  drop_na(sex)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species, shape = sex)) +
  geom_point(alpha = 0.7) +
  facet_grid(sex ~ species) +
  theme_minimal()

In [None]:
ggplot(penguins, aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(bins = 10, position = "identity", alpha = 0.5, color = "white") +
  facet_wrap(~ sex) +
  labs(
    x = "Flipper Length (mm)", 
    y = "Count",
    title = "Penguin Flipper Lengths by Sex", 
    caption = "Source: palmerpenguins"
  ) +
  theme_minimal()

## Scales

Scales control **how data values map to visual properties** (position, color, size, etc.).

### 1) Continuous vs. discrete
- `scale_x_continuous()` / `scale_y_continuous()` for numeric axes  
- `scale_x_discrete()` / `scale_y_discrete()` for categorical axes  

### 2) Axis limits: dropping vs. zooming
- `scale_x_continuous(limits = ...)` **drops data outside** the limits (can change counts/statistics).
- `coord_cartesian(xlim = ...)` **zooms** without dropping data (keeps calculations based on the full dataset).

### 3) Breaks and labels
- Use `breaks = ...` to control tick locations.
- Use `labels = ...` to control how ticks are printed (e.g., dollars, commas, percentages).

### 4) Transformations
- `scale_x_log10()` is a common way to handle variables spanning multiple orders of magnitude.
- Other options include `scale_x_sqrt()` or `scale_x_continuous(trans = "sqrt")`.

### 5) Color scales
- Discrete palettes: `scale_color_viridis_d()` / `scale_fill_viridis_d()`
- Continuous palettes: `scale_color_viridis_c()` / `scale_fill_viridis_c()`


## Scales

Before choosing a scale, it helps to check the range.


In [None]:
# Built-in dataset in ggplot2
min(diamonds$price)
max(diamonds$price)
max(diamonds$price) / min(diamonds$price)

> **Interpretation:** Diamond prices span a very large range (tens of dollars to tens of thousands), so a linear scale can hide structure in the lower end.

In [None]:
log(min(diamonds$price))
log(max(diamonds$price))
log(max(diamonds$price)) / log(min(diamonds$price))

A histogram on a **linear** x-axis uses equal *dollar* widths per bin.


In [None]:
diamond_plot <- ggplot(diamonds, aes(x = price)) +
  geom_histogram(bins = 100, fill = "hotpink4") +
  labs(
    title = "Distribution of Diamond Prices (Linear)",
    x = "Price",
    y = "Count"
  ) +
  theme_minimal()

diamond_plot

A **log10** x-axis spreads out small values and compresses large values.


In [None]:
diamond_plot + 
  scale_x_log10() + 
  labs(
    title = "Distribution of Diamond Prices (Log Scale)",
    x = "Price (log scale)",
    y = "Count"
  )

### Linear bins vs. Log bins  
- **Linear bins:** equal dollars per bin (e.g., 0–300, 300–600, 600–900, ...)  
- **Log10 axis:** equal *multiplicative* steps (powers of 10).  
  - Equal spacing corresponds to equal *factors* (e.g., 100 → 200 → 400 → 800 is evenly spaced on a log scale).


## Density Comparison: Linear vs Log (Side-by-Side)

Using **density** rescales bar heights so total area is 1. This helps compare shapes.


In [None]:
d_linear <- ggplot(diamonds, aes(x = price, y = after_stat(density))) +
  geom_histogram(bins = 100, fill = "hotpink4", color = "white") +
  labs(
    title = "Diamond Prices (Linear X)",
    x = "Price",
    y = "Density"
  ) +
  theme_minimal()

d_log <- ggplot(diamonds, aes(x = price, y = after_stat(density))) +
  geom_histogram(bins = 100, fill = "forestgreen", color = "white") +
  scale_x_log10() +
  labs(
    title = "Diamond Prices (Log10 X)",
    x = "Price (log scale)",
    y = "Density"
  ) +
  theme_minimal()

library(patchwork)
d_linear + d_log

### Example: custom breaks and labels on a log scale


In [None]:
ggplot(diamonds, aes(price)) +
  geom_histogram(bins = 80, fill = "hotpink4", color = "white") +
  scale_x_log10(
    breaks = c(100, 500, 1000, 5000, 10000, 20000),
    labels = dollar
  ) +
  labs(
    title = "Diamond Prices with Custom Log10 Breaks",
    x = "Price (log scale)",
    y = "Count"
  ) +
  theme_minimal()

## Color Scales: Viridis (Discrete)

Viridis palettes are designed to be **perceptually uniform** and generally work well for many viewers.


In [None]:
ggplot(penguins) + 
 aes(x = flipper_length_mm, y = body_mass_g, color = species) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_viridis_d(option = "mako") +
  labs(
    title = "Penguin body mass vs flipper length",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    caption = "Source: palmerpenguins"
  ) +
theme_minimal()

I always recommend using `shape` with `color` for selecting the category to be clear to all viewers. 

## Try it out! 

- Try `scale_x_sqrt()` instead of log.
- Try `scale_color_viridis_c()` on a numeric variable like `diamonds$carat`.
- Change `bins` and watch how the histogram changes (especially on a log axis).


In [None]:
p <- ggplot(penguins) + 
 aes(x = flipper_length_mm, y = body_mass_g, color = species) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_viridis_d(option = "mako") +
  labs(
    title = "Penguin body mass vs flipper length",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    caption = "Source: palmerpenguins"
  ) +
theme_minimal()

## Save Plots (PNG / PDF / RDS)

In [None]:
# Save as PNG and PDF
ggsave("myplot.png", plot = p, width = 6, height = 4, dpi = 300)
ggsave("myplot.pdf", plot = p, width = 6, height = 4)

# Save & reload plot object
saveRDS(p, file = "myplot.rds")
p2 <- readRDS("myplot.rds")
p2