*Analytical Information Systems*

# Tutorial 4 - Data Visualization

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

<h1>Agenda<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-babybames-dataset" data-toc-modified-id="The-babybames-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The babybames dataset</a></span></li><li><span><a href="#Distribution" data-toc-modified-id="Distribution-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Distribution</a></span></li><li><span><a href="#Comparison" data-toc-modified-id="Comparison-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Comparison</a></span></li><li><span><a href="#Composition" data-toc-modified-id="Composition-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Composition</a></span></li></ul></div>

## The babybames dataset

The babybames dataset comprises the given names and the number of children of each sex born in the US.
- For each year from 1880 to 2017
- All names with more than 5 uses are given
- Five variables: `year`, `sex`, `name`, `n` and `prop` (n divided by total number of applicants in that year, which means proportions are of people of that gender with that name born in that year).

In [None]:
library(tidyverse)
library(babynames)
babynames %>% head()

__Recap: A taxonomy of plots__

<img src="images/03/taxonomy.png" style="width:60%">

## Distribution

___Up to you: Distribution___

a. What is the distribution of your name over time?

In [None]:
options(repr.plot.width=5, repr.plot.height=4)
babynames %>%
        filter(name=="Matthias")%>%
        ggplot(aes(x=year, y=n)) +
        geom_col()

___Up to you: Distribution___

b. What is the distribution of names starting with “A“,( “C“, “F“), over time? Is there a difference between genders?

In [None]:
options(repr.plot.width=8, repr.plot.height=4)
StartingLetter = "A"

babynames %>%
        filter(str_sub(string = name, start = 0, end = 1) == StartingLetter) %>%
        ggplot(aes(x=year, y=n)) +
        geom_col() + 
        facet_wrap(~sex)

## Comparison

___Up to you: Comparison___

a. What are the Top-10 common names in 2017, both male and female?

In [None]:
options(repr.plot.width=5, repr.plot.height=4)
babynames %>%
        filter(year==2015) %>%
        group_by(sex) %>%
        top_n(n=10, wt=prop) %>%
        ggplot(aes(x=name, y=prop)) +
        geom_col() + 
        coord_flip()

... with seperate plots for each sex?

In [None]:
options(repr.plot.width=8, repr.plot.height=4)
babynames %>%
        filter(year==2017) %>%
        group_by(sex) %>%
        top_n(n = 10, wt=n) %>%
        ggplot(aes(x=reorder(name, prop), y=prop)) +
        geom_col() +
        coord_flip() +
        facet_wrap(~sex, scales = "free") +
        theme_minimal()

___Up to you: Comparison___

b. How has the popularity of your (and your neighbour‘s) name changed over time?

In [None]:
options(repr.plot.width=6, repr.plot.height=4)
Names = c("Matthias", "Nikolai", "Christoph")

babynames %>%
        filter(name %in% Names) %>%
        ggplot(aes(x=year, y=prop)) + 
        geom_line(aes(linetype=name))

## Composition

__Up to you: Composition__

a. What are the most popular names from 2012 to 2015?

In [None]:
babynames %>%
        filter(year %in% 2012:2015) %>%
        group_by(year, sex) %>%
        top_n(n=6, wt=prop) %>%
        ggplot(aes(x=year, y=prop, fill=name)) +
        geom_col(position="stack") + 
        facet_wrap(~sex) 

__Up to you: Composition__

b. What is the proportion of the most common names over time?

In [None]:
babynames %>%
        group_by(name, sex) %>%
        summarise(AvProp = mean(prop)) %>%
        group_by(sex) %>%
        top_n(n=10, wt=AvProp) -> top10

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
babynames %>%
        filter(name %in% top10$name) %>%
        ggplot(aes(x=year, y=prop, fill=name)) +
        geom_area(position = "stack") +
        facet_wrap(~sex) +
        theme_minimal()

__Up to you: Composition__

c. How did the proportion of the yearly most common names change over time?

In [None]:
options(repr.plot.width=7, repr.plot.height=4)
babynames %>%
        group_by(sex, year) %>%
        top_n(n=10, wt=prop) %>%
        summarise(Prop10 = sum(prop)) %>%
        ggplot(aes(x=year, y=Prop10)) +
        geom_line(aes(linetype=sex))

__Up to you: Composition__

d. How did the proportion of Top 10 common names change over time in decades?

In [None]:
babynames %>% 
  group_by(sex, year) %>% 
  top_n(n = 10, wt = prop) %>%
  summarise(PropTop10 = sum(prop)) %>%
  ggplot(aes(x=cut_width(year, width = 10), y=PropTop10, fill=sex)) + 
  geom_col(position = "dodge") +
  scale_fill_brewer() + 
  theme(axis.ticks = element_blank(),panel.background = element_blank(), axis.text.x = element_text(angle = 45))
