*Analytical Information Systems*

# Tutorial 3 - Descriptive Analytics

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

__Data Warehousing Analytics__

<img src="images/03/BIStack_fe.png" style="width:100%">

__Exploratory Data Analysis__

The objectives of EDA
- Suggest hypotheses
- Assess assumptions
- Support the selection of appropriate statistical techniques
- Provide a basis for further data collection

<img src="images/03/EDA.png" style="width:100%">

<h1>Agenda<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1-Descriptive-Statistics" data-toc-modified-id="1-Descriptive-Statistics-1">1 Descriptive Statistics</a></span><ul class="toc-item"><li><span><a href="#Scales-of-Measurement" data-toc-modified-id="Scales-of-Measurement-1.1">Scales of Measurement</a></span></li><li><span><a href="#Summary-Descriptive-Statistics" data-toc-modified-id="Summary-Descriptive-Statistics-1.2">Summary Descriptive Statistics</a></span></li></ul></li><li><span><a href="#2-Data-Visualization" data-toc-modified-id="2-Data-Visualization-2">2 Data Visualization</a></span><ul class="toc-item"><li><span><a href="#Data-Visualization-with-ggplot2" data-toc-modified-id="Data-Visualization-with-ggplot2-2.1">Data Visualization with ggplot2</a></span></li><li><span><a href="#Visualizing-European-Parliament-Election-2019" data-toc-modified-id="Visualizing-European-Parliament-Election-2019-2.2">Visualizing European Parliament Election 2019</a></span></li></ul></li></ul></div>

## 1 Descriptive Statistics

### Scales of Measurement

- Categorical (Qualitative) Data
    - Nominal Scale: Identity or category
    - Ordinal Scale: Order or rank

- Quantitative Data (numeric)
    - Interval Scale: Order and quantity
        - Differences can be calculated
    - Ratio Scale: Interval scale with an absolute zero

<img src="images/03/data-categories.png" style="width:100%">

__The diamonds dataset__

The diamonds dataset comes with the package `ggplot2` and contains information about ~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond.

Let's have a look at the data. What are the scales of the variables?

In [None]:
library(tidyverse)
diamonds %>% head()
?diamonds

### Summary Descriptive Statistics

In the lecture, we talked about 
- Central Tendency: What are the most typical values?
    - mean, median, mode
- Variability: How do the values vary?
    - Range
    - Percentiles
    - Standard Deviation
    - Coefficient of variation
- Shape: Are the values symmetrically or asymmetrically distributed?
    - skewness: symmetry 
    - kurtosis: how peaky is the distribution

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Comparison_mean_median_mode.svg/1280px-Comparison_mean_median_mode.svg.png">

Source: *[Wikipedia](https://en.wikipedia.org/wiki/Skewness)*

__Skewness__ 
- negative skew: The left tail is longer (left-skewed, left-tailed)
    - mass of the distribution is concentrated on the right of the figure.
- positive skew: The right tail is longer (right-skewed, right-tailed)
    - mass of the distribution is concentrated on the left of the figure.

*from [Wikipedia](https://en.wikipedia.org/wiki/Skewness)*

__Kurtosis__

Measure of the "tailedness" of the probability distribution

Excess kurtosis is defined as kurtosis minus 3
- Mesokurtic (excess = 0): normal distribution family, regardless of the values of its parameters
- Leptokurtic (excess > 0): distribution has fatter tails and a pointed peak, e.g., Laplace distribution, exponential distribution, Poisson distribution and the logistic distribution
- Platykurtic (excess < 0): distribution has thinner tails a flat peak, e.g., coin toss is the most platykurtic distribution

*from [Wikipedia](https://en.wikipedia.org/wiki/Kurtosis)*

<img src="https://upload.wikimedia.org/wikipedia/commons/e/e6/Standard_symmetric_pdfs.png">

___Up to you:  Statistical Measures___

(a) On the diamond data set, calculate statistical measures to describe the central tendency, variability and the shape of the `price`:

We can also use the `psych::describe()` function to calculate the statistical measures as it provides more descriptive statistics than base R.

In [None]:
psych::describe(diamonds)['price', ]

Visualization often facilitates understanding of the data and its distribution. 

*(We will learn more about data visualization in the next section.)*

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>% ggplot() + 
    geom_density(aes(x=scale(price))) +
    stat_function(n = 100, fun = dnorm, linetype='dotted') +
    theme_minimal() +
    xlim(-5,5)

___Up to you:  Statistical Measures___

We calculated statistical measures to get a grasp understanding of the diamond prices. Now, we want to get a deeper insight.

(b) Describe the central tendency, variability and the shape of the prices depending on the quality (`cut`) and `color`.

___Up to you:  Statistical Measures___

(c) How many diamonds belong to each of the groups (`cut` and `color` combinations)? What is the cheapest and the most expensive price within each group?

___Up to you:  Statistical Measures___

Let's take a closer look at the other variables in the data set.

(d) What is the average volume of the diamonds, depending on the qualtity (`cut`)?

___Up to you:  Statistical Measures___

(e) What are the average values of all (numeric) columns?

## 2 Data Visualization

__Why use Data Visualization?__

The depiction of information using spatial or graphical representations, to facilitate comparison, pattern recognition, change detection, and other cognitive skills by making use of the visual system.

- Problem 
    - Big datasets: How to understand them?
- Solution
    - Take better advantage of human perceptual system
    - Convert information into a graphical representation.
- Issues
    - How to convert abstract information into graphical form?
    - Do visualizations do a better job than other methods?




### Data Visualization with ggplot2

The ggplot2 package lets you make beautiful and customizable plots of your data. 
- one of the core members of the tidyverse
- based on the __grammar of graphics__, the idea that you can build every graph from the same components:
    - a data set
    - a coordinate system
    - geoms - visual marks that represent data points

[Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)

<img src="https://www.rstudio.com/wp-content/uploads/2018/08/data-visualization-2.1-600x464.png">

A `ggplot2` graph consists of the following compenents.

```R
ggplot(data = <DATA> ) +
    <GEOM_FUNCTION> (mapping = aes( <MAPPINGS> ), stat = <STAT> , position = <POSITION> ) +
    (opt) <COORDINATE_FUNCTION> + 
    (opt) <FACET_FUNCTION> +
    (opt) <SCALE_FUNCTION> + 
    (opt) <THEME_FUNCTION>
``` 

Let's complete the template below to build a graph of the the relationship between `carat`and `price`in the `diamonds`data:

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
ggplot(data = diamonds) +
    geom_point(mapping = aes(x = carat, y = price), stat = "identity", position = "identity") + 
    theme_minimal()

__Controlling plots in Jupyter notebooks__

<img align="right" src="https://irkernel.github.io/images/irkernel-logo.svg" style="width:20%">

You can control the size and quality of plots. 
- Use `options(repr.* = ...)` and `getOption('repr.*')` to set and get them, respectively.
- Example: adjust width and heigth
```R
options(repr.plot.width=7, repr.plot.height=7)
```

[Documentation](https://www.rdocumentation.org/packages/repr/versions/0.7/topics/repr-options)

__Choosing Plots and Aesthetic Elements__

<img src="images/03/taxonomy.png">

### Visualizing European Parliament Election 2019

The ninth elections to the European Parliament in Germany was held on 26 May 2019, electing members of the national Germany constituency to the European Parliament. The results are available [here](https://www.bundeswahlleiter.de/europawahlen/2019/ergebnisse.html).

<img src="images/03/Stimmenanteile.png" style="width:100%">

__Let's download the data and reproduce the plot__

In [None]:
'https://www.bundeswahlleiter.de/dam/jcr/5441f564-1f29-4971-9ae2-b8860c1724d1/ew19_kerg2.csv' %>% 
    read_csv2(skip = 9) -> eu_elections

Get an impression of the dataset

In [None]:
glimpse(eu_elections)
eu_elections$Gebietsart %>% unique()
eu_elections$Gruppenart %>% unique()

__Filter and prepare the data__

In [None]:
eu_elections %>%
    filter(Gebietsart == 'Bund', Gruppenart == 'Partei') %>%
    mutate(Gruppenname = fct_reorder(Gruppenname, -VorpProzent), 
           Gruppenname = fct_lump(Gruppenname, prop=0.0065, w = Prozent, other_level = "Sonstige")) %>%
    group_by(Gruppenname) %>%
    summarise(Prozent = sum(Prozent),
              VorpProzent =  sum(VorpProzent),
              Anzahl =  sum(Anzahl),
              VorpAnzahl =  sum(Anzahl)) -> share_data

How would you solve this using the `if_else()` function? You may need `fct_relevel()` to reorder the factor levels.

```R
    mutate(Gruppenname = if_else(Prozent < 0.65, 'Sonstige', Gruppenname),
           Gruppenname = fct_reorder(Gruppenname, -VorpProzent),
           Gruppenname = fct_relevel(Gruppenname, "Sonstige" , after = Inf))
```

Colors have a special meaning in our democracy. There is no universal standard to map the parties to their respective colors, so we have to do it manually (here, in a named vector)

In [None]:
party_colors <- c(
    'AfD'= 'turquoise',
    'CDU'= 'darkblue',
    'CSU'='blue',
    'DIE LINKE'='purple',
    'Die PARTEI'= 'darkred',
    'FAMILIE'='pink',
    'FDP'='yellow',
    'FREIE WÄHLER'='lightblue',
    'GRÜNE'='seagreen',
    'PIRATEN'= 'orange',
    'SPD'='red',
    'Tierschutzpartei'='lightgreen',
    'Volt'='blue',
    'ÖDP'= 'orange2',
    'Sonstige'= 'gray')

__Rebuilding the graph__

In [None]:
share_data %>% 
    ggplot(aes(x = fct_reorder(Gruppenname, -VorpProzent), y = Prozent, fill=Gruppenname)) + 
    geom_col(width=0.5) + 
    geom_col(aes(y = VorpProzent), alpha=0.5, position= position_nudge(x = 0.25), width=0.5) +
    geom_text(aes(label=format(round(Prozent, 1), nsmall = 1)), size= 3, vjust = -0.5) +
    scale_fill_manual(values= party_colors) + 
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 0.95)) +
    xlab(label = element_blank()) +
    guides(fill=FALSE)

__Visualizing the vote count__

In [None]:
share_data %>%
    mutate(Gruppenname = fct_reorder(Gruppenname, Anzahl),
           Gruppenname = fct_relevel(Gruppenname, "Sonstige")) %>%
    ggplot(aes(x=Gruppenname, y=Anzahl/1000000, fill=Gruppenname)) + 
    geom_col(width=0.7) + 
    geom_text(aes(label=format(round(Anzahl/1000000, 2), nsmall = 2)), size= 3, hjust = -0.1) +
    scale_fill_manual(values= party_colors) + 
    theme_minimal() +
    xlab(label = element_blank()) +
    ylab(label = 'Stimmen in Mio.') +
    guides(fill=FALSE) +
    coord_flip()

__Comparing two parties ("AfD", "GRÜNE")__   

What is the distribution of shares in the different constituencies of each federal state?

First, we need to get map the federal state codes to their respective names:

In [None]:
eu_elections %>%
    filter(Gebietsart == 'Land') %>%
    select(Gebietsnummer, Gebietsname) %>%
    rename(Land = Gebietsname) %>%
    distinct() -> lookUpLand
lookUpLand

Plot the comparison: distribution of shares in the different constituencies of each federal state

In [None]:
options(repr.plot.width=8, repr.plot.height=4)
eu_elections %>%
    filter(Gebietsart == 'Kreis', Gruppenart == 'Partei') %>%
    filter(Gruppenname %in% c("AfD", "GRÜNE")) %>%
    left_join(lookUpLand, by=c("UegGebietsnummer" = "Gebietsnummer")) %>%
    ggplot(aes(x=Prozent, fill=Gruppenname)) + 
    geom_density(alpha=0.5) + 
    scale_fill_manual(values= party_colors) + 
    theme_minimal() +
    facet_wrap(~Land)

__Relationship__

Is there any relationship beweeen the shares of two parties at constituency level?

Filter and preprocess the data

In [None]:
eu_elections %>%
    filter(Gebietsart == 'Kreis', Gruppenart == 'Partei') %>%
    select(Gebietsname, Gruppenname, Prozent) %>%
    spread(Gruppenname, Prozent) -> kreis_data

Plot the relationship beweeen the shares of 'GRÜNE' and 'CSU' at constituency level.

In [None]:
set.seed(0)
kreis_data %>% 
    ggplot(aes(x=CSU,y=GRÜNE, label=Gebietsname)) + 
    geom_point() + 
    ggrepel::geom_text_repel(data=sample_frac(kreis_data, 0.15),alpha=0.8) +
    theme_minimal()