*Analytical Information Systems*

# Tutorial 3 - Descriptive Analytics

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

__Data Warehousing Analytics__

<img src="images/03/BIStack_fe.png" style="width:100%">

__Exploratory Data Analysis__

The objectives of EDA
- Suggest hypotheses
- Assess assumptions
- Support the selection of appropriate statistical techniques
- Provide a basis for further data collection

<img src="images/03/EDA.png" style="width:100%">

<h1>Agenda<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Descriptive Statistics</a></span><ul class="toc-item"><li><span><a href="#Scales-of-Measurement" data-toc-modified-id="Scales-of-Measurement-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Scales of Measurement</a></span></li><li><span><a href="#Summary-Descriptive-Statistics" data-toc-modified-id="Summary-Descriptive-Statistics-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Summary Descriptive Statistics</a></span></li></ul></li><li><span><a href="#Data-Visualization" data-toc-modified-id="Data-Visualization-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Visualization</a></span><ul class="toc-item"><li><span><a href="#Data-Visualization-with-ggplot2" data-toc-modified-id="Data-Visualization-with-ggplot2-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Data Visualization with ggplot2</a></span></li><li><span><a href="#Comparison" data-toc-modified-id="Comparison-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Comparison</a></span></li><li><span><a href="#Composition" data-toc-modified-id="Composition-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Composition</a></span></li><li><span><a href="#Distributions" data-toc-modified-id="Distributions-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Distributions</a></span></li><li><span><a href="#Relationships" data-toc-modified-id="Relationships-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Relationships</a></span></li><li><span><a href="#Coloring" data-toc-modified-id="Coloring-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Coloring</a></span></li><li><span><a href="#Faceting" data-toc-modified-id="Faceting-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Faceting</a></span></li><li><span><a href="#Graphical-Excellence" data-toc-modified-id="Graphical-Excellence-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Graphical Excellence</a></span></li></ul></li><li><span><a href="#Exam-Questions" data-toc-modified-id="Exam-Questions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exam Questions</a></span><ul class="toc-item"><li><span><a href="#Exam-AIS-WS-2018/19" data-toc-modified-id="Exam-AIS-WS-2018/19-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Exam AIS WS 2018/19</a></span></li><li><span><a href="#Exam-SS-2018" data-toc-modified-id="Exam-SS-2018-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Exam SS 2018</a></span></li></ul></li></ul></div>

## Descriptive Statistics

### Scales of Measurement

Categorical (Qualitative) Data
- Nominal Scale: Identity or category
- Ordinal Scale: Order or rank

Quantitative Data (numeric)
- Interval Scale: Order and quantity: Differences can be calculated
- Ratio Scale: Interval scale with an absolute zero

<img src="images/03/data-categories.png" style="width:100%">

__The diamonds dataset__

The diamonds dataset comes with the package `ggplot2` and contains information about ~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond.

Let's have a look at the data. What are the scales of the variables?

In [None]:
library(tidyverse)
diamonds %>% head()
?diamonds

### Summary Descriptive Statistics

In the lecture, we talked about:

- __Central tendency:__ What are the most typical values?
    - mean, median, mode

- __Variability:__ How do the values vary?
    - Range
    - Percentiles
    - Standard Deviation
    - Coefficient of variation

- __Shape:__ Are the values symmetrically or asymmetrically distributed?
    - skewness: symmetry 
    - kurtosis: how peaky is the distribution

__Central tendency, variability, and shape__

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Comparison_mean_median_mode.svg/1280px-Comparison_mean_median_mode.svg.png" style="width:60%">

from *[Wikipedia](https://en.wikipedia.org/wiki/Skewness)*

__Skewness__ 
- negative skew: The left tail is longer (left-skewed, left-tailed)
    - mass of the distribution is concentrated on the right of the figure.
- positive skew: The right tail is longer (right-skewed, right-tailed)
    - mass of the distribution is concentrated on the left of the figure.

*from [Wikipedia](https://en.wikipedia.org/wiki/Skewness)*

__Kurtosis__

Measure of the "tailedness" of the probability distribution

Excess kurtosis is defined as kurtosis minus 3
- Mesokurtic (excess = 0): normal distribution family, regardless of the values of its parameters
- Leptokurtic (excess > 0): distribution has fatter tails and a pointed peak
- Platykurtic (excess < 0): distribution has thinner tails a flat peak

*from [Wikipedia](https://en.wikipedia.org/wiki/Kurtosis)*

<img src="https://upload.wikimedia.org/wikipedia/commons/e/e6/Standard_symmetric_pdfs.png">

___Up to you:  Statistical Measures___

(a) On the diamond data set, calculate statistical measures to describe the central tendency, variability and the shape of the `price`:

In [None]:
diamonds %>%
    summarise(Mean = mean(price),
             Sd = sd(price),
             Skew = psych::skew(price),
            Kurt = psych::kurtosi(price))

summary(diamonds$price)

We can also use the `psych::describe()` function to calculate the statistical measures as it provides more descriptive statistics than base R.

In [None]:
psych::describe(diamonds$price)

Visualization often facilitates understanding of the data and its distribution. 

*(We will learn more about data visualization in the next section.)*

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>% ggplot() + 
    geom_density(aes(x=scale(price))) +
    stat_function(n = 100, fun = dnorm, linetype='dotted') +
    theme_minimal() +
    xlim(-5,5)

___Up to you:  Statistical Measures___

We calculated statistical measures to get a grasp understanding of the diamond prices. Now, we want to get a deeper insight.

(b) Describe the central tendency, variability and the shape of the prices depending on the quality (`cut`) and `color`.

In [None]:
diamonds %>%
    group_by(cut, color) %>%
    summarise(Mean = mean(price),
             Sd = sd(price),
             Skew = psych::skew(price),
             Kurt = psych::kurtosi(price))

___Up to you:  Statistical Measures___

(c) How many diamonds belong to each of the groups (`cut` and `color` combinations)? What is the cheapest and the most expensive price within each group?

In [None]:
diamonds %>%
    group_by(cut, color) %>%
    summarise(n(), 
             max(price), 
             min(price))

___Up to you:  Statistical Measures___

Let's take a closer look at the other variables in the data set.

(d) What is the average volume of the diamonds, depending on the qualtity (`cut`)?

In [None]:
diamonds %>%
    mutate(Volume = x*y*z) %>%
    group_by(cut) %>%
    summarise(mean(Volume))

___Up to you:  Statistical Measures___

(e) What are the average values of all (numeric) columns?

In [None]:
diamonds %>%
    summarise_if(is.numeric, mean)

diamonds %>%
    summarise_all(mean)

## Data Visualization

Parts of this chapter are based on the book "R for Data Science" by Garrett Grolemund and Hadley Wickham

- Chapter 3: [Data Visualisation](https://r4ds.had.co.nz/data-visualisation.html)
- Chapter 7: [Exploratory Data Analysis](https://r4ds.had.co.nz/exploratory-data-analysis.html)

<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png">

__Why use Data Visualization?__

The depiction of information using spatial or graphical representations, to facilitate comparison, pattern recognition, change detection, and other cognitive skills by making use of the visual system.

- Problem 
    - Big datasets: How to understand them?
- Solution
    - Take better advantage of human perceptual system
    - Convert information into a graphical representation.
- Issues
    - How to convert abstract information into graphical form?
    - Do visualizations do a better job than other methods?




### Data Visualization with ggplot2

The ggplot2 package lets you make beautiful and customizable plots of your data. 
- one of the core members of the tidyverse
- based on the __grammar of graphics__, the idea that you can build every graph from the same components:
    - a data set
    - a coordinate system
    - geoms - visual marks that represent data points

[Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)

<img src="https://www.rstudio.com/wp-content/uploads/2018/08/data-visualization-2.1-600x464.png">

A `ggplot2` graph consists of the following compenents.

```R
ggplot(data = <DATA> ) +
    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>),stat=<STAT>,position=<POSITION>)+
    (opt) <COORDINATE_FUNCTION> + 
    (opt) <FACET_FUNCTION> +
    (opt) <SCALE_FUNCTION> + 
    (opt) <THEME_FUNCTION>
``` 

Let's complete the template below to build a graph of the the relationship between `carat`and `price`in the `diamonds`data:

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
ggplot(data = diamonds) +
    geom_point(mapping = aes(x = carat, y = price)) + 
    theme_minimal()

__Controlling plots in Jupyter notebooks__

You can control the size and quality of plots. 
- Use `options(repr.* = ...)` and `getOption('repr.*')` to set and get them, respectively.
- Example: adjust width and heigth
```R
options(repr.plot.width=7, repr.plot.height=7)
```

[Documentation](https://www.rdocumentation.org/packages/repr/versions/0.7/topics/repr-options)

__Choosing Plots and Aesthetic Elements__

<img src="images/03/taxonomy.png" style="width:60%">

### Comparison

__Columns Charts__

If you want the heights of the bars to represent values in the data, use `geom_col()`. 

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>%
    group_by(cut) %>%
    summarise(avgprice = mean(price)) %>%
    ggplot() + 
    geom_col(mapping = aes(x=cut, y =avgprice))

There's plenty of unused (uninformative) space in the chart, so let's change the width of th bars:

In [None]:
options(repr.plot.width=4, repr.plot.height=5)
diamonds %>%
    group_by(cut) %>%
    summarise(avgprice = mean(price)) %>%
    ggplot() + 
    geom_col(mapping = aes(x=cut, y =avgprice), width=0.2)

This chart contains the same information:

In [None]:
diamonds %>%
    group_by(cut) %>%
    summarise(avgprice = mean(price)) %>%
    ggplot() + 
    geom_point(mapping = aes(x=cut, y =avgprice))

Finding an explanation (avgprice low for ideal cut) is not easy!

In [None]:
options(repr.plot.width=4, repr.plot.height=5)
diamonds %>%
    group_by(cut) %>%
    summarise(avgPricePerCarat = mean(price)/mean(carat)) %>%
    ggplot() +
    geom_col(aes(x=cut, y=avgPricePerCarat), width=0.2)

Clarity as explanation?

In [None]:
options(repr.plot.width=4, repr.plot.height=5)
diamonds %>%
    filter(clarity=="IF") %>%
    group_by(cut) %>%
    summarise(avgPricePerCarat = mean(price)/mean(carat)) %>%
    ggplot() +
    geom_col(aes(x=cut, y=avgPricePerCarat), width=0.2)

### Composition

__Bar chart__

To examine the counts of different classes of a categorical variable, use a bar chart:

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>%
    ggplot() +
    geom_bar(aes(x=cut))

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar().

<img src="images/03/visualization-stat-bar-2.png" style="width:80%">

__Stacked Bar chart__

To examine the composition (with respect to other categorical variables) of the different classes of a categorical variable, we use a stacked bar chart:

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>%
    ggplot() +
    geom_bar(aes(x=cut, fill=clarity))

__100% Stacked Bar chart__

To focus exclusively on the composition (with respect to other categorical variables) of the different classes of a categorical variable, we use a 100% stacked bar chart:

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>%
    ggplot() +
    geom_bar(aes(x=cut, fill=clarity), position="fill")

### Distributions

How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is categorical if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. 

__Histograms__

To examine the distribution of a continuous variable, use a histogram. A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. 

In [None]:
ggplot(data = diamonds) +
    geom_histogram(aes(x=carat), bins=30, color="black", fill="white")

__Smoothed density estimates__

`geom_density` computes and draws kernel density estimate, which is a smoothed version of the histogram. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.

In [None]:
ggplot(diamonds) +
    geom_density(aes(x=carat))

__Boxplots__

You can use boxplots to display the distribution of a continuous variable broken down by a categorical variable.

- Concise way to illustrate the standard quantiles, shape, and outliers of data
- Each “box” is created according to some standard composition rules

<img src="images/03/eda-boxplot.png" style="width:60%">

[Source](https://r4ds.had.co.nz/exploratory-data-analysis.html)


In [None]:
options(repr.plot.width=2, repr.plot.height=5)

__Comparing Distributions__

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable

In [None]:
options(repr.plot.width=7, repr.plot.height=5)

...better use boxplots!

In [None]:
options(repr.plot.width=3, repr.plot.height=4)

If the descriptions on the x-Axis do not fit well, flip the coordinates

In [None]:
options(repr.plot.width=5, repr.plot.height=3)

### Relationships

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

__Relationship between two continuous variables__

Using a scatterplot with two variables you can see covariation as a pattern in the points. You can see an exponential relationship between the carat size and price of a diamond.

In [None]:
options(repr.plot.width=7, repr.plot.height=7)

__Two categorical variables__

To visualise the covariation between categorical variables, you’ll need to count the number of observations for each combination

### Coloring

__Continuous data:__
- differences between your steps should be high enough
    - Sequential palettes (single hue vs multiple hues)
    - Diverging palettes which emphasize extremes

__Qualitative data:__
- find colors which go well together and attract the reader’s eye
    - Qualitative palettes

__RColorBrewer for color palettes__

In [None]:
options(repr.plot.width=7, repr.plot.height=7)
RColorBrewer::display.brewer.all(n=NULL, type="all", select=NULL, exact.n=TRUE, colorblindFriendly=FALSE)

__Using color palettes on the `diamonds`data__

In [None]:
diamonds %>% 
    sample_n(1000) %>%
    ggplot(aes(carat, price)) +
    geom_point(aes(colour = clarity)) -> d
d

Select brewer palette to use - Sequential

Select brewer palette to use - Qualitative

`scale_fill_brewer` works just the same as `scale_colour_brewer` but for fill colours

In [None]:
diamonds %>%
    ggplot(aes(x = price, fill = cut)) +
    geom_histogram(position = "dodge", binwidth = 1000) -> p


The direction of colors can be reversed

### Faceting

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

In [None]:
options(repr.plot.width=9, repr.plot.height=3)
diamonds %>%
    ggplot(aes(x = price)) +
    geom_histogram(binwidth = 1000)

### Graphical Excellence

Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency

Graphical Displays should show the data and...

- induce the viewer to think about the substance
- avoid distorting what the data says
- present many numbers in small space
- make large data sets coherent
- encourage the eye to compare different pieces of data
- reveal the data at several levels of detail, from a broad overview to the new structure
- serve a reasonably clear purpose: description, exploration, tabulation, or decoration

__Example for graphical excellence__

Goal-Contribution Matrix for the Premier League! Special focus on Eden Hazard, Jamie Vardy, Glen Murray, and Ryan Fraser ([Link](https://twitter.com/r_by_ryo/status/1129773418184925184?s=12)/[Code](https://gist.github.com/Ryo-N7/67ca1c364c342a82c4098918082ca445)).

<img src="https://pbs.twimg.com/media/D63BjrRXoAAdbZi.png:large" style="width:60%">

## Exam Questions

### Exam AIS WS 2018/19
__Question 2: Descriptive Analytics__

(b) (3 Points) __Color palettes__: In the lecture we discussed that color palettes should reflect the underlying data types. Recommend a suitable palette choice for the following data sets:

i. (1 points) Daily stock performance measured as percent change

>

ii. (1 points) Annual sales data of a single company

>

iii. (1 points) Monthly earnings data of a single company

>

iv. (1 points) Distinguishing different companies in stock price charts

>

v. (1 points) Points in AIS exam achieved by students ranging from 0 to 60

>

vi. (1 points) Coloring countries on a map based on the population's favorite sport

> 

### Exam SS 2018
__Question 2: Descriptive Analytics__

(a) (4 points) __Plots and Colors__ In the lecture, you learned about different types of visualizations. Additionally, we talked about color palettes. Your task is to visualize the following data sets. Which plots and color palettes do you recommend? Sketch the plots and highlight for each data set what you are trying to achieve with your visualization.

\begin{array}{|l|l|}
    \hline
    Fund ID & Performance \\
    \hline
    1         & $+7.03\%$        \\
    2         & $+3.14\%$         \\
    3         & $-1.12\%$         \\
    4         & $-5.87\%$          \\
    \dots       & \dots        \\
    \hline
\end{array}
<center>
(a) Asset Management
<center>

Solution: 
>

Plotting the solution using ggplot2 (___not___ in exam) - Option 1

In [None]:
asset_performance <- tribble(
  ~FundID, ~Performance,
  "1",     7.03,
  "2",     3.14,
  "3",     -1.12,
  "4",     -2.84)

Plotting the solution using ggplot2 (___not___ in exam) - Option 2

[...] Your task is to visualize the following data sets. Which plots and color palettes do you recommend? Sketch the plots and highlight for each data set what you are trying to achieve with your visualization.

\begin{array}{|l|l|l|}
    \hline
    Employee ID & Gender & Income\\
    \hline
    1       & male      &    80,000 \\
    2       & female    &   70,000 \\
    3       & male      &   35,000 \\
    4       & female    &   37,000 \\
    \dots   & \dots           & \dots \\
    \hline
\end{array}

<center>
(b) Gender Pay Gap
<center>

Solution 1: 
>

Generating sample data (___not___ in exam)

In [None]:
tibble(EmployeeID = 1:n,
       Gender = sample(c("male","female"), size = 5000, replace = T)) %>%
       mutate(Income = if_else(Gender=="female", 
                            rnorm(n = n, mean = 40000, sd = 5000),
                            rnorm(n = n, mean = 45000, sd = 10000))) -> gender
gender %>% head()

Plotting the solution using ggplot2 (___not___ in exam)

In [None]:
RColorBrewer::display.brewer.all(type='all')

Solution 2: 

> - Plot: for comparison, e.g. bar chart (x=Income Bins, y=Income)
- Color palette: qualitative (Gender)

Plotting the solution using ggplot2 (___not___ in exam) - Option 1

Plotting the solution using ggplot2 (___not___ in exam) - Option 2