*Analytical Information Systems*

# Worksheet 3 - Descriptive Statistics

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2020

## Exercises

Firstly, we need to load the `tidyverse` package

In [None]:
library(tidyverse)

To analyze skewness and kurtosis we also need to install the 'psych' package

In [None]:
install.packages('psych')

### 1 The diamonds dataset

The diamonds dataset comes with the package `ggplot2` and contains information about ~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond.

Let's have a look at the data. What are the scales of the variables?

In [None]:
diamonds %>% head()

In [None]:
?diamonds

1. 
    1. __On the diamond data set, calculate statistical measures to describe the central tendency, variability and the shape of the `price`:__

In [None]:
diamonds %>%
    summarise(Mean = mean(price),
              Sd = sd(price),
              Skew = psych::skew(price),
              Kurt = psych::kurtosi(price))

Visualization often facilitates understanding of the data and its distribution. 

*(We will learn more about data visualization in the next section.)*

In [None]:
options(repr.plot.width=7, repr.plot.height=5)
diamonds %>% ggplot() + 
    geom_density(aes(x=scale(price))) +
    stat_function(n = 100, fun = dnorm, linetype='dotted') +
    theme_minimal() +
    xlim(-5,5)

We calculated statistical measures to get a grasp understanding of the diamond prices. Now, we want to get a deeper insight.

1. 
    2. __Describe the central tendency, variability and the shape of the prices depending on the quality (`cut`) and `color`.__

In [None]:
diamonds %>%
    group_by(cut, color) %>%
    summarise(mean(price), sd(price), psych::skew(price))

1. 
    3. __How many diamonds belong to each of the groups (`cut` and `color` combinations)? What is the cheapest and the most expensive price within each group?__

In [None]:
diamonds %>%
        group_by(cut, color) %>%
        summarise(n(), max(price), min(price))

Let's take a closer look at the other variables in the data set.

1. 
    4. __What is the average volume of the diamonds, depending on the qualtity (`cut`)?__

In [None]:
diamonds %>% 
    mutate(Volume = x*y*z) %>%
    group_by(cut) %>%
    summarise(mean(Volume))

1. 
    5. __What are the average values of all (numeric) columns?__

In [None]:
diamonds %>%
        summarise_if(is.numeric, mean)

### 2 Exam Questions

WS 2018/19 Data Engineering & Integration

2. __Consider the following diamonds data set:__

In [None]:
# This code was not included in the exam, values may differ
library(tidyverse)
set.seed(5)
ggplot2::diamonds %>% 
    sample_n(10) %>% 
    arrange(cut) -> diamonds

In [None]:
diamonds

__i. (1 points) You are executing the code below. How many rows does the resulting data frame contain? Briefly explain your answer.__

```R
diamonds  %>%
    group_by(cut) %>%
    summarize(median(depth))
```

__Solution__:

In [None]:
diamonds  %>%
    group_by(cut) %>%
    summarize(median(depth))
# 4 rows (distinct 'cuts')

__ii. (2 points) You are executing the code below. What are the column names of the resulting data frame?__

```R
diamonds %>%
    group_by(clarity , color) %>%
    filter(price > 1000) %>%
    mutate(volume = x * y * z) %>%
    summarise(x = mean(carat),y = mean(price)) %>%
    mutate(z = x * y)
```

__Solution__:

In [None]:
diamonds %>%
    group_by(clarity , color) %>%
    summarise(x = mean(carat),y = mean(price)) %>%
    mutate(z = x * y) %>%
    colnames()
# 'clarity' 'color' 'x' 'y' 'z'

__iii. (2 points) Explain in pseudo code (e.g., dplyr pipelines) how to obtain the following transformed table from the given data set.__

<table style="font-size: 100%;">
<thead>
	<tr><th scope=col>color</th><th scope=col>max_price</th><th scope=col>min_price</th></tr>
</thead>
<tbody>
	<tr><td>F</td><td>1630</td><td> 786</td></tr>
	<tr><td>G</td><td>2593</td><td>2593</td></tr>
	<tr><td>H</td><td>7604</td><td>1723</td></tr>
	<tr><td>I</td><td>4195</td><td>1840</td></tr>
	<tr><td>J</td><td>5463</td><td>5463</td></tr>
</tbody>
</table>

In [None]:
diamonds %>%
    group_by(color) %>%
    summarise(max_price = max(price),
             min_price = min(price))
# optional:   arrange(color)