# Statistics and Data Science

## Scales

Data can have different *scales*. It is important to distinguish clearly between these scales:

1.  metric
2.  ordinal
3.  nominal

Metric data can be measured, e.g. distance in meters \[m\]. Ordinary data can be ordered, but the distance between different values cannot be determined. Nominal data can be distinguished but not ordered.

## Assign scales to variables

| Variable      | values                |
|---------------|-----------------------|
| flow velocity | 1.0, 2.3, 3.9         |
| water quality | very good, good, poor |
| Makrozoos     | yes, no               |

-   types can be converted, sometimes
-   dummy variables can replace nominal variables
-   male / female: woman = 1, man = 0

## Data

-   Data are often contained in tables
-   Tables have a header / first row with the variable name
-   We should also indicate the scale and data type
-   the subsequent rows are the datasets
-   We should indicate a legend for each variable
-   Data can (and should not) be stored in excel tables, relational databases (SQL), data structures (R)
-   Publication of data sets is becoming more and more important and structured

## Software for working with data

More or less suitable programs for data definition should be explicit, scalable for big data, reproducible:

| Program | explicit | scalable | reproducible |
|---------|----------|----------|--------------|
| Excel   | +!       | \-       | (-)          |
| SQL     | (+)      | +!       | (+)          |
| R       | \-       | (+)      | +!           |

Excel is only suitable for small data sets (\<100000), SQL is scalable and be used for $> 10^6$ sets of data, R is reproducible and specifies data types.


## Gauss Rule for Derived Mean

The mean of a derived variable is

$$\bar{z} = f(\bar{x};\bar{y})$$

We can simply plug in the mean values to derive the mean value of the derived variable.

## Gauss Rule for Error Propagation

The propagation of variance, however, for a variable $Z$ derived from two variables $X$ and $Y$ with standard deviations $s_x$ and $s_y$ is:

$$s_z = \sqrt{ \left( f_x( \bar{x};\bar{y}) \cdot s_x \right)^2 + \left( f_y( \bar{x};\bar{y}) \cdot s_y \right)^2}$$

with $f_x( \bar{x};\bar{y})$ being the first partial derivative of the function $f_x(x;y)$ at the location $x=\bar{x};y=\bar{y}$.

## Descriptive Statistics

Descriptive statistics describe a set of measurements that is considered the totality of all observations.

-   Complete census of all citizens
-   complete inventory of all cities of more than 1 million inhabitants
-   complete list of all atomic reactors in Germany
-   liste of all federal states in Germany
-   list of all hydro-power stations in Germany

## Descriptive Statistics Example

Definition of the arithmetic mean of river length

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n}{x_i}$$

with $x_i$ the values of observations and $n$ the number of observations [see @Kreyszig2011].

The weighted arithmetic mean is

$$\bar{x} \approx \frac{\sum_{j=1}^{n}\bar{x_j} \cdot n_j}{n}$$

## Summary Descriptives

The summary descriptives already reveal a number of interesting elements:

-   First and third quartiles: 25 % and 75 % quantiles
-   median
-   arithmetic mean (as defined)

Lacking are:

-   modus
-   other quantiles (e.g. 5, 10 of 95 %)

## About the other central moments

**Geometric mean**: Whenever growth is involved, for power functions and multiplicative elements to be weighted, e.g. growth rates, r-factors (Corona)!.

$$\bar{x}_{geom} = \sqrt[n]{\prod_{i=1}^{n}{x_i}}$$ **Harmonic mean**: Whenever we try to find the average value of travel times (velocities), hydraulic conductivities (groundwater hydrology, \[see @Helsel_etal2020\],

$$\bar{x}_{harm} = \frac{n}{\sum_{i=1}^{n}\frac{1}{x_i}}$$

## Scales and central values

| Scale   | central     |
|---------|-------------|
| metric  | arithmetric |
|         | geometric   |
|         | harmonic    |
|         | median      |
|         | modus       |
| ordinal | median      |
|         | modus       |
| nominal | modus       |

## Standard deviation

To characterise the range of observations, we can use the spread

$$ S = x_{max}-x_{min}$$ If we have metric values and we have already calclated the arithmetic mean, we can calculate the average deviation from that mean:

$$ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{(n-1)}}$$

Comment: For samples we divide by $(n-1)$, for the totality of observations we divide by $(n)$. For the totality, we would also use the symbol $\sigma$ instead.

## The weighted standard deviation

We can also calculate a weighted standard deviation, if we have a series of averages from subsamples $x_j$ having each a number of $n_j$ observations.

$$ s \approx \sqrt[2]{\frac{\sum_{i=1}^{n} (x_j-\bar{x})^2\cdot n_j}{(n-1)}}$$

In some cases it is necessary and helpful not to take the sqare root, the variance

## Variance

For correlations we need to introduce the variance, which is just the square standard deviation

$$ var = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^2}{(n-1)}$$

Comment: For samples we divide by $(n-1)$, for the totality of observations we divide by $(n)$.

## The coefficient of variation

The coefficient of variation is defined by the arithmetic mean divided by the standard deviation

$$ c_v = \frac{s}{\bar{x}}$$ It *normalizes* the standard deviation by the arithmetic mean and expresses how many times the standard deviation exceeds the mean value.
