In [1]:
import pandas as pd
import plotly.express as px

### Load data

In [2]:
bodyfat = pd.read_csv("dataset/Bodyfat.csv")

### Visualization of relevant quantities

In [3]:
fig = px.box(bodyfat)
fig.show()

### a)

In [4]:
mean = bodyfat.mean()
median = bodyfat.median()
q1 = bodyfat.quantile(0.25)
q3 = bodyfat.quantile(0.75)
mode = bodyfat.mode()
range = bodyfat.max() - bodyfat.min()

Mean: The average value of a variable (visualized in the box plot above).

Median: The value for which 50% of a variables values are smaller and 50% are greater than the value (visualized in the box plot above).

P-quantile: The value for which 100p% of a variables values are smaller and 100p% are greater than the value (0.25 and 0.75-quantile visualized in the box plot above).

Mode: The value of a variable that appears most often.

Range: The distance between the maximum value and the minimum value (visualized in the box plot above, if there are no outliers).

### b)

Variance: Measure for the dispersion of a variable defined as the average squared distance of the data to the mean.

Interquartile range (IQR): The distance between the 0.75-quantile and the 0.25-quantile (visualized in the box plot above).

### c)

An outlier is a data point that differs in some sense significantly from the other data points. 

There is no clear definition when a point is classified as an outlier. 

In the box plot above, data points that are 1.5 IQR below the first quartile or 1.5 IQR above the third quartile are marked as outliers.

Outliers originate among other for example from human or technical measurement errors or simply from natural processes like mutations.

### d)

Mean: Not robust against outliers, because the data points are simply cumulated and then divided by the amount.

Median: Robust against outliers, because it is not calculated based on the observed values of a variable but rather on how many are below or above a certain threshold.

P-quantile: Robust against outliers, analog to median.

Mode: Robust against outliers, because like the name suggests if outliers could influence the mode they would not be called outliers.

Range: Not robust against outliers, because an outlier is typically the maximum or minimum value.

Variance: Question of perspective. I would say the variance is not robust against outliers, because it is calculated based on the mean.

Interquartile range: Robust against outliers, because it is calculated based on the 75% of values that appeard most often in the data and that therefore does not contain outliers.

### e)

Outliers lead to a biased view on the data, when one uses statistics that are not robust.

They can also lead to biased estimations. Especially when using a generative approach like Gaussian Discriminant Analysis.
With this technique, the class conditional density of the data has to be estimated wich is assumed to be Gaussian.
The mean and variance for the Gaussian has to be estimated. With the existence of outliers, the estimation for the class conditional mean is biased.

### f)

In the one-dimensional case, outliers can be recognized very easily by a boxplot.
Like one can see above, the bodyfat data set contains many outliers.
For example the weight variable contains extreme outliers.

In the multivariate case, the dectection of outliers is more difficult.
Standard approaches are distance or density-based.
One could also use clustering algorithms for outlier detection.

### g)

Outliers should go through a detailed inspection. It is possible that they can draw attention to existing measurement errors.

In some cases, it could be possible to correct these errors manually. If this is not possible, the best way to cope with outliers is to leave them out of the analysis.

### h)

1. Features are not discriminative enough (leads to non-reducible error = "bayes' error")
2. Chosen model class is too restrictive and does not contain a good approximation let alone the true hypothesis (leads to structure error = "bias")
3. Available data set is too small for chosen model class (leads to estimation error = "variance")