In [None]:
from mads_datasets import DatasetFactoryProvider, DatasetType
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np

penguinsdataset = DatasetFactoryProvider.create_factory(
    DatasetType.PENGUINS
)
penguinsdataset.download_data()

df = pd.read_parquet(penguinsdataset.filepath)
select = [
    "Species",
    "Island",
    "Culmen Length (mm)",
    "Culmen Depth (mm)",
    "Flipper Length (mm)",
    "Delta 15 N (o/oo)",
    "Delta 13 C (o/oo)",
    "Sex",
    "Body Mass (g)",
]
subset = df[select].dropna()
subset["Species"] = subset["Species"].apply(lambda x: x.split()[0])

So, we have this dataset. And we have some understanding of different types of distributions. Just from looking at the feature names it should be obvious that we can split the features into continuous and discrete features. 

But how do we find out which distribution fits our data best? Are these all normal distributions? Some of them, probably, but maybe not. How do we find out?

## Testing for normality

## qq-plot

The easiest way to test for normality is to use a qq-plot. This is a very basic and visual test. It is not very precise, but it is a good first step.

What is a qq-plot? It stands for quantile-quantile plot. It is a plot of the quantiles of the data against the quantiles of a theoretical distribution. If the data is normally distributed, the points will fall on a straight line. If the data is not normally distributed, the points will not fall on a straight line.

First, as a reminder, let's plot the PDFs for a normal and a skew-normal distribution.

In [None]:
# prepare axes
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# get the pdfs for the two distributions
x = np.linspace(-3, 3, 100)
y1 = stats.norm.pdf(x, loc=0, scale=1)
y2 = stats.skewnorm.pdf(x, a=5, loc=0, scale=1)  # a is the skewness parameter

# plot the two distributions
ax[0].plot(x, y1)
ax[1].plot(x, y2)
ax[0].set_title("Normal distribution")
ax[1].set_title("Skewed distribution")

Now, let's sample some data from a distribution and see how the qq plot works.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
n = 1000
rng = np.random.default_rng(42)


# lets sample n times from a skewed normal distribution with alpha skewness
alpha = 10
data = stats.skewnorm.rvs(a=alpha, size=n, random_state=rng)

# first, we plot the data with a theoretical normal distribution
stats.probplot(data, plot=ax[0], dist=stats.norm);
ax[0].set_title("Normal Q-Q plot")

# then, we plot the data with a theoretical skewed normal distribution
stats.probplot(data, plot=ax[1], dist=stats.skewnorm, sparams=(alpha,));
ax[1].set_title("Skew-normal Q-Q plot")

# this also works the other way around! (eg generate normal data,
# and then the theoretical normal distribution will align,
# but the theoretical skewed normal distribution will not align)

As you can see, for the theoretical normal distribution, the points fall on a straight line. However, for a theoretical skew-normal distribution, the points do not fall on a straight line. You can see that the theory and the data do not match, which implicates that the data is not normally distributed.

In [None]:
subset.head()

In [None]:
# let's test this for all continuous features from the penguins dataset

# select float columns, just from Adelie species
# question: what would happen if we used all species?
adelie = subset[subset["Species"] == "Adelie"]
floats = adelie.select_dtypes(include="float64")
features = floats.columns

# prepare axes
fig, axs = plt.subplots(2, 3, figsize=(12, 8))
axs = axs.ravel()

for i, col in enumerate(features):
    feats = floats[col]
    stats.probplot(feats, plot=axs[i], dist=stats.norm)
    axs[i].set_title(f"{col} Q-Q plot")

plt.tight_layout()


As you can see, simply visualising qq-plots already gives us a pretty good idea of what distributions probably dont align. There are many reasons for this: maybe there are more groups underlying the distribution (simpsons paradox!!), and if you split the group into subgroups a normal distribution will emerge. But there could be numerous reasons. Typically, you will need domain knowledge to understand what is going on.

# Fitting distributions

If you have a reasonable idea what distribution your data follows, you can fit the parameters of the distribution to your data. This is called fitting a distribution.

In [None]:
# first we generate some data. This way we can check if it works
kwargs = {"loc": 5, "scale": 2}
n = 100
dist = stats.norm
data = dist.rvs(size=n, random_state=rng, **kwargs)

# we can provide fit with a range of parameters to try to fit. This will speed up things,
# but note that there is a risk of setting the bounds too narrow and missing the true parameters!
# or you set the bounds too wide and the fit will take a long time
bounds = ((0, 10), (0, 4))
result = stats.fit(dist, data, bounds=bounds, method="mle")
result.params

As you can see, even with a 100 samples we get a pretty good fit. Try to play with this: change the n and see if it gets harder to fit the data. Also, change the parameters to see if the fit follows your changes. 

But can we know if a normal distribution is a good fit too begin with? E.g., for the penguins is seems to be the case that the Delta 13 C feature is not normally distibuted, but how sure are we of the others? 

We can use a [kstest](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) to find out.

In [None]:
# we use dist, which is the normal distribution from the previous cell, to create a cdf
cdf = dist(**kwargs).cdf

# now we can perform the Kolmogorov-Smirnov test on the data
stats.kstest(data, cdf)

We get a p-value. If the p-value is larger than 0.05, we cannot reject the null hypothesis that the two distributions are similar. 
For the people that like to think ahead of the lessons: yes, we could also use a bayesian approach to test this hypothesis. But we will get to that in later lessons...

Lets see what happens if we test the hypothesis that the data is a uniform distribution.
Since the kdtest compares two cdf (cumulative distribution functions), we need to create a cdf for the uniform distribution and pass that as a second argument to the kstest.

In [None]:
cdf = stats.uniform.cdf
stats.kstest(data, cdf)

Ok, that is a very small p-value. So we can reject the null hypothesis that the data is a uniform distribution.

Let's automate this test for all the continuous features in the penguins dataset.

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(12, 8))
axs = axs.ravel()
dist = stats.norm

for i, col in enumerate(features):
    data = floats[col]

    # lets get a good starting point for the parameters
    mu = data.mean()
    sigma = data.std()
    bounds = ((mu - 3 * sigma, mu + 3 * sigma), (0, sigma * 2))
    result = stats.fit(dist, data, bounds=bounds)

    # and plot the fitted result, including the p-values
    result.plot(ax=axs[i])
    kstest = stats.kstest(data, stats.norm(*result.params).cdf)
    axs[i].set_title(f"{col} fit (p={kstest.pvalue:.2f})")

plt.tight_layout()

We are not a penguin expert, but can you come up with possible reasons why the body mass is a lousy fit? E.g. are there possible subgroups you can think of, that would split the distribution into two groups?

It could very wll be that the subgroups turn out to be normally distributed, and we are just looking at two overlapping groups. Another reason could be that there is some other reason that influences our sample, e.g. the penguins are fed a certain diet, or are being hunted etc.