# Visualizing distributions

In this notebook, we will discuss how we can visualize distribution of values effectively using seaborn. 
Our running example will be the [penguins dataset](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data). 

Let's import the necessary libraries and load the dataset. 

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100

import seaborn as sns
sns.set_style("whitegrid")

import pandas as pd
import random
import numpy as np

In [None]:
penguins = sns.load_dataset('penguins')
penguins.head()

In [None]:
penguins.info()

### Averages and more

Using pandas describe method, we can get a first overview of the data. We see the range of values for the individual features, the mean, standard dviation as well as the 25, 50 and 75% quartile. 

In [None]:
penguins.describe()

**Discussion Questions** 

- What is good about reporting the mean and standard deviations of each feature? What are potential problems?
- Do you know another name of the 50% percentile?
- What happens to the individual measures if there are outliers in the data? Which one would be affected?

### Histogram

Reporting numbers, such as the mean and standard deviation is useful, but reduces the amount of information in the data. 
We also need to be careful that the numbers we report really tell us something meaningful about the data. 
We can a more complete picture by visualizing a histogram of the data. 

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm")

**Discussion question:** 

- Why is reporting the mean in this case not so useful?

### Stratification

We can get an even better picture by stratifying the dataset with regard to the species. 

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple="layer", bins = 20)

We see that the histogram provides with a much more complete picture. But even this plot can be misleading. 

**Discussion question:** 
- How do you choose the number of bins?
- What can go wrong when you choose the wrong number of bins?

### Density plots

Histograms divide the data into a discrete number of bins. If we have enough data and choose the bin size small, the resulting plot resembles a continuous distribution, as shown in the following experiment:

In [None]:
sns.histplot(x=np.random.randn(100000), bins = 200)

When we know that the distribution is indeed continuous (and smooth) there is a possibility to use that knowledge and smooth out the steps that occur from the sampling of the data. This can be done using *kernel density estimation (kde)* and the resulting plot is what is called a density plot in seaborn.

In [None]:
sns.kdeplot(data=penguins, x="flipper_length_mm", hue="species", multiple="layer", fill=True)

If the assumptions hold, the plot is very useful and conveys a lot of information about the density. It can be much more informative than just reporting a mean, or showing one simple histogram. A density plot can also be created using a figure level plot in seaborn. 

In [None]:
sns.displot(data=penguins, x="flipper_length_mm", hue="species", kind="kde", fill="true")