# Fundamentals of Statistics
## Summary Statistics and Boxplots

## Data

The dataset we use here is taken from this [paper](https://www.nature.com/articles/sdata201919) which studies life expectancy of hundards of animals from North American zoos and aquariums. We start by importing the required modules/packages/libraries. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data file is `AZA_MLE_Jul2018.csv`. Download it from Learn and upload it on your Noteable directory in the same folder as this notebook.
Now we read in the data which is a csv file using `pd.read_csv` and look at the first five rows of the data using `.head()`. (The format of the file is in a way that needs an extra argument `encoding='latin-1'` to be read in properly.)

In [None]:
zooanimals = pd.read_csv("AZA_MLE_Jul2018.csv", encoding='latin-1')
zooanimals.head()

The column "Overall MLE" shows the Median Life Expectancy for each animal. We take this column and visualise it with a histogram using the `hist` fucntion from the `matplotlib` module. 

In [None]:
all_lifexp = zooanimals["Overall MLE"]
plt.hist(all_lifexp)
plt.show()

We are interested in the data about Mammals. So we make a subset of the dataset in which `TaxonClass == Mammalia` and call that subset "Mammalia" (the rows corresponding to Mammals are selected using `.loc`).

In [None]:
Mammalia = zooanimals.loc[zooanimals['TaxonClass'] == 'Mammalia']
Mammalia

We save the "Overall MLE" in a variable called `Mammalia_lifexp` and make a histogram of life expectancy values for Mammals. This time, we also add title, labels, and grid to the plot.

In [None]:
Mammalia_lifexp = Mammalia["Overall MLE"]
plt.hist(Mammalia_lifexp)
plt.title("Distribution of life expectancy of zoo mammals")
plt.xlabel("Life Expectancy")
plt.ylabel("Count")
plt.grid()
plt.show()

## Measurements of centre of data

### Mean

The purpose is to find the mean of the life expectancy variable of the mammals. We can use the `np.mean` function to calculate this. Or sum up all the values using `np.sum` and divide them over the number of data values which is found using `len`.

In [None]:
Mammalia_mean = np.mean(Mammalia_lifexp)
Mammalia_mean

In [None]:
np.sum(Mammalia_lifexp)/len(Mammalia_lifexp)

### Median

In order to find median of this variable, `np.median` is used. Or we can sort all the values using `sort_values()`, which sorts the 175 values from smallest to largest. Since this number is odd,  median is the middle point which is the 88th observation out of the 175 of them (considering the index in python starts from 0, that would be the 87th element).

In [None]:
np.median(Mammalia_lifexp)

In [None]:
Mammalia_lifexp.sort_values()

In [None]:
Mammalia_median = Mammalia_lifexp.sort_values().iloc[87]
Mammalia_median

### Mode

Life expectancy is a continous variable and mode is not informative for it. However, just to see an example of a Mode, we can find mode of the nomial variable 'TaxonClass'. The function `value_counts()` shows all the levels of this variable and their frequencies, which indicates what mode is. Or instead we can import the `statistics` module and use the `mode` function. 

In [None]:
zooanimals['TaxonClass'].value_counts()

In [None]:
# The statistics module provides some functions to mathematical statistics of numeric data.
import statistics
statistics.mode(zooanimals['TaxonClass'])

We can mark these summary values in the histogram plot, using `axvline`.

In [None]:
plt.hist(Mammalia_lifexp)
plt.title("Distribution of life expectancy of zoo mammals")
plt.xlabel("Life Expectancy")
plt.ylabel("Count")
plt.grid()
plt.axvline(x=Mammalia_mean, color='k', label="mean")
plt.axvline(x=Mammalia_median, color='r', label="median")
plt.legend()
plt.show()

## Measures of spread of data

### Range

Finding the range of the values of a selected variable is straightforward using `sort_values()`. Or `np.ptp` can be used instead. 

In [None]:
Mammalia_lifexp.sort_values()

In [None]:
 42 - 4.6

In [None]:
np.ptp(Mammalia_lifexp)

### Variance

The most important measurement of spread of the data is variance which is calculated using `np.var`.

In [None]:
Mammalia_var = np.var(Mammalia_lifexp, ddof=1)
print(Mammalia_var)

We can directly follow the variance formula and calculate its value by "(1) find deviations – (2) square them – (3) sum them up – (4) devide it over n-1". 

In [None]:
#1
distance = Mammalia_lifexp - np.mean(Mammalia_lifexp)
#2
sq_distance = distance ** 2
#3
sum_sq_distance = np.sum(sq_distance)
#4
variance = sum_sq_distance / (len(Mammalia_lifexp)-1)
print(variance)

The other measurment of variation is standard deviation which is just the square root of variance. You can take square root of variance or use `np.std` function which directly calcutes the standard deviation.

In [None]:
np.sqrt(Mammalia_var)

In [None]:
np.std(Mammalia_lifexp, ddof=1)

In [None]:
#the percetage of animals who live in the range (mean-st.dev, mean+st.dev) is more than 50%
len(Mammalia_lifexp[(Mammalia_lifexp > 7.22) & (Mammalia_lifexp < 22.2)])/len(Mammalia_lifexp)

## Quantiles and boxplots

Any quantiles of the selected variable can be calculated using `np.quantile`. Usually we check the minimum, 25%, 50%, 75% and maximum of the data (quartiles of the data). 

In [None]:
print(np.quantile(Mammalia_lifexp, [0, 0.25, 0.5, 0.75, 1]))

A useful function is `describe()` which gives a descriptive summary  of the variable.

In [None]:
Mammalia_lifexp.describe()

A "boxplot" is a very informative way of showing the quartiles and spread of data. The function `boxplot` in `matplotlib` module is used below.

In [None]:
plt.boxplot(Mammalia_lifexp)
plt.xlabel("Mammalia")
plt.ylabel("Life expectancy")
plt.show()

### Outliers

The boxplot shows that there are a few outliers in the variable, shown with small circles. We can investigate the data and find out which animals are outliers corresponsing to this variable. The interquantile range is directly calculated using `iqr` function from `scipy.stats` module. Then the upper and lower threshold of the data are calculated. Any data element smaller than the lower threshold `or` larger than the upper threshold are outliers. 

In [None]:
from scipy.stats import iqr
IQR = iqr(Mammalia_lifexp)
lower_threshold = np.quantile(Mammalia_lifexp, 0.25) - 1.5 * IQR
upper_threshold = np.quantile(Mammalia_lifexp, 0.75) + 1.5 * IQR
outliers = Mammalia_lifexp[(Mammalia_lifexp < lower_threshold) | (Mammalia_lifexp > upper_threshold)]
print(outliers)

The lower and upper thresholds are:

In [None]:
print(lower_threshold, upper_threshold)

Extract rows of the dataset corresponding to the outliers.

In [None]:
Mammalia.loc[Mammalia["Overall MLE"] > upper_threshold]

# Exercises

1. Let's focus on birds in this data set. Make a subset of the data in which `TaxonClass == Aves` (Aves animals are mainly birds). Take the column `Overall MLE` of the data which contains birds life expectancy and name it `birds_lifexp`. Since this column has some NaN values (missing values) use `birds_lifexp = birds_lifexp.dropna()` to drop them. Make a histogram of the life expectancy of birds.

2. Calculate the summary statistics for the birds life expectancy. That is, measurments of centre of the data (mean, mode) and spread of data (range, variance, standard deviation).

3. Make a boxplot of  the birds life expectancy.

4. The boxplot shows some outliers. Find what birds are outliers in terms of their life expectancy.

5. Write a paragraph and compare the life expectancy of mammals and birds.