# Descriptive statistics in Python

Welcome to week 2 of 02402 Statistics (PF)

Today we will start using Python and in this notebook we will go through some basic descriptive statistics.

We will also start using the libraries: Numpy, Matplotlib and Pandas

In [None]:
# This the first "code" cell in this jupyter notebook
# all lines that start with a "#" are "commented out" 

In [None]:
# calculate 2+2
2+2

### Store sample data in a variable

In [None]:
# make a Python "list"
my_list = [1,2,3,4]

In [None]:
print(my_list)

In [None]:
my_list*3

In [None]:
print(type(my_list))

We want to be able to work with a data type that behaves as a vector. For this we use Numpy arrays. 

We can store our sample data in a Numpy array. 

In [None]:
# import the Numpy library
import numpy as np

We will now work with a sample, consisting of 10 measurements of students heights. 

The 10 observations have the values: 168, 161, 167, 179, 184, 166, 198, 187, 191 and 179

In [None]:
# store sample data in variable x:
x = np.array([168, 161, 167, 179, 184, 166, 198, 187, 191, 179])

In [None]:
print(x)

In [None]:
print(type(x))

### Calculating simple statistics

In [None]:
# calculate mean of x (average height of students)
np.mean(x)

In [None]:
# "mean()" can also be called as a method
x.mean()

Have a look in the online documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

The datatype "ndarray" (also called a numpy array) has many methods.

In [None]:
# lets try some other "methods"
x.min()

In [None]:
x.max()

In [None]:
# what about variance? 
# OBS: we need to remember ddof = 1 in order to calculate the "sample variance"
x.var(ddof=1)

Why ddof=1? have a look in the documentation for explanation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html


In [None]:
# standard deviation (also remember ddof=1 for "sample standard deviation")
x.std(ddof=1)

In [None]:
# what about the median?
x.median()

no method called median? 

OK, then we call the median() function directly from numpy

In [None]:
np.median(x)

In [None]:
# we can also get other percentiles (50th percentile is the same as the median)
np.percentile(x, [0,10,25,50,75,90,100], method='averaged_inverted_cdf')

In [None]:
# Python has two equivalent funtions for calculating quantiles: "percentile" and "quantile"
np.quantile(x, [0,0.10,0.25,0.50,0.75,0.90,1.00], method='averaged_inverted_cdf')

In [None]:
# compare with sorted data
sorted_x = np.sort(x)
print(sorted_x)

Notice the method="averaged inverted cdf"  !<br>

There are many different ways to define percentiles!

See the documentaion: https://numpy.org/doc/stable/reference/generated/numpy.percentile.html#numpy.percentile

In this course (and in the book) we always use the 'averaged_inverted_cdf' method.

### More complex data

We now add to the dataset 10 measurements of student weights. We store this data in variable y:

In [None]:
y = np.array([65.5, 58.3, 68.1, 85.7, 80.5, 63.4, 102.6, 91.4, 86.7, 78.9])

In [None]:
print(x)
print(y)

In [None]:
# calculate covariance:
np.cov(x,y, ddof=1)

What are the four values?

In [None]:
# calculate correlation
np.corrcoef(x,y)

Now have a look at Appendix A.1 in the book :)

What are the four values?

How do you interpret a correlation of 0.9656 ?

KAHOOT (x1)

## Data visualization

We use the matplotlib library to produce plots

In [None]:
# import the matplotlib.pyplot package 
import matplotlib.pyplot as plt

In [None]:
# Recall our sample data:
print(sorted_x)

In [None]:
# Now make a histogram of the sample data
plt.hist(x)
plt.show()

In [None]:
# Customize your histogram
plt.hist(x, bins=8, edgecolor='black', color='red', density=True)
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Histogram Example')
plt.show()

In [None]:
# specifying bin-edges:
plt.hist(x, bins=[160,165,170,175,180,185,190,195,200], edgecolor='black', color='red', density=True)
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Histogram Example')
plt.show()

Histograms are important - they show how the data is **distributed** and are often the first choice of visualising a sample <br>

Histograms serve as *empirical distributions* ("empirical pdf")<br>

Based on the histogram above, how would you guess the height-distribution in the *population* looks like? <br> 

In [None]:
# lets try with really small bins, such that the histogram diplays all the details in the data:
plt.hist(x, bins=np.arange(160,200,1), edgecolor='black', color='red', density=True)
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Histogram Example')
plt.show()

### Cumulative distribution

The "detailed" histogram with small bins is maybe not the nicest way to display data. <br>

But histograms are dependent on bin-choices, which is also (sometimes) not ideal.. <br>

An alternative is to do a cumulative kind of plot:

In [None]:
# plot the "empirical cumulated density function" (empirical cdf)
plt.ecdf(x)
plt.show()

In [None]:
# compare with values 
print(sorted_x)

In the cumulated distribution all detailed information is kept - but it is another way to visualise the distribution of data. 


In [None]:
# lets increase the y-range slightly:
plt.ecdf(x)
plt.ylim(-0.1,1.1)
plt.xlabel('x')
plt.ylabel('ecdf(x)')
plt.title('Epirical cumulated density function')
plt.show()

The y-range goes from 0 to 1 (0% to 100%) <br>

Every vertical line-segment is a datapoint <br>

When the plot is "steep" there are many datapoints (corresponds to high values in the histogram). <br>

The cumulated plot can be used to understand the "averaged_inverted_cdf" used for percentiles. 
Example: If you want to find the 35% percentile, start by finding 0.35 on the y-axis. Then find the corresponding value on the x-axis. This is the value of the 35% percentile. <br>


KAHOOT (x2)

### Boxplot

In [None]:
# make a boxplot
plt.boxplot(x)
plt.show()

Now the *values* are on the **y-axis**

In [None]:
# Adding some explanation:
plt.boxplot(x)
plt.text(1.1, np.percentile(x,  [0]), 'Minimum', color='blue')
plt.text(1.1, np.percentile(x, [25]), 'Q1', color='blue')
plt.text(1.1, np.percentile(x, [50]), 'Median', color='blue')
plt.text(1.1, np.percentile(x, [75]), 'Q3', color='blue')
plt.text(1.1, np.percentile(x,[100]), 'Maximun', color='blue')
plt.title("Basic box plot")
plt.show()

see documentation for definition of box and whiskers: 

https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.boxplot.html#matplotlib.axes.Axes.boxplot



In [None]:
# Adding an outlier to the data:
print(np.append(x, [235]))

In [None]:
plt.boxplot(np.append(x, [235]))
plt.show()

In the plot above you see that "extreme values" are plotted individually. The "whiskers" do not extand all the way to min and max by default. 

You can control the whiskers by using the "whis=.." option:

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4))  # start by splitting the figure into two 
ax1.boxplot(np.append(x, [235]))                     # define first plot in the figure - default setting for whiskers
ax2.boxplot(np.append(x, [235]), whis=(0,100))       # define second plot in the figure - set whiskers manually
plt.show()                                           # now show the entire figure

### Scatter plot

In [None]:
plt.scatter(x,y)
plt.show()

Do you remember the correlation? Does it match with the plot?

KAHOOT (x1)

### DataFrames

For more complex data (many rows and many columns) we will sometime use "DataFrames" from the Pandas library. 

In [None]:
# import the Pandas library
import pandas as pd 

We can put our previous height and weight data into a *DataFrame*:

In [None]:
student_data = pd.DataFrame({
    'height':  x,
    'weight':  y
})
student_data

In [None]:
print(type(student_data))

In [None]:
# we could also type data directly into a DataFrame:
student_data = pd.DataFrame({
    'height':  [168, 161, 167, 179, 184, 166, 198, 187, 191, 179],
    'weight':  [65.5, 58.3, 68.1, 85.7, 80.5, 63.4, 102.6, 91.4, 86.7, 78.9]
})
student_data

It is good practice to have one *observational unit* in each row and different *observational variables* in the different columns. 

(recall Definition 1.1 from chapter 1 in the book)



In [None]:
# The DataFrame has a direct method for making histograms:
student_data.hist()
plt.show()

In [None]:
# The DataFrame also has a direct method for making a scatter plot:
student_data.plot.scatter("height", "weight")
plt.show()

### Reading data from an external file

It is very important to learn how to read data from other files. In practice one will never type all the data into Python by hand!

In [None]:
# read data from a csv file:
csv_data= pd.read_csv("studentheights.csv", sep=';')

In [None]:
# print the number of rows in the dataset:
print(len(csv_data))

In [None]:
# view the first few rows:
csv_data.head()

What is the data in the two columns?

What is the type of data in the two columns? (quantitative, qualitative ..?)

In [None]:
csv_data.describe(include='all')

If we want to do a boxplot by gender, we need to include the "by=.." argument:

In [None]:
csv_data.boxplot(by='Gender')

In [None]:
csv_data.hist(by="Gender")

What happens if we remove the "by=.." statement in the plots above?