<a href="https://colab.research.google.com/github/shaevitz/MOL518-Intro-to-Data-Analysis/blob/main/Lecture_8/MOL518_Lecture8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_8

# Lecture 8: An Introduction to Statistics

In this class, we will learn how to compute some basic statistics and plot them in Python using Jupyter Notebooks running in Google Colab. These notebooks let us combine code, explanatory text, figures, and results in a single, readable document, which makes them well suited for learning, exploration, and real data analysis.

The goals of this lecture are to:

1.	Teach you the basics of statistics (hopefully this is mostly a refresher)
2.	Teach you how to compute basic statistics using Python (specifically the ```numpy``` and ```scipy.stats``` packages)

## Statistics is a Critical Part of Quantitative Data Analysis

<p align="center">
<img src="https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis/blob/main/Lecture_1/media/ScientificCycle.png?raw=1" alt="Scientific Cycle" width="400" />
</p>

- Without statistics, it is not possible to do rigorous, quantitative data analysis
- Statistics underpins the reproducibility of the scientific method - it is **important**
- As you conduct your thesis research, you are very likely to be making measurements and analyzing data


## Mean, Median, Mode
When we make measurements in biology, they usually come from a population of bacteria, cells, or organisms. Very often, we want to know the *average* over that population. There are three basic ways to compute the average:

- The **mean** is calculated by summing all the values and dividing by the number of values.
- The **median** can be thought of as the value with the middle rank
- The **mode** is the most common value

### How do we compute the mean, median, and mode using Python?

For this, we will turn to our old friends the ```numpy``` and ```scipy``` packages. The mean and median are part of the ```numpy``` package as they are very easy to compute. Here are links to the reference pages for these functions so you can learn more about them:

- [numpy.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)
- [numpy.median](https://numpy.org/doc/stable/reference/generated/numpy.median.html)

The mode is a little more complicated to compute, as there might be multiple modes in a complex dataset, so it is part of the ```scipy.stats``` package:

- [scipy.stats.mode](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html#scipy.stats.mode)

### Example 1: Drosophila egg size
The following example is taken from [Barghi & Ramirez-Lanzas](https://www.nature.com/articles/s41598-023-30472-8) (2023) *Scientific Reports*. The authors developed a new, high-throughput method based on large particle flow cytometry for measuring Drosophila egg size, and compared their method to manual egg length measurements.

A note to avoid confusion is that "size" refers to measurements made with the new method, whereas "length" refers to measurments made manually using microscopy. The dataset we are working with initially consists of 74 eggs, each of which were measured with both methods. The first column, denoted ```EggSize```, is the size of each egg, and the second column, denoted ```EggLength```, is the length of each egg.

In this initial example I will show you how to compute the mean, median, and mode of the egg size measurements.

In [None]:
# We need to import the numpy package to compute the mean and the median
import numpy as np

# We need to import the scipy.stats package to compute the mode
import scipy.stats as stats

eggarray = np.loadtxt('data/egg_measurements.csv', delimiter = ',', skiprows=1) # skip the first row since it is a header
# print(eggarray)
eggsize = eggarray[:,1]

print(np.mean(eggsize)) # compute the mean
print(np.median(eggsize)) # compute the median
print(stats.mode(eggsize)) # compute the mode

### Exercise 1

There is a bug hidden in the code below, which attempts to compute the mean egg length. It is up to you to find the bug and fix it.

In [None]:
# We need to import the numpy package to compute the mean and the median
import numpy as np

# We need to import the scipy.stats package to compute the mode
import scipy.stats as stats

eggarray = np.loadtxt('data/egg_measurements.csv', delimiter = ',', skiprows=1) # skip the first row since it is a header

egglen = eggarray[:,3]

print(np.mean(egglen))

### Exercise 2

Write code to calculate the median and mode of egg *length* rather than size.

In [None]:
# Write your own code here




#### Excercise 2 questions

Once you have run your code, please answer the following questions:

1. How do the means and medians of egg length compare to one another?

**[your answer goes here]**

2. Based on the statistics you have calculated so far, is egg length similar to egg size?

**[your answer goes here]**

3. One of the measurements I have asked you to make is flawed. Which measurement is that and why?

**[your answer goes here]**

The answers you provide here will not be graded, but will be helpful feedback for developing the course

## Variance and Standard Deviation

Very often in biology, we care about how variable our measurements are. Hopefully, the variability is low, but sometimes it is high. To quantify the variability, we will define the variance, $\sigma^2$, as the average squared distance between each measurement $x_i$ and the mean $\mu$.

$$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}$$

Why are we squaring things? Well, one reason is that it avoids having negative numbers (depending on whethere $x_i$ is greater than $\mu$ or less than $\mu$). However, taking the absolute value of the difference $|x_i-\mu|$ could also work in principle, see this [stack exchange thread](https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia?rq=1) for further discussion. The real reason for the squaring comes from the Pythagorean theorem and the concept of Euclidean distance. We will discuss this point further when we talk about covariance and correlation coefficients later on in the course, so don't worry if this point is a little confusing at the moment.

The standard deviation $\sigma$ is simply defined as the square root of the variance $\sigma^2$.

$$\sigma = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}}$$


In this section, I will show you how to calculate the variance and standard deviation of a dataset using the ```numpy.var``` and ```numpy.std``` functions respectively, and how to do it without those functions. Here are links to the reference pages so you can learn more about them:
- [numpy.var](https://numpy.org/doc/stable/reference/generated/numpy.var.html)
- [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

### Example 2

In this example, I will show you how to calculate the variance of egg size manually, e.g. without using the built-in fuctions described above. To do so, we will be using the formula for variance described above.

In [None]:
# We need to import the numpy package to compute the mean and the median
import numpy as np

eggarray = np.loadtxt('data/egg_measurements.csv', delimiter = ',', skiprows=1) # skip the first row since it is a header
eggsize = eggarray[:,1] # egg size is the first column of the dataset

eggsize_var = np.mean((eggsize - np.mean(eggsize))**2) # the variance is defined as the mean of the squared difference between each element in eggsize and the mean eggsize
print(eggsize_var)

### Exercise 3

Now, calculate the variance of egg size, using ```numpy.var```.

In [None]:
# Your code goes here

### Exercise 4

Calculate the standard deviation of egg size, without using the built-in ```numpy.var``` or ```numpy.std``` function. You are, however, allowed to use the ```numpy.sqrt``` function.

In [None]:
# Your code goes here