<a href="https://colab.research.google.com/github/shaevitz/MOL518-Intro-to-Data-Analysis/blob/main/Lecture_10/MOL518_Lecture10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_10

/content
Cloning into 'MOL518-Intro-to-Data-Analysis'...
remote: Enumerating objects: 585, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 585 (delta 7), reused 3 (delta 3), pack-reused 570 (from 1)[K
Receiving objects: 100% (585/585), 22.15 MiB | 15.85 MiB/s, done.
Resolving deltas: 100% (231/231), done.
/content/MOL518-Intro-to-Data-Analysis/Lecture_10


# Lecture 10: Simulating random numbers and distributions

In this class, we will learn how to generate random numbers from different probability distributions and plot them in Python using Jupyter Notebooks running in Google Colab.

The goals of this lecture are to:

1.	Teach you how to generate random numbers
2.	Teach you how to simulate probability distributions using random number generation
3. Allow you to calculate a confidence interval using bootstrapping

## Biology is complicated!



- Simple examples in last class emphasize situations where the binomial, normal, poisson, and exponential distributions
- Real biology is often more complicated than this!
- Examples of this include bimodal distributions, skewed distributions, **[OTHER EXAMPLES]**


## Why do we need to generate random numbers, anyways?



## Random number generation in Python

There are two ways to generate random numbers in Python - there is a built-in ```random``` package, as well as more sophisticated random number generators as part of the ```numpy.random``` package. We will start with the built-in package so that you can understand how the process works.



### What is the simplest way to generate a random number in Python?

The simplest approach uses the aptly-named ```random.random``` function, which generates a floating point number between zero and one. Technically, the output can be equal to zero but should always be at least ever so slightly less than one.



### Example 1: Drosophila egg size
The following example is taken from [Barghi & Ramirez-Lanzas](https://www.nature.com/articles/s41598-023-30472-8) (2023) *Scientific Reports*. The authors developed a new, high-throughput method based on large particle flow cytometry for measuring Drosophila egg size, and compared their method to manual egg length measurements.

A note to avoid confusion is that "size" refers to measurements made with the new method, whereas "length" refers to measurments made manually using microscopy. The dataset we are working with initially consists of 74 eggs, each of which were measured with both methods. The first column, denoted ```EggSize```, is the size of each egg, and the second column, denoted ```EggLength```, is the length of each egg.

In this initial example I will show you how to compute the mean, median, and mode of the egg size measurements.

In [5]:
# import the random package
import random

# using a for loop, generate and print random numbers
for i in range(10):
  rnd = random.random()
  print(rnd)


0.42556394995152036
0.13693627352975113
0.730544223576348
0.4522886499457973
0.7389020384046397
0.20687126240190834
0.18938527343020817
0.6733884941630983
0.8530643012515852
0.2977423628574857


### Exercise 1

Generate

In [3]:
# We need to import the numpy package to compute the mean and the median
import numpy as np

# We need to import the scipy.stats package to compute the mode
import scipy.stats as stats

eggarray = np.loadtxt('data/egg_measurements.csv', delimiter = ',', skiprows=1) # skip the first row since it is a header

egglen = eggarray[:,3]

print(np.mean(egglen))

IndexError: index 3 is out of bounds for axis 1 with size 2

### Exercise 2

Write code to calculate the median and mode of egg *length* rather than size.

In [None]:
# Write your own code here




#### Excercise 2 questions

Once you have run your code, please answer the following questions:

1. How do the means and medians of egg length compare to one another?

**[your answer goes here]**

2. Based on the statistics you have calculated so far, is egg length similar to egg size?

**[your answer goes here]**

3. One of the measurements I have asked you to make is flawed. Which measurement is that and why?

**[your answer goes here]**

The answers you provide here will not be graded, but will be helpful feedback for developing the course

## Variance and Standard Deviation

Very often in biology, we care about how variable our measurements are. Hopefully, the variability is low, but sometimes it is high. To quantify the variability, we will define the variance, $\sigma^2$, as the average squared distance between each measurement $x_i$ and the mean $\mu$.

$$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}$$

Why are we squaring things? Well, one reason is that it avoids having negative numbers (depending on whethere $x_i$ is greater than $\mu$ or less than $\mu$). However, taking the absolute value of the difference $|x_i-\mu|$ could also work in principle, see this [stack exchange thread](https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia?rq=1) for further discussion. The real reason for the squaring comes from the Pythagorean theorem and the concept of Euclidean distance. We will discuss this point further when we talk about covariance and correlation coefficients later on in the course, so don't worry if this point is a little confusing at the moment.

The standard deviation $\sigma$ is simply defined as the square root of the variance $\sigma^2$.

$$\sigma = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}}$$


In this section, I will show you how to calculate the variance and standard deviation of a dataset using the ```numpy.var``` and ```numpy.std``` functions respectively, and how to do it without those functions. Here are links to the reference pages so you can learn more about them:
- [numpy.var](https://numpy.org/doc/stable/reference/generated/numpy.var.html)
- [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

### Example 2

In this example, I will show you how to calculate the variance of egg size manually, e.g. without using the built-in fuctions described above. To do so, we will be using the formula for variance described above.

In [None]:
# We need to import the numpy package to compute the mean and the median
import numpy as np
import matplotlib.pyplot as plt

eggarray = np.loadtxt('data/egg_measurements.csv', delimiter = ',', skiprows=1) # skip the first row since it is a header
# print(eggarray)
eggsize = eggarray[:,1]
egglen = eggarray[:,2]

plt.hist(eggsize,  bins=10)
plt.show()

### Exercise 2

Now, calculate the variance of egg size, using ```numpy.var```.

In [None]:
# Your code goes here

### Exercise 3

Calculate the standard deviation of egg size, without using the built-in ```numpy.var``` or ```numpy.std``` function. You are, however, allowed to use the ```numpy.sqrt``` function.

In [None]:
# Your code goes here