# Stats Module 1: Basics of Distributions

In [None]:
# %load_ext autoreload
# %autoreload 2
import tests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import poisson

## Q1: This is how your grades are calculated.

Assume that a study measures the exam scores of 500 randomly selected people in Berkeley. The results of the survey are stored in ``scores.csv``.

### Q1a: reading in data
Using Pandas, read in the data as a DataFrame. Then, select the ``Scores`` column and convert that column to a numpy array.

In [None]:
df = ...
scores = ...

In [None]:
tests.run('test_1a', scores)

### Q1b: histogram

Create a histogram of the scores above with 10 bins. **Make sure you label the axes appropriately.**

Hint: read the documentation for ``Axes.hist``.

In [None]:
# TODO

### Q1c: expected score

Calculate the expectation value of the scores and print it out. 

In [None]:
# TODO

### Q1d: standard deviation
Calculate the standard deviation of the scores and print it out.

In [None]:
# TODO

### Q1e: 1 sigma

Calculate the probability that the score of a randomly chosen student in our survey is within 1.37 standard deviations of the mean.

In [None]:
prob = ...

In [None]:
tests.run('test_1e', prob)

### Q1f: Normal CDF

First, import ``norm`` from ``scipy.stats``. If you do not have ``scipy`` installed, install the package with pip.

As its name suggests, ``scipy`` is a package for performing scientific calculations, among them, the normal distribution ``norm``. Read the documentation for ``norm.cdf``, which calculates the cumulative distribution function (CDF) of the normal distribution at some point ``x``. Finally, calculate the probability that a random sample of the normal distribution is within 1.37 standard deviations of the mean. How does this compare to the probability we found in part e?

Hint: ``loc`` is the mean and ``scale`` is the std.

In [None]:
from scipy.stats import norm

true_prob = ...

In [None]:
tests.run('test_1f', true_prob)

### Q1g: Kind of undersanding how your grades are calculated (EXTRA)

This section has been done for you. It's a rough display of what we mean by the 'CURVE' when talking about grades at Berkeley. Firstly, the method ```norm.ppf()``` takes a percentage and returns a standard deviation multiplier for what value that percentage occurrs at. So to calculate the 99th percentile, we would do the following:

In [None]:
ninty_ninth_percentile = norm.ppf(.99, loc = mean, scale = std )
ninty_ninth_percentile 

What this means is that if a student happens to be in the ninty-ninth percentile (usually an A+), their expected score is 126. Additionally, you can choose to see the entire distribution of scores between the first percentile and the ninty-nith percentile through a graph using matplotlib:

In [None]:
first_percentile = norm.ppf(.01, loc = mean, scale = std )
ninty_ninth_percentile = norm.ppf(.99, loc = mean, scale = std )
x = np.linspace(first_percentile, ninty_ninth_percentile, len(scores))
plt.plot(x, norm.pdf(x, mean, std))

However, sometimes, we project this normal distribution to the standard normal distribution. A standard normal distribution has a mean of 0 and a standard deviation of 1. If you have ever been stressed about your grades like me, you probably looked up what a Z-score is. You probably even know how to calculate it. Z-score is nothing but standardazing your random variable. In other words if the random variable X is the 'projected score you want', Z-score is your 'standardized projected score'. We can go from percentiles to z score and vice-versa. 

So, below we try to project the normal distbution onto a standard normal distirbution:

In [None]:
z_scores = (np.round(scores, 0) - mean)/std
first_percentile = norm.ppf(.01, loc = np.mean(z_scores), scale = np.std(z_scores))
ninty_ninth_percentile = norm.ppf(.99, loc = np.mean(z_scores), scale = np.std(z_scores))
x = np.linspace(first_percentile, ninty_ninth_percentile, 500)
plt.plot(x, norm.pdf(x))

Your x-axis now directly corresponds to the percentile you are in! So a z-score of 1.2 would give you your percentile:

In [None]:
mean = np.mean(z_scores)
std = np.std(z_scores)
N = len(scores)
count = np.sum(z_scores < 1.2)
prob = count/N
print(prob) 

Your z-score puts you in the 86.2nd percentile! 

## Question 2: The Distribution named after Fish 

A study shows that on average, 50 people visit a local supermarket per day. The number of people that arrive on any given day is modeled by the Poisson distribution. For a random variable with Poisson Distribution (i.e. $X \sim Poisson(\lambda))$, its expected value is the $\lambda$ parameter.

To find the probability that we obsrve $x$ events in time $t$, we break up the time $t$ into $n$ intervals. For $n$ large enough, the time interval is so short that, at most, one event can occur. The probability of this event occurring is $\lambda(t/n)$. Therefore we can model each small time intervals as a Bernoulli Trial with $p=\lambda t/n$.

### Q2a: a single Poisson process

Write a function ``poisson_process`` which simulates the number of arrivals for a single Poisson process of rate ``lam`` during time ``t`` with ``N`` time increments. For example, ``poisson_process(lam=50, t=1, N=10000)`` would simulate the number of shoppers that visit the supermarket for a given day. The function should proceed as follows:
1. Break ``t`` into ``N`` increments of time $\delta t = t/N$
2. The probability that a shopper arrives in one small time increment $\delta t$ is $\lambda\cdot \delta t$. Using a random number generator (```np.random.rand```), determine by chance if a shopper actually arrives in this time increment.
3. Repeat the previous step for all ``N`` increments, and return the total number of shoppers that arrive in your simulation.

In [None]:
def poisson_process(lam, t, N):
    # TODO

In [None]:
# since your simulation is probabilistic, there is a small (~2 percent) chance of failing the test 
# the chance of failing twice is quite unlikely
tests.run('test_2a', poisson_process)

### Q2b: a distribution of Poisson processes

Your function ``poisson_process`` simulates a single instance of the Poisson process (e.g. recording the number of shopper in a single day). In order to get a sense of the distribution of the number of shoppers, record the results of 1000 Poisson processes in the array ``x``. We will continue with the shoppers example, so let ``lam=50, t=1, n=10000``.

In [None]:
x = ...

In [None]:
# since your simulation is probabilistic, there is a small (but non-zero) chance of failing the test 
# re-run the cell above if you believe this is the case
tests.run('test_2b', x)

### Q2c: plots!

Already provided for you is the code for plotting the expected Poisson distribution for the number of shoppers. Using your simulation results from above, plot (on the same figure) a *normalized* histogram of the simulated number of shoppers with 20 bins. Make sure to label your axes appropriately!

Hint: to create a normalized histogram, use ``density=True``.

Do the plots agree? 

In [None]:

fig, ax = plt.subplots()

# TODO

mu = 50
xx = np.arange(poisson.ppf(0.005, mu),poisson.ppf(0.995, mu))
ax.plot(xx, poisson.pmf(xx, mu), '-', label='theory')
ax.legend()
ax.set_xlabel('Number of shoppers in a day')
ax.set_ylabel('Count')

plt.show()

## Submission

Check to make sure that you have answered all questions. Run all the cells so that all output is visible. Finally, export this notebook as a PDF (File/Download As/PDF via LaTeX (.pdf)) and submit to bCourses.

Created and edited by the ULAB staff. Last updated: December 2021.