# Outline

- Background: Probability Distributions
- Characterizing a Distribution
- Normal Distributions
  - Central Limit Theorem
- Continuous DIstributions Derived from the Normal Distribution
  - CHi-Square Distribution
  - T-Distribution
  - F-Distribution
- Other Continuous Distributions
  - Lognormal Distribution
  - Weibull Distribution
  - Exponential Distribution
  - Uniform Distribution
- `scipy.stats`

# Characterizing a distribution
- We deal with samples instead of populations in practical data analysis
- Lets use library numpy to charaterize a data set

In [1]:
import numpy as np
import pandas as pd

In [3]:
mtcars = pd.read_csv("../Datasets/mtcars.csv")
np.mean(mtcars.mpg)

np.float64(20.090625000000003)

In [4]:
np.median(mtcars.mpg)

np.float64(19.2)

Mode is the most frequently occuring value in a dataset

In [5]:
from scipy import stats
stats.mode(mtcars.mpg)

ModeResult(mode=np.float64(10.4), count=np.int64(2))

In [6]:
# Geometric mean can be used to describe the location of a distribution
stats.gmean(mtcars.mpg)

np.float64(19.25006404155361)

In [7]:
# range = max - min
range = np.ptp(mtcars.mpg)
print(range)

23.5


In [8]:
# you can also find the range by subtracting the max from the min
max(mtcars.mpg) - min(mtcars.mpg)

23.5

In [9]:
# Percentiles are just the inverse of the CDF, and they give the value below which a given percentage of the data values occur
# the 50th percentile is the median
np.quantile(mtcars.mpg, q=[0.32, 0.5, 0.97])

array([16.352, 19.2  , 32.505])

In [12]:
# finding the variance of a distribution
# ddof = delta degrees of freedom, not sure what this means yet
np.var(mtcars.mpg, ddof=1)

np.float64(36.32410282258064)

In [11]:

# standard deviation is the square root of variance
np.std(mtcars.mpg, ddof=1)

np.float64(6.026948052089104)