# Course: Statistical Thinking in Python - Part 1
[Course link](https://www.datacamp.com/courses/statistical-thinking-in-python-part-1)


## Chapter 1: Graphical exploratory data analysis
[Slides](slides/Statistical Thinking in Python - Part 1/ch1_slides.pdf)

- Classic book [Exploratory Data Analysis](https://www.amazon.com/Exploratory-Data-Analysis-John-Tukey/dp/0201076160) by John Tukey (1977)
- **Histograms**
- Be aware of "binning bias."  **Swarm plots** help mitigate this.
- **Empirical Cumulative Distribution Functions (ECDFs)**
  - Among the most important plots in statistical analysis. A good starting point in EDA.
  - Shows all the data & gives complete picture of how they are distributed.  

In [6]:
# A typical ECDF function
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n+1) / n

    return x, y

## Chapter 2: Quantitative exploratory data analysis
[Slides](slides/Statistical Thinking in Python - Part 1/ch2_slides.pdf)

- Focus: *Summary statistics*

- mean: `np.mean()`
- median: `np.mean()` (not affected by outliers)
- **Box plots** show the 25th, 50th, and 75th quartiles in the box (when combined, this 50% is called the inter-quartile range [IQR]).  The whiskers extend to 1.5x the IQR or to the extent of the data.  (Box plots were invented by Tukey!)
- Outliers are often delimited as 2 IQRs away from the median!
- variance: `np.var()`
- std deviation: `np.std()`
- **scatter plots**
- **Covariance** - A measure of how two quantities vary together. 
  - `np.cov()`
  - `covariance = mean of [ (x(i) - x(mean)) * (y(i) - y(mean)) ]`
- **Pearson correlation coefficient** - A "unitless" or "dimensionless" measure of covariance. 
  - `np.corrcoef()`
  - **`𝜌`** ` = pearson correlation = covariance / (std of x) (std of y)`
  - or `variability due to codependence / independent variability`
 

In [5]:
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    corr_mat = np.corrcoef(x, y)
    return corr_mat[0,1]

## Chapter 3: Thinking probabilistically-- Discrete variables
[Slides](slides/Statistical Thinking in Python - Part 1/ch3_slides.pdf)

Numpy `random` module:
- `np.random.random()` - Psuedo-random: the same seed will always produce the same result.
- `np.random.seed()`

Probability:
- **Statistical inference** - The process by which we go from measured data to probabilistic conclusions about what we might expect if we collected the same data again.  Actionable decisions can be made on these conclusions.
- **Probability** "precisely defines uncertainty."
  - The probability *p* of an event is the number of occurrences *n* divided by the number of samples.
  - `p = n/samples`
- **Probability Mass Function (PMF)** - The set of probabilities of *discrete* outcomes.
  - Example: One roll of a die has a "**discrete uniform** distribution."
- **Distribution** - A mathematical description of outcomes.
- **Binomial distribution** example is the outcome of one coin toss.
  - `np.random.binomial(n, p, size=m)`
  - The "story": The number of successes *r* in *n* Bernoulli trials with probability *p* of success is *binomially* distributed.  (With number of samples *m*.)
  - (A Bernoulli trial is just an event with a boolean outcome.)
- **Poisson process** - The timing of the next event is completely independent of when the previous event occurred.
  - The number of arrivals *r* of a Poisson process in a given time interval with average rate of *λ* arrivals
per interval is *Poisson distributed*.
  - **Poisson distribution** is a limit of (approximates) the Binomial distribution for *rare* events.  (An event is binary -- it either happens or doesn't.)
  - In other words: The Poisson distribution with arrival rate equal to *np* approximates a Binomial distribution for *n* Bernoulli trials with probability *p* of success (with *n* large and *p* small). 
  - `np.random.poisson()`



## Chapter 4: Thinking probabilistically-- Continuous variables
[Slides](slides/Statistical Thinking in Python - Part 1/ch4_slides.pdf)

- **Probability Density Function (PDF)** - The set of probabilities of *continuous* outcomes.
  - The probability is measured by the *area under the line* in a PDF.
- **Normal (or Guassian) distribution** - Describes a continuous variable whose PDF has a single symmetric peak.
  - `np.random.normal(mean, std)`
- **Exponential distribution**
  - : The *time between events* **τ** (tau) in a Poisson process is exponentially distributed.
  - `np.random.exponential(mean)`

---
# Course: Statistical Thinking in Python - Part 2
[Course link](https://www.datacamp.com/courses/statistical-thinking-in-python-part-2)


## Chapter 1: Parameter estimation by optimization
[Slides](slides/Statistical Thinking in Python - Part 2/ch1_slides.pdf)

- **Optimal parameters** - Parameter values that bring the model in closest agreement with the data.
- **Least squares** - The *process* of finding the parameters for which the sum of the squares of the residuals (rise) is minimal.  Also known as RSS (for *residual sum of squares*).
  - `slope, intercept = np.polyfit(x_data, y_data, degree)`
  - (A line is a first-degree polynomial)
- Equation for a line: `y = ax + b`

## Chapter 2: Bootstrap confidence intervals
[Slides](slides/Statistical Thinking in Python - Part 2/ch2_slides.pdf)

- **Bootstrapping** - The use of resampled data to perform statistical inference.
- **Bootstrap sample** - An array that was drawn from the original data with replacement.
- **Bootstrap replicate** - A statistic computed from a resampled array.
- `np.random.choice()`

## Chapter 3: Introduction to hypothesis testing
[Slides](slides/Statistical Thinking in Python - Part 2/ch3_slides.pdf)

## Chapter 4: Hypothesis test examples
[Slides](slides/Statistical Thinking in Python - Part 2/ch4_slides.pdf)

## Chapter 5: Putting it all together: a case study
[Slides](slides/Statistical Thinking in Python - Part 2/ch5_slides.pdf)