# Introduction to Probability and Data with R  
at Duke Univ @ coursera

# I. Theory

## 1. Data Basics

Numerical: continous, discrete  
Categorical: regular categorical, ordinal

Example:  
> Student height → continuous numerical  
> Whether a student has previously taken a statistics course → categorical  
> Customer satisfaction: very unsatisfied, unsatisfied, satisfied, very satisfied → ordinal categorical.  
> Population of each state in the US → continuous numerical <- <b> WRONG </b> Becasue counted data are discrete numerical variables since they can’t take on non-whole values.  

## 2. Studies: Observational & Experiment

> <b> 1. Observational study: </b>   
Collect data in a way that does not directly interfere with how the data arise ("observe")  
No random assignment.   
For the most part, it allows us to make only correlational statements.  
```
  - retrospective: uses past data
  - prospective: data are collected throughout the study.  
```  
> <b> 2. Experiments: </b> Randomly assign subjects to treatments  
It allow us to infer causation.  

>  <b>Confounding variables </b>: Extraneous variables that affect both the explanatory and the response variable, and that make it seem like there is a relationship between them.  

<h3> Note </h3>  
<b> - Correlation does not imply causation </b>  

<b> - Random assignment is the most important difference between observational studies and experiments.</b>

## 3. Sampling and sources of bias

<h3>A few sources of sampling bias</h3>  

> <b> 1. Convenience sample: </b> Individuals who are easily accessible are more likely to be included in the sample.  
> <b> 2. Non-response: </b> If only a (non-random) fraction of the randomly sampled people respond to a survey such that the sample is no longer representative of the poppulation.  
> <b> 3. Voluntary response: </b> Occurs when the sample consists of people whi volunteer to respond because they have strong opinions on the issues. 

<h3>Sampling methods</h3>

> <b> 1. Simple Random Sample (SRS: </b> Each case is equally likely to be selected.  
> <b> 2. Stratified Sample: </b> Divide the population into homogenous groups strata (eg. Male and Female groups), then randomly sample from within each stratum.  
> <b> 3. Cluster Sample: </b> Divide the population into clusters (heterogeneous within themselves - different with stratified), randomly sample a few clusters, then sample all observations within these clusters.  
> <b> 4. Multistage sample: </b> Divide the population clusters (heterogeneous within themselves - different with stratified), randomly sample a few clusters, then randomly sample within these clusters.  

## 4. Experimental Design

<h3>The 4 principles of experimental design</h3>  

> <b> 1. Control: </b> To compare treatment of interest to a control group.  
> <b> 2. Randomize: </b> To randomely assign subjects to treatments.  
> <b> 3. Replicate: </b> To collect a sufficiently large sample or replicate the entire study.  
> <b> 4. Block: </b> To block for variables known or suspected to affect the outcome.  

<h4>For example: Design an experiment investigating whether energy gels help you run faster. </h4>

> Treatment group: energy gel <br> 
> Control group: No energy gel <br> 
> Since energy gels might affect pro and amateur athletes differently, we, therefore, <b>block</b> for pro status:  
    - Divide the sample to pro and amateur <br>
    - Randomly assign pro and amateur athletes to treatment and control groups.  
    --> pro and amateur athletes are equally represented in both groups  
    
<h4> Blocking vs. explanatory variables </h4>  

> Explanatory variables (aka factors): conditions we can impose on experimental units.  
> Blocking variables: Characteristecs that experimental units come with, that we would like to control for.  
> Blocking is like strifying:  
> - blocking during random assignment  
> - strifying during random sampling  

<h4> Example: </h4>  
A study is designed to test the effect of light level and noise level on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so wants to make sure both genders are represented equally under different conditions. <br>
Therefore, there are  <b> 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance)  </b> (The researchers are interested in the effect of light and noise on exam performance. Since they believe these two variables might be affecting the outcome, these are the explanatory variables and exam performance is the response variable. Gender of the student is a nuisance variable they want to control for, hence they block for it. Unlike light and noise, gender is not a treatment that is being imposed on the subjects.)

<h4> Random sampling vs. Random assigment </h4>  

> <b>Random sampling</b>: Population -> get samples (subjects are being selected for a study) -> If subjects selected randomly from population and each subject in the population is equally likely to be selected and the resulting sample is likely representative of the population, therefore, the result is <b>generalizability</b>  
> <b>Random assigment</b>: samples -> the subjects are being assigned to various treatments. By a random assigment, we ensure that the different characteristics of subjects are represented equally in the treatment and control groups. This allows us to attribute any observed difference between the treatment and control groups. Therefore, we can make <b>causal conclusions</b> based on the study.   

<h4> Confounder (Confounding) variables </h4>  

> Eg. <b>Conduct a study evaluating whether people read Serif or Sans Serif fonts faster.</b>  
> We design: Population --> Samples --> Assign two treatment groups: Group 1. People who read Serif and Group 2. People who read Sans Serif and compare when they read the same text content. Through random assigment, we ensure that <b> other factors that may be contributing to reading speed, for example, fluency or how often they read the subject for leisure, are represented equally in the two groups. We call such variables confounder or confounding variables </b> <br>  
 
> Note: Stratified sampling allows for controlling for possible confounders in the sampling stage, while blocking allows for controlling for such variables during random assignment.

#### Install R

In [1]:
library(devtools)

# Run one time for the below code
# devtools::install_github("statswithr/statsr",
#                          dependencies=TRUE,
#                          upgrade_dependencies = TRUE)

## II. Exploratory Data Analysis and Introduction to Inference

## III. Introduction to Probability

<b>Frequentist interpretation (traditional method)</b>: The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times.  
<b>Bayesian interpretation</b>: A Bayesian interprets probability as a subject degree of belief. Eg. For the same event, two seperate people could have different viewpoints and so assign different probabilities to it.  
This interpretation allows for prior information to be integrated into the inferential framework.

<b>Law of large numbers</b> states that as more observations are collected the proportion of occurences with a particular outcome converges to the probability of that outcome.

<b>Disjoint events (mutually exclusive)</b>: events cannot happen at the same time. P(A and B) = 0  
<b>Non-Disjoint events</b>: events can happen at the same time. P(A and B) >0  
<b> A and B are independent events </b>: P(A and B) = P(A) * P(B)  