To start a SAS Academic session, run:

In [10]:
import saspy
my_session = saspy.SASsession()

Using SAS Config named: oda
SAS Connection established. Subprocess id is 640



You can then type SAS commands. "The %%SAS magic enables you to submit the contents of a cell to your SAS session. The cell magic executes the contents of the cell and returns any results."

In [5]:
%%SAS my_session
proc print data=sashelp.class;
run;

data work.a;
  set sashelp.cars;
run;

'Invalid SAS Session object supplied'

## Discrete random variables
* Probability Mass Function (PMF)
  * Pick a point and the PDF shows the probability of that point.
* Cumulative Distribution Function (CDF)

## Continuous random variables
Continuous variables take on any value in an interval; that is, their **support** is (0,$\infty$). To summarize the distribution of continuous variables, we commonly summarize center and spread. 
* Population average (mean), or the expected value, **measures center**
$$
\mu = E(Y) = \int_{-\infty}^\infty yf(y)dy
$$
$\mu$ is the average value of the random variables we would get if we repeated the experiment again and again. With discrete random values, we multiplied values by their probabilities and summed them. With continuous variables, since each individual value has a probability of 0, we take the value of _y_ and multiply it by its density value, f(y). "y times f(y), integrated from negative infinity to infinity".
* Population variance (and standard deviation) **measure spread**. Standard deviation, which is the square root of variance, is in the units of our random variable and is often easier to interpret. It can be thought of as the average distance from the mean).
$$
\sigma^2 = \text{Var(Y)} = E[(Y-\mu)^2] = E(Y^2)-[E(Y)]^2 \\
\sqrt{\sigma^2} = \sigma = \text{SD(Y)}
$$
* We cannot assign a probability to any individual value (with an infinite number of fractional values, we cannot say that any one has a probability). 
* To describe their distributions, we need to look at the Probability Density Function (PDF) and the Cumulative Distribution Function (CDF).

### PDF = f(y)
PDF = f(y) = smoothed curve that shows the "relative likelihood" of observing y; where the curve is higher, the tiny interval of values is more likely (even though every single value has probability 0). Probabilities in a PDF correspond to the **area under a curve** (more than the height of a single bar) and the total area under the curve must equal 1. 
    <center><img src="continuous_pdf.png" style="width:400px"/></center>
    <center><img src="continuous_pdf_2.png" style="width:400px"/></center>
    The value of the shaded blue area is the probability of us getting a value between 5 and 10 (i.e. the probability of us spending between 5-10 minutes reading a news article). Mathematically, this area is found using integration. 
    $$
    P(a \lt Y \lt B) = \int_a^b f(y)dy
    $$
    Individual PDF values can be larger than 1, but the total area under the curve must be 1.

#### Example
Consider a continuous random variable _X_ that denotes the time a person waits for an elevator to arrive. The PDF of _X_ is given by
$$
f(x) = \left\{
\begin{array}{ll}
      x, & \text{for } 0 \le x \le 1 \\
      2-x, & \text{for } 1 \lt x \le 2 \\
      0, & \text{otherwise} \\
\end{array} 
\right.  
$$

The expected value of _X_, _E(X)_, is therefore
$$
E(X) = \int_0^1 x\cdot xdx + \int_1^2 x\cdot(2-x)dx = \int_0^1 x^2dx + \int_1^2 (2x-x^2)dx = \frac{1}{3} + \frac{2}{3} = 1
$$
Thus, we expect that a person will wait an average of 1 minute for the elevator.

### CDF = F(y)
The CDF is also an integral, but rather than being between two values, it is the probability that Y is _less than or equal to_ some value _k_.
    $$
    P(Y \le k) = \int_{-\infty}^k f(y)dy
    $$   

### PDF vs. CDF 
$$
P(Y \le 6) = \int_0^6 f(y)dy = 0.7898
$$
<center><img src="pdf_vs_cdf.png" style="width:800px"/></center>

6 is the **0.79 quantile** or the **79th percentile** of the data, so 79% of our data is below 6. Other quantiles of interest include
* Q1 = 1st quartile = 0.25 quantile = 25th percentile
* Median = 0.5 quantile or 50th percentile. 50% of values are above our median and 50% of values are below our median, which gives us a good measure of center.
* Q3 = 2nd quartile = 0.75 quantile = 75th percentile

### Named continuous distributions
<center><img src="continuous_distributions.png" style="width:800px"/></center>

#### Uniform distribution
Comparable to its discrete random variable counterpart. Every value between _a_ and _b_ is equally likely.  
$$
Y \sim Unif(a,b) \text{ or } Y \sim U(a,b) \\
f(y) = \frac{1}{b-a}, \text{ for } a \lt y \lt b \\
E(Y) = \frac{a+b}{2} \text{ (the average of a and b)}\\
Var(Y) = \frac{(b-a)^2}{12} \\
$$

$Y\sim U(0,1)$ is called the **standard uniform distribution**.

**Example uniform distributions**

_Example 1_:

<center><img src="example_uniform.png" style="width:800px"/></center>

_Example 2_:

Suppose the time it takes to "Find Waldo" is _equally likely_ to be anywhere from 0 to 60 seconds. Define Y as the time it takes to "Find Waldo". 
$$
Y \sim U(0,60) \\
f(y) = \frac{1}{60} \text{ for } 0 \lt y \lt 60 \\
E(Y) = \frac{a+b}{2} = \frac{0+60}{2} = 30 \\
Var(Y) = \frac{(b-a)^2}{12} = \frac{(60-0)^2}{12} = 300
$$

What is the probability of finding Waldo between 40 and 60 seconds?
<center><img src="waldo_pdf.png" style="width:400px"/></center>
$$
P(40 \lt Y \lt 60) = \int_{40}^{60} f(y)dy = \int_{40}^{60} \frac{1}{60}dy = \frac{1}{3} = 0.33
$$

We can also use the CDF instead. 
<center><img src="waldo_cdf.png" style="width:400px"/></center>
$$
P(40 \lt Y \lt 60) \\
= P(Y \lt 60) - P(Y \lt 40) \\
= P(Y \le 60) - P(Y \le 40) \\
= 1 - 0.67 \\
= 0.33
$$

**Quiz**

If you are told that $Y \sim Uniform(10,20)$ or $Y \sim U(10,20)$, what does this mean?

This implies that we are using the uniform distribution with lower endpoint 10 and upper endpoint 20 as a model for how we observe the RV Y.  This distribution implies that every value is ‘equally likely’ between 10 and 20.  This implies that any interval of the same width between 10 and 20 has the same probability of being observed.

#### Normal distribution
A bell-shaped curve that is completely defined by its mean, $\mu$, and variance/standard deviation ($\sigma^2$ or $\sigma$). 95% of the data in a normal distribution is within two standard deviations, or $2\sigma$.
<center><img src="normal_distribution.png" style="width:400px"/></center>

If $Y \sim N(\mu,\sigma)$, then
$$
f(y) = \frac{1}{\sqrt{2 \pi \sigma^2}}e^{\frac{-1}{2 \sigma^2}(y-\mu)^2} \\
E(Y) = \mu \\
Var(Y) = \sigma^2
$$

A "standard" normal distribution has a mean of 0 and a deviation of 1. We can take any normal distribution and convert it into a _standard_ normal distribution. If $Y \sim N(\mu,\sigma)$, then 
$$
Z = \frac{Y-\mu}{\sigma} \sim N(0,1)
$$
where $Z$ is the standard normal distribution's common random variable notation. Subtracting $\mu$ from $Y$ centers us at 0 (i.e. our new mean is 0) and dividing by $\sigma$ re-scales $Y$.

#### PDF and CDF for normal distributions
Standard normal PDF and CDF are denoted as
$$
\text{PDF: }\phi(z) \\
\text{CDF: }\Phi(z)
$$
Software will gives us CDF. 

**Example**

Suppose $Y$ is the time spent reading a news article and suppose we have 40 independent observations. A reasonable distribution for the _sample_ mean is 
$$
\bar{Y} \sim N(4.29,\frac{2.47}{\sqrt{40}})
$$
where $\bar{Y}$ is the average time spent reading the news article by 40 people, the mean is 2.47, and the standard deviation is $\frac{2.47}{\sqrt{40}}$.

What is the probability that the average time is less than 4 minutes?
$$
P(\bar{Y} \lt 4) = P(\bar{Y} \le 4)
$$

We can find the CDF using software. Often, we revert to the standard normal distribution (convert to a z-score) and work with that standard normal distribution instead.
$$
Z = \frac{\bar{Y}-\mu}{\frac{\sigma}{\sqrt{n}}}
$$
This can be plugged into a calculator.
$$
P(\bar{Y} \le 4) = P()
$$

In [28]:
%%SAS my_session

/* Uniform Distribution PDF*/
DATA UniformPDF;
   pdf=pdf('UNIFORM',1/60,0,60);
RUN;

PROC PRINT DATA = UniformPDF;
RUN;


Obs,pdf,cdf
1,0.016667,0


### February 27
## Multinomial distribution
Assessing probability of multiple variables occuring at the same time. We make the same assumptions as for a binomial:
* Trials result in only 1 of _k_ outcomes
* Sequences of _n_ independent trials
* Same probabilities associated with each trial


**Example:** Among residential customers traveling from Raleigh to Washington, DC, there are three main choices:
* Plane - 45% of travelers
* Car/bus - 35% of travelers
* Train - 20% of travelers

If 12 randomly selected people travel, what is the probability that 5 travel by plane, 4 travel by car/bus, and 3 travel by train? 

**Answer:**
Probabilities must sum to 1, _n_ is the number of independent trials, and _y_'s must sum to _n_. In this case
$$
 y_1+y_2+y_3 = 5 + 4 + 3 = 12 \\
$$
The multinomial distribution is given by
$$
\begin{pmatrix}
Y_1 \\ Y_2 \\ Y_3
\end{pmatrix}
\sim multinomial(12,0.45,0.35,0.2) \\
p(y_1,y_2,y_3) = p(5,4,3) = \frac { 12! } { y_1!y_2!y_3! }0.45^{y_1}0.35^{y_2}0.2^{y_3} = \frac { 12! } { y_1!y_2!y_3! }0.45^{5}0.35^{4}0.2^{3} = 0.0614
$$

#### Quiz
---
**$Q_{1}$:** Why is the multinomial considered a _joint_ distribution?

**$A_{1}$:** The multinomial is concerned with the distribution of more than one random variable (RV).  This makes it a joint distribution!  Specifically, it is concerned with


$Y_1$ = number of observations in category 1

$Y_2$ = # of observations in category 2  
...    
$Y_k$ = # of observations in category _k_ 

---
**$Q_{2}$:** How are the multinomial and the binomial related?

**$A_{2}$:** Take one of the categories of the multinomial and label it a “success”. Label all the other categories a “failure”. The random variable corresponding to the number of observations falling into the success category is a binomial random variable. 

This is true if you look at any one random variable from the multinomial.  We say that ‘marginally’ each $Y_{i}$ follows a binomial.
_________________________________________________________________________________________________

## Contingency tables
Contingency tables summarize the _frequency_ or the _proportion_ of observations that fall into a category (or a combination of categories when looking at multi-way tables).  
* One-way tables summarize one variable
* Two-way tables summarize two variables
* _N_-way tables summarize _N_ variables, but these are more difficult to visualize

## Conditional probability ideas
With more than one variable, we can also consider conditional probabilities. For instance, what is the probability that a passenger survives _given that_ (|) they embarked on a journey at port _S_? Compare this against the probability that a passenger survives overall ($P$(passenger survives)).

<div class="alert alert-block alert-success">
<b>Conditional example:</b> The probability that a passenger survives given that they embarked on a journey at port S.
<br>
    <br><center>$P$(passenger survives $\mid$ embarked at port S)</center>
</div>
<center><img src="conditional_probability.png" style="width:400px"/></center>

More generally, if we consider two events _A_ and _B_ that may occur, the conditional probability of event A _given that_ (|) event B has occurred is defined as

$$
P(A \mid B) = \frac { P(A \cap B) } { P(B) } = \frac { P(A {\sf and} B) } { P(B) }
$$
<center><img src="conditional_probability_venn.png" style="width:400px"/></center>

Conversely, the conditional probability of B _given that_ (|) A has occurred is defined as

$$
P(B \mid A) = \frac { P(A \cap B) } { P(A) } = \frac { P(A {\sf and} B) } { P(A) }
$$
Notice that the denominator is what we are conditioning upon. Referring back to the SAS contingency table, the probability that a passenger survives given that they embarked from port _S_ is given by 

$$
P(\text{survives} \mid \text{embarks from port S}) = \frac { P(\text{survives and embarks from port S}) } { P(\text{embarks from port S}) } = \frac { 304 } { 914 }
$$
<center><img src="survives_and_embarks.png" style="width:400px"/></center>

## Multiplication law
<div class="alert alert-block alert-success">
<b>Tip:</b> The <b>generic</b> multiplication law is
<br>$P(A \cap B) = P(A \mid B)P(B)$. 
<br> and
<br>$P(A \cap B) = P(B \mid A)P(A)$
</div>

We get this by re-writing the conditional probability equation as

$$
P(A \cap B) = P(A \mid B)P(B)
$$

and

$$
P(A \cap B) = P(B \mid A)P(A)
$$

If conditioning on something does _not_ change my probability, then we have **independent events**! That is, if

$$
P(A \cap B) = P(A)
$$

$$
P(A \cap B) = P(B)
$$

$$
P(A \cap B) = P(A)P(B) 
$$

and A and B are independent. That is, _A_ happening has no impact on _B_ happening across observations and knowing _A_ does not tell us about the probabilities of any other observations. Another way of saying this is that we have an **independent and identically distributed (iid)** sample. 

<div class="alert alert-block alert-success">
<b>Tip:</b> If we know that our events are independent, the multiplication law for <b>independent events</b> is
<br>$P(A \cap B) = P(A)P(B)$. 
</div>

With our port example, if the probability of surviving given that we embarked from S port was the same as the _overall_ probability of survival, it would not matter whether or not we embarked from port S; we would still survive with the same probability. 

<center><img src="independence_example.png" style="width:400px"/></center>
<center><img src="independence_example_2.png" style="width:400px"/></center>


#### Quiz
**$Q_{2}$:** What is the idea of statistical independence?

**$A_{2}$:** Statistical independence is the idea that knowledge of one event tells us nothing about the probabilities associated with another event.  That is, we don’t know any more about the 2nd event given knowledge of the 1st event. 

---

**$Q_{3}$:** Suppose the probability that it rains on Saturday is 0.25 and the probability that it rains on Sunday is 0.25.  The probability that it rains on the weekend is then $P(\text{Rain on Sunday OR Rain on Saturday}) = P(\text{Rain on Sunday}) + P(\text{Rain on Saturday}) – P(\text{Rain on Saturday AND Rain on Sunday})$. 

**$A_{3}$:** No – this would only be the case if $P(\text{Rain on Saturday AND Rain on Sunday}) = P(\text{Rain on Saturday})P(\text{Rain on Sunday})$.  This would be true if it raining on Saturday was independent of it raining on Sunday.  This is highly unlikely.  Many times there are storms that can last multiple days.  This means that knowing it is raining on Saturday may change our probabilities associated with it raining on Sunday. 

---

**$Q_{4}$:** Are the terms ‘Random Sample’ and ‘iid’ equivalent?

**$A_{4}$:** Yes! They both imply that our sample comes from the same distribution (identically distributed) and that the individual variables are independent of one another.

---

**$Q_{5}$:** How does the multiplication law/rule <br>$P(A \cap B) = P(A)P(B)$ apply to joint distributions from a random sample?

**$A_{5}$:** Similar to the multiplication law/rule, the joint distribution of $Y_1$ and $Y_2$ can be written as $f(y_1,y_2) = f(y_1)f(y_2)$.  This will be really important when we start looking at maximum likelihood!

### Hypergeometric
In a hypergeometric distribution, the trials are **not independent**. If we have 10 total units, 3 of which are good and 7 of which are bad, we can sample 4 and define _Y_ as the number of good chosen units. 

$$ 
\text{3 good} + \text{7 bad} = \text{10 total} \\
\textit{Y} = \text{number of good}
$$

_Y_ is almost a Binomial but the trials are not independent! On our first trial, $P(success) = 3/10 = 0.3$. Given a success on the first trial (finding a good unit), $P(success) = 2/9 = 0.2222$. 