In [7]:
import scipy.stats as stats
import pandas as pd
import numpy as np


## Population
---
### Descriptive Statistics (기술통계학)
* Central Tendency
    * Mean ($\mu$)
    * Median (중앙값)
    * Mode (최빈값)
* Variability
    * Variance ($\sigma^2$)
    * Standard Deviation ($\sigma$)
    * Range (max - min)
    * IQR (Inter Quartile Range) ($Q_3 - Q_1$)
* Shape
    * Skewness
    * Kurtosis
---
### Inferential Statistics (추론통계학)
* Estimation (추정)
    * Point Estimation (점 추정)
    * Interval Estimation (구간 추정)
* Hypothesis Test (가설 검정)

---
### Qualitative Data / Categorical Data
* Nominal Data (명목 자료)
* Ordina Data (순서 자료)
---
### Quantitative Data / Numerical Data
* Continuous Data (연속형 자료)
* Discrete Data (이산형 자료)

## Parameter
1. Expected Value(Mean) - 평균, 기대값
> $E(X)=\mu_{X}=\sum xp(x)\,\,\,\,\,\,\,\,\,\,$ *(Discrete Probability Distribution)*
>
> $E(X)=\mu_{X}=\int xf(x)\,dx\,\,\,$ *(Continuous Probability Distribution)*

    * Corollay
    > $E(X+Y)=E(X)+E(Y)$
    >
    > $E(aX)=a\,E(X)\,\,\,\,$ *(a is constant)*
    >
    > $E(a)=a$
    >
    > if $X,\,Y$ is independent $\implies$ $E(XY)=E(X)E(Y)\,\,\,$ $(\because\,COV(X,Y)=0)$

In [32]:
# random_seed = 123
sample_size = 100
sample_scale = 10
# np.random.seed(seed=random_seed)

sample_x = np.random.random_sample(size=sample_size) * sample_scale
sample_y = np.random.random_sample(size=sample_size) * sample_scale

a = np.random.randint(2,10)
print(f"E(X) = {np.mean(sample_x)}")
print(f"E(Y) = {np.mean(sample_y)}")
print()
print(f"E(X) + E(Y) = {np.mean(sample_x) + np.mean(sample_y)}")
print(f"E(X + Y) = {np.mean(sample_x + sample_y)}")
print()
print(f"{a} x E(X) = {a * np.mean(sample_x)}")
print(f"E({a}X) = {np.mean(a * sample_x)}")
print()
print(f"E(XY) = {np.mean(sample_x * sample_y)}")
print(f"E(X) x E(Y) = {np.mean(sample_x) * np.mean(sample_y)}")
print(f"COV(X,Y) = {np.cov(sample_x,sample_y)[0][1]}")


E(X) = 5.2322443863273165
E(Y) = 5.080511770962427

E(X) + E(Y) = 10.312756157289744
E(X + Y) = 10.312756157289744

8 x E(X) = 41.85795509061853
E(8X) = 41.85795509061853

E(XY) = 25.789806435145657
E(X) x E(Y) = 26.58247919328801
COV(X,Y) = -0.8006795536791428


2. Variance - 분산
> $V(X)=\sigma^2=E[(X-\mu)^2]$

    * Corollay
    > $V(X)=E(X^2)-E(X)^2$
    >
    > $V(X) \ge 0\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,$ *(for any $X$)*
    >
    > $V(aX)=a^2V(X)\,\,\,\,$ *(a is constant)*
    >
    > $V(X + a)=V(X)\,\,\,\,$ *(a is constant)*
    >
    > $V(X+Y)=V(X)+V(Y)+2\,COV(X,Y)\,\,\,\,\,\,\,(\sigma_{XY}\equiv COV(X,Y))$
    >
    > $V(X-Y)=V(X)+V(Y)-2\,COV(X,Y)$
    >
    > if $X,\,Y$ is independent $\implies$ $V(X\pm Y)=V(X)\pm V(Y)$

3. Standard Deviation - 표준편차
> $S(X)=\sigma_{x}=\sqrt{V(X)}$

In [59]:
# random_seed = 123
sample_size = 100
sample_scale = 10
# np.random.seed(seed=random_seed)

sample_x = np.random.random_sample(size=sample_size) * sample_scale
sample_y = np.random.random_sample(size=sample_size) * sample_scale

a = np.random.randint(2,10)

# default: ddof=0 
# print(f"V(X) = {np.var(sample_x, ddof=0)}")
print(f"V(X) = {np.var(sample_x)}")
print(f"V(Y) = {np.var(sample_y)}")
print()
print(f"{a**2} x V(X) = {a**2 * np.var(sample_x)}")
print(f"V({a}X) = {np.var(a * sample_x)}")
print()
print(f"V(X) + V(Y) = {np.var(sample_x) + np.var(sample_y)}")
print(f"V(X + Y) = {np.var(sample_x + sample_y)}")
print(f"V(X - Y) = {np.var(sample_x - sample_y)}")
print(f"COV(X,Y) = {np.cov(sample_x,sample_y, ddof=0)[0][1]}")
print(f"{np.cov(sample_x,sample_y, ddof=0)[0][1] * 2 + np.var(sample_x) + np.var(sample_y)}")
print(f"{np.cov(sample_x,sample_y, ddof=0)[0][1] * -2 + np.var(sample_x) + np.var(sample_y)}")

V(X) = 8.210661929832966
V(Y) = 8.5332251557808

36 x V(X) = 295.5838294739868
V(6X) = 295.5838294739868

V(X) + V(Y) = 16.743887085613764
V(X + Y) = 17.575536524668934
V(X - Y) = 15.912237646558596
COV(X,Y) = 0.4158247195275833
17.575536524668934
15.912237646558598


4. Covariance - 공분산
> $COV(X,Y) \equiv \sigma_{XY}=E[(X-\mu_{X})(Y-\mu_{Y})]$
>
> $COV(X,Y)=E(XY)-E(X)E(Y)$

    * Corollay
    > $COV(X,X) \equiv \sigma_{XX}=V(X)=\sigma_{x}^2$
    >
    > $COV(X,Y) = COV(Y,X)$
    
5. Correlation Coefficient - 상관계수
> $CORR(X,Y) \equiv \rho_{XY}=\frac{COV(X,Y)}{\sigma_{X}\sigma_{Y}}=\frac{\sigma_{XY}}{\sigma_{X}\sigma_{Y}}$

    * Corollay
    > $\rho_{XX}=1$
    >
    > $-1 \le \rho_{XY} \le 1$
    
    <details>
        <summary>Proof</summary>
        <p>
        
        * Proof
        > $U=\frac{X-\mu_{X}}{\sigma_{X}},\,\,V=\frac{Y-\mu_{Y}}{\sigma_{Y}}$
        >
        > $\implies E(U)=0,V(U)=1,\,\,E(V)=0,V(V)=1$
        >
        > $V(U+V)=V(U)+V(V)+2COV(U,V)$
        >
        > $V(U-V)=V(U)+V(V)-2COV(U,V)$
        >
        > $\because\forall X,\,V(X)\ge0,\,\,\,$
        >
        > $V(U)+V(V)+2COV(U,V)\ge0$
        >
        > $V(U)+V(V)-2COV(U,V)\ge0$
        >
        > $\therefore -1 \le COV(U,V) \le 1$
        >
        > $\because COV(U,V)=CORR(X,Y),$
        >
        > $\therefore -1 \le \rho_{XY} \le 1$
        
        </p>
    </details>


In [63]:
# random_seed = 123
sample_size = 100
sample_scale = 10
# np.random.seed(seed=random_seed)

sample_x = np.random.random_sample(size=sample_size) * sample_scale
sample_y = np.random.random_sample(size=sample_size) * sample_scale

a = np.random.randint(2,10)

# default: bias=False, ddof=1 
# print(f"COV(X,Y) = {np.cov(sample_x, sample_y, ddof=0)[0][1]}")
print(f"COV(X,Y) = {np.cov(sample_x, sample_x, ddof=0)[0][1]}")
print(f"COV(Y,X) = {np.cov(sample_x, sample_x, ddof=0)[0][1]}")

print(f"CORR(X,X) = {np.corrcoef(sample_x, sample_x)[0][1]}")
print(f"CORR(X,Y) = {np.corrcoef(sample_x, sample_y)[0][1]}")

COV(X,Y) = 8.645067355073884
COV(Y,X) = 8.645067355073884
CORR(X,X) = 1.0
CORR(X,Y) = -0.1414260882090568


## Sample

---
## Statistic

### $\bar{X}$ : sample mean 

> $\bar{X} = \frac{\sum_{i=1}^{n} X_i}{n}$

* Corollay
    > $E(\bar{X})=\mu$
    >
    > $V(\bar{X})=\frac{\sigma^2}{n}$

### $S_{X}^2$ : sample variance 
> $S^2 = \frac{\sum_{i=1}^{n} (X_i-\bar{X})^2}{n\,-\,1}$
* Corollay
    > $E(S^2)=\sigma^2$

<details>
    <summary>Proof</summary>
    <p>
        
* Proof
    > $E[\,\sum_{1}^{n} (X_{i}-\bar{X})^2]$
    >
    > $=E[\,\sum_{1}^{n} (X_{i}^2-2X_{i}\bar{X}+\bar{X}^2)]$
    >
    > $=E[\,\sum_{1}^{n}(X_{i}^2)-2\sum_{1}^{n}(X_{i}\bar{X})+\sum_{1}^{n}(\bar{X}^2)]$
    >
    > $=E[\,\sum_{1}^{n}(X_{i}^2)-2\bar{X}\sum_{1}^{n}(X_{i})+n\bar{X}^2]$
    >
    > $=E[\,\sum_{1}^{n}(X_{i}^2)-2\bar{X}\,(n\bar{X})+n\bar{X}^2]$
    >
    > $=E[\,\sum_{1}^{n}(X_{i}^2)-2\,n\bar{X}^2+n\bar{X}^2]$
    >
    > $=E[\,\sum_{1}^{n}(X_{i}^2)-n\bar{X}^2]$
    >
    > $=E[\,\sum_{1}^{n}(X_{i}^2)] - nE[\bar{X}^2]$
    >
    > $=E[\,\sum_{1}^{n}(\sigma^2+\mu^2)] - nE[\bar{X}^2]\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(\because V(X)=E(X^2)-E(X)^2)$
    >
    > $=E[\,n(\sigma^2+\mu^2)] - nE[\frac{\sigma^2}{n}+\mu^2]\,\,\,\,(\because E(\bar{X})=\mu,\,V(\bar{X})=\frac{\sigma^2}{n})$
    >
    > $=\,n(\sigma^2+\mu^2) - n(\frac{\sigma^2}{n}+\mu^2)$
    >
    > $=(n-1)\,\sigma^2$

    </p>
</details>
    
### $S_{X}$ : sample standard deviation 
> $S=\sqrt{\frac{\sum_{i=1}^{n} (X_i-\bar{X})^2}{n\,-\,1}}$


### $S_{XY}$ : sample covariance
> $S_{XY}=\frac{\sum_{i=1}^{n} (X_i-\bar{X})(Y_i-\bar{Y})}{n - 1}$

### $R$ : sample correlation coefficient
> $R=\frac{S_{XY}}{S_{X}S_{Y}}=\frac{\sum_{i=1}^{n} (X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_{i}-\bar{X})^{2}}\sqrt{\sum_{i=1}^{n} (Y_{i}-\bar{Y})^{2}}}$

### $\hat{p}$ : sample ratio
> $\hat{p}=\frac{X}{n}\,\,\,(\,X$~$Bin(n,p)\,)$
>
> $E(\hat{p})=p$
>
> $V(\hat{p})=\frac{pq}{n}$

### Theorem
* **Generalized Theorem** --- $\bar{X}$

> If $X$ ~ $N(\mu, \sigma^2)$, then
> $\bar{X}$ ~ $N(\mu,\frac{\sigma^2}{n})$<br/>
>
> $Z = \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$ ~ $N(0,1)$

* **Central Limit Theorem (CLT)** --- $\bar{X}$

> If $E(X)=\mu$, $V(X)=\sigma^2$, $n \gt 30$, then 
> $\bar{X}$ ~ $N(\mu,\frac{\sigma^2}{n})$<br/>
>
> $Z = \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$ ~ $N(0,1)$

* **Sample Variance Distribution** --- $S^2$

> If $X$ ~ $N(\mu, \sigma^2)$, then
>
> $\frac{(n\,-\,1)\,S^2}{\sigma^2}$ ~ $\chi^2(n-1)$<br/>

* **Sample Mean and Sample Variance** --- $\bar{X},\,S^2$

> If $X$ ~ $N(\mu, \sigma^2)$, then
>
> $\frac{\bar{X}-\mu}{\frac{S}{\sqrt{n}}}$ ~ $t(n-1)$<br/>

* **Sample Ratio** --- $\hat{p}\,\,(\frac{X}{n})$

> If $X$ ~ $Bin(n,p)$ and $n\gt30$, then
>
> $\hat{p}$ ~ $N(p, \frac{pq}{n})$<br/>
>
> $\frac{\hat{p}-p}{\sqrt{\frac{pq}{n}}}$ ~ $N(0,1)$