# Bivariate Overview: 

How do we change X, will effect on Y? 


### Parametric Appr

- have lots of assumptions 

- rely on larger sample sizes 

- higher power than non-parametric 

- nice math properties

- sensitive to outliers

### Non-parametric Appr

- fewer assumptions

    - not require large sample sizes

    - not assume normality

- smaller sample sizes

- lower power than parametric 

    - if the null hypothesis is false, it is less likely to reject the null compared to the parametric approach

- not sensitive to outliers

- rank the observed data

### re-sampling Appr

- fewer assumptions

- don't need large sample sizes

- more flexible

- no nice math functions

| X var | Y var | Exmaple | Plot | Ways |
| --- | --- | --- | --- | --- |
| cat | num | X - Drug (A/B); y - delta value | side to side box plot | two sample t test;  one way ANOVA;  wilcox test|
| cat | cat | X - smoke (Y/N); y - cancer (Y/N) | side to side bar plot | chi-square tests; fishers; RR; OR...|
| num | num | X - years of education; y - salary | scatter plot | correlation; simple linear regression|

* multi- variable analysis

| approach | 2 paired | 3+ paired | 2 indep | 3+ indep | 
| --- | --- | --- | --- | --- | 
| parametric | - paired t-test | - rep. measure ANOVA | - 2 sample t-test | - one way ANOVA | 
| non-parametric | - wilcoxon signed-rank | - FRIEDNAN's | - rank-sum AKA.MANN-WHITNEYU | - kruskal wallis | 

* also use bootstrapping 

# Paired 

## T-test (param) 


## Wilcoxon Sigend-Rank test (non-param)

- sample sample size

- test hypothesis about media change rather than mean change

- Assumption: 

    - similar as signed test, but add the rank thing

    - if H0 true, expected half (+) and half (-)

    - will rank the difference and judge if the rank is same level in the + sign and - sign

# Signed Test 

Sign Test only looks at increase/ decrease but it ignores the magnitude of change 

## Assumption

If Null Hypothesis is true, ie, if the median difference is 0, then what we expect to see is: 

half the people should show a decrease and half should show an increase

p-value: what is the probability of observing what we did in our sample, or more extreme, if the Null is true!

currently obs 8 -> (-), 3 -> (+)

expect obs 5.5 -> (-), 5.5 -> (+)

p-value: 

- prob of 8 or more(-), if should obs. 5.5(-)

- P(x >= 8 | x~ Binomial, n = 11, p = .5) -> P = .11

## Binomial Distribution

- Model event that either occurs or do not occur

- The probability of event occurring is constant (p)

- each trial is independent


In [75]:
import numpy as np

before_data = np.array([135, 142, 137, 122, 147, 151, 131, 117, 154, 143, 133])
after_data = np.array([127, 145, 131, 125, 132, 147, 119, 125, 132, 139, 122])
diff = after_data - before_data

def rerank_4_same_value_no_sort(rank_no_sorted, array_value_no_sorted, relate_ind_no_sorted): 
    '''
    input: 
        array_value_no_sorted: 绝对值，按照原来顺序
        rank_no_sorted: 对应的rank，按照原来顺序
        related_ind_no_sorted: 对应的原来的数据的下标
    output: 
        新的rank，按照原来顺序
    purpose: 
        重新计算rank，按照绝对值
    '''
    data_len = len(rank_no_sorted)
    rank_no_sorted = rank_no_sorted + 1.0
    ind = 0
    
    while ind < data_len:
        
        temp_n = 1
        have_same_value = False
        while ind + temp_n <= data_len - 1:
            
            if array_value_no_sorted[relate_ind_no_sorted[ind]] == array_value_no_sorted[relate_ind_no_sorted[ind + temp_n]]:
                temp_n += 1
                have_same_value = True
            else: 
                temp_n -= 1
                break
        
        if have_same_value:
            rank_avg = (rank_no_sorted[relate_ind_no_sorted[ind]] + rank_no_sorted[relate_ind_no_sorted[ind + temp_n]])/ (temp_n + 1.0)
        
        i = 0
        while i <= temp_n and have_same_value: 
            rank_no_sorted[relate_ind_no_sorted[ind + i]] = rank_avg
            i += 1
        
        if have_same_value: 
            ind += temp_n + 1
        else: 
            ind += 1

    return (rank_no_sorted, array_value_no_sorted)

def wilcoxon_sign_rank(data_array): 
    '''
    input: the general array
    output: + sign rank; - sign rank
    purpose: get the rank of + / - with their rank, this is to 
    '''
    # print (data_array)
    abs_data = abs(data_array)
    # print (abs_data)
    temp = abs_data.argsort() # index 下标 + 1 就是对应element的 rank值
    # print (temp)
    ranks = np.empty_like(temp)
    ranks[temp] = np.arange(len(abs_data))
    # print (ranks)
    
    rerank_no_sort, _ = rerank_4_same_value_no_sort(ranks, abs_data, temp)
    
    pos_ind = np.where(data_array >= 0)
    neg_ind = np.where(data_array < 0)

    pos_value, pos_rank = data_array[pos_ind], rerank_no_sort[pos_ind]
    neg_value, neg_rank = data_array[neg_ind], rerank_no_sort[neg_ind]
    return (pos_rank, neg_rank)

pos_rank, neg_rank = wilcoxon_sign_rank(diff)
avg_rank = (pos_rank.sum() + neg_rank.sum()) / 2
print ('expect sum of (-) is about {}; and sum of (+) is about {}'.format(avg_rank, avg_rank))
print ()
print ('observe sum of (-) is about {}; and sum of (+) is about {}'.format(neg_rank.sum(), pos_rank.sum()))
print ()
print ('p-value = prob of sum of (-) >= {} if expectation is {}'.format(neg_rank.sum(), avg_rank))

from scipy import stats
w, p = stats.wilcoxon(diff, zero_method = 'wilcox', correction = True, alternative = 'less') # less
print ()
print ('p-value is {:.4f}, reject null hyperthesis, which median decrease is less than 0'.format(p))

expect sum of (-) is about 33.0; and sum of (+) is about 33.0

observe sum of (-) is about 56.5; and sum of (+) is about 9.5

p-value = prob of sum of (-) >= 56.5 if expectation is 33.0

p-value is 0.0203, reject null hyperthesis, which median decrease is less than 0


# Pearson's Chi-Square Test of Independence

- Chi-square test tests for independence between two variable (X & Y) that are both categorial/ factors

- X & Y can have two or more levels, but the groups formed by X are independent

    - if X is exposure, the exposed and non-exposed groups must be independent of one another

- Chi-square test is a non-parametric test

    - it relies on having a large sample size

    - it use a theoretical probability distribution (Chi-Square Distribution)

    - it assumes a large sample and a theoretical probability distribution (just like a parametric approach)

- Notice: 

    - For large sample sizes, small effects may show up as statistically significant

    - For small sample sizes, big effects may NOT show up as statistically significant

    - so P-value just a guide, not a magic number to grade 

    - no describe direction, and no association relation

- Assumptions: 

    - groups are independent; observations are independent

    - all cell >= 1 每个cell里的数字至少为1

    - all expected cell >= 5 所有cell里的数字的总和至少为5 <- just a guide line, not magic number

        - if this not met, use non-parametric method -> FISHERS TEST/ BOOTSTRAP

## example

- X: vaccinated for MMR (yes or no); Y: Diagnosed with autism (yes or no)

- question: is there any relationship between X and Y

- H0: X, y independent and unassociated ---> H0: P1 = P2 ---> (P1 - P2) = 0

- H1: X, y dependent or associated ---> H0: P1 <> P2 ---> (P1 - P2) <> 0

- obs 
    
    - P1 = P(aut | vacc) = 621 / 440,655 = .00141

    - P2 = P(aut | not vacc) = 117 / 96,648 = .00121


| item | aut? yes | aut? no | total |
| --- | --- | --- | --- |
| vacc? yes | 621 | 440,034 | 440,655 |
| vacc? no | 117 | 96,531 | 96,648 |
| total | 738 | 536,565 | 537,303 |

- exp

    - P(vacc and aut) = P(vacc)P(aut) = (440,655/ 537,303) x (738/537,303) * 537,303

    - expected # people in cell = "np" => E = row total x col total / overall total

    - degree of freedom, in the four cell, we only have one degree of freedom. Because once one cell is fixed, the other three values are fixed too. 

| item | aut? yes | aut? no | total |
| --- | --- | --- | --- |
| vacc? yes | 605.25 | 440,049.75 | 440,655 |
| vacc? no | 132.75 | 96,515.25 | 96,648 |
| total | 738 | 536,565 | 537,303 |

- test stat = chi-square_stat = square(621 - 605.25)/ 605.25 + ... + square(96,531 - 96,515.25) / 96,515.25 = 2.28

- chi-suqare distribution: if H0 is true => df: (#row - 1)*(#col - 1)

- P-value = P(chi-square >= 2.28 if H0 is true = .1309), which is bigger than 5% and fail to reject the H0, we have no evidence that X and Y have relationship

In [88]:
from scipy import stats
import numpy as np

def pearson_chi_square(x_expected, x_actual): 
    '''
    input two array: one is expected, and the other is actual
    process: 
        - squared difference one by one -> divide by the expect value 
        - sum up all -> this is the F stats
        - df = (# col - 1) * (# row - 1)
    '''
    sum_up = 0
    for a, b in zip(x_expected, x_actual): 
        sum_up += (a - b) ** 2 / a
        
    return sum_up

x_expected = np.array([605.25, 440049.75, 132.75, 96515.25])
x_actual = np.array([621, 440034, 117, 96531 ])

F_stat = pearson_chi_square(x_expected, x_actual)

1 - stats.chi2.cdf(F_stat, df = 1)

605.25 621
440049.75 440034
132.75 117
96515.25 96531
2.2816292732677996


0.13091428382458448

# Measures of Association for 2x2 Tables