This notebook contains simulations to illustrate puzzles and doubts.

Todo list:
- Does standardization change the significance of coefficients of linear regressions?

## Expected number of flips
- Q: What is the expected number of fair coin flips to get two consecutive heads?
- Ans: 6

In [1]:
import numpy as np

In [8]:
## Count the number of flips to get HH for one trial

count = 0 # number of flips
hh = 0    # number of consecutive heads

while hh < 2:
    # flip the coin and count
    count += 1
    # get the result: 1=Head, 0=Tail
    r = np.random.binomial(1, 0.5)
    if r == 1:
        hh += 1
    elif r == 0 and hh ==1:
        hh -= 1
        
print('Number of flips: %d' % count)

Number of flips: 6


In [11]:
## Average number of flips to get HH for n trials

def flips_2h(n=1):
    '''Returns a list of size n. 
    Each element is the number of flips to get 2 consecutive heads. '''
    l = [] 
    
    for i in range(n):
        count = 0
        hh = 0
        
        while hh < 2:
            count += 1
            r = np.random.binomial(1, 0.5)
            if r == 1:
                hh += 1
            elif r == 0 and hh == 1:
                hh -= 1
            
        l.append(count)
    
    return l 

flips_2h(10)

[3, 3, 3, 2, 3, 3, 2, 4, 7, 7]

According to the Law of Large Numbers, the average number of flips converges to its expected value as n becomes large.

In [15]:
print('Averge number of flips with 1000 trials: %.2f' % np.mean(flips_2h(1000)))

Averge number of flips with 1000 trials: 6.17


**Variations:**
- what happens if we change 2 consecutive Heads to k?
- recursive way of writing code? similar to Hanoi Tower .


## Correlation of Error Terms
- In linear regressions, error terms are assumed to be i.i.d. 
- Correlation of error terms will lead to smaller estimated standard error. And coefficients are more likely to be statistically significant.
- E.g., accidentally double the data (n -> 2n)
- This simulation tries to illustrate how estimated coef. and Std.Err changes when: 
    1. copy the sample n many times
    2. bootstrap the sample many times
    
True model: `Y = 2X + e`

In [93]:
import numpy as np
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
import random

In [102]:
## DGP

random.seed(20201208)

n = 100
mu, sigma = 9, 3
sigma_e = 4
scale = 10

x = np.random.normal(mu, sigma, n)
e = np.random.normal(0, sigma_e, size=n)
y = 2*x + e

print(x.shape, y.shape)

(100,) (100,)
(100, 2)


In [None]:
## Model: benchmark OLS

X = sm.add_constant(x)
ols = sm.OLS(y, x).fit()

print(X.shape)

In [103]:
## Duplicate the sample: copy - paste

x_dup = np.repeat(x, scale)
y_dup = np.repeat(y, scale)

print(y_dup.shape, x_dup.shape)

X = sm.add_constant(x_dup) # keep same var name
dup = sm.OLS(y_dup, X).fit()


(1000,) (1000,)


In [104]:
## Draw bootstrap sample

# Trick: bootstrap (y, x) pair using index
index = np.arange(len(sample))
index_bs = np.random.choice(index, scale*len(index))

y_bs = y[index_bs]
x_bs = x[index_bs]

print(y_bs.shape, x_bs.shape)

X = sm.add_constant(x_bs)
bs = sm.OLS(y_bs, X).fit()


(1000,) (1000,)


In [105]:
## Display results

summary_col([ols, dup, bs], 
            model_names=['OLS', 'Duplicate', 'Bootstrap'],
            stars=True, float_format='%0.3f',
            regressor_order=['x1', 'const'],
            drop_omitted=True,
            info_dict={'N':lambda x: "{0:d}".format(int(x.nobs))})

0,1,2,3
,OLS,Duplicate,Bootstrap
x1,1.931***,1.776***,1.782***
,(0.038),(0.037),(0.036)
const,,1.578***,1.672***
,,(0.359),(0.354)
R-squared,0.962,0.697,0.706
R-squared Adj.,0.962,0.696,0.706
N,100,1000,1000


**Findings**
- When the error terms have relatively small standard deviation, e.g., e ~ N(0,1), the estimated Std.Err of beta1 becomes smaller when n increases.
- When the standard deviation of error terms increases, e ~ N(0,2), the estimated Std.Err of beta1 becomes similar to the original sample when n increases.
- The model fit `R2` becomes worse with duplicated obs. and with more noise (sigma_e).
- The Duplicate and Bootstrap sample return very similar results.
- Q: for unbalanced data, is it good to enlarge the minority group or shrink the majority group?

## Next problem