## On which weekday should we purchace VTI (Vanguard Total Stock Market ETF)?

For each weekday defined by weekday 0 = Monday, ..., weekday 4 = Friday, compare the average prices of VTI and find out on which day we should purchase index funds if we want to avoid checking its price frequently or potentially for automation. 

### Sampling:
1. Collect data of 30 weeks (30 Mondays, Tuesdays, Wednessday, Thursdays and Fridays) assuming that there is no bias and then apply test statistics to see what we get.  

2. Based on the assumption that election results affect the stock market a lot, use data prior to the democrat's primary election and then apply test statistics to see what I get (~ Jan 31st, 2020).

3. The news of coronavirus might have a huge effect on the stock prices as well so if that seems to be the case, I will think about a way to mitigate its effect: probably by using the data from last year (~ Nov 29th, 2019).

**To make the sample unbiased, I will remove weeks that include holidays during the week, because, for example, if Monday is a holiday, then Tuesday might kind of function as Monday and this can be considered as a bias in the sample.**

Since taking average is difficult given the availability of data, define average in two ways:
1. The (max + min) / 2
2. Closing value**

In [1]:
import pandas as pd
from scipy.stats import f

In [2]:
# utility functions

def days(year: int, month: int):
    days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]

    if year % 4 == 0 and month == 2:
        return 29
    else:
        return days[month - 1]

def eval_date(date: str):
    return list(map(int, date.split("-")))

def is_sequence(dates):
    n = dates[0][2]
    for i in range(4):
        c_date = dates[i]
        if c_date[2] != n:
            return False
        if c_date[2] == 1:
            month, year = 0, 0
            if c_date[1] != 1:
                month = c_date[1] - 1
                year = c_date[0]
            else:
                month = 12
                year = c_date[0] - 1
            n = days(year, month)
        else:
            n = c_date[2] - 1
            
    return dates[4][2] == n

def collect_data(mem, vals):
    [f, h, w, t, m] = vals
    mem[0].append(m)
    mem[1].append(t)
    mem[2].append(w)
    mem[3].append(h)
    mem[4].append(f)
    return mem

def avg(lst):
    return list(map(lambda x : sum(x) / len(x), lst))

def get_high_and_low(data):
    return list(map(lambda weekday : list(map(lambda x : (x[2] + x[3]) / 2, weekday)), data))

def get_close(data):
    return list(map(lambda weekday : list(map(lambda x : x[4], weekday)), data))

def get_diff_open_and_close(data):
    return list(map(lambda weekday : list(map(lambda x : x[4] - x[1], weekday)), data))

def get_diff_high_and_low(data):
    return list(map(lambda weekday : list(map(lambda x : x[2] - x[3], weekday)), data))

def sst(avgs, n):
    avg = sum(avgs) / len(avgs)
    res = 0
    for i in range(len(avgs)):
        res += n * ((avgs[i] - avg) ** 2)
    return res

def s_squares(lst):
    res = []
    for i in range(len(lst)):
        s = 0
        avg = sum(lst[i]) / len(lst[i])
        for j in range(len(lst[i])):
            s += (lst[i][j] - avg) ** 2
        res.append(s / (len(lst[i]) - 1))
    return res

def sse(lst):
    ss = s_squares(lst)
    res = 0
    for i in range(len(ss)):
        res += (len(lst[0]) - 1) * ss[i]
    return res

def mst(avgs, n):
    return sst(avgs, n) / (len(avgs) - 1)

def mse(lst):
    return sse(lst) / ((len(lst) * len(lst[0])) - 1)

def test_stat(lst, avgs):
    return mst(avgs, len(lst) * len(lst[0])) / mse(lst)

def test_many_cases(data):
    print("Average by (high + low) / 2:")
    print(avg(get_high_and_low(data)))
    t = test_stat(get_high_and_low(data), avg(get_high_and_low(data)))
    print(t)
    print("p-value: ")
    print(1 - f.cdf(t, len(data) - 1, (len(data[0]) - 1) * len(data)))
    
    print("Average by close:")
    print(avg(get_close(data)))
    t = test_stat(get_close(data), avg(get_close(data)))
    print(t)
    print("p-value: ")
    print(1 - f.cdf(t, len(data) - 1, (len(data[0]) - 1) * len(data)))
    
    print("Average by close - open:")
    print(avg(get_diff_open_and_close(data)))
    t = test_stat(get_diff_open_and_close(data), avg(get_diff_open_and_close(data)))
    print(t)
    print("p-value: ")
    print(1 - f.cdf(t, len(data) - 1, (len(data[0]) - 1) * len(data)))
    
    print("Average by high - low:")
    print(avg(get_diff_high_and_low(data)))
    t = test_stat(get_diff_high_and_low(data), avg(get_diff_high_and_low(data)))
    print(t)
    print("p-value: ")
    print(1 - f.cdf(t, len(data) - 1, (len(data[0]) - 1) * len(data)))

### Hypothesis 1:
Let u0 = Average price of VTI on Mondays, u1 = Average price of VTI on Tuesdays, …).
1. H0: u0 = u1 = u2 = u3 = u4
2. H1: u's are different

## Sample 1:
30 weeks (30 Mondays, Tuesdays, Wednessday, Thursdays and Fridays) until 3/6/2020

In [9]:
vtis = pd.read_csv('VTI.csv', usecols=['Date','Open','High','Low','Close'])
vtis['Date'] = vtis['Date'].apply(eval_date)
vtis = vtis.iloc[::-1]

data = [[], [], [], [], []]

# remove weeks with holidays
while len(data[0]) < 31:
    if is_sequence(list(vtis.head(5)['Date'])):
        data = collect_data(data, vtis.head(5).values)
        vtis = vtis[5:]
    else:
        vtis = vtis[1:]

from statistics import variance
print(variance(get_diff_open_and_close(data)[0]))
print(variance(get_diff_open_and_close(data)[1]))
print(variance(get_diff_open_and_close(data)[2]))
print(variance(get_diff_open_and_close(data)[3]))
print(variance(get_diff_open_and_close(data)[4]))
test_many_cases(data)

0.5317120387042581
2.0092355157928115
0.864710758431361
0.994510747301108
1.5722591557770513
Average by (high + low) / 2:
[155.02403233870967, 155.06080680645158, 154.90177466129032, 155.13129006451612, 155.2137091451613]
0.044973023067741605
p-value: 
0.9961446381354158
Average by close:
[155.02387161290324, 154.92741935483875, 155.00677490322582, 155.25354883870963, 155.28096796774193]
0.08215510074788068
p-value: 
0.9877652493165315
Average by close - open:
[-0.009998516129031495, -0.29709690322580457, 0.03032267741935544, 0.07064612903225738, -0.05645122580645351]
2.786838415415574
p-value: 
0.02861967406744914
Average by high - low:
[1.0829049354838738, 1.514515548387096, 1.39387158064516, 1.4387099354838728, 1.465483709677421]
4.730842696414332
p-value: 
0.0012685948520273493


## Sample 2:
30 weeks prior to the democrat's primary (~ Jan 31st, 2020).

In [4]:
vtis = pd.read_csv('VTI.csv', usecols=['Date','Open','High','Low','Close'])
vtis['Date'] = vtis['Date'].apply(eval_date)
vtis = vtis.iloc[::-1]

while list(vtis.head(1)['Date']) != [[2020, 1, 31]]:
    vtis = vtis[1:]

data = [[], [], [], [], []]

# remove weeks with holidays
while len(data[0]) < 31:
    if is_sequence(list(vtis.head(5)['Date'])):
        data = collect_data(data, vtis.head(5).values)
        vtis = vtis[5:]
    else:
        vtis = vtis[1:]
        
test_many_cases(data)

Average by (high + low) / 2:
[152.80290343548387, 152.9258066451613, 152.82048466129035, 153.2332259032258, 153.60112833870966]
0.4645653956758006
p-value: 
0.7616530050069177
Average by close:
[152.77225861290322, 152.92806464516127, 152.99419429032255, 153.46064554838708, 153.56354841935482]
0.48183508600806924
p-value: 
0.7490410819810378
Average by close - open:
[-0.11709503225806327, -0.018064419354838112, 0.13709729032258208, 0.19774199999999786, -0.16129000000000357]
4.829111839022657
p-value: 
0.0010831813239968735
Average by high - low:
[1.060646806451616, 1.3761287741935508, 1.359678096774192, 1.3277425806451633, 1.371935516129034]
5.593339756936002
p-value: 
0.00031796055436794646


## Sample 3:
Before the outbreak of coronavirus (~ Nov 29th, 2019).

In [5]:
vtis = pd.read_csv('VTI.csv', usecols=['Date','Open','High','Low','Close'])
vtis['Date'] = vtis['Date'].apply(eval_date)
vtis = vtis.iloc[::-1]

while list(vtis.head(1)['Date']) != [[2019, 11, 29]]:
    vtis = vtis[1:]

data = [[], [], [], [], []]

# remove weeks with holidays
while len(data[0]) < 31:
    if is_sequence(list(vtis.head(5)['Date'])):
        data = collect_data(data, vtis.head(5).values)
        vtis = vtis[5:]
    else:
        vtis = vtis[1:]
        
test_many_cases(data)

Average by (high + low) / 2:
[149.81790308064518, 149.9532257096774, 149.77790401612904, 150.01854875806455, 150.43145137096778]
0.731567979926648
p-value: 
0.571755714891763
Average by close:
[149.84677467741935, 149.97096745161292, 149.91548499999996, 150.21580645161296, 150.52935499999998]
0.8402683785194143
p-value: 
0.5016745729982786
Average by close - open:
[-0.02257909677419159, -0.05032248387096714, 0.04483919354838819, 0.09129077419354674, 0.04870941935483632]
0.6764852783746641
p-value: 
0.609281694572928
Average by high - low:
[1.031936161290324, 1.4535478709677423, 1.4067740967741906, 1.3448388064516135, 1.3725808064516147]
8.563205180923461
p-value: 
2.9868225613904897e-06


### Hypothesis 2
However, there might always huge narrative in our society that affect the stock market significantly, so maybe it is good to not remove those factors.
Indeed, this is yet another hypothesis to be tested by comparing the results I get by trying to eliminate a particular potential factor.
**if there is no significant difference between data collected by removing a factor A vs. the data collected by removing a factor B then we can conclude that on which day to buy VTI is not affected by a dominant narrative created by breaking news.**

In [6]:
vtis = pd.read_csv('VTI.csv', usecols=['Date','Open','High','Low','Close'])
vtis['Date'] = vtis['Date'].apply(eval_date)
vtis = vtis.iloc[::-1]

while list(vtis.head(1)['Date']) != [[2020, 1, 31]]:
    vtis = vtis[1:]

data = [[], [], [], [], []]

# remove weeks with holidays
while len(data[0]) < 501:
    if is_sequence(list(vtis.head(5)['Date'])):
        data = collect_data(data, vtis.head(5).values)
        vtis = vtis[5:]
    else:
        vtis = vtis[1:]
        
test_many_cases(data)

Average by (high + low) / 2:
[94.36944111876245, 94.40893207684637, 94.45311386427153, 94.43399212175652, 94.51512969560882]
0.006586036026860385
p-value: 
0.9999139387191973
Average by close:
[94.3750498962076, 94.45872260079847, 94.47321382634719, 94.524690720559, 94.56243520758484]
0.011402019336268387
p-value: 
0.9997437085168427
Average by close - open:
[-0.06532915768463016, 0.033872241516965475, -0.021317103792414808, 0.007305674650698315, 0.007205604790419191]
6.338801043039896
p-value: 
4.5068139671866625e-05
Average by high - low:
[0.9627944690618766, 1.0092813632734527, 1.0468663313373252, 1.1380438722554889, 0.9764870558882242]
15.116957537769538
p-value: 
3.2391866966463567e-12


### Motivation:
I read books called Sapiens: A Brief History of Humankind, Homo Deus: A Brief History of Tomorrow, and 21 Lessons for the 21st Century all written by the same author, Yuval Noah Harari. In these books, one of the arguments the author makes is that the stock market is the largest and the most complex algorithm humans have invented and I found this idea fascinating and wanted to play with the stock market's data. Also, for my personal finance, I would like to know when I should purchase index funds.