## On which weekday should we purchace VTI (Vanguard Total Stock Market ETF)?

For each weekday defined by weekday 0 = Monday, ..., weekday 4 = Friday, compare the average prices of VTI and find out on which day we should purchase index funds if we want to avoid checking its price frequently or potentially for automation. 

### Sampling:
1. Collect data of 30 weeks (30 Mondays, Tuesdays, Wednessday, Thursdays and Fridays) assuming that there is no bias and then apply test statistics to see what we get.  

2. Based on the assumption that election results affect the stock market a lot, use data prior to the democrat's primary election and then apply test statistics to see what I get (~ Jan 31st, 2020).

3. The news of coronavirus might have a huge effect on the stock prices as well so if that seems to be the case, I will think about a way to mitigate its effect: probably by using the data from last year (~ Nov 29th, 2019).

4. 500 weeks of data

**To make the sample unbiased, I will remove weeks that include holidays during the week, because, for example, if Monday is a holiday, then Tuesday might kind of function as Monday and this can be considered as a bias in the sample.**

Since taking average is difficult given the availability of data, define average in two ways:
1. The (max + min) / 2
2. Closing value**

In [1]:
import pandas as pd
from scipy.stats import f
from statistics import variance

In [20]:
# utility functions

def days(year: int, month: int):
    days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]

    if year % 4 == 0 and month == 2:
        return 29
    else:
        return days[month - 1]

def eval_date(date: str):
    return list(map(int, date.split("-")))

def is_sequence(dates):
    n = dates[0][2]
    for i in range(4):
        c_date = dates[i]
        if c_date[2] != n:
            return False
        if c_date[2] == 1:
            month, year = 0, 0
            if c_date[1] != 1:
                month = c_date[1] - 1
                year = c_date[0]
            else:
                month = 12
                year = c_date[0] - 1
            n = days(year, month)
        else:
            n = c_date[2] - 1
            
    return dates[4][2] == n

def collect_data(mem, vals):
    [f, h, w, t, m] = vals
    mem[0].append(m)
    mem[1].append(t)
    mem[2].append(w)
    mem[3].append(h)
    mem[4].append(f)
    return mem

def extract_weeks_without_holidays(n, remove):
    vtis = pd.read_csv('VTI.csv', usecols=['Date','Open','High','Low','Close'])
    vtis['Date'] = vtis['Date'].apply(eval_date)
    vtis = vtis.iloc[::-1]

    if remove:
        while list(vtis.head(1)['Date']) != [remove]:
            vtis = vtis[1:]

    data = [[], [], [], [], []]

    # remove weeks with holidays
    while len(data[0]) < n + 1:
        if is_sequence(list(vtis.head(5)['Date'])):
            data = collect_data(data, vtis.head(5).values)
            vtis = vtis[5:]
        else:
            vtis = vtis[1:]
    
    return data

def avg(lst):
    return list(map(lambda x : sum(x) / len(x), lst))

def get_high_and_low(data):
    return list(map(lambda weekday : list(map(lambda x : (x[2] + x[3]) / 2, weekday)), data))

def get_close(data):
    return list(map(lambda weekday : list(map(lambda x : x[4], weekday)), data))

def get_diff_open_and_close(data):
    return list(map(lambda weekday : list(map(lambda x : x[4] - x[1], weekday)), data))

def get_diff_high_and_low(data):
    return list(map(lambda weekday : list(map(lambda x : x[2] - x[3], weekday)), data))

def sst(avgs, n):
    avg = sum(avgs) / len(avgs)
    res = 0
    for i in range(len(avgs)):
        res += n * ((avgs[i] - avg) ** 2)
    return res

def s_squares(lst):
    res = []
    for i in range(len(lst)):
        s = 0
        avg = sum(lst[i]) / len(lst[i])
        for j in range(len(lst[i])):
            s += (lst[i][j] - avg) ** 2
        res.append(s / (len(lst[i]) - 1))
    return res

def sse(lst):
    ss = s_squares(lst)
    res = 0
    for i in range(len(ss)):
        res += (len(lst[0]) - 1) * ss[i]
    return res

def mst(avgs, n):
    return sst(avgs, n) / (len(avgs) - 1)

def mse(lst):
    return sse(lst) / ((len(lst) * len(lst[0])) - 1)

def test_stat(lst, avgs):
    return mst(avgs, len(lst) * len(lst[0])) / mse(lst)

def test_many_cases(data):
    funcs = [["(high + low) / 2", get_high_and_low], ["close", get_close], ["close - open", get_diff_open_and_close], ["high - low", get_diff_high_and_low]]
    
    for i in range(len(funcs)):
        print(funcs[i][0])
        print("Average: ", avg(funcs[i][1](data)))
        variances = list(map(variance, funcs[i][1](data)))
        print("Variances: ", variances)
        if max(variances) / min(variances) < 2:
              t = test_stat(funcs[i][1](data), avg(funcs[i][1](data)))
              print("test-stat: ", t)
              print("p-value: ", 1 - f.cdf(t, len(data) - 1, (len(data[0]) - 1) * len(data)))
        else:
              print("test-stat cannot be applied because variances differ: max(v) / min(v) = ", max(variances) / min(variances))
        print("\n")
        
def test(n, remove):
    print(n, " weeks:")
    test_many_cases(extract_weeks_without_holidays(n, remove))

### Hypothesis 1:
Let u0 = Average price of VTI on Mondays, u1 = Average price of VTI on Tuesdays, â€¦).
1. H0: u0 = u1 = u2 = u3 = u4
2. H1: u's are different

## Sample 1:
30 weeks (30 Mondays, Tuesdays, Wednessday, Thursdays and Fridays) until 3/6/2020

In [26]:
test(60, False)
test(500, False)

60  weeks:
(high + low) / 2
Average:  [148.6676228114754, 148.8587703114754, 148.93983627049178, 148.86934436885244, 148.9853278442623]
Variances:  [81.59372384479924, 80.02245951483998, 77.7181123925901, 81.82379921399107, 82.87483442073503]
test-stat:  0.056502702750279105
p-value:  0.9940424502500522


close
Average:  [148.6901645245902, 148.8890165573771, 148.90770608196723, 149.0268854590164, 149.0396722459016]
Variances:  [82.41988566582987, 77.47863813807369, 80.13055345792415, 81.58440777931393, 84.01782528113868]
test-stat:  0.07547769811975427
p-value:  0.9896351096676572


close - open
Average:  [-0.06393260655737475, -0.014097852459015105, -0.15229421311475244, 0.12147555737704852, -0.050819819672132546]
Variances:  [0.800618204215008, 1.5172410707614872, 1.5126239978972702, 1.064561691814715, 1.2780245117250157]
test-stat:  2.4858099398846227
p-value:  0.043671488425366056


high - low
Average:  [1.346394540983609, 1.487048786885245, 1.5793446065573753, 1.6065571639344267,

## Sample 2:
30 weeks prior to the democrat's primary (~ Jan 31st, 2020).

In [27]:
test(50, [2020, 1, 31])
test(500, [2020, 1, 31])

50  weeks:
(high + low) / 2
Average:  [148.12078424509804, 148.3242153529412, 148.42764766666667, 148.53862747058824, 148.8704901960784]
Variances:  [72.93230093396711, 69.67628623512594, 67.2311619498688, 72.29877268209425, 73.88728603701071]
test-stat:  0.2812282812566758
p-value:  0.8899860820981423


close
Average:  [148.17647088235296, 148.36196086274512, 148.50529537254903, 148.79098056862745, 148.90352931372547]
Variances:  [73.04017394305035, 69.26829415293722, 68.67263557142789, 72.20909055927409, 74.93984460044129]
test-stat:  0.32545791941038565
p-value:  0.8607584837014394


close - open
Average:  [-0.012939470588232846, -0.020587725490195916, 0.03549105882353118, 0.24039213725490044, -0.048235803921571396]
Variances:  [0.7314386579375323, 0.9209467352993994, 1.0301787808735794, 0.700283626790761, 1.1172273585931616]
test-stat:  3.9170400810506227
p-value:  0.004196570831560997


high - low
Average:  [1.2545115098039243, 1.3684307058823533, 1.4780395294117632, 1.46392164705

## Sample 3:
Before the outbreak of coronavirus (~ Nov 29th, 2019).

In [29]:
test(70, [2019, 11, 29])
test(500, [2019, 11, 29])

70  weeks:
(high + low) / 2
Average:  [145.2440843239436, 145.40809787323946, 145.47063406338034, 145.47866228873247, 145.6559154295775]
Variances:  [39.3477758375666, 38.27327135816502, 34.900768865516405, 38.165484651043705, 39.14587806331109]
test-stat:  0.20757217949551734
p-value:  0.9341523947136919


close
Average:  [145.25112697183096, 145.50154885915495, 145.4771844507043, 145.6502816478874, 145.70098588732395]
Variances:  [39.86713514380583, 37.32111195278459, 36.986583619738326, 37.39292058698853, 40.652191350159086]
test-stat:  0.289822375867997
p-value:  0.8845298119511471


close - open
Average:  [-0.12408321126760413, 0.08619730985915468, -0.09267477464788473, 0.13042294366197127, -0.03577501408450841]
Variances:  [0.6479256772402249, 0.9525591018575575, 1.3841271722718318, 0.7421589511684272, 0.8934166925161857]
test-stat cannot be applied because variances differ: max(v) / min(v) =  2.136243740435453


high - low
Average:  [1.2405640845070443, 1.3677462816901407, 1.494

## Sample 4:
500 weeks of data

In [24]:
test(500, False)

500  weeks:
(high + low) / 2
Average:  [94.9706686477046, 95.01385223752503, 95.05758492614778, 95.03335340119764, 95.10419156986029]
Variances:  [1137.4162038217598, 1139.06712311682, 1141.1425295709569, 1145.3862067697828, 1143.9822942658766]
test-stat:  0.005430449344717468
p-value:  0.999941399586404


close
Average:  [94.9776846367266, 95.05742518363279, 95.07822381636714, 95.11930150099813, 95.15710588423157]
Variances:  [1138.3773609282173, 1138.678400006428, 1142.7404869688294, 1144.1014058886103, 1143.7454425312292]
test-stat:  0.01013803213184338
p-value:  0.9997970403431153


close - open
Average:  [-0.059341129740518354, 0.022654666666666136, -0.020758221556885864, 0.003014283433133524, 0.015928161676646742]
Variances:  [0.5298185553138289, 0.6541830597424705, 0.6108055928851722, 0.636509186016767, 0.5032238852260243]
test-stat:  4.736457210193769
p-value:  0.0008290464578053491


high - low
Average:  [0.9676447604790421, 1.0199600119760475, 1.051856359281437, 1.14774446706

### Hypothesis 2
However, there might always huge narrative in our society that affect the stock market significantly, so maybe it is good to not remove those factors.
Indeed, this is yet another hypothesis to be tested by comparing the results I get by trying to eliminate a particular potential factor.
**if there is no significant difference between data collected by removing a factor A vs. the data collected by removing a factor B then we can conclude that on which day to buy VTI is not affected by a dominant narrative created by breaking news.**

### Motivation:
I read books called Sapiens: A Brief History of Humankind, Homo Deus: A Brief History of Tomorrow, and 21 Lessons for the 21st Century all written by the same author, Yuval Noah Harari. In these books, one of the arguments the author makes is that the stock market is the largest and the most complex algorithm humans have invented and I found this idea fascinating and wanted to play with the stock market's data. Also, for my personal finance, I would like to know when I should purchase index funds.