# Incomplete Data 

Creating accurate predictions is one of the most valuable skills in the job market today. Statisticians, economists, and data scientists use data gathered from specific populations in order to make predictions about what behaviors are likely to occur in the future, or what the truth is about what has already occured. Through computational and statistical techniques, we can make _statistical inferences_ to draw conclusions from data that are often incomplete.

As far as estimations of parameters that already exist, having full population data would mean that our questions about that population are answered. But because the cost of gathering full population would, usually, outweigh the benefit of having perfectly accurate data, we are okay with using incomplete samples to make inferences. 

## Are Mutual Funds better than Broad-market index funds? 

The term “index fund” refers to the investment approach of a fund. Specifically, it is a fund that that aims to match the performance of a particular market index, such as the S&P 500 or Russell 2,000. The index fund simply tries to match the market. This differs from a more actively managed fund, in which investments are picked by a fund manager in an attempt to beat the market. The age old question is: are the fees payed to an actively-managed mutual fund worth it? 

We could simply compare the mean return from a given date range for a mutual fund and compare it to the S&P500s mean return from the same time interval, and see which is higher. But the fact that mutual funds have a finite number of time intervals which we can sample means we cannnot rule out the possibility that higher or lower returns from the fund were a result of random variation, and not an indicator of the true quality of the fund. The S&P500 is , essentially, the market. We know all of the information we need about it, because it isnt a sample. The mutual fund data, on the other hand, is incomplete. So, we need to analyze the two funds using statistical techniques which account for random variation that is possible from incomplete data. 

We would like to ananlyze which(if any) mutual funds have out-performed the market, fees included. To start, we read the CSV downloaded from [Stock Market MBA](https://stockmarketmba.com/listoftop100activelymanagedusstockmutualfunds.php), which shows the 100 largest actively-managed mutual funds in the US. 

In [1]:
import numpy as np
import pandas as pd

mutual_fund_data = pd.read_csv("Top100MutualFunds.csv")
to_drop = ["Category2", "Category1","Category3", "Morningstar Rating","Current yield", "Action"]
mutual_fund_data = mutual_fund_data.drop(columns=to_drop)

In [2]:
mutual_fund_data

Unnamed: 0,Symbol,Name,Morningstar Category,Market cap,Fees
0,AGTHX,American Funds The Growth Fund of America Class A,Large Growth,"$138,592,080,000",0.62%
1,FCNTX,Fidelity Contrafund Fund,Large Growth,"$121,762,870,000",0.74%
2,CWMAX,American Funds Washington Mutual Investors Fun...,Large Blend,"$113,300,000,000",0.63%
3,CWMCX,American Funds Washington Mutual Investors Fun...,Large Blend,"$113,300,000,000",1.40%
4,CWMEX,American Funds Washington Mutual Investors Fun...,Large Blend,"$113,300,000,000",0.87%
...,...,...,...,...,...
95,FDTRX,Franklin DynaTech Fund Class R6,Large Growth,"$9,600,000,000",0.51%
96,FDYZX,Franklin DynaTech Fund Advisor Class,Large Growth,"$9,600,000,000",0.62%
97,BBVLX,Bridge Builder Large Cap Value Fund,Large Value,"$9,500,000,000",0.25%
98,PEYAX,Putnam Large Cap Value Fund Class A,Large Value,"$9,393,340,000",0.91%


In [3]:
import pandas_datareader as web
import datetime as dt

mutual_fund_dict = {}
symbols = mutual_fund_data["Symbol"][25:50]
# will only analyze the first 25 funds for now, but this should illustrate how to do it for any fund 
start= dt.datetime(1970, 1, 1)
end = dt.datetime.today()
for symbol in symbols: 
    #pull mutual fund data for the longest timeframe avaliable, and cpnvert to monthly percent change data 
    fund_data = web.DataReader(symbol, 'yahoo', start, end)#["Adj Close"].resample('M').first().pct_change()
    mutual_fund_dict[symbol] = fund_data
    

In [4]:
mutual_fund_dict

{'EAGRX':                  High        Low       Open      Close  Volume  Adj Close
 Date                                                                     
 2018-01-18  61.389999  61.389999  61.389999  61.389999     0.0  49.962879
 2018-01-19  61.590000  61.590000  61.590000  61.590000     0.0  50.125648
 2018-01-22  61.950001  61.950001  61.950001  61.950001     0.0  50.418644
 2018-01-23  62.119999  62.119999  62.119999  62.119999     0.0  50.556995
 2018-01-24  62.349998  62.349998  62.349998  62.349998     0.0  50.744183
 ...               ...        ...        ...        ...     ...        ...
 2022-02-08  64.930000  64.930000  64.930000  64.930000     0.0  64.930000
 2022-02-09  65.639999  65.639999  65.639999  65.639999     0.0  65.639999
 2022-02-10  65.010002  65.010002  65.010002  65.010002     0.0  65.010002
 2022-02-11  64.919998  64.919998  64.919998  64.919998     0.0  64.919998
 2022-02-14  64.680000  64.680000  64.680000  64.680000     0.0  64.680000
 
 [1027 rows x 

To start, you formulate your __hypotheses__. These are mutually exclusive, falsifiable statements. Only one can be true, and one of them will be true. We create these two hypotheses: 

- The _null_ hypothesis $H_o$: The true means of the the sample populations do not differ.
- The _alternate_ hypothesis $H_a$: The true means of the sample populations do differ.

### 4 Steps of Hypothesis Testing

All hypotheses are tested using a four-step process:

1. State the two hypotheses so that only one can be right. 
2. Formulate an analysis plan, which outlines how the data will be evaluated.
3. Carry out the plan and physically analyze the sample data.
4. Analyze the results and either reject the null hypothesis, or state that the null hypothesis is plausible, given the data.

Hypothesis testing can be done mentally. It would be burdensome to have to state your _null_ and _alternate_ hypotheses, and run through these four steps explicitly every time you made a predictive computer model. The point is that in means testing, there is a clear process and result that deliniates "Yes, the true means of these samples are different" and "No, they're not significantly different"

In the case of us determining the efficacy of our company's marketing campaign, these are our hypotheses: 

- $H_o$: There is no difference between the mutual fund's and S&P500's average monthly return. 
- $H_a$: The mutual funds have a higher mean gain than the S&P500. 

In [5]:
keys = mutual_fund_dict.keys()
keys

dict_keys(['EAGRX', 'AMRMX', 'FMAGX', 'HACAX', 'FLPSX', 'PRGFX', 'PRNHX', 'DFQTX', 'DFEOX', 'FLPKX', 'RPMGX', 'FOCPX', 'CNGAX', 'CNGCX', 'CNGEX', 'CNGFX', 'FNEFX', 'FOCKX', 'DFLVX', 'CDDRX', 'CDDYX', 'CDIRX', 'CVIRX', 'TWCUX', 'EGFFX'])

In [6]:
# create empty dictionary to hold average yearly gain for each mutual fund 
monthly_returns_dict = {}
keys = mutual_fund_dict.keys()
for key in keys:
    # for each mutual fund, find average yearly gain
    monthly_returns_dict[key] = mutual_fund_dict[key]["Adj Close"].resample("Y").first().pct_change().mean()
monthly_returns_dict

{'EAGRX': 0.0752911784487113,
 'AMRMX': 0.09748249533765245,
 'FMAGX': 0.14698700612271093,
 'HACAX': 0.13310212612523584,
 'FLPSX': 0.1436571263499804,
 'PRGFX': 0.11315218175286379,
 'PRNHX': 0.1437720383779023,
 'DFQTX': 0.11225230181469795,
 'DFEOX': 0.11508386237797634,
 'FLPKX': 0.11686041803216049,
 'RPMGX': 0.14861602206647556,
 'FOCPX': 0.17356664595967675,
 'CNGAX': 0.12667358456504008,
 'CNGCX': 0.11903825641585608,
 'CNGEX': 0.12149481117395071,
 'CNGFX': 0.12641272602915776,
 'FNEFX': 0.19470481988305616,
 'FOCKX': 0.19586175099645883,
 'DFLVX': 0.11722244984296658,
 'CDDRX': 0.1303376720461748,
 'CDDYX': 0.13077736390683284,
 'CDIRX': 0.10960190921706282,
 'CVIRX': 0.12966492275724245,
 'TWCUX': 0.15064326901331127,
 'EGFFX': 0.15268094995867576}

In [7]:
yearly_returns_dict = {}
for key in keys:
    # for each mutual fund, find average yearly gain
    yearly_returns_dict[key] = mutual_fund_dict[key]["Adj Close"].resample("Y").first().pct_change().dropna()

In [8]:
yearly_returns_dict

{'EAGRX': Date
 2019-12-31   -0.115169
 2020-12-31    0.210248
 2021-12-31    0.080210
 2022-12-31    0.125876
 Freq: A-DEC, Name: Adj Close, dtype: float64,
 'AMRMX': Date
 1987-12-31    0.129006
 1988-12-31    0.017496
 1989-12-31    0.051963
 1990-12-31    0.229782
 1991-12-31   -0.044868
 1992-12-31    0.183723
 1993-12-31    0.033482
 1994-12-31    0.078716
 1995-12-31   -0.032689
 1996-12-31    0.258144
 1997-12-31    0.151025
 1998-12-31    0.270376
 1999-12-31    0.145518
 2000-12-31   -0.017953
 2001-12-31    0.100673
 2002-12-31    0.076262
 2003-12-31   -0.098698
 2004-12-31    0.201578
 2005-12-31    0.099177
 2006-12-31    0.070008
 2007-12-31    0.150431
 2008-12-31    0.017657
 2009-12-31   -0.272640
 2010-12-31    0.238247
 2011-12-31    0.116737
 2012-12-31    0.049732
 2013-12-31    0.136999
 2014-12-31    0.241262
 2015-12-31    0.138669
 2016-12-31   -0.040841
 2017-12-31    0.161633
 2018-12-31    0.178324
 2019-12-31   -0.029867
 2020-12-31    0.224195
 2021-12-31

These values will be compared to the monthly returns of the stock market: 

In [19]:
from datlib.stats import *

sp500 = web.DataReader('^GSPC', 'yahoo', start, end)['Adj Close'].resample('Y').first().pct_change().dropna()
mean_sp500_gain = mean(sp500)

mean_sp500_gain

0.09179442431292621

##### T Distributions
All of the t-distributions below are normal distributions. As the degrees of freedom increases past 30 or so, the distribution becomes the _standard normal distribution_(see Central Limit Theorem below), which has a standard deviation of 1 and mean of 0, and we use z-scores to analyze this. 

__The $t$ value tells us how many standard deviations away from the mean our sample sits on a $t$ distribution of the _differences_ of these two means, where the mean of the distribution is zero.__
The t-distribution changes based on sample size, as increased sample size allows for higher _degrees of freedom_, which are defined for two samples as: 

- $df = (N_1 + N_2)  – 2$

And for a single sample as: 

- $df = N - 1$

# Comparisons of Means

When dealing with a population of known parameters $\mu$ and $\sigma^2$, we can take any mean $\bar{X}$ gotten from a sample and determine the likelihood that the sample came from out known population, or a population with same mean as our known population. We do this using a z-score: 
<h3 align="center">
    <font size="5">
        $ z = \frac{\bar{X} - \mu}{\sigma}$
    </font>
</h3>

### Central Limit Theorem:


If $\bar{X}$ is the mean of a random sample of size $n$ taken
from a population with mean $\mu$ and finite variance $\sigma^2$, then the limiting form of
the distribution of
<h3 align="center">
    <font size="5">
        $ z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}$
    </font>
</h3>

 as $\lim_{n \to \infty}$, is the *standard normal distribution* 
 
 The power of the CLT is that this holds no matter the type of distribution we are sampling from. So, for instance, if we took 30 random samples from a lognormal distribution, the means of the samples would be normally distributed. 
 
 The Z-value tells us: what is the probability that a given sample mean would occur given the sample size and population mean? As n gets larger, the mean is expected to get more accurate if it does follow the population mean $\mu$
 
The gotten _z-score_ tells us how many standard deviations our sample mean $\bar{X}$ is from our population mean $\mu$.
 
 The normal approximation for $\bar{X}$ will generally be good if $n$ ≥ 30, provided the population distribution is not terribly skewed. If $n$ < 30, the approximation is good only if the population is not too different from a normal distribution and, as stated above, if the population is known to be normal, the sampling distribution of $\bar{X}$ will follow a normal distribution exactly, no matter how small the size of the samples.
 
For the following demonstration, [ImageMagick](https://imagemagick.org/script/download.php#windows) must be installed as well as wand through pip install. 

In [10]:
# !pip install wand

# must install ImageMagick as well; but not necessary if you just want to view the plot included. 

In [100]:
# Central Limit Theorem
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation as animation
from wand.image import Image
from wand.display import display
from IPython import display
%matplotlib notebook
# number of simulations of die roll
n = 1000

# In each simulation, there is one trial more than the previous simulation
avg = []
for i in range(2,n):
    a = np.random.randint(1,7,i)
    avg.append(np.average(a))
print(avg[0:5])
# Function that will plot the histogram, where current is the latest figure
def clt(current):
    # if animation is at the last frame, stop it
    plt.cla()
    if current == 1000:
        a.event_source.stop()
    plt.hist(avg[0:current], bins= int(current/10 + 1))
    plt.gca().set_title('Expected value of die rolls')
    plt.gca().set_xlabel('Average from die roll')
    plt.gca().set_ylabel('Frequency')
    plt.annotate('Die roll = {}'.format(current), [3,27])
fig = plt.figure()
a = animation(fig, clt, interval=50, frames=200 )
video = a.to_html5_video()
html = display.HTML(video)
display.display(html)
# a.save('clt2.gif', writer='imagemagick', fps=10)
plt.show()

[2.0, 4.666666666666667, 1.5, 4.4, 3.8333333333333335]


<IPython.core.display.Javascript object>

In [101]:
a.save("clt2.gif")

This is the gif we just produced, embedded: 

<img src="clt2.gif">

So, for any sample with $n$ > 30, $\bar{x}$ can be substituted for $\mu$ and $s$ can be substituted for $\sigma$
 
This Z-test asunes that we have access to the population standard deviation and mean _or_ that $n$ is large enough (>30) for $s^2$ and $\bar{x}$ to be used as a reliable estimate for $\sigma^2$ and $\mu$. When these conditions do not hold, and we do not have a large enough sample or sufficient population data, we need another estimator.  
 

The __T-test__ is used when we are dealing with a population of unknown distribution, and would like to compare a given sample mean to one of three options: 

- **One Sample T-test:** The one sample t test compares the mean of your sample data to a known value. For example, you might want to know how your sample mean compares to the population mean, like our value of 120,000 for average mothly store revenue
<h3 align="center">
    <font size="7">
        $ t = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{n}}}$
    </font>
    </h3> 
    
    - Null Hypothesis: sample mean is the same as hypothesized or theoretical mean
    - Alternative Hypothesis: sample mean is different from the hypothesized or theoretical mean
    

- **Independent Samples T-test:** The The independent samples t test (also called the unpaired samples t test) is the most common form of the T test. It helps you to compare the means of two sets of data. Normally, we are checking to see if the means of the data are significantly different from a differnece of zero. But , we can also check if they are significantly different from a hypothesized or theoretical value. For instance, say we had two groups of males and one group of females and we wanted to compare average heights between the groups. For the males, we would check to see if they differed significantly from an average height difference of zero, whereas when comparing the males to the females we may want to see if they were significantly different from an average difference of 2 inches, or whatever the average height between males and females is. **This hypothesized difference, $(\mu_1 - \mu_2)$, will usually be zero, but not always.**

<h3 align="center">
    <font size="7">
        $ t = \frac{(\bar{x_1}-\bar{x_2})-(\mu_1 - \mu_2)}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}$
    </font>
    </h3> 
    
   
   - Note that this t-test is one variation of the independent samples t-test that _does not_ assume equal variance between the samples
   
    - Null Hypothesis: sample mean is the same as hypothesized or theoretical mean
    - Alternative Hypothesis: sample mean is different from the hypothesized or theoretical mean
    
    
- **Paired Samples T-test:** A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent samples t test) is where you run a t test on dependent samples. Dependent samples are essentially connected — they are tests on the same person or thing. This would be useful if we chose a random sample of stores and measured their mean revenues before and after implementation of the new marketing campaign as our two means. For our function, we can simply add an optional argument "equal_var" to our independent t-test funtion which will cause it to act like a paired samples t-test. 

The t-value we obtain will lie on the horizontal axis of our t-distribution, representing the number of standard deviations the difference between our sample mean and theoretical mean lies from zero, with a corresponding p-value on the y-axis that tells us how likely our result would be if the population our sample was drawn from had the same mean as our theorized or population mean. This t-distribution takes the form: 
<h3 align="center">
    <font size="6">
        $ f(T) = \frac{(1 + \frac{T^2}{\nu})^{\frac{-(\nu+1)}{2}}}{B(0.5,0.5\nu)\sqrt(\nu)}$
    </font>
    </h3> 
    
    
- Where $\nu$ is the degrees of freedom of the distribution and B is the beta function, which is beyond the scope of this book and can be pulled from the scipy.stats library. 

### T-distribution p-value

As we can see, a lower sample size, and hence a lower degrees of freedom, leads to a lower probaility that our t-score is near 0 when our population means are the same, because more random variation is likely when the sample size is so low. The point of a t-score is to determine if the difference in the two means of the samples is too drastic for the true population means to be the same. As we approacch 30 with our degrees of freedom, the graph doesnt change much, and this is a standard normal distribution, which the z-score uses. That is why we use z-score for large sample sizes. 

Once we get our t-score based on the t-distribution, shown on the x-axis of the above graph, we get a corresponding __p-value__, shown on the y-axis. This value is the probability of our gotten t-value if the true means were the same. 

- If the corresponding p-value from our t-value is too low, we choose to __reject the null hypothesis $H_o$__, and say that our samples come from different populations who's means are different. This is a "statistically significant" result. 


- If the p-value is sufficiently high, we __fail to reject the null hypothesis $H_o$__, and say that there is a high enough chance that the samples came from populations with the same means. This is a "statistically insignificant" result. 


- The value at which a non-significant result becomes a significant one is called the __*critical value*__, denoted $\alpha$, and is most commonly 0.05.

We now implement the theory into code: 

First, we create a funtion that uses a flat np.linspace array from -10 to 10(very extreme t vaues) and transform it according to the t-distributions density function:

In [102]:
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(15, 8))
cauchy = create_t_distribution(1)
ax.plot(cauchy, '-', lw=3, alpha=1,  label = "Cauchy", color='b')
t_df = [2, 3, 4, 5, 10, 20]
for df in t_df:
    dist = create_t_distribution(df)
    ax.plot(dist, '-', lw=1, alpha=df/20,  label = "df: "+ str(df), color='k')
gaussian = create_t_distribution(30)
ax.plot(gaussian, lw=3, alpha=1, color = 'r',  label='Standard Normal Distribution')
plt.rcParams.update({"font.size": 15})
ax.set_ylabel("Probability of t-score")
ax.set_xlabel("Standard Deviations away from mean( this varies as the distributions have different SD's)")
plt.title("T-distribution with varying degrees of freedom")
ax.set_xticklabels(labels = "")
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x28274167700>

In [20]:
import datlib.stats
import scipy.special as sc
import scipy.stats as stats

# define a function to create the actual distribution from which we can analyze our t value from the t test
def create_t_distribution(df):
    x = np.linspace(-5, 5, 1000) # large number of points will ensure accuracy
    # transform flat array of x values into t distribution
    t_distribution = ((1+x**2/df)**(-(df+1)/2))/(sc.beta(.5, .5*df)*np.sqrt(df))
    return t_distribution

In [21]:
# function that allows us to evaluate the t distribution at a given t value and df
def t_prob(t, df): 
    p_value = stats.t.sf(t, df)
    # equivalent to = ((1 + t**2 / df)**(-(df + 1) / 2)) / (sc.beta(.5, .5 * df) * np.sqrt(df))
    return p_value

Next, we create a function that will calculate a t value from a given sample set and $\mu$ value and output the t value and its corresponding p-value from the distribution we created. 

In [22]:
from datlib.stats import *
def ttest_one_sample(data, mu):
    x_bar = mean(data)
    s = SD(data, sample=True)
    n = len(data)
    df = n - 1
    t = (x_bar - mu) / (s / np.sqrt(n))
    p_value = t_prob(t, df)
    if p_value > .05:
        return_string = "T-value: " + str(t) + ", P-value: " + str(
            p_value) + ", Fail to reject null hypothesis."
    else:
        return_string = "T-value: " + str(t) + ", P-value: " + str(
            round(p_value, 5)) + ", Reject null hypothesis."

    return return_string

To show the utility of the single-sample t-test, we can check to see if the S&P500's average yearly return was significantly different from some arbitrary value, like 10%. Can we confidently say that the S&P500's return was different from 10%, or could the observed difference have been due to random variation? 

In [23]:
ttest_one_sample(sp500, .10)

'T-value: -0.36401676860591037, P-value: 0.6413248366120687, Fail to reject null hypothesis.'

We can test our results vs the SciPy library's one sample t test: 

In [24]:
stats.ttest_1samp(sp500, .10)

Ttest_1sampResult(statistic=-0.3640167686059105, pvalue=0.7173503267758627)

And the results are essentially the same, but the loss of significant figures many decimal places out become multiplied later on in our function to give slightly different results. 

From the test, we fail to reject our null hypothesis that the average yearly S&P500 gain was not significantly different from 10%. 

In [43]:
# independent samples t-test, setting equal_var=True will turn this test into a paired samples t-test where equal variance 
# is assumed 
def ttest_ind_samp(a, b, hypothesized_difference = 0, equal_var=False): 
    s1 = variance(a)
    s2 = variance(b)
    n1 = len(a)
    n2 = len(b)
# if paired samples, function is simpler 
    if (equal_var):
        df = n1 + n2 - 2
        svar = ((n1 - 1) * s1 + (n2 - 1) * s2) / float(df)
        denom = np.sqrt(svar * (1.0 / n1 + 1.0 / n2))
    else:
        vn1 = s1 / n1
        vn2 = s2 / n2
        df = ((vn1 + vn2)**2) / ((vn1**2) / (n1 - 1) + (vn2**2) / (n2 - 1))
        denom = np.sqrt(vn1 + vn2)
  
    x_bar1 = np.mean(a)
    x_bar2 = np.mean(b)
    d = np.mean(a) - np.mean(b) - hypothesized_difference
    t = d / denom
    #df_approx = round(df, 0)
    # t = ((x_bar1 - x_bar2) - (hypothesized_difference)) / (np.sqrt( (s1 / n1) + ( s2 / n2)))
    p_value = t_prob(t, df)
    if p_value > .05:
        return_string = "T-value: " + str(t) + ", P-value: " + str(
            p_value) + ", Fail to reject null hypothesis."
    else:
        return_string = "T-value: " + str(t) + ", P-value: " + str(
            round(p_value, 5)) + ", Reject null hypothesis."

    return return_string
    
    

In [35]:
i = 0
for key in yearly_returns_dict.keys():
    print("\n"+key+": ")
    print(stats.ttest_ind(yearly_returns_dict[key], sp500, alternative="less"))
    i+=1
    if i == 7: 
        break


EAGRX: 
Ttest_indResult(statistic=-0.19720701504357524, pvalue=0.42220297879352464)

AMRMX: 
Ttest_indResult(statistic=0.17957327495976705, pvalue=0.5710449941328631)

FMAGX: 
Ttest_indResult(statistic=1.3953138677906483, pvalue=0.9168598696267594)

HACAX: 
Ttest_indResult(statistic=1.0259475206132025, pvalue=0.8460864174269368)

FLPSX: 
Ttest_indResult(statistic=1.4290991916532547, pvalue=0.9216356810618928)

PRGFX: 
Ttest_indResult(statistic=0.6128209348938095, pvalue=0.7292466643407904)

PRNHX: 
Ttest_indResult(statistic=1.290973032582562, pvalue=0.9000261897468775)


In [42]:
i = 0
for key in yearly_returns_dict.keys():
    print("\n"+key+": ")
    print(ttest_ind_samp(yearly_returns_dict[key], sp500))
    i+=1
    if i == 7: 
        break


EAGRX: 
T-value: -0.2274615742719947, P-value: 0.5839210426856163, Fail to reject null hypothesis.

AMRMX: 
T-value: 0.19008567473644125, P-value: 0.424845584650113, Fail to reject null hypothesis.

FMAGX: 
T-value: 1.3514820743437954, P-value: 0.0903440757733132, Fail to reject null hypothesis.

HACAX: 
T-value: 0.9741543720437927, P-value: 0.16694980636084672, Fail to reject null hypothesis.

FLPSX: 
T-value: 1.4265027585330028, P-value: 0.07915448710425667, Fail to reject null hypothesis.

PRGFX: 
T-value: 0.6081638020790008, P-value: 0.27234999095161605, Fail to reject null hypothesis.

PRNHX: 
T-value: 1.2468546506174194, P-value: 0.10824584589891259, Fail to reject null hypothesis.


The results are not exactly the same but they are reasonably close 

# ANOVA 

While using T-tests and Z-tests to analyze means of groups, we were restricted to only being able to compare two groups at a time. What if we wanted to see of there was significant differences between more than two groups? The **ANOVA**, or **Analysis of Variance** techniques allow us to test the null hypothesis that there is no significant difference between $k$ (some integer larger than 2) groups. 

- $H_o$: $\mu_1 = \mu_2 = \cdots = \mu_k$
- $H_a$: At least two of the means are not equal. 

### Assumptions needed for ANOVA
There are three assumptions that must be met in order to carry out an ANOVA test: 

1. The experimental errors of yoyr data are normally distributed
2. Homoscedasticity - the variances of your factors are all roughly the same (and at least follow the same distribution)
3. Samples are independent - Selection of one sample had no effect on any other sample

### F-Statistic
The distribution used for the hypothesis test is a new one. It is called the F distribution, named after Sir Ronald Fisher, an English statistician. The F-statistic is a ratio. There are two sets of degrees of freedom; one for the numerator and one for the denominator. 

The F distribution is derived from the t-distribution. The values of the F distribution are squares of the
corresponding values of the t-distribution. One-Way ANOVA expands the t-test for comparing more than two groups.
The scope of that derivation is beyond the level of this textbook. 

To calculate the F ratio, two estimates of the variance are made:

1. **Variance between samples**: An estimate of $\sigma^2$ that is the variance of the sample means multiplied by n (when the sample sizes are the same.). If the samples are different sizes, the variance between samples is weighted to account for the different sample sizes. The variance is also called **variation due to treatment or explained variation.**

2. **Variance within samples**: An estimate of $\sigma^2$ that is the average of the sample variances (also known as a pooled variance). When the sample sizes are different, the variance within samples is weighted. The variance is also called **the variation due to error or unexplained variation.**

- $SS_b$ = the sum of squares that represents the variation among the different samples

- $SS_w$ = the sum of squares that represents the variation within samples that is due to chance.

To find a "sum of squares" means to add together squared quantities that, in some cases, may be weighted. We used sum of squares to calculate the sample variance and the sample standard deviation. 

MS means "mean square." $MS_b$ is the variance between groups, and $MS_w$ is the variance within groups.

### Caluculating the F-Statistic

- $k$ = the number of different groups
- $n_j$ = the size of the $j^{th}$ group
- $s_j$ = the sum of the values in the $j^{th}$ group
- $n$ = total number of all the values combined (total sample size: $\sum{n_j}$)
- $x$ = one value: $\sum{x}$
- Between group variability = $SS_{total} = \sum{x^2} - \frac{\sum{x^2}}{n}$
- Explained variation: sum of squares representing variation among the different samples: $SS_{b} = \sum{\frac{(s_j)^2}{n_j}} - \frac{(\sum{(s_j)^2}}{n}$
- Unexplained variation: sum of squares representing variation within samples due to chance:
$SS_w = SS_{total} – SS_b$
- $df$'s for the numerator(between samples): $df_b = k – 1$
- $df$'s for the denominator($df$'s within samples): $df_w = k – 1$
- Mean square (variance estimate) explained by the different groups:
$MS_b = \frac{SS_b}{df_b} = \frac{SS_b}{k-1}$
- Mean square (variance estimate) that is due to chance (unexplained): $MS_w = \frac{SS_w}{df_w} = \frac{SS_w}{n - k}$

The one-way ANOVA test depends on the fact that $MS_b$ can be influenced by population differences among means of the several groups. Since $MS_w$ compares values of each group to its own group mean, the fact that group means might
be different does not affect $MS_w$. The null hypothesis says that all groups are samples from populations having the same normal distribution. The alternate
hypothesis says that at least two of the sample groups come from populations with different normal distributions. If the null hypothesis is true, $MS_b$ and $MS_w$ should both estimate the same value. 

Finally, we arrive at the **F-Statistic**, which will function for us as the T-Statistic did earlier this chapter, as an input into its density function to recieve a p-value telling us the likelihood of its occurence if our null hypothesis was true. 

- $ F = \frac{MS_b}{MS_w}$

With a density function:
<h3 align="center">
    <font size="5">
        $ f(x, df_1, df_2) = \frac{df_2^{df_2/2} df_1^{df_1/2} x^{df_1 / 2-1}}
                        {(df_2+df_1 x)^{(df_1+df_2)/2}
                         \beta(df_1/2, df_2/2)}$
    </font>
    </h3> 



where $df_1$ and $df_2$ are the
shape parameters and
$\beta$ is the beta function.  The formula for the beta function
is
<ul>
$B(a, b) = \int_0^1 t^{a-1}(1-t)^{b-1}dt
        = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$,
 
<p>where <span class="math notranslate nohighlight">\(\Gamma\)</span> is the gamma function.</p>
    
These funtions could be implememnted manually using basic math symbols, but for our purposes, importing them from Scipy will be much more pragmatic. 
    
In a testing context, the F distribution is treated as  "standardized distribution" (i.e., no location or scale parameters).
However, in a distributional modeling context (as with other probability distributions), the F distribution itself can be
transformed with a <a href="eda364.htm">location parameter</a>, $\mu$, and a <a href="eda364.htm">scale parameter</a>, $\sigma$.
<p>

In [45]:
from scipy.stats import f as f_dist
# create function to quickly sum squares within ANOVA test
def sum_squares(a):
    s = 0
    for i in range(len(a)): 
        s += a[i]**2
    return s

# and one to square sums
def square_sums(a):
    s = 0
    for i in range(len(a)): 
        s += a[i]
    return s * s

# finds value of f-distribution for given f and df1, df2
def f_prob(f, df1, df2):
    # use scipy to plug f-value into f distribution to return p-value
    p_value = f_dist.sf(df1, df2, f)
    # could attempt to manually implement, i.e
    # [f(x, df_1, df_2) = (df_2^{df_2/2} df_1^{df_1/2} x^{df_1 / 2-1} / 
    #                   {(df_2+df_1 x)^{(df_1+df_2)/2}*sc.beta(df_1/2, df_2/2)}\]
    return p_value

In [46]:
# use *args command to accept variable number of arguments
def ANOVA(*args):
    # Create numpy array of arguments using generator function
    args = [np.asarray(arg) for arg in args]
    # ANOVA on N groups, each in its own array
    k = len(args)
    # use print statements throughout funtion to ensure proper logic
    print("k: ", k)
    alldata = np.concatenate(args)
    bign = 0
    for i in range(k):
        bign += len(args[i])
    
    print("n: ", bign)

    # do the (x_i - x_mean) calculation that happens for every calculation of variance. This is simpler
    # than doing the same calculation for each variable
    grand_mean = alldata.mean()
    alldata -= grand_mean
    
    
    # 
    sstot = sum_squares(alldata) - (square_sums(alldata) / float(bign))
    print("sstot: ", sstot)
    
    
    ssbn = 0
    for a in args:
        ssbn += square_sums(a - grand_mean) / float(len(a))
    ssbn -= (square_sums(alldata) / float(bign))
    print("ssbn: ",ssbn)

    sswn = sstot - ssbn
    print("sswn: ", sswn)
    
    dfbn = k - 1
    dfwn = bign - k
    print("k: ", k, "n: ", bign)
    print("DFbn, DFwn: ", dfbn,",", dfwn)
    msb = ssbn / float(dfbn)
    print("msb: ", msb)
    msw = sswn / float(dfwn)
    print("msw: ", msw)
    f = msb / msw

   
    print("sswn: ", sswn)
    p_value = float(f_prob(f, dfbn, dfwn))

    if p_value > .05:
        return_string = "F-value: " + str(f) + ", P-value: " + str(
            p_value) + ", Fail to reject null hypothesis."
    else:
        return_string = "F-value: " + str(f) + ", P-value: " + str(
            round(p_value, 5)) + ", Reject null hypothesis."

    return return_string

Now that we have built our function, we can test it. In the case of our mutual fund analysis, a relevant ANOVA problem would be comparing mean returns across the categorical variable of fund category. In one of our data columns, "Morningstar Category", Morningstar provides categorizations of funds based of what they focus on. Does one specialization do better than the others? Should investors choose one category of fund over the others? 

In [47]:
# find what categories are listed in our dataset: 
fund_types = mutual_fund_data["Morningstar Category"].unique()
fund_types

array(['Large Growth', 'Large Blend', 'Large Value', 'Mid-Cap Value',
       'Mid-Cap Growth', 'Health', 'Mid-Cap Blend', 'Communications',
       'Small Value'], dtype=object)

In [48]:
# create lists of categories to be analyzed and pull mean yeary returns for each category
fund_type_returns_dict = {}
for ftype in fund_types: 
    fund_type_returns_dict[ftype] = []
    for i in range(25, 50): 
        if mutual_fund_data["Morningstar Category"][i] == ftype:
            fund_type_returns_dict[ftype].extend(yearly_returns_dict[mutual_fund_data["Symbol"][i]])
  


In [49]:
mutual_fund_data["Morningstar Category"]

0     Large Growth
1     Large Growth
2      Large Blend
3      Large Blend
4      Large Blend
          ...     
95    Large Growth
96    Large Growth
97     Large Value
98     Large Value
99    Large Growth
Name: Morningstar Category, Length: 100, dtype: object

In [50]:
fund_type_returns_dict

{'Large Growth': [-0.11516896890536632,
  0.21024800399894827,
  0.08020978179860982,
  0.12587589690265344,
  0.7570397712188528,
  -0.2178822999791835,
  0.3708926948067228,
  0.33955412977647326,
  -0.09504486420584668,
  0.3726674521719533,
  0.1671760461917593,
  0.022323115049480036,
  0.18379775992407366,
  0.3698573656836046,
  -0.0639154346738624,
  0.42102835219271295,
  0.07224116612941778,
  0.24021594664006396,
  -0.023084630156390284,
  0.3853180227350925,
  0.10483167171142704,
  0.2803295935593004,
  0.3310894582637056,
  0.23022505499499357,
  -0.11142582776710441,
  -0.08432038752136328,
  -0.21433603325183292,
  0.20527696877759238,
  0.07052203257335221,
  0.09380426609667669,
  0.05333963814540654,
  0.17119553791216857,
  -0.4695820727724316,
  0.3949316679484096,
  0.11337157514846319,
  -0.11079299263718723,
  0.1897765777975038,
  0.31113657907551207,
  0.14651757257330433,
  0.024778792472980538,
  0.08310495073039559,
  0.26310395346919657,
  -0.0636658307753

In [51]:
fund_type_returns_dict.keys()

dict_keys(['Large Growth', 'Large Blend', 'Large Value', 'Mid-Cap Value', 'Mid-Cap Growth', 'Health', 'Mid-Cap Blend', 'Communications', 'Small Value'])

There are almost no funds with the last 5 categories so we will just do the analysis on the fist 4. 

In [52]:
ANOVA(fund_type_returns_dict["Large Growth"],
      fund_type_returns_dict['Large Value'],
      fund_type_returns_dict['Large Blend'],
      fund_type_returns_dict['Mid-Cap Growth'])

k:  4
n:  532
sstot:  19.99037561252724
ssbn:  0.08610202346508852
sswn:  19.904273589062154
k:  4 n:  532
DFbn, DFwn:  3 , 528
msb:  0.028700674488362842
msw:  0.03769748785807226
sswn:  19.904273589062154


'F-value: 0.7613418325491175, P-value: 0.4953731492515171, Fail to reject null hypothesis.'

Compare to equivalent SciPy function: 

In [53]:
stats.f_oneway(fund_type_returns_dict["Large Growth"],
      fund_type_returns_dict['Large Value'],
      fund_type_returns_dict['Large Blend'],
      fund_type_returns_dict['Mid-Cap Growth'])

F_onewayResult(statistic=0.7613418325491174, pvalue=0.5161213957671515)

These results suggest that almost all of the total variance in the data is caused by within-data variance, not between-data variance. The practical interperetation is that the type of mutual fund does not significantly change yearly returns. This result may support the EMH because specific specialization of funds has no effect on returns. The market gave the same returns to the funds no matter their specialization. 