# The F-Test (First Example)

### Intro and objectives


### In this lab you will learn:
1. examples of multiple regression models.
2. how to fit multiple regression models in Python.
3. How to conduct F-Tests.


## What I hope you'll get out of this lab
* The feeling that you'll "know where to start" when you need to fit a multiple regression model.
* Worked Examples of multiple regression models
* How to interpret the results obtained

In [25]:
!pip install wooldridge

import wooldridge as woo
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Example 1. Determinants of salaries in the major baseball league




#### In this case we fit a multiple linear models to predict salaries in terms of a series of factors including years in the league, games played per year, batting average, home runs and runs batted per year

$ log(salary)=\beta_0+\beta_1*years+\beta_2*gamesyr+\beta_3*bavg+\beta_4*hrunsyr+\beta_5*rbisyr+u $



### Using the data in MLB1 where n=353 individuals

In [26]:
salaries = woo.dataWoo('mlb1')


In [27]:
salaries.head()

Unnamed: 0,salary,teamsal,nl,years,games,atbats,runs,hits,doubles,triples,...,runsyr,percwhte,percblck,perchisp,blckpb,hispph,whtepw,blckph,hisppb,lsalary
0,6329213.0,38407380.0,1,12,1705,6705,1076,1939,320,67,...,89.666664,70.277969,18.844229,10.877804,0.0,0.0,70.277969,0.0,0.0,15.660686
1,3375000.0,38407380.0,1,8,918,3333,407,863,156,38,...,50.875,70.277969,18.844229,10.877804,18.844229,0.0,0.0,10.877804,0.0,15.031906
2,3100000.0,38407380.0,1,5,751,2807,370,840,148,18,...,74.0,70.277969,18.844229,10.877804,0.0,0.0,70.277969,0.0,0.0,14.946913
3,2900000.0,38407380.0,1,8,1056,3337,405,816,143,18,...,50.625,70.277969,18.844229,10.877804,0.0,0.0,70.277969,0.0,0.0,14.880221
4,1650000.0,38407380.0,1,12,1196,3603,437,928,19,16,...,36.416668,70.277969,18.844229,10.877804,18.844229,0.0,0.0,10.877804,0.0,14.316286


In [28]:
salaries.describe()

Unnamed: 0,salary,teamsal,nl,years,games,atbats,runs,hits,doubles,triples,...,runsyr,percwhte,percblck,perchisp,blckpb,hispph,whtepw,blckph,hisppb,lsalary
count,353.0,353.0,353.0,353.0,353.0,353.0,353.0,353.0,353.0,353.0,...,353.0,330.0,330.0,330.0,330.0,330.0,330.0,330.0,330.0,353.0
mean,1345672.0,30794830.0,0.475921,6.325779,648.424929,2168.592068,290.402266,584.736544,103.88102,16.733711,...,38.582988,72.631088,16.548926,10.819986,5.144886,2.004179,36.667635,2.931898,3.046493,13.492183
std,1407352.0,8722411.0,0.500129,3.880142,538.697499,2025.059766,301.010915,575.378562,104.323511,21.533336,...,23.862686,15.227258,13.668326,9.387961,11.109446,5.858419,37.637047,6.66769,8.415198,1.182466
min,109000.0,8854000.0,0.0,1.0,7.0,7.0,1.0,1.0,0.0,0.0,...,0.5,20.296301,3.741786,0.54087,0.0,0.0,0.0,0.0,0.0,11.599103
25%,253600.0,24557330.0,0.0,3.0,230.0,632.0,73.0,164.0,26.0,3.0,...,18.6,67.668961,8.007545,1.96208,0.0,0.0,0.0,0.0,0.0,12.443514
50%,675000.0,34136500.0,0.0,6.0,520.0,1585.0,191.0,419.0,70.0,9.0,...,35.333332,74.131294,14.453978,10.877804,0.0,0.0,20.296301,0.0,0.0,13.422468
75%,2250000.0,37792000.0,1.0,9.0,930.0,3071.0,407.0,818.0,147.0,23.0,...,55.888889,82.94886,18.755629,16.330647,7.98687,0.0,73.642937,1.561949,0.0,14.626441
max,6329213.0,42866000.0,1.0,20.0,2729.0,10554.0,1570.0,3025.0,634.0,142.0,...,105.14286,94.696266,73.96003,31.037498,73.96003,31.037498,94.696266,31.037498,73.96003,15.660686


In [29]:
type(salaries)

pandas.core.frame.DataFrame

In [30]:
# OLS regression:
reg = smf.ols(
    formula='np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr',
    data=salaries)

In [31]:
# We fit the model
results = reg.fit()


In [32]:
results.params

Intercept    11.192418
years         0.068863
gamesyr       0.012552
bavg          0.000979
hrunsyr       0.014429
rbisyr        0.010766
dtype: float64

## Based on the previous we have fitted the following model:

$ log(salary)=11.19+0.68*years+0.0125*gamesyr+0.00097*bavg+0.14*hrunsyr+0.0107*rbisyr+u $


## F-Test of statistical significance

#### We are interested in determining if, once years in the league and games per year have been controlled for, performance metrics (bavg, hrunsy, rbisyr) have no effect on salary.

#### Essentially we want to test if productivity, measured by baseball statistics, has no effect on salary.

#### This test corresponds to the following null hypothesis:

 $H_0: Β_3=0,Β_4=0,Β_5=0 $


In [33]:
# automated F test:
hypotheses = ['bavg = 0', 'hrunsyr = 0', 'rbisyr = 0']
ftest = results.f_test(hypotheses)
fstat = ftest.statistic[0][0]
fpval = ftest.pvalue

In [34]:
print(f'F statistic: {fstat}\n')
print(f'F p-value: {fpval}\n')

F statistic: 9.550253521951879

F p-value: 4.4737081398389455e-06



### Based on the previous results:

#### The F-statistic is large and its associated p-value is almost cero.

#### Therefore we reject the null hypothesis that bavg, hrunsyr, rbisyr have no effect on salary
## Therefore performance metrics DO HAVE an impact on salary

