### <font color='darkred'> HW4_1

1. Compute Adj $R^2$, AIC and BIC for each of your five regression models in HW2
    
    * Remember that, after running a regression using statsmodels, you can access the AIC (or BIC or Adj $R^2$) of the regression by typing "$***$.aic" (or "$***$.bic" or "$***$.rsquared_adj") where $***$ is the name of the variable containing the regression result 

    
2. Draw three graghs for the three criteria like Figure 6.2 on page 211 of the textbook 
    
    * You can use any types of figure from "matplotlib.pyplot"
    * X-axis may be defined as "number of predictors" or "model number" (set by you)
    * Add a mark or anything onto each graph that indicates the selected model from each criterion 

In [3]:
import os
os.chdir('/Users/robbyjeffries/MSEA2022/Spring 2022/ECON 5763, Economic Analytics/Data')

In [10]:
import numpy as np 
import pandas as pd
import math
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col

In [5]:
raw0 = pd.read_csv('College.csv')

In [6]:
raw0.head()

Unnamed: 0.1,Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


#### Additional Data Cleaning

In [7]:
# Convert "private" variable to a dummy using a built-in function
raw0['Private']=pd.get_dummies(raw0['Private'],drop_first=True)

# Rename columns to remove periods
raw0.rename(columns = {'perc.alumni':'palumni'}, inplace = True)
raw0.rename(columns = {'Room.Board':'RoomBoard'}, inplace = True)
raw0.rename(columns = {'Grad.Rate':'GradRate'}, inplace = True)
raw0.rename(columns = {'F.Undergrad':'FUndergrad'}, inplace = True)
raw0.rename(columns = {'P.Undergrad':'PUndergrad'}, inplace = True)
raw0.rename(columns = {'S.F.Ratio':'SFRatio'}, inplace = True)

#### Regressions

In [8]:
# Reg 2 will try to capture the incoming students' ability
HWOLS1 = smf.ols('palumni ~ Top10perc', data=raw0).fit()

# Reg 2 will try to capture the prestige of the college
HWOLS2 = smf.ols('palumni ~ Top10perc + Private + Terminal + GradRate', data=raw0).fit()

# Reg 3 will try to capture the size of the college 
HWOLS3 = smf.ols('palumni ~ Apps + Accept + Enroll + FUndergrad + PUndergrad + SFRatio', data=raw0).fit()

# Reg 4 will try to capture the financial status of the student
HWOLS4 = smf.ols('palumni ~ Outstate + RoomBoard + Books + Personal + Expend', data=raw0).fit()

# Reg 5 will try to capture everything from the previous models
HWOLS5 = smf.ols('palumni ~ Top10perc + Private + Terminal + GradRate + Outstate + RoomBoard + Books + Personal + Expend + Apps + Accept + Enroll + FUndergrad + PUndergrad + SFRatio', data=raw0).fit()

In [11]:
HW_info_dict={'BIC' : lambda x: f"{x.bic:.2f}",
    'No. observations' : lambda x: f"{int(x.nobs):d}",
    'F-stat' : lambda x: f"{x.fvalue:.2f}",
    'F P-value' : lambda x: f"{x.f_pvalue:6.3g}"}

# dictionary is another way to store data, which use "keys" to index elements (instead of numbers): key-value pair

HW_results_table = summary_col(results=[HWOLS1,HWOLS2,HWOLS3,HWOLS4,HWOLS5],
                            float_format='%0.2f',
                            stars = True,
                            model_names=['Model 1',
                                         'Model 2',
                                         'Model 3',
                                         'Model 4',
                                         'Model 5'],
                            info_dict=HW_info_dict,
                            regressor_order=['Top10perc',
                                             'Intercept',
                                             'Private',
                                             'Terminal',
                                             'GradRate',
                                             'Apps',
                                             'Accept',
                                             'Enroll',
                                             'Top25perc',
                                             'FUndergrad',
                                             'PUndergrad',
                                             'SFRatio',
                                             'Outstate',
                                             'RoomBoard',
                                             'Books',
                                             'Personal',
                                             'PhD',
                                             'palumni',
                                             'Expend'])

HW_results_table.add_title('Robby Jeffries - OLS Regressions')

print(HW_results_table)

              Robby Jeffries - OLS Regressions
                 Model 1  Model 2  Model 3  Model 4  Model 5 
-------------------------------------------------------------
Top10perc        0.32***  0.16***                    0.10*** 
                 (0.02)   (0.03)                     (0.03)  
Intercept        13.93*** -7.86*** 38.07*** 15.23*** 5.39    
                 (0.73)   (2.36)   (1.54)   (1.99)   (3.55)  
Private                   8.77***                    3.13**  
                          (0.87)                     (1.22)  
Terminal                  0.11***                    0.10*** 
                          (0.03)                     (0.03)  
GradRate                  0.17***                    0.15*** 
                          (0.02)                     (0.03)  
Apps                               0.00***           -0.00   
                                   (0.00)            (0.00)  
Accept                             -0.00***          -0.00** 
                       

In [16]:
print(HWOLS1.aic)
print(HWOLS2.aic)
print(HWOLS3.aic)
print(HWOLS4.aic)
print(HWOLS5.aic)
print('\n')
print(HWOLS1.bic)
print(HWOLS2.bic)
print(HWOLS3.bic)
print(HWOLS4.bic)
print(HWOLS5.bic)

5938.833609337911
5749.924614019112
5935.996898599233
5780.411386982302
5673.235393277424


5948.144490038647
5773.20181577095
5968.584981051807
5808.344029084508
5747.7224388833065


## <font color='green'> Forward and Backward selection

In [1]:
import os
os.chdir('/Users/robbyjeffries/MSEA2022/Spring 2022/ECON 5763, Economic Analytics/Data')

import numpy as np
import pandas as pd
import math

raw0 = pd.read_csv('Boston.csv')
raw0.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [2]:
import statsmodels.api as sm
#import statsmodels.formula.api as smf

# define y and X
raw0 = raw0.iloc[:,1:].values
Y = raw0[:,-1] # -1 means the last one
X = raw0[:,:-1] 
ncol=X.shape[1]

In [17]:
# Code for forward selection based on BIC

pcand = list(range(ncol)) # A list to keep track of the set of predictors left to add to the model at each iteration
psel = [] # A list to keep track of the selected predictors at each iteration (the order of the selected predictors)
tb = np.zeros(ncol) # A vector to store the BIC of the selected model at each iteration
p = 0 # Iteration idex

while len(psel) != ncol: # Repeat below until the model includes all the predictors
    tb0 = np.zeros((len(pcand),2)) # Store Rsquare(s) and BIC(s) of the models under consideration at each iteration
    
    # for loop finds r2 and bic for each iteration
    for i in range(0,len(pcand)):
        psel0 = psel + [pcand[i]] # "psel0" is a temporary version of psel which includes one of the predictors in pcan and those in psel
        # Caution: "+" combines two lists, but not a list and an integer (i.e pcan[i])
        XX = X[:,psel0]
        XX = sm.add_constant(XX)
        model = sm.OLS(Y, XX)
        res = model.fit()
        tb0[i,:] = [res.rsquared, res.bic]
    
    ind = np.argmax(tb0[:,0]) # Find the regressor that results in the largest Rsquare when added to the model
    psel = psel + [pcand[ind]] # Add the selected regressor to psel
    pcand.remove(pcand[ind]) # Remove the selected regressor from pcand
    tb[p] =  tb0[ind,1] # Store the BIC of the selected model at this iteration
    p += 1

In [18]:
tb

array([3295.42803024, 3184.22192431, 3131.0034141 , 3118.49172821,
       3094.79785318, 3087.5248064 , 3082.25067607, 3080.31382312,
       3082.41992887, 3078.56060506, 3072.44482786, 3078.55639492,
       3084.78010745])

In [19]:
psel

[12, 5, 10, 7, 4, 3, 11, 1, 0, 8, 9, 2, 6]

In [20]:
# Select the model that has the smallest BIC among the selected models over the iterations
psel[:(np.argmin(tb)+1)]

[12, 5, 10, 7, 4, 3, 11, 1, 0, 8, 9]

### <font color='darkred'> HW4_2
    
1. Write code to implement Backward Selection on the same data by modifying the code above

2. Check if the forward and backward selections select the same model