This notebook contains the PySAL/spreg code for Chapter 6 - 2SLS 

in
Modern Spatial Econometrics in Practice: A Guide to GeoDa, GeoDaSpace and PySAL.

by Luc Anselin and Sergio J. Rey

(c) 2014 Luc Anselin and Sergio J. Rey, All Rights Reserved

In [34]:
__author__ = "Luc Anselin luc.anselin@asu.edu"

## Basic Regression Setup##

**Creating arrays for y, x, the endogenous variables yend and the instruments q**

Using the **natregimes.dbf** example

Preliminaries, importing **numpy** and **pysal**

In [35]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


In [36]:
import numpy as np
import pysal

Loading the data set and creating the data object

In [37]:
db = pysal.open('data/natregimes.dbf','r')

In [38]:
len(db)

3085

In [39]:
db.header

['REGIONS',
 'NOSOUTH',
 'POLY_ID',
 'NAME',
 'STATE_NAME',
 'STATE_FIPS',
 'CNTY_FIPS',
 'FIPS',
 'STFIPS',
 'COFIPS',
 'FIPSNO',
 'SOUTH',
 'HR60',
 'HR70',
 'HR80',
 'HR90',
 'HC60',
 'HC70',
 'HC80',
 'HC90',
 'PO60',
 'PO70',
 'PO80',
 'PO90',
 'RD60',
 'RD70',
 'RD80',
 'RD90',
 'PS60',
 'PS70',
 'PS80',
 'PS90',
 'UE60',
 'UE70',
 'UE80',
 'UE90',
 'DV60',
 'DV70',
 'DV80',
 'DV90',
 'MA60',
 'MA70',
 'MA80',
 'MA90',
 'POL60',
 'POL70',
 'POL80',
 'POL90',
 'DNL60',
 'DNL70',
 'DNL80',
 'DNL90',
 'MFIL59',
 'MFIL69',
 'MFIL79',
 'MFIL89',
 'FP59',
 'FP69',
 'FP79',
 'FP89',
 'BLK60',
 'BLK70',
 'BLK80',
 'BLK90',
 'GI59',
 'GI69',
 'GI79',
 'GI89',
 'FH60',
 'FH70',
 'FH80',
 'FH90',
 'West']

**y** - dependent variable HR90

In [40]:
y_name = "HR90"
y = np.array([db.by_col(y_name)]).T

In [41]:
y.shape

(3085, 1)

**x** - array with observations on explanatory variables

In [42]:
x_names = ['RD90','MA90','PS90']
x = np.array([db.by_col(var) for var in x_names]).T

In [43]:
x.shape

(3085, 3)

**yend** - endogenous explanatory variable, UE90

In [44]:
yend_names = ['UE90']
yend = np.array([db.by_col(var) for var in yend_names]).T

In [45]:
yend.shape

(3085, 1)

**q** - array of instruments

In [46]:
q_names = ['FH90','FP89','GI89']
q = np.array([db.by_col(var) for var in q_names]).T

In [47]:
q.shape

(3085, 3)

**Creating the model weights, queen contiguity for natregimes.shp**, using FIPSNO as the ID variable

In [48]:
w = pysal.queen_from_shapefile('data/natregimes.shp',idVariable="FIPSNO")

In [49]:
w.n

3085

row-standardize

In [50]:
w.transform = 'r'

**Creating the kernel weights, triangular with k=20**

since natregimes.shp coordinates are in lat-lon, use **radius** to get great circle distance

note that **diagonal = True** to ensure that the value of 1 is on the diagonal

In [51]:
kw = pysal.adaptive_kernelW_from_shapefile('data/natregimes.shp',
                                             k=20,radius=pysal.cg.RADIUS_EARTH_MILES,
                                             diagonal=True,idVariable='FIPSNO')

## Basic Two Stage Least Squares##

**default settings** including variable names and data set name

In [52]:
reg1 = pysal.spreg.TSLS(y,x,yend,q,name_y=y_name,name_x=x_names,
                        name_yend=yend_names,name_q=q_names,name_ds='nat.dbf')

regression coefficients, in alphabetical order of the variable names

In [53]:
reg1.betas

array([[ 15.64555155],
       [  5.72924882],
       [ -0.09837584],
       [  1.8770506 ],
       [ -0.91445539]])

pretty listing

In [54]:
print reg1.summary

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     nat.dbf
Dependent Variable  :        HR90                Number of Observations:        3085
Mean dependent var  :      6.1829                Number of Variables   :           5
S.D. dependent var  :      6.6414                Degrees of Freedom    :        3080
Pseudo R-squared    :      0.3570

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      15.6455516       1.3545018      11.5507796       0.0000000
                MA90      -0.0983758       0.0299492      -3.2847583       0.0010207
                PS90       1.8770506       0.1070934      17.5272273       0.0000000
                RD90       5.7292488       0.2129126      

## The Two Stages of 2SLS##

create a matrix with all the instruments, i.e., both **x** and **q**

In [55]:
bigx = np.hstack((x,q))

In [56]:
bigx.shape

(3085, 6)

OLS regression of endogenous variable on all the instruments (**x** and **q**)

In [57]:
step1 = pysal.spreg.OLS(yend,bigx)

predicted values for endogenous variable

In [58]:
y2 = step1.predy

replace the endogenous variable by its predicted value

In [59]:
newx = np.hstack((x,y2))

second step OLS regression

In [60]:
step2 = pysal.spreg.OLS(y,newx)

In [61]:
print step2.summary

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Dependent Variable  :     dep_var                Number of Observations:        3085
Mean dependent var  :      6.1829                Number of Variables   :           5
S.D. dependent var  :      6.6414                Degrees of Freedom    :        3080
R-squared           :      0.4027
Adjusted R-squared  :      0.4019
Sum squared residual:   81252.812                F-statistic           :    519.1009
Sigma-square        :      26.381                Prob(F-statistic)     :           0
S.E. of regression  :       5.136                Log likelihood        :   -9422.964
Sigma-square ML     :      26.338                Akaike info criterion :   18855.928
S.E of regression ML:      5.1321                Schwarz criterion     :   18886.100

------------------------------------------------------------------------------------
            Variable     C

ignore measure of fit and diagnostics, but the coefficient estimates match, the estimated standard
errors do not, because they are based on the wrong residuals

##2SLS with Spatial Diagnostics##

set **spat_diag = True** and specify a weights object **w** (and, optionally, its name in **name_w**)

In [62]:
reg2 = pysal.spreg.TSLS(y,x,yend,q,w=w,spat_diag=True,
                        name_y=y_name,name_x=x_names,name_yend=yend_names,
                        name_q=q_names,name_w="nat_queen",name_ds="nat.dbf")

In [63]:
print reg2.summary

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     nat.dbf
Weights matrix      :   nat_queen
Dependent Variable  :        HR90                Number of Observations:        3085
Mean dependent var  :      6.1829                Number of Variables   :           5
S.D. dependent var  :      6.6414                Degrees of Freedom    :        3080
Pseudo R-squared    :      0.3570

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      15.6455516       1.3545018      11.5507796       0.0000000
                MA90      -0.0983758       0.0299492      -3.2847583       0.0010207
                PS90       1.8770506       0.1070934      17.5272273       0.0000000
                RD90    

##2SLS with White Standard Errors##

set **robust = 'white'**

In [64]:
reg3 = pysal.spreg.TSLS(y,x,yend,q,robust='white',
                        name_y=y_name,name_x=x_names,name_yend=yend_names,
                        name_q=q_names,name_ds="nat.dbf")

In [65]:
print reg3.summary

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     nat.dbf
Dependent Variable  :        HR90                Number of Observations:        3085
Mean dependent var  :      6.1829                Number of Variables   :           5
S.D. dependent var  :      6.6414                Degrees of Freedom    :        3080
Pseudo R-squared    :      0.3570

White Standard Errors
------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      15.6455516       1.5393092      10.1640082       0.0000000
                MA90      -0.0983758       0.0316213      -3.1110577       0.0018642
                PS90       1.8770506       0.1688432      11.1171261       0.0000000
                RD90       5.7292488

##2SLS with HAC Standard Errors##

set **robust = 'hac'** and specify a kernel weights object as **gkw** (**name_gwk** is optional)

In [66]:
reg4 = pysal.spreg.TSLS(y,x,yend,q,robust='hac',gwk=kw,
                        name_y=y_name,name_x=x_names,name_yend=yend_names,
                        name_q=q_names,name_gwk="nat_k20_triang",
                        name_ds="nat.dbf")

In [67]:
print reg4.summary

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     nat.dbf
Dependent Variable  :        HR90                Number of Observations:        3085
Mean dependent var  :      6.1829                Number of Variables   :           5
S.D. dependent var  :      6.6414                Degrees of Freedom    :        3080
Pseudo R-squared    :      0.3570

HAC Standard Errors; Kernel Weights: nat_k20_triang
------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      15.6455516       1.6405678       9.5366688       0.0000000
                MA90      -0.0983758       0.0341965      -2.8767776       0.0040176
                PS90       1.8770506       0.1982054       9.4702289       0.0000000
      

##Practice##

Repeat the 2SLS regression with spatial diagnostics using the natregimes data set, but for a different year, say using HR60.

Check the effect of HAC estimates on the standard errors.

Create a different model weights, using k=6 nearest neighbors and a kernel weights using quadratic (Epanechnicov) kernel.