# What is the handler?

The `spreg-handler` provides a unified interface to apply any specifed regression function in pysal to data, like a call to `lm` in `R`. 

This has 2 components. 

1. `registry.py`, which finds all of the valid model classes in `pysal.spreg`
2. `handler.py`, which proves *one* function to estimate all of those classes.

Thanks to the `registry`, `handler.Model` can serve as the single point of access for a `patsy`/`pandas` interface logic, as well as anything else that we might want to add to regression classes without forcing it through inheritance.

First, let's set up and estimate some models. 

In [1]:
import handler as h
import pysal as ps
import geopandas as gpd

In [2]:
df = gpd.read_file(ps.examples.get_path('columbus.json'))
dbf = ps.open(ps.examples.get_path('columbus.dbf'))
y = dbf.by_col_array(['HOVAL'])
X = dbf.by_col_array(['INC', 'CRIME'])
W = ps.open(ps.examples.get_path('columbus.gal')).read()

In [3]:
original = ps.spreg.OLS(y,X,W, name_x=['INC', 'CRIME'], name_y='HOVAL')

The handler's default model is `OLS`. So, for a model of type `OLS`, no extra argument needs to be passed. However, for the sake of clarity, I'll pass the model specification argument, `mtype`.

In [4]:
handled = h.Model(y,
                  X,
                  W,
                  name_x=['INC', 'CRIME'], 
                  name_y='HOVAL', 
                  mtype='OLS')

In [5]:
print(original.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :       HOVAL                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
R-squared           :      0.3495
Adjusted R-squared  :      0.3212
Sum squared residual:   10647.015                F-statistic           :     12.3582
Sigma-square        :     231.457                Prob(F-statistic)     :   5.064e-05
S.E. of regression  :      15.214                Log likelihood        :    -201.368
Sigma-square ML     :     217.286                Akaike info criterion :     408.735
S.E of regression ML:     14.7406                Schwarz criterion     :     414.411

-----------------------------------------------------------------------------

In [6]:
print(handled.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :       HOVAL                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
R-squared           :      0.3495
Adjusted R-squared  :      0.3212
Sum squared residual:   10647.015                F-statistic           :     12.3582
Sigma-square        :     231.457                Prob(F-statistic)     :   5.064e-05
S.E. of regression  :      15.214                Log likelihood        :    -201.368
Sigma-square ML     :     217.286                Akaike info criterion :     408.735
S.E of regression ML:     14.7406                Schwarz criterion     :     414.411

-----------------------------------------------------------------------------

The long and short of it is that `Model` classes pass estimation to the function specified in `mtype`, and then contain the results in a reasonable way. 

In fact, the "real" `PySAL` model class sits under `handled._called`, so, at worst, we can just reference aspects of `handled` down to `_called`. I currently do this by iterating through `dir(handled._called)` and using `eval` to flatten all of `_called`'s attributes into `handled` at initialization. 

But, eventually, I am think about adding plotting, visual diagnostics, out of sample prediction, or other stuff to this wrapper. So, I will probably not duplicate the access points for intermediate computations, like `X'X`, `e`, or `TSLS`'s arcane-sounding `zthhthi`.

I'd like to clean up this `Model` interface so that only X, Y, residuals, and some statistics are directly exposed. 

Keep in mind, since [assignment **never** copies data](https://youtu.be/_AEJHKGk9ns?t=296), and the original model sits in `handled._called`, this isn't actually a *loss* of information, just a *hiding*, which is a standard OOP principle. 

### Isn't this wastefully storing multiple copies of data in memory?

No. Let's see where everything lives using the python built-in `id` function. 

Recall that the original model is stuffed into `Model._called`. So, if anything in there has a different memory address from what's being displayed by `Model`, the data is duplicated:

In [7]:
for atname in dir(handled._called):
    attr = eval("handled._called.{}".format(atname))
    composed_id = hex(id(attr))
    outattr = eval("handled.{}".format(atname))
    outer_id = hex(id(outattr))
    if composed_id != outer_id:
        print(atname + "is in two different addresses.")
        print("\t Outer is at " + outer_id +"\n\t Inner is at " + composed_id)

__init__is in two different addresses.
	 Outer is at 0x7f11fbb2dcd0
	 Inner is at 0x7f11fbb92820


Only the `__init__` function exposed by `Model` is different from `Model._called`, which makes sense, since an `__init__` function can't write over itself. 

If we wanted to access `Model._called.__init__`, it's still there. This means we could implement some "refit" method, `Model.refit(y=Model.y, X=Model.X, ...)` which could use `Model._called.__init__` to revise estimates in `Model` in place or returning a new model.

I don't know why we might want to do this, but it's kinda neat :)

# What does this buy us?

Regardless, all the stuff the wrapping `Model` class is parsed *around* the underlying PySAL classes. That is, the wrapper would only inject commands into the API. At minimum, it *is exactly* the underlying class. 

This is because it dispatches the arguments to the specified model type without knowing any special information about the function call.  

This means we can do some pretty cool things, while keeping the actual wrapper at ~40 LoC!

In [8]:
ML = ps.spreg.ML_Lag(y,X,W)



In [9]:
handled_ML = h.Model(y,X,W,mtype='ML_Lag')

In [10]:
print(ML.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           4
S.D. dependent var  :     18.4661                Degrees of Freedom    :          45
Pseudo R-squared    :      0.3639
Spatial Pseudo R-squared:  0.3384
Sigma-square ML     :     212.490                Log likelihood        :    -200.903
S.E of regression   :      14.577                Akaike info criterion :     409.807
                                                 Schwarz criterion     :     417.374

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
-----------------------------

In [11]:
print(handled_ML.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           4
S.D. dependent var  :     18.4661                Degrees of Freedom    :          45
Pseudo R-squared    :      0.3639
Spatial Pseudo R-squared:  0.3384
Sigma-square ML     :     212.490                Log likelihood        :    -200.903
S.E of regression   :      14.577                Akaike info criterion :     409.807
                                                 Schwarz criterion     :     417.374

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
-----------------------------

### Intercepting formulas

So, this is pretty neat, but gives us nothing above using *one* function to dispatch models. That's cool and R-like, but it's not necessarly better. Where it does add functionality is in its ability to intercept model formulas.

In [12]:
handled_eq = h.Model("HOVAL ~ INC + CRIME", data=df)

In [13]:
print(handled_eq.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
R-squared           :      0.3495
Adjusted R-squared  :      0.3212
Sum squared residual:   10647.015                F-statistic           :     12.3582
Sigma-square        :     231.457                Prob(F-statistic)     :   5.064e-05
S.E. of regression  :      15.214                Log likelihood        :    -201.368
Sigma-square ML     :     217.286                Akaike info criterion :     408.735
S.E of regression ML:     14.7406                Schwarz criterion     :     414.411

------------------------------------------------------------------------------------
            Variable     C

That means `HOVAL`, `CRIME`, and `INC` all get drawn out of the dataframe using patsy and pushed into arrays. This works for any class, since we're just turning the equations into their consituent arrays. 

Where there is a possible bikeshedding point is over the syntax for TSLS-type models. Right now, I have it specified with (what I think is) a clear synatx reflecting the simultanous equations approach: 

`y ~ x1 + x2 || yend ~ xend1 + xend2`

implies an equation where your exogenous relationship is `y ~ x1 + x2` and your endogenous relationship is `yend ~ xend1 + xend2`. 

For any simultaneous equation-type model, I would suggest using double pipe as the separator. Under the hood, I'm just using `string.split('||')`, since patsy doesn't use the double pipe. 

In [14]:
y = dbf.by_col_array(['CRIME'])
X = dbf.by_col_array(['INC'])
yend = dbf.by_col_array(['HOVAL'])
q = dbf.by_col_array(['DISCBD'])

In [15]:
tsls = ps.spreg.TSLS(y,X,yend,q,W)

In [16]:
handledtsls = h.Model(y,X,yend,q,W,mtype='TSLS')

In [17]:
handledtsls_eq = h.Model("CRIME ~ INC || HOVAL ~ DISCBD", W, data=df, mtype='TSLS')

In [18]:
print(tsls.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           3
S.D. dependent var  :     16.7321                Degrees of Freedom    :          46
Pseudo R-squared    :      0.2794

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      88.4657958      15.1346096       5.8452645       0.0000000
        endogenous_1      -1.5821659       0.7931892      -1.9946891       0.0460768
               var_1       0.5200379       1.4146781       0.3676016       0.7131703
------------------------

In [19]:
print(handledtsls.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           3
S.D. dependent var  :     16.7321                Degrees of Freedom    :          46
Pseudo R-squared    :      0.2794

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      88.4657958      15.1346096       5.8452645       0.0000000
        endogenous_1      -1.5821659       0.7931892      -1.9946891       0.0460768
               var_1       0.5200379       1.4146781       0.3676016       0.7131703
------------------------

In [20]:
print(handledtsls_eq.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES
------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           3
S.D. dependent var  :     16.7321                Degrees of Freedom    :          46
Pseudo R-squared    :      0.2794

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      88.4657958      15.1346096       5.8452645       0.0000000
        endogenous_1      -1.5821659       0.7931892      -1.9946891       0.0460768
               var_1       0.5200379       1.4146781       0.3676016       0.7131703
------------------------

This would also enable adding plotting capabilities to spatial regression models, like the standard four-plot output from plotting an `lm` in `R`, but wouldn't have to be hacked into each and every model class. 

# Pitfalls

This wrapper is neat, but could lead us down a weird road.

Right now, I clean up the mandatory arguments tuple `*args` such that only the arguments that match the type signature needed by PySAL get passed. This is to prevent runtime errors.

To explain, let some function be defined using the default Python argument expansion (a.k.a. splatting) operator, `*`:

        def foo(*args, message='Hello'):
            for arg in args:
                print(message + args)
        
        >>> foo("world!", "users!")
        Hello world!
        Hello users!
        
Cool. Now, let's say we want to print our message right before we execute some function, `bar`. `bar` is a model class with a well-defined API. It requires two positional arguments of type `ndarray`, `y` and `X`, can accept an arbitrary number of keyword arguments, and returns `True` if a model is fit correctly. 

For the sake of argument (no pun intended), let's say that `foo` only needs the strings passed as positional arguments. Depending on your python version, we can wrap `bar` with `foo` this way:

        def foo(*args, message='Hello', **kwargs):
            strargs = [arg for arg in args if isinstance(args, str)]
            barargs = [arg for arg in args if arg not in strargs]
            for arg in args:
                print(message + arg)
            return bar(*args, **kwargs)
            
If we do this, arbitrary arguments could get passed to `foo` that, if `bar` isn't expecting them, will cause a `TypeError`. 

        >>> foo("world!", "users!", 12313, y, X)
        Hello world!
        Hello users!
        -------------------------------------------
        TypeError
        return bar(*args, **kwargs)
        
        TypeError: bar() takes exactly 2 arguments (3 given)
 
Some libraries (*ahem* Matplotlib) get around this by making almost every function take arbitrary arguments, and each function just peels off of `args` and `kwargs` what it needs. 

This isn't us, though. I would suggest that the standard be that we use input types or ordering of arguments to define how to wrap the underlying functions. This means that, if we know `bar` accepts a `y` and an `X` of type `ndarray` in that order, we pop off of a stack of arguments with the current scope's arguments at the top of the stack.

Here, since we know that we're wrapping one function with two arguments, its arguments are at the bottom of the stack. 

        def foo(*args, message="Hello", **kwargs):
            args.pop() = X
            args.pop() = y
            for arg in args:
                print(message + arg)
            return bar(y, X, **kwargs)
       
        >>> foo("world!", "users!", y, X)
        Hello world!
        Hello users!
        True #model fit successfully
        
Of course, we could also get even more abstract. We could construct a list of arguments for the wrapped function based on the *types* expected by that function. If we can identify or construct subsequences of *args* that match the function we're wrapping, we're good to go. 

However, these strategies have the strange side effect that, if extra arguments are passed, no error is raised. This could be desirable, but may not be if the user expects the arguments to **do** something. 

The call with ignored arguments looks like this:

In [21]:
handledtsls_ignored = h.Model("CRIME ~ INC || HOVAL ~ DISCBD",
                              True, #gets ignored
                              W, 
                              42, #gets ignored too
                              data=df, 
                              mtype='GM_Lag')

In [22]:
print(handledtsls_ignored.summary) #woah! no error or notice!

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           4
S.D. dependent var  :     16.7321                Degrees of Freedom    :          45
Pseudo R-squared    :      0.2377
Spatial Pseudo R-squared:  0.2477

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      96.5182979      52.5305917       1.8373731       0.0661548
           W_dep_var      -0.0148182       0.0857203      -0.1728667       0.8627562
        endogenous_1      -1.8097627       1.7652028      -

Those other positional arguments get ignored in the equation framework. 

Practically, this means you couldn't provide some variables in equations and some in vectors... you'd have to either have equations or vectors.