Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(contrib) GWR bandwidth selection fails when data contains zeros #1010

Closed
mattwigway opened this issue Nov 20, 2017 · 6 comments
Closed

(contrib) GWR bandwidth selection fails when data contains zeros #1010

mattwigway opened this issue Nov 20, 2017 · 6 comments
Assignees

Comments

@mattwigway
Copy link

When I run the Geographically Weighted Regression Sel_BW process to find the optimal bandwidth for GWR on my data (with a dependent variable that is exactly zero in some observations, and an assumption of a gaussian distribution for the dependent variable), I get the following error.

/Users/matthewc/anaconda/envs/chicago-transit/lib/python2.7/site-packages/pysal/contrib/gwr/gwr.py:275: RuntimeWarning: divide by zero encountered in divide
  S = S * (1.0/z)
/Users/matthewc/anaconda/envs/chicago-transit/lib/python2.7/site-packages/pysal/contrib/gwr/gwr.py:275: RuntimeWarning: invalid value encountered in multiply
  S = S * (1.0/z)

There is a minimum working example here, and data is here.

Sel_BW then returns a bandwidth of 1352, which seems much too large. GWR4 finds a bandwidth of 399. 1352 is one of the starting values in the golden section search from GWR4 (below) so I'm wondering if the search is just never getting started. I get the same error if I change the kernel to gaussian or the criterion to BIC, which is not surprising since it seems to be in the GWR model-fitting code. The issue seems to have to do with zeros in the dependent variable; if I add one to all observations of the dependent variable, I get a bandwidth of 400, which is roughly what I would expect.

Page 189 of Fotheringham, Brunsdon and Charlton (2002), which is referenced in the IWLS source, suggests that for a Gaussian distribution of the dependent variable, z is equal to y. I'm not quite following the algebra to derive that, but assuming that is correct it would explain this issue.

So I guess the fix is either to fix the algorithm to work with zeros, or if this is just a limitation of the algorithm, throw an error that stops the search process without returning anything (and preferably with an error message that explains that it doesn't work with zeros in the dependent variable), rather than printing some warning and returning a bandwidth value that is incorrect.

  • Platform information: posix darwin
  • Python version: 2.7.14 |Anaconda, Inc.| (default, Nov 8 2017, 16:45:25)
    [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
  • SciPy version: 0.19.0
  • NumPy version: 1.13.1
  • PySAL version: 1.14.3

Bandwidth selection output from GWR4:

Bandwidth search <golden section search>
  Limits: 70,  2144
 Golden section search begins...
 Initial values
  pL            Bandwidth:    70.000 Criterion:  -2072.307
  p1            Bandwidth:   862.198 Criterion:  -3375.784
  p2            Bandwidth:  1351.802 Criterion:  -3271.067 # <---- 1352 is an initial value for Golden Section Search
  pU            Bandwidth:  2144.000 Criterion:  -3119.078
 iter    1 (p1) Bandwidth:   862.198 Criterion:  -3375.784 Diff:    489.605
 iter    2 (p1) Bandwidth:   559.605 Criterion:  -3430.592 Diff:    302.593
 iter    3 (p1) Bandwidth:   372.593 Criterion:  -3446.235 Diff:    187.012
 iter    4 (p2) Bandwidth:   372.593 Criterion:  -3446.235 Diff:    115.580
 iter    5 (p2) Bandwidth:   444.025 Criterion:  -3447.081 Diff:     71.432
 iter    6 (p1) Bandwidth:   444.025 Criterion:  -3447.081 Diff:     44.148
 iter    7 (p1) Bandwidth:   416.740 Criterion:  -3447.846 Diff:     27.285
 iter    8 (p1) Bandwidth:   399.877 Criterion:  -3448.098 Diff:     16.863
 iter    9 (p2) Bandwidth:   399.877 Criterion:  -3448.098 Diff:     10.422
Best bandwidth size 399.000
Minimum AICc    -3448.098
@TaylorOshan
Copy link
Contributor

Hey @mattwigway, thanks for writing this up. It is curious that 1352 is close to one of the starting values used for the search in GWR4, but since GWR4 isn't open source, it is unfortunately not possible to look inside and see how the algorithms compare. You're right, that it looks like a zero division that is probably producing NaN or inf , which is then causing the selection criterion diagnostic to be erroneous and is making the selection routine get stuck.

Theoretically, there shouldn't be an issue with having zero observations of the dependent variable. I think the problem may be even more elemental than the gwr routine, and is due to some decision made in the glm contrib module that gwr is built upon. z is not necessarily an observed value, it is actually output from the iteratively weighted least squares (IWLS) algorithm, though how the IWLS algorithm treats zero values of y to produce z is probably what is causing an issue. I think the GWR4 code uses an OLS routine for a Gaussian probability model, but GLM for Poisson and Binomial, whereas this gwr code uses a GLM for all three. Though the estimates should be equivalent between OLS and Gaussian GLM, the algorithm is slightly different and that is why I suspect the issue is within the IWLS routine and could be causing the difference between here and GWR4.

I'll plan to double check the math and then see if the IWLS routine can be bolstered to avoid this issue but it might not be until next week when I have access to the GWR book you cited and a bit more time. In the meantime I'd recommend using GWR4 for that dataset and apologize for the inconvenience. Also, if you wanted to further investigate, the glm module is based on and validated on functionality from the statsmodels project, thought I don't think my tests included a Gaussian model and zero values of the dependent variable, so when I get the chance I'll probably start by looking to see if/how such an instance is handled there.

@mattwigway
Copy link
Author

@TaylorOshan thanks! I don't know that I'll have time to dig into it more in the next few days either, but do let me know if there's anything else I can do to help.

@mattwigway
Copy link
Author

FWIW, fitting a Gaussian GLM in statsmodels to this data works and produces the same coefficients that OLS does.

@TaylorOshan
Copy link
Contributor

Ok, thanks for letting me know. Will look into it.

@TaylorOshan
Copy link
Contributor

TaylorOshan commented Jan 19, 2018

Hi @mattwigway, finally got around to taking a closer look at this. There are kind of two different things going on here.

First, In that error, v is a transformed version of the dependent variable. However, for a Gaussian GWR I'm pretty sure the there is no substantive transformation and so zero values remain zero, and then are used to compute the hat matrix, S. Interesting, @Ziqi-Li recently contributed some enhancements here , which should ameliorate this issue (note that active dev is now occurring in a gwr-specific project under the PySAL org as per the recent refactoring initiative). Essentially, the divide-by-zero issue occurs during the computation of the hat matrix, which is not actually needed during bandwidth selection. Ziqi's update only computes necessary diagnostics during bandwidth selection, avoiding zero division and finding the correct bandwidth. However, this will still be an issue once the bandwidth is selected and a final gwr model is fit, which does require the computation and application of the hat matrix, S.

Second, I've actually realized that the current code multiplies each row of S by a row of Z in a row-by-row fashion and then divides S by Z. This is redundant as the operations cancel each other out. I believe it was a transcription error on my part when parsing the math in this publication about Poisson GWR using a GLM framework. I think most other software uses a regular OLS routine for Gaussian GWR and then a GLM framework for Poisson, Binomial etc. But, I figured since the OLS and Gaussian GLM produce the same output it would be easier/simpler to develop the entire framework using GLM. Didn't realize it was an issue because the dataset used to test the Gaussian GWR routine doesn't have any zeros. So far, it looks as if the line S = S * (1.0/z) can be removed so long as ri*np.reshape(rslt[4].flatten(), (1,-1)) is also truncated to simply be S[i] = ri. In this case you can actually just skip the instantiation of the R array and assign each row of R directly to each row of S, such as S[i] = np.dot(self.X[i], rslt[5]). With this update, all tests are passing for Gaussian, Poisson, and Binomial routines, but I'm still going to think on it to make sure I'm not missing anything else.

@ljwolf
Copy link
Member

ljwolf commented Jul 14, 2018

Addressed in pysal/gwr#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants