-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(contrib) GWR bandwidth selection fails when data contains zeros #1010
Comments
Hey @mattwigway, thanks for writing this up. It is curious that 1352 is close to one of the starting values used for the search in GWR4, but since GWR4 isn't open source, it is unfortunately not possible to look inside and see how the algorithms compare. You're right, that it looks like a zero division that is probably producing Theoretically, there shouldn't be an issue with having zero observations of the dependent variable. I think the problem may be even more elemental than the gwr routine, and is due to some decision made in the glm contrib module that gwr is built upon. I'll plan to double check the math and then see if the IWLS routine can be bolstered to avoid this issue but it might not be until next week when I have access to the GWR book you cited and a bit more time. In the meantime I'd recommend using GWR4 for that dataset and apologize for the inconvenience. Also, if you wanted to further investigate, the glm module is based on and validated on functionality from the statsmodels project, thought I don't think my tests included a Gaussian model and zero values of the dependent variable, so when I get the chance I'll probably start by looking to see if/how such an instance is handled there. |
@TaylorOshan thanks! I don't know that I'll have time to dig into it more in the next few days either, but do let me know if there's anything else I can do to help. |
FWIW, fitting a Gaussian GLM in statsmodels to this data works and produces the same coefficients that OLS does. |
Ok, thanks for letting me know. Will look into it. |
Hi @mattwigway, finally got around to taking a closer look at this. There are kind of two different things going on here. First, In that error, v is a transformed version of the dependent variable. However, for a Gaussian GWR I'm pretty sure the there is no substantive transformation and so zero values remain zero, and then are used to compute the hat matrix, S. Interesting, @Ziqi-Li recently contributed some enhancements here , which should ameliorate this issue (note that active dev is now occurring in a gwr-specific project under the PySAL org as per the recent refactoring initiative). Essentially, the divide-by-zero issue occurs during the computation of the hat matrix, which is not actually needed during bandwidth selection. Ziqi's update only computes necessary diagnostics during bandwidth selection, avoiding zero division and finding the correct bandwidth. However, this will still be an issue once the bandwidth is selected and a final gwr model is fit, which does require the computation and application of the hat matrix, S. Second, I've actually realized that the current code multiplies each row of S by a row of Z in a row-by-row fashion and then divides S by Z. This is redundant as the operations cancel each other out. I believe it was a transcription error on my part when parsing the math in this publication about Poisson GWR using a GLM framework. I think most other software uses a regular OLS routine for Gaussian GWR and then a GLM framework for Poisson, Binomial etc. But, I figured since the OLS and Gaussian GLM produce the same output it would be easier/simpler to develop the entire framework using GLM. Didn't realize it was an issue because the dataset used to test the Gaussian GWR routine doesn't have any zeros. So far, it looks as if the line |
Addressed in pysal/gwr#9 |
When I run the Geographically Weighted Regression Sel_BW process to find the optimal bandwidth for GWR on my data (with a dependent variable that is exactly zero in some observations, and an assumption of a gaussian distribution for the dependent variable), I get the following error.
There is a minimum working example here, and data is here.
Sel_BW
then returns a bandwidth of 1352, which seems much too large. GWR4 finds a bandwidth of 399. 1352 is one of the starting values in the golden section search from GWR4 (below) so I'm wondering if the search is just never getting started. I get the same error if I change the kernel to gaussian or the criterion to BIC, which is not surprising since it seems to be in the GWR model-fitting code. The issue seems to have to do with zeros in the dependent variable; if I add one to all observations of the dependent variable, I get a bandwidth of 400, which is roughly what I would expect.Page 189 of Fotheringham, Brunsdon and Charlton (2002), which is referenced in the IWLS source, suggests that for a Gaussian distribution of the dependent variable, z is equal to y. I'm not quite following the algebra to derive that, but assuming that is correct it would explain this issue.
So I guess the fix is either to fix the algorithm to work with zeros, or if this is just a limitation of the algorithm, throw an error that stops the search process without returning anything (and preferably with an error message that explains that it doesn't work with zeros in the dependent variable), rather than printing some warning and returning a bandwidth value that is incorrect.
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Bandwidth selection output from GWR4:
The text was updated successfully, but these errors were encountered: