-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts about adaptive gaussian #3
Comments
Original issue text: The
|
@Ziqi-Li comment: By default, the
|
@ljwolf comment: might the PySAL gaussian and arcgis gaussian differ precisely by |
We compute the distance matrix divided by the bandwidth on line 82 of However, the D-matrix is symmetric pairwise distances, and Taking @Ziqi-Li's code: from pysal.contrib.gwr.kernels import _Kernel
test_kernel = _Kernel(coords, bandwidth=None, fixed=False, k=10) #like the adaptive call happening in the GWR class.
test_kernel.bandwidth.shape
(159,1)
test_kernel.dmat.shape
(159,159) immediately, I think something's wrong with then using
because test_square = np.arange(25).reshape(5,5)
test_divisor = np.ones((5,1))*5
print(test_square)
# [[ 0 1 2 3 4]
# [ 5 6 7 8 9]
# [10 11 12 13 14]
# [15 16 17 18 19]
# [20 21 22 23 24]]
print(test_square/test_divisor)
# [[0. 0.2 0.4 0.6 0.8]
# [1. 1.2 1.4 1.6 1.8]
# [2. 2.2 2.4 2.6 2.8]
# [3. 3.2 3.4 3.6 3.8]
# [4. 4.2 4.4 4.6 4.8]] For our purposes, this makes the "kernel" matrix no longer symmetric, since it's dividing each column by each element of the bandwidth vector. So each column of the kernel is only matched to the np.testing.assert_allclose(test_kernel.dmat, test_kernel.dmat.T) #passes
np.testing.assert_allclose(test_kernel.kernel, test_kernel.kernel.T) #fails! Now, to be fair, this also fails in PySAL, which is also not something I would expect: psW = pysal.weights.Distance.Kernel(coords, k=10, fixed=False, function='gaussian').full()[0]
np.testing.assert_allclose(psW, psW.T) #fails!
Not sure what this means, but I am confused. |
I feel it should use the column because, as I understand the adaptive bandwidth, it should be the vector containing the distances for observation i divided by observation i's unique bandwidth. That would be the column, not the row. |
No, sorry, wrong again. The broadcasting is right: np.ones((5,5)) / (np.arange(5,10).reshape(5,1))
array([[0.2 , 0.2 , 0.2 , 0.2 , 0.2 ],
[0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667],
[0.14285714, 0.14285714, 0.14285714, 0.14285714, 0.14285714],
[0.125 , 0.125 , 0.125 , 0.125 , 0.125 ],
[0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111]]) but the fact that the kernel is asymmetric is throwing me off. |
So I was taking a look at this, and I see how this line could make the D-matrix asymmetric. That line of code first sorts the values and then selects only "k" nearest-neighbors, obviously only when k is not None, which indicates it should be an adaptive kernel. But, does it matter that it is not asymmetric? At that point, it's not a comprehensive D-matrix, but just encapsulated the NN distances and wouldn't be expected to be symmetric. Does that make sense? I imagine if it was a big issue it would cause lots of trouble downstream. |
Yeah, it makes sense. I got ahead of myself, since I'm using them in another context where I'd expected them to be symmetric (or assumed them to be). Really shot myself in the foot doing so, now that I realized they're not. In general, if each observation has its own bandwidth, then the "kernel" becomes just a list of vectors of the weights for each observation given those bandwidths. It's my fault for assuming their concatenation is a "kernel" in the basis function sense, which is symmetric. |
So, I don't know what is going on with the ArcGIS kernels, but from Ziqi's plots, it seems either they are indeed using bisquare and not Gaussian or they have some very sophisticated mechanism to truncate a Gaussian into something very similar to a bisquare. But I don't think its productive to focus on that any further because we may never know what they actually have going on under the hood. Plus, I'm not really getting the sense that what we currently have is wrong for any particular reason. Rather, by design, the Gaussian kernel gives a much different curve than the bisquare. This is also the case with the exponential kernel (shown below), which, for me, is reassuring as I would expect very similar results between Gaussian and exponential and also heightens my suspicion about that the arcGIS results are not simply a Gaussian kernel. I did some experimenting by comparing different kernel functions across each type of bandwidth (i.e., fixed vs. adaptive). First for the same Georgia data example... Top is adaptive kernels and bottom is fixed kernels. The solid vertical lines represent the optimal bandwidth for each kernel function while the dashed vertical lines (only red and yellow for gaussian and exponential) represent the "truncated" bandwidth value. However, its not a real truncation, but is the distance/NN at which the data is essentially weighted at zero (approx. < 0.05) for the given optimal bandwidth and was obtained by essentially plugging the optimal bandwidth and the desired zero-weight approximation value and then solving for the associated distance/NN. I reality, I just did some plugging and chugging to approximate the value, but I assume it could be solved for analytically. Anyway, I don't think there is anything weird going on with fixed vs. adaptive or bisqaure vs gaussian. The behavior of each kernel function is consistent for either fixed or adaptive kernels. I just think the nature of the Gaussian/exponential kernels is to weight the data much less intensely and without a real truncation and so produces a much smaller bandwidth value compared to a bisquare function. Solving for an approximate zero-weight truncation point yields a distance/NN that is closer that from the bisquare function but ultimately is not equivalent. I also did the same experiment for data simulated with a single covariate/simulated coefficient surface using some of the data from the comparison paper. I thought it might be useful to see if all of the kernels produce the expected order of bandwidths for B1 (regional) and B2 (local) where the expectation is Bandwidth_B1 > Bandwidth_B2. Its hard to tell in the results below, but for all of the kernels it always holds that B1 > B2 so long as there aren't rules imposed on the minimum search value. These curve shapes are consistent with the Georgia dataset and so overall I don't think there is anything inherently wrong with any of the kernels and I imagine all of the kernels here produce similar coefficients and could potentially useful for producing "relative" scales in that the bandwidths from MGWR could be compared to each other. However, interpretatively, it seems like the bisquare is at an advantage since the bandwidth quantifies the intensity of decay and exactly the extent of the non-zero data included in the model, which is more intuitive and may be useful for comparing across study areas or different models. Perhaps we could still achieve something similar for the fixed Gaussian/exponential kernels by standardizing the distance matrix. Not sure how one would standardize a nearest neighbor metric so might possibly be useful to exclude an adaptive Gaussian/exponential specification. I can confirm, that as the code stands, the fixed bisquare (and only the fixed bisquare) will truncate the data like the adaptive bisquare. At first glance, it seems like the derived fixed distance is indicative of the distance associated with the Nth nearest neighbor...well the average across of the distances for the Nth nearest neighbor for each row of the weights... so I think another advantage for the bisquare is that the fixed and adaptive bandwidths may be more closely related. Perhaps the same holds for Gaussian/exponential if the weights are manually truncated. Gotta think about it a bit more... |
@ASFotheringham |
In general, shouldn't these work for any kernel function? Like... Cleveland's original exploration in the univariate space used tricube functions, something we don't even consider. I think any arbitrary kernel should be (in theory) useful, even though we have our preferences. I do bet @ASFotheringham is right on the scoring, and we could get different (maybe better) results using a more explicit tradeoff than the AICc, & looking into the bias & local variance would be a good way to get into this. We still get a site-specific MSE in both GWR and MGWR, so I'd use that as a starting point for that variance estimate. |
@Ziqi-Li more relevant results, I think: I find that the gaussian adaptive stuff sometimes is hyperlocal, when the fixed version is definitely not. I also find that, sometimes, the kernel and the score can be really influential, but the fixed gaussians and fixed bisquare behave really well, and the adaptive bisquare, too. Also, BIC is local for adaptive bisquare and global for fixed bisquare. So, not sure what to make of that. |
Looking at @TaylorOshan results, I think there's a substantive difference in how we're storing the function. I'm inclined to prefer how I do it, but I don't understand why they'd be different. I prefer how I'm doing it in that notebook because I'm storing the actual callable function Then, I'm plotting that function over a range of values. You can see that, in the notebook, identified solutions either clearly mark the minima of those functions, or reflect the "maximum" allowable bandwidth, I don't see how this is similarly shown in the other method. So, i'm not sure how the objective function that we're actually optimizing and the final model scores could be so different? I'm fitting exactly the same model as Taylor, so far as I can tell. |
I need to take a closer look at your notebook @ljwolf, but I noticed we do have different models fit. Mine was following Ziqi's rather than what is used in the tests. So its pct_Bach ~ pctFB + pctPov + pctBlack rather than pct_Bach ~ pctRural + pctPov + pctBlack. I'd be surprised if that was causing such a large different though...need to take a closer too to see if I can spot why caching the opt function should be any different than just running different GWR's. |
OK, I can update that and rerun the notebook in like 5 minutes. I'm less concerned about the model and more concerned that the AIC you're pulling off using: results = []
for bandwidth_i in range(min_bw, max_bw):
results.append(GWR(coords, X, Y, bandwidth_i).fit().aicc) is not the same surface as what happens if you just evaluate the objective function actually being optimized in our numerical routines, the |
Changing the model does change a few of the score functions being optimized, but they still don't really look like what's posted above. Attached is a notebook with the updated model. A few change, but.... still looks different from peeling the AIC off after manually setting the bw. |
Ok, did a quick test because it looked like we were using different supports. I originally had evaluated the adaptive bisquare/Gaussian on the interval [10, 160] which produced something like and if I instead evaluate them on the interval you used of [20, 180], then I get this which looks to be right on track with yours. I think more than anything its just that I plotted the two function on the same figure and the AICc values for the adaptive bisquare are relatively large from the range [10, 20] than I include so much of the fine-grain trend is obfuscated. It could also be in some cases where different bandwidths were found allowing golden section to search in the area < 20 NN results in a different answer than starting at 20. |
Ahh.... ok, just a scale issue. Then, good. they're the same just appearing different. Great. Then, yes, I think we should change the maximum value over in the other PR, since half the maximum interpoint distance seems to me to be too small of a search zone. I think it should be plausible that, even if a bandwidth hits that "full map" value, it's still distance-weighting all observations, so it'll be a "local" model around the point. |
Was just thinking about this...Stewart has asked me a few times if we should totally remove the option to do an adaptive Gaussian/exponential since they do not truncate to 0 at the Nth nearest neighbor like bisqaure, but I think these might be useful for Poisson and binomial regression where it can sometimes be an issue to make sure every sub-sample is not saturated with zeros and then becomes ill-conditioned. Just a thought I wanted to get down before it escapes me! |
We could always adopt a |
In association with Ziqi's original issue and moved here to protect any potential intellectual merit.
The text was updated successfully, but these errors were encountered: