ML Lag indexing error on optimization result #587

ratishm1 · 2015-03-08T03:51:33Z

When running ml_lag, I get the following error: IndexError: invalid index to scalar variable. This is from the code: self.rho = res.x[0][0]. I went into the ml_lag.py file in spreg and printed out res, which is the OptimizeResult object from scipy. This was the following output:

status: 0
    nfev: 36
 success: True
     fun: array([[ nan]])
       x: -0.23606797749978981
 message: 'Solution found.'

X should be a solution array, but here it is just a scalar value. I changed rho = res.x[0][0] to rho=res.x; however, that gave me problems elsewhere.

The text was updated successfully, but these errors were encountered:

sjsrey · 2015-03-10T00:01:44Z

Would it be possible for you to share the data and specification you are running? That would help us to debug this.

jaketangosierra · 2016-02-05T16:37:47Z

I'm seeing this as well, though I'm using GeoPandas (rather than the native pysal shapefile reading). I do some cleanup and quick calculation of some things that aren't in the shapefile, but I don't see why that would matter, the input formats are right, to ML_Lag. Specifically, I'm running:

chicago = GeoDataFrame.from_file('./other_chicago.shp')
chicago['FIPS_TEXT'] = chicago['FIPS_TEXT'].map(stringify)
chicagoEQAreaConic = chicago.to_crs(epsg='2790')
chicagoEQAreaConic = pd.merge(chicagoEQAreaConic, real_data, how='left', on='FIPS_TEXT')
chicago_weights = pysal.queen_from_shapefile("./other_chicago.shp")
chicagoEQAreaConic['hinc_10k'] = chicagoEQAreaConic['ACS_MED_HI'].map(lambda x: (x/10000))
iv = []
iv.append(chicagoEQAreaConic['WAIT_S'].astype(float).values)
iv = np.array(iv).T
dvs = []
dvs.append(chicagoEQAreaConic['hinc_10k'].astype(float).values)
dvs.append(chicagoEQAreaConic['pop2012_dens'].astype(float).values)
dvs = np.array(dvs).T
chicago_weights.transform = 'r'
pysal.spreg.ml_lag.ML_Lag(iv, dvs, chicago_weights)

Unfortunately, I'm not really in a position to share my shapefile, as our results haven't been published yet.

ljwolf · 2016-02-05T17:10:33Z

Hmm... I wondered if this had to do with some weirdness with input array shaping, so I tried what I could with your code and could not replicate.

import pysal as ps; assert ps.version == '1.11.0'
df = ps.pdio.read_files(ps.examples.get_path('columbus.shp'))
w = ps.queen_from_shapefile(ps.examples.get_path('columbus.shp'))
iv = []
iv.append(df['HOVAL'].astype(float).values)
iv = np.array(iv).T
dvs = []
dvs.append(df['INC'].astype(float).values)
dvs.append(df['CRIME'].astype(float).values)
dvs = np.array(dvs).T
reg = ps.spreg.ml_lag.ML_Lag(iv, dvs, w)

If you're able, can you post the exact traceback?

jaketangosierra · 2016-02-05T17:20:34Z

anonymized the paths a bit, but otherwise this is the same. I had cleaned up the original to remove the "method='FULL'" piece, because this error happens either way, so ignore that piece in the top-level call.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-18-1bd724ff8213> in <module>()
     14 chicago_weights.transform = 'r'
     15 
---> 16 pysal.spreg.ml_lag.ML_Lag(iv, dvs, chicago_weights, method='FULL')

/Users/username/Work/Development/venvs/py3/jupyter/lib/python3.4/site-packages/pysal/spreg/ml_lag.py in __init__(self, y, x, w, method, epsilon, spat_diag, vm, name_y, name_x, name_w, name_ds)
    542         method = method.upper()
    543         BaseML_Lag.__init__(
--> 544             self, y=y, x=x_constant, w=w, method=method, epsilon=epsilon)
    545         # increase by 1 to have correct aic and sc, include rho in count
    546         self.k += 1

/Users/username/Work/Development/venvs/py3/jupyter/lib/python3.4/site-packages/pysal/spreg/ml_lag.py in __init__(self, y, x, w, method, epsilon)
    225             return
    226 
--> 227         self.rho = res.x[0][0]
    228 
    229         # compute full log-likelihood, including constants

IndexError: invalid index to scalar variable.

ljwolf · 2016-02-05T17:36:46Z

Cool, thanks, that helps!

jaketangosierra · 2016-02-05T17:38:41Z

Great! I'd love a bug-fix release, if this actually lets you solve the problem!

ljwolf · 2016-02-05T18:03:50Z

I notice in your traceback, you have method="FULL".

If you have relatively large data, this results in computing np.linalg.det for a dense (n,n) matrix. In my experience, you lose machine accuracy on that dense determinant at around 1.5k observations.

Do you get this error if you use method = "ORD" or method = "LU"?

jaketangosierra · 2016-02-05T19:15:13Z

Yes, in all three cases.

My dataset isn't huge, I have ~1,000 census tracts, so it may be within the range of what you're talking about. That said, given that all three values produce the same error, my guess is that's not the issue?

ljwolf · 2016-02-05T23:41:13Z

Hmm... Okay. Bummer. I'll keep pushing on this, but I need to be able to replicate the error to solve the problem.

jaketangosierra · 2016-02-08T14:38:31Z

I'm not sure what's going on, I think it's related to some joining I was doing between dataframes, but I started this, with a different way of accessing the data, and I cannot replicate either.

From my perspective, I solved my problem, but I'm not clear on what is causing this.

ljwolf · 2016-02-08T16:13:14Z

Good, I'm glad you fixed it!

My search is based on @ratishm1's optimize result. If function is array([[nan]]), then something has gone awry in the likelihood minimization. Since you get this error regardless of likelihood, I'm looking at any of the lag_c_loglik* starting around l559 in ml_lag.py.

The only way I see for those all to nan out is if the determinant fails, or n is somehow not what we expect it to be. Without a dataset to reliably reproduce this, we can't really figure out what needs to get wrapped.

sjsrey · 2016-02-26T00:10:03Z

Can't reproduce.

jaketangosierra · 2016-04-13T15:36:20Z

Hi all, I ran into this again with a different dataset. I wanted to provide this as a reproduction of the bug. See reproducing_bug_587.zip

I've included the requisite datafiles and a jupyter notebook (which is expecting Python3) that shows the error above.

ljwolf · 2016-04-15T21:37:03Z

Awesome, now we're cooking; I can replicate using the notebook provided. Will hopefully identify & patch if necessary.

ljwolf · 2016-04-15T22:05:45Z

For a fiona-less version of minimal replication from the data provided:

import pysal as ps

weights = ps.queen_from_shapefile('shapefile_for_weights.shp', idVariable='gid')
yellow = ps.pdio.read_files('yellow.shp')

y  = yellow[['total_spen']].values
X = yellow[['num_rides']].values

ps.spreg.ML_Lag(y, X, weights)

ljwolf · 2016-04-15T23:22:37Z

Ok, I think this is the cause:

82]> np.isnan(X).sum() > 0
True
83]> np.isnan(y).sum() > 0
True
84]> yellow[np.isnan(X)]
#omitted, but will show obs. w/ nan fields

Currently, there is no masking of nan values, so any nan will cause the optimization procedure to return nan. This nan array is of the wrong shape, so the element extraction fails.

jaketangosierra · 2016-04-18T15:16:37Z

Sorry, to be clear: you're saying this is known/intended behavior, because there is no masking of nan values?

I upgraded to 1.11.1, and I'm still getting the error, but if you're saying this expected behavior because the data has nans, then that makes sense (and is frustrating, but that's a different issue).

ljwolf · 2016-04-18T17:05:42Z

I believe this is known/intended behavior, since we don't do any masking anywhere.

This also happens in other econometrics packages, like statsmodels:

1]> import statsmodels.api as sm
2]> import pysal as ps
3]> data = ps.pdio.read_files(ps.examples.get_path('columbus.shp))

4]> y = data[['HOVAL']].values
5]> X = data[['CRIME', 'INC']].values

6]> y[5] = X[5] = np.nan

7]> sm.OLS(y,X).fit().summary()
[...]

will yield a summary output entirely of nan.

In light of our other pre-estimation checks, it might make sense to add a check for nan?

darribas · 2016-04-24T10:20:55Z

@jtsmn Yes, spreg assumes throughout the codebase you do not have missing values. This is for several reasons but, essentially, the gist of it is that we expect the user to clean missing values and holes before coming to spreg.

Also note that the W object you pass needs to be aligned with the data, so if you had any nan and removed that row, you'd have to also remove it from W. The w_subset operation and similar ones in Wsets should help you with that.

@ljwolf +1 to add a check for nan's at the user level.

jaketangosierra · 2016-04-25T13:39:15Z

This assumption is not particularly clear in any place that I've found (either in the overview documentation or in the API documentation itself). Thank you for clarifying though.

I agree that at least pre-estimation checks and clarifying the necessity would be good.

ljwolf added the Regression label May 26, 2015

sjsrey added this to the Wishlist milestone Jul 9, 2015

sjsrey closed this as completed Feb 26, 2016

ljwolf reopened this Apr 15, 2016

ljwolf changed the title ~~ML Lag~~ ML Lag indexing error on optimization result Apr 15, 2016

ljwolf closed this as completed Apr 15, 2016

ljwolf mentioned this issue Apr 26, 2016

w_subset(weights:W, ids:np.ndarray) constructs faulty weights object #799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Lag indexing error on optimization result #587

ML Lag indexing error on optimization result #587

ratishm1 commented Mar 8, 2015

sjsrey commented Mar 10, 2015

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 8, 2016

ljwolf commented Feb 8, 2016

sjsrey commented Feb 26, 2016

jaketangosierra commented Apr 13, 2016

ljwolf commented Apr 15, 2016 •

edited

ljwolf commented Apr 15, 2016

ljwolf commented Apr 15, 2016

jaketangosierra commented Apr 18, 2016

ljwolf commented Apr 18, 2016 •

edited

darribas commented Apr 24, 2016

jaketangosierra commented Apr 25, 2016

ML Lag indexing error on optimization result #587

ML Lag indexing error on optimization result #587

Comments

ratishm1 commented Mar 8, 2015

sjsrey commented Mar 10, 2015

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 5, 2016

ljwolf commented Feb 5, 2016

jaketangosierra commented Feb 8, 2016

ljwolf commented Feb 8, 2016

sjsrey commented Feb 26, 2016

jaketangosierra commented Apr 13, 2016

ljwolf commented Apr 15, 2016 • edited

ljwolf commented Apr 15, 2016

ljwolf commented Apr 15, 2016

jaketangosierra commented Apr 18, 2016

ljwolf commented Apr 18, 2016 • edited

darribas commented Apr 24, 2016

jaketangosierra commented Apr 25, 2016

ljwolf commented Apr 15, 2016 •

edited

ljwolf commented Apr 18, 2016 •

edited