Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML Lag indexing error on optimization result #587

Closed
ratishm1 opened this issue Mar 8, 2015 · 20 comments
Closed

ML Lag indexing error on optimization result #587

ratishm1 opened this issue Mar 8, 2015 · 20 comments
Milestone

Comments

@ratishm1
Copy link

ratishm1 commented Mar 8, 2015

When running ml_lag, I get the following error: IndexError: invalid index to scalar variable. This is from the code: self.rho = res.x[0][0]. I went into the ml_lag.py file in spreg and printed out res, which is the OptimizeResult object from scipy. This was the following output:

status: 0
    nfev: 36
 success: True
     fun: array([[ nan]])
       x: -0.23606797749978981
 message: 'Solution found.'

X should be a solution array, but here it is just a scalar value. I changed rho = res.x[0][0] to rho=res.x; however, that gave me problems elsewhere.

@sjsrey
Copy link
Member

sjsrey commented Mar 10, 2015

Would it be possible for you to share the data and specification you are running? That would help us to debug this.

@sjsrey sjsrey added this to the Wishlist milestone Jul 9, 2015
@jaketangosierra
Copy link

I'm seeing this as well, though I'm using GeoPandas (rather than the native pysal shapefile reading). I do some cleanup and quick calculation of some things that aren't in the shapefile, but I don't see why that would matter, the input formats are right, to ML_Lag. Specifically, I'm running:

chicago = GeoDataFrame.from_file('./other_chicago.shp')
chicago['FIPS_TEXT'] = chicago['FIPS_TEXT'].map(stringify)
chicagoEQAreaConic = chicago.to_crs(epsg='2790')
chicagoEQAreaConic = pd.merge(chicagoEQAreaConic, real_data, how='left', on='FIPS_TEXT')
chicago_weights = pysal.queen_from_shapefile("./other_chicago.shp")
chicagoEQAreaConic['hinc_10k'] = chicagoEQAreaConic['ACS_MED_HI'].map(lambda x: (x/10000))
iv = []
iv.append(chicagoEQAreaConic['WAIT_S'].astype(float).values)
iv = np.array(iv).T
dvs = []
dvs.append(chicagoEQAreaConic['hinc_10k'].astype(float).values)
dvs.append(chicagoEQAreaConic['pop2012_dens'].astype(float).values)
dvs = np.array(dvs).T
chicago_weights.transform = 'r'
pysal.spreg.ml_lag.ML_Lag(iv, dvs, chicago_weights)

Unfortunately, I'm not really in a position to share my shapefile, as our results haven't been published yet.

@ljwolf
Copy link
Member

ljwolf commented Feb 5, 2016

Hmm... I wondered if this had to do with some weirdness with input array shaping, so I tried what I could with your code and could not replicate.

import pysal as ps; assert ps.version == '1.11.0'
df = ps.pdio.read_files(ps.examples.get_path('columbus.shp'))
w = ps.queen_from_shapefile(ps.examples.get_path('columbus.shp'))
iv = []
iv.append(df['HOVAL'].astype(float).values)
iv = np.array(iv).T
dvs = []
dvs.append(df['INC'].astype(float).values)
dvs.append(df['CRIME'].astype(float).values)
dvs = np.array(dvs).T
reg = ps.spreg.ml_lag.ML_Lag(iv, dvs, w)

If you're able, can you post the exact traceback?

@jaketangosierra
Copy link

anonymized the paths a bit, but otherwise this is the same. I had cleaned up the original to remove the "method='FULL'" piece, because this error happens either way, so ignore that piece in the top-level call.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-18-1bd724ff8213> in <module>()
     14 chicago_weights.transform = 'r'
     15 
---> 16 pysal.spreg.ml_lag.ML_Lag(iv, dvs, chicago_weights, method='FULL')

/Users/username/Work/Development/venvs/py3/jupyter/lib/python3.4/site-packages/pysal/spreg/ml_lag.py in __init__(self, y, x, w, method, epsilon, spat_diag, vm, name_y, name_x, name_w, name_ds)
    542         method = method.upper()
    543         BaseML_Lag.__init__(
--> 544             self, y=y, x=x_constant, w=w, method=method, epsilon=epsilon)
    545         # increase by 1 to have correct aic and sc, include rho in count
    546         self.k += 1

/Users/username/Work/Development/venvs/py3/jupyter/lib/python3.4/site-packages/pysal/spreg/ml_lag.py in __init__(self, y, x, w, method, epsilon)
    225             return
    226 
--> 227         self.rho = res.x[0][0]
    228 
    229         # compute full log-likelihood, including constants

IndexError: invalid index to scalar variable.

@ljwolf
Copy link
Member

ljwolf commented Feb 5, 2016

Cool, thanks, that helps!

@jaketangosierra
Copy link

Great! I'd love a bug-fix release, if this actually lets you solve the problem!

@ljwolf
Copy link
Member

ljwolf commented Feb 5, 2016

I notice in your traceback, you have method="FULL".

If you have relatively large data, this results in computing np.linalg.det for a dense (n,n) matrix. In my experience, you lose machine accuracy on that dense determinant at around 1.5k observations.

Do you get this error if you use method = "ORD" or method = "LU"?

@jaketangosierra
Copy link

Yes, in all three cases.

My dataset isn't huge, I have ~1,000 census tracts, so it may be within the range of what you're talking about. That said, given that all three values produce the same error, my guess is that's not the issue?

@ljwolf
Copy link
Member

ljwolf commented Feb 5, 2016

Hmm... Okay. Bummer. I'll keep pushing on this, but I need to be able to replicate the error to solve the problem.

@jaketangosierra
Copy link

I'm not sure what's going on, I think it's related to some joining I was doing between dataframes, but I started this, with a different way of accessing the data, and I cannot replicate either.

From my perspective, I solved my problem, but I'm not clear on what is causing this.

@ljwolf
Copy link
Member

ljwolf commented Feb 8, 2016

Good, I'm glad you fixed it!

My search is based on @ratishm1's optimize result. If function is array([[nan]]), then something has gone awry in the likelihood minimization. Since you get this error regardless of likelihood, I'm looking at any of the lag_c_loglik* starting around l559 in ml_lag.py.

The only way I see for those all to nan out is if the determinant fails, or n is somehow not what we expect it to be. Without a dataset to reliably reproduce this, we can't really figure out what needs to get wrapped.

@sjsrey
Copy link
Member

sjsrey commented Feb 26, 2016

Can't reproduce.

@sjsrey sjsrey closed this as completed Feb 26, 2016
@jaketangosierra
Copy link

Hi all, I ran into this again with a different dataset. I wanted to provide this as a reproduction of the bug. See reproducing_bug_587.zip

I've included the requisite datafiles and a jupyter notebook (which is expecting Python3) that shows the error above.

@ljwolf
Copy link
Member

ljwolf commented Apr 15, 2016

Awesome, now we're cooking; I can replicate using the notebook provided. Will hopefully identify & patch if necessary.

@ljwolf ljwolf reopened this Apr 15, 2016
@ljwolf ljwolf changed the title ML Lag ML Lag indexing error on optimization result Apr 15, 2016
@ljwolf
Copy link
Member

ljwolf commented Apr 15, 2016

For a fiona-less version of minimal replication from the data provided:

import pysal as ps

weights = ps.queen_from_shapefile('shapefile_for_weights.shp', idVariable='gid')
yellow = ps.pdio.read_files('yellow.shp')

y  = yellow[['total_spen']].values
X = yellow[['num_rides']].values

ps.spreg.ML_Lag(y, X, weights)

@ljwolf
Copy link
Member

ljwolf commented Apr 15, 2016

Ok, I think this is the cause:

82]> np.isnan(X).sum() > 0
True
83]> np.isnan(y).sum() > 0
True
84]> yellow[np.isnan(X)]
#omitted, but will show obs. w/ nan fields

Currently, there is no masking of nan values, so any nan will cause the optimization procedure to return nan. This nan array is of the wrong shape, so the element extraction fails.

@ljwolf ljwolf closed this as completed Apr 15, 2016
@jaketangosierra
Copy link

Sorry, to be clear: you're saying this is known/intended behavior, because there is no masking of nan values?

I upgraded to 1.11.1, and I'm still getting the error, but if you're saying this expected behavior because the data has nans, then that makes sense (and is frustrating, but that's a different issue).

@ljwolf
Copy link
Member

ljwolf commented Apr 18, 2016

I believe this is known/intended behavior, since we don't do any masking anywhere.

This also happens in other econometrics packages, like statsmodels:

1]> import statsmodels.api as sm
2]> import pysal as ps
3]> data = ps.pdio.read_files(ps.examples.get_path('columbus.shp))

4]> y = data[['HOVAL']].values
5]> X = data[['CRIME', 'INC']].values

6]> y[5] = X[5] = np.nan

7]> sm.OLS(y,X).fit().summary()
[...]

will yield a summary output entirely of nan.

In light of our other pre-estimation checks, it might make sense to add a check for nan?

@darribas
Copy link
Member

@jtsmn Yes, spreg assumes throughout the codebase you do not have missing values. This is for several reasons but, essentially, the gist of it is that we expect the user to clean missing values and holes before coming to spreg.

Also note that the W object you pass needs to be aligned with the data, so if you had any nan and removed that row, you'd have to also remove it from W. The w_subset operation and similar ones in Wsets should help you with that.

@ljwolf +1 to add a check for nan's at the user level.

@jaketangosierra
Copy link

This assumption is not particularly clear in any place that I've found (either in the overview documentation or in the API documentation itself). Thank you for clarifying though.

I agree that at least pre-estimation checks and clarifying the necessity would be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants