Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Change the handling of IDs in from_dataframe constructors #477

Merged
merged 15 commits into from
Jan 10, 2023

Conversation

martinfleis
Copy link
Member

@martinfleis martinfleis commented Sep 3, 2022

As a follow-up of the discussion we had yesterday during the dev call, I am starting this PR more as a place for a discussion on top of a code rather than an actual code change.

I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.

Looking at the code, I also think that @ljwolf's suggestion to deprecate id_order should be done at the same time as the rest of the changes as these are quite closely linked together.

There is one question we haven't discussed though. What shall we do with the actual dataframe index? We currently completely ignore it and when no ids are passed use positional indexing to encode weights. But the most sensible default would be to use the dataframe index instead imho. However, this is quite a big breaking change, so I am not sure what would be the best on that front. edit: the docstring now says If nothing is provided, the dataframe index is used but the reality does not follow that.

The API change proposed here on the example of Rook should be then mirrored to all relevant from_dataframe methods. In some cases (KNN, DistanceBand), it involves nearly no change as ids behave as we want it and there is no idVariable. So the main goal here is to ensure the API is consistent.

Relevant discussions: #473, #183, #184 (which does essentially the same thing as this PR), #102

geom_col = df.geometry.name

if isinstance(ids, str):
ids = df[ids]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using get here as if you pass a string that does not match a column name we should raise an informative warning. The current code above gives AttributeError: 'NoneType' object has no attribute 'tolist', this one gives KeyError: 'missing', which is sensible enough.

@codecov
Copy link

codecov bot commented Sep 3, 2022

Codecov Report

Merging #477 (c796673) into master (d291e01) will increase coverage by 0.0%.
The diff coverage is 69.9%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #477   +/-   ##
======================================
  Coverage    78.8%   78.8%           
======================================
  Files         122     122           
  Lines       13100   13161   +61     
======================================
+ Hits        10320   10371   +51     
- Misses       2780    2790   +10     
Impacted Files Coverage Δ
libpysal/weights/contiguity.py 77.2% <64.4%> (-1.9%) ⬇️
libpysal/weights/gabriel.py 47.1% <65.2%> (+3.3%) ⬆️
libpysal/weights/distance.py 85.4% <77.8%> (ø)
libpysal/weights/tests/test_contiguity.py 95.1% <100.0%> (+0.3%) ⬆️
libpysal/_version.py 40.7% <0.0%> (+2.7%) ⬆️

@sjsrey
Copy link
Member

sjsrey commented Sep 8, 2022

I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

If yes, I can start exploring deprecation.

@sjsrey
Copy link
Member

sjsrey commented Sep 9, 2022

On a related note, turning off the sorting of the ids on init can be done. I explored this 13c89b6 and two tests broke in all of weights. These had to do with block_weights in both cases. I updated the doctests and unittests to correct for the case of turning off sort on init fc0f8b6. So removing the behavior of sorting on init could be done in the no sort branch.

@martinfleis
Copy link
Member Author

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.

@sjsrey
Copy link
Member

sjsrey commented Sep 10, 2022

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.

Since 3.6, dicts order the keys based on insertion order, so the positional index for our w would be w.neighbors.keys(). Prior to 3.6 we had to do all the id_sort business to support alignment between w and the arrays they would be applied to (in spatial lags).

So one thought is the new w.ids could be a property with a getter that returns w.neighborhs.keys().

Users could then sort the array to align with the w.ids order and then pass the sorted array into lag_spatial or we provide a way to reorder the w.ids and pass that into the lag_spatial function with the original (unsorted) array.

@sjsrey
Copy link
Member

sjsrey commented Sep 19, 2022

The timing for the changes might be something like this:

For meta 23.01 released Jan 2023, deprecation warnings have been added for the old id-related api.

We then target 23.07 (July release) for the new api.

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

What do devs think?

@martinfleis
Copy link
Member Author

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

@sjsrey
Copy link
Member

sjsrey commented Sep 20, 2022

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

What I am trying to understand is how we go about this. Given the nature of the changes we are proposing, I can't see a way (yet) where say; [1] we have deprecation warnings around ids and id_order which would permit legacy code to work, but alert the user to upcoming changes; and [2] also allow for the new functionality of ids in order to point the user to the functionality to adopt.

Its easier if we have a mapping of: old_property -> new_property with different names. For ids we don't have this to leverage as its ids -> ids.

Maybe this suggests a different name for the property/method in the new API?
old is ids new maybe something like iloc and loc?

The other problem is that if we deprecate idVariable and the sort of ids on init, then that functionality has to coexist peacefully with the new API (which we agree should not sort on init).

Fleshing all this out here is really helpful :->

@martinfleis
Copy link
Member Author

We don't have to deprecate anything around ids, we just make that variable smarter to consume either string representing the columns name, or any array-like of the matching length. That is fully backwards compatible and can be done immediately.

At the same time, we can raise a deprecation warning on idVariable but pass its value temporarily to enhanced ids.

The id_order can now raise a warning and since we want to remove it without a replacement, that is it.

One thing I still don't know how to approach is the issue of index vs positional index. We currently claim in the docstring that index is used automatically but it is not. If we just start doing that instead of using the position (iloc), but may break some things (e.g. it would break small bits in momepy). So I would say we should add a new keyword to control that behaviour.

Something like this

    @classmethod
    def from_dataframe(
        cls, df, geom_col=None, idVariable=None, ids=None, id_order=None, use_index=None, **kwargs
    ):
        """
        Construct a weights object from a pandas dataframe with a geometry
        column. This will cast the polygons to PySAL polygons, then build the W
        using ids from the dataframe.

        Parameters
        ----------
        df          : DataFrame
                      a :class: `pandas.DataFrame` containing geometries to use
                      for spatial weights
        geom_col    : string
                      the name of the column in `df` that contains the
                      geometries. Defaults to active geometry column.
        idVariable  : string
                      the name of the column to use as IDs. If nothing is
                      provided, the dataframe index is used DEPRECATED - use ids
        ids         : list
                      a list of ids to use to index the spatial weights object.
                      Order is not respected from this list.
        id_order    : list
                      an ordered list of ids to use to index the spatial weights
                      object. If used, the resulting weights object will iterate
                      over results in the order of the names provided in this
                      argument. DEPRECATED
        use_index   : bool
                      use index of `df` as ids to index the spatial weights object

        See Also
        --------
        :class:`libpysal.weights.weights.W`
        :class:`libpysal.weights.contiguity.Rook`
        """
        if geom_col is None:
            geom_col = df.geometry.name

        if id_order is not None:
            warnings.warn("deprecated, don't use")
            if id_order is True and ((idVariable is not None) or (ids is not None)):
                # if idVariable is None, we want ids. Otherwise, we want the
                # idVariable column
                id_order = list(df.get(idVariable, ids))
            else:
                id_order = df.get(id_order, ids)
        
        if idVariable is not None:
            if ids is None:
                warnings.warn("deprecated, use ids")
                ids = idVariable
            else:
                warnings.warn("both idVariable and ids passed, using ids")

        if ids is None:
            if use_index is None:
                warnings.warn("use_index defaults to False but will default to True in future. Set True/False directly to control this behaviour and silence this warning")
                use_index = False
            if use_index:
                ids = df.index.tolist()

        if isinstance(ids, str):
            ids = df[ids]

        if not isinstance(ids, list):
            ids = ids.tolist()


        return cls.from_iterable(
            df[geom_col].tolist(), ids=ids, id_order=id_order, **kwargs

This already supports all we want in the future but is backwards compatible and provides a deprecation period for transition.

@jGaboardi jGaboardi mentioned this pull request Oct 28, 2022
50 tasks
@martinfleis martinfleis marked this pull request as ready for review October 28, 2022 19:06
@martinfleis
Copy link
Member Author

I have updated from_dataframe in Rook, and Queen, adding the deprecation and ensuring all is controllable (use of index or not) and matching the docstring. I have also updated from_dataframe in KNN, Kernel, and DistanceBand by adding use_index keyword to ensure consistency between all methods.

All changes are backwards compatible. In contiguity, we will want to change the default in use_index to True to match the default in distance weights (that has already been used).

@ljwolf how are Delaunay weights indexed? It would be good to have the same options there if possible.

@ljwolf
Copy link
Member

ljwolf commented Oct 28, 2022

They only take input, and do no id processing iirc, so they create a range index. They're in weights.gabriel

@martinfleis
Copy link
Member Author

Okay, I added the same options to handle index to Delaunay.from_dataframe as well. The default on use_index will need to change there as well, the situation is the same as we have in contiguity right now.

Copy link
Member

@jGaboardi jGaboardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ljwolf
Copy link
Member

ljwolf commented Nov 3, 2022

OK, still some odd behavior around weights ordering here, akin to the issues I outlined in the test bench from #184. I think that part of this fix has to be stopping sorting ids by default.

An example using FIPS codes from the south example data:

import geopandas
from libpysal import weights, examples
examples.load_example("south.shp") # I can't locate south if I don't load it first, even if it's downloaded?
south = geopandas.read_file(examples.get_path("south.shp"))
south_shuffled = south.sample(frac=1, replace=False)
print(weights.Rook.from_dataframe(south, ids="FIPS").id_order[:5])
['01001', '01003', '01005', '01007', '01009']
print(weights.DistanceBand.from_dataframe(south, threshold=2, ids="FIPS").id_order[:5])
['54029', '54009', '54069', '54051', '10003']

Sorting of the ids is happening for contiguity weights, but not distance weights. The passing isn't working at all for the Delaunay-family weights:

print(weights.Delaunay.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]
print(weights.Gabriel.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]

And this fails entirely, since the columns aren't propagated down to the FIPS

weights.Voronoi(south.set_geometry(south.centroid), ids='FIPS')

because columns aren't correctly propagated through to the Rook.from_dataframe() call.

"The numba package is used extensively in this module"
" to accelerate the computation of graphs. Without numba,"
" these computations may become unduly slow on large data."
)
edges, _ = self._voronoi_edges(coordinates)
voronoi_neighbors = pandas.DataFrame(edges).groupby(0)[1].apply(list).to_dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where we have to intervene to get ids to "work":

  1. Build pandas.DataFrame(edges) and store as adjlist
  2. re-assign the head/tail values for each edge according to the id list (e.g. edges.loc[:,0] = ids[edges.loc[:,0])
  3. then proceed with the group by and dict to get the weights into W input format.

I believe this would have to happen at the same location in Gabriel neighbors (line 181) and I don't yet understand how to do this at 223 (maybe pass WSP(sp, ids=...?)

Copy link
Member

@ljwolf ljwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, too, the Voronoi fix would be to set the index of the voronoi cell dataframe in the Voronoi function...

@martinfleis
Copy link
Member Author

@ljwolf whoops. I'll have a look at Delaunay/Voronoi stuff later unless you want to commit some changes yourself (which may be faster).

some odd behavior around weights ordering here

Can we solve the ordering issue in another PR and have this focused on passing the ids through? Since this hasn't changed from what we have on master.

@ljwolf
Copy link
Member

ljwolf commented Nov 3, 2022

Sure! I'm drafting a fix for these inconsistencies now, will send

@ljwolf
Copy link
Member

ljwolf commented Nov 3, 2022

I've sent a PR upstream (martinfleis/libpysal#35), since I'm not sure if this counts as "fixing" sorting by default.... What I'm referring to is this:

.from_dataframe() methods should always provide weights that are aligned to the input dataframe.

The current state of things is:

  1. contiguity weights' .from_dataframe() will break the ordering
  2. distance weights' .from_dataframe() does not break the ordering
  3. With martinfleis/libpysal#35, Delaunay-based weights will not break the ordering.

I'd encourage us to not break order in from_dataframe(), which means that we consider the input to from_dataframe as having a "set" id_order, rather than sorting it.

@martinfleis
Copy link
Member Author

The order should now be preserved for contiguity as well.

I think I'll add some tests for this before merging.

@martinfleis martinfleis mentioned this pull request Nov 15, 2022
@martinfleis
Copy link
Member Author

Should be ready now.

Copy link
Member

@knaaptime knaaptime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet

@sjsrey
Copy link
Member

sjsrey commented Jan 6, 2023

@martinfleis should we merge this?

@martinfleis
Copy link
Member Author

Unless there are any other comments it is ready to merge, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants