Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: include Graph.describe() to describe neighbourhood values #717

Merged
merged 10 commits into from
Jun 5, 2024

Conversation

u3ks
Copy link
Contributor

@u3ks u3ks commented Jun 4, 2024

This PR adds a method to the graph api which takes an array of values and calculates descriptive statistics within each neighborhood.
Optionally, some neighbors can be filtered out based on the percentiles of the passed values.
The supported stats are - "count", "mean", "median", "std", "min", "max", "sum", "nunique" and "mode".

The method similar to .apply, but all values are calculated in one grouping operation and all functions are jitted.

@martinfleis
Copy link
Member

Just to add some context to this. As we are refactoring momepy, we realised that we rely very often on this internal function, which is fairly generic and shall be tied directly to Graph.

The idea behind the q limiting the range is coming from morphology. We often want to get some sort of a spatial average but given the high likelihood of outliers (think of a church in the middle of a neighborhood), we can't include all the values within each neighborhood.

@ljwolf
Copy link
Member

ljwolf commented Jun 4, 2024

I think, for generality, this should be called a truncated or trimmed reduction/lag?

This is very useful generally... @weikang9009 and I have been working on related concepts recently, so it'd be very nice to have something core here!

libpysal/graph/base.py Outdated Show resolved Hide resolved
libpysal/graph/base.py Outdated Show resolved Hide resolved
libpysal/graph/tests/test_base.py Show resolved Hide resolved
libpysal/graph/tests/test_base.py Show resolved Hide resolved
@martinfleis
Copy link
Member

I think, for generality, this should be called a truncated or trimmed reduction/lag?

Only if q is not None. Otherwise it is just a generic lag. I am also not sure what can be called a lag (nunique?). The describe terminology comes from pandas. It felt close enough to what we're doing here.

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
@martinfleis martinfleis changed the title Describe neighbourhood values ENH: include Graph.describe() to describe neighbourhood values Jun 4, 2024
Copy link

codecov bot commented Jun 4, 2024

Codecov Report

Attention: Patch coverage is 97.82609% with 2 lines in your changes missing coverage. Please review.

Project coverage is 85.1%. Comparing base (bcabdbc) to head (879f3f5).
Report is 18 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff           @@
##            main    #717    +/-   ##
======================================
  Coverage   85.0%   85.1%            
======================================
  Files        141     145     +4     
  Lines      15203   15483   +280     
======================================
+ Hits       12924   13169   +245     
- Misses      2279    2314    +35     
Files Coverage Δ
libpysal/graph/tests/test_base.py 100.0% <100.0%> (ø)
libpysal/graph/_utils.py 97.1% <97.6%> (+2.2%) ⬆️
libpysal/graph/base.py 96.8% <92.9%> (-1.1%) ⬇️

... and 6 files with indirect coverage changes

libpysal/graph/_utils.py Outdated Show resolved Hide resolved

Parameters
----------
grouper : pandas.GroupBy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pandas.Grouper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pandas.Grouper is another type of object, that is used for filtering columns , i used the name grouper since its used in other functions and the type is groupby since, pandas groupy returns a groupby object

libpysal/graph/_utils.py Show resolved Hide resolved

Weight values do not affect the calculations, only adjacency does.

Returns nan for all isolates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 'nan' is OK here, but also maybe NaN or numpy.nan (or something else?)

Probably not a big deal either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to numpy.nan

libpysal/graph/base.py Outdated Show resolved Hide resolved
u3ks and others added 2 commits June 4, 2024 19:24
Co-authored-by: James Gaboardi <jgaboardi@gmail.com>
@u3ks u3ks requested a review from jGaboardi June 5, 2024 09:00
Comment on lines 2042 to 2043
if not isinstance(y, pd.Series):
y = pd.Series(y)
y = pd.Series(y, index=self.unique_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, we may want to check that the y.index matches self.unique_ids in case of a custom Series is passed. I suppose that non-matching index may break this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check

@martinfleis martinfleis merged commit 34a7fe1 into pysal:main Jun 5, 2024
11 checks passed
@knaaptime
Copy link
Member

i think i would call this describe_cardinalities or something because "Graph.describe() to describe neighbourhood values" implies we're looking at the neighbor values

@martinfleis
Copy link
Member

But this is not describing cardinalities, no? Where cardinality is a number of elements in a set. It is describing distribution of values within a neighbourhood.

@knaaptime
Copy link
Member

oh i see. It was this note on line 2014 that tripped me up:

'Weight values do not affect the calculations, only adjacency does.'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants