Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Fixing the number of clusters #89

Open
Alex-Mathai-98 opened this issue Nov 9, 2021 · 14 comments
Open

Support for Fixing the number of clusters #89

Alex-Mathai-98 opened this issue Nov 9, 2021 · 14 comments
Labels
enhancement New feature or request

Comments

@Alex-Mathai-98
Copy link

Hi @vtraag - thankyou for this amazing repo. It has been immensely useful.

I have a feature request - can you add an additional argument that fixes the number of communities ? Say, if I want 5 communities (even if that results in partitions with a lower modularity score) - is there a way to fix that ?

While going through your implementation - I have thought of a possible way of doing this (more like a post-processing step with the assumption that leiden will always give more communities than I need).

But wanted to check with you first, before going there.

@vtraag
Copy link
Owner

vtraag commented Nov 9, 2021

I have a feature request - can you add an additional argument that fixes the number of communities ? Say, if I want 5 communities (even if that results in partitions with a lower modularity score) - is there a way to fix that ?

In principle, you can do this by optimising an existing partition, where you initially assign the existing partition a random initial membership, but limited to the desired number of communities. For example,

n_comms = 5
partition = la.CPMVertexPartition(G, 
                                  initial_membership=np.random.choice(n_comms, 100),
                                  resolution_parameter=0.5)

where np refers to numpy.

If you then turn off the option to consider also empty communities for moving nodes, the optimiser only considers existing communities, i.e. it will never go beyond the indicated number of communities n_comms.

opt = la.Optimiser()
opt.consider_empty_community = False
opt.optimise_partition(partition)

Of course, it is possible that you will end up having fewer communities than n_comms, because such a partition will have a higher quality value. However, it will never have more communities than n_comms.

While going through your implementation - I have thought of a possible way of doing this (more like a post-processing step with the assumption that leiden will always give more communities than I need).

But wanted to check with you first, before going there.

What was the idea you had in mind? I'm curious to hear about it!

@Alex-Mathai-98
Copy link
Author

Alex-Mathai-98 commented Nov 9, 2021

@vtraag - thanks for your immediate response. I will consider your snippet as one of the possible ways to attack this.

In regard to my idea, it goes like this.

  1. After performing the partition algorithm - lets say we get K clusters. I assume that K typically won't be above 1000.
  2. I then create the aggregated graph of the partition.
  3. If my target number of clusters is say M : I intend to iterate through each of the smallest K-M partitions, and then for each partition - I check the decrease in modularity when combining the partition with one of the M largest partitions. The partition for which the decrease is minimal - is the one that is chosen to merge the two partitions.
  4. I think the time complexity should be O(K^2 x time_taken_for_difference_in_modularity)

Do you see any problem with this approach ?

@vtraag
Copy link
Owner

vtraag commented Nov 9, 2021

This is also an interesting approach. It is doing something else though: it tries to deal with minimum community sizes somehow. This is what is discussed in #53, with the approach that I outlined being very similar to what you propose here.

@Alex-Mathai-98
Copy link
Author

@vtraag - I am not sure I understand why this would ensure a minimum community size.

If K= 10 and M = 5 and the size of the clusters are [10,9,8,6,1,1,1,1,1,1] then using the above approach - is not possible that the new list of clusters become [16,9,8,6,1] ? So all the single nodes go to the 10 cluster.

If this is not possible, can you throw some light as to why that wouldn't happen ?

@vtraag
Copy link
Owner

vtraag commented Nov 9, 2021

If K= 10 and M = 5 and the size of the clusters are [10,9,8,6,1,1,1,1,1,1] then using the above approach - is not possible that the new list of clusters become [16,9,8,6,1] ? So all the single nodes go to the 10 cluster.

Yes, it is certainly possible that this would be an outcome.

3. I check the decrease in modularity when combining the partition with one of the M largest partitions. The partition for which the decrease is minimal - is the one that is chosen to merge the two partitions.

How you formulated this is that you were interested in simply keeping the largest M communities, not just retain a specific number of communities.

All in all, it depends a bit on what you are trying to do. For example, if you are interested in just restricting the number of communities, it might be that a more even distribution of communities would be better than simply aggregating existing communities. For example, it could be that the communities with sizes [10,9,8,6,1,1,1,1,1,1] should become communities of sizes [8, 8, 8, 8, 7] if you want to restrict the number of communities. With strictly aggregating existing communities you don't necessarily arrive at that outcome. If you want to simply restrict the number of communities, I think that the approach that I outlined in my earlier reply would be preferable to aggregating communities.

@Alex-Mathai-98
Copy link
Author

@vtraag - I see the confusion now.

It is not a requirement for me that the clusters are evenly distributed. I just need to make sure that there are exactly M clusters are provided.

I coded up the routine I mentioned in my previous messages. This is the trend I observe. The leiden algorithm outputs more than 20 clusters which I iteratively decrease to 8.

[136, 41, 86, 121, 54, 50, 48, 42, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2, 2, 2]
[136, 41, 86, 121, 54, 50, 48, 44, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2, 2]
[136, 41, 86, 121, 54, 50, 48, 46, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2]
[136, 41, 86, 121, 54, 50, 48, 48, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2]
[136, 41, 86, 121, 54, 50, 48, 50, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2]
[136, 41, 86, 121, 54, 50, 48, 52, 28, 28, 14, 14, 8, 7, 5, 5, 4]
[136, 41, 86, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7, 5, 5]
[136, 41, 91, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7, 5]
[141, 41, 91, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7]
[141, 41, 91, 121, 61, 50, 48, 56, 28, 28, 14, 14, 8]
[141, 41, 91, 121, 61, 50, 48, 64, 28, 28, 14, 14]
[141, 55, 91, 121, 61, 50, 48, 64, 28, 28, 14]
[141, 55, 91, 121, 61, 50, 62, 64, 28, 28]
[141, 55, 91, 121, 89, 50, 62, 64, 28]

@vtraag
Copy link
Owner

vtraag commented Nov 10, 2021

OK, good to see you got something working that does the trick for you! If you post your code here, it might also help some other people who would like to do the same.

One final suggestion: in your final result consisting of 8 communities, you can still let the Leiden algorithm run while setting consider_empty_community = False. That way, it will keep only the 8 communities that you have already detected, but it will try to find a better assignment still.

@Alex-Mathai-98
Copy link
Author

Alex-Mathai-98 commented Nov 11, 2021

@vtraag - thanks for your suggestion. I have incorporated it in my code. Please take a look at the code below and if possible, please provide feedback.

### assuming that 'partition' is already an output that has come from running leiden
### and that the number of communities in leiden are too many. We want to reduce
### the number of communities to 'num_clusters'

# instantiate optimiser
opt = Optimiser()
opt.consider_empty_community = False

# get the size of the community clustes
sizes = partition.sizes()

if len(sizes) < num_clusters :
    raise ValueError("Number of communities is already lesser than 'num_clusters'")
elif len(sizes) == num_clusters :
    return partition
else :
    orig_num_clusters = len(sizes)

    for i in range(num_clusters,orig_num_clusters) :

        # from the node level partition - aggregate to form the super nodes 
        aggregate_partition = partition.aggregate_partition(partition)

        # get the size of the community clustes
        sizes = partition.sizes()

        # take the smallest community with the largest index
        minimum = np.min(sizes)
        smallest = np.where(sizes==minimum)[0][-1]

        # the "num_cluster" largest communities
        arg_sort = np.argsort(sizes)
        fixed = arg_sort[len(sizes)-num_clusters:]

        max_diff = -np.inf
        cluster_idx = -1
        for fix in fixed : 
            diff = aggregate_partition.diff_move(smallest,fix)
            if diff > max_diff :
                max_diff = diff
                cluster_idx = fix

        # found the cluster that it should be a part of
        aggregate_partition.move_node(smallest,cluster_idx)

        # now one of the communities has a zero size. that community can have the 'last' index
        # in the sizes() array or it can be somewhere in the 'middle'. If it has the 'last' index
        # then the 'from_coarse_partition' function will automatically remove the last cluser but
        # if the empty community is in the middle (and not at the last index) then 'from_coarse_partition'
        # won't. To solve the latter - we optimise the partition.
        if not(aggregate_partition.sizes()[-1] == 0) :
            # remove the empty community that exists in the 'middle'
            opt.optimise_partition(aggregate_partition)

        # update parition with the new community
        partition.from_coarse_partition(aggregate_partition)			

    # final refinement				
    opt.optimise_partition(partition)
    
    # sanity checks
    assert(len(partition.sizes())==num_clusters)

    return partition

I also have a small question regarding your suggestion. If I run the leiden algorithm on my final output, is it possible that I will get less than M clusters ?

@Alex-Mathai-98
Copy link
Author

@vtraag - I have also found the function renumber_communities - that renumbers the sizes array.

This is to prevent the below complicated logic.

# now one of the communities has a zero size. that community can have the 'last' index
# in the sizes() array or it can be somewhere in the 'middle'. If it has the 'last' index
# then the 'from_coarse_partition' function will automatically remove the last cluser but
# if the empty community is in the middle (and not at the last index) then 'from_coarse_partition'
# won't. To solve the latter - we optimise the partition.
if not(aggregate_partition.sizes()[-1] == 0) :
    # remove the empty community that exists in the 'middle'
    opt.optimise_partition(aggregate_partition)

def renumber_communities(self):

Can you please verify from your end that I can use this to keep the zero sized cluster at the end.

@vtraag
Copy link
Owner

vtraag commented Nov 16, 2021

Overall, the script looks good to me, thanks for sharing!

    if not(aggregate_partition.sizes()[-1] == 0) :
        # remove the empty community that exists in the 'middle'
        opt.optimise_partition(aggregate_partition)

In principle this not only removes the last empty community, it also optimizes the remaining assignments (without considering an empty community). It is possible that this already gets you less than M clusters, but this is unlikely.

I also have a small question regarding your suggestion. If I run the leiden algorithm on my final output, is it possible that I will
get less than M clusters ?

Yes, similar to the optimise_partition call that you use earlier, in principle the last call could result in fewer than M clusters. However, given that the original quality is higher with more clusters (presumably), I think it is unlikely to result in fewer clusters.

@vtraag - I have also found the function renumber_communities - that renumbers the sizes array.
Can you please verify from your end that I can use this to keep the zero sized cluster at the end.

Yes, this will remove empty communities. In principle, it would probably be better to indeed remove empty communities using renumber_communities instead of optimise_partition, and only run optimise_partition once when you have obtained the desired number of clusters.

Finally, one other comment: I don't think it's necessary to get the aggregate partition and update the original partition for each cluster that you change. You can simply get the aggregation partition once, then update the original partition using from_coarse_partition at the very end. You only need to make sure that you carefully track the total size for each community in sizes. However, if this script already does the job for you and you don't need to make it more efficient, then you can just forget about this.

@Alex-Mathai-98
Copy link
Author

@vtraag - thanks for your approval. As of now, performance does not seem to be a concern.

@Alex-Mathai-98
Copy link
Author

@vtraag - do you think this would be a useful addition to you code repository ? If you think so, I do not mind creating a pull request for this functionality.

@vtraag
Copy link
Owner

vtraag commented Dec 16, 2021

Sorry for the late response @Alex-Mathai-98 ! In principle, the functionality could be interesting for others. However, I would prefer to have an implementation in the C++ layer that could then be exposed in Python, instead of doing it in Python all together. Moreover, performance will become an issue at some point, so I do prefer to make sure it is sufficiently performant before integrating it in the library. If you are open to doing this, I would be willing to work with you on a PR! If not, I'll try to see if I can find some time in the future to implement such a feature.

@Gilquin
Copy link

Gilquin commented Jul 18, 2023

Sorry for bumping up this issue, but I have a related question.

I would be interested to use the consider_empty_community feature from the leiden R package that interfaces with leidenalg.
Unfortunately, this is not currently possible as the main function of the leiden package relies on leidenalg.find_partition which instantiate internally the optimiser object (see issue#27).

A possible solution would be to add a new argument empty_community to find_partition (same as for max_comm_size):

+ def find_partition(graph, partition_type, initial_membership=None, weights=None, n_iterations=2, max_comm_size=0, empty_community=True, seed=None, **kwargs):
  ....
  optimiser.max_comm_size = max_comm_size
+ optimiser.consider_empty_community = empty_community

Would such a change be possible or not ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants