Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input to function seriate #4

Closed
DeepaMahm opened this issue Sep 21, 2019 · 6 comments
Closed

Input to function seriate #4

DeepaMahm opened this issue Sep 21, 2019 · 6 comments

Comments

@DeepaMahm
Copy link

Hi,

In the set of comments given in seritae.py, it is mentioned that
:param dists: Either a condensed pdist-like or a symmetric square distance matrix.

Does that mean a correlation matrix shouldn't be used as input? Should the correlation matrix be
converted to a distance matrix?

@Guillemdb
Copy link
Contributor

The fact that correlations can be negative could influence the calculation of the TSP using ortools, but you can do something like seriate(pdist(corr_matrix)) to solve that problem.

In the docs, pdist-like referes to using scipy.spatial.distance.pdist to process non-square distance matrix input before seriation.

@vmarkovtsev
Copy link
Collaborator

The TSP does not have a solution with negative values: we follow the corresponding cycle and reach the infinitely negative optimal loss.

The triangle inequality does not have to hold, though. So I don't think that the matrix must be positively defined.

@DeepaMahm
Copy link
Author

@Guillemdb I tried the followig

import os
import pickle
import matplotlib.pyplot as plt
from pprint import pprint
from seriate import seriate
from scipy.spatial.distance import pdist


def serialize_data(f_input):
    if os.path.exists(f_input):
        with open(f_input, "rb") as f:
            # prior to seriation
            df = pickle.load(f)
            pprint(df.head())
            input_np = df.values   #np nd array correlation matrix
            dist = pdist(input_np)  #distance matrix

            # matplotlib
            fig, ax = plt.subplots()
            im = ax.imshow(input_np)
            fig.tight_layout()
            plt.show()

            # seriation
            idx = seriate(dist, timeout=50)
            fig1, ax1 = plt.subplots()
            im1 = ax1.imshow(input_np[idx])
            fig1.tight_layout()
            plt.show()


if __name__ == '__main__':
    f_input = #input
    serialize_data(f_input)

This is the plot of the input data containing the correlation matrix.

This is the plot of the seriated data. I could observe streaks of blue patterns. However, these streaks aren't grouped together. I expect these streaks to be grouped :(

@Guillemdb
Copy link
Contributor

To me your output looks fine. Probably this is as grouped as they should be, it is normal to have this kind of results when working with such big matrices.

@DeepaMahm
Copy link
Author

@Guillemdb Many thanks for the response. Shouldn't the diagonal remain unchanged? Before seriation, I could see a yellow pattern. After seriation, the rows are sorted according to the Euclidean distance.

Would it be a good idea to sort the columns as well? Since the diagonal entries of the correlation matrix are expected to exhibit high correlation, I am a bit confused.

@DeepaMahm
Copy link
Author

I came across a post on SO that suggests sorting both columns and rows of the correlation matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants