Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: spatial: Rewriting distance metrics to _distance_pybind #14611

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

czgdp1807
Copy link
Member

@czgdp1807 czgdp1807 commented Aug 19, 2021

Reference issue

#14077

What does this implement/fix?

This PR aims to re-write distance metrics in scipy.spatial.distance to _distance_pybind. As of now this is a work in progress. Following metrics will be added here,

  • Correlation (pdist)
  • Cosine (pdist)
  • Mahalanobis (pdist)
  • Seucledian (pdist)
  • Jensenshannon (pdist)

Refer this comment for more details.

Please feel free to leave your reviews meanwhile this gets completed. Thanks.

Additional information

Benchmarks result.

+      10.1±0.1μs       24.9±0.2μs     2.47  spatial.Xdist.time_pdist(10, 'jensenshannon')
+     1.06±0.04ms      1.13±0.01ms     1.07  spatial.XdistWeighted.time_cdist(10, 'russellrao')
+      9.57±0.3ms      10.1±0.08ms     1.05  spatial.XdistWeighted.time_cdist(20, 'kulsinski')
-      61.7±0.6μs       58.7±0.5μs     0.95  spatial.Xdist.time_pdist(10, 'mahalanobis')
-     2.95±0.07ms      2.80±0.02ms     0.95  spatial.XdistWeighted.time_pdist(20, 'jaccard')
-     2.19±0.02ms      2.08±0.02ms     0.95  spatial.XdistWeighted.time_pdist(20, 'russellrao')
-     4.60±0.08ms      4.37±0.05ms     0.95  spatial.XdistWeighted.time_pdist(20, 'sokalmichener')
-     4.86±0.03μs       4.61±0.1μs     0.95  spatial.Xdist.time_pdist(10, 'dice')
-     4.84±0.07μs       4.59±0.1μs     0.95  spatial.Xdist.time_pdist(10, 'sokalmichener')
-     5.32±0.02μs      5.04±0.02μs     0.95  spatial.XdistWeighted.time_pdist(20, 'canberra')
-     4.81±0.07μs       4.55±0.1μs     0.95  spatial.Xdist.time_pdist(10, 'russellrao')
-      40.0±0.1μs      37.7±0.05μs     0.94  spatial.Xdist.time_pdist(100, 'correlation')
-     2.45±0.04ms       2.30±0.1ms     0.94  spatial.Xdist.time_pdist(1000, 'hamming')
-     4.58±0.06μs      4.30±0.02μs     0.94  spatial.XdistWeighted.time_pdist(20, 'chebyshev')
-     4.56±0.02μs      4.28±0.09μs     0.94  spatial.Xdist.time_pdist(10, 'jaccard')
-      13.1±0.1μs       12.3±0.2μs     0.94  spatial.XdistWeighted.time_pdist(10, 'minkowski-P3')
-     1.25±0.01ms      1.17±0.01ms     0.94  spatial.XdistWeighted.time_pdist(10, 'sokalsneath')
-      4.46±0.1ms      4.18±0.01ms     0.94  spatial.Xdist.time_cdist(1000, 'seuclidean')
-     4.97±0.02μs      4.64±0.09μs     0.93  spatial.Xdist.time_pdist(10, 'yule')
-     4.96±0.02μs       4.62±0.1μs     0.93  spatial.Xdist.time_pdist(10, 'sokalsneath')
-      16.6±0.5μs       15.4±0.1μs     0.93  spatial.XdistWeighted.time_pdist(10, 'hamming')
-      17.6±0.4μs       16.3±0.2μs     0.93  spatial.XdistWeighted.time_pdist(20, 'hamming')
-      28.2±0.2μs       26.1±0.1μs     0.93  spatial.XdistWeighted.time_pdist(100, 'sqeuclidean')
-      27.8±0.2μs       25.7±0.3μs     0.92  spatial.XdistWeighted.time_pdist(100, 'cityblock')
-     1.80±0.06ms      1.66±0.01ms     0.92  spatial.Xdist.time_pdist(1000, 'correlation')
-        71.4±2μs         64.9±2μs     0.91  spatial.Xdist.time_cdist(10, 'mahalanobis')
-      25.3±0.5μs       22.5±0.2μs     0.89  spatial.Xdist.time_cdist(10, 'correlation')
-        1.85±0ms      1.62±0.01ms     0.88  spatial.Xdist.time_pdist(1000, 'cosine')
-     24.7±0.04μs      21.5±0.03μs     0.87  spatial.Xdist.time_pdist(100, 'cosine')
-      40.4±0.5μs       34.2±0.1μs     0.85  spatial.Xdist.time_cdist(100, 'braycurtis')
-      23.0±0.1μs       19.4±0.2μs     0.84  spatial.Xdist.time_pdist(100, 'braycurtis')
-        70.1±1μs       58.6±0.1μs     0.84  spatial.Xdist.time_cdist(100, 'correlation')
-         170±2μs          141±1μs     0.83  spatial.Xdist.time_pdist(100, 'mahalanobis')
-     3.53±0.05ms      2.91±0.02ms     0.82  spatial.Xdist.time_cdist(1000, 'braycurtis')
-     1.80±0.01ms         1.47±0ms     0.81  spatial.Xdist.time_pdist(1000, 'braycurtis')
-      16.0±0.4μs       12.8±0.4μs     0.80  spatial.Xdist.time_cdist(10, 'jensenshannon')
-         288±3μs          227±2μs     0.79  spatial.Xdist.time_cdist(100, 'mahalanobis')
-     3.65±0.07ms      2.83±0.02ms     0.78  spatial.Xdist.time_cdist(1000, 'cosine')
-     3.71±0.06ms      2.86±0.01ms     0.77  spatial.Xdist.time_cdist(1000, 'correlation')
-      44.6±0.8μs       34.3±0.1μs     0.77  spatial.Xdist.time_cdist(100, 'cosine')
-      9.80±0.2ms      7.30±0.09ms     0.75  spatial.Xdist.time_pdist(1000, 'mahalanobis')
-      19.9±0.2ms       14.5±0.2ms     0.73  spatial.Xdist.time_cdist(1000, 'mahalanobis')
-         145±3ms       99.3±0.1ms     0.68  spatial.Xdist.time_cdist(1000, 'jensenshannon')
-      4.45±0.1μs      3.01±0.03μs     0.68  spatial.Xdist.time_pdist(10, 'cosine')
-     4.71±0.02μs       3.19±0.1μs     0.68  spatial.Xdist.time_cdist(10, 'cosine')
-     1.49±0.02ms      1.01±0.01ms     0.68  spatial.Xdist.time_cdist(100, 'jensenshannon')
-         766±6μs          503±2μs     0.66  spatial.Xdist.time_pdist(100, 'jensenshannon')
-      73.3±0.1ms       46.7±0.2ms     0.64  spatial.Xdist.time_pdist(1000, 'jensenshannon')
-        90.9±2μs       49.9±0.2μs     0.55  spatial.Xdist.time_cdist(100, 'seuclidean')
-        56.9±1μs      27.3±0.07μs     0.48  spatial.Xdist.time_pdist(100, 'seuclidean')
-      30.1±0.2μs      3.17±0.03μs     0.11  spatial.Xdist.time_pdist(10, 'seuclidean')
-      35.5±0.7μs       3.48±0.1μs     0.10  spatial.Xdist.time_cdist(10, 'seuclidean')
-     3.26±0.02ms       32.8±0.2μs     0.01  spatial.XdistWeighted.time_pdist(10, 'correlation')
-      7.16±0.2ms       23.0±0.3μs     0.00  spatial.XdistWeighted.time_cdist(10, 'correlation')
-      13.1±0.3ms       34.3±0.2μs     0.00  spatial.XdistWeighted.time_pdist(20, 'correlation')
-     2.21±0.02ms      3.59±0.04μs     0.00  spatial.XdistWeighted.time_pdist(10, 'cosine')
-      28.7±0.4ms       25.8±0.9μs     0.00  spatial.XdistWeighted.time_cdist(20, 'correlation')
-      4.92±0.2ms      3.86±0.02μs     0.00  spatial.XdistWeighted.time_cdist(10, 'cosine')
-     9.34±0.05ms       4.81±0.1μs     0.00  spatial.XdistWeighted.time_pdist(20, 'cosine')
-      19.1±0.3ms      5.65±0.05μs     0.00  spatial.XdistWeighted.time_cdist(20, 'cosine')
-         347±5ms         62.9±2μs     0.00  spatial.XdistWeighted.time_pdist(100, 'correlation')
-         241±3ms       27.7±0.9μs     0.00  spatial.XdistWeighted.time_pdist(100, 'cosine')
-         722±6ms       72.3±0.8μs     0.00  spatial.XdistWeighted.time_cdist(100, 'correlation')
-         485±3ms       46.5±0.3μs     0.00  spatial.XdistWeighted.time_cdist(100, 'cosine')

This PR is ready for now. Merged main into this branch after my latest commit.

@czgdp1807 czgdp1807 added enhancement A new feature or improvement scipy.spatial labels Aug 19, 2021
@czgdp1807 czgdp1807 changed the title ENH: Port correlation to _distance_pybind ENH: Rewriting distance metrics to _distance_pybind Aug 19, 2021
@tupui
Copy link
Member

tupui commented Aug 19, 2021

Could you please fill up a bit more the description of your PRs? 😅 We try to make it clear enough so that we don't need to go through the code to know what is the intent (it takes quite some time otherwise to do this for all PRs). Thanks a lot 🙏

@czgdp1807
Copy link
Member Author

czgdp1807 commented Aug 19, 2021

For this PR, there is nothing more info to provide other than the title at the moment. Once ready and things get clear enough, I will provide the benchmark code and results in the description. Thanks.

@tupui
Copy link
Member

tupui commented Aug 20, 2021

From the current code it seems that this just addresses Hamming, and not the whole issue which is unclear from your description. If indeed you want to add all, then it should read Closes ...

@czgdp1807
Copy link
Member Author

First I added, Correlation distance and it seems like adding it would require a bunch of changes in the framework itself (or may be not). So, I changed to adding Hamming. I would be pushing more commits to add Cosine and Correlation. This PR may not be able to close the referenced issue at once (to avoid large code diff making it hard to review). In order to avoid confusion, I kept it in WIP mode. Once it would be ready for review I will explain my changes in the description, add benchmark code and results.

@tupui
Copy link
Member

tupui commented Aug 20, 2021

In order to avoid confusion, I kept it in WIP mode. Once it would be ready for review I will explain my changes in the description, add benchmark code and results.

In the future, please still add a description. In this case you can just list what you intend to do with a list for instance. It's not a problem if you don't tick everything and modify it latter. It's just clearer.

If your work is very experimental though, I would argue that you should not open yet a PR and work on your fork first. This simply helps our CI consumption.

@czgdp1807
Copy link
Member Author

czgdp1807 commented Aug 20, 2021

In the future, please still add a description.

Sure. I didn't think that adding Correlation/Cosine would be tricky. Though it turned out to be. I will add some description soon. Once I push the commits, I will update here.

Edit - Updated the description for clarity.

@czgdp1807
Copy link
Member Author

czgdp1807 commented Feb 12, 2022

I am unable to run benchmarks with following command.

Command

python runtests.py --bench-compare main spatial.Xdist

Error

· No `environment_type` specified in asv.conf.json. This will be required in the future.
· Creating environments....................
·· Error running /home/czgdp1807ssd/anaconda3/bin/conda env create -f /tmp/tmpnufhm_31.yml -p /home/czgdp1807ssd/scipy_project/scipy/benchmarks/env/9e6b5bad4b437052513fdb7dfc5533cc --force (exit status 1)
   STDOUT -------->
   Collecting package metadata (repodata.json): ...working... done
   Solving environment: ...working... 
   Found conflicts! Looking for incompatible packages.
   This can take several minutes.  Press CTRL-C to abort.
   failed
   Solving environment: ...working... 
   Found conflicts! Looking for incompatible packages.
   This can take several minutes.  Press CTRL-C to abort.
   failed
   STDERR -------->
Examining @/linux-64::__cuda==11.2=0:  23%|██▎       | 3/13 [00:00<00:00, 23.68iExamining @/linux-64::__cuda==11.2=0:  31%|███       | 4/13 [00:00<00:00, 31.55iExamining @/linux-64::__glibc==2.27=0:  31%|███       | 4/13 [00:00<00:00, 31.55Examining @/linux-64::__unix==0=0:  38%|███▊      | 5/13 [00:00<00:00, 31.55it/sExamining pybind11:  46%|████▌     | 6/13 [00:00<00:00, 31.55it/s]              Examining @/linux-64::__archspec==1=x86_64:  54%|█████▍    | 7/13 [00:00<00:00, Examining @/linux-64::__archspec==1=x86_64:  62%|██████▏   | 8/13 [00:00<00:00, Examining pytest:  62%|██████▏   | 8/13 [00:00<00:00, 33.21it/s]                Examining conflict for pythran cython pybind11 pytest numpy python pip wheel:   Examining conflict for numpy pythran:   8%|▊         | 1/13 [00:01<00:13,  1.12sExamining conflict for numpy pythran:  15%|█▌        | 2/13 [00:01<00:06,  1.79iExamining conflict for pythran cython pytest python pip wheel:  15%|█▌        | Examining conflict for numpy pythran pytest:  23%|██▎       | 3/13 [00:01<00:05,Examining conflict for numpy pythran pytest:  31%|███       | 4/13 [00:01<00:02,Examining conflict for pythran cython pybind11 numpy python:  31%|███       | 4/Examining conflict for pythran cython pybind11 numpy python:  38%|███▊      | 5/Examining conflict for pythran pybind11 pytest numpy python pip wheel:  38%|███▊Examining conflict for pythran pybind11 pytest numpy python pip wheel:  46%|████Examining conflict for pythran cython __glibc pybind11 numpy python:  46%|████▌ Examining conflict for pythran cython __glibc pybind11 numpy python:  54%|█████▍Examining conflict for pytest pip cython wheel:  54%|█████▍    | 7/13 [00:05<00:Examining conflict for pytest pip cython wheel:  62%|██████▏   | 8/13 [00:05<00:Examining conflict for pybind11 pytest numpy python pip wheel:  62%|██████▏   | Examining conflict for pybind11 pytest numpy python pip wheel:  69%|██████▉   | Examining conflict for python pip wheel:  69%|██████▉   | 9/13 [00:06<00:02,  1.Examining conflict for python pip wheel:  77%|███████▋  | 10/13 [00:06<00:02,  1                                                                                Examining @/linux-64::__cuda==11.2=0:  23%|██▎       | 3/13 [00:00<00:00, 22.65iExamining @/linux-64::__cuda==11.2=0:  31%|███       | 4/13 [00:00<00:00, 30.18iExamining @/linux-64::__glibc==2.27=0:  31%|███       | 4/13 [00:00<00:00, 30.18Examining @/linux-64::__unix==0=0:  38%|███▊      | 5/13 [00:00<00:00, 30.18it/sExamining pybind11:  46%|████▌     | 6/13 [00:00<00:00, 30.18it/s]              Examining @/linux-64::__archspec==1=x86_64:  54%|█████▍    | 7/13 [00:00<00:00, Examining @/linux-64::__archspec==1=x86_64:  62%|██████▏   | 8/13 [00:00<00:00, Examining pytest:  62%|██████▏   | 8/13 [00:00<00:00, 31.69it/s]                Examining conflict for pythran cython pybind11 pytest numpy python pip wheel:   Examining conflict for numpy pythran:   8%|▊         | 1/13 [00:01<00:14,  1.24sExamining conflict for numpy pythran:  15%|█▌        | 2/13 [00:01<00:06,  1.61iExamining conflict for pythran cython pytest python pip wheel:  15%|█▌        | Examining conflict for numpy pythran pytest:  23%|██▎       | 3/13 [00:01<00:06,Examining conflict for numpy pythran pytest:  31%|███       | 4/13 [00:01<00:03,Examining conflict for pythran cython pybind11 numpy python:  31%|███       | 4/Examining conflict for pythran cython pybind11 numpy python:  38%|███▊      | 5/Examining conflict for pythran pybind11 pytest numpy python pip wheel:  38%|███▊Examining conflict for pythran pybind11 pytest numpy python pip wheel:  46%|████Examining conflict for pythran cython __glibc pybind11 numpy python:  46%|████▌ Examining conflict for pythran cython __glibc pybind11 numpy python:  54%|█████▍Examining conflict for pytest pip cython wheel:  54%|█████▍    | 7/13 [00:06<00:Examining conflict for pytest pip cython wheel:  62%|██████▏   | 8/13 [00:06<00:Examining conflict for pybind11 pytest numpy python pip wheel:  62%|██████▏   | Examining conflict for pybind11 pytest numpy python pip wheel:  69%|██████▉   | Examining conflict for python pip wheel:  69%|██████▉   | 9/13 [00:07<00:02,  1.Examining conflict for python pip wheel:  77%|███████▋  | 10/13 [00:07<00:02,  1                                                                                        
   UnsatisfiableError: The following specifications were found to be incompatible with each other:
   
   Output in format: Requested package -> Available versions
   
   Package ca-certificates conflicts for:
   wheel -> python -> ca-certificates
   pip -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
   python=3.10 -> openssl[version='>=1.1.1l,<1.1.2a'] -> ca-certificates
   numpy -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
   pybind11 -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
   pytest -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
   
   Package python conflicts for:
   cython=0.29.21 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.7,<3.8.0a0']
   pythran -> beniget=0.3 -> python[version='>=2.7,<2.8.0a0|>=3.5|>=3.5,<3.6.0a0|>=3.7|>=3.6']
   pytest -> attrs[version='>=19.2.0'] -> python[version='>=2.7|>=3.5|>=3.6|>=3|>=3.4']
   pybind11 -> python[version='>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.6,<3.7.0a0|>=3.5,<3.6.0a0']
   pytest -> python[version='>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.6,<3.7.0a0|>=3.5,<3.6.0a0']
   numpy -> python[version='>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0|>=3.9,<3.10.0a0|>=3.6,<3.7.0a0|>=3.5,<3.6.0a0']
   pip -> wheel -> python
   wheel -> setuptools -> python[version='>=3.10,<3.11.0a0|>=3.9,<3.10.0a0']
   cython=0.29.21 -> setuptools -> python[version='>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.5,<3.6.0a0']
   wheel -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0|>=3.5,<3.6.0a0']
   pip -> python[version='>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.5,<3.6.0a0']
   pythran -> python[version='>=3.10,<3.11.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0']
   
   Package six conflicts for:
   numpy -> mkl-service[version='>=2.3.0,<3.0a0'] -> six
   pytest -> six[version='>=1.10.0']
   pytest -> more-itertools[version='>=4.0.0'] -> six[version='>=1.0.0,<2.0.0']
   pythran -> six
   
   Package numpy conflicts for:
   pythran -> numpy[version='>=1.16.6,<2.0a0|>=1.21.2,<2.0a0']
   pythran -> networkx[version='>=2'] -> numpy[version='>=1.19']
   numpy
   
   Package bzip2 conflicts for:
   numpy -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
   python=3.10 -> bzip2[version='>=1.0.8,<2.0a0']
   pip -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
   pybind11 -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
   wheel -> python -> bzip2[version='>=1.0.8,<2.0a0']
   pythran -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
   pytest -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
   
   Package setuptools conflicts for:
   cython=0.29.21 -> setuptools
   wheel -> setuptools
   pip -> setuptools
   pythran -> networkx[version='>=2'] -> setuptools
   pytest -> setuptools[version='>=40.0']
   python=3.10 -> pip -> setuptools
   
   Package certifi conflicts for:
   cython=0.29.21 -> setuptools -> certifi[version='>=2016.09|>=2016.9.26']
   pip -> setuptools -> certifi[version='>=2016.09|>=2016.9.26']
   wheel -> setuptools -> certifi[version='>=2016.09|>=2016.9.26']
   pytest -> setuptools[version='>=40.0'] -> certifi[version='>=2016.09|>=2016.9.26']
   
   Package _openmp_mutex conflicts for:
   python=3.10 -> libgcc-ng[version='>=7.5.0'] -> _openmp_mutex[version='>=4.5']
   cython=0.29.21 -> libgcc-ng[version='>=7.3.0'] -> _openmp_mutex[version='>=4.5']
   numpy -> libgcc-ng[version='>=7.5.0'] -> _openmp_mutex[version='>=4.5']
   pybind11 -> libgcc-ng[version='>=7.5.0'] -> _openmp_mutex[version='>=4.5']
   pythran -> libgcc-ng[version='>=7.5.0'] -> _openmp_mutex[version='>=4.5']
   
   Package wheel conflicts for:
   pip -> wheel
   wheelThe following specifications were found to be incompatible with your system:
   
     - feature:/linux-64::__glibc==2.27=0
     - feature:|@/linux-64::__glibc==2.27=0
     - cython=0.29.21 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
     - numpy -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
     - pybind11 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
     - pythran -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
   
   Your installed version is: 2.27
   
   

·· Failure creating environment for conda-py3.10-Cython0.29.21-numpy-pybind11-pytest-pythran
Traceback (most recent call last):
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/bin/asv", line 10, in <module>
    sys.exit(main())
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/main.py", line 38, in main
    result = args.func(args)
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/commands/__init__.py", line 49, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/commands/continuous.py", line 75, in run_from_conf_args
    return cls.run(
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/commands/continuous.py", line 114, in run
    result = Run.run(
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/commands/run.py", line 294, in run
    Setup.perform_setup(environments, parallel=parallel)
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/commands/setup.py", line 89, in perform_setup
    list(map(_create, environments))
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/commands/setup.py", line 21, in _create
    env.create()
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/environment.py", line 704, in create
    self._setup()
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/plugins/conda.py", line 174, in _setup
    self._run_conda(['env', 'create', '-f', env_file_name,
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/plugins/conda.py", line 227, in _run_conda
    return util.check_output([conda] + args, env=env)
  File "/home/czgdp1807ssd/anaconda3/envs/scipy-dev/lib/python3.10/site-packages/asv/util.py", line 754, in check_output
    raise ProcessError(args, retcode, stdout, stderr)
asv.util.ProcessError: Command '/home/czgdp1807ssd/anaconda3/bin/conda env create -f /tmp/tmpnufhm_31.yml -p /home/czgdp1807ssd/scipy_project/scipy/benchmarks/env/9e6b5bad4b437052513fdb7dfc5533cc --force' returned non-zero exit status 1

cc: @rgommers @tupui Any ideas why the above might be happening?

@czgdp1807
Copy link
Member Author

Well I made the benchmarks work by removing 0.29.21 from scipy/benchmarks/asv.conf.json from Cython.

@czgdp1807
Copy link
Member Author

Work on cdist for the distance metric listed in the OP is complete. Performance increases in general (except for one case). I am marking it as ready for review. Meanwhile I will keep updating for pdist and accommodate for reviews as well.

@czgdp1807
Copy link
Member Author

This PR is ready now. I have added pdist and cdist support for all the metrics listed in the OP. Benchmarks have improved (except for very a few metrics which degrade on small sized inputs, corner cases I'd say). Please feel free to review.

cc: @rgommers @peterbell10 @tupui

Also, note that #16023 is open which discusses some important aspects related to the new framework for distance metrics. I am not sure why tests on Windows fail. The log is shown below,

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "runtests.py", line 612, in <module>
    main(argv=sys.argv[1:])
  File "runtests.py", line 307, in main
    test, version, mod_path = import_module()
  File "runtests.py", line 356, in import_module
    __import__(PROJECT_MODULE)
ModuleNotFoundError: No module named 'scipy'
##[error]PowerShell exited with code '1'.
Finishing: Run SciPy Test Suite

@tupui
Copy link
Member

tupui commented Apr 28, 2022

Also, note that #16023 is open which discusses some important aspects related to the new framework for distance metrics. I am not sure why tests on Windows fail. The log is shown below,

On these jobs when this happens it's always because the previous stage failed. I know, it's confusing. You have build errors there.

@czgdp1807
Copy link
Member Author

So, should I rebase on top of main branch because it might be possible that one of my commits was WIP and that resulted in a failure on Windows. Let me know if rebasing will be okay here.

@tupui
Copy link
Member

tupui commented Apr 28, 2022

Feel free to rebase. When there are no review there is no problem with that 😃 When you rebase on GH, you should just be careful about which part of the history you are rewriting. GH is using the hash of the commits to links comments (seemingly) and if the hash change, comments on specific lines become dead links. I often force push my last commit because I have a typo, and this is fine until a new comment is made.

@czgdp1807
Copy link
Member Author

I see. Probably that's why its safe to rebase here because currently there are on comments on the diff. Am I right?

@tupui
Copy link
Member

tupui commented Apr 28, 2022

Yes, exactly. And during a review which already started, you can also ask about rebasing. Most of the time people will be fine with it. They just need to know before it happens and they are not surprise about their reviews disappearing.

@czgdp1807
Copy link
Member Author

I rebased on top of main branch and modified the history to 3 commits for easier review. Thanks for clarifying my queries swiftly. :-)

@czgdp1807
Copy link
Member Author

Well, C++ on Windows doesn't allow creating variable sized arrays in stack memory. Will fix this.

@czgdp1807
Copy link
Member Author

Tests pass now I think.

@patacca
Copy link

patacca commented May 18, 2022

Probably I am doing something wrong but I cannot compile the branch:

git clone https://github.com/czgdp1807/scipy.git
cd scipy
git checkout pybind11_dist
pip install .

[snip]
      /tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/system_info.py:933: UserWarning: Specified path /usr/local/include/python3.10 is invalid.
        return self.get_paths(self.section, key)
      /tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/system_info.py:933: UserWarning: Specified path /home/patacca/software/scipy/venv/include/python3.10 is invalid.
        return self.get_paths(self.section, key)
      non-existing path in 'scipy/linalg': 'src/lapack_deprecations/LICENSE'
      Traceback (most recent call last):
        File "/home/patacca/software/scipy/venv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/home/patacca/software/scipy/venv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/patacca/software/scipy/venv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 164, in prepare_metadata_for_build_wheel
          return hook(metadata_directory, config_settings)
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 174, in prepare_metadata_for_build_wheel
          self.run_setup()
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 267, in run_setup
          super(_BuildMetaLegacyBackend,
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 158, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File "setup.py", line 532, in <module>
          setup_package()
        File "setup.py", line 528, in setup_package
          setup(**metadata)
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/core.py", line 135, in setup
          config = configuration()
        File "setup.py", line 438, in configuration
          config.add_subpackage('scipy')
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 1016, in add_subpackage
          config_list = self.get_subpackage(subpackage_name, subpackage_path,
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 982, in get_subpackage
          config = self._get_configuration_from_setup_py(
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 924, in _get_configuration_from_setup_py
          config = setup_module.configuration(*args)
        File "scipy/setup.py", line 18, in configuration
          config.add_subpackage('optimize')
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 1016, in add_subpackage
          config_list = self.get_subpackage(subpackage_name, subpackage_path,
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 982, in get_subpackage
          config = self._get_configuration_from_setup_py(
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 924, in _get_configuration_from_setup_py
          config = setup_module.configuration(*args)
        File "scipy/optimize/setup.py", line 130, in configuration
          config.add_subpackage('_highs')
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 1016, in add_subpackage
          config_list = self.get_subpackage(subpackage_name, subpackage_path,
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 982, in get_subpackage
          config = self._get_configuration_from_setup_py(
        File "/tmp/pip-build-env-z3vckssd/overlay/lib/python3.10/site-packages/numpy/distutils/misc_util.py", line 924, in _get_configuration_from_setup_py
          config = setup_module.configuration(*args)
        File "scipy/optimize/_highs/setup.py", line 63, in configuration
          _major_dot_minor = _get_version(
        File "scipy/optimize/_highs/setup.py", line 50, in _get_version
          with open(CMakeLists, 'r', encoding='utf-8') as f:
      FileNotFoundError: [Errno 2] No such file or directory: '/home/patacca/software/scipy/scipy/_lib/highs/CMakeLists.txt'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@tupui
Copy link
Member

tupui commented May 18, 2022

Hi @patacca. This is not how you would install a development version. Please refer to our contributors' documentation. If you have further issues after strictly following the steps, I invite you to open an issue to not mix things with this PR.

@patacca
Copy link

patacca commented May 22, 2022

Is there an ETA for this PR? It seems to pass all the checks

@tupui
Copy link
Member

tupui commented May 22, 2022

It's just waiting for someone knowledgeable to review 😃 Any help is appreciated.

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the short summary here is that we basically have a few more functions migrated from Python -> C++ for the distrance metrics, that's something we've recently been +1 on. So, if the performance has generally improved, it seems "ok" in principle at least.

Are we reasonably confident in the state of the testing for these distance functions?

// Under NumPy's relaxed stride checking, dimensions with
// 1 or fewer elements are ignored.
desc.strides[i] = 0;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is a bit hard to read because of its size--did you actually delete any of Peter's old infrastructure, or is just getting relocated because of the number of changes here?

@rgommers
Copy link
Member

I'd say this should get a review from @peterbell10. I'll bump the milestone - if it still gets merged in time that's nice to include, otherwise it's for the next release.

@rgommers rgommers modified the milestones: 1.9.0, 1.10.0 May 28, 2022

StridedView1D<T> rownorms_view;
rownorms_view.shape = {x.shape[0]};
rownorms_view.data = new T[x.shape[0]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid raw new and delete:

Suggested change
rownorms_view.data = new T[x.shape[0]];
auto rownorms_data = std::make_unique<T[]>(x.shape[0]);
rownorms_view.data = rownorm_data.get();

T x_ij = in_data[i*x.strides[0] + j*x.strides[1]];
rownorm += x_ij * x_ij;
}
rownorms_view.data[i] = sqrt(rownorm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always use std:: because the C versions names don't necessarily have float overloads

Suggested change
rownorms_view.data[i] = sqrt(rownorm);
rownorms_view.data[i] = std::sqrt(rownorm);


StridedView1D<T> rownorms_view;
rownorms_view.shape = {x.shape[0]};
rownorms_view.data = new T[x.shape[0]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid new

T w_ij = w_data[j*w.strides[0]];
rownorm += w_ij * x_ij * x_ij;
}
rownorms_view.data[i] = sqrt(rownorm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rownorms_view.data[i] = sqrt(rownorm);
rownorms_view.data[i] = std::sqrt(rownorm);


StridedView1D<T> xrownorms_view;
xrownorms_view.shape = {x.shape[0]};
xrownorms_view.data = new T[x.shape[0]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid new

}

ForceUnroll<ilp_factor>{}([&](int k) {
out(i + k, 0) = 1.0 - project(dist[k])/(xrownorms(ix.shape[0]) * yrownorms(i + k));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having completely new templates for cosine, seclidean and jensenshannon, what if we have just one new template that passes the indices i and j to map, and i to project. Then the metrics can use or ignore the indices as they require. If that performs the same then it would save a lot of duplication.

@@ -1856,8 +1895,8 @@ class MetricInfo:
canonical_name='jensenshannon',
aka={'jensenshannon', 'js'},
dist_func=jensenshannon,
cdist_func=CDistMetricWrapper('jensenshannon'),
pdist_func=PDistMetricWrapper('jensenshannon'),
cdist_func=_distance_pybind.cdist_jensenshannon,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run the benchmarks I'm seeing ~50% slow down in unweighted jensenshannon from this PR.

auto x_ratio = x_i / m_i + error_x;
auto y_ratio = y_i / m_i + error_y;
auto log_e = log2(exp(1));
auto s_x = ((T) x_i > 0.0) * x_i * (log2(x_ratio)/log_e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't log2(x)/log_e equal to log(x)? Also need to use std:: here as well.

template <typename T>
struct Acc {
Acc(): js_diverg(0) {}
T js_diverg;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll point out that the existing metrics that have an Acc struct only use it because there are multiple values being accumulated, not as a requirement. If you just wanted to name the value then that's fine though.

T error_y = (y_i == 0.0) * 1e-6;
m_i += (x_i == 0.0 && y_i == 0.0) * 1e-6;
auto x_ratio = x_i / m_i + error_x;
auto y_ratio = y_i / m_i + error_y;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this come from, I can't see it in the existing C code?

@rgommers
Copy link
Member

rgommers commented Dec 1, 2022

There are unaddressed review comments, so I will clear the milestone from this PR. It'd be great to see this land if you find time at some point @czgdp1807

@rgommers rgommers removed this from the 1.10.0 milestone Dec 1, 2022
@lucascolley lucascolley added needs-work Items that are pending response from the author C/C++ Items related to the internal C/C++ code base labels Dec 22, 2023
@lucascolley lucascolley changed the title ENH: Rewriting distance metrics to _distance_pybind ENH: Rewriting distance metrics to _distance_pybind Jun 2, 2024
@lucascolley lucascolley changed the title ENH: Rewriting distance metrics to _distance_pybind ENH: spatial: Rewriting distance metrics to _distance_pybind Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement needs-work Items that are pending response from the author scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Different behaviour of scipy.spatial.distance.cdist depending on parameters
7 participants