Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize K-means #97

Merged
merged 15 commits into from
Mar 18, 2021
Merged

Optimize K-means #97

merged 15 commits into from
Mar 18, 2021

Conversation

YuhanLiin
Copy link
Collaborator

Optimized the averaging step of K_means to use non-moving averages and arrays instead of hashmaps. The code also updates the new centroid in-place. Also fixed deprecation warning on all benchmarks in linfa-clustering (other benchmarks may have the same warnings). I'm curious as to the effect of my changes on benchmark speeds, since my machine isn't suited to running benchmarks.

@Sauro98
Copy link
Member

Sauro98 commented Mar 15, 2021

Hi, thanks for the PR! I've been busy today but I will gladly review this tomorrow 👍🏻

@Sauro98
Copy link
Member

Sauro98 commented Mar 16, 2021

I'm taking some more time to review this because on my machine these changes actually bring the performance down significantly, at least for what concerns the benchmarks, but I am also getting some outliers so I'd like to investigate this a little more. Also, I want to make sure that the performance is not impacted by the change in the bencher (it shouldn't but I would still like to check).

These are the results that I get when I run cargo bench k_means on my machine:

  • On branch master:
Benchmarking naive_k_means/10: Warming up for 3.0000 s
naive_k_means/10        time:   [1.0439 ms 1.1075 ms 1.1703 ms]          
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
naive_k_means/100       time:   [5.1121 ms 5.4133 ms 5.7409 ms]                         
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
naive_k_means/1000      time:   [35.332 ms 37.381 ms 39.503 ms]              
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
Benchmarking naive_k_means/10000: Warming up for 3.0000 s
naive_k_means/10000     time:   [233.40 ms 243.14 ms 253.00 ms]   
  • On branch kmeans-opt:
naive_k_means/naive_k_means/10                                                                            
                        time:   [5.3906 ms 8.3629 ms 11.660 ms]
Found 19 outliers among 100 measurements (19.00%)
  19 (19.00%) high severe
naive_k_means/naive_k_means/100                                                                            
                        time:   [16.939 ms 24.063 ms 31.794 ms]
Found 24 outliers among 100 measurements (24.00%)
  24 (24.00%) high severe
Benchmarking naive_k_means/naive_k_means/1000: Warming up for 3.0000 s
naive_k_means/naive_k_means/1000                                                                            
                        time:   [66.666 ms 94.997 ms 126.49 ms]
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe
Benchmarking naive_k_means/naive_k_means/10000: Warming up for 3.0000 s
naive_k_means/naive_k_means/10000                                                                            
                        time:   [284.80 ms 436.71 ms 613.42 ms]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe

Have you tried running the benches on your machine? Do you get similar results?
My machine is definitely not the top for benchmarking so the fault for the worse performance could be on the machine itself, If you get different results from me please share them so that I know the results that I should expect

@YuhanLiin
Copy link
Collaborator Author

It's also slower on my machine, but I'm on a 2-core i7-6500U that runs faster without Rayon so I didn't take the results too seriously. I mainly wanted someone with a better machine to confirm my results. I'm also curious as to what flamegraphs you get from the benchmarks because my flamegraphs for both version indicated that the assignment step was the bottleneck, not the averaging step.

@YuhanLiin
Copy link
Collaborator Author

I'm seeing similar results as you in my benchmarks, but my flamegraphs indicate that the averaging step isn't the bottleneck.

@YuhanLiin
Copy link
Collaborator Author

These are my results with cargo bench k_means -- --verbose.
With just the benchmark fixes

naive_k_means/naive_k_means/10                                                                                                                                                      
                          time:   [1.1976 ms 1.2291 ms 1.2623 ms]                                                                                                                     
                          change: [-14.458% -7.8971% -1.1469%] (p = 0.03 < 0.05)                                                                                                      
                          Performance has improved.                                                                                                                                   
  Found 4 outliers among 100 measurements (4.00%)                                                                                                                                     
    4 (4.00%) high mild                                                                                                                                                               
  slope  [1.1976 ms 1.2623 ms] R^2            [0.5064945 0.5050213]                                                                                                                   
  mean   [1.2193 ms 1.2825 ms] std. dev.      [132.98 us 187.69 us]                                                                                                                   
  median [1.1873 ms 1.2554 ms] med. abs. dev. [105.86 us 182.06 us]                                                                                                                                                                                        
  naive_k_means/naive_k_means/100                                                                                                                                                     
                          time:   [7.4920 ms 7.7974 ms 8.1227 ms]                                                                                                                     
                          change: [-0.9476% +5.6558% +12.591%] (p = 0.10 > 0.05)                                                                                                      
                          No change in performance detected.                                                                                                                          
  Found 3 outliers among 100 measurements (3.00%)                                                                                                                                     
    2 (2.00%) high mild                                                                                                                                                               
    1 (1.00%) high severe                                                                                                                                                             
  mean   [7.4920 ms 8.1227 ms] std. dev.      [1.2859 ms 1.9477 ms]                                                                                                                   
  median [7.1343 ms 7.8455 ms] med. abs. dev. [921.75 us 1.7574 ms]                                                                                                                                                                                          
  naive_k_means/naive_k_means/1000                                                                                                                                                    
                          time:   [46.172 ms 48.628 ms 51.174 ms]                                                                                                                     
                          change: [+3.4581% +10.997% +19.390%] (p = 0.00 < 0.05)                                                                                                      
                          Performance has regressed.                                                                                                                                  
  Found 1 outliers among 100 measurements (1.00%)                                                                                                                                     
    1 (1.00%) high mild                                                                                                                                                               
  mean   [46.172 ms 51.174 ms] std. dev.      [10.782 ms 14.653 ms]                                                                                                                   
  median [44.011 ms 49.039 ms] med. abs. dev. [8.9856 ms 15.345 ms]                                                                                                                   
  naive_k_means/naive_k_means/10000                                                                                                                                                   
                          time:   [282.75 ms 296.74 ms 311.53 ms]                                                                                                                     
                          change: [-5.9549% +0.8865% +8.3379%] (p = 0.81 > 0.05)                                                                                                      
                          No change in performance detected.                                                                                                                          
  Found 3 outliers among 100 measurements (3.00%)                                                                                                                                     
    3 (3.00%) high mild                                                                                                                                                               
  mean   [282.75 ms 311.53 ms] std. dev.      [59.775 ms 87.050 ms]                                                                                                                   
  median [261.28 ms 300.02 ms] med. abs. dev. [46.431 ms 80.503 ms]  

With all changes

naive_k_means/naive_k_means/10                                                                                                                                                      
                          time:   [5.0783 ms 7.6619 ms 10.578 ms]                                                                                                                     
                          change: [+312.70% +512.85% +725.71%] (p = 0.00 < 0.05)                                                                                                      
                          Performance has regressed.                                                                                                                                  
  Found 19 outliers among 100 measurements (19.00%)                                                                                                                                   
    1 (1.00%) high mild                                                                                                                                                               
    18 (18.00%) high severe                                                                                                                                                           
  mean   [5.0783 ms 10.578 ms] std. dev.      [11.457 ms 16.126 ms]                                                                                                                   
  median [1.0480 ms 1.1344 ms] med. abs. dev. [127.76 us 227.01 us]                                                                                                                                                                                                                                         
  naive_k_means/naive_k_means/100                                                                                                                                                     
                          time:   [23.279 ms 33.091 ms 43.804 ms]                                                                                                                     
                          change: [+195.19% +324.39% +472.32%] (p = 0.00 < 0.05)                                                                                                      
                          Performance has regressed.                                                                                                                                  
  Found 24 outliers among 100 measurements (24.00%)                                                                                                                                   
    24 (24.00%) high severe                                                                                                                                                           
  mean   [23.279 ms 43.804 ms] std. dev.      [41.019 ms 62.016 ms]                                                                                                                   
  median [5.4170 ms 6.4924 ms] med. abs. dev. [1.2616 ms 2.6658 ms]                                                                                                                                                                                                                                         
  naive_k_means/naive_k_means/1000                                                                                                                                                    
                          time:   [77.660 ms 111.70 ms 148.71 ms]                                                                                                                     
                          change: [+57.755% +129.70% +219.54%] (p = 0.00 < 0.05)                                                                                                      
                          Performance has regressed.                                                                                                                                  
  Found 18 outliers among 100 measurements (18.00%)                                                                                                                                   
    18 (18.00%) high severe                                                                                                                                                           
  mean   [77.660 ms 148.71 ms] std. dev.      [140.19 ms 217.17 ms]                                                                                                                   
  median [29.247 ms 33.316 ms] med. abs. dev. [6.4509 ms 13.765 ms]                                                                                                                                                                                                                                      
naive_k_means/naive_k_means/10000                                                                                                                                                   
                          time:   [337.02 ms 513.33 ms 720.10 ms]                                                                                                                     
                          change: [+9.3530% +72.989% +143.89%] (p = 0.03 < 0.05)                                                                                                      
                          Performance has regressed.                                                                                                                                  
  Found 10 outliers among 100 measurements (10.00%)                                                                                                                                   
    10 (10.00%) high severe                                                                                                                                                           
  mean   [337.02 ms 720.10 ms] std. dev.      [730.82 ms 1.2520 s]                                                                                                                    
  median [170.21 ms 188.99 ms] med. abs. dev. [34.806 ms 56.924 ms]   

The perf problems seems to be caused entirely by the outliers, since the median is so much lower than the mean.

@YuhanLiin
Copy link
Collaborator Author

YuhanLiin commented Mar 17, 2021

I've tried profiling the code with flamegraph as well as measuring the times directly and compute_centroids has never shown up as a bottleneck. The only way to improve benchmark performance beyond the original version was to track averages using an empty HashMap. Any other way of tracking averages (using a vector, using a pre-filled HashMap) ends up severely degrading the performance. Vector/Array bounds-checks may have been the culprit in earlier iterations, but I have absolutely no idea why this solution is so much better. One thing to note is that hashing functions aren't called for HashMaps under the size of 32, so this solution (and the solution in master) may perform differently on more than 32 clusters.

To see how weird this is, go to line 272 of algorithm.rs and add

(0..n_clusters).for_each(|i| {                                                                                                                              
    averages.insert(i, (Array1::zeros(n_features), F::from(0.0).unwrap()));                                                                                                
});  

and see how that affects the benchmarks.

@bytesnake
Copy link
Member

I've tried profiling the code with flamegraph as well as measuring the times directly and compute_centroids has never shown up as a bottleneck. The only way to improve benchmark performance beyond the original version was to track averages using an empty HashMap. Any other way of tracking averages (using a vector, using a pre-filled HashMap) ends up severely degrading the performance. Vector/Array bounds-checks may have been the culprit in earlier iterations, but I have absolutely no idea why this solution is so much better. One thing to note is that hashing functions aren't called for HashMaps under the size of 32, so this solution (and the solution in master) may perform differently on more than 32 clusters.

To see how weird this is, go to line 272 of algorithm.rs and add

(0..n_clusters).for_each(|i| {                                                                                                                              
    averages.insert(i, (Array1::zeros(n_features), F::from(0.0).unwrap()));                                                                                                
});  

and see how that affects the benchmarks.

I will try to run the benchmarks now. This seems really strange, one thing which stood out to me that when you are pre-filling both hashmap and vector your performance degrades, so probably there lies the problem. In commit 561d9b7 you could also use a 2d array directly with (num_clusters + 1, num_features) where the last row is your count, then normalize and drop the last row.

@bytesnake
Copy link
Member

bytesnake commented Mar 17, 2021

Without changes to algorithm.rs:

naive_k_means/naive_k_means/10                        
                        time:   [1.5642 ms 1.6362 ms 1.7137 ms]
                        change: [-9.0982% -3.8257% +1.9977%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking naive_k_means/naive_k_means/100: Collecting 100 samples in estimated 5.9021 s (6                                                                                             naive_k_means/naive_k_means/100                        
                        time:   [8.9471 ms 9.3102 ms 9.6769 ms]
                        change: [-0.8811% +4.6403% +10.543%] (p = 0.10 > 0.05)
                        No change in performance detected.
Benchmarking naive_k_means/naive_k_means/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, or reduce sample count to 80.
Benchmarking naive_k_means/naive_k_means/1000: Collecting 100 samples in estimated 5.8972 s (                                                                                             naive_k_means/naive_k_means/1000                        
                        time:   [45.065 ms 47.675 ms 50.373 ms]
                        change: [-11.544% -4.7009% +3.0629%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Benchmarking naive_k_means/naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.8s, or reduce sample count to 10.
Benchmarking naive_k_means/naive_k_means/10000: Collecting 100 samples in estimated 28.840 s                                                                                              naive_k_means/naive_k_means/10000                        
                        time:   [318.61 ms 336.68 ms 358.62 ms]
                        change: [+13.476% +22.160% +31.329%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

With changes (794d8a5)

naive_k_means/naive_k_means/10                        
                        time:   [1.3526 ms 1.3857 ms 1.4196 ms]
                        change: [-17.931% -14.437% -10.805%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking naive_k_means/naive_k_means/100: Collecting 100 samples in estimated 5.3013 s (6                                                                                             naive_k_means/naive_k_means/100                        
                        time:   [11.366 ms 12.488 ms 13.704 ms]
                        change: [+21.267% +34.128% +49.536%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
Benchmarking naive_k_means/naive_k_means/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, or reduce sample count to 70.
Benchmarking naive_k_means/naive_k_means/1000: Collecting 100 samples in estimated 6.3159 s (                                                                                             naive_k_means/naive_k_means/1000                        
                        time:   [46.636 ms 50.043 ms 53.764 ms]
                        change: [-4.1604% +4.9669% +14.524%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking naive_k_means/naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 26.3s, or reduce sample count to 10.
Benchmarking naive_k_means/naive_k_means/10000: Collecting 100 samples in estimated 26.330 s                                                                                              naive_k_means/naive_k_means/10000                        
                        time:   [277.86 ms 291.15 ms 305.68 ms]
                        change: [-20.009% -13.525% -6.7332%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

With changes: (4e38c38)

naive_k_means/naive_k_means/10                        
                        time:   [5.9038 ms 8.9010 ms 12.260 ms]
                        change: [+327.31% +542.32% +809.45%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe
Benchmarking naive_k_means/naive_k_means/100: Collecting 100 samples in estimated 5.3996 s (7                                                                                             naive_k_means/naive_k_means/100                        
                        time:   [23.784 ms 33.923 ms 44.968 ms]
                        change: [+91.759% +171.66% +264.89%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 24 outliers among 100 measurements (24.00%)
  24 (24.00%) high severe
Benchmarking naive_k_means/naive_k_means/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 57.5s, or reduce sample count to 10.
Benchmarking naive_k_means/naive_k_means/1000: Collecting 100 samples in estimated 57.475 s (                                                                                             naive_k_means/naive_k_means/1000                        
                        time:   [97.606 ms 139.01 ms 184.61 ms]
                        change: [+88.374% +177.79% +277.91%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  18 (18.00%) high severe
Benchmarking naive_k_means/naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 244.1s, or reduce sample count to 10.
Benchmarking naive_k_means/naive_k_means/10000: Collecting 100 samples in estimated 244.07 s                                                                                              naive_k_means/naive_k_means/10000                        
                        time:   [372.39 ms 595.09 ms 848.03 ms]
                        change: [+22.358% +104.39% +191.14%] (p = 0.01 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe
cpu info

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 69
model name	: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping	: 1
microcode	: 0x26
cpu MHz		: 1846.742
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips	: 4591.59
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 69
model name	: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping	: 1
microcode	: 0x26
cpu MHz		: 1686.786
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips	: 4591.59
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 69
model name	: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping	: 1
microcode	: 0x26
cpu MHz		: 1706.545
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips	: 4591.59
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 69
model name	: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping	: 1
microcode	: 0x26
cpu MHz		: 1791.553
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips	: 4591.59
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual

@YuhanLiin
Copy link
Collaborator Author

I've looked a bit more and the problem seems to be with the benchmarks. For some reason if I move all the initialization code out of the benchmark function I observe more reasonable performance numbers. I'm still observing runtime spikes with cluster size of 100, which looks to be RNG-dependent.

@bytesnake
Copy link
Member

which is still strange, because only the code in bencher.iter(|| ) should be relevant

@bytesnake
Copy link
Member

I will try to run the benchmarks now. This seems really strange, one thing which stood out to me that when you are pre-filling both hashmap and vector your performance degrades, so probably there lies the problem. In commit 561d9b7 you could also use a 2d array directly with (num_clusters + 1, num_features) where the last row is your count, then normalize and drop the last row.

just saw that this was your original version

@bytesnake
Copy link
Member

bytesnake commented Mar 17, 2021

reducing the number of iterations to 100 seems to make the benchmark results more reasonable. If you don't know, you can look in target/criterion/naive_k_means/report/index.html for a webpage version of the benchmark

@bytesnake
Copy link
Member

can you try to reinitialize the rng for each new parameter set, this gives me much more consistent results

@YuhanLiin
Copy link
Collaborator Author

I did some more digging and the real reason behind the perf issues was because some inputs run for all 1000 iterations, which doesn't usually happen in the benchmarks. This happens because somewhere along the way the centroids are calculated as NaN, which messes up the algorithm.

@YuhanLiin
Copy link
Collaborator Author

Performance issues were caused by empty clusters producing an average of NaN (bug in my code, not master). I'm setting empty centroids to 0 right now but there may be a better value. This may also indicate an issue with the random cluster initialization we're using (can we use a better initialization?).

@Sauro98
Copy link
Member

Sauro98 commented Mar 18, 2021

Maybe it could be a good idea to look into kmeans++ for center initialization? I see that the kmeans crate implements it so maybe that could be used as a reference. Provided that you have the time of course 👍🏻

@bytesnake
Copy link
Member

bytesnake commented Mar 18, 2021

I'm getting similar issues with rand::rngs::SmallRng but with thread_rng everything seems to work fine. Cluster collapse should terminate the algorithm and return an error IMO or the API has to handle them specifically (either by setting centroids to NaN or using optional centroids but we can't store everything in contiguous memory anymore).
The cluster collapse is caused by using standard normal distributions with a shift (https://github.com/rust-ml/linfa/blob/master/algorithms/linfa-clustering/src/utils.rs#L44) which makes the sample distribution very sparse and then using random initialization by sampling the dataset just four time which makes it very likely that two centroids are in the same cluster. @Sauro98 yep we should add different initialization strategies. For a quickfix: you can add a parameter to utils::generate_blobs for the standard deviation and then set σ=10.0

@bytesnake
Copy link
Member

btw: have you heard about the iai project? It's from the same author as criterion and can directly measure the cache/memory access and instruction counts instead of the wall-clock. The beauty is that this can be reliable deployed to CI systems (as opposed to criterion.rs) and only need a single pass. We could probably use the metrics for a dashboard like the rustc performance and publish it with every new release

@bytesnake
Copy link
Member

so what are the actions here now: should we merge the PR? Does the performance gain justify it to have two implementations, one for increment K-Means and not?

@Sauro98
Copy link
Member

Sauro98 commented Mar 18, 2021

btw: have you heard about the iai project? It's from the same author as criterion and can directly measure the cache/memory access and instruction counts instead of the wall-clock. The beauty is that this can be reliable deployed to CI systems (as opposed to criterion.rs) and only need a single pass. We could probably use the metrics for a dashboard like the rustc performance and publish it with every new release

That project sounds promising, especially since it can be integrated with the CI system, it would give benchmarks some more importance, which is cool, and having a dashboard like the one you linked in the website I believe could get some more people interested in the project, I will definitely explore it a little bit 👍🏻

so what are the actions here now: should we merge the PR? Does the performance gain justify it to have two implementations, one for increment K-Means and not?

This evening I'll be able to run the benchmarks again with the latest changes to feel the extent of the improvements. Why do you mention two implementations? Doesn't this PR just overwrite the existing one?

@bytesnake
Copy link
Member

This evening I'll be able to run the benchmarks again with the latest changes to feel the extent of the improvements. Why do you mention two implementations? Doesn't this PR just overwrite the existing one?

sorry I meant that in context of scikit: it contains two implementation KMeans and MiniBatchKMeans. The moving average is currently only used in our single-fit implementation, but could be used for a incremental version (by implement https://github.com/rust-ml/linfa/blob/master/src/traits.rs#L38) in the future.

@YuhanLiin
Copy link
Collaborator Author

The PR changed the averaging algorithm from moving average to standard average. For handling cluster collapse I vastly prefer returning an Error. I'm definitely against setting to NaN because it causes perf drops in the benchmarks. Center initialization and incremental fit are out of scope for this PR.

@Sauro98
Copy link
Member

Sauro98 commented Mar 18, 2021

sorry I meant that in context of scikit: it contains two implementation KMeans and MiniBatchKMeans. The moving average is currently only used in our single-fit implementation, but could be used for a incremental version (by implement https://github.com/rust-ml/linfa/blob/master/src/traits.rs#L38) in the future.

Just by reading scikit-learn's user guide, it would seem to imply that they do use two different approaches for the mini-batches and regular k-means, so maybe it would make sense to keep the non-moving average implementation for k-means if it increases performance. With a future incremental fit implementation in mind, maybe it is better to keep the helper mod so that there already is a moving average struct available?

Performance issues were caused by empty clusters producing an average of NaN (bug in my code, not master).

What is the behavior of the implementation in master in such cases?

@bytesnake
Copy link
Member

when we merge I would like to keep the moving average implementation as well: just ran the benchmark new version vs. old version:

Benchmarking naive_k_means/10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.0s, enable flat sampling, or reduce sample count to 50.
naive_k_means/10        time:   [1.4124 ms 1.4485 ms 1.4861 ms]                              
                        change: [-5.3063% -2.6335% +0.1510%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
naive_k_means/100       time:   [8.6071 ms 8.9256 ms 9.2593 ms]                              
                        change: [-5.9313% -1.2229% +4.2687%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking naive_k_means/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.6s, or reduce sample count to 80.
naive_k_means/1000      time:   [54.948 ms 58.965 ms 63.304 ms]                               
                        change: [+6.7258% +16.787% +27.683%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Benchmarking naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 32.6s, or reduce sample count to 10.
naive_k_means/10000     time:   [303.20 ms 317.37 ms 332.34 ms]                                
                        change: [-7.3419% -1.4865% +5.0620%] (p = 0.65 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

@Sauro98
Copy link
Member

Sauro98 commented Mar 18, 2021

On my machine:

master:

Benchmarking naive_k_means/10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
naive_k_means/10        time:   [1.2289 ms 1.2922 ms 1.3656 ms]                              
                        change: [-14.414% -5.6616% +3.8992%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
naive_k_means/100       time:   [5.3927 ms 5.6523 ms 5.9290 ms]                               
                        change: [-3.3688% +3.8261% +11.909%] (p = 0.33 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
naive_k_means/1000      time:   [38.752 ms 41.189 ms 43.716 ms]                               
                        change: [-7.9562% +1.1587% +10.389%] (p = 0.80 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Benchmarking naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 27.1s, or reduce sample count to 10.
naive_k_means/10000     time:   [250.79 ms 261.14 ms 271.86 ms]                                
                        change: [-0.2075% +5.6954% +11.811%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

kmeans-opt:

Benchmarking naive_k_means/naive_k_means/10: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, enable flat sampling, or reduce sample count to 60.
naive_k_means/naive_k_means/10                                                                             
                        time:   [1.2419 ms 1.2810 ms 1.3237 ms]
                        change: [-9.7741% +1.1279% +12.853%] (p = 0.87 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe
naive_k_means/naive_k_means/100                                                                             
                        time:   [3.6659 ms 3.8628 ms 4.0931 ms]
                        change: [+1.2612% +8.0094% +15.932%] (p = 0.02 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
naive_k_means/naive_k_means/1000                                                                            
                        time:   [23.170 ms 24.227 ms 25.356 ms]
                        change: [-10.836% -3.4829% +4.3460%] (p = 0.38 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe
Benchmarking naive_k_means/naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.1s, or reduce sample count to 40.
naive_k_means/naive_k_means/10000                                                                            
                        time:   [104.28 ms 107.76 ms 112.22 ms]
                        change: [-25.250% -19.050% -11.754%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

I would say that the improvement in speed is quite good 👍🏻

@YuhanLiin
Copy link
Collaborator Author

YuhanLiin commented Mar 18, 2021

What is the behavior of the implementation in master in such cases?

Since master uses moving averages, it just sets the centroid to 0.
I'll look into the perf of moving average vs standard average. Also, the benchmarks for this PR has more outliers than master, which is strange.

@bytesnake
Copy link
Member

bytesnake commented Mar 18, 2021 via email

Copy link
Member

@Sauro98 Sauro98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you undelete algorithms/linfa-clustering/src/k_means/helpers.rs so that we keep the incremental mean implementation even if it is not currently used?

@bytesnake
Copy link
Member

nah you can always find the file in the git history, no need to add dangling files 😆

@bytesnake bytesnake mentioned this pull request Mar 18, 2021
3 tasks
@bytesnake
Copy link
Member

without changes:

naive_k_means/10        time:   [1.4574 ms 1.4977 ms 1.5400 ms]                              
                        change: [+1.0481% +4.3349% +7.5120%] (p = 0.01 < 0.05)
                        Performance has regressed.
naive_k_means/100       time:   [8.8972 ms 9.2581 ms 9.6355 ms]                              
                        change: [-1.6644% +3.7247% +9.4769%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking naive_k_means/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, or reduce sample count to 80.
Benchmarking naive_k_means/1000: Collecting 100 samples in estimated 6.0698 s (100 iterations                                                                                             naive_k_means/1000      time:   [46.133 ms 48.795 ms 51.541 ms]
                        change: [-24.440% -17.248% -9.3971%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 29.2s, or reduce sample count to 10.
Benchmarking naive_k_means/10000: Collecting 100 samples in estimated 29.184 s (100 iteration                                                                                             naive_k_means/10000     time:   [312.75 ms 327.89 ms 345.63 ms]
                        change: [-3.0299% +3.3159% +9.9612%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

with changes:

naive_k_means/naive_k_means/10                                                                             
                        time:   [1.3423 ms 1.3659 ms 1.3937 ms]
                        change: [+4.5462% +6.5943% +8.6149%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
naive_k_means/naive_k_means/100                                                                            
                        time:   [5.5286 ms 5.6984 ms 5.9126 ms]
                        change: [+12.994% +16.524% +20.862%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
naive_k_means/naive_k_means/1000                                                                            
                        time:   [30.290 ms 31.026 ms 31.879 ms]
                        change: [+4.2413% +6.5199% +9.3345%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  8 (8.00%) high mild
  9 (9.00%) high severe
Benchmarking naive_k_means/naive_k_means/10000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 13.2s, or reduce sample count to 30.
naive_k_means/naive_k_means/10000                                                                            
                        time:   [126.78 ms 127.38 ms 128.09 ms]
                        change: [+0.0205% +1.1580% +2.1152%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

👍 if no objections are raised I would merge this now

@bytesnake bytesnake merged commit da18edd into rust-ml:master Mar 18, 2021
@YuhanLiin YuhanLiin deleted the kmeans-opt branch March 18, 2021 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants