Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed. Hot nasty bad-ass speed! #8

Closed
JimBobSquarePants opened this issue Oct 30, 2016 · 60 comments
Closed

Speed. Hot nasty bad-ass speed! #8

JimBobSquarePants opened this issue Oct 30, 2016 · 60 comments

Comments

@JimBobSquarePants
Copy link
Member

From @JimBobSquarePants on March 16, 2016 9:44

We've got a pretty good feature set for a V1 release.

Now we need to make it fast.

Things to look at:

  • Reducing array allocation ArrayPool, Slice<T>
  • Unsafe code where applicable
  • Moar SIMD
  • Algorithmic tightening
  • Reduce pixel looping
  • Other things

For benchmarking we can use BenchMarkDotNet now that the prerelease supports CoreFX

i-wanna-go-fast-ricky-bobby

Add your thoughts below.

Copied from original issue: JimBobSquarePants/ImageProcessor#347

@JimBobSquarePants
Copy link
Member Author

From @mavenius on March 16, 2016 10:11

Might want to check out Scientist.Net for comparing performance and
consistency between old and optimized branches. It's fairly new, but I've
read good things.

Thank you,

Mark Avenius
On Mar 16, 2016 5:44 AM, "James Jackson-South" notifications@github.com
wrote:

We've got a pretty good feature set
https://github.com/JimBobSquarePants/ImageProcessor#what-works-so-far-what-is-planned
for a V1 release.

Now we need to make it fast.

Things to look at:

For benchmarking we can use BenchMarkDotNet
https://github.com/PerfDotNet/BenchmarkDotNet now that the prerelease
supports CoreFX

[image: i-wanna-go-fast-ricky-bobby]
https://cloud.githubusercontent.com/assets/385879/13807908/b9e66592-ebb7-11e5-8c21-c7596ff3264e.gif

Add your thoughts below.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/JimBobSquarePants/ImageProcessor/issues/347

@JimBobSquarePants
Copy link
Member Author

From @rold2007 on March 17, 2016 8:35

The first task is obviously to identify the slow parts. I have never used BenchMarkDotNet but I heard about it recently, I guess this is a good opportunity to try it!
Maybe a new set of tests (performance) should be created for this? I think the benchmark should be done on small, medium, large and huge images (32x32, 512x512, 1024x1024, 10Kx10K ?).

I never heard of Scientist.Net before but maybe it could be used to compare even with other similar libraries (AForge.Net, etc.) ?

I'll do some tests with BenchMarkDotNet to see if it does the job, and report back here.

@JimBobSquarePants
Copy link
Member Author

That would be great if you could. I had a read through the instructions but I am terrible when it comes to following them.

@JimBobSquarePants
Copy link
Member Author

From @rold2007 on March 23, 2016 8:9

I haven't been able to use BenchmarkDotNet with DNX so far, even on the most basic code. I'll report my issues to the BenchmarkDotNet team and see what can be done. Sorry about the delay.

@JimBobSquarePants
Copy link
Member Author

@rold2007 No worries, thanks for trying. I'm sure someone there will be happy to give us pointers.

@JimBobSquarePants
Copy link
Member Author

From @rold2007 on March 23, 2016 8:50

@JimBobSquarePants It is now resolved. I'll be quite busy until next week but I'll continue with this asap.

@JimBobSquarePants
Copy link
Member Author

@rold2007 Awesome! Here's hoping we can get some great info to find bottlenecks.

@JimBobSquarePants
Copy link
Member Author

From @voidstar69 on May 1, 2016 9:4

I have tried installing BenchmarkDotNet into ImageProcessorCore.Tests via Nuget, both the official release and the prerelease. I cannot use either - the prerelease appears to need the NETStandard libraries, and the official release appears to not support DNXCore 5.0.

@JimBobSquarePants @rold2007 Have you managed to get BenchmarkDotNet working to benchmark an existing unit test in ImageProcessorCore?

@JimBobSquarePants
Copy link
Member Author

@voidstar69 Truth be told I've not tried, I've only used it outside for testing. I'll have a look as soon as I can.

It's probably best, however, to put the benchmark tests into a separate NET 4.6 console project so that I can do direct comparisons against System.Drawing.

@JimBobSquarePants
Copy link
Member Author

@voidstar69 Just added it to the repo now 😄

@JimBobSquarePants
Copy link
Member Author

From @voidstar69 on May 19, 2016 21:12

@JimBobSquarePants Thanks for adding BenchmarkDotNet to the repo, looks like it will be very useful to compare against System.Drawing.

I have started off by looking at the performance of Crop. On my laptop the System.Drawing version take about 800us, while the Core version takes about 1380us.
I tried a minor optimisation in Crop.cs, changing this:

for (int x = startX; x < endX; x++)
{
    target[x, y] = source[x + sourceX, y + sourceY];
}

to this:

Array.Copy(source.Pixels, (((y + sourceY) * source.Width) + (startX + sourceX)) * 4,
    target.Pixels, ((y * target.Width) + startX) * 4,
    (endX - startX) * 4);

This appears to speed up the crop from 1380us to about 1330us. A small but significant improvement. You should test for a similar performance improvement on your PC, in case this is specific to my PC.

One downside is that the array index calculation is now exposed in Crop.cs, whereas before it was hidden away in ImageBase.this[].
Some other things I noticed:

  • If the target rectangle (top-left) x and y are non-zero, I believe the crop will be incorrect because the source coordinates are offset by this target x and y.
  • ImageBase pins the memory of pixelArray, but never calls GCHandle.Free to unpin this memory. Additionally keeping lots of memory blocks pinned for long periods will degrade the garbage collector. As long as ImageBase objects are constructed and then disposed not long later, this probably is not a problem. The alternative would be to pin this memory on-demand, and unpin it at a convenient time, i.e. at the end of each image operation.

@JimBobSquarePants
Copy link
Member Author

Hi @voidstar69 Thanks for having a look at this. Have you go the latest version of the codebase?

I have a test against Crop here which resulted in Crop taking half the time of System.Drawing on my machine. Admittedly results will vary depending on how many cores you have (Also make sure you are running in release mode?)

If the target rectangle (top-left) x and y are non-zero, I believe the crop will be incorrect because the source coordinates are offset by this target x and y.

I'll need to add a unit test to double check against the target. The current code was submitted as a bug fix.

ImageBase pins the memory of pixelArray, but never calls GCHandle.Free to unpin this memory.

I do call GCHandle.Free() I think you must have just missed it. Image now implements IDisposable and the method is called when disposing of the instance.

What I do need to know is whether I should be checking and calling Free() when using SetPixels() or whether simply updating the handle is sufficient.

@JimBobSquarePants
Copy link
Member Author

From @mattwarren on May 20, 2016 8:53

Hi, I'm one of the developers for BenchmarkDotNet and it's cool to see you using it, I added you to our list.

BTW there's a Diagnostics Nuget package that might be help you out. It gives you information about memory allocations and method inlining that might be useful to have in your benchmarks

@JimBobSquarePants
Copy link
Member Author

Hi @mattwarren thanks for dropping by.

That package sounds like it would definitely be very useful thank you! Happy to have made your list. 😄

@JimBobSquarePants
Copy link
Member Author

From @voidstar69 on May 21, 2016 18:17

Hi @JimBobSquarePants, I got the latest code from the Core branch before I tried my changes, and I did test them in Release mode.

That is the same test that I ran on my machine, but the performance profile is different for me (i5 CPU, 4 logical processors). What were your timings in us (microseconds)? If you try out my Array.Copy code, does it make Crop faster or slower on your machine?

You are right about GCHandle.Free - I was searching the file for this exact text, rather than pixelHandles.Free.

As for SetPixels() I suspect that if Free() is not called before calling GCHandle.Alloc() again, the previous memory block will remain pinned in memory. So Free() should be called if pixelsHandle is not null.

@JimBobSquarePants
Copy link
Member Author

@voidstar69 I've taken your advise on board re SetPixels() as that seems like a logical interpretation of expected behaviour. The code will be in the next push.

As for timings. Here's the timing on my main laptop. It's got waaay less cores than my work machine which has twelve. On that machine ImageProcessorCore is 0.46 when scaled which is good and quick.

BenchmarkDotNet=v0.9.6.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Processor 5Y70 CPU @ 1.10GHz, ProcessorCount=4
Frequency=1266597 ticks, Resolution=789.5171 ns, Timer=TSC
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
JitModules=clrjit-v4.6.1080.0

Type=Crop  Mode=Throughput

                  Method |    Median |    StdDev | Scaled |
------------------------ |---------- |---------- |------- |
     System.Drawing Crop | 1.0361 ms | 0.1025 ms |   1.00 |
 ImageProcessorCore Crop | 1.7126 ms | 0.0806 ms |   1.65 |

Using your code addition

BenchmarkDotNet=v0.9.6.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Processor 5Y70 CPU @ 1.10GHz, ProcessorCount=4
Frequency=1266597 ticks, Resolution=789.5171 ns, Timer=TSC
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
JitModules=clrjit-v4.6.1080.0

Type=Crop  Mode=Throughput

                  Method |        Median |     StdDev | Scaled |
------------------------ |-------------- |----------- |------- |
     System.Drawing Crop |   847.2536 us | 26.1538 us |   1.00 |
 ImageProcessorCore Crop | 1,397.4700 us | 59.5903 us |   1.65 |

I ran the tests a few times and the difference was negligible so I don't want to change the code yet (big emphasis on the yet)

You have got me thinking though which is awesome. IImageBase should have a method to set a row in one move. There must be a way we can copy an entire row in one in an unsafe context to maximize performance. So if you, I or anyone reading wants to set aside the time to tests and benchmark an approach that would be excellent.

@JimBobSquarePants
Copy link
Member Author

One thing that we should definitely have a look at which affects all processes is how I manage threading.

in ParallelImageProcessor.cs I split up jobs into tasks

https://github.com/JimBobSquarePants/ImageProcessor/blob/0300f596b0c59a7187ba7ce0efcf0d9e98932a4f/src/ImageProcessorCore/ParallelImageProcessor.cs#L45

I determine the number of tasks based on multiplying the processor count by 2 which I will freely admit was a number plucked out from the air.

I would love to have someone review this process to see if I am falling short somewhere and taking the wrong approach.

@JimBobSquarePants
Copy link
Member Author

Well ain't this nice... 😄

// * Summary *

BenchmarkDotNet=v0.9.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Xeon(R) CPU E5-1650 0 3.20GHz, ProcessorCount=12
Frequency=3117489 ticks, Resolution=320.7710 ns, Timer=TSC
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
JitModules=clrjit-v4.6.1063.1

Type=Resize  Mode=Throughput

                    Method |    Median |    StdDev | Scaled |
-------------------------- |---------- |---------- |------- |
     System.Drawing Resize | 2.6414 ms | 0.0395 ms |   1.00 |
 ImageProcessorCore Resize | 1.7930 ms | 0.0773 ms |   0.68 |

// ***** BenchmarkRunner: End *****

Global total time: 00:00:30 (30.42 sec)

@JimBobSquarePants
Copy link
Member Author

From @alexmbaker on June 7, 2016 9:10

You may find something useful in https://github.com/Microsoft/Microsoft.IO.RecyclableMemoryStream

@JimBobSquarePants
Copy link
Member Author

@mattwarren Just to follow up on your package recommendation. My project threw a wobbler citing a missing .dll when I tried to use the diagnostic package. Are there known issues on .NET Core or am I doing something daft?

https://github.com/JimBobSquarePants/ImageProcessor/blob/0c0c868ae14312f4ab5d7bc78a0c8e118d10f404/tests/ImageProcessorCore.Benchmarks/Program.cs#L31

@JimBobSquarePants
Copy link
Member Author

From @mattwarren on June 29, 2016 8:40

Have you added the BenchmarkDotNet.Diagnostics.Windows package to your project as well?

@JimBobSquarePants
Copy link
Member Author

@mattwarren Yeah I do. Here's the message. I'll open an issue if you like?

Error loading BenchmarkDotNet.Diagnostics.Windows.dll: FileNotFoundException - Could not load file or assembly 'file:///C:\github\ImageProcessor\tests\ImageProcessorCore.Benchmarks\BenchmarkDotNet.Diagnostics.Windows.dll' or one of its dependencies. The system cannot find the file specified.

Unhandled Exception: System.InvalidOperationException: memorydiagnoser is an unrecognised Diagnoser
   at BenchmarkDotNet.Configs.ConfigParser.ParseDiagnosers(String value)
   at BenchmarkDotNet.Configs.ConfigParser.<>c.<.cctor>b__18_9(ManualConfig config, String value)
   at BenchmarkDotNet.Configs.ConfigParser.Parse(String[] args)
   at BenchmarkDotNet.Running.BenchmarkSwitcher.RunBenchmarks(String[] args)
   at ImageProcessorCore.Benchmarks.Program.Main(String[] args) in C:\github\ImageProcessor\tests\ImageProcessorCore.Benchmarks\Program.cs:line 35
Press any key to continue . . .

@JimBobSquarePants
Copy link
Member Author

From @danijel-peric on July 1, 2016 22:9

my test for resize, used jpeg image 960x540 and resized it to 320x240, code

using (MemoryStream inStream = new MemoryStream(Image))
using (MemoryStream outStream = new MemoryStream())
using (ImageFactory imageFactory = new ImageFactory())
    imageFactory.Load(inStream)
                .Resize(new Size(320, 240))
                .Format(new JpegFormat { Quality = 50 })
                .Save(outStream);

if you want code used for Windows.Media.Imaging or System.Drawing.Graphics let me know

note i tested your library because i was trying to find fast resize library which doesn't use much cpu, from this test Windows.Media.Imaging was using 15% cpu on my machine, your resize used 40-50%
Parallel is the problem, but very nice library, thanks


BenchmarkDotNet=v0.9.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4790 CPU 3.60GHz, ProcessorCount=8
Frequency=3515620 ticks, Resolution=284.4448 ns, Timer=ACPI
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
JitModules=clrjit-v4.6.1055.0
Type=ImageResizeBenchmark  Mode=Throughput  
MethodMedianStdDev
WindowsMediaImaging2.1111 ms0.1916 ms
SystemDrawingGraphics5.3590 ms0.5527 ms
ImageProcessor23.0896 ms0.4548 ms

@JimBobSquarePants
Copy link
Member Author

Hi @danijel-peric

Thanks for your info but you're actually benchmarking the wrong codebase. This issue has been raised specifically for ImageProcessor.Core.

There's going to be some overhead using the legacy ImageFactory code as I copy the image over to a new pixel format when loading. If you were going to benchmark you should be benchmarking the Resize method only. There is absolutely no parallel code used in the example workflow you have provided.

@JimBobSquarePants
Copy link
Member Author

From @rold2007 on July 3, 2016 1:0

@danijel-peric : Also, make sure you compare apples with apples. Resize methods like nearest neighbor and bicubic have a very different speed, and quality. But I'm not even sure what ImageProcessor is using.

@JimBobSquarePants
Copy link
Member Author

From @danijel-peric on July 3, 2016 7:25

:), it looks i have tested with old ImageFactory which was on https://www.nuget.org/packages/ImageProcessor

sorry, will do another test with ImageProcessor.Core, so in order to get it i need to download source? and build it, do you have binary files somewhere for download?

@JimBobSquarePants
Copy link
Member Author

Hi @NeelBhatt Thanks for the publicity! 😄

I'd just like to suggest some updates to your blog if that's ok to correct a few details.

Just to clarify. The ImageProcessor version (2.4.4) published in Nuget is the legacy version that runs on the full .NET Framework. That version is in a separate branch in this repo called Framework.

What you are looking for to work on .NET Standard 1.1 + is the ImageProcessorCore package which is only on MyGet for the time being. That is hosted here. https://www.myget.org/gallery/imageprocessor

I hope that clears things up a little.

Cheers

James

@JimBobSquarePants
Copy link
Member Author

From @NeelBhatt 14th July 2016

Oh okay so you mean for .Net core, we need to download it from the link you mentioned in your comment correct? Also can it be downloaded from your Github repository?

I will surely change that. Thanks :)

@JimBobSquarePants
Copy link
Member Author

Just to note. Focus is now on Encoders/Decoders. Bmp is great! The rest....

@antonfirsov
Copy link
Member

Hi, I "ported" your projects to .NET 4.6 & managed to play with DotPeek.
I think the main issue is having classes where using structs would be a much better choice. The code needs a LOT of systematic refactor. I would be happy to contribute, but I also want to avoid duplicate work. Any advices? (What to touch/not touch?)

@JimBobSquarePants
Copy link
Member Author

Hey @antonfirsov That's great! Which classes in particular were you looking at? Please say the Jpeg decoder/encoder...

I would ❤️ if you could have a look at that if you could? I'm focusing on png just now.

@antonfirsov
Copy link
Member

Yes, its the jpeg :)
I will start with a few small changes this week to proof my ideas first.

@JimBobSquarePants
Copy link
Member Author

Excellent stuff! I got memory usage down on the encoder but the decoder is a mess!

@antonfirsov
Copy link
Member

Lost some of my initial optimism. Only managed to gain ~10% speed growth by reducing allocations + array flattening in the jpeg decoder. Even with ArrayPool-s!

The main bottlenecks are the IDCT.Transform() and JpegDecoderCore.DecodeHuffman() functions.

After analyzing the libjpeg code, I have the following conclusions:

  1. Transform could be replaced by a faster floating point implementation:
    https://dev.w3.org/Amaya/libjpeg/jidctflt.c
  2. It might be not enough, libjpeg DCT logic is much more complex

I can play a bit more this weekend, but if you want to go 100% managed, someone has to become a jpeg expert sooner or later.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov That's still progress so don't lose heart yet. I have a big book on the jpeg spec at home and am investigating the different implementations. Already forked libjpeg-turbo 😄

Translating to the floating point version shouldn't be beyond us. There's a faster, slightly less accurate version here also https://github.com/libjpeg-turbo/libjpeg-turbo/blob/master/jidctflt.c

FluxJpeg has implementations we can adapt that i'd like to investigate also.
https://github.com/briandonahue/FluxJpeg.Core/blob/master/FJCore/FDCT.cs
https://github.com/briandonahue/FluxJpeg.Core/blob/master/FJCore/DCT.cs

Intrigued by their inverse... It seems very simplistic.

@antonfirsov
Copy link
Member

I tried the FluxJpeg implementation ... actually it's slower. Maybe we need a floating point IDCT with System.Numerics.

@antonfirsov
Copy link
Member

This library contains a libjpeg port in C#:
https://imagetools.codeplex.com/

Shouldn't we adapt this instead of the golang port?

@JimBobSquarePants
Copy link
Member Author

It's libjpeg.net. You don't wanna see the source, its brutal and super slow.

@JimBobSquarePants
Copy link
Member Author

Could you PR with what you have please and I'll have a tinker from there.

@boguscoder
Copy link

As an initial exercise I'll try to PR scanline buffer reuse in PNG decoder filters, its seems there are redundant allocations hapenning there

@JimBobSquarePants
Copy link
Member Author

@boguscoder Thanks!

@antonfirsov
Copy link
Member

Just sent a PR, but conflicted :( I'm not sure if it's worth to merge in it's current form.

Currently I'm experimenting with SIMD IDCT implementations. Trying to port this:
https://github.com/norishigefukushima/dct_simd

@antonfirsov
Copy link
Member

I have good news with JpegDecoder!
DecodeJpeg.JpegCore() (with Calliphora.jpg) on my machine:

Host Process Environment Information:
BenchmarkDotNet-Dev.Core=v0.9.9.0
OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728115 ticks, Resolution=366.5535 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1076.0

Type=DecodeJpeg  Mode=Throughput  

Current master branch:

Method Median StdDev Gen 0 Gen 1 Gen 2 Bytes Allocated/Op
ImageSharp Jpeg 48.7835 ms 0.4524 ms 724.00 - 144.00 6,063,638.46

My (experimental) master with SIMD optimizations:

Method Median StdDev Gen 0 Gen 1 Gen 2 Bytes Allocated/Op
ImageSharp Jpeg 31.8167 ms 0.5176 ms - - 130.00 1,979,222.11

Are you interested to pull this in? :) There are lots of additional classes & test cases this time.

@JimBobSquarePants
Copy link
Member Author

Awwwwwww yeaaaaaaaaah!

That's a brilliant improvement!

Definitely looking to pull the changes in though I'll need you to strip your PR down to the minimum changes please first. You've got 28 files changed so far, some of which are completely unrelated (png filters)

@olivif
Copy link
Contributor

olivif commented Dec 17, 2016

@JimBobSquarePants I'd like to help out on perf investigations, any pointers on where I can start/what to focus on? Should we start with some documentation on bechmarking in ImageSharp and an overview of the areas and if anyone is actively looking into them? 😄

Btw, I ran benchmark.cmd on a few of the profiles, but it's a bit difficult to go from there.

Also, I searched a bit on benchmarking with xUnit, found NBench and xUnit Perf, not sure if anyone has already looked into these.

@antonfirsov
Copy link
Member

antonfirsov commented Dec 18, 2016

@olivif benchmarks are very good starters, but you don't get too far in performance analysis & optimization without a profiler. There is a built-in profiler in VS, but it's quite slow, it's better to grab JetBrains dotTrace if possible.

And here is the tricky part:
None of these tools worked for me with a .NET core project. My workaround was to define a temporal 4.6 project, and reference ImageSharp as a dll (bin\Release\net45\ImageSharp.dll).

If the infrastructure is ready, it's really easy to profile .NET 4.6 code with dotTrace or with VS profiler. Finding and eliminating bottlenecks is an exciting game, think of a detective movie! :)

@olivif
Copy link
Contributor

olivif commented Dec 18, 2016

Ahh I think I see now. Benchmarking will give you the overall numbers and help you compare against a baseline (be it a different lib or an earlier version of your lib), whereas the profiler will actually tell you what is slow (or provide enough info on all the parts so you can hunt yourself).

Hmm I'm surprised the VS profiler didn't work with a core project, maybe the tooling for core is not all fully out there.

@JimBobSquarePants
Copy link
Member Author

Closing this as we have come a long way performance wise and it does no good to keep this issue open.

JimBobSquarePants pushed a commit that referenced this issue Nov 27, 2019
antonfirsov pushed a commit that referenced this issue Jan 18, 2020
prevent stylecop.json form being included in package
JimBobSquarePants pushed a commit that referenced this issue Feb 17, 2021
prevent stylecop.json form being included in package
JimBobSquarePants pushed a commit that referenced this issue Feb 17, 2021
prevent stylecop.json form being included in package
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants