Skip to content

regorov/jpegcc

Repository files navigation

jpegcc GoDoc Build Status Coverage Status Go Report Card

JPEG Most Prevalent Colors Counter Package and Command Line Tool.

Requirements

  • There is list of links leading to an image.
  • The solution should create a CSV file with 3 mots prevalent colors.
  • The solution should be able to handle input files with more than a billion URLs.
  • The solution should work under limited resources (e.g. 1 CPU, 512MB RAM).
  • There is no limit on the execution time.
  • Utilization of the provided resources should be as much as possible at any time during the program execution.

Result

Research and Insights

  • Input file: There are duplicated URLs => avoid downloading and processing duplicates. Take result calculated before.
  • Input file: There are broken URLs => avoid download attempts of broken duplicates.
  • Input file: Link with .jpeg postfix in fact refers to .png file => mark this URL as broken.
  • Input file: All URLs refer to limited amount of hosts (2-3). => avoid blocking after DDoS by opening a lot of simultaneous HTTP connections. Use HTTP connection limitation per host.
  • Input file: too short for benchmarking => bigger file with links should be found. There is repository (18+ age) : https://github.com/EBazarov/nsfw_data_source_urls
  • Limited RAM => Reduce garbage generation by zero copy, object pools and escape analysis.
  • Network utilization => support simultaneous HTTP connections.
  • Storage utilization => buffered reading from input file and buffered result writing.
  • 1 CPU => localize the part of the program mostly loading the CPU.

Pipeline Concept

Input >>[Reading from input file] >> [1] ->>Nx[Image Down-loaders]>>[1]>>Mx[Image Processor]>>[1]>>[Buffered Result Writer] -> File

[1] - channel length

M, N - simultaneous goroutines.

  • Reading from file uses Scanner, what has buffer reading inside.
  • N down-loaders must be launched. Down-loader is based on fasthttp library, what has HTTP requests limitation per host, reading pools for HTTP Body and zero allocations. Idea of having several parallel downloading processes is to have something in the channel with downloaded images listened by processing. Because performance of a single image download is not guaranteed.
  • The most loading part of CPU is Processing part. Taking into account 1 CPU, there is no sense to have more than one simultaneous processing goroutine. Two processes will be competing for CPU cache and reduce overall performance. But, there is a command line option allowing M>1 processing goroutines.
  • Buffered result writer does not block (file i/o) Image processing goroutine per each single result.

Installation

go get github.com/regorov/jpegcc
cd ${GOPATH}/src/github.com/regorov/jpegcc/cmd/jpegcc
go build

Usage

jpegcc help
jpegcc help start
export GOMAXPROCS=1

jpegcc start -i ./input.txt --pw 1 --dw 8 -o ./result.csv

Profiling Results

jpegcc --pl ":6001" start -i ./input.txt --pw 1 --dw 8 -o ./result.csv
curl -sK -v http://127.0.0.1:6001/debug/pprof/heap > heap.out
curl -sK -v http://127.0.0.1:6001/debug/pprof/profile > cpu.out
go tool pprof -http=":8086" ./heap.out
  • Go runtime does not return memory back to OS as fast as possible.

Build You Own Image Processing Tool

There are several interfaces what can be implemented to change input source, download approach, image processing logic and output direction.

// Resulter is the interface that wraps Result and Header methods.
//
// Result returns string representation of processing result.
//
// Header returns header if output format expects header (e.g. CSV file format).
// If output format does not requires header, method implementation can return
// empty string.
type Resulter interface {
	Result() string
	Header() string
}

// Outputer is the interface that wraps Save and Close method,
//
// Save receives Resulter to be written to the output.
//
// Close flushes output buffer and closes output.
type Outputer interface {
	Save(Resulter) error
	Close() error
}

// Counter is the interface that wraps the basic Count method,
//
// Count receives Imager, process it in accordance to implementation and returns Resulter or error if processing failed.
type Counter interface {
	Count(Imager) (Resulter, error)
}

// Inputer is the interface that wraps the basic Next method.
//
// Next returns channel of URL's read from input. Channel closes
// when input EOF is reached.
type Inputer interface {
	Next() <-chan string
}

// Downloader is the interface that groups methods Download and Next.
//
// Download downloads image addressed by url and returns it wrapped into Imager.
//
// Next returns channel of Imager downloaded and ready to be processed. Channel
// closes when nothing to download.
type Downloader interface {
	Download(ctx context.Context, url string) (Imager, error)
	Next() <-chan Imager
}

// Imager is the interface that groups methods to deal with
// downloaded image.
//
// Bytes returns downloaded image as []byte.
//
// Reset releases []byte of HTTP Body. Do not call Bytes() after
// calling Reset.
//
// URL returns the URL of downloaded image.
type Imager interface {
	Bytes() []byte
	Reset()
	URL() string
}

Further research

  • Create RAM disk
  • Use RAM disk as shared storage
  • Split current jpegcc application to images downloader daemon and images processing apps (similar to FaSS). Because downloader part of application does not consumes memory and does not generate garbage it can stay as daemon.

Prague 2020

About

JPEG Most Prevalent Colors Counter Package and Command Line Tool.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages