Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sql): fix COPY performance for pre-created tables with non-default symbol capacity #3305

Merged
merged 4 commits into from
May 5, 2023

Conversation

puzpuzpuz
Copy link
Contributor

Refs #3269

COPY was using default symbol table capacity for temporary tables. This led to sub-optimal ingestion performance due to slow symbol lookups. With this patch, temporary tables use the same symbol capacities as the end table.

Benchmark

Create a 100M test CSV file with this Golang program:

package main

import (
	"bufio"
	"os"
	"strconv"
	"time"
)

func main() {
	f, err := os.Create("/tmp/input.csv")
	if err != nil {
		panic(err)
	}

	defer f.Close()

	_, err = f.WriteString("sym,ts\n")
	if err != nil {
		panic(err)
	}
	f.Sync()

	w := bufio.NewWriter(f)
	for i := 1; i <= 100_000_000; i++ {
		t := time.UnixMicro(int64(i) * 1_000_000)
		_, err := w.WriteString(strconv.Itoa(i%234870) + "," + t.Format("2006-01-02T15:04:05.000000Z") + "\n")
		if err != nil {
			panic(err)
		}
	}

	w.Flush()
	f.Sync()
}

Pre-create the table and run COPY on it:

CREATE TABLE test (sym SYMBOL CAPACITY 300000, ts TIMESTAMP) TIMESTAMP(ts) PARTITION BY YEAR;
COPY test FROM 'input.csv' WITH HEADER true;

Before this patch:

2023-05-05T08:31:18.791182Z I i.q.c.t.ParallelCsvFileImporter import complete [importId=76f11f07197e248b, file=`/tmp/input.csv`', time=1180s]

With this patch:

2023-05-05T08:10:23.145525Z I i.q.c.t.ParallelCsvFileImporter import complete [importId=76ee51f64b9008b9, file=`/tmp/input.csv`', time=16s]

@puzpuzpuz puzpuzpuz added Bug Incorrect or unexpected behavior Performance Performance improvements labels May 5, 2023
@puzpuzpuz puzpuzpuz self-assigned this May 5, 2023
@puzpuzpuz
Copy link
Contributor Author

@amyshwang we might want to stress the importance of symbol capacity configuration in the COPY guide in the docs.

@puzpuzpuz puzpuzpuz force-pushed the puzpuzpuz_fix_copy_perf_for_symbols branch from 9e81fac to 43429fc Compare May 5, 2023 09:49
bluestreak01
bluestreak01 previously approved these changes May 5, 2023
@ideoma
Copy link
Collaborator

ideoma commented May 5, 2023

[PR Coverage check]

😍 pass : 19 / 20 (95.00%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/vm/NullMapWriter.java 0 1 00.00%
🔵 io/questdb/cairo/SymbolMapWriter.java 2 2 100.00%
🔵 io/questdb/cutlass/text/ParallelCsvFileImporter.java 17 17 100.00%

@bluestreak01 bluestreak01 merged commit 1fb6fcd into master May 5, 2023
20 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_fix_copy_perf_for_symbols branch May 5, 2023 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Incorrect or unexpected behavior Performance Performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants