Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

sql/index/pilosa: parallelize index creation #644

Merged
merged 3 commits into from
Apr 10, 2019

Conversation

erizocosmico
Copy link
Contributor

Closes #346

@erizocosmico erizocosmico requested a review from a team March 22, 2019 14:58
@kuba--
Copy link
Contributor

kuba-- commented Mar 22, 2019

Looks that we enumerate columns per partition as well (they are not global anymore), so it does't make sense to merge bitmaps across partitions. Am I right?

@kuba--
Copy link
Contributor

kuba-- commented Mar 22, 2019

I'm trying to think if following scenario is possible:

You have 2 partitions.
So regularly, if you iterate over locations you have global columns.
Right now we have columns enumerator per partition, so I wonder if you have 2 bitmaps from 2 different partitions you may have:
[1, 0, 0]
[0, 1, 0]

instead of
[1, 0, 0]
[1, 0, 0]

so after AND we'll get [0, 0, 0] instead of [1, 0, 0]

Is it possible?
Because merging is still global, but with mapping you will go to partition mapping instead of global index mapping. Correct me if I'm wrong.

@erizocosmico
Copy link
Contributor Author

As far as I can see there are already tests using ANDs and ORs https://github.com/src-d/go-mysql-server/blob/master/engine_test.go#L1665-L1726 and tests use partitioned tables https://github.com/src-d/go-mysql-server/blob/master/engine_test.go#L1508-L1511, so I assume that if tests pass, it should work 🤷‍♂️

@erizocosmico
Copy link
Contributor Author

I've tried this with gitbase and all tests that use indexes still work. Also tried manually and they output the exact same results.

@kuba--
Copy link
Contributor

kuba-- commented Mar 28, 2019

I tested this PR against following repos:

pga list -l javascript -f json | head -n 5 | jq -r '.sivaFilenames[]' | pga get -i -o repos
pga list -l python -f json | head -n 5 | jq -r '.sivaFilenames[]' | pga get -i -o repos

Generally I think we have something broken in current implementation of indexes because for following query:

SELECT count(*) FROM commits WHERE commit_author_email!='sjdflkjsdlfkj';

I get 4344 without indexes but after creating an index:
CREATE INDEX email_idx ON commits USING pilosa (commit_author_email);
I get 4301.

With your implementation both results are OK!
But I came across some segfaults. The first one when I quickly created 2 indexes (one after another):

CREATE INDEX email_idx ON commits USING pilosa (commit_author_email);
CREATE INDEX ref_idx ON refs USING pilosa (ref_name);

I got:

INFO[0025] audit trail                                   action=query address="127.0.0.1:51427" connection_id=1 duration=129.61007ms err="invalid view" pid=44 query="SELECT count(*) FROM commits WHERE commit_author_email!='sjdflkjsdlfkj'" success=false system=audit user=root
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x4b0cb18]

goroutine 89 [running]:
github.com/src-d/gitbase/vendor/github.com/pilosa/pilosa.(*Field).Name(...)
github.com/src-d/gitbase/vendor/github.com/pilosa/pilosa/field.go:210
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*negateLookup).intersectExpressions(0xc0002ba230, 0x518b920, 0xc000ce6600, 0x0, 0x30, 0x10)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/lookup.go:424 +0x2a8
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*negateLookup).values(0xc0002ba230, 0x518b920, 0xc000ce6600, 0x0, 0x0, 0x0)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/lookup.go:468 +0xed
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*negateLookup).Values(0xc0002ba230, 0x518b920, 0xc000ce6600, 0x0, 0x0, 0x0, 0x0)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/lookup.go:507 +0x49
github.com/src-d/gitbase.(*commitsTable).PartitionRows.func1(0xc0010cafc0, 0xb, 0xb, 0x4dfe293, 0x7)
	/github.com/src-d/gitbase/commits.go:100 +0x352
github.com/src-d/gitbase.rowIterWithSelectors(0xc000cc0460, 0x5f2dc00, 0xb, 0xb, 0x4dfe293, 0x7, 0xc00226b980, 0x1, 0x1, 0xc00189dd68, ...)
	/github.com/src-d/gitbase/filters.go:261 +0x115
github.com/src-d/gitbase.(*commitsTable).PartitionRows(0xc001b1b620, 0xc000cc0460, 0x518b920, 0xc000ce6600, 0x20, 0x4d1fe60, 0xc001cef301, 0xc001cef380)
	/github.com/src-d/gitbase/commits.go:89 +0x1f7
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*ProcessIndexableTable).PartitionRows(0xc001ceefa0, 0xc001cecb40, 0x518b920, 0xc000ce6600, 0x6f3d320, 0x518b920, 0xc000ce6610, 0xc000ce6600)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/process.go:94 +0x55
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangePartition).RowIter(0xc001cef380, 0xc001cecb40, 0x51b8960, 0xc001cef380, 0x0, 0x0)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:306 +0x4f
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangeRowIter).iterPartition(0xc001605ce0, 0x518b920, 0xc000ce6600)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:227 +0xe6
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangeRowIter).start.func1(0xc001605ce0, 0xc000ede214, 0x518b920, 0xc000ce6600)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:170 +0x3f
created by github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangeRowIter).start
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:169 +0x10d

If I do it slowly (create the first one, wait and create the next one, then everything looks fine).

Another segfault I get when I create an index on expression:

CREATE INDEX files_lang_idx ON files USING pilosa (language(file_path, blob_content));

panic: page 252 already freed

goroutine 182 [running]:
github.com/src-d/gitbase/vendor/go.etcd.io/bbolt.(*freelist).free(0xc002462280, 0x3, 0xa5bd000)
	/github.com/src-d/gitbase/vendor/go.etcd.io/bbolt/freelist.go:175 +0x3d6
github.com/src-d/gitbase/vendor/go.etcd.io/bbolt.(*Tx).Commit(0xc0018f40e0, 0x0, 0x0)
	/github.com/src-d/gitbase/vendor/go.etcd.io/bbolt/tx.go:171 +0x1ab
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*mapping).transaction(0xc00029a0e0, 0x4bf2e01, 0xc000d75d38, 0x0, 0x0)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/mapping.go:171 +0x13c
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*mapping).getRowID(0xc00029a0e0, 0xc00260bbf0, 0x2c, 0x4bf2e80, 0xc0019c4440, 0x99, 0x10f, 0x0)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/mapping.go:183 +0xf9
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*Driver).savePartition(0xc000336660, 0xc00211c000, 0x518b920, 0xc002bb68b0, 0x5199760, 0xc0014d3db0, 0xc001d1a580, 0xc001de4e40, 0xc000d75f18, 0x0, ...)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/driver.go:283 +0x407
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*Driver).Save.func1(0xc00211e050, 0xc002128300, 0xc001d1a580, 0xc000336660, 0xc00211c000, 0x518b920, 0xc002bb68b0, 0x5199760, 0xc0014d3db0, 0xc001de4e40, ...)
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/driver.go:393 +0x1ec
created by github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa.(*Driver).Save
	/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/index/pilosa/driver.go:380 +0x60b

The first one looks more like a race in pilosa structs (some file is not closed/open, so data didn't sync up.) The second one related to mapping. PTAL, if you can reproduce it or maybe it's just my messed up env.

@erizocosmico
Copy link
Contributor Author

Pausing this until src-d/gitbase#769 is fixed (which is kind of related to this)

@erizocosmico
Copy link
Contributor Author

First issue has been solved in gitbase, as it was not an index issue per se.

The other issues have been solved by this last commit. Guarding of mapping transactions was not being done correctly.

@erizocosmico
Copy link
Contributor Author

TODO: config for setting the number of threads for creating indexes

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
@erizocosmico
Copy link
Contributor Author

Done

@ajnavarro ajnavarro merged commit c9ee09f into src-d:master Apr 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants