New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
janitor supports incremental execution #2094
Comments
indexer command: quickwit --service indexer --service metastore index config: version: 0
index_id: logs
indexing_settings:
timestamp_field: unix_ms
search_settings:
default_search_fields: [log]
sources:
- source_id: kafka
source_type: kafka
num_pipelines: 3
params:
topic: log.log-container.stdout
client_log_level: debug
client_params:
bootstrap.servers: '${ip}'
group.id: quickwit
debug: all
log_level: 7
auto.offset.reset: latest
doc_mapping:
partition_key: app_name
max_num_partitions: 50
tag_fields: [app_name]
field_mappings:
- name: app_name
type: text
tokenizer: raw
fast: true
- name: unit_name
type: text
tokenizer: raw
fast: true
- name: version
type: text
tokenizer: raw
fast: true
- name: idc
type: text
tokenizer: raw
fast: true
- name: container_name
type: text
tokenizer: raw
fast: true
- name: pod
type: text
tokenizer: raw
fast: true
- name: log
type: text
- name: unix_ms
type: i64
fast: true
|
Interesting, @evanxg852000 can you take a look at this? |
@guidao I will try to reproduce this. At first sight, In janitor we run Gc, retention policy and delete task. Given you have one index without retention policy configured, I think that retention policy be out. |
Yes, and we don't have a "delete task", so I suspect it's a gc task. We have a lot of split to be deleted, which could also be the cause of the long transaction. metastore=> select split_state, count(1) from splits group by split_state;
split_state | count
-------------------+---------
MarkedForDeletion | 1731338
Published | 95184
Staged | 400 |
@guilload @fmassot I think using the system correctly (running all necessary services) should not yield this huge amount of deletable splits. However, we need to guard against having to load huge number of |
It's not normal at all. An explanation for this high number of |
What I have observed is that janitor works fine at first, but from a certain time it starts to time out executing database operations, and then it never works properly. The error log has been lost for testing reasons, but I can re-run it tomorrow after clearing the data. Our consumption has not been very stable, and I suspect that having a large number of splits in Staged state may be causing this problem or there is a large number of mergers |
A few things:
SELECT mode, pg_class.relname, count(*) FROM pg_locks JOIN pg_class ON (pg_
locks.relation = pg_class.oid) WHERE pg_locks.mode IS NOT NULL AND pg_class.
relname NOT LIKE 'pg_%%' GROUP BY pg_class.relname, mode; Deadlocks (per database) (9.2+) SELECT datname, deadlocks FROM pg_stat_database; Dead rows SELECT relname, n_dead_tup FROM pg_stat_user_tables; In the meantime, I'm going to add some log statements and metrics in the GC and metastore. |
@guilload We have 50 Indexers in our cluster. I will try it tomorrow. |
partition_key: app_name
max_num_partitions: 50 Does the number of split in the "deleted state" have anything to do with this configuration? |
janitor logs(postgres statement_timeout: 300000):
metastore=> SELECT mode, pg_class.relname, count(*) FROM pg_locks JOIN pg_class ON (pg_locks.relation = pg_class.oid) WHERE pg_locks.mode IS NOT NULL AND pg_class.relname NOT LIKE 'pg_%%' GROUP BY pg_class.relname, mode;
mode | relname | count
------------------+--------------+-------
RowExclusiveLock | splits | 143
RowShareLock | indexes | 137
ExclusiveLock | indexes | 5
RowExclusiveLock | indexes_pkey | 143
AccessShareLock | indexes_pkey | 137
RowExclusiveLock | indexes | 143
RowExclusiveLock | splits_pkey | 143
metastore=> SELECT datname, deadlocks FROM pg_stat_database;
datname | deadlocks
-----------+-----------
postgres | 0
metastore | 0
template1 | 0
template0 | 0
(4 rows)
metastore=> SELECT relname, n_dead_tup FROM pg_stat_user_tables;
relname | n_dead_tup
------------------+------------
spatial_ref_sys | 0
splits | 33461
delete_tasks | 0
indexes | 120
_sqlx_migrations | 5
(5 rows)
metastore=> select count(1), split_state from splits group by split_state;
count | split_state
--------+-------------------
283923 | MarkedForDeletion
68598 | Published
164837 | Staged
(3 rows)
|
Linking this with so it's easier to follow the paper trail #2126 |
@guidao what is the general overview of your Postgres server?
|
|
Can you detail where the 300 connections are coming from? I assume we have 50 indexers? The extra 250 are searchers? |
It should be related to this configuration. |
Ah yes we have a conneciton pool! understood. |
We pushed a bunch of fixes that have increased the performance of the janitor. I'm going to close this issue. Feel free to reopen if necessary. |
During our usage we found that postgres has a lot of Lock state transactions. This caused the pipeline to restart due to fetching connection timeouts.
When I stopped the janitor execution, I noticed that there were a lot less transactions waiting. This might have something to do with the fact that janitor is getting a lot of splits, so maybe it could be executed in batches, with a fixed number of splits at a time.
The text was updated successfully, but these errors were encountered: