-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TT returns inconsistent data on query #48
Comments
hi Jen,I haven’t looked at your case closely yet but in general errno 12 indicates out of memory. Very possible that TT can’t fetch data with mmap due to OOM. Sometimes in 32bits OS TT may run out of virtual memory but seem unlikely in your case.
Btw, Please use latest version 0.11.7 if possible.
Sent from my iPhoneOn Apr 22, 2023, at 5:49 AM,
|
Jens, how big is your data? Would you mind to share your data with us for a repro?
OpenTSDB does not support influx line protocol, doesn’t it? For writes It only supports http json and tcp plain put, ASFAIK. Maybe I missed something? Regards |
Hi Jens, You are correct on the 503 response. We will make the change in the next release. Thanks. As for using wget to retrieve large amount of data in one shot, it's not going to work, unfortunately. Our HTTP server does not support chunked encoding, so that means we will try to return all the data in one response, and it will result in OOM. To dump all data out, your best bet is the 'inspect' tool. For example,
It will spit out all your data on stdout, although the format is not ideal. We can introduce a new output format that can be used to inject them back into TT, if that helps. Thanks |
I transferred all data to my laptop (i7-6600U CPU, 16G Memory, running TT 0.11.7) and re-executed just the same queries -> all worked fine.
I'll switch soon :) |
It's about 50 megabytes as gzip compressed tar. I think about sharing but believe you won't be able to reproduce on a larger scale machine. As I wrote above all worked fine for me when running TT on my laptop. Maybe repro would be possible on a Raspberry Pi Model 2B - but this is significantly smaller than my Odroid HC1: A repro should be possible when using an Odroid XU4 which was the base platform for the HC1.
Yes, sure. But TT supports both the line protocols of OpenTSDB and Influx. I tested in mix with both of them when writing data to TT to find out which one fits better for me. Finally I decided to use the Influx line protocol.
|
The largest response I got was smaller than half a megabyte. This, I believe, shouldn't be a reason for OOM.
I tried
Thanks Jens |
Hi, I just updated to 0.11.7 and executed just another experiment (running on the ARMv7 machine again):
When executing quite the same with exactly the same database and TT running on my i7-6600U CPU, 16G Memory, all works fine. Some infos regarding my database:
|
Hello Yongtao,
I just checked about this. When querying some of my metrics for a whole day, I receive a So this is handled correctly without an OOM happening. I ran this a second time with the config property Thanks Jens |
I see you refer OpenTSDB plain put (e.g.,
Yes, Influx line protocol is the best. We are stress-testing Json, opentsdb plain put, and influx-line right now. Will publish results in wiki.
27530 is delta of timestamp if I remember correctly. @ytyou can confirm.
You can find ts-id to ts<metric, tags> in data/ticktock.meta which is a text file. Inspect is an internal tool. We use it to verify data integrity and missing points etc. Not very friendly to end users. |
We use mmap in loading data files for queries. This indicates OOM.
The canonical data size of 6.5M data points is only 12MB (1-2B/DataPoint in avg). It seems that the OOM above is more likely due to virtual memory than due to physical memory, especially since it is on a 32-bit Arm OS. Did you happen to observe the size of RSS and VM during the query? TOP may be enough. We will try if we can repro with the similar data size. If it is indeed OOM (very likely), any suggestions to response code and status?
Yes, if query result size is larger than tcp.buffer.size, the query will fail. You have to reduce result size, either by increasing downsample interval or reducing query range etc. |
@jens-ylja Are these numbers for the whole DB, or just specific to your query results? Can you pls share the TicktockDB config with us? How many files in each TSDB and what is the file size? |
Hello @ylin30,
Not yet, I'll try to capture it today.
I would propose a HTTP 503 Service Unavailable. |
These are the counters of the whole DB.
Attached please find a listing of all data files ( |
I just executed a query accessing data from a test instance (clone from origin) for the last 15 days - which was working.
After query execution state:
Next try with a larger clone.
After query execution - producing four
Indeed these errors seem to be related to the limit for the virtual memory size. Do I have any chance to influence the vmem usage of TT? |
It is consistent with our observation. TT will go OOM if virtual mem is 2.5-3GB.
2.8GB vsz doesn't make sense since your DB is around 100MB. BTW, each data file is configured to keep 2**15=32768 pages which corresponds to 4MB by default. Unless you have 600 data files opened, your vsz mem should not be that large. I don't know if I can repro it but we will try in RPI env.
Fix is ready in dev by @ytyou. No partial results is returned. And resp code = 503. Will be released in 0.11.8.
It is indeed a bug in inspect. Fix is ready in dev by @ytyou . Will be released in 0.11.8. |
I cross checked the memory behaviour with almost the same DB on my Laptop (i7-6600U CPU, 16G Memory, Ubuntu 20.04 LTS) I see the following: TT instance is new-born:
After query execution (which worked without an error):
This is a diff in vmem of almost 4GB - very large too in my opinion. BTW - the complete database I used for this test is:
|
"This is a diff in vmem of almost 4GB - very large too in my opinion." I agree. We will look into this... |
I verified that vsize was already 2GB ( Updated: vsize of TT on fresh start (x86, 64bit Ubuntu 22.04):
|
@jens-ylja I assume you ran the test in a 8core cpu, 32bit OS. I simulate your scenario by setting http.listener.count and tcp.listener.count to 8 (basically 1 core 1 listener count) in RPI-0-w. You can verify that TT should have 71 threads ( I checked TT's mem usage with
The vsize is 580980kB, RSS 4284kB. So this is consistent with your case.
I backfilled 30 days data (at 10 sec interval) of 50 time series. After backfill, TT's vsize and rss remained almost the same.
I checked TT's vsize and rss grew to 654MB and 11MB. They are significantly smaller than yours (from 662MB to 1.7GB).
I continued with 30 days query and it succeeded. And the final vsize of TT was 738MB, significantly smaller than yours (2.9GB).
So we can repro the vsize value is expected at fresh start. However, vsize after query can't be reproed.
In general, TT vsize is affected by a). number of threads , b). and number of files (data and headers) opened. For a), you can limit number of listeners to number of cpu cores (as default). It is hard to estimate and control exactly how much vsize they cost. But In 32bit, each thread will reserve 8MB vsize for its stack. In 64bit, there will be 64MB more reserved for heap (shared by other threads too, I think). There is a glibc flag to control but we don't want to change it. That's why you see very large vsize in 64bit even upon fresh start. For b), we do have control on number of data and header files opened for reads/writes. But with only 30 data files (each 4MB), their vsize should be small (120MB). Could you please help us capture TT's memory usage with Thanks for your efforts to improve TickTockDB |
@ylin30 indeed my system has 8 cores (ARM Cortex-A15 (quad core) & ARM Cortex-A7 (quad core) in a big.LITTLE pack). When running with defaults for When executing a query hitting all files Attached please find the output of |
The direct cause is that there are 17 data.0 files opened and each uses 131072KB vsize. In total, they contribute 2.2GB vsize already. What puzzled me is why each data file is so large, 131072KB? By default Its size should be (page_size * page_count)=(128B * 32768) = 4,194,304B. I verified that in Raspberry-PI-0 (ArmV6, 32bit OS, Raspbian GNU/Linux 11 (bullseye)). I don't have Odroid HC1. But could you tell me what OS version you use? At last, if you don't mind, you can replace line 610 in src/core/mmap.cpp to print out the file size.
|
@ylin30 My Odroid HC1 is running an Ubuntu 20.04 in a minimized server edition by Hardkernel (manufacturer of the Odroids).
I'll replace the code line and come back with the new logs - hopefully soon :) |
I found something fishy in your log.
20960870762807296 looks like an overflow. Will check. |
Here is the new log file of the line 610 patched version: @ylin30 you wrote about |
I suspect it might be caused by compaction which compacts old data into new pages with a large page size (e.g., system page size 4kb). It doesn't matter in 64bit OS but has huge effect in 32bit OS. We will confirm if this is the cause. Cheers |
@jens-ylja This is confirmed to cause by compaction. Will be fixed in 0.11.8 soon. Thx! Assigned to @ytyou. |
I just re-tested with the new v0.11.8 version and can confirm the issue to be fixed. Thanks a lot |
@jens-ylja only 571MB vsize after 90 days query on ArmV7 32bit! Good to see that works. Again, thank you so much for discovering the bug. We wouldn't realize that if without you. Let TickTockDB rock! |
Hallo,
I experimented a lot with the line format (OpenTSDB vs. Influx) and general layout of my time series.
With this the data stored in TT became some kind of a huddle.
So I decided to clean up. But don't want to loose my data and decided too, to fetch them all from TT, reorganize, cleanup, etc. and finally re-insert to a new clean TT instance.
Unfortunately I failed with the first task - fetch them all.
I tried the following (as bash script):
This wrote me some files with data and some with only
[]
.Running it a second time maybe returned different results.
To verify if all the data is gone, I checked with Grafana and sometimes got data, some other times not. After waiting a couple of minutes and just running the Grafana query again, all is fine.
Finally I checked the logs and found a lot of error messages like:
Maybe these are the reasons for the
[]
answers.I'm running TT (still version 0.11.4) on a ODROID HC1 which has a total of 2 GB RAM only. Is it a valid assumption this low memory availability causes the misbehaviour?
The effect seems to self-heal after a while. If the low memory pressure is the reason I would propose to answer with HTTP 503 (temporary unavailable) instead of 200 and
[]
. The 503 status code - in my opinion - would transport quite the correct information: "For now I cannot answer, please try later."A a workaround, I will slow down my queries for data export or even move all (temporarily) to a larger scale machine.
But I assume, the same effect will happen if I run a query over the full time range (start == begin of March, end == now).
Thanks
Jens
The text was updated successfully, but these errors were encountered: