Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search query crushes container (depends on host) #78

Closed
tmnhy opened this issue Jul 27, 2021 · 8 comments
Closed

Search query crushes container (depends on host) #78

tmnhy opened this issue Jul 27, 2021 · 8 comments
Labels
bug Something isn't working

Comments

@tmnhy
Copy link

tmnhy commented Jul 27, 2021

On one host system, a search query crushes the container. On another with the same data set it works fine.

Current Behavior

After a search query, docker container crushed.

Docker logs:
docker run -p 6333:6333 -l debug -v ${pwd}/qdrant/storage:/qdrant/storage generall/qdrant:

[2021-07-27T16:09:49Z INFO  wal::segment] Segment { path: "./storage/collections/test_collection/wal/open-5", entries: 31, space: (2424040/33554432)
}: opened
[2021-07-27T16:09:49Z INFO  wal::segment] Segment { path: "./storage/collections/test_collection/wal/open-6", entries: 0, space: (8/33554432) }: opened
[2021-07-27T16:09:49Z INFO  wal::segment] Segment { path: "./storage/collections/test_collection/wal/closed-1374", entries: 418, space: (33474360/33554432) }: opened
[2021-07-27T16:09:49Z INFO  wal] Wal { path: "./storage/collections/test_collection/wal", segment-count: 2, entries: [1374, 1823)  }: opened
[2021-07-27T16:09:56Z INFO  qdrant] loaded collection: test_collection
[2021-07-27T16:09:56Z INFO  actix_server::builder] Starting 12 workers
[2021-07-27T16:09:56Z INFO  actix_server::builder] Starting "actix-web-service-0.0.0.0:6333" service on 0.0.0.0:6333

Docker logs with strace:
strace docker run -p 6333:6333 -l debug -v ${pwd}/qdrant/storage:/qdrant/storage generall/qdrant:

futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL[2021-07-27T16:13:27Z INFO  wal::segment] Segment { path: "./storage/collections/test_collection/wal/open-5", entries: 31, space: (2424040/33554432) }: opened
[2021-07-27T16:13:27Z INFO  wal::segment] Segment { path: "./storage/collections/test_collection/wal/open-6", entries: 0, space: (8/33554432) }: opened
) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL[2021-07-27T16:13:27Z INFO  wal::segment] Segment { path: "./storage/collections/test_collection/wal/closed-1374", entries: 418, space: (33474360/33554432) }: opened
[2021-07-27T16:13:27Z INFO  wal] Wal { path: "./storage/collections/test_collection/wal", segment-count: 2, entries: [1374, 1823)  }: opened
) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL[2021-07-27T16:13:35Z INFO  qdrant] loaded collection: test_collection
[2021-07-27T16:13:35Z INFO  actix_server::builder] Starting 12 workers
[2021-07-27T16:13:35Z INFO  actix_server::builder] Starting "actix-web-service-0.0.0.0:6333" service on 0.0.0.0:6333
) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
futex(0x55d8f6e79c40, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn({mask=[]})                 = 202
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x55d8f6e5cce8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 0)     = 0
epoll_pwait(4,  <unfinished ...>)       = ?
+++ exited with 132 +++

Steps to Reproduce

Collection like:

{
    "create_collection": {
        "name": "test_collection",
        "vector_size": 256,
        "distance": "Cosine"
    }
}

Number of vectors does not affect the error.

Search request like:
POST /collections/test_collection/points/search

{
    'vector': [...],
    'top': 10
}

Expected Behavior

Stable container work and return search result. And сhecking resource availability.

Possible Solution

Maybe affected by processor instructions set or problems with free memory? What is the size of the free memory block for searching?

Context (Environment)

  • Host system based on Intel(R) Core(TM) i7-3930
  • Memory free: ~1.8-2G
  • Ubuntu 18.04
  • Docker version 20.10.2
  • latest pre-built image from DockerHub
@tmnhy tmnhy added the bug Something isn't working label Jul 27, 2021
@generall
Copy link
Member

Hi @tmnhy, thanks for reporting the issue. It does not look to me like the issue is related to memory or CPU. Processor architecture is relatively new, qdrant supports much older processors by default. The amount of free memory should also be enough, we run our demos with even more limited resources.

Is there any chance you could try to build and run Qdrant on your host machine without Docker and comment here on result?

  • is it a segfault? Or panic?
  • What is the exit code?
  • Is there any mentions of OOM killer in the output of dmesg?

@tmnhy
Copy link
Author

tmnhy commented Jul 28, 2021

Hi, later I will try build and run without a docker.

With docker in the dmesg at the time of the crash only:

traps: tokio-runtime-w[18989] trap invalid opcode ip:55aa9cf31de7 sp:7f4ce601bde0 error:0 in qdrant[55aa9c890000+8f4000]

@generall
Copy link
Member

I have checked compiled binary with https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html and indeed, there are some AVX2 instructions which are not supported by i7-3930 CPU. I am currently trying to figure out how to compile it with required instruction set.

@tmnhy
Copy link
Author

tmnhy commented Jul 29, 2021

Hi, build container from source solved this problem.

@generall
Copy link
Member

https://github.com/qdrant/qdrant/releases/tag/v0.3.6 now uses dynamic arch for OpenBLAS, but I had no chance to test it on actual CPU. I would appreciate if you try it on your machine

@generall
Copy link
Member

generall commented Oct 5, 2021

The error could still be reproduced on Google Cloud virtual machines of E2 and N1 types.
Steps to reproduce:

  • Create a fresh VM on GCP of e.g. type e2-small (2 vCPUs, 2 GB memory)
  • Install docker
  • run qdrant sudo docker run -it --network=host generall/qdrant
  • Create collection with vector size >= 250 and cosine distance
  • insert just a few random vectors
  • try to do a search request with another random vector

Actual result:

  • qdrant docker crushes with 132 exit code

Expected:

  • normal execution of search request

cat /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping        : 0
microcode       : 0x1
cpu MHz         : 2199.998
cache size      : 56320 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips        : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

@generall
Copy link
Member

@generall generall reopened this Oct 27, 2021
@generall
Copy link
Member

Upd: fixed in v0.4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants