Master timeouts during dirAssign volume growth #5213

bvanelst · 2024-01-17T22:12:50Z

Describe the bug
Timeouts when requesting a /dir/assign at the master(s).

System Setup

master(s) 3 of them, but I get the same issues when I only start one:
/usr/local/bin/weed -v=3 -logdir=/var/log/seaweedfs master -mdir=/etc/seaweedfs -ip=10.0.9.15 -port=9333 -metrics.address=10.0.9.17:9091 -defaultReplication=010 -volumePreallocate -garbageThreshold=0.3 -volumeSizeLimitMB=20000 -peers=10.0.9.17:9333,10.0.9.14:9333,10.0.9.15:9333
volume(s) 7 of them, I use different IP's, racks, and volumes.
/usr/local/bin/weed -v=3 -logdir=/var/log/seaweedfs volume -index=leveldb -mserver=10.0.9.17:9333,10.0.9.14:9333,10.0.9.15:9333 -dir=/volumes/98fb3388c280,/volumes/LHHGS,/volumes/e000c055cbe4,/volumes/c5a9aff45527,/volumes/619c9a0827f4,/volumes/f8c44345756f,/volumes/eeedca023938,/volumes/cae089cd2dd9,/volumes/20F30GRVRD,/volumes/KWEGS,/volumes/3d5638f4fd34,/volumes/18f39b04390d,/volumes/6a9e8c97ba2a,/volumes/LDTGS,/volumes/a9a6e2d048de,/volumes/20F308T27D,/volumes/19641ea6d6c5,/volumes/20F30JB3JE,/volumes/6e19dd8da77b,/volumes/3d1614841bcc,/volumes/372cb7e5ac18,/volumes/152d865d39ce,/volumes/20F305249D,/volumes/1289675b7f03,/volumes/222079443d03,/volumes/cc66d284719d,/volumes/ca6f98cd3c16,/volumes/6611045c7cf2,/volumes/381ee044d930,/volumes/ff81968af32c,/volumes/9d611128cfed,/volumes/21F306AD4F,/volumes/595892cb8709,/volumes/0553ccb52b90 -max=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 -concurrentDownloadLimitMB=20000 -concurrentUploadLimitMB=20000 -hasSlowRead=true -readBufferSizeMB=8 -compactionMBps=10 -rack=store02 -ip=10.0.9.2
OS version
Debian GNU/Linux 12 (bookworm) / Linux store02 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
output of weed version
version 30GB 3.62 59b8af99b0aca1b9e88fec7b5f27c7d15e5e8604 linux amd64
no filer only masters & volume servers, we have an own metadata store.

Expected behavior
When we assign/request multiple keys or a single key at the Leader master, we don't expect timeouts. Normally we get an instant response. But sometimes (every x minutes) we get a timeout on the a request like: http://10.0.9.17:9333/dir/assign?collection=nntp&count=10000&replication=001. Also when we lower the count. We sadly don't get an error at this request, but I noticed when this happens I see the following log entry:
seaweedfs-master[459769]: I0116 14:50:10.468052 master_server_handlers.go:125 dirAssign volume growth {"collection":"nntp","replication":{"node":1},"ttl":{"Count":0,"Unit":0}} from 10.0.9.12:40308

It looks that this always happens when there is a dirAssign volume growth.
In parallel there are constantly POST requests directly to the volume servers to store data.

I thought a work-around was to use replication 010 or 002, but working only for a while.

We just started testing SeaweedFS and started with 3.60, but also after the upgrades to 3.62 we still see this issue. ( I didn't tested older versions )

Additional context
How I test/reproduce it:

while true; do curl --max-time 3 'http://localhost:9333/dir/assign?collection=nntp&replication=001'; echo "" ; done

I get once in a couple of seconds/minutes a timeout (also when I increase the max-time):

curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received

at that moment I see always a volume growth message in the logging:

seaweedfs-master[459769]: I0116 14:50:10.468052 master_server_handlers.go:125 dirAssign volume growth {"collection":"nntp","replication":{"node":1},"ttl":{"Count":0,"Unit":0}} from 10.0.9.12:40308

Screen shot

The text was updated successfully, but these errors were encountered:

bvanelst · 2024-01-18T11:54:24Z

I downgraded to 3.59 and the timeouts/failed assigns seem to be gone. Also the log message below disappeared.

seaweedfs-master[375357]: I0118 11:52:34.725059 master_server_handlers.go:123 dirAssign volume growth {"collection":"nntp","replication":{"rack":1},"ttl":{"Count":0,"Unit":0},"preallocate":20971520000} from 10.0.9.12:36966

As you can see the traffic on our POST server is much more stable after the downgrade.

The diskutil of the volume servers is also much better.

I think it could be related to #5154

chrislusf · 2024-01-18T17:00:34Z

Added a fix for #5154

BenoitKnecht · 2024-01-23T17:06:45Z

Added a fix for #5154

I'm still getting the timeouts described by @bvanelst with that fix, and I think I found what's causing it. In

seaweedfs/weed/server/master_server_handlers.go

Lines 146 to 149 in 439377b

    
           if err := <-errCh; err != nil { 
        
           	writeJsonError(w, r, http.StatusInternalServerError, fmt.Errorf("cannot grow volume group! %v", err)) 
        
           	return 
        
           }

we wait on the req.ErrCh channel. That channel is closed when we exit ProcessGrowRequest() here

seaweedfs/weed/server/master_grpc_server_volume.go

Lines 64 to 67 in 439377b

    
           if req.ErrCh != nil { 
        
           	req.ErrCh <- err 
        
           	close(req.ErrCh) 
        
           }

but not if we exit it there

seaweedfs/weed/server/master_grpc_server_volume.go

Lines 72 to 76 in 439377b

    
           } else { 
        
           	glog.V(4).Infoln("discard volume grow request") 
        
           	time.Sleep(time.Millisecond * 211) 
        
           	vl.DoneGrowRequest() 
        
           }

It would be easy enough to explicitly close it in that else branch (which I did to make sure it got rid of the timeouts, which it did), but I wonder if there are other situations where this channel is not properly closed; e.g. should it be closed when we skip the loop too?

seaweedfs/weed/server/master_grpc_server_volume.go

Lines 30 to 34 in 439377b

    
           if !ms.Topo.IsLeader() { 
        
           	//discard buffered requests 
        
           	time.Sleep(time.Second * 1) 
        
           	continue 
        
           }

@chrislusf What do you think?

chrislusf closed this as completed in 49fcb48 Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master timeouts during dirAssign volume growth #5213

Master timeouts during dirAssign volume growth #5213

bvanelst commented Jan 17, 2024 •

edited

bvanelst commented Jan 18, 2024 •

edited

chrislusf commented Jan 18, 2024

BenoitKnecht commented Jan 23, 2024

Master timeouts during dirAssign volume growth #5213

Master timeouts during dirAssign volume growth #5213

Comments

bvanelst commented Jan 17, 2024 • edited

bvanelst commented Jan 18, 2024 • edited

chrislusf commented Jan 18, 2024

BenoitKnecht commented Jan 23, 2024

bvanelst commented Jan 17, 2024 •

edited

bvanelst commented Jan 18, 2024 •

edited