Too much memory used when doing aofshrink in master/slave mode #258

zycbobby · 2018-01-27T16:37:28Z

Hi, I insert over 5 million point into the master, and it expire in 72 hours.

When i do aofshrink in master, I found the slave will use a lot memory and cannot recover from gc(even debug.FreeOsMemory()).

I profile the memory use pprof. I think there might be some problem in set a point which will be expired in the future in slave.

tile38/controller/expire.go

Line 78 in e1fe83c

c.exlist = append(c.exlist, exitem{key, id, at})

func (c *Controller) expireAt(key, id string, at time.Time) {
	m := c.expires[key]
	if m == nil {
		m = make(map[string]time.Time)
		c.expires[key] = m
	}
	m[id] = at
	c.exlistmu.Lock()
	c.exlist = append(c.exlist, exitem{key, id, at})
	c.exlistmu.Unlock()
}

After the aofshrink process finished in master, the master will force the slave to reconnect to itself. The slave will refollow again from some point of the aof file, including set point again.

In expire.go, Controller.expireAt function, We always append a new exitem into the exlist, ignoring that the key might already exists.

I just simply think this can be avoided.

The text was updated successfully, but these errors were encountered:

tidwall · 2018-01-27T19:05:01Z

Hi Curry, This does sounds like a problem. I'll need to investigate further, but just so that I understand: This issue is occurring only on the slave, and after the slave reconnects?

zycbobby · 2018-01-28T00:59:30Z

@tidwall you are right. It is a slave problem, and happens after reconnect.

zycbobby · 2018-01-30T06:45:53Z

Frankly speaking,

c.exlist = append(c.exlist, exitem{key, id, at})

is bad design. I think a priorityqueue is better for background expiring because we want to find the item need to be expired which its expire time is closest to now

tidwall · 2018-01-30T18:15:15Z

bad design. I think a priorityqueue is better for background expiring

Ouch. Well I can say that some thought has been put into the current implementation. I found that an array outperforms a data structure such as a heap queue for systems with a high frequency of SETs with TTLs. It also avoids bunches of pointers to alleviate extra stress on the heap and GC.

Please keep in mind that the exlist array only contains hints at what may need to be purged at some point in the future. The backgroundExpiring function does the heavy lifting, which is based on the Redis implementation.

#156 has a bit of discussion around this topic and was the catalyst for the current system.

As for the original issue. After the AOFSHRINK on a leader has completed the followers must then resync their AOF file and reset their working data. This reset operation did not include clearing the exlist, thus upon the next sync the exlist grew needlessly. This may be why you are seeing out of memory panics on your server. I just pushed an update that now clears the exlist when on a reset.

Please let me know if this update suffices. Thanks!

zycbobby · 2018-01-31T02:29:02Z

This reset operation did not include clearing the exlist, thus upon the next sync the exlist grew needlessly. This may be why you are seeing out of memory panics on your server. I just pushed an update that now clears the exlist when on a reset.

You hit the point. I think the update works.

tidwall · 2018-01-31T02:50:39Z

Great news!

zycbobby · 2018-01-31T05:51:34Z

You hit the point. I think the update works.

Sorry, I think I miss some point. After aofshrink (in master)

The slave will refollow again from some point of the aof file, including set point again.

So, here is how tile38 find the "some point":

tile38/controller/checksum.go

Lines 163 to 198 in 2088b5d

    
           match, err := c.matchChecksums(conn, min, checksumsz) 
        
           if err != nil { 
        
           	return 0, err 
        
           } 
        
           if match { 
        
           	min += checksumsz // bump up the min 
        
           	for { 
        
           		if max < min || max+checksumsz > limit { 
        
           			pos = min 
        
           			break 
        
           		} else { 
        
           			match, err = c.matchChecksums(conn, max, checksumsz) 
        
           			if err != nil { 
        
           				return 0, err 
        
           			} 
        
           			if match { 
        
           				min = max + checksumsz 
        
           			} else { 
        
           				limit = max 
        
           			} 
        
           			max = (limit-min)/2 - checksumsz/2 + min // multiply 
        
           		} 
        
           	} 
        
           } 
        
           fullpos := pos 
        
           fname := c.aof.Name() 
        
           if pos == 0 { 
        
           	c.aof.Close() 
        
           	c.aof, err = os.Create(fname) 
        
           	if err != nil { 
        
           		log.Fatalf("could not recreate aof, possible data loss. %s", err.Error()) 
        
           		return 0, err 
        
           	} 
        
           	return 0, nil 
        
           }

So after aofshrink, at most of the time, c.matchChecksums(conn, min, checksumsz) will return match=false, when min=0. Because the master's aof is re-ordered.

So the reset function will not be called (in slave) after aofshrink (in master), and Controller.followChecksome return in this branch, while pos=0 and fullpos=0

tile38/controller/checksum.go

Lines 206 to 211 in 2088b5d

    
           if pos == fullpos { 
        
           	if core.ShowDebugMessages { 
        
           		log.Debug("follow: aof fully intact") 
        
           	} 
        
           	return pos, nil 
        
           }

tidwall · 2018-01-31T16:08:57Z

I can confirm this issue on my side too. I'll do some digging.

tidwall · 2019-06-16T15:59:59Z

This was fixed in v1.16.2.

tidwall added a commit that referenced this issue Jan 30, 2018

clear expires on reset, #258

c62975b

tidwall closed this as completed Jun 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much memory used when doing aofshrink in master/slave mode #258

Too much memory used when doing aofshrink in master/slave mode #258

zycbobby commented Jan 27, 2018 •

edited

tidwall commented Jan 27, 2018

zycbobby commented Jan 28, 2018

zycbobby commented Jan 30, 2018 •

edited

tidwall commented Jan 30, 2018

zycbobby commented Jan 31, 2018

tidwall commented Jan 31, 2018 •

edited

zycbobby commented Jan 31, 2018 •

edited

tidwall commented Jan 31, 2018

tidwall commented Jun 16, 2019

Too much memory used when doing aofshrink in master/slave mode #258

Too much memory used when doing aofshrink in master/slave mode #258

Comments

zycbobby commented Jan 27, 2018 • edited

tidwall commented Jan 27, 2018

zycbobby commented Jan 28, 2018

zycbobby commented Jan 30, 2018 • edited

tidwall commented Jan 30, 2018

zycbobby commented Jan 31, 2018

tidwall commented Jan 31, 2018 • edited

zycbobby commented Jan 31, 2018 • edited

tidwall commented Jan 31, 2018

tidwall commented Jun 16, 2019

zycbobby commented Jan 27, 2018 •

edited

zycbobby commented Jan 30, 2018 •

edited

tidwall commented Jan 31, 2018 •

edited

zycbobby commented Jan 31, 2018 •

edited