Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much memory used when doing aofshrink in master/slave mode #258

Closed
zycbobby opened this issue Jan 27, 2018 · 9 comments
Closed

Too much memory used when doing aofshrink in master/slave mode #258

zycbobby opened this issue Jan 27, 2018 · 9 comments

Comments

@zycbobby
Copy link

zycbobby commented Jan 27, 2018

Hi, I insert over 5 million point into the master, and it expire in 72 hours.

When i do aofshrink in master, I found the slave will use a lot memory and cannot recover from gc(even debug.FreeOsMemory()).

I profile the memory use pprof. I think there might be some problem in set a point which will be expired in the future in slave.

c.exlist = append(c.exlist, exitem{key, id, at})

func (c *Controller) expireAt(key, id string, at time.Time) {
	m := c.expires[key]
	if m == nil {
		m = make(map[string]time.Time)
		c.expires[key] = m
	}
	m[id] = at
	c.exlistmu.Lock()
	c.exlist = append(c.exlist, exitem{key, id, at})
	c.exlistmu.Unlock()
}

After the aofshrink process finished in master, the master will force the slave to reconnect to itself. The slave will refollow again from some point of the aof file, including set point again.

In expire.go, Controller.expireAt function, We always append a new exitem into the exlist, ignoring that the key might already exists.

I just simply think this can be avoided.

@tidwall
Copy link
Owner

tidwall commented Jan 27, 2018

Hi Curry, This does sounds like a problem. I'll need to investigate further, but just so that I understand: This issue is occurring only on the slave, and after the slave reconnects?

@zycbobby
Copy link
Author

@tidwall you are right. It is a slave problem, and happens after reconnect.

@zycbobby
Copy link
Author

zycbobby commented Jan 30, 2018

Frankly speaking,

c.exlist = append(c.exlist, exitem{key, id, at})

is bad design. I think a priorityqueue is better for background expiring because we want to find the item need to be expired which its expire time is closest to now

tidwall added a commit that referenced this issue Jan 30, 2018
@tidwall
Copy link
Owner

tidwall commented Jan 30, 2018

bad design. I think a priorityqueue is better for background expiring

Ouch. Well I can say that some thought has been put into the current implementation. I found that an array outperforms a data structure such as a heap queue for systems with a high frequency of SETs with TTLs. It also avoids bunches of pointers to alleviate extra stress on the heap and GC.

Please keep in mind that the exlist array only contains hints at what may need to be purged at some point in the future. The backgroundExpiring function does the heavy lifting, which is based on the Redis implementation.

#156 has a bit of discussion around this topic and was the catalyst for the current system.

As for the original issue. After the AOFSHRINK on a leader has completed the followers must then resync their AOF file and reset their working data. This reset operation did not include clearing the exlist, thus upon the next sync the exlist grew needlessly. This may be why you are seeing out of memory panics on your server. I just pushed an update that now clears the exlist when on a reset.

Please let me know if this update suffices. Thanks!

@zycbobby
Copy link
Author

This reset operation did not include clearing the exlist, thus upon the next sync the exlist grew needlessly. This may be why you are seeing out of memory panics on your server. I just pushed an update that now clears the exlist when on a reset.

You hit the point. I think the update works.

@tidwall
Copy link
Owner

tidwall commented Jan 31, 2018

Great news!

@zycbobby
Copy link
Author

zycbobby commented Jan 31, 2018

You hit the point. I think the update works.

Sorry, I think I miss some point. After aofshrink (in master)

The slave will refollow again from some point of the aof file, including set point again.

So, here is how tile38 find the "some point":

match, err := c.matchChecksums(conn, min, checksumsz)
if err != nil {
return 0, err
}
if match {
min += checksumsz // bump up the min
for {
if max < min || max+checksumsz > limit {
pos = min
break
} else {
match, err = c.matchChecksums(conn, max, checksumsz)
if err != nil {
return 0, err
}
if match {
min = max + checksumsz
} else {
limit = max
}
max = (limit-min)/2 - checksumsz/2 + min // multiply
}
}
}
fullpos := pos
fname := c.aof.Name()
if pos == 0 {
c.aof.Close()
c.aof, err = os.Create(fname)
if err != nil {
log.Fatalf("could not recreate aof, possible data loss. %s", err.Error())
return 0, err
}
return 0, nil
}

So after aofshrink, at most of the time, c.matchChecksums(conn, min, checksumsz) will return match=false, when min=0. Because the master's aof is re-ordered.

So the reset function will not be called (in slave) after aofshrink (in master), and Controller.followChecksome return in this branch, while pos=0 and fullpos=0

if pos == fullpos {
if core.ShowDebugMessages {
log.Debug("follow: aof fully intact")
}
return pos, nil
}

@tidwall
Copy link
Owner

tidwall commented Jan 31, 2018

I can confirm this issue on my side too. I'll do some digging.

@tidwall
Copy link
Owner

tidwall commented Jun 16, 2019

This was fixed in v1.16.2.

@tidwall tidwall closed this as completed Jun 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants