Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple calls to the RpcClient # getConnection method fail to keep the connection persistent #210

Closed
Synex-wh opened this issue Dec 16, 2019 · 3 comments
Assignees
Milestone

Comments

@Synex-wh
Copy link

Describe the bug

现象

  • 因为注册中心后台需要保持两个服务节点间的链接,因为系统内存满或者其他原因导致断链后,希望持续调用com.alipay.remoting.rpc.RpcClient#getConnection(com.alipay.remoting.Url, int)方法,保证链接

  • 目前发现系统持续调用此方法的任务一直执行,但是获取到的Connection对象为null,体现为这两个节点断链,没有再进行重连

  • 进行内存dump查明,com.alipay.remoting.DefaultConnectionManager#connTasks这个对象对应链接的ip+port为key的RunStateRecordedFutureTask存在,并且其中反馈的outcome对象ConnectionPool也不为null,但ConnectionPool中的Connection列表没有数据isempty,这样每次即使调用getConnection方法返回的一直是null的connection对象,并且不会触发重新建联,因为建联方式必须connTasks里面关于这个key的task清理掉才会再次触发task.run才会触发

image

  • 查看代码,这个RunStateRecordedFutureTask对象清理只有通过IO断链事件触发com.alipay.remoting.DefaultConnectionManager#remove(com.alipay.remoting.Connection)进行清理,或者定时任务RpcTaskScanner触发com.alipay.remoting.DefaultConnectionManager#scan进行清理,但是这个清理任务事件判断怀疑有问题

image

  • 故,综上所述,这个task有一定几率无法删除导致,持续调用getConnection方法无法正常保证链接可以重建

Environment

  • SOFABolt version:1.5.2
  • JVM version (e.g. java -version):
  • OS version (e.g. uname -a):
  • Maven version:
  • IDE version:
@sofastack-bot sofastack-bot bot changed the title 多次调用RpcClient#getConnection方法无法持续保持连接 Multiple calls to the RpcClient # getConnection method fail to keep the connection persistent Dec 16, 2019
@sofastack-bot
Copy link

sofastack-bot bot commented Dec 16, 2019

Hi @Synex-wh, we detect non-English characters in the issue. This comment is an auto translation by @sofastack-robot to help other users to understand this issue.

We encourage you to describe your issue in English which is more friendly to other users.

** Describe the bug ** ### Phenomenon-Because the background of the registration center needs to maintain the link between the two service nodes, after the system is full or other reasons cause the link to be broken, I hope to continue to call com.alipay.remoting.rpc.RpcClient #getConnection (com.alipay.remoting.Url, int) method to ensure the link-At present, it is found that the task of continuously calling this method has been performed, but the Connection object obtained is null, which reflects that the two nodes have broken the link and no longer Reconnect-Perform a memory dump to find out that com.alipay.remoting.DefaultConnectionManager # connTasks This object corresponds to the RunStateRecordedFutureTask with the linked ip port as key and the feedback Outcome object ConnectionPool is not null, but the Connection list in ConnectionPool There is no data isempty, so even if the connection object returned by calling the getConnection method always returns a null connection object, it will not trigger the re-establishment of the connection, because the connection method must be cleared in connTasks about the key task before triggering task.run again. Will trigger! [Image] (https://user-images.githubusercontent.com /8018119/70888770-08617800-201c-11ea-816f-54cfdf0adc6d.png)-View the code, this RunStateRecordedFutureTask object cleanup is only triggered by the IO broken link event com.alipay.remoting.DefaultConnectionManager # remove (com.alipay.remoting.Connection) Clean up, or RpcTaskScanner triggers com.alipay.remoting.DefaultConnectionManager # scan to clean up, but this cleanup task event is suspected to have a problem! [Image] (https://user-images.githubusercontent.com/8018119/70889032- 9c334400-201c-11ea-9a7a-3d9578778409.png)-Therefore, in summary, there is a certain chance that this task cannot be deleted, and the continuous call to the getConnection method does not normally guarantee that the link can be re-established -JVM version (eg java -version):-OS version (eg uname -a):-Maven version:-IDE version:

@dbl-x dbl-x added bug Something isn't working and removed bug Something isn't working labels Dec 16, 2019
@cytnju
Copy link
Contributor

cytnju commented Dec 16, 2019

@Synex-wh 你能够提供一些bolt的日志来帮忙定位这个问题吗?目前看确实存在bug导致连接池中的连接为空无法重连的情况,但正常情况下走断链事件会触发连接池清除的逻辑,不应该会出现空连接池的情况的,希望能够多提供一些日志方便排查下。

@cytnju cytnju added the bug Something isn't working label Dec 16, 2019
@dbl-x dbl-x removed the bug Something isn't working label Dec 16, 2019
@dbl-x
Copy link
Contributor

dbl-x commented Jan 2, 2020

背景:用户的bolt版本是1.5.2版本,应用出现过FullGC
分析问题出现的可能:

  • 用户应用出现FullGC,创建Channel和创建Connection是一个非原子的操作,这样可能在Channel断开的时候没有执行ConnectionManager#remove(Connection)方法
  • ConnectionManager#remove(Connection)没有执行,那么ConnectionPool里面会维持着不可用的连接,直到ConnectionPool#scan时将Connection从ConnectionPool中移除——并不会移除ConnectionPool对应的RunStateRecordedFutureTask
  • RunStateRecordedFutureTask在ConnectionManager#scan时如果ConnectionPool不包含Connection并且accessTime超过了一段时间,RunStateRecordedFutureTask会被移除——在应用的使用中,用户会不断的getConnection,这样会更新accessTime导致RunStateRecordedFutureTask不会被移除
  • 综合上面几点导致了用户出现一个ConnectionPool里面没有Connection,但是又无法通过RpcClient#getConnection创建出新的连接的问题

解决办法:

  • scan task时,如果connectionPool为空,且AccessTime超过配置时间,则移除RunStateRecordedFutureTask;
  • AccessTime只在ConnectionPool创建或者向其中写入数据时才更新,从ConnectionPool读取数据时不更新

@dbl-x dbl-x added this to the 1.6.2 release milestone Jul 1, 2020
dbl-x added a commit to dbl-x/sofa-bolt that referenced this issue Jul 6, 2020
dbl-x added a commit to dbl-x/sofa-bolt that referenced this issue Jul 6, 2020
@dbl-x dbl-x mentioned this issue Jul 7, 2020
cytnju pushed a commit that referenced this issue Jul 13, 2020
* fix code style

* fix #210

* fix #210

* fix #222

* support check and create connection in async way

* return false if no connection and try create in async way

* switch ClassLoader in try-finally
@dbl-x dbl-x closed this as completed Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants