Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dial tcp: i/o timeout #1327

Closed
srinidhis94 opened this issue May 14, 2020 · 15 comments
Closed

dial tcp: i/o timeout #1327

srinidhis94 opened this issue May 14, 2020 · 15 comments

Comments

@srinidhis94
Copy link

I am using go-redis version v6.14.2. I have my application running in an AWS cluster behind loadbalancer. All redis requests failed in one of the nodes in the cluster. Rest of the nodes were working as expected. Application started working properly after a restart. We are using ElastiCache.
Can you please help me with identifying the issue ??
If it is previously known issue and is solved in latest version, can you point me to that link ??

The error was "dial tcp: i/o timeout".

Below is my cluster configuration excluding redis host address and password:

  • ReadOnly : true
  • RouteByLatency : true
  • RouteRandomly : true
  • DialTimeout : 300ms
  • ReadTimeout : 30s
  • Write Timeout : 30s
  • PoolSize : 12000
  • PoolTimeout : 32
  • IdleTimeout : 120s
  • IdleCheckFrequency : 1s
import (
goRedisClient "github.com/go-redis/redis"
)

func GetRedisClient() *goRedisClient.ClusterClient {
clusterClientOnce.Do(func() {
redisClusterClient = goRedisClient.NewClusterClient(
&goRedisClient.ClusterOptions{
Addrs: viper.GetStringSlice("redis.hosts"),
ReadOnly: true,
RouteByLatency: true,
RouteRandomly: true,
Password: viper.GetString("redis.password"),

			DialTimeout:  viper.GetDuration("redis.dial_timeout"),
			ReadTimeout:  viper.GetDuration("redis.read_timeout"),
			WriteTimeout: viper.GetDuration("redis.write_timeout"),

			PoolSize:           viper.GetInt("redis.max_active_connections"),
			PoolTimeout:        viper.GetDuration("redis.pool_timeout"),
			IdleTimeout:        viper.GetDuration("redis.idle_connection_timeout"),
			IdleCheckFrequency: viper.GetDuration("redis.idle_check_frequency"),
		},
	)

	if err := redisClusterClient.Ping().Err(); err != nil {
		log.WithError(err).Error(errorCreatingRedisClusterClient)
	}
})
return redisClusterClient
}

As suggested in comments,#1194,
I wrote the following snippet to dial and test nodes health for each slot. There were no errors. As mentioned, its happening randomly in one of the clients.Not always. It happened again after 3-4 months. And it is always fixed after a restart.

func CheckRedisSlotConnection(testCase string) {
	fmt.Println(viper.GetStringSlice("redis.hosts"))
	fmt.Println("Checking testcase " + testCase)
	client := redis.GetRedisClient()
	slots := client.ClusterSlots().Val()
	addresses := []string{}
	for _, slot := range slots {
		for _, node := range slot.Nodes {
			addresses = append(addresses, node.Addr)
		}
	}
	fmt.Println("Received " + strconv.Itoa(len(addresses)) + " Slots")
	for _, address := range addresses {
		fmt.Println("Testing address " + address)
		conn, err := net.DialTimeout("tcp", address, 500*time.Millisecond)
		if err != nil {
			fmt.Println("Error dialing to address " + address + " Error " + err.Error())
			continue
		}
		fmt.Println("Successfully dialled to address " + address)
		err = conn.Close()
		if err != nil {
			fmt.Println("Error closing connection " + err.Error())
			continue
		}
	}
}
@srinidhis94
Copy link
Author

srinidhis94 commented May 14, 2020

If the issue is captured and fixed, can u point me to a commit where it's done or any suggestions would help. I can try upgrading go-redis version. But releases don't carry info regarding updates. Hence I don't know if just upgrade would fix this.

@ken5scal
Copy link

I'm facing the same problem

@srinidhis94
Copy link
Author

srinidhis94 commented Jul 14, 2020

Couldn't not get to the root of it but scaling the system seems to have fixed the issue for me. I think we may have hit one of the system limits.

@srinidhis94
Copy link
Author

Adding more details, in our case, it was an issue with cpu. We were using T2 instances and this was occuring when credits hit zero. Since our instances were behind ASG, it became hard to figure out the issues after the instance goes down.

@yanksyoon
Copy link

Any updates on this?

cli := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{"127.0.0.1:6379"},
	})
	val, err := cli.Ping().Result()
	fmt.Println(val, err)

this code gives me dial tcp: io/timeout error (port forwarded redis cluster service to localhost)
but doing redis-cli ping gives back pong instantly.

@wziww
Copy link
Contributor

wziww commented Sep 18, 2020

Any updates on this?

cli := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{"127.0.0.1:6379"},
	})
	val, err := cli.Ping().Result()
	fmt.Println(val, err)

this code gives me dial tcp: io/timeout error (port forwarded redis cluster service to localhost)
but doing redis-cli ping gives back pong instantly.

Could you provide more information about tcp remote address? Cause ping with NewClusterClient may not to your target address with hashtag.RandomSlot() func.

@yanksyoon
Copy link

Would you elaborate on what you mean by more information about tcp remote address?
Here are my cluster slots command output for your reference.

127.0.0.1:6379> cluster slots
1) 1) (integer) 5461
   2) (integer) 10922
   3) 1) "172.18.0.30"
      2) (integer) 6379
      3) "ac153c4a977513c85d58926a9d607ccafb45f080"
   4) 1) "172.18.0.17"
      2) (integer) 6379
      3) "f327e1fb9a9949d6381d9718543c7c80ad5fbda0"
2) 1) (integer) 0
   2) (integer) 5460
   3) 1) "172.18.0.29"
      2) (integer) 6379
      3) "9058ce5495d11976ae9faaa7b83f8ce101418386"
   4) 1) "172.18.0.11"
      2) (integer) 6379
      3) "b5b48f467b7f3707d8315616b8c30b0518c5a0fb"
3) 1) (integer) 10923
   2) (integer) 16383
   3) 1) "172.18.0.28"
      2) (integer) 6379
      3) "c7c9c4b3e0c591e61b240b989cd4a74141a14c02"
   4) 1) "172.18.0.20"
      2) (integer) 6379
      3) "a42af00cb515d90dc8ed5560919ea084fc399373"

Output from any commands using redis-cli

127.0.0.1:6379> LRANGE TEST 0 -1
(error) MOVED 2058 172.18.0.29:6379

@yanksyoon
Copy link

Was port forwarding from server's redis cluster service(used helm charts w/ default config) to my localhost which messed up the behaviour for redis-cli. SSH-ing into the node and then running redis-cli operations are working fine.

@wziww
Copy link
Contributor

wziww commented Sep 18, 2020

Was port forwarding from server's redis cluster service(used helm charts w/ default config) to my localhost which messed up the behaviour for redis-cli. SSH-ing into the node and then running redis-cli operations are working fine.

Thats the point. SSH-ING into the node, the client ( redis-cli ) can reach the node (address like 172.18.0.x). But at localhost these address can be reached or not? So your code may not doing what you want :)

@vasundhara785
Copy link

vasundhara785 commented Feb 12, 2021

@srinidhis94 Were you able to fix the issue as iam facing the same problem with the error dial tcp x.x.x.x:6380: i/o timeout .
Any help would be urgently required.
More insights:
Does the operations made on redis would also create a client connection internally?

@srinidhis94
Copy link
Author

I found this to be system limitation. Not a bug with library. Check cpu,memory and network limits once. A connection will be established with all nodes in cluster. Redis won't proxy the requests. It'll either execute request or respond with ip addr which handle the given request saying MOVED.

@vasundhara785
Copy link

vasundhara785 commented Feb 15, 2021

#1659.

Can you please look into this. It is trying to increase the connected clients even though the client connection was established during the start of application.

How many cpu cores usage did resolve the issue?

@sjmunizcomplytrust
Copy link

Hello, I can consistently reproduce this issue on ec2 instances at aws.
My setup is an ec2 instance as redis, and other in the same subnet running a docker container with a go app.
Upon client start it always throws:

2022/02/21 21:39:42 ApplicationStartupError: Error while trying to connect to Redis: OriginalError: dial tcp 10.0.1.53:6379: i/o timeout
panic: ApplicationStartupError: Error while trying to connect to Redis: OriginalError: dial tcp 10.0.1.53:6379: i/o timeout
goroutine 1 [running]:
main.main()
github.com/business/app-name/main.go:36 +0x4e5

Restarting docker container always fixes the issue.

@yanksyoon
Copy link

@sjmunizcomplytrust
Do you have istio or any other service mesh / sidecar running in your cluster? sometimes it could be that the application boots up faster than the sidecar, making the connections timeout.
For istio, you can add holdApplicationUntilProxyStarts: true to fix the issue

@vmihailenco
Copy link
Collaborator

@charlie4284 thanks for the advice - I've updated the docs at https://redis.uptrace.dev/guide/server.html#dial-tcp-i-o-timeout

@redis redis locked and limited conversation to collaborators Feb 22, 2022
@vmihailenco vmihailenco converted this issue into discussion #2032 Feb 22, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants