Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible locking in SCTP/IP4 rcv thread #633

Open
kanakagrawal opened this issue Aug 19, 2021 · 16 comments
Open

Possible locking in SCTP/IP4 rcv thread #633

kanakagrawal opened this issue Aug 19, 2021 · 16 comments

Comments

@kanakagrawal
Copy link

Hi,

We are using this userland SCTP stack with raw socket for our application. There was an issue of performance that more than ~40k tps was not working with our application. We found that bottleneck was SCTP/IP4 rcv thread and tried to parallelize it. To do that, we created n worker threads to do the actual work of the stack. SCTP/IP4 rcv thread will only read packets from raw socket and pass it to worker threads. Using a hash function, it is ensured that all packets of same association goes to same worker threads. This increased the performance a lot (around 260k tps). I think more tps can be supported but there is still some locking in between worker threads. SCTP/IP4 rcv thread was waiting for worker threads to be ready and this was causing performance drop but CPU usage of worker threads were around 50% only.

Can you tell if there are some global data being accessed across association for which there might be locking? Our assumption was that all locking will be within an association and so ensuring a single worker thread process all packets of an associaiton will reduce the performance impact of locking.

@tuexen
Copy link
Member

tuexen commented Aug 19, 2021

There is for example a global table for looking up the association... In the kernel we use a lock to protect this data structure and differentiate between 'locking for reading' and 'locking for writing'. Multiple readers are allowed to use the data structure at a time, but only a single writer. This is all managed by using macros, see SCTP_INP_INFO_RLOCK() and SCTP_INP_INFO_WLOCK(). In userland, we all map this to a simple pthread mutex. So this will be a bottleneck.

The code is written in a way that you can have multiple threads getting packets and calling sctp_common_input_processing(), and it would be very good that packets belonging to the same association end up in the same thread. This can easily be ensured by looking at the verification tag of the incoming packets (except for the ones with the T-bit set, but you don't have a lot of them).

So you might want to replace the SCTP_INP_INFO_RLOCK() and SCTP_INP_INFO_WLOCK() with something that allows multiple readers. I guess pthread_rwlock will be better, but I did not take the time to do this, because the consumers of the library up to now and up to my knowledge did not focus on performance with respect to multiple associations.

@cukiernick
Copy link

@tuexen
Please take a look at this commit: cukiernick@0a00e77
Maybe it can be somehow useful.

@kanakagrawal
Copy link
Author

Thanks for the reply!

The code is written in a way that you can have multiple threads getting packets and calling sctp_common_input_processing(), and it would be very good that packets belonging to the same association end up in the same thread. This can easily be ensured by looking at the verification tag of the incoming packets (except for the ones with the T-bit set, but you don't have a lot of them).

We did exactly this. Our hash function which selects thread for an incoming packet is based on using verification tag

/*
All packets for one association are handled by a single worker thread.
Target thread choice is based on verification tag from SCTP header.
*/
int get_target_thread(struct mbuf **recvmbuf, bool ipv6)
{
	struct sctphdr *sh;
	if (ipv6)
		sh = (struct sctphdr *)mtod(recvmbuf[0], struct sctphdr *);
	else {
		struct ip *iphdr = mtod(recvmbuf[0], struct ip *);
		sh = (struct sctphdr *)((caddr_t)iphdr + sizeof(struct ip));
	}
	uint32_t v_tag = sh->v_tag;

	return v_tag % worker_threads_number;
}

So you might want to replace the SCTP_INP_INFO_RLOCK() and SCTP_INP_INFO_WLOCK() with something that allows multiple readers. I guess pthread_rwlock will be better, but I did not take the time to do this, because the consumers of the library up to now and up to my knowledge did not focus on performance with respect to multiple associations.

We also thought about this and already changed it. The commit is mentioned by @cukiernick above.

Do you think there is any other possible area where performance can be improved?

@tuexen
Copy link
Member

tuexen commented Aug 19, 2021

@tuexen
Please take a look at this commit: cukiernick@0a00e77
Maybe it can be somehow useful.

That looks exactly like I had it in mind, except for replacing inp_mtx. The macros allow it, but we have not yet used it, only with a mutex. So using rwlock for inp_mtx is basically untested... I don't expect any substantial gain for that.

Can I take (part of) the patch and integrate it?

@tuexen
Copy link
Member

tuexen commented Aug 19, 2021

Do you think there is any other possible area where performance can be improved?

I think that is it from the perspective of distributing the load on multiple cores. So you can try to reduce the CPU load:

  1. Not sure if it is possible in your usecase to use CRC32c offloading to the NIC. Intel cards supports this and in the FreeBSD kernel implementation we make use of it. However, this is not easy for userland stack. I think I discussed this a while ago in the context of netmap.
  2. If CRC32c offloading is not possible, but you are running on ARM64, you could use assembler code for the CRC32c using specific processor support. The userland stack support provides a way to use your own CRC32c code. A similar technique if possible on AMD64, but the code required is more complex.
  3. Are you using the AUTH extension? Then optimising the SHA-1 and SHA-256 code will gain you something. ARM64 allows some optimisations when using assembler code (similar to CRC32c). But it depends if you are using ARM64 or not...
  4. Reduce the amount of copying user data. Originally the userland stack did not copy user data, but when the API was polished to be used within Web Browsers, the request was for a simple and portable API. Right now the user data is copied at the boundary to the application. If you expose the mbuf API you might be able to reduce the overhead.

For more optimisations I guess you need to run the stack with the workload you are expecting and do some profiling. I would expect that you can gain some performance by organising the data used in a way that the number of cache missed are reduced, for example.

@cukiernick
Copy link

Can I take (part of) the patch and integrate it?

Sure, feel free to use the code.

@tuexen
Copy link
Member

tuexen commented Aug 19, 2021

Can I take (part of) the patch and integrate it?

Sure, feel free to use the code.

Great, thanks!

@kanakagrawal
Copy link
Author

We are running our application in container environment on AMD64

1. Not sure if it is possible in your usecase to use CRC32c offloading to the NIC. Intel cards supports this and in the FreeBSD kernel implementation we make use of it. However, this is not easy for userland stack. I think I discussed this a while ago in the context of netmap.

2. If CRC32c offloading is not possible, but you are running on ARM64, you could use assembler code for the CRC32c using specific processor support. The userland stack support provides a way to use your own CRC32c code. A similar technique if possible on AMD64, but the code required is more complex.

We already did this using some help form @weinrank 's code - kanakagrawal@e46c0f4

4. Reduce the amount of copying user data. Originally the userland stack did not copy user data, but when the API was polished to be used within Web Browsers, the request was for a simple and portable API. Right now the user data is copied at the boundary to the application. If you expose the mbuf API you might be able to reduce the overhead.

This sounds interesting. @protodef you had some similar thoughts. How about sharing them here?

For more optimisations I guess you need to run the stack with the workload you are expecting and do some profiling. I would expect that you can gain some performance by organising the data used in a way that the number of cache missed are reduced, for example.

Thanks! We'll think on this.

@ashish-billore
Copy link

For more optimisations I guess you need to run the stack with the workload you are expecting and do some profiling. I would expect that you can gain some performance by organising the data used in a way that the number of cache missed are reduced, for example.

Just to add to this, we also tried doing pinning of SCTP/IP4 recv thread and worker threads (with all the above changes in place: splitting of processing from recv thread to multiple workers using vtag hashing and HW offloaded CRC) to dedicated specific CPU cores to avoid context switching and jitters.
However, the performance gains were moderate <10k tps.
So, we are investigating places in the code which might be causing the bottlenecks such as global locks.

Any help in this regards is appreciated. Thanks!

@tuexen
Copy link
Member

tuexen commented Aug 20, 2021

We are running our application in container environment on AMD64

Any reason you can't use the kernel SCTP of the host? I think for TCP or UDP based applications in container environments this is done...

@tuexen
Copy link
Member

tuexen commented Aug 20, 2021

For more optimisations I guess you need to run the stack with the workload you are expecting and do some profiling. I would expect that you can gain some performance by organising the data used in a way that the number of cache missed are reduced, for example.

Just to add to this, we also tried doing pinning of SCTP/IP4 recv thread and worker threads (with all the above changes in place: splitting of processing from recv thread to multiple workers using vtag hashing and HW offloaded CRC) to dedicated specific CPU cores to avoid context switching and jitters.
However, the performance gains were moderate <10k tps.
So, we are investigating places in the code which might be causing the bottlenecks such as global locks.

Any help in this regards is appreciated. Thanks!

Not of the top of my head. I guess using some profiler should give you some hints. I know that @stewrtrs did years ago some rounds of optimisations for the kernel stack. But this hasn't been done for a while and especially not for the userland stack. So I guess there are some low hanging fruits...

@kanakagrawal
Copy link
Author

Any reason you can't use the kernel SCTP of the host? I think for TCP or UDP based applications in container environments this is done...

Our application is a loadbalancer for kubernetes environment. And we needed a feature of binding address to socket which is not on that machine. So, we modified some part of this userland stack to make it happen. This was not possible with kernel sctp as we couldn't modify anything in kernel for container environment.

@tuexen
Copy link
Member

tuexen commented Aug 20, 2021

Any reason you can't use the kernel SCTP of the host? I think for TCP or UDP based applications in container environments this is done...

Our application is a loadbalancer for kubernetes environment. And we needed a feature of binding address to socket which is not on that machine. So, we modified some part of this userland stack to make it happen. This was not possible with kernel sctp as we couldn't modify anything in kernel for container environment.

Ahh, I see. This also explains why you are optimising for performance. Thanks for the explanation.

@tuexen
Copy link
Member

tuexen commented Aug 20, 2021

Any reason you can't use the kernel SCTP of the host? I think for TCP or UDP based applications in container environments this is done...

Our application is a loadbalancer for kubernetes environment. And we needed a feature of binding address to socket which is not on that machine. So, we modified some part of this userland stack to make it happen. This was not possible with kernel sctp as we couldn't modify anything in kernel for container environment.

I have a question, since I'm not familiar with the environment you are using: I seems that the load balancer terminates the incoming SCTP associations. Is that needed? Do the nodes behind the load balancer appear under the same IP addresses than the load balancer?

I was wondering if you could have a load balancer with its own IP address. Peers send the packet containing the INIT chunk to the load balancer, the load balancer decides which node should handle it, and forward the packet containing the INIT chunk to that node. That node would send back an INIT-ACK, using its own addresses, and a new, yet to be defined parameter, saying "you sent your packet containing the INIT chunk to the address contained in this parameter", and therefore the peer can match it.
I know that this needs to be supported by the client and the server, but is is a simple extension, which could be standardised and would make the life easier. Basically, the node distributing the load would only have to handle incoming packets containing the INIT chunk.

Would this work? Does it make sense?

@kanakagrawal
Copy link
Author

I have a question, since I'm not familiar with the environment you are using: I seems that the load balancer terminates the incoming SCTP associations. Is that needed? Do the nodes behind the load balancer appear under the same IP addresses than the load balancer?

The load balancer right now terminates the incoming SCTP associations. The peer sending traffic doesn't know the IP addresses of the nodes behind loadbalancer. Let me describe our use case in detail.

We are making loadbalancer for communication between RAN and AMF in 5G telecom. There are the following specific requirements -

  • server should know the IP of client (So, loadbalancer, when it sends packets to server, should send with src IP as that of client.)
  • If one server goes down, loadbalancer should start a new connection with another server without notifying the client (This is why we terminate SCTP association at loadbalancer and maintain another association between loadbalancer and server)

I was wondering if you could have a load balancer with its own IP address. Peers send the packet containing the INIT chunk to the load balancer, the load balancer decides which node should handle it, and forward the packet containing the INIT chunk to that node. That node would send back an INIT-ACK, using its own addresses, and a new, yet to be defined parameter, saying "you sent your packet containing the INIT chunk to the address contained in this parameter", and therefore the peer can match it.
I know that this needs to be supported by the client and the server, but is is a simple extension, which could be standardised and would make the life easier. Basically, the node distributing the load would only have to handle incoming packets containing the INIT chunk.

Would this work? Does it make sense?

I couldn't fully understand what you mean. But I think you are assuming the client and server can talk directly after the INIT packet is handled using the information in INIT-ACK. This won't work for us because servers' IP addresses are behind the loadbalancer and not exposed to the client.

@tuexen
Copy link
Member

tuexen commented Aug 21, 2021

I have a question, since I'm not familiar with the environment you are using: I seems that the load balancer terminates the incoming SCTP associations. Is that needed? Do the nodes behind the load balancer appear under the same IP addresses than the load balancer?

The load balancer right now terminates the incoming SCTP associations. The peer sending traffic doesn't know the IP addresses of the nodes behind loadbalancer. Let me describe our use case in detail.

We are making loadbalancer for communication between RAN and AMF in 5G telecom. There are the following specific requirements -

  • server should know the IP of client (So, loadbalancer, when it sends packets to server, should send with src IP as that of client.)

OK. (although I don't like the load balancer spoofing the source addresses)

  • If one server goes down, loadbalancer should start a new connection with another server without notifying the client (This is why we terminate SCTP association at loadbalancer and maintain another association between loadbalancer and server)

Why hide this failure? I don't think you can't guarantee the sequencing or reliability requirements when this happens. Wouldn't it be much simpler, the client would recover the messages it wants and reestablish a new association with the load balancer?

I was wondering if you could have a load balancer with its own IP address. Peers send the packet containing the INIT chunk to the load balancer, the load balancer decides which node should handle it, and forward the packet containing the INIT chunk to that node. That node would send back an INIT-ACK, using its own addresses, and a new, yet to be defined parameter, saying "you sent your packet containing the INIT chunk to the address contained in this parameter", and therefore the peer can match it.
I know that this needs to be supported by the client and the server, but is is a simple extension, which could be standardised and would make the life easier. Basically, the node distributing the load would only have to handle incoming packets containing the INIT chunk.
Would this work? Does it make sense?

I couldn't fully understand what you mean. But I think you are assuming the client and server can talk

The client sends an INIT chunk to the address of the load balancer. It must know this address in advance. When the load balancer receives that packet, it decides which server will handle this association and forwards the packet containing the INIT chunk to the server. The server receives this packet (it has still the source address of the client, since the load balancer did not modify the packet), and responds with an INIT-ACK. The server uses its own addresses. To allow the client to match the INIT-ACK with the INIT it has sent earlier, the server adds a new parameter to the INIT, let's call it INIT-FORWARDED, which contains the IP address of the load balancer.

So this extension would need to be supported by the client and the server and would allow a load balancer only to deal with packets containing INIT chunks. So the scalability should be much better and you don't have a single point os failure.

directly after the INIT packet is handled using the information in INIT-ACK. This won't work for us because servers' IP addresses are behind the loadbalancer and not exposed to the client.

Why can't you expose the addresses of the servers to the client? The above solution would do this, but I don't see why that is not possible...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants