deadlock in pthread implementation on SMP platforms #36652

evgeniy-paltsev · 2021-06-30T19:28:40Z

I've bumped into deadlock with pthread when pthread_join never exit even after thread we wait finished.
Initially I've found with issue on HSDK board (ARC HS 4 cores) with real payload application, but I was able to reproduce it with simple test case on 4 cores nSIM ARC HS simulation board.

Here is test case - we never return from second pthread_join in level_2_thread. When we check CPUs - all the cores are in arch_cpu_idle.

I can reproduce this issue only on 4 cores configuration - for some reason the test works fine on 2 cores configuration.

#define N_THR_R 2
#define STACKS (2048)

K_THREAD_STACK_ARRAY_DEFINE(stack_r, N_THR_R, STACKS);

void *thread_retval(void *p1)
{
	k_busy_wait(10000000);
	return p1;
}

void *level_2_thread(void *p1)
{
	pthread_t newthread[N_THR_R];
	pthread_attr_t attr[N_THR_R];

	for (int i = 0; i < N_THR_R; i++) {
		pthread_attr_init(&attr[i]);
		pthread_attr_setstack(&attr[i], &stack_r[i][0], STACKS);
		pthread_create(&newthread[i], &attr[i], thread_retval, INT_TO_POINTER(i));
	}

	for (int i = 0; i < N_THR_R; i++)
		// we never return from second pthread_join here
		pthread_join(newthread[i], NULL);

	printk(" exit level_2_thread\n");
	return NULL;
}

K_THREAD_STACK_ARRAY_DEFINE(stack_1, 1, STACKS);

void level_1_thread(void)
{
	pthread_t newthread;
	pthread_attr_t attr;
	void *context_ret;

	pthread_attr_init(&attr);
	pthread_attr_setstack(&attr, &stack_1, STACKS);

	pthread_create(&newthread, &attr, level_2_thread, NULL);
	pthread_join(newthread, &context_ret);

	printk("exit level_1_thread\n");
}

void main(void)
{
	printk("Hello World! %s\n", CONFIG_BOARD);

	level_1_thread();
}

NOTE: to reproduce it on other platform you probably need to play with k_busy_wait value.
NOTE: to reproduce it on 4 cores nsim you need to set instrs_per_pass=10 instead of instrs_per_pass=512 in boards/arc/nsim/support/mdb_hs_smp.args

I'm currently debugging this issue.

The text was updated successfully, but these errors were encountered:

evgeniy-paltsev · 2021-07-01T18:01:43Z

After debugging this a bit it doesn't look like the issue with pthread layer but it looks like an issue with SMP global_lock.

The deadlock in test example is caused by a race when cond_wait (called from pthread_join) and pthread_cond_broadcast (called from pthread_exit) are executed simultaneously on different CPUs.
This shouldn't be happening! - both of them do locking with irq_lock which is translated into z_smp_global_lock in SMP case.

I've check SMP global_lock implementation and it looks quite suspicious. We have complimentary z_smp_global_lock/z_smp_global_lock functions (which is OK) and one outstanding z_smp_release_global_lock which is called from do_swap.

The z_smp_release_global_lock previously had complimentary z_smp_reacquire_global_lock function which reacquired (if it is required) SMP global_lock on the thread brought online. However it was removed during huge scheduler rewrite 1acd8c2

This looks incorrect as after thread swap (and call z_smp_release_global_lock) and swap back we will get global_lock_count inconsistent with global_lock value - the global_lock is reseted by z_smp_release_global_lock, but thread's global_lock_count value is kept the same and all other calls to z_smp_global_lock and `z_smp_global_unlock1 by this thread won't work correctly.

This looks like a bug introduced by 1acd8c2.
@andyross could you please take a look on this as an author of SMP global_lock code and this scheduler rework code. Thanks!

Here is the current implementation of SMP global_lock code kernel/smp.c

static atomic_t global_lock;

unsigned int z_smp_global_lock(void)
{
	unsigned int key = arch_irq_lock();

	if (!_current->base.global_lock_count) {
		while (!atomic_cas(&global_lock, 0, 1)) {
		}
	}

	_current->base.global_lock_count++;

	return key;
}

void z_smp_global_unlock(unsigned int key)
{
	if (_current->base.global_lock_count) {
		_current->base.global_lock_count--;

		if (!_current->base.global_lock_count) {
			atomic_clear(&global_lock);
		}
	}

	arch_irq_unlock(key);
}

/* Called from within z_swap(), so assumes lock already held */
void z_smp_release_global_lock(struct k_thread *thread)
{
	if (!thread->base.global_lock_count) {
		atomic_clear(&global_lock);
	}
}

nashif · 2021-07-06T17:26:45Z

duplicate of #36736

evgeniy-paltsev added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug area: SMP Symmetric multiprocessing area: POSIX POSIX API Library labels Jun 30, 2021

evgeniy-paltsev mentioned this issue Jul 5, 2021

kernel: SMP global lock (and therefore irq_lock) works incorrectly on SMP platforms #36736

Closed

evgeniy-paltsev added this to the v2.7.0 milestone Jul 5, 2021

nashif assigned andyross Jul 6, 2021

nashif closed this as completed Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock in pthread implementation on SMP platforms #36652

deadlock in pthread implementation on SMP platforms #36652

evgeniy-paltsev commented Jun 30, 2021 •

edited

Loading

evgeniy-paltsev commented Jul 1, 2021

nashif commented Jul 6, 2021

deadlock in pthread implementation on SMP platforms #36652

deadlock in pthread implementation on SMP platforms #36652

Comments

evgeniy-paltsev commented Jun 30, 2021 • edited Loading

evgeniy-paltsev commented Jul 1, 2021

nashif commented Jul 6, 2021

evgeniy-paltsev commented Jun 30, 2021 •

edited

Loading