Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock in pthread implementation on SMP platforms #36652

Closed
evgeniy-paltsev opened this issue Jun 30, 2021 · 2 comments
Closed

deadlock in pthread implementation on SMP platforms #36652

evgeniy-paltsev opened this issue Jun 30, 2021 · 2 comments
Assignees
Labels
area: POSIX POSIX API Library area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Milestone

Comments

@evgeniy-paltsev
Copy link
Collaborator

evgeniy-paltsev commented Jun 30, 2021

I've bumped into deadlock with pthread when pthread_join never exit even after thread we wait finished.
Initially I've found with issue on HSDK board (ARC HS 4 cores) with real payload application, but I was able to reproduce it with simple test case on 4 cores nSIM ARC HS simulation board.

Here is test case - we never return from second pthread_join in level_2_thread. When we check CPUs - all the cores are in arch_cpu_idle.

I can reproduce this issue only on 4 cores configuration - for some reason the test works fine on 2 cores configuration.

#define N_THR_R 2
#define STACKS (2048)

K_THREAD_STACK_ARRAY_DEFINE(stack_r, N_THR_R, STACKS);

void *thread_retval(void *p1)
{
	k_busy_wait(10000000);
	return p1;
}

void *level_2_thread(void *p1)
{
	pthread_t newthread[N_THR_R];
	pthread_attr_t attr[N_THR_R];

	for (int i = 0; i < N_THR_R; i++) {
		pthread_attr_init(&attr[i]);
		pthread_attr_setstack(&attr[i], &stack_r[i][0], STACKS);
		pthread_create(&newthread[i], &attr[i], thread_retval, INT_TO_POINTER(i));
	}

	for (int i = 0; i < N_THR_R; i++)
		// we never return from second pthread_join here
		pthread_join(newthread[i], NULL);

	printk(" exit level_2_thread\n");
	return NULL;
}

K_THREAD_STACK_ARRAY_DEFINE(stack_1, 1, STACKS);

void level_1_thread(void)
{
	pthread_t newthread;
	pthread_attr_t attr;
	void *context_ret;

	pthread_attr_init(&attr);
	pthread_attr_setstack(&attr, &stack_1, STACKS);

	pthread_create(&newthread, &attr, level_2_thread, NULL);
	pthread_join(newthread, &context_ret);

	printk("exit level_1_thread\n");
}

void main(void)
{
	printk("Hello World! %s\n", CONFIG_BOARD);

	level_1_thread();
}

NOTE: to reproduce it on other platform you probably need to play with k_busy_wait value.
NOTE: to reproduce it on 4 cores nsim you need to set instrs_per_pass=10 instead of instrs_per_pass=512 in boards/arc/nsim/support/mdb_hs_smp.args

I'm currently debugging this issue.

@evgeniy-paltsev evgeniy-paltsev added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug area: SMP Symmetric multiprocessing area: POSIX POSIX API Library labels Jun 30, 2021
@evgeniy-paltsev
Copy link
Collaborator Author

After debugging this a bit it doesn't look like the issue with pthread layer but it looks like an issue with SMP global_lock.

The deadlock in test example is caused by a race when cond_wait (called from pthread_join) and pthread_cond_broadcast (called from pthread_exit) are executed simultaneously on different CPUs.
This shouldn't be happening! - both of them do locking with irq_lock which is translated into z_smp_global_lock in SMP case.

I've check SMP global_lock implementation and it looks quite suspicious. We have complimentary z_smp_global_lock/z_smp_global_lock functions (which is OK) and one outstanding z_smp_release_global_lock which is called from do_swap.

The z_smp_release_global_lock previously had complimentary z_smp_reacquire_global_lock function which reacquired (if it is required) SMP global_lock on the thread brought online. However it was removed during huge scheduler rewrite 1acd8c2

This looks incorrect as after thread swap (and call z_smp_release_global_lock) and swap back we will get global_lock_count inconsistent with global_lock value - the global_lock is reseted by z_smp_release_global_lock, but thread's global_lock_count value is kept the same and all other calls to z_smp_global_lock and `z_smp_global_unlock1 by this thread won't work correctly.

This looks like a bug introduced by 1acd8c2.
@andyross could you please take a look on this as an author of SMP global_lock code and this scheduler rework code. Thanks!

Here is the current implementation of SMP global_lock code kernel/smp.c

static atomic_t global_lock;

unsigned int z_smp_global_lock(void)
{
	unsigned int key = arch_irq_lock();

	if (!_current->base.global_lock_count) {
		while (!atomic_cas(&global_lock, 0, 1)) {
		}
	}

	_current->base.global_lock_count++;

	return key;
}

void z_smp_global_unlock(unsigned int key)
{
	if (_current->base.global_lock_count) {
		_current->base.global_lock_count--;

		if (!_current->base.global_lock_count) {
			atomic_clear(&global_lock);
		}
	}

	arch_irq_unlock(key);
}

/* Called from within z_swap(), so assumes lock already held */
void z_smp_release_global_lock(struct k_thread *thread)
{
	if (!thread->base.global_lock_count) {
		atomic_clear(&global_lock);
	}
}

@nashif
Copy link
Member

nashif commented Jul 6, 2021

duplicate of #36736

@nashif nashif closed this as completed Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: POSIX POSIX API Library area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

3 participants