Skip to content
This repository has been archived by the owner on Dec 18, 2022. It is now read-only.

Various changes to AudioIO subsystem #423

Closed
wants to merge 9 commits into from

Conversation

emabrey
Copy link
Member

@emabrey emabrey commented Jul 31, 2021

Improves performance of project loading substantively.

Signed-off-by: Emily Mabrey emabrey@tenacityaudio.org
Helped-by: Alex Disibio alexdisibio@gmail.com

Checklist
  • I have signed off my commits using -s or Signed-off-by* (See: Contributing § DCO)
  • I made sure the code compiles on my machine
  • I made sure there are no unnecessary changes in the code*
  • I made sure the title of the PR reflects the core meaning of the issue you are solving*
  • I made sure the commit message(s) contain a description and answer the question "Why do those changes fix that particular issue?" or "Why are those changes really necessary as improvements?"*

* indicates required

Improves performance of project loading substantively.

Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
Helped-by: Alex Disibio <alexdisibio@gmail.com>
@Be-ing
Copy link
Contributor

Be-ing commented Jul 31, 2021

What are the substantive changes here? I mostly see whitespace fixes?

@emabrey
Copy link
Member Author

emabrey commented Jul 31, 2021

Lines 26-31 and Lines 36-39. I just changed how we allocate the floats so it causes less cache misses.

@emabrey
Copy link
Member Author

emabrey commented Jul 31, 2021

Akleja tested it on GCC and it doesn't improve speed, but I think MSVC didn't make the same optimizer decisions. For me it seems much faster to load on Windows. If the optimizer doesn't come into play this saves numPlaybackChannels - 1 calls to new.

Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
@Paul-Licameli
Copy link
Contributor

Stop! Think! I beg you!

Was #412 really a good idea to start with?

@Paul-Licameli
Copy link
Contributor

Paul-Licameli commented Jul 31, 2021

Stop! Think! I beg you!

Was #412 really a good idea to start with?

I was actually wasting your time with the memory leak fix, though it might have been good C++ education.

THE “FIX” OF ALLOCA WAS THE BIGGER MISTAKE.

It was not broken to begin with!

You must learn not to let the static analyzer substitute for real understanding of the program.

Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
src/AudioIO.cpp Outdated
@@ -3979,6 +3979,7 @@ bool AudioIoCallback::FillOutputBuffers(
// wxASSERT( maxLen == toGet );

em.RealtimeProcessEnd();
delete bufHelper.release();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a double delete? unique_ptr::release should be sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.P.S. No, not a double delete. But an unnecessarily verbose single delete that could be left completely implicit. Whereas, “release” alone (distinct from “reset”) would be wrong, a memory leak.

Learn RAII. Learn “deterministic destruction.” Learn what makes C++ not like C on the one hand and not like Java on the other hand. You will like it, love it. I know I do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know that it removes the unique_ptr at the end of scope automatically. It was a bug testing change made for someone reporting issues on gcc that I couldn't repro on my system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/cpp/memory/unique_ptr

Release vs. reset and much else explained there.

@Paul-Licameli
Copy link
Contributor

Paul-Licameli commented Jul 31, 2021

Stop! Think! I beg you!
Was #412 really a good idea to start with?

I was actually wasting your time with the memory leak fix, though it might have been good C++ education.

THE “FIX” OF ALLOCA WAS THE BIGGER MISTAKE.

It was not broken to begin with!

You must learn not to let the static analyzer substitute for real understanding of the program.

P. S. RealtimeEffectManager.cpp had the OTHER memory leak, obvious to me on inspection, that you did not figure out deductively.

But the real fix you need is reverting #412 altogether.

Cordially.

@emabrey
Copy link
Member Author

emabrey commented Aug 1, 2021

The real fix would be moving memory allocation completely out of the real-time callback entirely. I haven't gotten to that yet, so I'm doing a half-measure and making sure that we don't get UB on a user's machine because the stack wasn't as big as it is on the developer's machine. Just as an example, MSVC gives 1MB stack by default and GCC gives 8MB. It's asking a lot to be 100% sure you aren't going to smash the stack when you use alloca in multiple places. Additionally, there are alloca calls inside loops which is just an anti-pattern (CWE-770 specifically). The performance of the memory allocation is completely beside the point because you shouldn't have been allocating memory inside of a realtime callback to begin with.

@Be-ing
Copy link
Contributor

Be-ing commented Aug 1, 2021

moving memory allocation completely out of the real-time callback

There's memory allocation in the audio thread? 😬

@Paul-Licameli
Copy link
Contributor

The real fix would be moving memory allocation completely out of the real-time callback entirely. I haven't gotten to that yet, so I'm doing a half-measure and making sure that we don't get UB on a user's machine because the stack wasn't as big as it is on the developer's machine. Just as an example, MSVC gives 1MB stack by default and GCC gives 8MB. It's asking a lot to be 100% sure you aren't going to smash the stack when you use alloca in multiple places. Additionally, there are alloca calls inside loops which is just an anti-pattern (CWE-770 specifically). The performance of the memory allocation is completely beside the point because you shouldn't have been allocating memory inside of a realtime callback to begin with.

If it were a real problem in practice, affecting playback itself, an essential function— wouldn’t it have been well known and corrected long before now?

@Be-ing
Copy link
Contributor

Be-ing commented Aug 1, 2021

Not necessarily. Glitches in audio playback can be difficult to reproduce. The allocation may be fast enough in the vast majority of cases but on rare occasions the OS may need more time to allocate and cause an audible glitch.

@Paul-Licameli
Copy link
Contributor

Paul-Licameli commented Aug 1, 2021

moving memory allocation completely out of the real-time callback

There's memory allocation in the audio thread? 😬

Yes, friend, if “alloca” is considered “allocation,” but it is adjustment of some registers for variable sized stack allocation, in this real time critical thread, which always has a shallow stack.

I still think this is fixin what ain’t broke, just because a static analyzer flags it, but that is not wisdom.

@Paul-Licameli
Copy link
Contributor

Not necessarily. Glitches in audio playback can be difficult to reproduce. The allocation may be fast enough in the vast majority of cases but on rare occasions the OS may need more time to allocate and cause an audible glitch.

Such hypotheses will be devilishly hard to demonstrate. I don’t say you are wrong.

There is a certain dance of three threads, the main, which updates the screen and polls the stop button; the real time critical portAudio callback, in which these alloca calls happen (but would new and delete really improve on them?), and another thread that reads from (play) and/or writes to (record) the drive, with a lesser frequency and bigger grain. And RingBuffer mediating between the last two.

I heard rumors of unusual bugs that I could not reproduce at will, I was given data, I saw odd patterns suggesting “time travel” of recorded data by numbers of samples that were powers of 2. I made speculative fixes to RingBuffer using std::atomic more rigorously. I didn’t hear further rumors of the recording glitches.

Did I really fix it? I still don’t know, but I made my best educated guess.

@emabrey
Copy link
Member Author

emabrey commented Aug 1, 2021

Not necessarily. Glitches in audio playback can be difficult to reproduce. The allocation may be fast enough in the vast majority of cases but on rare occasions the OS may need more time to allocate and cause an audible glitch.

Could this be related to https://github.com/tenacityteam/tenacity/issues/288 ?

Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
@Be-ing
Copy link
Contributor

Be-ing commented Aug 1, 2021

Is this new commit strictly whitespace changes? What are you using to format it? I doubt we should be doing such mass changes before automating coding standards.

@emabrey
Copy link
Member Author

emabrey commented Aug 1, 2021

Is this new commit strictly whitespace changes? What are you using to format it? I doubt we should be doing such mass changes before automating coding standards.

You're right. I was just trying to standardize the layout to something consistent. I'll avoid doing that again until we set a consistent standard.

Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
@emabrey emabrey changed the title Make AudioIOBufferHelper.h cache friendly Various changes to AudioIO subsystem Aug 1, 2021
@Paul-Licameli
Copy link
Contributor

The real fix would be moving memory allocation completely out of the real-time callback entirely. I haven't gotten to that yet, so I'm doing a half-measure and making sure that we don't get UB on a user's machine because the stack wasn't as big as it is on the developer's machine. Just as an example, MSVC gives 1MB stack by default and GCC gives 8MB. It's asking a lot to be 100% sure you aren't going to smash the stack when you use alloca in multiple places. Additionally, there are alloca calls inside loops which is just an anti-pattern (CWE-770 specifically). The performance of the memory allocation is completely beside the point because you shouldn't have been allocating memory inside of a realtime callback to begin with.

Emily, you are letting a static analyzer or a checklist of alleged "anti-patterns" substitute for thinking about the code you are reading.

What if I told you the bound of the loops in question was always a very small number? That bound is the number of playback channels, which is not more than two for most users of Audacity, and for more complicated setups is still not a large number.

Did you pause to figure this out? Did you put breakpoints in the code, figure out how to get there, and observe the loop bound?

Did you also observe in your breakpoint that this isn't the main thread, and did you wonder, is this a realtime-critical thread perhaps, in which breaking the usual recommended rules might be important and specially justified?

No, you just took the output of an analyzer as unquestionable advice and wasted time fixing things not really broken.

Please just revert #412 and move on, or prove me wrong by demonstrating a real bug traceable to these allocas and fixed by other allocation.

This is painful to watch.

@Semisol
Copy link
Contributor

Semisol commented Aug 1, 2021

Note, you can use Co-Authored-By instead of Helped-by

@Paul-Licameli
Copy link
Contributor

Now I'm wasting MY time too, but it's Sunday morning and this is kinda fun. Added to FillOutputBuffers (non-portable macOS code):

   auto self = pthread_self();
   auto stacksize = pthread_get_stacksize_np(self);
   auto addr = pthread_get_stackaddr_np(self);
   auto here = &self;
   auto used = reinterpret_cast<char*>(addr) - reinterpret_cast<char*>(here);
   auto avail = (stacksize - used) / sizeof(float);
   auto needed = numPlaybackChannels * framesPerBuffer;

That tells me stacksize is 524288 -- even less than that Windows default. But avail is 122390, and needed is 7818 or just about 6.4% of what's available.

Of course all this can vary a lot. I can influence "needed" by changing Buffer Length in Device preferences, but it won't go above 8192 no matter how big I make that value.

But by all means, try this for yourself with other OS equivalents. Try to prove that you can come anywhere close to the alloca stack overflow and smash.

Or instead, just stop think about it. These alloca-s have been there several years. The Audacity team left them there. Don't you think the stack smashes, if they happened -- "Audacity crashes every time I play!" -- would have been a notorious, high priority bug that would have been fixed by now?

@emabrey
Copy link
Member Author

emabrey commented Aug 1, 2021

Stop! Think! I beg you!
Was #412 really a good idea to start with?

I was actually wasting your time with the memory leak fix, though it might have been good C++ education.
THE “FIX” OF ALLOCA WAS THE BIGGER MISTAKE.
It was not broken to begin with!
You must learn not to let the static analyzer substitute for real understanding of the program.

P. S. RealtimeEffectManager.cpp had the OTHER memory leak, obvious to me on inspection, that you did not figure out deductively.

But the real fix you need is reverting #412 altogether.

Cordially.

Using alloca is only safe if you can be sure that you absolutely 100% NEVER exceed the stack boundary. If alloca runs into any issues at all, it's UB. It has no graceful error mode.

The real fix would be moving memory allocation completely out of the real-time callback entirely. I haven't gotten to that yet, so I'm doing a half-measure and making sure that we don't get UB on a user's machine because the stack wasn't as big as it is on the developer's machine. Just as an example, MSVC gives 1MB stack by default and GCC gives 8MB. It's asking a lot to be 100% sure you aren't going to smash the stack when you use alloca in multiple places. Additionally, there are alloca calls inside loops which is just an anti-pattern (CWE-770 specifically). The performance of the memory allocation is completely beside the point because you shouldn't have been allocating memory inside of a realtime callback to begin with.

Emily, you are letting a static analyzer or a checklist of alleged "anti-patterns" substitute for thinking about the code you are reading.

What if I told you the bound of the loops in question was always a very small number? That bound is the number of playback channels, which is not more than two for most users of Audacity, and for more complicated setups is still not a large number.

Did you pause to figure this out? Did you put breakpoints in the code, figure out how to get there, and observe the loop bound?

Did you also observe in your breakpoint that this isn't the main thread, and did you wonder, is this a realtime-critical thread perhaps, in which breaking the usual recommended rules might be important and specially justified?

No, you just took the output of an analyzer as unquestionable advice and wasted time fixing things not really broken.

Please just revert #412 and move on, or prove me wrong by demonstrating a real bug traceable to these allocas and fixed by other allocation.

This is painful to watch.

You have made a bunch of incorrect assumptions. I knew it was in a real time thread. That's why there shouldn't even be memory allocation happening there at all. I know you said you don't think modifying the stack segment register is a memory allocation, but you aren't really considering the the SSR points to memory and it still has to store those bits somewhere. On x86 it decrements the SP register and copies the new allocation to the memory area whose size is the difference between the two SP values, thus expanding the stack frame. It's still a memory allocation, but it is faster of course (if the memory area is something like L1 cache that is). There is just no portable standards compliant way to do this. I don't want to have to worry about stack overflows where it's not even needed. You also made a bunch of incorrect statements about alloca. For one, it isn't even guaranteed to be on the stack, it can actually be a malloc behind the scenes (see GCC 2.95 if you don't believe me where they use a combo of free and malloc). It's machine and compiler dependent and we don't even need it so why are you so attached to it? Additionally, given that alloca can be a malloc and you get NO guarantees about how it allocates the memory (stack or otherwise), can you prove that you aren't violating the instructions that PortAudio itself provides about not allocating memory within the the PA callback? And yes, I did do the math on the stack allocations. It was 32bits (per float) * 740 (framesPerBuffer) * 2 (channels) on my setup when I calculated it. That assumes, of course, that no inlining is done, as some versions of certain compilers will inline functions with calls to alloca and that obviously can be very bad. I figured there was a safety factor of about 600, though the numbers you gave are actually worse than that.

Also, it just occurred to me: what happens if any of those values are zero? alloca(0) is not well-defined and it returns anything from NULL to a pointer to SP. On Windows alloca(0) actually even does allocate some memory to the stack.

@emabrey
Copy link
Member Author

emabrey commented Aug 1, 2021

Here is the alloca implementation I referenced, so you don't have to go look it up:

/* alloca.c -- allocate automatically reclaimed memory
   (Mostly) portable public-domain implementation -- D A Gwyn

   This implementation of the PWB library alloca function,
   which is used to allocate space off the run-time stack so
   that it is automatically reclaimed upon procedure exit,
   was inspired by discussions with J. Q. Johnson of Cornell.
   J.Otto Tennant <jot@cray.com> contributed the Cray support.

   There are some preprocessor constants that can
   be defined when compiling for your specific system, for
   improved efficiency; however, the defaults should be okay.

   The general concept of this implementation is to keep
   track of all alloca-allocated blocks, and reclaim any
   that are found to be deeper in the stack than the current
   invocation.  This heuristic does not reclaim storage as
   soon as it becomes invalid, but it will do so eventually.

   As a special case, alloca(0) reclaims storage without
   allocating any.  It is a good idea to use alloca(0) in
   your main control loop, etc. to force garbage collection.  */

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#ifdef HAVE_STRING_H
#include <string.h>
#endif
#ifdef HAVE_STDLIB_H
#include <stdlib.h>
#endif

#ifdef emacs
#include "blockinput.h"
#endif

/* If compiling with GCC 2, this file's not needed.  Except of course if
   the C alloca is explicitly requested.  */
#if defined (USE_C_ALLOCA) || !defined (__GNUC__) || __GNUC__ < 2

/* If someone has defined alloca as a macro,
   there must be some other way alloca is supposed to work.  */
#ifndef alloca

#ifdef emacs
#ifdef static
/* actually, only want this if static is defined as ""
   -- this is for usg, in which emacs must undefine static
   in order to make unexec workable
   */
#ifndef STACK_DIRECTION
you
lose
-- must know STACK_DIRECTION at compile-time
#endif /* STACK_DIRECTION undefined */
#endif /* static */
#endif /* emacs */

/* If your stack is a linked list of frames, you have to
   provide an "address metric" ADDRESS_FUNCTION macro.  */

#if defined (CRAY) && defined (CRAY_STACKSEG_END)
long i00afunc ();
#define ADDRESS_FUNCTION(arg) (char *) i00afunc (&(arg))
#else
#define ADDRESS_FUNCTION(arg) &(arg)
#endif

#if __STDC__
typedef void *pointer;
#else
typedef char *pointer;
#endif

#ifndef NULL
#define	NULL	0
#endif

/* Different portions of Emacs need to call different versions of
   malloc.  The Emacs executable needs alloca to call xmalloc, because
   ordinary malloc isn't protected from input signals.  On the other
   hand, the utilities in lib-src need alloca to call malloc; some of
   them are very simple, and don't have an xmalloc routine.

   Non-Emacs programs expect this to call use xmalloc.

   Callers below should use malloc.  */

#ifndef emacs
#define malloc xmalloc
#endif
extern pointer malloc ();

/* Define STACK_DIRECTION if you know the direction of stack
   growth for your system; otherwise it will be automatically
   deduced at run-time.

   STACK_DIRECTION > 0 => grows toward higher addresses
   STACK_DIRECTION < 0 => grows toward lower addresses
   STACK_DIRECTION = 0 => direction of growth unknown  */

#ifndef STACK_DIRECTION
#define	STACK_DIRECTION	0	/* Direction unknown.  */
#endif

#if STACK_DIRECTION != 0

#define	STACK_DIR	STACK_DIRECTION	/* Known at compile-time.  */

#else /* STACK_DIRECTION == 0; need run-time code.  */

static int stack_dir;		/* 1 or -1 once known.  */
#define	STACK_DIR	stack_dir

static void
find_stack_direction ()
{
  static char *addr = NULL;	/* Address of first `dummy', once known.  */
  auto char dummy;		/* To get stack address.  */

  if (addr == NULL)
    {				/* Initial entry.  */
      addr = ADDRESS_FUNCTION (dummy);

      find_stack_direction ();	/* Recurse once.  */
    }
  else
    {
      /* Second entry.  */
      if (ADDRESS_FUNCTION (dummy) > addr)
	stack_dir = 1;		/* Stack grew upward.  */
      else
	stack_dir = -1;		/* Stack grew downward.  */
    }
}

#endif /* STACK_DIRECTION == 0 */

/* An "alloca header" is used to:
   (a) chain together all alloca'ed blocks;
   (b) keep track of stack depth.

   It is very important that sizeof(header) agree with malloc
   alignment chunk size.  The following default should work okay.  */

#ifndef	ALIGN_SIZE
#define	ALIGN_SIZE	sizeof(double)
#endif

typedef union hdr
{
  char align[ALIGN_SIZE];	/* To force sizeof(header).  */
  struct
    {
      union hdr *next;		/* For chaining headers.  */
      char *deep;		/* For stack depth measure.  */
    } h;
} header;

static header *last_alloca_header = NULL;	/* -> last alloca header.  */

/* Return a pointer to at least SIZE bytes of storage,
   which will be automatically reclaimed upon exit from
   the procedure that called alloca.  Originally, this space
   was supposed to be taken from the current stack frame of the
   caller, but that method cannot be made to work for some
   implementations of C, for example under Gould's UTX/32.  */

pointer
alloca (size)
     unsigned size;
{
  auto char probe;		/* Probes stack depth: */
  register char *depth = ADDRESS_FUNCTION (probe);

#if STACK_DIRECTION == 0
  if (STACK_DIR == 0)		/* Unknown growth direction.  */
    find_stack_direction ();
#endif

  /* Reclaim garbage, defined as all alloca'd storage that
     was allocated from deeper in the stack than currently.  */

  {
    register header *hp;	/* Traverses linked list.  */

#ifdef emacs
    BLOCK_INPUT;
#endif

    for (hp = last_alloca_header; hp != NULL;)
      if ((STACK_DIR > 0 && hp->h.deep > depth)
	  || (STACK_DIR < 0 && hp->h.deep < depth))
	{
	  register header *np = hp->h.next;

	  free ((pointer) hp);	/* Collect garbage.  */

	  hp = np;		/* -> next header.  */
	}
      else
	break;			/* Rest are not deeper.  */

    last_alloca_header = hp;	/* -> last valid storage.  */

#ifdef emacs
    UNBLOCK_INPUT;
#endif
  }

  if (size == 0)
    return NULL;		/* No allocation required.  */

  /* Allocate combined header + user data storage.  */

  {
    register pointer new = malloc (sizeof (header) + size);
    /* Address of header.  */

    if (new == 0)
      abort();

    ((header *) new)->h.next = last_alloca_header;
    ((header *) new)->h.deep = depth;

    last_alloca_header = (header *) new;

    /* User storage begins just after header.  */

    return (pointer) ((char *) new + sizeof (header));
  }
}

#if defined (CRAY) && defined (CRAY_STACKSEG_END)

#ifdef DEBUG_I00AFUNC
#include <stdio.h>
#endif

#ifndef CRAY_STACK
#define CRAY_STACK
#ifndef CRAY2
/* Stack structures for CRAY-1, CRAY X-MP, and CRAY Y-MP */
struct stack_control_header
  {
    long shgrow:32;		/* Number of times stack has grown.  */
    long shaseg:32;		/* Size of increments to stack.  */
    long shhwm:32;		/* High water mark of stack.  */
    long shsize:32;		/* Current size of stack (all segments).  */
  };

/* The stack segment linkage control information occurs at
   the high-address end of a stack segment.  (The stack
   grows from low addresses to high addresses.)  The initial
   part of the stack segment linkage control information is
   0200 (octal) words.  This provides for register storage
   for the routine which overflows the stack.  */

struct stack_segment_linkage
  {
    long ss[0200];		/* 0200 overflow words.  */
    long sssize:32;		/* Number of words in this segment.  */
    long ssbase:32;		/* Offset to stack base.  */
    long:32;
    long sspseg:32;		/* Offset to linkage control of previous
				   segment of stack.  */
    long:32;
    long sstcpt:32;		/* Pointer to task common address block.  */
    long sscsnm;		/* Private control structure number for
				   microtasking.  */
    long ssusr1;		/* Reserved for user.  */
    long ssusr2;		/* Reserved for user.  */
    long sstpid;		/* Process ID for pid based multi-tasking.  */
    long ssgvup;		/* Pointer to multitasking thread giveup.  */
    long sscray[7];		/* Reserved for Cray Research.  */
    long ssa0;
    long ssa1;
    long ssa2;
    long ssa3;
    long ssa4;
    long ssa5;
    long ssa6;
    long ssa7;
    long sss0;
    long sss1;
    long sss2;
    long sss3;
    long sss4;
    long sss5;
    long sss6;
    long sss7;
  };

#else /* CRAY2 */
/* The following structure defines the vector of words
   returned by the STKSTAT library routine.  */
struct stk_stat
  {
    long now;			/* Current total stack size.  */
    long maxc;			/* Amount of contiguous space which would
				   be required to satisfy the maximum
				   stack demand to date.  */
    long high_water;		/* Stack high-water mark.  */
    long overflows;		/* Number of stack overflow ($STKOFEN) calls.  */
    long hits;			/* Number of internal buffer hits.  */
    long extends;		/* Number of block extensions.  */
    long stko_mallocs;		/* Block allocations by $STKOFEN.  */
    long underflows;		/* Number of stack underflow calls ($STKRETN).  */
    long stko_free;		/* Number of deallocations by $STKRETN.  */
    long stkm_free;		/* Number of deallocations by $STKMRET.  */
    long segments;		/* Current number of stack segments.  */
    long maxs;			/* Maximum number of stack segments so far.  */
    long pad_size;		/* Stack pad size.  */
    long current_address;	/* Current stack segment address.  */
    long current_size;		/* Current stack segment size.  This
				   number is actually corrupted by STKSTAT to
				   include the fifteen word trailer area.  */
    long initial_address;	/* Address of initial segment.  */
    long initial_size;		/* Size of initial segment.  */
  };

/* The following structure describes the data structure which trails
   any stack segment.  I think that the description in 'asdef' is
   out of date.  I only describe the parts that I am sure about.  */

struct stk_trailer
  {
    long this_address;		/* Address of this block.  */
    long this_size;		/* Size of this block (does not include
				   this trailer).  */
    long unknown2;
    long unknown3;
    long link;			/* Address of trailer block of previous
				   segment.  */
    long unknown5;
    long unknown6;
    long unknown7;
    long unknown8;
    long unknown9;
    long unknown10;
    long unknown11;
    long unknown12;
    long unknown13;
    long unknown14;
  };

#endif /* CRAY2 */
#endif /* not CRAY_STACK */

#ifdef CRAY2
/* Determine a "stack measure" for an arbitrary ADDRESS.
   I doubt that "lint" will like this much.  */

static long
i00afunc (long *address)
{
  struct stk_stat status;
  struct stk_trailer *trailer;
  long *block, size;
  long result = 0;

  /* We want to iterate through all of the segments.  The first
     step is to get the stack status structure.  We could do this
     more quickly and more directly, perhaps, by referencing the
     $LM00 common block, but I know that this works.  */

  STKSTAT (&status);

  /* Set up the iteration.  */

  trailer = (struct stk_trailer *) (status.current_address
				    + status.current_size
				    - 15);

  /* There must be at least one stack segment.  Therefore it is
     a fatal error if "trailer" is null.  */

  if (trailer == 0)
    abort ();

  /* Discard segments that do not contain our argument address.  */

  while (trailer != 0)
    {
      block = (long *) trailer->this_address;
      size = trailer->this_size;
      if (block == 0 || size == 0)
	abort ();
      trailer = (struct stk_trailer *) trailer->link;
      if ((block <= address) && (address < (block + size)))
	break;
    }

  /* Set the result to the offset in this segment and add the sizes
     of all predecessor segments.  */

  result = address - block;

  if (trailer == 0)
    {
      return result;
    }

  do
    {
      if (trailer->this_size <= 0)
	abort ();
      result += trailer->this_size;
      trailer = (struct stk_trailer *) trailer->link;
    }
  while (trailer != 0);

  /* We are done.  Note that if you present a bogus address (one
     not in any segment), you will get a different number back, formed
     from subtracting the address of the first block.  This is probably
     not what you want.  */

  return (result);
}

#else /* not CRAY2 */
/* Stack address function for a CRAY-1, CRAY X-MP, or CRAY Y-MP.
   Determine the number of the cell within the stack,
   given the address of the cell.  The purpose of this
   routine is to linearize, in some sense, stack addresses
   for alloca.  */

static long
i00afunc (long address)
{
  long stkl = 0;

  long size, pseg, this_segment, stack;
  long result = 0;

  struct stack_segment_linkage *ssptr;

  /* Register B67 contains the address of the end of the
     current stack segment.  If you (as a subprogram) store
     your registers on the stack and find that you are past
     the contents of B67, you have overflowed the segment.

     B67 also points to the stack segment linkage control
     area, which is what we are really interested in.  */

  stkl = CRAY_STACKSEG_END ();
  ssptr = (struct stack_segment_linkage *) stkl;

  /* If one subtracts 'size' from the end of the segment,
     one has the address of the first word of the segment.

     If this is not the first segment, 'pseg' will be
     nonzero.  */

  pseg = ssptr->sspseg;
  size = ssptr->sssize;

  this_segment = stkl - size;

  /* It is possible that calling this routine itself caused
     a stack overflow.  Discard stack segments which do not
     contain the target address.  */

  while (!(this_segment <= address && address <= stkl))
    {
#ifdef DEBUG_I00AFUNC
      fprintf (stderr, "%011o %011o %011o\n", this_segment, address, stkl);
#endif
      if (pseg == 0)
	break;
      stkl = stkl - pseg;
      ssptr = (struct stack_segment_linkage *) stkl;
      size = ssptr->sssize;
      pseg = ssptr->sspseg;
      this_segment = stkl - size;
    }

  result = address - this_segment;

  /* If you subtract pseg from the current end of the stack,
     you get the address of the previous stack segment's end.
     This seems a little convoluted to me, but I'll bet you save
     a cycle somewhere.  */

  while (pseg != 0)
    {
#ifdef DEBUG_I00AFUNC
      fprintf (stderr, "%011o %011o\n", pseg, size);
#endif
      stkl = stkl - pseg;
      ssptr = (struct stack_segment_linkage *) stkl;
      size = ssptr->sssize;
      pseg = ssptr->sspseg;
      result += size;
    }
  return (result);
}

#endif /* not CRAY2 */
#endif /* CRAY */

#endif /* no alloca */
#endif /* not GCC version 2 */

@Paul-Licameli
Copy link
Contributor

Is the ancient history of gcc versions from 2001 actually relevant?

Audacity/Tenacity presupposes a compiler that can do C++17. Those versions, predating 2017, surely won't.

alloca is non-standard, it's true.

But the pertinent question is, what versions of gcc support C++17, and how is alloca implemented in them?

Don't hate me, I'm just trying to help you.

@emabrey
Copy link
Member Author

emabrey commented Aug 2, 2021

I don't personally know the implementations of every c++17 compiler, across every version and across every machine. If you do then I applaud you. My point is that it's basically relying on a kludge to get a little bit of extra speed. I want to move memory allocation outside of the real-time thread, and because of the way that alloca works (end-of-function scope/free on jmps) it makes it basically extremely difficult to do correctly and almost impossible to do in a way that isn't really fragile. If you want to allocate memory that lasts outside the scope of the current function then you shouldn't use alloca, that isn't what it's for. I'm just trying to achieve something that you aren't trying to do. When I first made the change it was in response to the static analysis, sure, but it did take me awhile to go through the 4000+ loc in AudioIO.cpp. Now I'm moving forward with this because it is a needed part of making the real time thread as allocation free as possible. Ideally we would dynamically allocate a global pool of memory exclusively for the use of the real-time thread as part of thread initialization/program startup. Then we use that memory until we cleanup the real-time thread.

@Paul-Licameli
Copy link
Contributor

I don't personally know the implementations of every c++17 compiler, across every version and across every machine.

And neither to I. But if Tenacity really wants to support a compiler minimum of GCC 2.95, well then, be my guest update your BUILDING.md, and also change a whole lot of source code using C++11 syntax or later. Say goodbye to lambdas.

If you want to support some other variety of compilers, not quite that wide, well still, document it there.

Meanwhile, I'll work on Audacity, where we are content to drop support for really old compilers and stipulate which small few are guaranteed to work.

If you do then I applaud you. My point is that it's basically relying on a kludge to get a little bit of extra speed. I want to move memory allocation outside of the real-time thread,

You do? Then pray tell, why did you add calls to operator new (!) within that thread, which were much more expensive than alloca, and they were not paired with delete (!), and therefore were leaky, and you pushed it to your master without review from your teammates, and you wouldn't even have known it was leaky if I hadn't spoken up. (Well maybe @nyanpasu64 at least would have figured it out soon enough.)

I honestly didn't know how horribly alloca was sometimes implemented circa 2001. Thank you for that bit of education.

But even thick-witted I have long known about std::auto_ptr superseding explicit delete, which was in the standard library since 1998 (deprecated now, please use unique_ptr). And, about the general princple of RAII, which existed in C++ since 1979, before Bjarne had even renamed it C++. https://youtu.be/u0aU0NTJ0hQ?t=858

If you want to allocate memory that lasts outside the scope of the current function then you shouldn't use alloca,

Um, no? I mean, yeah? That was not, ever, the intended lifetime of those arrays in question.

Ideally we would dynamically allocate a global pool of memory exclusively for the use of the real-time thread as part of thread initialization/program startup. Then we use that memory until we cleanup the real-time thread.

Now THAT makes sense as a possible improvement of AudioIO.cpp. Honestly.

Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
@nyanpasu64
Copy link
Contributor

Well maybe @nyanpasu64 at least would have figured it out soon enough.

I don't appreciate you speaking on my behalf about how I'd respond to code I haven't reviewed.

@Paul-Licameli
Copy link
Contributor

Well maybe @nyanpasu64 at least would have figured it out soon enough.

I don't appreciate you speaking on my behalf about how I'd respond to code I haven't reviewed.

🤷

I was complimenting you.

@emabrey emabrey marked this pull request as draft August 2, 2021 02:27
Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
Signed-off-by: Emily Mabrey <emabrey@tenacityaudio.org>
@emabrey emabrey closed this Aug 5, 2021
@Be-ing Be-ing mentioned this pull request Aug 8, 2021
5 tasks
@emabrey emabrey deleted the cache-friendly-audioiobufferhelper branch August 9, 2021 03:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants