Skip to content

Conversation

chaburkland
Copy link
Collaborator

A first pass at splitting up _arraykit.c into multiple source files.

I split it up as such:

  1. utilities.c: defines internal arraykit functions/macros that are used by other functions across multiple modules
  2. methods.c: defines the public arraykit functions
  3. block_index.c: defines all BlockIndex related code
  4. array_go.c: defines all ArrayGO related code
  5. file_parsing.c: defines all code related to delimited file parsing
  6. tri_map.c: defines all TriMap related code
  7. _arraykit: defines the arraykit public interface

All source files, except _arraykit.c have related header files.

A main note
Many functions are no longer static or static inline. Through splitting this code up, I gained a deeper understanding of what those keywords mean.

static means the function defined is specific to the scope of the source file it lives in. As such, it is non-sensical to define a static function in a header file, because it is essentially defining an interface that cannot ever be referenced outside the header file itself. As such, the public functions/typedefs in each of the header files are not static, and their corresponding definitions in the source files are also not static.

inline is a way to speed up performance at the cost of binary size.
With inlining all possible functions, the binary sizes (on my machine) increases from:
841,184 KBs -> 1,041,664 KBs
I'm testing the performance diff now

@chaburkland chaburkland self-assigned this Jun 15, 2024
@flexatone
Copy link
Contributor

@chaburkland : very happy to see this refactoring!

One initial comment: I would favor leaving in the existing inline usage for a number of reasons.

  • It will be hard re-run sufficient performance tests
    accross all AK utilities to gain confidence in the difference. That said, a focused isolated study would be informative for future usage.
  • My understanding is that inline is a suggestion to the compiler that may not always be observed. Thus, over inlineing seems fine.
  • Our deployment targets can afford larger binaries for the benefit of performance. For binary size comparison, we should not compare AK to AK, but AK to, say, Numpy!

@flexatone
Copy link
Contributor

@chaburkland : one more:

Can we name file_parsing.c something more specific like delimited_to_arrays.c? I guess there might be a name conflict, but a more specific name seems appropriate as, after this change, we are likely to create new, separate files for any other file parsing routines we might add.

@chaburkland
Copy link
Collaborator Author

@flexatone

Thanks for the push to rename. I wasn't sure what to call it, but delimited_to_arrays makes the most sense.

As for inline, I went back through and discovered I marked a lot more functions inline than previously.

I wasn't able to mark these 4 functions inline (they were previously), because they are defined in utilites.h:
- AK_build_pair_ssize_t
- AK_build_slice
- AK_slice_to_ascending_slice
- AK_nonzero_1d

@flexatone
Copy link
Contributor

flexatone commented Jun 18, 2024

I think (at least an AI told me) if we put the actual implementation (not just the header) in the .h file, we can inline some of those remaining functions, at least the smaller ones more natural for inlining.

@flexatone
Copy link
Contributor

flexatone commented Jun 21, 2024

@chaburkland : any thoughts on the above (putting implementations in header files for inlining)?

Also, do you think we should merge this before merging #160, or vice versa?

Copy link
Contributor

@flexatone flexatone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a long-overdue improvement!

@flexatone flexatone merged commit f6b3008 into static-frame:master Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants