Open
Description
Problem
Linked list is one of the essential data structures beside an array. It allows you to add, insert, or remove elements in constant time, without re-allocating the whole structure.
Fortran doesn't have a linked list. There are 3rd party libraries, but no obvious go-to solution. Fortran stdlib should have a linked list. I would use it.
Examples
- FLIBS by @arjenmarkus
- fortran-list by @LadaF
- flist by @jacobwilliams
- PolyCon by @cmacmackin
- Petaca by @nncarlson
- Modern Fortran Explained by MRC has an implementation in Appendix C
- Many others?
What kind of data can the linked list hold?
There's various levels of capability we could pursue:
- Single type: Basically just like an array, but allows insertion in constant time;
- Elements can be of any intrinsic type in a single list;
- Can take intrinsic type and user-defined derived types (is this even possible in current Fortran?)
API
I don't know, something like this?
use stdlib_experimental_collections, only :: List
type(List) :: a = List()
call a % append(42)
call a % append(3.141)
call a % append('text')
print *, a % get(2) ! prints 3.141
call a % remove(3) ! a is now List([42, 3.141])
call a % insert(2, 'hello') ! a is now List([42, 'hello', 3.141])
a = List([1, 2, 3]) ! instantiate a list from an array
Activity
certik commentedon Jan 2, 2020
C++ has std::list for this. (I added Petaca to your Examples above.)
I would mention that I personally have never had a need for a
std::list
in C++, nor any linked list implementation in Fortran, because linked list is very slow (to create, traverse, destrogy, ...) compared to just a regular array orstd::vector
. The only operation that might be faster is insertion or deletion of individual items in the middle. In my use cases, I typically need to add elements to the end, in which case array works great.But since there are at least 6 different people who reimplemented this already in Fortran and given that C++ has it too in their standard library, I would say that this would be a good candidate to include in stdlib, so that if people want to use it, they can. So +1 from me.
milancurcic commentedon Jan 2, 2020
Me too, but how do you do it? I thought that appending to an array always re-allocates on heap, e.g.:
It's okay, for small-to-moderate arrays, but for very large ones, isn't it crippling?
certik commentedon Jan 2, 2020
The canonical way is to pre-allocate the array and then append to it, like this:
Then you use your actual application to figure out what the maximum size of the array is (100000000 in this example), and then you can either keep
a
as is (only use the first 1000 elements, as in this example), or you can copy it to a smaller array. A real world example is e.g. here: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L111, theBj_
array is pre-allocated to the maximum size first (determined from the sparse arrays), and then downsized before returning to the user: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L127. This is typically still much faster than a linked list implementation. If you don't know the size ahead of time, then you can set some maximum at compile time and fail the program if you go over it (real world example: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/basis.f90#L230) --- many times this is fine, as you can recompile the code easily. But sometimes that's not appropriate, so then you can also do whatstd::vector
does --- it doubles the allocation every time you reach it, and copies the data. Here is a fast implementation of that that I use in LFortran (that's in C++, but one can do something similar in Fortran also): https://gitlab.com/lfortran/lfortran/blob/57d3b8077d884f0ff3945ad3a86b2da920e4b6b3/src/lfortran/parser/parser_stype.h#L22. All of these are fast options.But as I said, it's good to have linked list in stdlib, if people prefer that, so that they do not need to reimplement it.
rweed commentedon Jan 2, 2020
First I think we need to define which types of linked list we need. I prefer a circular double-linked list as the basic type since its the type I use most in FEM codes etc. I also think we would need a single-link list to implement stacks and queues. Also do we need some form of reference counting. As to @milancurcic question as to current Fortran support list that can contain both intrinsic and user defined types, yes it can. I've implemented both a circular list class and a single link class using unlimited polymorphic variables. They works but are not pretty and will probably have poor perfomance when compared to a type specific list generated by pre-processing/templating methods ala the
Fortran Template Library approach.
everythingfunctional commentedon Jan 3, 2020
Generic linked-list, or really any generic data structure, is really cumbersome with the current Fortran capabilities. They work, but you end up having to use a
select type
block every time you want to access the data. So for convenience you'd end up with some wrapper class or library, at which point you might as well have re-implemented for your specific use case. Until we get fully parameterized types or template capabilities I don't think these are a great idea.victorsndvg commentedon Jan 3, 2020
I think the supported data types should be wrapped with containers in order to be extendible.
In the main issue (#1) containers are mentioned, but I don't see any other specific issue.
I think FPL (https://github.com/victorsndvg/FPL) contains a smart implementation strategy for supporting native data types and allow to extend to other user defined data types. It contains lists, hash tables, etc. All of them depend on containers (aka wrappers) in order to manage different data types.
I agree that with this kind of data types you don't get performance, but amazing flexibility. This kind of data types (usually) are not for computation purposes.
Edit:
ivan-pi commentedon Jan 3, 2020
I think many of the projects in the list of popular projects contain linked list implementations. Perhaps it would be good to do a grep over all of those repositories to get a feeling for linked list usage in production codes (e.g. whether they use generic lists supporting multiple kinds or only specific ones for the intrinsic kinds and potentially derived types).
nshaffer commentedon Jan 3, 2020
I agree with @everythingfunctional on this issue. There's a ton of up-front labor in implementing fully polymorphic containers, and I'm not convinced that they're that much more useful than having generic (but homogeneous) containers. That is, I don't think it's worthwhile to support, say, linked lists where each element is of arbitrary type.
The more common use case I find is to need a linked list of
int32
, say. Or a binary tree ofclass(my_derived_t)
. These can be implemented withoutselect type
all over the place. There's still the labor of implementing all the intrinsic types, but this can be templated.Letting users make containers of derived types is tricker. The common solution is to provide an abstract base class that users need to extend in order to have containers of derived types. I think that solution kind of sucks, but I have an alternate idea... Just ship source code templates that implement each container for
class(__T__)
and then let users runsed s/__T__/mytype/g
on it to produce derived type containers on demand. (This will be slightly more involved for, e.g., mapping types, but just slightly).I confess I have not thought through if there is some great pitfall to this approach besides being slightly "icky" from a distribution p.o.v.
nncarlson commentedon Jan 3, 2020
I'm in agreement with @nshaffer here. I've done the linked-list-of
class(*)
variables in my own library which I use as a backend for some very specific things where such generality is needed. Otherwise it is incredibly clunky to use with all theselect type
and isn't an acceptable for general use, imo.Someone else seemed to suggest that perhaps performance shouldn't be a concern here. I think it would be a big mistake to ignore performance. Linked lists come with their intrinsic performance overhead that most would be aware of, but any implementation that significantly added to that I would find unacceptable to include in a standard library.
I think the best solution beyond intrinsic types, which could all have very performant implementations, would be, as @nshaffer suggested, to provide a literal template that a user could adapt for their particular case. In fact that's more or less what I do myself.
zbeekman commentedon Jan 3, 2020
A note on performance:
I think there is merit to providing classic data structures and algorithms. I would add hash tables to this list as well as binary-trees, octrees, K-D trees, and a number of others. Obviously they are not useful to all users and applications but having a decent implementation is worthwhile.
I agree that right now the
select type
combinatorial explosion makes unlimited polymorphics nearly useless, and very awkward. In my opinion better generic programming should be the highest priority for the next major standard revision.certik commentedon Jan 3, 2020
@zbeekman Generic programming will not make it to the next standard revision -- simply because there is no proposal that is ready. I think the latest most developed idea is pursued at j3-fortran/fortran_proposals#125, and we need everybody's help to help transform the idea into a solid proposal. Once we have a proposal that is community backed, I'll be happy to bring it to the committee and try to get it into the next standard.
zbeekman commentedon Jan 3, 2020
I know @rouson is working with Magne who leads the Bergen Language Design Lab and also @tclune on generics. They have something here but I don't know how up to date it is with their current efforts. Hopefully they can combine efforts and we can get something in, we'll see.
certik commentedon Jan 3, 2020
Yes, the issue j3-fortran/fortran_proposals#125 is the latest based on our discussion with Magne at the last meeting. Anyway, let's move the discussion about this there, I just wanted to point this out, that we need help.
tclune commentedon Jan 3, 2020
In the mean time I have a project https://github.com/Goddard-Fortran-Ecosystem/gFTL which provides (by far less elegant means) a generic container system. Currently it supports Vector and Map (ala C++ STL), but also has Set which is used under the hood.
gFTL uses the C preprocessor and requires explicit instantiation, but is still a real game changer for doing some common operations within Fortran. I have a separate project gFTL-shared that provides common instantiations.
But I do look forward to the day that this could be done much more elegantly through a proper generic facility. (And yes, I realize that other preprocessors could do what I have done more elegantly than the C preprocessor, but ... cpp is already integrated into the build systems for the other projects I work with.
gronki commentedon Jan 3, 2020
I agree here with @zbeekman that linked lists are essential and I think the approach to preallocate array is very ineffective (cause then you have to check for overflow and re-allocate it etc). I also sadly agree that this is undoable in the current Fortran. Gotta wait for generics (or hopefully an intrinsic highly-optimized types for lists and dicts).
68 remaining items
certik commentedon Oct 21, 2022
Let's keep the discussion here to have it in one place, it's all related.
a = [1, 2, 3]
and the operations that you can do on it, like "append" and "extent" and concatenationa = [1, 2, 3] + [10, 11]
. However, Python underneath does not use linked list, but rather implements the list using a size, capacity and a pointer. So I would recommend using the term "list" for this datastructure. If we want to specify a particular implementation, we can have both List and LinkedList. Using the term "vector" for this I think is confusing, because vector in this sense is the same as an array and Fortran already has arrays. So I would stick with Python's terminology, which is "list" (cheap appending) and "array" (NumPy array, for our purposes equivalent to Fortran arrays; expensive appending). (In my mind, the C++ implementation of the list is "std::vector". There is alsostd::list
which I think is a linked list and I don't think I've ever used it, as it is very slow. I recommend using Python, not C++ terminology.)List[f64]
(using LPython's syntax), in ASR notation (here is the full ASR so that you can see the context of the nodes below: https://github.com/lfortran/lfortran/blob/383831d7efb9b5b1bc00c74e928c8997b7470a95/src/libasr/ASR.asdl):milancurcic commentedon Oct 21, 2022
OK, your terminology is confusing to me, but we can pick any that most people here agree on. For this conversation, we can just call them "linked list" and "list", as you suggest.
Based on your description, as I understand it, your list is the same as a dynamic array. All elements of the same type, cheap indexing and prepending/appending, slow inserting.
The difference is not just an implementation detail. They're different data structures. I initially opened this thread to ask whether there's a desire for a structure with fast insertion (a linked list). I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.
Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?
awvwgk commentedon Oct 21, 2022
There is also need for a list holding
class(*)
values, for example to implement support for #671 using stdlib. If only one type is allowed user will need to manually instantiate a list using:And deal with the class dispatch themselves. In this case stdlib could directly provide such a wrapper class.
tclune commentedon Oct 21, 2022
Related note: In the initial release of generics in Fortran, one will not be able to write
CLASS(T)
where T is a deferred type. So one cannot directly use a single template that works for bothTYPE(T)
andCLASS(T)
. Well, a user could implement their own wrapper and then either swear up and down in packing/unpacking at each interface, or write their own wrappers.And you cannot even write an "outer" template to provide the polymorhic case, because you cannot even define the wrapper type within the template. Again - there is no way to write
CLASS(T)
.And no, I'm not happy about this, as well over half of my uses of containers are polymorphic. The majority of the subgroup prefers to wait for the initial set of features to be implemented so that we can more accurately plan subsequent extensions. I.e., so that we worry about the things that matter most after experience is gained.
certik commentedon Oct 21, 2022
We can discuss the terminology at our next Fortran meeting. :)
@tclune, we have our former GSoC student (@ansharlubis) working on the generics in LFortran. If you know about anybody else who can help, please let me know. While we are at this: there does not seem to be many good examples of the syntax. Here is our issue for this: lfortran/lfortran#929 where I linked all examples that I was able to find so far. If you create more examples, that would be a huge help.
tclune commentedon Oct 21, 2022
@certik The only examples that properly use the syntax as it appears in the formal syntax paper are:
linear_algebra
andfundamental_requirements
. (These are in the last link you provide in the reference above.)Until now, the syntax has been a (slowly) moving target, and maintaining the examples takes time. If/when the syntax paper is approved, we can start producing "final" examples and deleting the ones with obsolete syntax.
certik commentedon Oct 21, 2022
I see, thanks @tclune!
beddalumia commentedon Oct 21, 2022
About terminology
It may be worth noting that Wikipedia explicitly states that a list can be implemented either as a linked list or a dynamic array (plus a generic statement about the former being more frequent for lisp dialects, where the terms "list" and "linked list" are often conflated.)
https://en.m.wikipedia.org/wiki/List_(abstract_data_type)
To be fair the first picture you see is a depiction of a linked list, so I totally get that it can be misleading. I've also being taught in bachelor that a "list" is a "linked list".
In general, I wouldn't want to give too much importance to Wikipedia per sé, but I like that is more or less language agnostic, and we do not necessarily want to follow conventions from either the C++, or python community or anything else, rather use terms that would be familiar for people with different backgrounds and, even more importantly, would not result confusing after googling them. The C++ case gives an important lesson in my opinion, as
to me it violates the principle of least surprise with std::vector (no one that is not aware of C++ specific lore would ever imagine what it is, from the name... if googling works now it's just for how much C++ containers are pervasive at this point)
given the diffusion and extensive usage nowadays, it's basically impossible to amend the name, so yes, it would be important to avoid choosing a misleading/debatable name from the start.
Relevant quote from stack overflow:
original link with reference and discussion
beddalumia commentedon Oct 21, 2022
I think this is exactly what you'll want to do, it would make easy for client code to switch from one to the other, according to different needs / requirements changes / benchmarks.
I may just comment that I know that in the codebases that I'm working on during the PhD there is currently a homebrew implementation of a dynamic array (for sparse rows...) Older people told me that that was initially implemented as a linked list and used in many performance critical parts of the code. As they eventually scaled up the requirements, crafted a massively parallel algorithm for the iterative solver and started profiling the MPI performance they immediately spotted these linked lists as the main bottleneck and changed to dynamic arrays to give the very same "list functionality" but with acceptable performance.
beddalumia commentedon Oct 21, 2022
As for
class(*)
containers, what about keeping it separated from lists, also in naming convention?They would be inherently inferior for performance, but provide a whole new level of flexibility... this should suffice to qualify them as a totally separated thing, targeting very different use cases. Matlab calls such things1 "cells". How do you feel about such a name? Alternatives?
Footnotes
not really sure about the underlying implementation, but the API is somewhat the same: a collection of heterogeneous things, generally slower than a builtin array (remember that in matlab all arrays are dynamic, so satisfy Ondrej's definition of list), and with quite limited support to be passed around. E.g. you cannot do plot(x, y) if x and y are cells, nor you can do linear algebra, etc. They provide a suitable
cell2mat
method to convert homogenous cells to arrays, when needed. ↩milancurcic commentedon Oct 21, 2022
Since the terminology varies between communities. I suggest being explicit and descriptive such that the name reflects the properties of the structure. Perhaps those would be:
type linked_list
)type dynamic_array
)ivan-pi commentedon Oct 22, 2022
@milancurcic, is the difference that
linked_list
can hold items of different types, whiledynamic_array
contains items of the same type? Fortran allocatable arrays are dynamically allocated in the sense they can be resized. Thedynamic_array
on the other hand allows inserting/appending/removing (Wiki - Dynamic Array)?I know a few algorithms which require a
dynamic_array
, and I implement them currently with an allocatable array and a smart reallocation procedure (#598). Recently, there was a query about a dynamic array at Fortran Discourse: https://fortran-lang.discourse.group/t/what-package-would-you-recommend-for-typed-lists/4359. In that case, the user only needed the append operation.certik commentedon Oct 22, 2022
For what is worth, the first time I heard the term dynamic array is from Milan above, and I wouldn't know what it means without this thread, since Fortran arrays are also "dynamic" (in my mind). But I don't have a better term for it right now (besides just List), so we can use DynamicArray, to mean the "size+capacity+pointer" implementation (like
std::vector
).Thanks @bellomia for the comments! Yes, in LFortran's ASR (intermediate representation) we use the term List, and indeed it is independent of how it is actually implemented. So in order to be able to represent both linked list and size+capacity+pointer implementation, we could just add a flag to the type, such as "implementation=LinkedList | DynamicArray". So the API could indeed be exactly the same for both.
We can use "linked_list" and "dynamic_array" for implementations and then somehow alias "list" to mean "dynamic_array", as I think "list" is easier for people to understand what it means. Just "list" and "array".
milancurcic commentedon Oct 22, 2022
Great, yes, I'm not so concerned about the final terminology in stdlib because we can arrive there through consensus in issues and PRs; but more that we are on the same page in this thread when we discuss these structures so that we know what each name refers to.
milancurcic commentedon Oct 22, 2022
@ivan-pi As I understand it, the time complexity (fast insertion vs. fast indexing) and the type of elements are orthogonal.
A linked list has fast insertion and growing and shrinking but slow indexing. A list/vector has fast indexing and fast (amortized) growing and shrinking but slow insertion.
A linked list with
class(*)
elements allows you to mix types of elements and insert user-derived type instances as elements. But consuming data from the list is awkward because you need all theselect type
boilerplate. This is the kind of list that @ChetanKarwa worked on during his GSoC project.But a regular list/vector can, I think, also be implemented with
class(*)
to allow the same flexibility with types.And either structure with fixed (intrinsic) types is easy to consume elements from (no
select type
boilerplate needed) because the types are known at compile-time. And it's probably easier for a compiler to optimize because all elements have the same size in memory.So, I think that choosing the specific structures to implement requires deciding along both dimensions, time complexity of operations, and type-flexibility vs. ease of use.