Skip to content

Linked list #68

Open
Open
@milancurcic

Description

@milancurcic
Member

Problem

Linked list is one of the essential data structures beside an array. It allows you to add, insert, or remove elements in constant time, without re-allocating the whole structure.

Fortran doesn't have a linked list. There are 3rd party libraries, but no obvious go-to solution. Fortran stdlib should have a linked list. I would use it.

Examples

What kind of data can the linked list hold?

There's various levels of capability we could pursue:

  1. Single type: Basically just like an array, but allows insertion in constant time;
  2. Elements can be of any intrinsic type in a single list;
  3. Can take intrinsic type and user-defined derived types (is this even possible in current Fortran?)

API

I don't know, something like this?

use stdlib_experimental_collections, only :: List
type(List) :: a = List()

call a % append(42)
call a % append(3.141)
call a % append('text')
print *, a % get(2) ! prints 3.141
call a % remove(3) ! a is now List([42, 3.141])
call a % insert(2, 'hello') ! a is now List([42, 'hello', 3.141])

a = List([1, 2, 3]) ! instantiate a list from an array

Activity

certik

certik commented on Jan 2, 2020

@certik
Member

C++ has std::list for this. (I added Petaca to your Examples above.)

I would mention that I personally have never had a need for a std::list in C++, nor any linked list implementation in Fortran, because linked list is very slow (to create, traverse, destrogy, ...) compared to just a regular array or std::vector. The only operation that might be faster is insertion or deletion of individual items in the middle. In my use cases, I typically need to add elements to the end, in which case array works great.

But since there are at least 6 different people who reimplemented this already in Fortran and given that C++ has it too in their standard library, I would say that this would be a good candidate to include in stdlib, so that if people want to use it, they can. So +1 from me.

milancurcic

milancurcic commented on Jan 2, 2020

@milancurcic
MemberAuthor

In my use cases, I typically need to add elements to the end, in which case array works great.

Me too, but how do you do it? I thought that appending to an array always re-allocates on heap, e.g.:

integer :: i
integer, allocatable :: a(:)
a = [integer ::]
do i = 1, 100000000
  a = [a, i] ! re-allocates a on every append
end do

It's okay, for small-to-moderate arrays, but for very large ones, isn't it crippling?

certik

certik commented on Jan 2, 2020

@certik
Member

The canonical way is to pre-allocate the array and then append to it, like this:

integer :: i
integer, allocatable :: a(:)
allocate(a(100000000))
do i = 1, 1000
    a(i) = i
end do

Then you use your actual application to figure out what the maximum size of the array is (100000000 in this example), and then you can either keep a as is (only use the first 1000 elements, as in this example), or you can copy it to a smaller array. A real world example is e.g. here: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L111, the Bj_ array is pre-allocated to the maximum size first (determined from the sparse arrays), and then downsized before returning to the user: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L127. This is typically still much faster than a linked list implementation. If you don't know the size ahead of time, then you can set some maximum at compile time and fail the program if you go over it (real world example: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/basis.f90#L230) --- many times this is fine, as you can recompile the code easily. But sometimes that's not appropriate, so then you can also do what std::vector does --- it doubles the allocation every time you reach it, and copies the data. Here is a fast implementation of that that I use in LFortran (that's in C++, but one can do something similar in Fortran also): https://gitlab.com/lfortran/lfortran/blob/57d3b8077d884f0ff3945ad3a86b2da920e4b6b3/src/lfortran/parser/parser_stype.h#L22. All of these are fast options.

But as I said, it's good to have linked list in stdlib, if people prefer that, so that they do not need to reimplement it.

rweed

rweed commented on Jan 2, 2020

@rweed

First I think we need to define which types of linked list we need. I prefer a circular double-linked list as the basic type since its the type I use most in FEM codes etc. I also think we would need a single-link list to implement stacks and queues. Also do we need some form of reference counting. As to @milancurcic question as to current Fortran support list that can contain both intrinsic and user defined types, yes it can. I've implemented both a circular list class and a single link class using unlimited polymorphic variables. They works but are not pretty and will probably have poor perfomance when compared to a type specific list generated by pre-processing/templating methods ala the
Fortran Template Library approach.

everythingfunctional

everythingfunctional commented on Jan 3, 2020

@everythingfunctional
Member

Generic linked-list, or really any generic data structure, is really cumbersome with the current Fortran capabilities. They work, but you end up having to use a select type block every time you want to access the data. So for convenience you'd end up with some wrapper class or library, at which point you might as well have re-implemented for your specific use case. Until we get fully parameterized types or template capabilities I don't think these are a great idea.

victorsndvg

victorsndvg commented on Jan 3, 2020

@victorsndvg

I think the supported data types should be wrapped with containers in order to be extendible.
In the main issue (#1) containers are mentioned, but I don't see any other specific issue.

I think FPL (https://github.com/victorsndvg/FPL) contains a smart implementation strategy for supporting native data types and allow to extend to other user defined data types. It contains lists, hash tables, etc. All of them depend on containers (aka wrappers) in order to manage different data types.

I agree that with this kind of data types you don't get performance, but amazing flexibility. This kind of data types (usually) are not for computation purposes.

Edit:

  • Note 1: FPL data types naming is probably not good
  • Note 2: Multiple dimensions management is another usual issue also treated by FPL
ivan-pi

ivan-pi commented on Jan 3, 2020

@ivan-pi
Member

I think many of the projects in the list of popular projects contain linked list implementations. Perhaps it would be good to do a grep over all of those repositories to get a feeling for linked list usage in production codes (e.g. whether they use generic lists supporting multiple kinds or only specific ones for the intrinsic kinds and potentially derived types).

nshaffer

nshaffer commented on Jan 3, 2020

@nshaffer
Contributor

I agree with @everythingfunctional on this issue. There's a ton of up-front labor in implementing fully polymorphic containers, and I'm not convinced that they're that much more useful than having generic (but homogeneous) containers. That is, I don't think it's worthwhile to support, say, linked lists where each element is of arbitrary type.

The more common use case I find is to need a linked list of int32, say. Or a binary tree of class(my_derived_t). These can be implemented without select type all over the place. There's still the labor of implementing all the intrinsic types, but this can be templated.

Letting users make containers of derived types is tricker. The common solution is to provide an abstract base class that users need to extend in order to have containers of derived types. I think that solution kind of sucks, but I have an alternate idea... Just ship source code templates that implement each container for class(__T__) and then let users run sed s/__T__/mytype/g on it to produce derived type containers on demand. (This will be slightly more involved for, e.g., mapping types, but just slightly).

I confess I have not thought through if there is some great pitfall to this approach besides being slightly "icky" from a distribution p.o.v.

nncarlson

nncarlson commented on Jan 3, 2020

@nncarlson
Contributor

I'm in agreement with @nshaffer here. I've done the linked-list-of class(*) variables in my own library which I use as a backend for some very specific things where such generality is needed. Otherwise it is incredibly clunky to use with all the select type and isn't an acceptable for general use, imo.

Someone else seemed to suggest that perhaps performance shouldn't be a concern here. I think it would be a big mistake to ignore performance. Linked lists come with their intrinsic performance overhead that most would be aware of, but any implementation that significantly added to that I would find unacceptable to include in a standard library.

I think the best solution beyond intrinsic types, which could all have very performant implementations, would be, as @nshaffer suggested, to provide a literal template that a user could adapt for their particular case. In fact that's more or less what I do myself.

zbeekman

zbeekman commented on Jan 3, 2020

@zbeekman
Member

A note on performance:

  1. Yes linked lists have some overhead compared with arrays
  2. They perform well for sorted data, and in instances where you're always manipulating one end of the list or the other, e.g., stacks, but, in general are NOT constant time lookup for random access read or insertion unless you're always operating on data "nearby"
  3. They are a building block component for hash tables which are in general constant time insertion and lookup.
  4. In some cases where storage needs vary greatly and dynamically in complex ways pre-allocating a huge array may not be feasible and you may want/need to use a linked list

I think there is merit to providing classic data structures and algorithms. I would add hash tables to this list as well as binary-trees, octrees, K-D trees, and a number of others. Obviously they are not useful to all users and applications but having a decent implementation is worthwhile.

I agree that right now the select type combinatorial explosion makes unlimited polymorphics nearly useless, and very awkward. In my opinion better generic programming should be the highest priority for the next major standard revision.

certik

certik commented on Jan 3, 2020

@certik
Member

@zbeekman Generic programming will not make it to the next standard revision -- simply because there is no proposal that is ready. I think the latest most developed idea is pursued at j3-fortran/fortran_proposals#125, and we need everybody's help to help transform the idea into a solid proposal. Once we have a proposal that is community backed, I'll be happy to bring it to the committee and try to get it into the next standard.

zbeekman

zbeekman commented on Jan 3, 2020

@zbeekman
Member

I know @rouson is working with Magne who leads the Bergen Language Design Lab and also @tclune on generics. They have something here but I don't know how up to date it is with their current efforts. Hopefully they can combine efforts and we can get something in, we'll see.

certik

certik commented on Jan 3, 2020

@certik
Member

Yes, the issue j3-fortran/fortran_proposals#125 is the latest based on our discussion with Magne at the last meeting. Anyway, let's move the discussion about this there, I just wanted to point this out, that we need help.

tclune

tclune commented on Jan 3, 2020

@tclune

In the mean time I have a project https://github.com/Goddard-Fortran-Ecosystem/gFTL which provides (by far less elegant means) a generic container system. Currently it supports Vector and Map (ala C++ STL), but also has Set which is used under the hood.

gFTL uses the C preprocessor and requires explicit instantiation, but is still a real game changer for doing some common operations within Fortran. I have a separate project gFTL-shared that provides common instantiations.

But I do look forward to the day that this could be done much more elegantly through a proper generic facility. (And yes, I realize that other preprocessors could do what I have done more elegantly than the C preprocessor, but ... cpp is already integrated into the build systems for the other projects I work with.

gronki

gronki commented on Jan 3, 2020

@gronki

I agree here with @zbeekman that linked lists are essential and I think the approach to preallocate array is very ineffective (cause then you have to check for overflow and re-allocate it etc). I also sadly agree that this is undoable in the current Fortran. Gotta wait for generics (or hopefully an intrinsic highly-optimized types for lists and dicts).

68 remaining items

certik

certik commented on Oct 21, 2022

@certik
Member

Let's keep the discussion here to have it in one place, it's all related.

  • In LPython we don't allow mixing types in a list, all items have to be of the same type. Consequently also in LFortran.
  • We can allow mixing types in stdlib, and then LFortran would only optimize lists of the same type, and if you mix types, it would not be optimized; I would lean to not allow mixing of types.
  • the term "list" in my mind visualizes Python list like a = [1, 2, 3] and the operations that you can do on it, like "append" and "extent" and concatenation a = [1, 2, 3] + [10, 11]. However, Python underneath does not use linked list, but rather implements the list using a size, capacity and a pointer. So I would recommend using the term "list" for this datastructure. If we want to specify a particular implementation, we can have both List and LinkedList. Using the term "vector" for this I think is confusing, because vector in this sense is the same as an array and Fortran already has arrays. So I would stick with Python's terminology, which is "list" (cheap appending) and "array" (NumPy array, for our purposes equivalent to Fortran arrays; expensive appending). (In my mind, the C++ implementation of the list is "std::vector". There is also std::list which I think is a linked list and I don't think I've ever used it, as it is very slow. I recommend using Python, not C++ terminology.)
  • We have to figure out the API of a List, that is the main task.
  • We can implement any API into LFortran's frontend; the intermediate representation uses the following API: it has statement nodes and expression nodes and a type List which has a type of the element, such as List[f64] (using LPython's syntax), in ASR notation (here is the full ASR so that you can see the context of the nodes below: https://github.com/lfortran/lfortran/blob/383831d7efb9b5b1bc00c74e928c8997b7470a95/src/libasr/ASR.asdl):
stmt
    | ListAppend(expr a, expr ele)
    | ListInsert(expr a, expr pos, expr ele)
    | ListRemove(expr a, expr ele)
    | ListClear(expr a)

expr
    | ListConstant(expr* args, ttype type)
    | ListLen(expr arg, ttype type, expr? value)
    | ListConcat(expr left, expr right, ttype type, expr? value)
    | ListItem(expr a, expr pos, ttype type, expr? value)
    | ListSection(expr a, array_index section, ttype type, expr? value)
    | ListPop(expr a, expr? index, ttype type, expr? value)

ttype
    | List(ttype type)
milancurcic

milancurcic commented on Oct 21, 2022

@milancurcic
MemberAuthor

OK, your terminology is confusing to me, but we can pick any that most people here agree on. For this conversation, we can just call them "linked list" and "list", as you suggest.

Based on your description, as I understand it, your list is the same as a dynamic array. All elements of the same type, cheap indexing and prepending/appending, slow inserting.

The difference is not just an implementation detail. They're different data structures. I initially opened this thread to ask whether there's a desire for a structure with fast insertion (a linked list). I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.

Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?

awvwgk

awvwgk commented on Oct 21, 2022

@awvwgk
Member

There is also need for a list holding class(*) values, for example to implement support for #671 using stdlib. If only one type is allowed user will need to manually instantiate a list using:

type :: wrapped_class
  class(*), allocatable :: raw
end type wrapped_class

And deal with the class dispatch themselves. In this case stdlib could directly provide such a wrapper class.

tclune

tclune commented on Oct 21, 2022

@tclune

Related note: In the initial release of generics in Fortran, one will not be able to write CLASS(T) where T is a deferred type. So one cannot directly use a single template that works for both TYPE(T) and CLASS(T). Well, a user could implement their own wrapper and then either swear up and down in packing/unpacking at each interface, or write their own wrappers.

And you cannot even write an "outer" template to provide the polymorhic case, because you cannot even define the wrapper type within the template. Again - there is no way to write CLASS(T).

And no, I'm not happy about this, as well over half of my uses of containers are polymorphic. The majority of the subgroup prefers to wait for the initial set of features to be implemented so that we can more accurately plan subsequent extensions. I.e., so that we worry about the things that matter most after experience is gained.

certik

certik commented on Oct 21, 2022

@certik
Member

We can discuss the terminology at our next Fortran meeting. :)

@tclune, we have our former GSoC student (@ansharlubis) working on the generics in LFortran. If you know about anybody else who can help, please let me know. While we are at this: there does not seem to be many good examples of the syntax. Here is our issue for this: lfortran/lfortran#929 where I linked all examples that I was able to find so far. If you create more examples, that would be a huge help.

tclune

tclune commented on Oct 21, 2022

@tclune

@certik The only examples that properly use the syntax as it appears in the formal syntax paper are: linear_algebra and fundamental_requirements. (These are in the last link you provide in the reference above.)

Until now, the syntax has been a (slowly) moving target, and maintaining the examples takes time. If/when the syntax paper is approved, we can start producing "final" examples and deleting the ones with obsolete syntax.

certik

certik commented on Oct 21, 2022

@certik
Member

I see, thanks @tclune!

beddalumia

beddalumia commented on Oct 21, 2022

@beddalumia

About terminology

It may be worth noting that Wikipedia explicitly states that a list can be implemented either as a linked list or a dynamic array (plus a generic statement about the former being more frequent for lisp dialects, where the terms "list" and "linked list" are often conflated.)

https://en.m.wikipedia.org/wiki/List_(abstract_data_type)

To be fair the first picture you see is a depiction of a linked list, so I totally get that it can be misleading. I've also being taught in bachelor that a "list" is a "linked list".

In general, I wouldn't want to give too much importance to Wikipedia per sé, but I like that is more or less language agnostic, and we do not necessarily want to follow conventions from either the C++, or python community or anything else, rather use terms that would be familiar for people with different backgrounds and, even more importantly, would not result confusing after googling them. The C++ case gives an important lesson in my opinion, as

  • to me it violates the principle of least surprise with std::vector (no one that is not aware of C++ specific lore would ever imagine what it is, from the name... if googling works now it's just for how much C++ containers are pervasive at this point)

  • given the diffusion and extensive usage nowadays, it's basically impossible to amend the name, so yes, it would be important to avoid choosing a misleading/debatable name from the start.

Relevant quote from stack overflow:

It's called a vector because Alex Stepanov, the designer of the Standard Template Library, was looking for a name to distinguish it from built-in arrays. He admits now that he made a mistake, because mathematics already uses the term 'vector' for a fixed-length sequence of numbers. C++11 compounds this mistake by introducing a class 'array' that behaves similarly to a mathematical vector. Alex's lesson: be very careful every time you name something.

original link with reference and discussion

beddalumia

beddalumia commented on Oct 21, 2022

@beddalumia

Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?

I think this is exactly what you'll want to do, it would make easy for client code to switch from one to the other, according to different needs / requirements changes / benchmarks.

I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.

I may just comment that I know that in the codebases that I'm working on during the PhD there is currently a homebrew implementation of a dynamic array (for sparse rows...) Older people told me that that was initially implemented as a linked list and used in many performance critical parts of the code. As they eventually scaled up the requirements, crafted a massively parallel algorithm for the iterative solver and started profiling the MPI performance they immediately spotted these linked lists as the main bottleneck and changed to dynamic arrays to give the very same "list functionality" but with acceptable performance.

beddalumia

beddalumia commented on Oct 21, 2022

@beddalumia

As for class(*) containers, what about keeping it separated from lists, also in naming convention?

They would be inherently inferior for performance, but provide a whole new level of flexibility... this should suffice to qualify them as a totally separated thing, targeting very different use cases. Matlab calls such things1 "cells". How do you feel about such a name? Alternatives?

Footnotes

  1. not really sure about the underlying implementation, but the API is somewhat the same: a collection of heterogeneous things, generally slower than a builtin array (remember that in matlab all arrays are dynamic, so satisfy Ondrej's definition of list), and with quite limited support to be passed around. E.g. you cannot do plot(x, y) if x and y are cells, nor you can do linear algebra, etc. They provide a suitable cell2mat method to convert homogenous cells to arrays, when needed.

milancurcic

milancurcic commented on Oct 21, 2022

@milancurcic
MemberAuthor

Since the terminology varies between communities. I suggest being explicit and descriptive such that the name reflects the properties of the structure. Perhaps those would be:

  • Linked List (type linked_list)
  • Dynamic Array (type dynamic_array)
ivan-pi

ivan-pi commented on Oct 22, 2022

@ivan-pi
Member

@milancurcic, is the difference that linked_list can hold items of different types, while dynamic_array contains items of the same type? Fortran allocatable arrays are dynamically allocated in the sense they can be resized. The dynamic_array on the other hand allows inserting/appending/removing (Wiki - Dynamic Array)?

I know a few algorithms which require a dynamic_array, and I implement them currently with an allocatable array and a smart reallocation procedure (#598). Recently, there was a query about a dynamic array at Fortran Discourse: https://fortran-lang.discourse.group/t/what-package-would-you-recommend-for-typed-lists/4359. In that case, the user only needed the append operation.

certik

certik commented on Oct 22, 2022

@certik
Member

For what is worth, the first time I heard the term dynamic array is from Milan above, and I wouldn't know what it means without this thread, since Fortran arrays are also "dynamic" (in my mind). But I don't have a better term for it right now (besides just List), so we can use DynamicArray, to mean the "size+capacity+pointer" implementation (like std::vector).

Thanks @bellomia for the comments! Yes, in LFortran's ASR (intermediate representation) we use the term List, and indeed it is independent of how it is actually implemented. So in order to be able to represent both linked list and size+capacity+pointer implementation, we could just add a flag to the type, such as "implementation=LinkedList | DynamicArray". So the API could indeed be exactly the same for both.

We can use "linked_list" and "dynamic_array" for implementations and then somehow alias "list" to mean "dynamic_array", as I think "list" is easier for people to understand what it means. Just "list" and "array".

milancurcic

milancurcic commented on Oct 22, 2022

@milancurcic
MemberAuthor

Great, yes, I'm not so concerned about the final terminology in stdlib because we can arrive there through consensus in issues and PRs; but more that we are on the same page in this thread when we discuss these structures so that we know what each name refers to.

milancurcic

milancurcic commented on Oct 22, 2022

@milancurcic
MemberAuthor

@ivan-pi As I understand it, the time complexity (fast insertion vs. fast indexing) and the type of elements are orthogonal.

A linked list has fast insertion and growing and shrinking but slow indexing. A list/vector has fast indexing and fast (amortized) growing and shrinking but slow insertion.

A linked list with class(*) elements allows you to mix types of elements and insert user-derived type instances as elements. But consuming data from the list is awkward because you need all the select type boilerplate. This is the kind of list that @ChetanKarwa worked on during his GSoC project.

But a regular list/vector can, I think, also be implemented with class(*) to allow the same flexibility with types.

And either structure with fixed (intrinsic) types is easy to consume elements from (no select type boilerplate needed) because the types are known at compile-time. And it's probably easier for a compiler to optimize because all elements have the same size in memory.

So, I think that choosing the specific structures to implement requires deciding along both dimensions, time complexity of operations, and type-flexibility vs. ease of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic: container(Abstract) data structures and containerstopic: utilitiescontainers, strings, files, OS/environment integration, unit testing, assertions, logging, ...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @certik@zbeekman@LadaF@everythingfunctional@arjenmarkus

      Issue actions

        Linked list · Issue #68 · fortran-lang/stdlib