-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heterogeneous column types #6
Comments
If you don't have type-safety (option 3), then why use Rust? Pandas is good enough (actually I think On the point of bad type signatures, for This doesn't help when Rust reports errors, but it does make it possible to describe a
(Of course as soon as you do a join or a groupby or whatever, the type changes) |
That does seem pretty useful. Also, since macros can be used in type contexts, you could have something like: type MyFrame = Frame![Col1, Col2, Col3]; I did something similar with creating label lists in The type signatures in errors are still a problem, though. |
It looks like it's a bit of a trade-off between compile-time type-checking vs readability and usefulness of error messages. |
I'm pretty happy with how I think it'll be useful for what we're doing in that it'll provide a framework for our goal of compile-time-checked non-stringy column indexing. It should make the column lookup easier to use and a lot less messy. |
You are not alone! https://h2oai.github.io/db-benchmark/ |
We use
|
Because that type-safety is for the LANG, but data CAN'T BE type safe in a universal way. Imagine I open psql, and start typing queries. Is interactive, and it can be type safe all. My opinion is that any library that need to operate on arbitrary data MUST be dynamic, you could add a type safe layer on top "easily" but not the other way. |
Is this discussion/project still alive? |
I think the issue with the third proposal is, that I still have not understood it. It seems complicated. What are Cons-lists? And regarding the discussion of enums in #3 :
is better than
You still have the issue that you need to handle these enums in your eventual calculations. I mean it is nice that your dataframe can store arbitrary types. But you do not just want to store them. This is not supposed to be a database. And once you try to implement simple functions you could apply to these columns, like I mean for enum types like the
It might be possible to provide a default implementation for arbitrary enums, by making the first element of the enum be the
Since (if you want to do a linear regression) the effects of regular smoking are added on top of the occasional smoking effects. Anyway the point is: if you want to enable basic statistics/machine learning modules to build on top of dataframes. Then they need to be able to make some assumptions about the dataset. So an And if you want to use truly unique types, then you would have to reimplement all the functionality the dataframe provides for this particular type. Which at this point, defeats the purpose of using a dataframe framework. And you would not want to build this functionality on top of this dataframe framework, if the types are encoded as some strange Cons-List, which you would have to first understand before you can build anything on top. But I think that it is unlikely that there are more types than: |
[Edit: Added warning to skip some of my comments.] WARNING: To any body who haven't yet suffered through the next several comments by me and @FelixBenning: I suggest you skip all of my comments starting from this one (inclusive) up to (and including) #6 (comment). The only comment in this mess that I feel may have a sufficiently high cost/benefit ratio for most readers is #6 (comment). Just watch out for @FelixBenning's comment in between. I personally don't think it's worth while reading it, but he might disagree. 😉 @FelixBenning Regarding cons lists, I think @jblondin meant HLists. A cons list is just a special type of linked list. An HList is a cons list implemented in the type system. This makes it possible to define and reason about arbitrary length heterogeneous lists of types at compile time. So for example you can statically check the type of the i'th element in HList![T1, T2, ..., Tn] for any Regarding
|
Edit: You might want to skip this... Oh. Another disadvantage of HLists is that since its type must be completely known at compile time, you have to specify the exact number and types of columns at compile time. Therefore, even though something like this might work
This won't work if we use HLists internally, since the
|
Edit: You might want to skip this... A proposed fourth approach to heterogeneous column typesWe seem to have a tradeoff: On the one hand it would be very nice if we can have compile time type checks and zero cost abstractions, but on the other hand we need the ability to construct dataframes based on data that might only be available at runtime. (Bonuses like fast and ergonomic indexing, slicing and iteration and consistently good error messages would also be great.) Can we perhaps eliminate the trade-off (or at least give the user the option to choose)and yet have a single consistent type? We might be able to after all. Here's what I think: Approach 4: A typed front-end for a byte-array (or usize-array) back-end
PS: Apologies for walls of text.Especially since my previous long comment was mostly just a somewhat off-topic reply and I don't know whether the idea presented in this comment is any good either. ;-) |
No, but I apparently did not manage to get my point across very well. I am not interested in this particular example, i.e. I am not interested into a certain use case. Which is why this does not belong into #3. I instead tried to make a more general point using this as an example. So I'll take another shot at it clarifying what I meant: I am going to play dumb and claim there only 5 types of objects in rust:
So we can simply create an enum of these 5 types and have covered all possible types. This is of course not quite true. So what are the issues with this statement?
But multi indexes could simply be implemented by creating a So what is left is basically just 5 possible types (times possible sizes). Which is perfectly suited for an enum. The only thing this could not deal with is oversized types, if someone starts to implement them. But I very much doubt that anyone is really going to need larger than 512 bytes for any kind of numeric value. And for SIMD we only need to keep our enum up to date with the current chip generation, which is very little work. The only thing which might possibly cause issues, is String lengths. But if someone tries to store books in dataframe cells, then they probably care so little about performance, that we can also store string pointers. Another argument is practicability (the above should be completely sufficient, but I started with this argument, so I will try to clarify this a bit more in order to explain my past comments): When people use a dataframe library, they want to use prebuilt methods and functions. They want to be able to use stuff like
|
[Edit6: Added section Sorry everyone. I was an ass. I wrote 3 very long, badly structured and badly thought through comments above[^Note to self @dexterlemmer never write stuff on the Internet when you're tired! You knew that already.]. Terminology note: Implementations.The OP talks about three implementation strategies for dataframes with heterogeneous column types. After this section, I'll just refer to them as impl1/2/3. Here's a recap of the implementation strategies:
I'll also propose a new implementation (and API):
Overview
1. Explanation of impl3 (HLists)This is simply a way for us to fully statically type our heterogeneous columns. In rust, the frunk crate implements HLists. Static typing is great. Most of what makes Rust great is that it's statically typed by default. Sure, many people probably won't claim they like Rust because it's statically typed, but I can guarantee you that almost anything most people might say is great about Rust either requires static typing or interact with other great features in a way that requires static typing. Here's an example of using HLists from frunk's README (note the compiler allows this to type check and would've raised a type error if the types of the elements didn't match):
The advantage of HLists over tuples is that they are lists, so we can do things that are impossible with tuples such as iterate over an HList, pushing and popping elements, etc. 2. I feel impl3 (HLists) alone is insufficient.HLists' strength (full static typing) is also their weakness:
3. We can (and IMHO should) do better than impl1 (df_crate::ColEnum) by using impl2 (
|
Edit: You might want to skip this OK. Since @FelixBenning has replied to one of my previous comments while I wrote the above one, I'll rather not delete my previous comments. @FelixBenning. I'll read your reply and get back to you. |
Edit: You might want to skip this if you're not @FelixBenning... @FelixBenning I think I've answered you already, here:
(The text in brackets was not in the original.) I'll just add that I can see a user actually needing a prebuilt function/method for some type we don't provide. Who says Btw. You may have a misunderstanding about functions like Furthermore, I see my impl4 as letting the user decide what functionality he needs for his use case: Static or dynamic typing, encodings or a single column storing weird data, etc. It's his choice. As it should be. |
Edit: You might want to skip this if you're not @FelixBenning... Oh. BTW, @FelixBenning. Just in case I still haven't gotten my point across very well: There's no reason whatsoever why a user cannot use Let's use a shorter function name now: If Why would we arbitrarily not allow our user to use All of the above said. Sometimes static typing just doesn't make sense and sometimes it does make sense but not enough sense to overcome the disadvantages of working with HLists (or tuples). However, I've never thought nor claimed that impl3 or |
[Edit 1: changed the multilevel dataframe implementation] @dexterlemmer I only skimmed your proposal since I am in my exam weeks currently. But from what I got you essentially propose to implement an enum of first class citizens and handle the rest with dynamic typing ( I think we all agree that we want to avoid dynamic typing as much as possible. So I think it is worth discussing how far we can get with enums of types. Since everything we can not cover will end up in the fallback case.
Good point - I completely forgot about methods. I also realized that I forgot pointers (i.e. arrays, lists, etc.). I still think that we can deal with structs though, even with methods. I'll use the struct Point {
x: float,
y: float
}
struct measurement{
value: float,
point: Point
} If we would not care about performance at all, we could just store any kind of data as a list of structs. Where the struct is the measurement and thus fundamentally heterogeneous. Issue is: most often you want to access the same field for all the measurements and not the entire measurement. For this reason dataframes want to store things as a struct of arrays and not an array of structs. So how would you store that in a dataframe? Well you could store it as
But what if you suspected, that only the x coordinate of the point determines the value? Well then you would only query the x coordinate, and we would again run into the same speed of access discussion as above which would probably result in a multiIndex dataframe:
You could implement dataframes like this, and get multi-indices for free enum MultiColumn{
multiC(&Dataframe),
staticC(&ColumnStatic)
dynC(&ColumnDynamic)
}
struct Dataframe{
columns : Vec<MultiColumn>
} For the Method ProblemFundamentally any Row is always an object (i.e. a measurement). So why not simply include a struct Dataframe{
structType: ?,
columns : Vec<MultiColumn>
} You could store the type
So lets say you want to apply a method to every point in your dataset. In order to do that you iterate through the rows of the lower level dataframe and cast every row to Point, apply the method and store the result into your dataframe as a row. Since you iterate through your dataframe, you can still fetch aligned slices from memory. It might even be possible to avoid fetching the entire row, if it is possible to figure out what fields a method uses ahead of time. For example: a
So there is the potential to gain a lot of efficiency, by explicitly not saving structs as structs. Additionally, compatibility with arrow would force us to implement something like this anyway. As we would have to have a method, which converts our list of structs to a struct of lists in order for it to be saved in a format arrow would accept. Arrays, Lists, Vec, etc.You could convert same length iterables into columns. But beyond that we are probably forced to save a pointer and type information in an Built in Statistics
I think the main types of data are either numeric, categorical or strings (and of course collections of such types - mostly structs). Categorical data (i.e. slim enums) should have sensible defaults for conversion into dummy variables. This would make statistics/machine learning so much easier, if you can just rely on the dataframe framework to do the necessary conversion if needed. Or specify the conversion with very simple syntax. (Similarly missing data should of course be treated, but bitmasks can be slapped on any data type so I am not that worried). A dataframe without quality of life improvements for statistics is only a database. So prebuilt features are quite important. And dynamic types can make those really difficult. I mean, even if you think that this is not the task of the dataframe itself - libraries built on top of the dataframe have to deal with these issues. And if you make their life incredibly difficult by forcing them to deal with arbitrary types, then you won't get many takers. |
I'm building in the side a relational language and hit the same issue: How actually store data and yet provide easy operations on them. For a lang, this look to me harder but a library can constrain better the problem. This is some ideas I have about this: Instead of think in terms of method + traits, do method + arity (like kdb/j). Rust make things a little convoluted where is not possible to easily say "support
This mean that I can pass freely: data.column(apply: std::ops::add,....)
data.column(apply: StrOps::concat,....)
... etc this, I think, solve
If the dataframe API only worry by arity, ANY function can be supplied..
This one is the harder to implement, but I think transducers provide the answer. Here, we *don't" supply the actual storage (or more exactly, supply it later for convenience) and only provide transducers that operate in anything dataframe-like. This align with the idea of Building Efficient Query Engines in a High-Level Language. Combined with the potential of transducers to support not only static storage, but also get feed by vectors, channels, iterators, streams, etc , we could get the best of all worlds! This open the possibility to do magic like: let titanic = pd.read_csv("data/titanic.csv") <-- user only care for dataframe op, give own source
let query = transducer::map(titanic, select_field("age") | toi32 <-- dynamic parse csv somehow) <-- dataframe op
let mean = query.mean() <-- dataframe op
//But I wanna type!
struct Titanic{
age: Vec<i32> <--columnar storage!
}
//implement dataframe....
let titanic: Titanic = pd.read_csv("data/titanic.csv").into() <-- now we want fast columnar ops
let query = transducer::map(titanic, age <-- get statically typed, fast!)) <-- dataframe op, same
let mean = query.mean() <-- dataframe op, same In other words: How about decoupling ops vs storage, so instead of 2 camps fighting ("wanna static types!", "wanna dynamic data!") we marry them and allow to pick the best backend? |
I honestly do not understand how fixing the arity (number of arguments?) fixes our issues. I mean do you have types or not? If you have, then I do not see how arity changes anything. If you do not, I do not see how arity improves anything either. It just seems to introduce a limitation on operations you can use. But why? What do you gain from that? |
It help in how provide functionality BESIDES the types. Is only a part of the solution. Note that was related to:
What I trying to say is that one problem is what functionality (ops) are provided (like .sum, .mean, etc). If that depend on the internal types of the data frame implementation, that is fixed. If I'm loading data from csv, I could wish to do "sum" directly (mapping lines and then converting the column to a number), without loading into a vec of numbers, then, do the sum. |
[Edit4: See below, it's not far down (relatively speaking).]
Hehe. I actually do the same. At least in my case, I'm proposing what I think is very similar to how backjack already does it (except that blackjack stores a hashmap of Any+metadata, rather than a vec/array/tuple of Any+metadata, like me.
OK. I definitely did not get my point across well. (Hopefully not because I was unclear but rather because you only skimmed my proposal. 😄 ) Edit1: We already have impl1/2/3 to handle some situations. I propose doing six times the work (in terms of functionality, two column types multiplied by three dataframe types), since it would be barely more than 1--2 times the work in practice and we really do need a single dataframe crate that provides all functionality necessary for others to build a data science stack on top of it and ndarry/nalgebra, IMHO. The implementation detail is rather important. If you are referring to Column vs AnyCol, that's necessary but I guess we could leave AnyCol for later. I'd rather not, though since I want to make sure it works and we will probably eventually be asked to support it -- especially since impl4 needs to be competitive with impl2, which does provide AnyCol-like behavior.
Actually in impl4 "enums of types" is a fallback case for when that is what the user wants. However, he gets to choose which enums (even his own) and he gets to choose on a per dataframe instance case. Edit1: Additionally. Actually any one can choose. I'll show below how either we or another library crate building on top of ours or the user could implement dataframes for specific types.
You and I have very different understandings of the meaning of the word "heterogeneous" in the context of this discussion. That may be part of our other misunderstandings:
Sure you can... In
Interesting proposal. I think impl4 is both more expressive and more intuitive. But I guess that'll work. That said. I'd rather support all datatypes a user has any business putting in a dataframe and save multi-indices for cases where my dataframe is truly hierarchical rather than use it as a hack to enable users to encode un-supported datatypes.
Not... quite. You're supposed to work with dataframes where every row is a measurement, but you might be handed dataframes by someone else that weren't. It's often useful to be able to record dataframes in formats where not every row is a measurement. Or so I'm told... And then I need to go "tidy up" (as R calls it) the mess. It would be a nightmare to tidy up the mess if my dataframe crate cannot encode the mess (how do I even get the data imported). Well truthfully, last time this happens to me was before I did any data science of my own and I needed to turn a customer's Excell sheets into a normalized database. However, it's apparently still common enough that many R textbooks insists on spending a chapter or two teaching you how to do it with dataframes.
Actually, you are starting to figure out what I did. Look here: struct ColsTuple<S, R:Row<S>> {
//[Edit1: _:PhantomData serves a similar purpose as your `structType` field, except
// it allows the **compiler** to remember the type and the **program** to forget the
// the type.]
_: PhantomData<S, R>,
// [Edit1: `strides`, `buff` and `shape` (fields below) combine to reimplement
// `Vec<R>`. The problem with trying to store `Vec<Multicolumn>` is as follows:
// What is the datatype of `Multicolumn`? If it is a tuple, you're stuck with macros in
// your **indexing**. i.e. you are restricted to impl3. Furthermore if it is a tuple we
// end up storing data rowwise, which is very slow in practice for typical use cases
// of dataframes due to a lot of cash misses.
// If it's a `Vec`, it won't compile since `Vec` isn't heterogeneous. Oh! And again, we're
// storing the data row-wise.
// [Edit2: And if `MultiIndex` works the way you've shown us above, then your proposal
// becomes like my original (which was **more** complex than impl4), except that
// you drop a lot of my original complexity and my original advantages. You don't say
// how &StaticColumn or &DynColumn are implemented, but I'll give you a hint:
// &StaticColumn needing to be static and the recursive nature of your dataframe
// (via multicolumn), eventually rediscovers hlists as you try to make it work and
// &DynCol ends up working like my AnyCol. As you then battle the difficulties
// involved in hlists and try to add fast and ergonomic indexing and the ability to
// interface with anything that doesn't use hlists (including your &DynCol),
// you find yourself needing to seperate your storage from your API.
// Additionally, you've now forced both us and our users to care about
// multi-indexes, even if the user's index isn't naturally hierarchical and even
// if we want to leave implementing MultiIndexes for later.]
// Therefore we are forced to implement our own heterogeneous vec-like thing
// [Edit2: for the actual data storage. This vec-like thing should]
// allow us to store our data column-wise. (Such a vec-like
// thing [Edit2: (a vec-like thing capable of heterogeneous item types and
// statically typed)] doesn't yet exist in Rust [Edit2: as far as I'm aware].)]
// Note, we can rather use `strides: [usize]` if we also move this definition into the
// builder macros below.
strides: Vec<usize>,
buff: *DFBuffer, // note the raw pointer
shape: (usize, usize),
}
#proc_macro ColsTuple(...) -> ... {
// Parse user's code. Determine the types of his columns, for example
// if his column types are (respectively) T1, T2, ..., Tn, we do
type S = (T1, T2, ..., Tn)
// Check if S is already implemented, if not, do:
impl ColsTuple<S, R:Row<S>> {
// impl details
}
impl ColsTuple<S, R:Row<S>> for AbstractDF<R> {
// impl details
}
// [Edit1: Also provide info necessary for the #DeriveGeneric*OnColsTuple macros
// to know for which generic types they need to derive their trait impls.]
}
struct ColsArray<T, Column<T>> {
_: PhantomData<[T]>,
strides: usize, // note this is different from ColsTuple.strides
buff: DFBuffer, // note the raw pointer
shape: (usize, usize),
}
struct ColsVec<T, Column<T>> {
_: PhantomData<Vec<T>>,
strides: usize,
buff: DFBuffer, // note the raw pointer
shape: (usize, usize),
capacity: (usize, usize), // This is new!
} Note that all three Cols... structs in impl4 knows their exact type at compile time. They zero-cost cast their types to &[u8] and back again using
You are on the right track. The compiler already does this. Like I said, my dataframe types work just like Tuple, Array and Vec internally and these types store pointers to an internal buffer, which then gets updated during indexing by compiler generated code. Methods don't need to keep track of which fields they are operating on. They know on which column, dataframe or &dataframe they are working and that type stores pointers to tehir start and computes strides during compile time to compute pointers for indexes efficiently during runtime.
Yes... We save structs as
I don't understand what you are saying. May be I'm getting tired. Edit1: Oh. Now I get it. I had forgotten what you had said near the very top of your post. Nah! Your comment is not applicable to impl4, since it doesn't store lists of structs. It stores a heterogeneous list of columns (just like arrow), not a list of structs.
Basically, yes. Except that the type information is stored in a PhantomData which just type erases it at runtime. We only care that the compiler knows the type, since then it can compute everything we need to know to implement the transmutations and compute the strides for us. What's more, you may not know this (but you probably do) but in Rust,
That's why enums can be accessed very fast in arrays (and in my own types). However, that also makes them very memory inefficient.
Sure. But in your proposal, neither the numeric type
Edit1: iirc, Rust does support packed enums -- which stores their variants as small as possible per variant. However, I doubt packed enums are Sized, which means you cannot even store them in DFBuff (unless our impl can somehow prove which variant is used in each column and that only that variant gets used throughout that column), let alone in a Vec (Vec's only store T:Sized). So that doesn't help us much in terms of memory efficiency
I couldn't agree more. 😉 That said. If we provide the right datafrae abstraction and some basic support for statistics, other crates can provide the rest or we can add the rest later. For example, I've shown how we (or preferably the
And the user can do this:
Now isn't that nice! PS. Hey thanks. You make me really think hard about my proposed impl4, its capabilities and limitations and how to explain it all as well as how I'll go about it when I finally get to write its code out. That said. I thought I was going to be able to actually code some Rust tonight to reacquaint myself with it... Until I saw your post. Oh well. PPS. Good luck with your exams. |
[Edit4: Fixed bug in code.] @mamcx Wow! Looks like you have us an impl5 proposal. I think mine is better. But after my long answer to Felix (may I call you that @FelixBenning ?), I'm now to tired and hasty to really look into it. And it looks to me like you are still in the early figure it out stage that I was in when I wrote some of the posts I ended up editing to suggest people skip ahead. I hope you can show us what you have in mind and write up a more complete and understandable proposal. Or that tomorrow I'm less tired and actually realize its obvious and well written and I understand it. 😄 On the other hand. I don't think the arity approach makes sense for a dataframe. Sure, it might make sense for databases, but if you have the full power of Rust and the full generality of a good Rust dataframe API, I don't think we should force our users (or ourselves) to worry about arity and ops. Let the compiler figure it out and simply (for example) make
Now the user of our respective libs doesn't care about our impls at all, only that the compiler can find some impl of the functionality he uses somewhere.
Now if it's some function that either us or the author of the type the user uses in his columns don't support, the user can write the missing impl himself. Though he may end up needing to wrap the type in a new type due to Rust's ownership rules. |
@dexterlemmer Since I apparently have not understood what you are intending, and I am not quiet sure if you did understand my intention either, I think that at this point it might be more productive, if we would just talk about it in person and clear up all misunderstandings - what about tomorrow 11. June 20 @ 20:00 Central European Summer Time (GMT+2)? Maybe the Rust discord? (actually not sure if it has channels) |
@FelixBenning . I heartily agree that we should get this discussion off of this issue. We could try to talk in person. However, there's not an on topic discord channel on Rust discord yet. We can may be friend each other and talk privately? I don't currently use any other suitable social media and I don't intend to either. Oh. We could move the discussion to a new issue in this repo or on one of my or your repos. I made a possible start at dexterlemmer/impl4#2 (comment). Note. It would make sense if we still keep to the 20:00 GMT+2 time you've suggested even if we use github in stead of Discord. Whatever we decide to use, you might want to first read dexterlemmer/impl4#2 (comment), since I've already made a start there with two posts about (1) "What exactly are our goals with this discussion?" and (2) "Can we clean up the mess we've already made after we've decided we've reached our goal?" |
Hello! TL:DR Check out this github repo https://github.com/etemesi-ke/dami An example https://github.com/etemesi-ke/dami/blob/master/doc/10-minutes.ipynb Am quite late in this discussion, but I once needed a DataFrame implementation too in rust, and came up with an implementation for heterogeneous columns using the following methods
/// An enum of Datatypes
/// Note this isn't exhaustive
enum Dtypes{
F32,
F64,
UNKNOWN
}
struct DataFrame{
// The DataFrame Block Holder
blocks: HashMap<DataTypes, Box<dyn Any>>,
// Contains reference to the name of a Series inside a DataType
values: HashMap<String, DataTypes>,
// Names of the series to preserve order
names: Vec<String>,
} we can define a struct Series<T>{
// The underlying array
array:Array1<T>,
// The DataType of the array
dtype:Dtypes,
name:String
}
fn get_dtype(dtype:Box<dyn Any>)->Dtypes{
if T.is::<f32>(){
return Dtypes::F32,
}
else if t.is::<f64>(){
return Dtypes::F64,
}
return Dytpes::UNKNOWN
} Adding a Series to A DataFrame then becomes impl DataFrame{
/// Add a series to the Dataframe
pub fn add_to_df<T:Default+'static+Clone>(&mut self,other:series<T>){
// Maintain order of insertion
self.names.push(other.get_name());
self.values.insert(other.get_name(), other.get_dtype());
self.get_appropriate_block(other);
}
} What the The above quoted github repo uses an interesting way of grouping homogeneous columns together, because Therefore assuming we have 2 f32 columns and 2 String columns in order( What about applying a function to columns in the DataFrame? Consider the function that squares all f32 /// Square a function
pub fn square(a:f32)->f32{
a.powi(2)
} To apply this function to a DataFrame, you could write df.apply::<f32,_>(square); It will find the block containing all Added Bonuses
Personally, I think that is quite amazing, what do you think? |
So, I thought I'd start opening up issues to enable discussion of individual dataframe features we'd like to see. I'd like to start with 'heterogeneous column types': the ability to have a dataframe with columns of different types.
In looking through existing WIPs in #4, I came across a few different methods of implementing this:
enum
for either a column or for individual values. utah (and really any arbitrarily-typed dataframe library) can house enums as values, which allows you to mix types however you want (even within the same column), at the cost of run-time type safety and some performance. I didn't see any library currently use column-based enums, but I could see having something likeand in fact did it this way in an early version of
agnes
.Any
-based storage, along with some metadata for relating columns to data types at run-time. Used by rust-dataframe and black-jack.Each of these has its own advantages and disadvantages. For example, 1 and 2 lack compile-time type-checking, but have much cleaner type signatures and potentially cleaner error messages than 3 (where you have something like
DataFrame<Cons<usize, Cons<f64, Cons<String, Nil>>>>
for a relatively simple three-column dataframe).You could also have a combination of the above techniques -- I could see something like cons-list type-checking column metadata structure while the data itself is stored in some sort of
Any
-based structure.I'm personally a fan of the compile-time type-checking that cons-lists provide, but they can be hard to work with for those unfamiliar with them. I've started work on a labeled cons-list library (which will replace the one I'm using in
agnes
) to hopefully help out with some of these issues.What are everyone's thoughts / opinions? Are there other options we should consider? I'd love to hear what people think the best approach is!
The text was updated successfully, but these errors were encountered: