Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a generic read_csv method #66

Closed
VolodymyrOrlov opened this issue Jan 14, 2021 · 5 comments
Closed

Implement a generic read_csv method #66

VolodymyrOrlov opened this issue Jan 14, 2021 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@VolodymyrOrlov
Copy link
Collaborator

In many cases data analysis starts from loading dataset into memory. Some datasets comes as a CSV file. We need a new default function read_csv that is defined on the BaseMatrix trait.

This story is not fully defined and a lot of details should be discussed prior to working on implementation. For example, I am not sure what parameters (if any) his function should take. Some ideas can be borrowed from the similar function in Pandas

@VolodymyrOrlov VolodymyrOrlov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jan 14, 2021
@abhikjain360
Copy link

I'd like to give it a try!

@VolodymyrOrlov
Copy link
Collaborator Author

VolodymyrOrlov commented Feb 27, 2021

Hi @abhikjain360, sounds good!

The basic idea is to define a new function in the BaseVector trait that loads data from CSV file. Once the function read_csv is defined in BaseVector trait it will be automatically available for every type of matrix we support. If the function's definition is too generic, it can always be redefined by a concrete implementation of the matrix later.

One way to implement the function is to read a file first, and use one of the matrix initialization functions to create an instance of BaseMatrix and then push the values into the matrix using set method. I am also open to any additional abstract method you might find useful. E.g. you might want to define a new method on BaseMatrix that can initialize a matrix directly from an iterator.

Things to keep in mind. I plan to redesign BaseMatrix and BaseVector in #85 . One of the problems you will face is a lack of support for integer and string data types. For now feel free to limit method read_csv to floats only.

Let me know if you are stuck or have any questions!

@abhikjain360
Copy link

okay!

Seeing the read_csv of pandas, I think it would be better to provide something like a builder struct which implements Default, and functions to change the reading options. Should I add a ReadCsv struct in the same file as BaseMatrix<T>, or should I create a seperate file?

Also, in case of errors should I just reuse the Failed? As it uses FailedError which does not cover the cases when reading a file, should I add more options to it or create a new type specific to parsing files? In case of latter I think we can just use the std::io::Error from the standard library.

@kastolars
Copy link

Has this been implemented/resolved?

@titoeb
Copy link
Collaborator

titoeb commented Sep 2, 2022

I could not find anything in the library so far, and I am currently looking into it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants