Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for returning datasets as DataFrames #10733

Closed
jnothman opened this issue Mar 1, 2018 · 79 comments
Closed

API for returning datasets as DataFrames #10733

jnothman opened this issue Mar 1, 2018 · 79 comments
Labels
Easy Well-defined and straightforward way to resolve Enhancement good first issue Easy with clear instructions to resolve Sprint

Comments

@jnothman
Copy link
Member

jnothman commented Mar 1, 2018

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with feature_names. This would be controlled with an as_frame parameter (though return_X_y='frame' would mean the common usage is more succinct).

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Enhancement help wanted labels Mar 1, 2018
@layog
Copy link

layog commented Mar 1, 2018

I would like to work upon this. Seems like a good first issue. If I understand it correctly, the task is to add an argument to the load function for returning a DataFrame. So in case of iris dataset, this can be dataset.load_iris(as_frame=True). Am I right?

@jnothman
Copy link
Member Author

jnothman commented Mar 1, 2018 via email

@layog
Copy link

layog commented Mar 1, 2018

Okay, but I might as well create a PR and see how the conversation turns out.

@jnothman
Copy link
Member Author

jnothman commented Mar 1, 2018 via email

@nsorros
Copy link

nsorros commented Apr 11, 2018

@layog are you still working on this? if not, I would like to give it a try.

@layog
Copy link

layog commented Apr 11, 2018

@nsorros No, I have not started working on this one. You can take it.

@nsorros
Copy link

nsorros commented Apr 13, 2018

@jnothman It would be helpful to get some thoughts in the PR which simply implements what you suggested for one dataset.

@jorisvandenbossche
Copy link
Member

@jnotham is there agreement on the desired interface? as_frame=True/False ? (or return_frame to be consistent with return_X_y?)

On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to ensure a numerical array is returned (drop string features, #11819). A return='array'|'pandas'|... might be more flexible future-wise to add new things without the need to add new keywords, but the question is of course this would be needed.

@jnothman
Copy link
Member Author

jnothman commented Aug 18, 2018 via email

@jorisvandenbossche
Copy link
Member

@jnothman Yes, that is how I understood it, and also what I meant above with return_frame (to return a frame instead of an array; both in the bunch or as direct value depending on return_X_yis True or False)
(I meant return_frame only as a name suggestion compared to as_frame, not a different behaviour)

@daniel-cortez-stevenson
Copy link

daniel-cortez-stevenson commented Feb 2, 2019

Hey y'all,

Seems like this has gone cold. I'd be keen to take care of the last bit.

Currently, this change requires modifying each load_*() function. What are y'alls thoughts on creating an abstract class DataSetLoader that is subclassed for each type of dataset like:

  • SimpleDataSetLoader (only a single data file is loaded from)
  • TargetDataSetLoader (sources from a target data file)
  • ImageDataSetLoader (image data)
  • and so on

Then those classes could hide dataset specific implementation in classes like:

  • Iris
  • Digits
  • BreastCancer
  • and so on

The abstract class DataSetLoader would have a method load(return_X_y=False, as_frame=False) -> Bunch, which would call overrideable getter methods (I'm not married to 'getter' method patterns, but it might do) to modify the behavior of DataSetLoader.load like:

  • get_data_file_basename() -> str
  • get_feature_names() -> List[str]
  • etc

Just spitballing here but while the SimpleDataSetLoader, TargetDataSetLoader, ImageDataSetLoader etc. classes could include some implementation, thye could also include overrideable methods of like:

  • preprocess_data()
  • preprocess_target()
  • preprocess_images()
  • get_target_file_basename() -> str
  • get_image_file_basename() -> str

to leave open for classes like Iris or Digits to take care of any dataset-specific implementation.

My thought is that by adopting an OO approach it will be easier to add datasets in the future, or add features to dataset loading that make it compatible with other Python packages (like this issue originally was trying to do with Pandas).

I'd really appreciate y'alls thoughts and feedback, as this is my first attempt to contribute to SciKit-Learn (aka Noobie here!).

Thanks for reading :)

p.s I'd be keen to collaborate with anyone interested (especially if they're experienced in OO design) - it would be great to learn from y'all

@daniel-cortez-stevenson
Copy link

daniel-cortez-stevenson commented Feb 3, 2019

And here's a preview of the suggested API - which could replace the load_wine and load_iris functions:

NOTE: this preview is outdated. A more recent version of the API preview can be found in the comments below

class DataSetLoader:
    from typing import Union, Tuple, Any

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        raise NotImplementedError


class SimpleDataSetLoader(DataSetLoader):
    from typing import Union, Tuple, Any, List

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self.get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        if return_X_y:
            return data, target

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self.get_feature_names(),
                     filename=dataset_csv_filename)
    
    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]::
        return self._load(return_X_y=return_X_y)

    def get_data_file_basename(self) -> str:
        raise NotImplementedError

    def get_feature_names(self) -> List[str]:
        raise NotImplementedError


class Wine(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'wine_data'

    def get_feature_names(self) -> List[str]:
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']
    
    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine.load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return self._load(return_X_y=return_X_y)


class Iris(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'iris'

    def get_feature_names(self) -> List[str]:
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']
    
    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 25, and 50, and want to
        know their class name.

        >>> from sklearn.datasets import Iris
        >>> data = Iris.load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return self._load(return_X_y=return_X_y)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load

@vnmabus
Copy link
Contributor

vnmabus commented Feb 3, 2019

It would be great that target_column was an argument of load for every dataset (as in fetch_openml). That allows for using the same dataset for different tasks, or to have all the data (including the target) in the same array. Also, are you proposing something similar for the fetchers?

@daniel-cortez-stevenson
Copy link

@vnmabus I agree that being able to set the target_column would be a helpful feature - I've run into the same issue with wanting all of the data in the same array. But I'm hesitant to add it as I'm wary of 'feature creep'. Is there a reasoning for how a target_column param would ease API restructuring?

I really like the idea of bringing the fetch_openml and openml API into this restructuring. In the code below, the abstract _fetch_dataset method could be overridden in RemoteDataSetLoader to work with openml remote datasets. I've also added an entrypoint and psuedocode for the original as_frame param in AbstractDataSetLoader:

from abc import ABC, abstractmethod
from typing import Union, Tuple, Any, List


class AbstractDataSetLoader(ABC):

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Override this method to insert the docstring of the dataset"""

        fetched_dataset = self._fetch_dataset()

        if as_frame:
            # Pseudo Code: Convert 'data' and 'target' of fetched_dataset bunch to pandas DataFrames
            # fetched_dataset['data'] = pd.DataFrame(fetched_dataset['data'])
            # fetched_dataset['target'] = pd.DataFrame(fetched_dataset['target'])
            pass

        if return_X_y:
            return fetched_dataset['data'], fetched_dataset['target']

        return fetched_dataset

    @abstractmethod
    def _fetch_dataset(self) -> Bunch:
        pass


class RemoteDataSetLoader(AbstractDataSetLoader):
    pass


class LocalDataSetLoader(AbstractDataSetLoader):
    def _fetch_dataset(self) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self._get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self._get_feature_names(),
                     filename=dataset_csv_filename)

    @abstractmethod
    def _get_data_file_basename(self) -> str:
        pass

    @abstractmethod
    def _get_feature_names(self) -> List[str]:
        pass


class Wine(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'wine_data'

    def _get_feature_names(self):
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine().load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


class Iris(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'iris'

    def _get_feature_names(self):
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        >>> from sklearn.datasets import Iris
        >>> data = Iris().load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load

@jnothman
Copy link
Member Author

jnothman commented Feb 3, 2019 via email

@daniel-cortez-stevenson
Copy link

@jnothman Yes I agree that the OO API isn't strictly necessary to implement returning dataframes. However, without an OO API, each function would need to be individually modified.

IMO Refactoring the API would make a change like this trivial, and I see it as a good opportunity to refactor the API rather than add weight and complexity to the current functions.

Refactoring the datasets API is getting away from the spirit of this issue, so I'll open up a new one with a reference here.

@amueller
Copy link
Member

also see #13902

@amueller
Copy link
Member

Is this solved by #13902 being merged? Do we want this for all dataset loaders?

@jnothman
Copy link
Member Author

jnothman commented Jul 31, 2019 via email

@amueller
Copy link
Member

amueller commented Aug 1, 2019

ok cool. these are actually nicely sprintable. should we tag "good first issue"?

@thomasjpfan
Copy link
Member

This may be slightly harder since we usually block the fetch_* from running on the CI in pull requests.

@bsipocz
Copy link
Contributor

bsipocz commented Jun 6, 2020

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it?

@cmarmo
Copy link
Contributor

cmarmo commented Jun 6, 2020

@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it?

Suggesting for the sprint. :)

@bsipocz
Copy link
Contributor

bsipocz commented Jun 6, 2020

ok, then @tianchuliang and I would take this for the sprint

@thomasjpfan
Copy link
Member

When testing be sure to set SKLEARN_SKIP_NETWORK_TESTS=0 and the fixtures in sklearn/datasets/tests/conftest.py.

@erodrago
Copy link
Contributor

erodrago commented Jun 6, 2020

take

@thomasjpfan
Copy link
Member

@erodrago I believe @bsipocz and @tianchuliang is working on this.

@cmarmo
Copy link
Contributor

cmarmo commented Jun 6, 2020

@bsipocz, @erodrago there is a number of function left, do you mind to specify on which one you are working? Thanks!

@tianchuliang
Copy link
Contributor

@cmarmo @bsipocz and I will basically split 4 half and half; I will work on first 2, @bsipocz works on the second 2

@cmarmo cmarmo removed the help wanted label Jun 6, 2020
@tianchuliang
Copy link
Contributor

@cmarmo can we each make a PR since our tasks is pretty well divided? Or do you suggest one combined PR for this issue?

@erodrago
Copy link
Contributor

erodrago commented Jun 6, 2020

@thomasjpfan sorry I picked it up since it was still on the to do pane.. okay will leave you guys to work on it

@cmarmo
Copy link
Contributor

cmarmo commented Jun 6, 2020

@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks!

@bsipocz
Copy link
Contributor

bsipocz commented Jun 6, 2020

@erodrago - I've started to work on the 20newsgroup one, if you want to pick up fetch_rcv1 just let me know so I don't duplicate the effort

@tianchuliang
Copy link
Contributor

I've started workingon the first one, fetch_covtype

@jaketae
Copy link
Contributor

jaketae commented Jun 6, 2020

Hi, can we start working on fetch_rcv1? Just wanted to make sure there is no duplicate work being done before we begin.

@bsipocz
Copy link
Contributor

bsipocz commented Jun 6, 2020

@jaketae go for it!

@thomasjpfan
Copy link
Member

Be sure to look at #10733 (comment) when working on this locally.

@tianchuliang
Copy link
Contributor

FYI: Started working on adding as_frame for fetch_kddcup99

@jaketae
Copy link
Contributor

jaketae commented Jun 6, 2020

@thomasjpfan Should the returned DataFrame have labeled columns? As noted in the comments above, I'm not entirely sure where to get the column labels (words) for the dataset.

cc: @amueller

@thomasjpfan
Copy link
Member

Let's not work on fetch_rcv1 since the words are difficult to get.

@cmarmo
Copy link
Contributor

cmarmo commented Nov 1, 2020

Closing, all data fetchers suitable for this feature have been modified.

@cmarmo cmarmo closed this as completed Nov 1, 2020
@deliciouslytyped
Copy link

deliciouslytyped commented Nov 29, 2023

This can be really nice, thanks!

I have two issues / things I would like:

  • There does not seem to be a way to get the target_names attribute without loading a Bunch, which reduces the convenience of getting the data as a data frame, if I want these I have to load both the DataFrame and the bunch, or I construct the DataFrame manually.
  • pandas has a Categorical datatype for categorical data, could this be used for categorical data in the datasets? This can supposedly improve the behavior of various libraries / plotting that uses the data, and prevents doing numerical operations where they don't make sense.

Edit:
Of course, 15 minutes later, I found that load_iris(as_frame=True) has a .frame attribute, while load_iris(as_frame=True, return_X_y=True) does not seem to have that attribute.

With only as_frame=True, I can do something like this:

import sklearn as sk
ir = sk.datasets.load_iris(as_frame=True)
ir.frame["target"] = ir.frame["target"].astype('category')
ir.frame.target.cat.rename_categories(www.target_names)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement good first issue Easy with clear instructions to resolve Sprint
Projects
None yet