Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for returning datasets as DataFrames #10733

Open
jnothman opened this issue Mar 1, 2018 · 54 comments · May be fixed by #10972
Open

API for returning datasets as DataFrames #10733

jnothman opened this issue Mar 1, 2018 · 54 comments · May be fixed by #10972

Comments

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 1, 2018

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with feature_names. This would be controlled with an as_frame parameter (though return_X_y='frame' would mean the common usage is more succinct).

@layog

This comment has been minimized.

Copy link

@layog layog commented Mar 1, 2018

I would like to work upon this. Seems like a good first issue. If I understand it correctly, the task is to add an argument to the load function for returning a DataFrame. So in case of iris dataset, this can be dataset.load_iris(as_frame=True). Am I right?

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Mar 1, 2018

@layog

This comment has been minimized.

Copy link

@layog layog commented Mar 1, 2018

Okay, but I might as well create a PR and see how the conversation turns out.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Mar 1, 2018

@nsorros

This comment has been minimized.

Copy link

@nsorros nsorros commented Apr 11, 2018

@layog are you still working on this? if not, I would like to give it a try.

@layog

This comment has been minimized.

Copy link

@layog layog commented Apr 11, 2018

@nsorros No, I have not started working on this one. You can take it.

@nsorros

This comment has been minimized.

Copy link

@nsorros nsorros commented Apr 13, 2018

@jnothman It would be helpful to get some thoughts in the PR which simply implements what you suggested for one dataset.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 17, 2018

@jnotham is there agreement on the desired interface? as_frame=True/False ? (or return_frame to be consistent with return_X_y?)

On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to ensure a numerical array is returned (drop string features, #11819). A return='array'|'pandas'|... might be more flexible future-wise to add new things without the need to add new keywords, but the question is of course this would be needed.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Aug 18, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 20, 2018

@jnothman Yes, that is how I understood it, and also what I meant above with return_frame (to return a frame instead of an array; both in the bunch or as direct value depending on return_X_yis True or False)
(I meant return_frame only as a name suggestion compared to as_frame, not a different behaviour)

@daniel-cortez-stevenson

This comment has been minimized.

Copy link

@daniel-cortez-stevenson daniel-cortez-stevenson commented Feb 2, 2019

Hey y'all,

Seems like this has gone cold. I'd be keen to take care of the last bit.

Currently, this change requires modifying each load_*() function. What are y'alls thoughts on creating an abstract class DataSetLoader that is subclassed for each type of dataset like:

  • SimpleDataSetLoader (only a single data file is loaded from)
  • TargetDataSetLoader (sources from a target data file)
  • ImageDataSetLoader (image data)
  • and so on

Then those classes could hide dataset specific implementation in classes like:

  • Iris
  • Digits
  • BreastCancer
  • and so on

The abstract class DataSetLoader would have a method load(return_X_y=False, as_frame=False) -> Bunch, which would call overrideable getter methods (I'm not married to 'getter' method patterns, but it might do) to modify the behavior of DataSetLoader.load like:

  • get_data_file_basename() -> str
  • get_feature_names() -> List[str]
  • etc

Just spitballing here but while the SimpleDataSetLoader, TargetDataSetLoader, ImageDataSetLoader etc. classes could include some implementation, thye could also include overrideable methods of like:

  • preprocess_data()
  • preprocess_target()
  • preprocess_images()
  • get_target_file_basename() -> str
  • get_image_file_basename() -> str

to leave open for classes like Iris or Digits to take care of any dataset-specific implementation.

My thought is that by adopting an OO approach it will be easier to add datasets in the future, or add features to dataset loading that make it compatible with other Python packages (like this issue originally was trying to do with Pandas).

I'd really appreciate y'alls thoughts and feedback, as this is my first attempt to contribute to SciKit-Learn (aka Noobie here!).

Thanks for reading :)

p.s I'd be keen to collaborate with anyone interested (especially if they're experienced in OO design) - it would be great to learn from y'all

@daniel-cortez-stevenson

This comment has been minimized.

Copy link

@daniel-cortez-stevenson daniel-cortez-stevenson commented Feb 3, 2019

And here's a preview of the suggested API - which could replace the load_wine and load_iris functions:

NOTE: this preview is outdated. A more recent version of the API preview can be found in the comments below

class DataSetLoader:
    from typing import Union, Tuple, Any

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        raise NotImplementedError


class SimpleDataSetLoader(DataSetLoader):
    from typing import Union, Tuple, Any, List

    def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self.get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        if return_X_y:
            return data, target

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self.get_feature_names(),
                     filename=dataset_csv_filename)
    
    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]::
        return self._load(return_X_y=return_X_y)

    def get_data_file_basename(self) -> str:
        raise NotImplementedError

    def get_feature_names(self) -> List[str]:
        raise NotImplementedError


class Wine(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'wine_data'

    def get_feature_names(self) -> List[str]:
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']
    
    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine.load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return self._load(return_X_y=return_X_y)


class Iris(SimpleDataSetLoader):
    from typing import Union, Tuple, Any, List

    def get_data_file_basename(self) -> str:
        return 'iris'

    def get_feature_names(self) -> List[str]:
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']
    
    def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 25, and 50, and want to
        know their class name.

        >>> from sklearn.datasets import Iris
        >>> data = Iris.load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return self._load(return_X_y=return_X_y)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load
@vnmabus

This comment has been minimized.

Copy link

@vnmabus vnmabus commented Feb 3, 2019

It would be great that target_column was an argument of load for every dataset (as in fetch_openml). That allows for using the same dataset for different tasks, or to have all the data (including the target) in the same array. Also, are you proposing something similar for the fetchers?

@daniel-cortez-stevenson

This comment has been minimized.

Copy link

@daniel-cortez-stevenson daniel-cortez-stevenson commented Feb 3, 2019

@vnmabus I agree that being able to set the target_column would be a helpful feature - I've run into the same issue with wanting all of the data in the same array. But I'm hesitant to add it as I'm wary of 'feature creep'. Is there a reasoning for how a target_column param would ease API restructuring?

I really like the idea of bringing the fetch_openml and openml API into this restructuring. In the code below, the abstract _fetch_dataset method could be overridden in RemoteDataSetLoader to work with openml remote datasets. I've also added an entrypoint and psuedocode for the original as_frame param in AbstractDataSetLoader:

from abc import ABC, abstractmethod
from typing import Union, Tuple, Any, List


class AbstractDataSetLoader(ABC):

    descr_string = 'descr'
    csv_ext = '.csv'
    rst_ext = '.rst'

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Override this method to insert the docstring of the dataset"""

        fetched_dataset = self._fetch_dataset()

        if as_frame:
            # Pseudo Code: Convert 'data' and 'target' of fetched_dataset bunch to pandas DataFrames
            # fetched_dataset['data'] = pd.DataFrame(fetched_dataset['data'])
            # fetched_dataset['target'] = pd.DataFrame(fetched_dataset['target'])
            pass

        if return_X_y:
            return fetched_dataset['data'], fetched_dataset['target']

        return fetched_dataset

    @abstractmethod
    def _fetch_dataset(self) -> Bunch:
        pass


class RemoteDataSetLoader(AbstractDataSetLoader):
    pass


class LocalDataSetLoader(AbstractDataSetLoader):
    def _fetch_dataset(self) -> Union[Bunch, Tuple[Any, Any]]:
        module_path = dirname(__file__)
        base_filename = self._get_data_file_basename()
        data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
        dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)

        with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
            fdescr = rst_file.read()

        return Bunch(data=data, target=target,
                     target_names=target_names,
                     DESCR=fdescr,
                     feature_names=self._get_feature_names(),
                     filename=dataset_csv_filename)

    @abstractmethod
    def _get_data_file_basename(self) -> str:
        pass

    @abstractmethod
    def _get_feature_names(self) -> List[str]:
        pass


class Wine(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'wine_data'

    def _get_feature_names(self):
        return ['alcohol',
                'malic_acid',
                'ash',
                'alcalinity_of_ash',
                'magnesium',
                'total_phenols',
                'flavanoids',
                'nonflavanoid_phenols',
                'proanthocyanins',
                'color_intensity',
                'hue',
                'od280/od315_of_diluted_wines',
                'proline']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the wine dataset (classification).
        *** omitted some docstring for clarity ***
        Examples
        --------
        Let's say you are interested in the samples 10, 80, and 140, and want to
        know their class name.

        >>> from sklearn.datasets import Wine
        >>> data = Wine().load()
        >>> data.target[[10, 80, 140]]
        array([0, 1, 2])
        >>> list(data.target_names)
        ['class_0', 'class_1', 'class_2']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


class Iris(LocalDataSetLoader):
    def _get_data_file_basename(self):
        return 'iris'

    def _get_feature_names(self):
        return ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

    def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
        """Load and return the iris dataset (classification).
        *** omitted some docstring for clarity ***
        >>> from sklearn.datasets import Iris
        >>> data = Iris().load()
        >>> data.target[[10, 25, 50]]
        array([0, 0, 1])
        >>> list(data.target_names)
        ['setosa', 'versicolor', 'virginica']
        """
        return super().load(return_X_y=return_X_y, as_frame=as_frame)


# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load
@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Feb 3, 2019

@daniel-cortez-stevenson

This comment has been minimized.

Copy link

@daniel-cortez-stevenson daniel-cortez-stevenson commented Feb 3, 2019

@jnothman Yes I agree that the OO API isn't strictly necessary to implement returning dataframes. However, without an OO API, each function would need to be individually modified.

IMO Refactoring the API would make a change like this trivial, and I see it as a good opportunity to refactor the API rather than add weight and complexity to the current functions.

Refactoring the datasets API is getting away from the spirit of this issue, so I'll open up a new one with a reference here.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 12, 2019

also see #13902

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 30, 2019

Is this solved by #13902 being merged? Do we want this for all dataset loaders?

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jul 31, 2019

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 1, 2019

ok cool. these are actually nicely sprintable. should we tag "good first issue"?

@aditi9783

This comment has been minimized.

Copy link

@aditi9783 aditi9783 commented Aug 24, 2019

@vmanisha and I'd like to work on this: adding as_frame=True to relevant functions.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 24, 2019

@aditi9783 great, maybe pick one dataset loader to start with

@adrinjalali adrinjalali added this to To do in Pandas Oct 21, 2019
@gitsteph

This comment has been minimized.

Copy link
Contributor

@gitsteph gitsteph commented Nov 2, 2019

Hi! I'd like to work on adding as_frame=True for the California housing dataset. It doesn't seem to have anyone working on it yet, so I'll just get started with it now at the sprint. :)

@ltcguthrie

This comment has been minimized.

Copy link

@ltcguthrie ltcguthrie commented Nov 2, 2019

I'd like to work on this by adding as_frame=True to the load_diabetes loader.

@wconnell

This comment has been minimized.

Copy link

@wconnell wconnell commented Nov 2, 2019

I'll work on this by adding as_frame=True to the load_wine loader!

@reganconnell

This comment has been minimized.

Copy link

@reganconnell reganconnell commented Nov 2, 2019

I will work on this by adding as_frame=True to the load_breast_cancer.

@trista-paul

This comment has been minimized.

Copy link

@trista-paul trista-paul commented Nov 2, 2019

I'm working on this today by adding as_frame=True to load_linnerud(). Afterwards, I will look into the same for one of the real life datasets.

@wconnell

This comment has been minimized.

Copy link

@wconnell wconnell commented Nov 2, 2019

Hey guys I think we should write a general _bunch_to_dataframe() function that can be called in all of these loaders. Instead of writing separate code in each loader, are we on the same page?

@trista-paul

This comment has been minimized.

Copy link

@trista-paul trista-paul commented Nov 2, 2019

Yes, I had thought that was the request originally.

@trista-paul

This comment has been minimized.

Copy link

@trista-paul trista-paul commented Nov 2, 2019

I'd like to claim it if no one else that posted this morning before me doesn't (or doesn't respond soon).

@wconnell

This comment has been minimized.

Copy link

@wconnell wconnell commented Nov 2, 2019

I think we have made some progress already if you don't mind!

@trista-paul

This comment has been minimized.

Copy link

@trista-paul trista-paul commented Nov 2, 2019

Okay

@gitsteph

This comment has been minimized.

Copy link
Contributor

@gitsteph gitsteph commented Nov 2, 2019

Referencing the example from the linked issue's implementation (#13902), which adds as_frame to fetch_openml -- it seems that some of these datasets have different formats that require custom handling. For example the openml dataset has an ARFF file format, while the one (_california_housing.py) I'm working on uses RST.

I agree that it'd be good to follow a standard and generalized naming convention and pattern, though I don't think we could simply call a shared _bunch_to_dataframe without shadowing it in individual loaders to handle (or at least check) these varying formats. Also, the existing merged openml implementation adds frame to the bunch, rather than converting the bunch to dataframe as the proposed name (_bunch_to_dataframe) implies. I propose an abstract method called _convert_data_dataframe, which is a generalization of openml.py's _convert_arff_data_dataframe method.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 2, 2019

@gitsteph not sure I follow. I think for many of the cases (thought not for openml probably) we can use the existing bunch to create a dataframe. The bunch has numpy arrays and standardized ways to name features.

@gitsteph

This comment has been minimized.

Copy link
Contributor

@gitsteph gitsteph commented Nov 2, 2019

@amueller it seems cleaner to create the frame before the bunch, and add it to the bunch as I do in PR #15486 -- and more consistent with what happens in the existing merged example for openml.

Though I'm open to revising what I have there already, especially if you're already noticing that most cases can use the same method/are often in the same formats.

Erg, I'm noticing a couple things I want to change already... so going to temporarily close the PR for a few moments... whoops...

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 2, 2019

Just an FYI for everybody: the decision was to include the target into the dataframe.

@wconnell

This comment has been minimized.

Copy link

@wconnell wconnell commented Nov 2, 2019

It seems to me that each dataset loader massages the data into an appropriate format for a data bunch...a simple function to return a dataframe from this structure is trivial

@reganconnell

This comment has been minimized.

Copy link

@reganconnell reganconnell commented Nov 2, 2019

def bunch_to_dataframe(bunch):
    data= np.c_[bunch.data, bunch.target]
    target_cols = ["target"]
    
    # Check if target is n-dimensional
    if len(bunch.target.shape) > 1:
        target_cols = ["target_" + col for col in bunch.target_names]
    
    columns = np.append(bunch.feature_names, target_cols)
    return pd.DataFrame(data, columns=columns)

So add the following code into the dataset loaders, not including image datasets:

bunch = Bunch(data=data,
                 target=target,
                 feature_names=feature_names,
                 DESCR=descr)
if as_frame:
    return bunch_to_dataframe(bunch)

This doesn't include metadata for feature_names or DESCR in the final data frame.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 2, 2019

the target_names are names of the categories in classification, not different columns, right?

@wconnell

This comment has been minimized.

Copy link

@wconnell wconnell commented Nov 2, 2019

Yes, except for the linnerud datatset which is for multiple multivariate regression. There are 3 target columns which are named under target_names.

@gitsteph

This comment has been minimized.

Copy link
Contributor

@gitsteph gitsteph commented Nov 2, 2019

🤔 Hmm... I think I might be a bit confused.
Are we only returning the dataframe, or a Bunch containing the frame if as_frame=True (as you see in the merged OpenML example)? I implemented it to be the latter, for consistency with that other merged example.

@gitsteph

This comment has been minimized.

Copy link
Contributor

@gitsteph gitsteph commented Nov 2, 2019

That's what I meant by saying it was confusing to me that we'd convert bunch_to_dataframe. The way I've implemented it in the above PR, when as_frame=True, I provide data and target as pandas objects (instead of np arrays) and also provide a frame that isn't None, all wrapped in a returned bunch. The bunch is still there, but with the data further altered/augmented if as_frame=True.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 2, 2019

@wconnell yeah that sounds about right.

@gitsteph yes we should return a bunch containing a dataframe I think.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 2, 2019

linnerud probably needs it's own thing, but most of the rest should be able to share a function.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 2, 2019

#15486 has a good helper

@wconnell

This comment has been minimized.

Copy link

@wconnell wconnell commented Nov 4, 2019

Waiting for #15486 to merge so I can apply this function to the builitin data loaders. Although I am curious if there is a good way to work with the new code from this PR locally so I can begin without this being officially merged? Would I have to clone @gitsteph branch and then set upstream? Or is this a bad idea and patience a better virtue in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pandas
  
To do
You can’t perform that action at this time.