-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for returning datasets as DataFrames #10733
Comments
I would like to work upon this. Seems like a good first issue. If I understand it correctly, the task is to add an argument to the load function for returning a DataFrame. So in case of iris dataset, this can be |
yes, i think it would be good, but there may not be consensus on making
this change. So if you make a pull request, it is with some risk that other
core devs will object to it
|
Okay, but I might as well create a PR and see how the conversation turns out. |
you are welcome to
|
@layog are you still working on this? if not, I would like to give it a try. |
@nsorros No, I have not started working on this one. You can take it. |
@jnothman It would be helpful to get some thoughts in the PR which simply implements what you suggested for one dataset. |
@jnotham is there agreement on the desired interface? On the PR it was suggested to have a |
I believe it should be possible to get a dataframe as the bunch's data
attribute, rather than this being an alternative value of return_X_y. And
that's why as_frame or frame is better than return_frame.
…On 18 August 2018 at 03:39, Joris Van den Bossche ***@***.***> wrote:
@jnotham is there agreement on the desired interface? as_frame=True/False
? (or return_frame to be consistent with return_X_y?)
On the PR it was suggested to have a return_X_y='pandas' instead.
For openml, I was thinking that a return='array' could also make sense to
ensure a numerical array is returned (drop string features, #11819
<#11819>). A
return='array'|'pandas'|... might be more flexible future-wise to add new
things without the need to add new keywords, but the question is of course
this would be needed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10733 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_q1x5Sx7p-r78D8dAt2gyfX2YNLks5uRv_OgaJpZM4SXlNK>
.
|
@jnothman Yes, that is how I understood it, and also what I meant above with |
Hey y'all, Seems like this has gone cold. I'd be keen to take care of the last bit. Currently, this change requires modifying each
Then those classes could hide dataset specific implementation in classes like:
The abstract class
Just spitballing here but while the
to leave open for classes like My thought is that by adopting an OO approach it will be easier to add datasets in the future, or add features to dataset loading that make it compatible with other Python packages (like this issue originally was trying to do with Pandas). I'd really appreciate y'alls thoughts and feedback, as this is my first attempt to contribute to SciKit-Learn (aka Noobie here!). Thanks for reading :) p.s I'd be keen to collaborate with anyone interested (especially if they're experienced in OO design) - it would be great to learn from y'all |
And here's a preview of the suggested API - which could replace the NOTE: this preview is outdated. A more recent version of the API preview can be found in the comments below class DataSetLoader:
from typing import Union, Tuple, Any
descr_string = 'descr'
csv_ext = '.csv'
rst_ext = '.rst'
def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
raise NotImplementedError
class SimpleDataSetLoader(DataSetLoader):
from typing import Union, Tuple, Any, List
def _load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
module_path = dirname(__file__)
base_filename = self.get_data_file_basename()
data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)
with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
fdescr = rst_file.read()
if return_X_y:
return data, target
return Bunch(data=data, target=target,
target_names=target_names,
DESCR=fdescr,
feature_names=self.get_feature_names(),
filename=dataset_csv_filename)
def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]::
return self._load(return_X_y=return_X_y)
def get_data_file_basename(self) -> str:
raise NotImplementedError
def get_feature_names(self) -> List[str]:
raise NotImplementedError
class Wine(SimpleDataSetLoader):
from typing import Union, Tuple, Any, List
def get_data_file_basename(self) -> str:
return 'wine_data'
def get_feature_names(self) -> List[str]:
return ['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']
def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
"""Load and return the wine dataset (classification).
*** omitted some docstring for clarity ***
Examples
--------
Let's say you are interested in the samples 10, 80, and 140, and want to
know their class name.
>>> from sklearn.datasets import Wine
>>> data = Wine.load()
>>> data.target[[10, 80, 140]]
array([0, 1, 2])
>>> list(data.target_names)
['class_0', 'class_1', 'class_2']
"""
return self._load(return_X_y=return_X_y)
class Iris(SimpleDataSetLoader):
from typing import Union, Tuple, Any, List
def get_data_file_basename(self) -> str:
return 'iris'
def get_feature_names(self) -> List[str]:
return ['sepal length (cm)', 'sepal width (cm)',
'petal length (cm)', 'petal width (cm)']
def load(self, return_X_y=False) -> Union[Bunch, Tuple[Any, Any]]:
"""Load and return the iris dataset (classification).
*** omitted some docstring for clarity ***
Examples
--------
Let's say you are interested in the samples 10, 25, and 50, and want to
know their class name.
>>> from sklearn.datasets import Iris
>>> data = Iris.load()
>>> data.target[[10, 25, 50]]
array([0, 0, 1])
>>> list(data.target_names)
['setosa', 'versicolor', 'virginica']
"""
return self._load(return_X_y=return_X_y)
# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load |
It would be great that |
@vnmabus I agree that being able to set the target_column would be a helpful feature - I've run into the same issue with wanting all of the data in the same array. But I'm hesitant to add it as I'm wary of 'feature creep'. Is there a reasoning for how a I really like the idea of bringing the from abc import ABC, abstractmethod
from typing import Union, Tuple, Any, List
class AbstractDataSetLoader(ABC):
descr_string = 'descr'
csv_ext = '.csv'
rst_ext = '.rst'
def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
"""Override this method to insert the docstring of the dataset"""
fetched_dataset = self._fetch_dataset()
if as_frame:
# Pseudo Code: Convert 'data' and 'target' of fetched_dataset bunch to pandas DataFrames
# fetched_dataset['data'] = pd.DataFrame(fetched_dataset['data'])
# fetched_dataset['target'] = pd.DataFrame(fetched_dataset['target'])
pass
if return_X_y:
return fetched_dataset['data'], fetched_dataset['target']
return fetched_dataset
@abstractmethod
def _fetch_dataset(self) -> Bunch:
pass
class RemoteDataSetLoader(AbstractDataSetLoader):
pass
class LocalDataSetLoader(AbstractDataSetLoader):
def _fetch_dataset(self) -> Union[Bunch, Tuple[Any, Any]]:
module_path = dirname(__file__)
base_filename = self._get_data_file_basename()
data, target, target_names = load_data(module_path, base_filename + self.csv_ext)
dataset_csv_filename = join(module_path, 'data', base_filename + self.csv_ext)
with open(join(module_path, self.descr_string, base_filename + self.rst_ext)) as rst_file:
fdescr = rst_file.read()
return Bunch(data=data, target=target,
target_names=target_names,
DESCR=fdescr,
feature_names=self._get_feature_names(),
filename=dataset_csv_filename)
@abstractmethod
def _get_data_file_basename(self) -> str:
pass
@abstractmethod
def _get_feature_names(self) -> List[str]:
pass
class Wine(LocalDataSetLoader):
def _get_data_file_basename(self):
return 'wine_data'
def _get_feature_names(self):
return ['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']
def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
"""Load and return the wine dataset (classification).
*** omitted some docstring for clarity ***
Examples
--------
Let's say you are interested in the samples 10, 80, and 140, and want to
know their class name.
>>> from sklearn.datasets import Wine
>>> data = Wine().load()
>>> data.target[[10, 80, 140]]
array([0, 1, 2])
>>> list(data.target_names)
['class_0', 'class_1', 'class_2']
"""
return super().load(return_X_y=return_X_y, as_frame=as_frame)
class Iris(LocalDataSetLoader):
def _get_data_file_basename(self):
return 'iris'
def _get_feature_names(self):
return ['sepal length (cm)', 'sepal width (cm)',
'petal length (cm)', 'petal width (cm)']
def load(self, return_X_y=False, as_frame=False) -> Union[Bunch, Tuple[Any, Any]]:
"""Load and return the iris dataset (classification).
*** omitted some docstring for clarity ***
>>> from sklearn.datasets import Iris
>>> data = Iris().load()
>>> data.target[[10, 25, 50]]
array([0, 0, 1])
>>> list(data.target_names)
['setosa', 'versicolor', 'virginica']
"""
return super().load(return_X_y=return_X_y, as_frame=as_frame)
# Backward compatibility
load_wine = Wine().load
load_iris = Iris().load |
I haven't yet had time to properly evaluate the benefits of an object
interface. I am not readily convinced that this particular feature
requirement necessitates the redesign: until we have to worry about
heterogeneous dtype, a DataFrame can be constructed from a Bunch, can't it?
|
@jnothman Yes I agree that the OO API isn't strictly necessary to implement returning dataframes. However, without an OO API, each function would need to be individually modified. IMO Refactoring the API would make a change like this trivial, and I see it as a good opportunity to refactor the API rather than add weight and complexity to the current functions. Refactoring the datasets API is getting away from the spirit of this issue, so I'll open up a new one with a reference here. |
also see #13902 |
Is this solved by #13902 being merged? Do we want this for all dataset loaders? |
I'm happy to see as_frame=True for other dataset loaders...
|
ok cool. these are actually nicely sprintable. should we tag "good first issue"? |
This may be slightly harder since we usually block the |
@lucyleeow and @cmarmo - do your comments about suggesting this issue for the sprint, or saying that you're planning to work on it? |
Suggesting for the sprint. :) |
ok, then @tianchuliang and I would take this for the sprint |
When testing be sure to set |
take |
@erodrago I believe @bsipocz and @tianchuliang is working on this. |
@cmarmo can we each make a PR since our tasks is pretty well divided? Or do you suggest one combined PR for this issue? |
@thomasjpfan sorry I picked it up since it was still on the to do pane.. okay will leave you guys to work on it |
@tianchuliang @bsipocz I think that one PR per dataset fetcher will be easier to review. Thanks! |
@erodrago - I've started to work on the 20newsgroup one, if you want to pick up |
I've started workingon the first one, fetch_covtype |
Hi, can we start working on |
@jaketae go for it! |
Be sure to look at #10733 (comment) when working on this locally. |
FYI: Started working on adding as_frame for fetch_kddcup99 |
@thomasjpfan Should the returned DataFrame have labeled columns? As noted in the comments above, I'm not entirely sure where to get the column labels (words) for the dataset. cc: @amueller |
Let's not work on |
Closing, all data fetchers suitable for this feature have been modified. |
This can be really nice, thanks! I have two issues / things I would like:
Edit: With only as_frame=True, I can do something like this:
|
Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age, for those datasets otherwise providing a Bunch with
feature_names
. This would be controlled with anas_frame
parameter (thoughreturn_X_y='frame'
would mean the common usage is more succinct).The text was updated successfully, but these errors were encountered: