Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Add type annotations for OpenML fetcher #17053

Merged
merged 21 commits into from
Jun 25, 2020

Conversation

rth
Copy link
Member

@rth rth commented Apr 26, 2020

I find that OpenML fetcher code is currently very difficult to read, there is a bunch of private functions with undocumented signatures and manipulating objects with non trivial types.

This PR adds some type annotations for function signatures, to at least make it a bit clearer what is the expected function input and output.

I'll contribute some of it to liac-arff upstream as soon as it drops Python 2 support renatopp/liac-arff#107

@@ -81,15 +88,15 @@ def _open_openml_url(openml_path, data_home):
result : stream
A stream to the OpenML resource
"""
def is_gzip(_fsrc):
def is_gzip_encoded(_fsrc):
Copy link
Member Author

@rth rth Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed because I found the name confusing. It's not actually a gzipped file, but rather that Content-Encoding == 'gzip'in the header. So we need to use gzip for local files when this function is False.

Whether to raise an error if OpenML returns an acceptable error (e.g.,
date not found). If this argument is set to False, a None is returned
in case of acceptable errors. Note that all other errors (e.g., 404)
will still be raised as normal.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it and catch the exception directly in one case when it's necessary. This makes this function return a single type, instead of Dict or None.

if raise_if_error:
raise ValueError(error_message)
return None
raise OpenMLError(error_message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenMLError inherits from ValueError, so it is backward compatible.

conda create --name flake8_env --yes python=3.8
conda activate flake8_env
source activate flake8_env
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda activate was actually not working in master with,

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

and conda init was not working either. We were just not checking for exit status (now done with set -ex) So reverted to source activate as I don't want to debug this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one issue with conda is that it expects the shell function to be the entry point and not the executable. So if we have the executable in the PATH. This has already given me a headache in other places.

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved
@@ -171,6 +172,18 @@
_RE_SPARSE_LINE = re.compile(r'^\s*\{.*\}\s*$', re.UNICODE)
_RE_NONTRIVIAL_DATA = re.compile('["\'{}\\s]', re.UNICODE)

ArffDataType = Tuple[List, ...]

if typing.TYPE_CHECKING:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to TYPE_CHECKING docs this is used for performance reasons. If this is case, let's have a comment here.

Copy link
Member Author

@rth rth Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the reason is that https://pypi.org/project/typing-extensions/ (imported below) is a mypy dependency so we can only import it if we are checking for types and therefore it is installed. I'll add a comment.

sklearn/datasets/_openml.py Show resolved Hide resolved
sklearn/datasets/_openml.py Outdated Show resolved Hide resolved
rth and others added 2 commits June 3, 2020 22:29
@@ -311,6 +333,9 @@ def _convert_arff_data_dataframe(arff, columns, features_dict):
attributes = OrderedDict(arff['attributes'])
arff_columns = list(attributes)

if isinstance(arff['data'], tuple):
raise ValueError("Unreachable code. arff['data'] must be a generator.")

# calculate chunksize
first_row = next(arff['data'])
Copy link
Member Author

@rth rth Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above check is necessary as technically arff['data'] could be a tuple for sparse data, and then next would fails as it's not implemented for tuples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly check that arff['data' is not iterable?

if not isinstance(arff['data'], Iterable)

(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

@rth
Copy link
Member Author

rth commented Jun 9, 2020

@thomasjpfan I addressed your comments, please let me know if you have others.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to try out typehints here. Thank you @rth!

@@ -311,6 +333,9 @@ def _convert_arff_data_dataframe(arff, columns, features_dict):
attributes = OrderedDict(arff['attributes'])
arff_columns = list(attributes)

if isinstance(arff['data'], tuple):
raise ValueError("Unreachable code. arff['data'] must be a generator.")

# calculate chunksize
first_row = next(arff['data'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly check that arff['data' is not iterable?

if not isinstance(arff['data'], Iterable)

(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's try it out.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicks

@@ -245,6 +261,10 @@ def _convert_arff_data(arff, col_slice_x, col_slice_y, shape=None):
"""
arff_data = arff['data']
if isinstance(arff_data, Generator):
if shape is None:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have test coverage?

@@ -311,6 +333,11 @@ def _convert_arff_data_dataframe(arff, columns, features_dict):
attributes = OrderedDict(arff['attributes'])
arff_columns = list(attributes)

if not isinstance(arff['data'], Generator):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage?

@rth
Copy link
Member Author

rth commented Jun 25, 2020

Thanks for the review @jnothman , I added a test to cover those two lines and coverage is now green. Merging.

@rth rth changed the title Add type annotations for OpenML fetcher MNT Add type annotations for OpenML fetcher Jun 25, 2020
@rth rth merged commit 361c052 into scikit-learn:master Jun 25, 2020
@rth rth deleted the annotations-openml branch June 25, 2020 16:30
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@rth rth mentioned this pull request Aug 31, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
cmarmo added a commit to cmarmo/scikit-learn that referenced this pull request Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants