MNT Add type annotations for OpenML fetcher #17053

rth · 2020-04-26T20:02:03Z

I find that OpenML fetcher code is currently very difficult to read, there is a bunch of private functions with undocumented signatures and manipulating objects with non trivial types.

This PR adds some type annotations for function signatures, to at least make it a bit clearer what is the expected function input and output.

I'll contribute some of it to liac-arff upstream as soon as it drops Python 2 support renatopp/liac-arff#107

rth · 2020-04-26T20:49:18Z

sklearn/datasets/_openml.py

@@ -81,15 +88,15 @@ def _open_openml_url(openml_path, data_home):
    result : stream
        A stream to the OpenML resource
    """
-    def is_gzip(_fsrc):
+    def is_gzip_encoded(_fsrc):


Renamed because I found the name confusing. It's not actually a gzipped file, but rather that Content-Encoding == 'gzip'in the header. So we need to use gzip for local files when this function is False.

rth · 2020-04-26T20:50:40Z

sklearn/datasets/_openml.py

-        Whether to raise an error if OpenML returns an acceptable error (e.g.,
-        date not found). If this argument is set to False, a None is returned
-        in case of acceptable errors. Note that all other errors (e.g., 404)
-        will still be raised as normal.


Remove it and catch the exception directly in one case when it's necessary. This makes this function return a single type, instead of Dict or None.

rth · 2020-04-26T20:52:35Z

sklearn/datasets/_openml.py

-    if raise_if_error:
-        raise ValueError(error_message)
-    return None
+    raise OpenMLError(error_message)


OpenMLError inherits from ValueError, so it is backward compatible.

rth · 2020-04-26T20:53:12Z

azure-pipelines.yml

        conda create --name flake8_env --yes python=3.8
-        conda activate flake8_env
+        source activate flake8_env


conda activate was actually not working in master with,

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. To initialize your shell, run $ conda init <SHELL_NAME>

and conda init was not working either. We were just not checking for exit status (now done with set -ex) So reverted to source activate as I don't want to debug this.

one issue with conda is that it expects the shell function to be the entry point and not the executable. So if we have the executable in the PATH. This has already given me a headache in other places.

sklearn/datasets/_openml.py

thomasjpfan · 2020-05-23T18:11:05Z

sklearn/externals/_arff.py

@@ -171,6 +172,18 @@
 _RE_SPARSE_LINE = re.compile(r'^\s*\{.*\}\s*$', re.UNICODE)
 _RE_NONTRIVIAL_DATA = re.compile('["\'{}\\s]', re.UNICODE)

+ArffDataType = Tuple[List, ...]
+
+if typing.TYPE_CHECKING:


According to TYPE_CHECKING docs this is used for performance reasons. If this is case, let's have a comment here.

No the reason is that https://pypi.org/project/typing-extensions/ (imported below) is a mypy dependency so we can only import it if we are checking for types and therefore it is installed. I'll add a comment.

sklearn/datasets/_openml.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

rth · 2020-06-09T10:58:03Z

sklearn/datasets/_openml.py

@@ -311,6 +333,9 @@ def _convert_arff_data_dataframe(arff, columns, features_dict):
    attributes = OrderedDict(arff['attributes'])
    arff_columns = list(attributes)

+    if isinstance(arff['data'], tuple):
+        raise ValueError("Unreachable code. arff['data'] must be a generator.")
+
    # calculate chunksize
    first_row = next(arff['data'])


The above check is necessary as technically arff['data'] could be a tuple for sparse data, and then next would fails as it's not implemented for tuples.

Can we directly check that arff['data' is not iterable?

if not isinstance(arff['data'], Iterable)

(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)

rth · 2020-06-09T11:28:31Z

@thomasjpfan I addressed your comments, please let me know if you have others.

thomasjpfan

I am happy to try out typehints here. Thank you @rth!

thomasjpfan · 2020-06-09T14:29:22Z

sklearn/datasets/_openml.py

@@ -311,6 +333,9 @@ def _convert_arff_data_dataframe(arff, columns, features_dict):
    attributes = OrderedDict(arff['attributes'])
    arff_columns = list(attributes)

+    if isinstance(arff['data'], tuple):
+        raise ValueError("Unreachable code. arff['data'] must be a generator.")
+
    # calculate chunksize
    first_row = next(arff['data'])


Can we directly check that arff['data' is not iterable?

if not isinstance(arff['data'], Iterable)

(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)

jnothman

Sure, let's try it out.

jnothman

nitpicks

jnothman · 2020-06-25T14:43:38Z

sklearn/datasets/_openml.py

@@ -245,6 +261,10 @@ def _convert_arff_data(arff, col_slice_x, col_slice_y, shape=None):
    """
    arff_data = arff['data']
    if isinstance(arff_data, Generator):
+        if shape is None:
+            raise ValueError(


Shouldn't this have test coverage?

jnothman · 2020-06-25T14:43:50Z

sklearn/datasets/_openml.py

@@ -311,6 +333,11 @@ def _convert_arff_data_dataframe(arff, columns, features_dict):
    attributes = OrderedDict(arff['attributes'])
    arff_columns = list(attributes)

+    if not isinstance(arff['data'], Generator):
+        raise ValueError(


Test coverage?

rth · 2020-06-25T16:30:16Z

Thanks for the review @jnothman , I added a test to cover those two lines and coverage is now green. Merging.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

rth added 11 commits April 26, 2020 20:18

Type annotations for OpenML fetcher

7692122

More types

ffa469d

Merge remote-tracking branch 'upstream/master' into annotations-openml

c5bfa10

Fix merge conflicts

14d34c4

More fixes

a31781b

Fixing CI

59231e2

Lint

5fa0edd

Conda activate doesn't work

d202ecc

Simplify exception handling

e50203d

Simplify data_qualities

00edef4

Fix tests

937f606

rth commented Apr 26, 2020

View reviewed changes

rth requested a review from thomasjpfan April 26, 2020 20:57

rth mentioned this pull request May 10, 2020

MNT properly activate the env in the linting CI #17177

Merged

rth mentioned this pull request May 23, 2020

ENH Verify md5-checksums received from openml arff file metadata #14800

Merged

thomasjpfan reviewed May 23, 2020

View reviewed changes

rth and others added 2 commits June 3, 2020 22:29

Apply suggestions from code review

55ad34b

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Merge branch 'master' into annotations-openml

bef7672

github-actions bot added the module:datasets label Jun 9, 2020

rth added 2 commits June 9, 2020 10:52

Address review comments

3856d5b

More fixes

c514c53

rth commented Jun 9, 2020

View reviewed changes

rth added 2 commits June 9, 2020 13:03

Another fix

ebf1b59

Another typo

55ec82b

thomasjpfan approved these changes Jun 9, 2020

View reviewed changes

Check arff['data'] for being an iterable instead

ebf80d1

jnothman approved these changes Jun 25, 2020

View reviewed changes

jnothman reviewed Jun 25, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into annotations-openml

fcd0b62

rth added 2 commits June 25, 2020 17:40

Add requested tests

8c1fbfa

Only run test if pandas is installed

c9a974a

rth changed the title ~~Add type annotations for OpenML fetcher~~ MNT Add type annotations for OpenML fetcher Jun 25, 2020

rth merged commit 361c052 into scikit-learn:master Jun 25, 2020

rth deleted the annotations-openml branch June 25, 2020 16:30

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020

MNT Add type annotations for OpenML fetcher (scikit-learn#17053)

9ef07a7

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

rth mentioned this pull request Aug 31, 2020

Support typing #16705

Open

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

MNT Add type annotations for OpenML fetcher (scikit-learn#17053)

d9adfae

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

cmarmo added a commit to cmarmo/scikit-learn that referenced this pull request Mar 2, 2021

Revert Roman modifications from scikit-learn#17053.

5515ca4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Add type annotations for OpenML fetcher #17053

MNT Add type annotations for OpenML fetcher #17053

rth commented Apr 26, 2020 •

edited

Loading

rth Apr 26, 2020 •

edited

Loading

rth Apr 26, 2020

rth Apr 26, 2020

rth Apr 26, 2020

adrinjalali Apr 27, 2020

thomasjpfan May 23, 2020

rth Jun 3, 2020 •

edited

Loading

rth Jun 9, 2020 •

edited

Loading

thomasjpfan Jun 9, 2020

rth Jun 9, 2020

rth commented Jun 9, 2020

thomasjpfan left a comment

thomasjpfan Jun 9, 2020

jnothman left a comment

jnothman left a comment

jnothman Jun 25, 2020

jnothman Jun 25, 2020

rth commented Jun 25, 2020 •

edited

Loading

MNT Add type annotations for OpenML fetcher #17053

MNT Add type annotations for OpenML fetcher #17053

Conversation

rth commented Apr 26, 2020 • edited Loading

rth Apr 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth Jun 3, 2020 • edited Loading

Choose a reason for hiding this comment

rth Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Jun 9, 2020

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Jun 25, 2020 • edited Loading

rth commented Apr 26, 2020 •

edited

Loading

rth Apr 26, 2020 •

edited

Loading

rth Jun 3, 2020 •

edited

Loading

rth Jun 9, 2020 •

edited

Loading

rth commented Jun 25, 2020 •

edited

Loading