New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix PermissionError in datasets fetchers on Windows #9847

Merged
merged 1 commit into from Oct 3, 2017

Conversation

Projects
None yet
3 participants
@massich
Contributor

massich commented Sep 28, 2017

Reference Issue

Fixes #9820,

What does this implement/fix? Explain your changes.

Any other comments?

@massich

This comment has been minimized.

Show comment
Hide comment
@massich

massich Sep 28, 2017

Contributor

@lesteve what I don't see is why fetch_rcv1 and fetch_species_distributions are affected. Both fetchers remove temporal files indeed. But none of them uses a reference of the content after deleting the temporal files as was the case for california housing.

I'll find myself a windows machine to ensure that everything works but I don't see the issue.

Contributor

massich commented Sep 28, 2017

@lesteve what I don't see is why fetch_rcv1 and fetch_species_distributions are affected. Both fetchers remove temporal files indeed. But none of them uses a reference of the content after deleting the temporal files as was the case for california housing.

I'll find myself a windows machine to ensure that everything works but I don't see the issue.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 28, 2017

Member

No suggestion off the top of my head, I'll try to take a look shortly. Welcome to the beautiful world of file locks in Windows in any case.

My personal setting for testing on Windows is VirtualBox and there was an official Windows 10 iso that you could download for free at some point. Not sure whether that's still the case, google it.

Member

lesteve commented Sep 28, 2017

No suggestion off the top of my head, I'll try to take a look shortly. Welcome to the beautiful world of file locks in Windows in any case.

My personal setting for testing on Windows is VirtualBox and there was an official Windows 10 iso that you could download for free at some point. Not sure whether that's still the case, google it.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 28, 2017

Member

Just tested this solution on Windows for fetch_california_housing and nope this does not work, i.e. I still get:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\lesteve\\scikit_learn_data\\cal_housing.tgz'
Member

lesteve commented Sep 28, 2017

Just tested this solution on Windows for fetch_california_housing and nope this does not work, i.e. I still get:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\lesteve\\scikit_learn_data\\cal_housing.tgz'
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 28, 2017

Member

With this patch it seems to work:

diff --git a/sklearn/datasets/california_housing.py b/sklearn/datasets/california_housing.py
index 581e962..727a9cb 100644
--- a/sklearn/datasets/california_housing.py
+++ b/sklearn/datasets/california_housing.py
@@ -100,9 +100,10 @@ def fetch_california_housing(data_home=None, download_if_missing=True):

         archive_path = _fetch_remote(ARCHIVE, dirname=data_home)

-        with tarfile.open(mode="r:gz", name=archive_path).extractfile(
-                'CaliforniaHousing/cal_housing.data') as f:
-            cal_housing = np.loadtxt(f, delimiter=',')
+        with tarfile.open(mode="r:gz", name=archive_path) as f:
+            cal_housing = np.loadtxt(
+                f.extractfile('CaliforniaHousing/cal_housing.data'),
+                delimiter=',')
             # Columns are not in the same order compared to the previous
             # URL resource on lib.stat.cmu.edu
             columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
Member

lesteve commented Sep 28, 2017

With this patch it seems to work:

diff --git a/sklearn/datasets/california_housing.py b/sklearn/datasets/california_housing.py
index 581e962..727a9cb 100644
--- a/sklearn/datasets/california_housing.py
+++ b/sklearn/datasets/california_housing.py
@@ -100,9 +100,10 @@ def fetch_california_housing(data_home=None, download_if_missing=True):

         archive_path = _fetch_remote(ARCHIVE, dirname=data_home)

-        with tarfile.open(mode="r:gz", name=archive_path).extractfile(
-                'CaliforniaHousing/cal_housing.data') as f:
-            cal_housing = np.loadtxt(f, delimiter=',')
+        with tarfile.open(mode="r:gz", name=archive_path) as f:
+            cal_housing = np.loadtxt(
+                f.extractfile('CaliforniaHousing/cal_housing.data'),
+                delimiter=',')
             # Columns are not in the same order compared to the previous
             # URL resource on lib.stat.cmu.edu
             columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 28, 2017

Member

Basically the context manager was not closing the right file ...

Member

lesteve commented Sep 28, 2017

Basically the context manager was not closing the right file ...

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 29, 2017

Member

OK I think everything should be fixed now.

Member

lesteve commented Sep 29, 2017

OK I think everything should be fixed now.

@lesteve lesteve changed the title from [WIP] Add context manager to california housing fetcher to [MRG] FIX PermissionError in datasets fetchers on Windows Sep 29, 2017

FIX PermissionError in datasets fetchers on Windows
PermissionError: [WinError 32] The process cannot access the file
because it is being used by another process. This was happening when
trying to remove the downloaded archive because the archive was not
properly closed.

@lesteve lesteve added this to the 0.19.1 milestone Sep 29, 2017

@lesteve lesteve changed the title from [MRG] FIX PermissionError in datasets fetchers on Windows to [MRG] Fix PermissionError in datasets fetchers on Windows Sep 29, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 29, 2017

Member

I ran the script I used for the figshare work on Windows to make sure I could download all the datasets from scratch and everything went fine.

Member

lesteve commented Sep 29, 2017

I ran the script I used for the figshare work on Windows to make sure I could download all the datasets from scratch and everything went fine.

@massich

This comment has been minimized.

Show comment
Hide comment
@massich

massich Sep 29, 2017

Contributor

There's something I don't understand tough. My first solution was:

with tarfile.open(...) as f:
       fileobj = f.extractfile(...)
       cal_housing = np.loadtxt(fileobj, ...) 

And this was not working for me. So why would adding it in the call work?
Anyhow. I just checked your commit and everything looks good.

Contributor

massich commented Sep 29, 2017

There's something I don't understand tough. My first solution was:

with tarfile.open(...) as f:
       fileobj = f.extractfile(...)
       cal_housing = np.loadtxt(fileobj, ...) 

And this was not working for me. So why would adding it in the call work?
Anyhow. I just checked your commit and everything looks good.

@massich massich changed the title from [MRG] Fix PermissionError in datasets fetchers on Windows to [MRG+1] Fix PermissionError in datasets fetchers on Windows Sep 29, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 29, 2017

Member

And this was not working for me. So why would adding it in the call work?
Anyhow. I just checked your commit and everything looks good.

Weird, oh well, it's this kind of stuff you want to not ever touch again and forget about when everything starts working ;-).

And just a side-comment, I seem to remember (but maybe it's my imagination) that you added this remove for consistency between fetchers (in some fetchers the archive were removed in some other they were not). Personally I took it as a cautionary talke that tells us that it is very easy to break something even if the change looks really innocuous. I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

Member

lesteve commented Sep 29, 2017

And this was not working for me. So why would adding it in the call work?
Anyhow. I just checked your commit and everything looks good.

Weird, oh well, it's this kind of stuff you want to not ever touch again and forget about when everything starts working ;-).

And just a side-comment, I seem to remember (but maybe it's my imagination) that you added this remove for consistency between fetchers (in some fetchers the archive were removed in some other they were not). Personally I took it as a cautionary talke that tells us that it is very easy to break something even if the change looks really innocuous. I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

@massich

This comment has been minimized.

Show comment
Hide comment
@massich

massich Sep 29, 2017

Contributor

I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

Contributor

massich commented Sep 29, 2017

I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 29, 2017

Member

This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

nilearn has some testing of the fetchers. From what I remembered it was not pretty (monkey-patching of some urllib functions maybe?) but it was doing its job.

Member

lesteve commented Sep 29, 2017

This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

nilearn has some testing of the fetchers. From what I remembered it was not pretty (monkey-patching of some urllib functions maybe?) but it was doing its job.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 29, 2017

Member

Having said that this problem may not have been spotted with mocks. Basically you need a real file to realise that on Windows (tough luck) the file is locked. All I am saying is that mocks may not cover all the edge cases. It is better than no tests as we currently have of course.

Member

lesteve commented Sep 29, 2017

Having said that this problem may not have been spotted with mocks. Basically you need a real file to realise that on Windows (tough luck) the file is locked. All I am saying is that mocks may not cover all the edge cases. It is better than no tests as we currently have of course.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 3, 2017

Member

LGTM

Member

jnothman commented Oct 3, 2017

LGTM

@jnothman jnothman changed the title from [MRG+1] Fix PermissionError in datasets fetchers on Windows to [MRG+2] Fix PermissionError in datasets fetchers on Windows Oct 3, 2017

@jnothman jnothman merged commit 534f68b into scikit-learn:master Oct 3, 2017

5 of 6 checks passed

codecov/patch 0% of diff hit (target 96.17%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.16% (-0.01%) compared to daeb3ad
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

@massich massich deleted the massich:9820 branch Jun 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment