{{ message }}

# [MRG+2] Fix PermissionError in datasets fetchers on Windows #9847

Merged
merged 1 commit into from Oct 3, 2017
Merged

# [MRG+2] Fix PermissionError in datasets fetchers on Windows#9847

merged 1 commit into from Oct 3, 2017

## Conversation

Fixes #9820,

### massich commented Sep 28, 2017

 @lesteve what I don't see is why fetch_rcv1 and fetch_species_distributions are affected. Both fetchers remove temporal files indeed. But none of them uses a reference of the content after deleting the temporal files as was the case for california housing. I'll find myself a windows machine to ensure that everything works but I don't see the issue.

### lesteve commented Sep 28, 2017 • edited

 No suggestion off the top of my head, I'll try to take a look shortly. Welcome to the beautiful world of file locks in Windows in any case. My personal setting for testing on Windows is VirtualBox and there was an official Windows 10 iso that you could download for free at some point. Not sure whether that's still the case, google it.

### lesteve commented Sep 28, 2017 • edited

 Just tested this solution on Windows for fetch_california_housing and nope this does not work, i.e. I still get: PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\lesteve\\scikit_learn_data\\cal_housing.tgz' 

### lesteve commented Sep 28, 2017

 With this patch it seems to work: diff --git a/sklearn/datasets/california_housing.py b/sklearn/datasets/california_housing.py index 581e962..727a9cb 100644 --- a/sklearn/datasets/california_housing.py +++ b/sklearn/datasets/california_housing.py @@ -100,9 +100,10 @@ def fetch_california_housing(data_home=None, download_if_missing=True): archive_path = _fetch_remote(ARCHIVE, dirname=data_home) - with tarfile.open(mode="r:gz", name=archive_path).extractfile( - 'CaliforniaHousing/cal_housing.data') as f: - cal_housing = np.loadtxt(f, delimiter=',') + with tarfile.open(mode="r:gz", name=archive_path) as f: + cal_housing = np.loadtxt( + f.extractfile('CaliforniaHousing/cal_housing.data'), + delimiter=',') # Columns are not in the same order compared to the previous # URL resource on lib.stat.cmu.edu columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]

### lesteve commented Sep 28, 2017

 Basically the context manager was not closing the right file ...

### lesteve commented Sep 29, 2017

 OK I think everything should be fixed now.
changed the title [WIP] Add context manager to california housing fetcher [MRG] FIX PermissionError in datasets fetchers on Windows Sep 29, 2017
 FIX PermissionError in datasets fetchers on Windows 
 6e3d73b 
PermissionError: [WinError 32] The process cannot access the file
because it is being used by another process. This was happening when
properly closed.
added this to the 0.19.1 milestone Sep 29, 2017
changed the title [MRG] FIX PermissionError in datasets fetchers on Windows [MRG] Fix PermissionError in datasets fetchers on Windows Sep 29, 2017

### lesteve commented Sep 29, 2017

 I ran the script I used for the figshare work on Windows to make sure I could download all the datasets from scratch and everything went fine.

### massich commented Sep 29, 2017

 There's something I don't understand tough. My first solution was: with tarfile.open(...) as f: fileobj = f.extractfile(...) cal_housing = np.loadtxt(fileobj, ...)  And this was not working for me. So why would adding it in the call work? Anyhow. I just checked your commit and everything looks good.
changed the title [MRG] Fix PermissionError in datasets fetchers on Windows [MRG+1] Fix PermissionError in datasets fetchers on Windows Sep 29, 2017

### lesteve commented Sep 29, 2017

 And this was not working for me. So why would adding it in the call work? Anyhow. I just checked your commit and everything looks good. Weird, oh well, it's this kind of stuff you want to not ever touch again and forget about when everything starts working ;-). And just a side-comment, I seem to remember (but maybe it's my imagination) that you added this remove for consistency between fetchers (in some fetchers the archive were removed in some other they were not). Personally I took it as a cautionary talke that tells us that it is very easy to break something even if the change looks really innocuous. I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

### massich commented Sep 29, 2017

 I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific). This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

### lesteve commented Sep 29, 2017

 This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher. nilearn has some testing of the fetchers. From what I remembered it was not pretty (monkey-patching of some urllib functions maybe?) but it was doing its job.

### lesteve commented Sep 29, 2017

 Having said that this problem may not have been spotted with mocks. Basically you need a real file to realise that on Windows (tough luck) the file is locked. All I am saying is that mocks may not cover all the edge cases. It is better than no tests as we currently have of course.

### jnothman commented Oct 3, 2017

 LGTM
changed the title [MRG+1] Fix PermissionError in datasets fetchers on Windows [MRG+2] Fix PermissionError in datasets fetchers on Windows Oct 3, 2017
merged commit 534f68b into scikit-learn:master Oct 3, 2017
5 of 6 checks passed
5 of 6 checks passed
codecov/patch 0% of diff hit (target 96.17%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.16% (-0.01%) compared to daeb3ad
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
added a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017
 FIX PermissionError in datasets fetchers on Windows (scikit-learn#9847) 
 ab33915 
mentioned this pull request Oct 18, 2017
mentioned this pull request Oct 18, 2017
added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
 FIX PermissionError in datasets fetchers on Windows (scikit-learn#9847) 
 ee52996 
pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
 FIX PermissionError in datasets fetchers on Windows (scikit-learn#9847) 
 815db8f 
deleted the massich:9820 branch Jun 6, 2018