Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom dataset in skorch when using sklearn GradSearchCV? #212

Closed
johnny5550822 opened this issue May 9, 2018 · 6 comments
Closed

Custom dataset in skorch when using sklearn GradSearchCV? #212

johnny5550822 opened this issue May 9, 2018 · 6 comments

Comments

@johnny5550822
Copy link

johnny5550822 commented May 9, 2018

According to the tutorial (https://nbviewer.jupyter.org/github/dnouri/skorch/blob/master/notebooks/Advanced_Usage.ipynb), skorch supports custom dataset (trainset in this case) object in training, which is perfect. As along as I provide the y, then I can train the model:

net.fit(trainset, y=trainset_label)

However, when I tried doing Grid search using sklearn, I got inconsistent X,y dimension error. The code that I had is below:

params = {
    'module__num_units_fc3': [84,42,21]
}
gs = GridSearchCV(net, params, cv=5, scoring='roc_auc')
gs.fit(trainset,y=trainset_label)

Error:
X and y have inconsistent lengths.

Is the custom dataset being supported in skorch when using sklearn GridSearchCV?

@johnny5550822 johnny5550822 changed the title sklearn supports custom dataset in skorch custom dataset in skorch when using sklearn GradSearchCV? May 9, 2018
@johnny5550822 johnny5550822 changed the title custom dataset in skorch when using sklearn GradSearchCV? Custom dataset in skorch when using sklearn GradSearchCV? May 9, 2018
@BenjaminBossan
Copy link
Collaborator

Getting pytorch Datasets to work with GridSearchCV is not trivially possible. The problem is that eventually, the Dataset leaves the skorch domain and is handled directly by sklearn. sklearn only works with a couple of data types (ndarray, scipy sparse, pandas DataFrame), so you will encounter an error sooner or later.

My recommendation for now would be to try out a different data format that works with GridSearchCV, e.g. one of the formats mentioned above. Additionally, by using skorch.helper.SliceDict, you can even run GridSearchCV with dicts.

That being said, we could surely think about providing a helper class that wraps a Dataset and can be handled by sklearn (similar to skorch.helper.SliceDict). However, it will be almost impossible to handle all cases. Say you want to load your data lazily. Since GridSearchCV wants to make a cv split on your data, it will more or less have to load everything at once, so this cannot work.

Minor note: You only need to provide the y explicitly to fit if you need a stratified CVSplit, otherwise, fitting should work with just the dataset.

@johnny5550822
Copy link
Author

@BenjaminBossan Thanks for the clarification. Does that mean if I want to use sklearn GridSearchCV, Hmm, I mostly likely either (1) Load all the data into memory and store it as numpy so that I can use GridSearchCV, or (2) hmm...[no idea]. My custom Dataset follows almost the same as what skorch tutorial is (https://nbviewer.jupyter.org/github/dnouri/skorch/blob/master/notebooks/Advanced_Usage.ipynb):

    def __getitem__(self, i):
        return self.xx[i], self.yy[i]

except xx[i] and yy[i] is a pytorch cuda.tensor. One of the advantage of using skorch is that it is sklearn-compatible, which is very attractive. Hmm, if I use my custom dataset (like above), I have no way to work with sklearn GridCVSearch?

@BenjaminBossan
Copy link
Collaborator

Maybe we could find a solution for your problem if you tell us in more detail what your use case. For example, when loading everything into memory is an issue, there is a solution where your X is just a numpy array of indices (or file names), which will work with sklearn. Then you can write a custom Dataset that will return the data indicated in the index/name when __getitem__ is called.

@johnny5550822
Copy link
Author

johnny5550822 commented May 11, 2018

@BenjaminBossan Thanks a lot for your help! So, I basically preload all my data into memory, which is a 4-D data. I wrote a custom Dataset and a custom Transform because i want to do some specific data manipulation. Do you see any problem below that leads to not being able to use GridSearchCV (which I did encounter an error)

I wrote my getitem as follow:

def __getitem__(self, index):
    """
    imagetype x depth x height x width <-- We treat imagetype as color channel
    
    :param index: the index of the item        
    """
    idx = self.patients[index]
    img = self.data[idx]['imgs'] # can change to imgs_precise        
    label = self.data[idx]['label']

    # transform data first
    if self.transform:
        img = self.transform(img)
        
    # permute axis
    if type(img) is torch.Tensor:
        img = img.permute(3,2,0,1)
    else:
        img = np.transpose(img,(3,2,0,1))
    
    # create sample
    sample = {'idx':idx,'img': img, 'label': label}
            
    #return sample
    return img, label

@BenjaminBossan
Copy link
Collaborator

Okay, since your data is completely in memory, there should be a solution. What you probably don't know is that NeuralNet has a dataset argument. This is where you should pass your custom dataset. This way, your data (ndarray, dict, ...) can be transformed to a dataset after sklearn has touched it.

So for your case specifically, something like this should work:

# SliceDict is like a Python dict, but works with GridSearchCV
X = skorch.helper.SliceDict(data=data, patients=patients)
y = data['label']

class MyDataset(skorch.dataset.Dataset):
    def __getitem__(self, index):
        idx = self.X['patients'][index]
        img = self.X['data']['imgs']['idx']
        label = self.y
        ...

net = NeuralNet(..., dataset=MyDataset)
net.fit(X, y)

I don't know exactly what your data looks like, thus there might be some more minor adjustments you need to make, but at the end of the day this should work.

Note 1: We generally recommend to only fit with Datasets if there is no other way (e.g. when data cannot be loaded to memory at once).

Note 2: If you perform random image augmentation within your dataset, you should be careful since the same augmentations would also be applied during prediction, which is not always what you want.

@johnny5550822
Copy link
Author

Great, thanks @BenjaminBossan. Let me try this and see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants