Custom dataset in skorch when using sklearn GradSearchCV? #212

johnny5550822 · 2018-05-09T19:33:40Z

According to the tutorial (https://nbviewer.jupyter.org/github/dnouri/skorch/blob/master/notebooks/Advanced_Usage.ipynb), skorch supports custom dataset (trainset in this case) object in training, which is perfect. As along as I provide the y, then I can train the model:

net.fit(trainset, y=trainset_label)

However, when I tried doing Grid search using sklearn, I got inconsistent X,y dimension error. The code that I had is below:

params = {
    'module__num_units_fc3': [84,42,21]
}
gs = GridSearchCV(net, params, cv=5, scoring='roc_auc')
gs.fit(trainset,y=trainset_label)

Error:
X and y have inconsistent lengths.

Is the custom dataset being supported in skorch when using sklearn GridSearchCV?

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2018-05-10T09:00:09Z

Getting pytorch Datasets to work with GridSearchCV is not trivially possible. The problem is that eventually, the Dataset leaves the skorch domain and is handled directly by sklearn. sklearn only works with a couple of data types (ndarray, scipy sparse, pandas DataFrame), so you will encounter an error sooner or later.

My recommendation for now would be to try out a different data format that works with GridSearchCV, e.g. one of the formats mentioned above. Additionally, by using skorch.helper.SliceDict, you can even run GridSearchCV with dicts.

That being said, we could surely think about providing a helper class that wraps a Dataset and can be handled by sklearn (similar to skorch.helper.SliceDict). However, it will be almost impossible to handle all cases. Say you want to load your data lazily. Since GridSearchCV wants to make a cv split on your data, it will more or less have to load everything at once, so this cannot work.

Minor note: You only need to provide the y explicitly to fit if you need a stratified CVSplit, otherwise, fitting should work with just the dataset.

johnny5550822 · 2018-05-10T18:16:52Z

@BenjaminBossan Thanks for the clarification. Does that mean if I want to use sklearn GridSearchCV, Hmm, I mostly likely either (1) Load all the data into memory and store it as numpy so that I can use GridSearchCV, or (2) hmm...[no idea]. My custom Dataset follows almost the same as what skorch tutorial is (https://nbviewer.jupyter.org/github/dnouri/skorch/blob/master/notebooks/Advanced_Usage.ipynb):

    def __getitem__(self, i):
        return self.xx[i], self.yy[i]

except xx[i] and yy[i] is a pytorch cuda.tensor. One of the advantage of using skorch is that it is sklearn-compatible, which is very attractive. Hmm, if I use my custom dataset (like above), I have no way to work with sklearn GridCVSearch?

BenjaminBossan · 2018-05-10T20:32:20Z

Maybe we could find a solution for your problem if you tell us in more detail what your use case. For example, when loading everything into memory is an issue, there is a solution where your X is just a numpy array of indices (or file names), which will work with sklearn. Then you can write a custom Dataset that will return the data indicated in the index/name when __getitem__ is called.

johnny5550822 · 2018-05-11T00:03:23Z

@BenjaminBossan Thanks a lot for your help! So, I basically preload all my data into memory, which is a 4-D data. I wrote a custom Dataset and a custom Transform because i want to do some specific data manipulation. Do you see any problem below that leads to not being able to use GridSearchCV (which I did encounter an error)

I wrote my getitem as follow:

def __getitem__(self, index):
    """
    imagetype x depth x height x width <-- We treat imagetype as color channel
    
    :param index: the index of the item        
    """
    idx = self.patients[index]
    img = self.data[idx]['imgs'] # can change to imgs_precise        
    label = self.data[idx]['label']

    # transform data first
    if self.transform:
        img = self.transform(img)
        
    # permute axis
    if type(img) is torch.Tensor:
        img = img.permute(3,2,0,1)
    else:
        img = np.transpose(img,(3,2,0,1))
    
    # create sample
    sample = {'idx':idx,'img': img, 'label': label}
            
    #return sample
    return img, label

BenjaminBossan · 2018-05-11T08:48:46Z

Okay, since your data is completely in memory, there should be a solution. What you probably don't know is that NeuralNet has a dataset argument. This is where you should pass your custom dataset. This way, your data (ndarray, dict, ...) can be transformed to a dataset after sklearn has touched it.

So for your case specifically, something like this should work:

# SliceDict is like a Python dict, but works with GridSearchCV
X = skorch.helper.SliceDict(data=data, patients=patients)
y = data['label']

class MyDataset(skorch.dataset.Dataset):
    def __getitem__(self, index):
        idx = self.X['patients'][index]
        img = self.X['data']['imgs']['idx']
        label = self.y
        ...

net = NeuralNet(..., dataset=MyDataset)
net.fit(X, y)

I don't know exactly what your data looks like, thus there might be some more minor adjustments you need to make, but at the end of the day this should work.

Note 1: We generally recommend to only fit with Datasets if there is no other way (e.g. when data cannot be loaded to memory at once).

Note 2: If you perform random image augmentation within your dataset, you should be careful since the same augmentations would also be applied during prediction, which is not always what you want.

johnny5550822 · 2018-05-11T17:08:17Z

Great, thanks @BenjaminBossan. Let me try this and see how it goes.

johnny5550822 changed the title ~~sklearn supports custom dataset in skorch~~ custom dataset in skorch when using sklearn GradSearchCV? May 9, 2018

johnny5550822 changed the title ~~custom dataset in skorch when using sklearn GradSearchCV?~~ Custom dataset in skorch when using sklearn GradSearchCV? May 9, 2018

johnny5550822 closed this as completed May 11, 2018

AMarkard mentioned this issue Oct 23, 2019

Add sklearn example UKPLab/framenet-tools#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom dataset in skorch when using sklearn GradSearchCV? #212

Custom dataset in skorch when using sklearn GradSearchCV? #212

johnny5550822 commented May 9, 2018 •

edited

Loading

BenjaminBossan commented May 10, 2018

johnny5550822 commented May 10, 2018

BenjaminBossan commented May 10, 2018

johnny5550822 commented May 11, 2018 •

edited

Loading

BenjaminBossan commented May 11, 2018

johnny5550822 commented May 11, 2018

Custom dataset in skorch when using sklearn GradSearchCV? #212

Custom dataset in skorch when using sklearn GradSearchCV? #212

Comments

johnny5550822 commented May 9, 2018 • edited Loading

BenjaminBossan commented May 10, 2018

johnny5550822 commented May 10, 2018

BenjaminBossan commented May 10, 2018

johnny5550822 commented May 11, 2018 • edited Loading

BenjaminBossan commented May 11, 2018

johnny5550822 commented May 11, 2018

johnny5550822 commented May 9, 2018 •

edited

Loading

johnny5550822 commented May 11, 2018 •

edited

Loading