Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while creating data #1

Closed
saurabh502 opened this issue Dec 15, 2018 · 7 comments
Closed

Error while creating data #1

saurabh502 opened this issue Dec 15, 2018 · 7 comments

Comments

@saurabh502
Copy link

Hi I am getting below error while running below code

code:

data = (
    ImageItemList
        .from_folder('data/whale/input/train')
        .random_split_by_pct(seed=SEED)
        .label_from_func(lambda path: fn2label[path.name])
        .add_test(ImageItemList.from_folder('data/whale/input/test'))
        .transform(get_transforms(do_flip=False, max_zoom=1, max_warp=0, max_rotate=2), size=SZ, resize_method=ResizeMethod.SQUISH)
        .databunch(bs=BS, num_workers=NUM_WORKERS, path='data/whale/input')
)

error:

KeyError Traceback (most recent call last)
~\fastai1\fastai\courses\dl1\fastai\data_block.py in process_one(self, item)
277 def process_one(self,item):
--> 278 try: return self.c2i[item] if item is not None else None
279 except:

KeyError: 'w_d8a08f8'

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
in
3 .from_folder('data/whale/input/train')
4 .random_split_by_pct(seed=SEED)
----> 5 .label_from_func(lambda path: fn2label[path.name])
6 .add_test(ImageItemList.from_folder('data/whale/input/test'))
7 .transform(get_transforms(do_flip=False, max_zoom=1, max_warp=0, max_rotate=2), size=SZ, resize_method=ResizeMethod.SQUISH)

~\fastai1\fastai\courses\dl1\fastai\data_block.py in _inner(*args, **kwargs)
391 self.valid = fv(*args, **kwargs)
392 self.class = LabelLists
--> 393 self.process()
394 return self
395 return _inner

~\fastai1\fastai\courses\dl1\fastai\data_block.py in process(self)
438 "Process the inner datasets."
439 xp,yp = self.get_processors()
--> 440 for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
441 return self
442

~\fastai1\fastai\courses\dl1\fastai\data_block.py in process(self, xp, yp, filter_missing_y)
563 def process(self, xp=None, yp=None, filter_missing_y:bool=False):
564 "Launch the processing on self.x and self.y with xp and yp."
--> 565 self.y.process(yp)
566 if filter_missing_y and (getattr(self.x, 'filter_missing_y', None)):
567 filt = array([o is None for o in self.y])

~\fastai1\fastai\courses\dl1\fastai\data_block.py in process(self, processor)
66 if processor is not None: self.processor = processor
67 self.processor = listify(self.processor)
---> 68 for p in self.processor: p.process(self)
69 return self
70

~\fastai1\fastai\courses\dl1\fastai\data_block.py in process(self, ds)
284 ds.classes = self.classes
285 ds.c2i = self.c2i
--> 286 super().process(ds)
287
288 def getstate(self): return {'classes':self.classes}

~\fastai1\fastai\courses\dl1\fastai\data_block.py in process(self, ds)
36 def init(self, ds:Collection=None): self.ref_ds = ds
37 def process_one(self, item:Any): return item
---> 38 def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])
39
40 class ItemList():

~\fastai1\fastai\courses\dl1\fastai\data_block.py in (.0)
36 def init(self, ds:Collection=None): self.ref_ds = ds
37 def process_one(self, item:Any): return item
---> 38 def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])
39
40 class ItemList():

~\fastai1\fastai\courses\dl1\fastai\data_block.py in process_one(self, item)
278 try: return self.c2i[item] if item is not None else None
279 except:
--> 280 raise Exception("Your validation data contains a label that isn't present in the training set, please fix your data.")
281
282 def process(self, ds):

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.

Thanks in advance for your help!

@radekosmulski
Copy link
Owner

Did you change the SEED value? I wonder if this might be because you have a different directory structure than in the readme. Could you try running running this with a directory structure as in the readme?

@josemontiel
Copy link

I'm new to fast ai, so thanks for sharing this repo! I'm having this same issue, I changed my file structure to match yours, the error persists.

@radekosmulski
Copy link
Owner

I won't have access to my computer until Monday. If noone finds a solution to this by then, I will probably just move the creation of a better validation set from the later NBs to the first one.

In the meantime, if you'd like to play with this, you could jump over to the later NBs, or could try attempt moving how the validation set is created there to here.

The other validation set is much less forgiving than this one so don't worry if you get a much poorer score locally.

@josemontiel
Copy link

@radekosmulski playing a bit with it and with the help of the fastai documentation, it works if you call no_split instead of random_split_by_pct. So, it seems to me that the issue might be caused by the fact that there are images with only one sample and that they could be falling as part of the validation set and not training, hence the label missing. Thoughts?

@radekosmulski
Copy link
Owner

You are right, that is exactly what is happening. Some whales have only a single image in the train set, this error message implies that there exists a whale whose all images got assigned to the validation set (be that one or more).

@radekosmulski
Copy link
Owner

I understand why the issue is happening, but I do not know why despite having the same seed and folder structure some people are getting the error.

My only thought is that maybe they removed some files or are using data from the playground competition.

I started to change the first_submission notebook to sample the validation set like in the later notebooks, but this makes the first_submission notebook overly complex.

Either way, I am not going to make the changes. Will add a note to the readme. If anyone is encountering the issue, please skip over to only_known_train.ipynb. By running all cells in the notebook you should get to ~0.760 on the LB. All other notebooks should work for you even if you are having issues with the first_submission one.

I think there is more value in keeping the first_submission NB simple and showing the natural evolution of code as I work towards solving a classification problem rather than backporting the creation of the validation set.

@radekosmulski
Copy link
Owner

Additional reasoning behind not making this change on kaggle here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants