Tutorial on imbalanced datasets #236

ptrblck · 2018-05-06T23:38:59Z

This tutorial deals with the problem of an imbalanced dataset and how to train a classifier on it.

After training a CNN on the original CIFAR10 dataset, we resample it to create an artificially imbalanced dataset. Since the CNN performs quite poorly on this new dataset, we use the WeightedRandomSampler in the first step and a weighted criterion afterwards to tackle the problem.

I've created the tutorial in the intermediate section, but I'm not sure if it's the right place.

Feedback regarding the text and code is very welcome!

intermediate_source/imbalanced_data_tutorial.py

+###############################################################################
+# Let's have a look at the class distribution in the datasets.
+
+# Get all training targets and count the number of class instances


intermediate_source/imbalanced_data_tutorial.py

+# The last 5 classes will keep their samples.
+
+# Create class proportions
+imbal_class_prop = imbal_class_prop = np.hstack(([0.1] * 5, [1.0] * 5))


chsasank

I have roughly gone through half of the tutorial.

I also have some minor comments with formatting etc. But I'll create a PR to your branch later with those changes rather than commenting.

intermediate_source/imbalanced_data_tutorial.py

+# The last 5 classes will keep their samples.
+
+# Create class proportions
+imbal_class_prop = imbal_class_prop = np.hstack(([0.1] * 5, [1.0] * 5))


ptrblck · 2018-05-14T20:36:36Z

@chsasank Thank you for the review! I've added your suggestions. Let me know, what you think about the changes.

intermediate_source/imbalanced_data_tutorial.py

+# Let's have a look at the class distribution in the datasets.
+
+
+def get_labels_and_class_counts(labels_list):


intermediate_source/imbalanced_data_tutorial.py

+f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15, 6))
+ax1.bar(class_names, train_class_counts)
+ax1.set_title('Training dataset distribution')
+ax1.set_xlabel('Classes')


intermediate_source/imbalanced_data_tutorial.py

+optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)
+
+
+def train(epoch):


chsasank · 2018-05-26T15:24:39Z

Thanks, I think this looks good. Let me add some small formatting changes to your branch.

Angelina1996 · 2020-07-01T12:20:40Z

intermediate_source/imbalanced_data_tutorial.py

+# Let's have a look at the class distribution in the datasets.
+
+# Get all training targets and count the number of class instances
+train_targets = np.array(train_dataset.train_labels)


Getting the following error on this line:
AttributeError: 'CIFAR10' object has no attribute 'train_labels'
Could you please help with this

facebook-github-bot · 2020-10-30T17:35:30Z

Hi @ptrblck!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2021-03-21T15:05:10Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

kyscg

Updated variables because in recent versions, the variable for the labels for the training samples in torchvision.datasets.CIFAR10 has been changed to targets

kyscg · 2021-10-25T07:48:51Z

intermediate_source/imbalanced_data_tutorial.py

+
+# Get all training targets and count the number of class instances
+train_targets, train_class_counts = get_labels_and_class_counts(
+    train_dataset.train_labels)


Suggested change

train_dataset.train_labels)

train_dataset.targets)

kyscg · 2021-10-25T07:49:11Z

intermediate_source/imbalanced_data_tutorial.py

+        '''
+        if self.train:
+            targets, class_counts = get_labels_and_class_counts(
+                self.dataset.train_labels)


Suggested change

self.dataset.train_labels)

self.dataset.targets)

svekars · 2023-03-24T20:49:21Z

Closing this as it's been quite some time since it was created and no longer relevant.

initial commit

37bfb4d

chsasank self-assigned this May 8, 2018

chsasank reviewed May 10, 2018

View reviewed changes

intermediate_source/imbalanced_data_tutorial.py Outdated

# The last 5 classes will keep their samples.

# Create class proportions

imbal_class_prop = imbal_class_prop = np.hstack(([0.1] * 5, [1.0] * 5))

This comment was marked as off-topic.

Sign in to view

added suggestions from pytorch#236

bebeb44

chsasank reviewed May 19, 2018

View reviewed changes

intermediate_source/imbalanced_data_tutorial.py Outdated

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15, 6))

ax1.bar(class_names, train_class_counts)

ax1.set_title('Training dataset distribution')

ax1.set_xlabel('Classes')

This comment was marked as off-topic.

Sign in to view

chsasank reviewed May 19, 2018

View reviewed changes

intermediate_source/imbalanced_data_tutorial.py Outdated

optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)

def train(epoch):

This comment was marked as off-topic.

Sign in to view

ptrblck added 3 commits May 23, 2018 00:04

-wrapped some code in functions

aa41525

-renaming (pytorch#236)

6b77297

-updated model (pytorch#236)

5d0aefd

Angelina1996 reviewed Jul 1, 2020

View reviewed changes

facebook-github-bot added the cla signed label Nov 2, 2020

Base automatically changed from master to main February 16, 2021 19:32

Base automatically changed from main to master February 16, 2021 19:37

kyscg suggested changes Oct 25, 2021

View reviewed changes

svekars closed this Mar 24, 2023

		# Let's have a look at the class distribution in the datasets.


		def get_labels_and_class_counts(labels_list):

		optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)


		def train(epoch):

Tutorial on imbalanced datasets #236

Tutorial on imbalanced datasets #236

Uh oh!

Conversation

ptrblck commented May 6, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

chsasank left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

ptrblck commented May 14, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

chsasank commented May 26, 2018

Uh oh!

Angelina1996 Jul 1, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 30, 2020

Uh oh!

facebook-github-bot commented Mar 21, 2021

Uh oh!

kyscg left a comment

Choose a reason for hiding this comment

Uh oh!

kyscg Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

kyscg Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

svekars commented Mar 24, 2023

Uh oh!

Uh oh!