<a href="https://colab.research.google.com/github/skyprince999/100-Days-Of-ML/blob/master/Day_13_Imbalanced_Datasets_in_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In many machne learning competitions we come across imbalanced datasets. Say in a rare disease identification challenge, the target variable may just be 0.1% of the entire dataset. Traditionally this is a challenge for most ML practitioners. 

Normally the way to overcome this is by using either one of the two ways of balancing the dataset:


*   Undersampling the majority class
*   Oversampling the minority class

When you undersample the majority class, it generally leads to a loss of information. 

While oversampling the minority class can lead to overfitting to those particular examples. 

In this colab notebook, we use PyTorchs `ImbalancedDatasetSampler` to do the following:



1.   Rebalanced the class distributions when sampling from the imbalanced dataset
2.   Estimate the sampling weights automatically
3. Avoid creating a balanced dataset
4. Mitigate overfitting by the use of data augmentations








In [1]:
#First git clone the repo to your local drive 

!git clone https://github.com/ufoym/imbalanced-dataset-sampler

Cloning into 'imbalanced-dataset-sampler'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects:   8% (1/12)[Kremote: Counting objects:  16% (2/12)[Kremote: Counting objects:  25% (3/12)[Kremote: Counting objects:  33% (4/12)[Kremote: Counting objects:  41% (5/12)[Kremote: Counting objects:  50% (6/12)[Kremote: Counting objects:  58% (7/12)[Kremote: Counting objects:  66% (8/12)[Kremote: Counting objects:  75% (9/12)[Kremote: Counting objects:  83% (10/12)[Kremote: Counting objects:  91% (11/12)[Kremote: Counting objects: 100% (12/12)[Kremote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects:   9% (1/11)[Kremote: Compressing objects:  18% (2/11)[Kremote: Compressing objects:  27% (3/11)[Kremote: Compressing objects:  36% (4/11)[Kremote: Compressing objects:  45% (5/11)[Kremote: Compressing objects:  54% (6/11)[Kremote: Compressing objects:  63% (7/11)[Kremote: Compressing objects:  72% (8/11)[Kremote: Compressi

In [2]:
!ls

imbalanced-dataset-sampler  sample_data


In [5]:
!ls imbalanced-dataset-sampler

examples  MANIFEST.in  requirements.txt  torchsampler
LICENSE   README.md    setup.py


In [6]:
!python imbalanced-dataset-sampler/setup.py install

Partial import during the build process.
running install
running bdist_egg
running egg_info
creating torchsampler.egg-info
writing torchsampler.egg-info/PKG-INFO
writing dependency_links to torchsampler.egg-info/dependency_links.txt
writing requirements to torchsampler.egg-info/requires.txt
writing top-level names to torchsampler.egg-info/top_level.txt
writing manifest file 'torchsampler.egg-info/SOURCES.txt'
error: package directory 'torchsampler' does not exist


In [7]:
!pip install imbalanced-dataset-sampler/.

Processing ./imbalanced-dataset-sampler
Building wheels for collected packages: torchsampler
  Building wheel for torchsampler (setup.py) ... [?25l[?25hdone
  Created wheel for torchsampler: filename=torchsampler-0.1-cp36-none-any.whl size=3634 sha256=a2b4ad34d4ceb35792f249a52eae9f50190bdd2723de505d54e4bdb5281d1ffd
  Stored in directory: /root/.cache/pip/wheels/38/2b/6a/c92da1292ef596800afc50058a85ca91c768176288a586ecbe
Successfully built torchsampler
Installing collected packages: torchsampler
Successfully installed torchsampler-0.1


In [0]:
import torch
from torchsampler import ImbalancedDatasetSampler

In [0]:
train_loader = torch.utils.data.DataLoader(
    train_dataset, 
    sampler=ImbalancedDatasetSampler(train_dataset),
    batch_size=args.batch_size, 
    **kwargs
)