<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Introduction</p>


When using pytorch for dataloading, we mostly use the default data sampling technique. One of the factors that affect the the training stability is data sampling.In this notebook I will show you various data sampling methods available in pytorch and also how to write a custom data sampler function for your dataloader.

Sequence bucketing is another method which I will describe and implement in this notebook. This is a great method to increase model training speed by dynamic padding.

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Data Samplers</p>

The purpose of data sampler is to determine how batches of data is formed from the given pool of data with given batch size. It also responsible for determining the order of the dataset which is fed into the model for learning.


![](https://www.programmersought.com/images/340/df42b44fcd8753e385e6413c3354ae8c.png)


- When the dataloader is initialized, the sampler is also passed to it ( RandomSampler by default) which first create the sequence order in which the the samples in dataset is accessed using index.ie (1,2,3..N) where N = size of the dataset. 
- Then using this sequence of indices, the data is pulled from the dataset for each batch with given batch size.

So, Let's see some of the available data samplers in pytorch

In [None]:
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader,Dataset
from transformers import AutoTokenizer
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
import torch
from torch.utils.data import Sampler,SequentialSampler,RandomSampler,SubsetRandomSampler
from collections import defaultdict
plt.style.use('seaborn')
seed = 42

I will read the data and split it into bins like in this notebook.

In [None]:
train_data = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
# num_bins = int(np.floor(1 + np.log2(len(train_data))))
num_bins = 5
train_data.loc[:,'bins'] = pd.cut(train_data['target'],bins=num_bins,labels=False)
bins = train_data.bins.to_numpy()
target = train_data.target.to_numpy()


In [None]:
x = train_data.bins.value_counts()*100/train_data.shape[0]
sns.barplot(list(map(str,x.index)),x.values,palette='flare') 
plt.plot(x.values,marker='*',color='blue')
plt.gca().set_ylabel("percentage of samples",fontsize=14)
plt.gca().set_title("Bins distribution",fontsize=18)
plt.show()

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:left">Dataset</p>

I will make a simple pytorch dataset for our exercise which we can use to experiment with different sampling techniques.

In [None]:
class CLRPDataset(Dataset):
    def __init__(self,df,tokenizer,max_len=128):
        self.excerpt = df['excerpt'].to_numpy()
        self.targets = df['target'].to_numpy()
        self.bins = df['bins'].to_numpy()
        self.max_len = max_len
        self.tokenizer = tokenizer
    
    def __getitem__(self,idx):
        curr_sent = {}
        curr_sent['target'] = self.targets[idx]
        curr_sent['bin'] = self.bins[idx]
    
        return curr_sent
    
    def __len__(self):
        return len(self.excerpt)

In [None]:
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

In [None]:
def get_fold_loader(sampler=None,indices=None):
    

    train_ds = CLRPDataset(train_data,tokenizer)
    
    if sampler=='random':
        
        train_sampler = RandomSampler(train_ds,replacement=True)
        train_dataloader = DataLoader(train_ds,
                                     sampler = train_sampler,
                                     batch_size=128,
                                     drop_last=False)
        
        
    
    elif sampler=='sequential':
        
        train_sampler = SequentialSampler(train_ds)
        train_dataloader = DataLoader(train_ds,
                                     sampler = train_sampler,
                                     batch_size=128,
                                     drop_last=False)
        
    elif sampler=='subset':
        
        train_sampler = SubsetRandomSampler(indices)
        train_dataloader = DataLoader(train_ds,
                                     sampler = train_sampler,
                                     batch_size=128,
                                     drop_last=False)
        
        
        
        
        
    elif sampler=="weighted":
        
        train_sampler = weightedsampler(train_ds)
        train_dataloader = DataLoader(train_ds,
                                     sampler = train_sampler,
                                     batch_size=128,
                                     drop_last=False)
        
    else:
        train_dataloader = DataLoader(train_ds,
                                      shuffle=True,
                                     batch_size=128,
                                     drop_last=False)
    
    
    return train_dataloader
        



In [None]:
def get_batch_count(train_dataloader):
    
    all_batches={f'bin_{b}':[] for b in range(num_bins)}
    for i,data in enumerate(train_dataloader):

          curr = dict(Counter(data['bin'].numpy()))
          for b in range(num_bins):
                all_batches[f'bin_{b}'].append(curr.get(b,0))

    return all_batches  
        


In [None]:
import plotly.graph_objects as go

def plot_batches(all_batches):
    
    x = [str(x) for x in range(len(all_batches['bin_0']))]
    data_append = []
    for bin_no in range(num_bins):
        data_append.append(go.Bar(
                                   x=x,
                                   y=all_batches[f'bin_{bin_no}'],
                                   text=all_batches[f'bin_{bin_no}'],
                                   textposition='auto',name=f'bin_{bin_no}'))
        
        

        
        
    fig = go.Figure(data=data_append)
    fig.update_layout(title="Batch bin frequency", xaxis_title="Batch number", yaxis_title="Bin frequency",
                      xaxis_visible=True,  xaxis_showticklabels=True,  yaxis_visible=True,  yaxis_showticklabels=False,
                      xaxis_tickmode='linear', barmode='stack')
    fig.show()

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:left">Random Sampler</p>

First let's try the random sampler which shuffles the indices randomly and fetches the indices based on it for every batch.


In [None]:

train_dataloader = get_fold_loader()
all_batches = get_batch_count(train_dataloader)
plot_batches(all_batches)

You can see that as the samples are drawn randomly it does not keep any distribution and hence this could bring more randomness into the model. Class distrituion in different batches are also very different.

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:left">SubsetRandom Sampler</p>

Samples elements randomly from a given list of indices, without replacement.This can be used with K Fold to sample indices randomly from a list of indices.

In [None]:
train_data['Fold'] = -1
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=seed)
for k , (train_idx,valid_idx) in enumerate(kfold.split(X=train_data,y=bins)):
    train_data.loc[valid_idx,'Fold'] = k


In [None]:
fold=1
x_train,x_valid = train_data.query(f"Fold != {fold}"),train_data.query(f"Fold == {fold}")
indices_to_sample = x_train.index.tolist()

In [None]:
train_dataloader = get_fold_loader(sampler='subset',indices=indices_to_sample)
all_batches = get_batch_count(train_dataloader)
plot_batches(all_batches)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:left">Sequential Sampler</p>

This samples the data samples sequentially,always in the same order

In [None]:
train_dataloader = get_fold_loader(sampler='sequential')
all_batches = get_batch_count(train_dataloader)
plot_batches(all_batches)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:left">Custom Sampler (weighted)</p>

To implement a custom sampler you can inherit from `Sampler` class in `utils.data`. 

This sampler is used to ensure that each batch sees a proportional number of all classes.
* Get all the target classes.
* Get the class weights. Class weights are the reciprocal of the number of items per class.
* Obtain corresponding weight for each target sample.

This will sample the data points with multinomial distribution. 

> the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided die rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

In [None]:
class weightedsampler(Sampler):
    
    def __init__(self,dataset):
        
        self.indices = list(range(len(dataset)))
        self.num_samples = len(dataset)
        self.label_to_count = dict(Counter(dataset.bins))
        weights = [1/self.label_to_count[i] for i in dataset.bins]
        
        self.weights = torch.tensor(weights,dtype=torch.double)
        
    def __iter__(self):
        count = 0
        index = [self.indices[i] for i in torch.multinomial(self.weights, self.num_samples, replacement=True)]
        while count < self.num_samples:
            yield index[count]
            count += 1
    
    def __len__(self):
        return self.num_samples
        
        

In [None]:
train_dataloader = get_fold_loader(sampler='weighted')
all_batches = get_batch_count(train_dataloader)
plot_batches(all_batches)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:left">Sequence Bucketing</p>


After fetching a list of samples using the indices from sampler, the function passed as the collate_fn argument is used to collate lists of samples into batches.

Sequence bucketing is done by implementing a custom collate_fn for the dataloader. This ensures that the padding is done dynamically for each batch.

In [None]:
class CLRPDataset(Dataset):
    def __init__(self,df,tokenizer,max_len=128):
        self.excerpt = df['excerpt'].to_numpy()
        self.targets = df['target'].to_numpy()
        self.max_len = max_len
        self.tokenizer = tokenizer
    
    def __getitem__(self,idx):
        encode = self.tokenizer(self.excerpt[idx],
                                padding=False,
                                truncation=False,
                                return_attention_mask=True,
                                return_token_type_ids=True,
                                )

        
        curr_sent = {}
        curr_sent['input_ids'] = encode['input_ids']
        curr_sent['attention_mask'] =encode['attention_mask'] 
        curr_sent['token_type_ids'] = encode['token_type_ids']
        curr_sent['target'] = self.targets[idx]
    
        return curr_sent
    
    def __len__(self):
        return len(self.excerpt)

In [None]:
class CLMCollate:
    
    def __init__(self,config):
        self.config = config
        self.seq_dic = defaultdict(int)  ## used to track max_length
        self.batch_record = defaultdict(list)
        self.bn = 0
        
    def __call__(self,batch):
                
        out = {'input_ids' :[],
               'attention_mask':[],
               'token_type_ids':[],
                'target':[]
            
        }
        
        for i in batch:
            for k,v in i.items():
                out[k].append(v)
                
        if self.config['bucket']:
            max_pad =0
            
            for p in out['input_ids']:
                if max_pad < len(p):
                    max_pad = len(p)
                    
        else:
            max_pad = self.config['max_len']
            
        
        self.batch_record[str(self.bn)] = [len(x) for x in out['input_ids']]  
        self.seq_dic[str(self.bn)] = max_pad
        self.bn+=1

        for i in range(len(batch)):
            
            input_id = out['input_ids'][i]
            att_mask = out['attention_mask'][i]
            token_type_id = out['token_type_ids'][i]
            text_len = len(input_id)
            
            out['input_ids'][i] = (out['input_ids'][i] + [1] * (max_pad - text_len))[:max_pad]
            out['attention_mask'][i] = (out['attention_mask'][i] + [0] * (max_pad - text_len))[:max_pad]
            out['token_type_ids'][i] = (out['token_type_ids'][i] + [0] * (max_pad - text_len))[:max_pad]
        
        out['input_ids'] = torch.tensor(out['input_ids'],dtype=torch.long)
        out['attention_mask'] = torch.tensor(out['attention_mask'],dtype=torch.long)
        out['token_type_ids'] = torch.tensor(out['token_type_ids'],dtype=torch.long)
        out['target'] = torch.tensor(out['target'],dtype=torch.float)
        
        return out

#### Let's try without sequence  bucketing
For this I have set `bucket=False` in config.

In [None]:
config = {
    'bucket':False,
    'max_len':256,
    'batch_size':256
}
train_ds = CLRPDataset(train_data,tokenizer)
sequence = CLMCollate(config)
train_dataloader = DataLoader(train_ds,
                              batch_size=config['batch_size'],
                             collate_fn=sequence,
                             shuffle=True,)
for i,data in enumerate(train_dataloader):
    pass

Let's visualize the the max length of batches. The red line shows the padded length and the distribution shows the token length distribution of the respective batch

In [None]:
def plot_sequence(sequence):
    plt.figure(figsize=(15,15))
    for i in sequence.seq_dic.keys():
        fig = plt.subplot(3,4,int(i)+1)
        plt.hist(sequence.batch_record[str(i)],density=True,color='yellow', edgecolor='white', alpha=0.5)
        plt.axvline(sequence.seq_dic[str(i)],color='red')
        plt.gca().set_title('batch '+str(i))
        plt.gca().axes.get_yaxis().set_visible(False)
        
    plt.show()
    

plot_sequence(sequence)

You can see when the max length is fixed, we are loosing important information.

#### Now let's try with sequence bucketing

In [None]:
config = {
    'bucket':True,
    'max_len':256,
    'batch_size':256
}
train_ds = CLRPDataset(train_data,tokenizer)
sequence = CLMCollate(config)
train_dataloader = DataLoader(train_ds,
                              batch_size=config['batch_size'],
                             collate_fn=sequence,
                             shuffle=True,)
for i,data in enumerate(train_dataloader):
    pass

In [None]:
plot_sequence(sequence)

You can see that the max length is varying here for different batches according to the size of input sample.
That's what we wanted to do :)