# Subsampling

## Ideas

### Rationale: 

Subsampling can change: 
- the number of participants
- the number of classes (signs or words)
- the occurrence or counts per class

#### On the number of participants

Different participants have different ways of signing the same word or sign. A model learned on more participants can predict more variations of a class (sign or word). Thus we can assume, that the more participants we have, the higher the model accuracy will be. 

#### On the number of classes 

"This shows that "60%" accuracy "means more" if you have more classes: a binary classifier with 60% accuracy is almost random, but achieving 60% accuracy for 100 classes is godlike (assuming the classes are relatively balanced)." (https://stackoverflow.com/questions/71632842/does-the-number-of-classes-in-the-target-variable-affect-the-accuracy-of-a-class: )

#### On the occurrence or counts per class

The occurrence or counts per class represents the number of example of the same class (word or sign) that the model sees during training. We can assume, the higher this "coverage", the higher the accuracy of the model as it has been trained on more instances of each class. 

Based on this, there are a few meaningful options to subsample the whole data set. 

### Option 1
Just randomly takes a certain proportion of the data, e.g. 1 % of the whole data set: 
- covers all participants
- covers all classes
- low occurrence or counts per class

### Option 2
Set the number of counts per word 

## Install Dependencies

## Import Dependencies

In [4]:
import numpy as np
import pandas as pd

import random
from random import randint, randrange

## Load train.csv

In [5]:
df_train = pd.read_csv('../data/asl-signs/train.csv')
df_train.shape

(94477, 4)

In [7]:
df_train.sign.nunique()

250

## Parameters

In [81]:
MODE = 'random'
MODE = 'sign_counts'

# random seed used for subsampling
RSEED = 42

# random subsampling proportion of whole data
p = 0.01



## Option 1: Random subsampling

In [82]:
# set random state
random.seed(RSEED)

# set length of subsample
length = int(df_train.shape[0] * p) # number of rows in train.csv times p

# random sampling of row indices
r = [randrange(df_train.shape[0]) for i in range(0, length)] 

# subsampling df_train
df_train_sub = df_train.iloc[r]


## Save subsample 

In [83]:
f'../data/train_sub{length}.csv'

'../data/train_sub944.csv'

In [84]:
df_train_sub.to_csv(f'../data/train_sub{length}.csv', index=False)