# Simple random sampling


- randomly take items one at a time
- each item has same chance / probability of being picked (Pick a chocolate from 10 chocolates)
- example : `df.sample(n=5, random_state=42)`

# Systematic Sampling

- Sample a population at regular intervals (Take every 5th row in a dataset)
    - example : `df.iloc[::interval]`
- May introduce bias
- To make sure that there is no bias, use plotting to determine that there is no hidden relationship / sequence / pattern involved in the plot (See exammple)
    - Make sure the plot produces pure white noise
    - Randomize row order before sampling with `df.sample(frac=1)`
- Taking every n-th element of the dataframe comes with index too. Reset the index to get fresh set of random data to store


### Unsafe systematic sampling

In [1]:
# sample_size = 5
# pop_size = len(df)
# print(pop_size)
# interval = pop_size // sample_size
# print(interval)
# df.iloc[::interval]


### Safe systematic sampling


- Steps:
    1. use `df.sample(frac=1)` to create a shuffled version of the dataset
    2. use `df.reset_index()` to show the hidden index lying underneath
    3. use `df.reset_index(drop=True).reset_index()` to create index for new dataframe
        - `.reset_index(drop=True)` gets rid of the old hidden index keeping the shuffled version
        -  `.reset_index()` creates a new index column as top-down rows

In [2]:
# shuffled = coffee_ratings.sample(frac=1)
# shuffled = shuffled.reset_index(drop=True).reset_index()
# shuffled.plot(x="index", y="aftertaste", kind="scatter")
# plt.show()

# Stratified Sampling

- Sampling by each and all sub-groups
- See available sub-groups in proportion in original population dataset : `df["col"].value_counts(normalize = True)`
- normal sampling may lead to disproportion sampling
- use stratified sampling so that the proportion is uniform
- Steps:
    - First use `.groupby()` to group the dataframe by the sub-group category
    - Then use `.sample()`
- 2 types of creating such samples:
    1. use proportion of data to sample : Proportional stratified sampling
    2. use number of data to sample : Equal counts stratified sampling

In [3]:
# stratified_sample_proportion = df.groupby("class").sample(frac=0.1, random_state=42)
# stratified_sample_counts = df.groupby("class").sample(n=10, random_state=42)


# stratified_sample['class'].value_counts(normalize=True) # See verification

# Weighted random sampling

- Sampling based on specified weight / weighted probability
- Steps:
    1. We create a weighted column in the dataset
    2. We sample the dataset based on the weighted column

In [4]:
# weighted_sample = df.sample(frac=0.1, weights="weighted_col")

# Cluster sampling

- Randomly pick sub-groups
- Use sampling on the picked sub-groups
- Relatively cheaper than stratified sampling that requires all subgroups
- Cluster sampling is a type of multistage sampling
    - Can have > 2 stages
    - E.g., countrywide surveys may sample states, counties, cities, and neighborhoods
- Steps:
    1. Randomly select a specific number of subgroups
    2. Mask the population dataframe to the selected subgroups
    3. Use grouping them first and then do sampling on the masked dataframe (same as in statified sampling)
        - use `masked_df['class'] = masked_df['class'].cat.remove_unused_categories()` to make sure that unused categories are filtered out

In [5]:
# unique_class = list(df['class'].unique())

# import random
# sampled_class = random.sample(unique_class, k=3)

# mask = df['class'].isin(sampled_class)
# masked_df = df[mask]
# masked_df['class'] = masked_df['class'].cat.remove_unused_categories()

# clustered_sample = masked_df.groupby("class", observed = True).sample(n=5, random_state=2021)