Sampling sequences by number is not truly random #386

alvanuffelen · 2023-05-16T09:00:58Z

Prerequisites

make sure you're are using the latest version by seqkit version
read the usage

Describe your issue

describe the problem
When using seqkit sample -n, the first number of sequences in the file will a higher chance of being picked than later sequences.

provide a reproducible example
Porting your logic:

Lines 155 to 163 in 22f71ff

    
           proportion = float64(number) / float64(len(records)) 
        
           for _, record := range records { 
        
           	// if <-randg <= proportion { 
        
           	if rand.Float64() <= proportion { 
        
           		n++ 
        
           		record.FormatToWriter(outfh, config.LineWidth) 
        
           		if n == number { 
        
           			break

to Python:

import random
import matplotlib.pyplot as plt

# Toy data to sample from
sample = range(1, 101)

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}

# Iterate a million times
for _ in range(1_000_000):
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.show()

A solution could be to shuffle the records before iterating over them:

import random
import matplotlib.pyplot as plt

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
for _ in range(1_000_000):
    # Toy data to sample from
    sample = list(range(1, 101))
    # Shuffle the list each iteration
    random.shuffle(sample)
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()

The text was updated successfully, but these errors were encountered:

shenwei356 · 2023-05-18T16:18:13Z

Thre are no truly random numbers.

Seeds also affect the chosen positions: https://bioinf.shenwei.me/seqkit/note/#location-distribution-of-sampled-records

alvanuffelen · 2023-05-18T21:24:09Z

Agreed that there are no truly random numbers and that the seed affects the selected sequences.
But aren't you actively selecting against the later sequences in your implementation?

shenwei356 · 2023-05-19T15:27:30Z

Splitting sequences into N bins of equal size and randomly choosing one record in each bin? This could be the best way.

But most of the time, we (I) don't care if they are uniformly located in the sequences file, or if each record has an equal chance been chosen. What I need is just N sequences down-sampled from the original file.

If the sequence file is not too big, shuffling before down-sampling could be a more rigorous way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling sequences by number is not truly random #386

Sampling sequences by number is not truly random #386

alvanuffelen commented May 16, 2023 •

edited

Loading

shenwei356 commented May 18, 2023

alvanuffelen commented May 18, 2023

shenwei356 commented May 19, 2023

Sampling sequences by number is not truly random #386

Sampling sequences by number is not truly random #386

Comments

alvanuffelen commented May 16, 2023 • edited Loading

Prerequisites

Describe your issue

shenwei356 commented May 18, 2023

alvanuffelen commented May 18, 2023

shenwei356 commented May 19, 2023

alvanuffelen commented May 16, 2023 •

edited

Loading