Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling sequences by number is not truly random #386

Open
4 tasks done
alvanuffelen opened this issue May 16, 2023 · 3 comments
Open
4 tasks done

Sampling sequences by number is not truly random #386

alvanuffelen opened this issue May 16, 2023 · 3 comments

Comments

@alvanuffelen
Copy link

alvanuffelen commented May 16, 2023

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
    When using seqkit sample -n, the first number of sequences in the file will a higher chance of being picked than later sequences.
  • provide a reproducible example
    Porting your logic:

    seqkit/seqkit/cmd/sample.go

    Lines 155 to 163 in 22f71ff

    proportion = float64(number) / float64(len(records))
    for _, record := range records {
    // if <-randg <= proportion {
    if rand.Float64() <= proportion {
    n++
    record.FormatToWriter(outfh, config.LineWidth)
    if n == number {
    break

    to Python:
import random
import matplotlib.pyplot as plt

# Toy data to sample from
sample = range(1, 101)

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}

# Iterate a million times
for _ in range(1_000_000):
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.show()

image

A solution could be to shuffle the records before iterating over them:

import random
import matplotlib.pyplot as plt

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
for _ in range(1_000_000):
    # Toy data to sample from
    sample = list(range(1, 101))
    # Shuffle the list each iteration
    random.shuffle(sample)
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()

image

@shenwei356
Copy link
Owner

Thre are no truly random numbers.

Seeds also affect the chosen positions: https://bioinf.shenwei.me/seqkit/note/#location-distribution-of-sampled-records

@alvanuffelen
Copy link
Author

Agreed that there are no truly random numbers and that the seed affects the selected sequences.
But aren't you actively selecting against the later sequences in your implementation?

@shenwei356
Copy link
Owner

Splitting sequences into N bins of equal size and randomly choosing one record in each bin? This could be the best way.

But most of the time, we (I) don't care if they are uniformly located in the sequences file, or if each record has an equal chance been chosen. What I need is just N sequences down-sampled from the original file.

If the sequence file is not too big, shuffling before down-sampling could be a more rigorous way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants