You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importrandomimportmatplotlib.pyplotasplt# Toy data to sample fromsample=range(1, 101)
# Wanted number of elements (-n)number=10# Calculate proportion based of the wanted number and total numberproportion=number/len(sample)
# Init dictionary to count the number of times each element was selectedelement_count= {element: 0forelementinrange(1, 101)}
# Iterate a million timesfor_inrange(1_000_000):
# Counter for the number of selected elements in each iterationadded_element_counter=0forelementinsample:
ifrandom.random() <=proportion:
element_count[element] +=1added_element_counter+=1ifnumber==added_element_counter:
breakplt.bar(element_count.keys(), height=element_count.values())
plt.show()
A solution could be to shuffle the records before iterating over them:
importrandomimportmatplotlib.pyplotasplt# Wanted number of elements (-n)number=10# Calculate proportion based of the wanted number and total numberproportion=number/len(sample)
# Init dictionary to count the number of times each element was selectedelement_count= {element: 0forelementinrange(1, 101)}
for_inrange(1_000_000):
# Toy data to sample fromsample=list(range(1, 101))
# Shuffle the list each iterationrandom.shuffle(sample)
# Counter for the number of selected elements in each iterationadded_element_counter=0forelementinsample:
ifrandom.random() <=proportion:
element_count[element] +=1added_element_counter+=1ifnumber==added_element_counter:
breakplt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()
The text was updated successfully, but these errors were encountered:
Agreed that there are no truly random numbers and that the seed affects the selected sequences.
But aren't you actively selecting against the later sequences in your implementation?
Splitting sequences into N bins of equal size and randomly choosing one record in each bin? This could be the best way.
But most of the time, we (I) don't care if they are uniformly located in the sequences file, or if each record has an equal chance been chosen. What I need is just N sequences down-sampled from the original file.
If the sequence file is not too big, shuffling before down-sampling could be a more rigorous way.
Prerequisites
seqkit version
Describe your issue
When using
seqkit sample -n
, the first number of sequences in the file will a higher chance of being picked than later sequences.Porting your logic:
seqkit/seqkit/cmd/sample.go
Lines 155 to 163 in 22f71ff
to Python:
A solution could be to shuffle the records before iterating over them:
The text was updated successfully, but these errors were encountered: