-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduced_genome.sh error - grep: memory exhausted #26
Comments
Hi thanks a lot for re-writing this to be less memory intensive. I have never had problems running this script but I know other people have. Would you be ok with me giving this line to people if they are having the same problems? |
Certainly! |
Thanks! :) |
For me, that awk one-liner doesn't give the same results as the original command. Full test with example data below, and an alternative approach that avoids grep and does match results from the original comand. 39MB example data here: https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa Setupwget https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa
INPUT=dm6_DpnII_25mer_flanking.fa
OUTPUT=dm6_DpnII_25mer_flanking_unique.fa Option 1As used in grep -v '^>' $INPUT \
| sort \
| uniq -i -u \
| grep -xF -f - -B 1 $INPUT \
| grep -v '^--' > ${OUTPUT}.1 Option 2From @mimi31k in above post; ~4 seconds awk \
'$1~/^>/{seq=$1; next};\
{$1=toupper($1)};\
a[$1]>0{a[$1]-=1;next};\
{a[$1]=seq};\
END\
{for(x in a){if(a[x]>0){print a[x];print x}}}' \
$INPUT > ${OUTPUT}.2 Option 3My alternative attempt. Uses 2 arrays; paste - - < $INPUT | awk \
'{a[toupper($2)]++; b[$2]=$1}; END \
{for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
> ${OUTPUT}.3 ComparisonsSince the awk methods do not retain the sorting of the original, first I'm sorting the files just for comparison purposes: sort ${OUTPUT}.1 > out1.sorted
sort ${OUTPUT}.2 > out2.sorted
sort ${OUTPUT}.3 > out3.sorted Then compare them. Option 2 is reporting about 1% extra sites compared to option 1, but option 1 and option 3 output match. diff out1.sorted out3.sorted | wc -l # 0
# Compare option 1 and option 2
# in both:
comm out1.sorted out2.sorted -1 -2 | wc -l # 1324242
# only in option 2
comm out1.sorted out2.sorted -1 -3 | wc -l # 15638
# only in option 1
comm out1.sorted out2.sorted -2 -3 | wc -l # 0 So I think option 3 would be a better approach, as long as it's OK that the de-duped sequences have a different order from the original. |
Update to get sorted output. This runs slower than the original, at ~20s. paste - - < $INPUT | awk \
'{a[toupper($2)]++; b[$2]=$1}; END \
{for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
| sort -k1,1 -k2,2n -k3,3n \
| awk '{print $1":"$2"-"$3"\n"$4}' \
> ${OUTPUT}.3 |
I had some issues with the script above so I re-wrote this: |
Hi,
I get an error running the reduce_genome.sh script on the full hg19 genome (seems to have issues with the line which selects unique fasta sequences). Increasing the session memory doesn't seem to help (as much as 96g); it might be a built in limitation in grep?
Ended up replacing:
grep -v '^>' ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | sort | uniq -i -u | grep -xF -f hg19_dpnii_flanking_sequences_100_unique_2.fa.temp -B 1 ${genome}${enzyme}flanking_sequences${fl}unique_2.fa | grep -v '^--' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa
with an awk liner:
awk '$1~/^>/{seq=$1; next};{$1=toupper($1)};a[$1]>0{a[$1]-=1;next};{a[$1]=seq};END{for(x in a){if(a[x]>0){print a[x];print x}}}' hg19_dpnii_flanking_sequences_100_unique_2.fa > hg19_dpnii_flanking_sequences_100_unique.fa
which seems to fix things...
The text was updated successfully, but these errors were encountered: