reduced_genome.sh error - grep: memory exhausted #26

ml31k · 2017-03-08T04:15:43Z

Hi,

I get an error running the reduce_genome.sh script on the full hg19 genome (seems to have issues with the line which selects unique fasta sequences). Increasing the session memory doesn't seem to help (as much as 96g); it might be a built in limitation in grep?

Ended up replacing:

grep -v '^>' ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | sort | uniq -i -u | grep -xF -f hg19_dpnii_flanking_sequences_100_unique_2.fa.temp -B 1 ${genome}${enzyme}flanking_sequences${fl}unique_2.fa | grep -v '^--' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa

with an awk liner:

awk '$1~/^>/{seq=$1; next};{$1=toupper($1)};a[$1]>0{a[$1]-=1;next};{a[$1]=seq};END{for(x in a){if(a[x]>0){print a[x];print x}}}' hg19_dpnii_flanking_sequences_100_unique_2.fa > hg19_dpnii_flanking_sequences_100_unique.fa

which seems to fix things...

rr1859 · 2017-03-08T21:31:03Z

Hi thanks a lot for re-writing this to be less memory intensive. I have never had problems running this script but I know other people have. Would you be ok with me giving this line to people if they are having the same problems?

ml31k · 2017-03-08T23:17:17Z

Certainly!
I made a mistake earlier copying the liner. It's now fixed in the above comment

rr1859 · 2017-03-09T16:06:48Z

Thanks! :)

daler · 2017-04-05T15:57:28Z

For me, that awk one-liner doesn't give the same results as the original command. Full test with example data below, and an alternative approach that avoids grep and does match results from the original comand.

39MB example data here: https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa

Setup

wget https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa
INPUT=dm6_DpnII_25mer_flanking.fa
OUTPUT=dm6_DpnII_25mer_flanking_unique.fa

Option 1

As used in reduce_genome.sh; ~12 seconds

grep -v '^>' $INPUT \
  | sort \
  | uniq -i -u \
  | grep -xF -f - -B 1 $INPUT \
  | grep -v '^--' > ${OUTPUT}.1

Option 2

From @mimi31k in above post; ~4 seconds

awk \
'$1~/^>/{seq=$1; next};\
{$1=toupper($1)};\
a[$1]>0{a[$1]-=1;next};\
{a[$1]=seq};\
END\
{for(x in a){if(a[x]>0){print a[x];print x}}}' \
$INPUT > ${OUTPUT}.2

Option 3

My alternative attempt. Uses 2 arrays; a to keep track of the sequence count and b to keep track of the header for the last-seen sequence. Since we only care about cases where there's only one sequence, it's OK that the b array entries get overwritten for subsequent sequence occurrences. Plus it also keeps memory usage low. ~3 seconds

paste - - < $INPUT | awk \
  '{a[toupper($2)]++; b[$2]=$1}; END \
  {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
  > ${OUTPUT}.3

Comparisons

Since the awk methods do not retain the sorting of the original, first I'm sorting the files just for comparison purposes:

sort ${OUTPUT}.1 > out1.sorted
sort ${OUTPUT}.2 > out2.sorted
sort ${OUTPUT}.3 > out3.sorted

Then compare them. Option 2 is reporting about 1% extra sites compared to option 1, but option 1 and option 3 output match.

diff out1.sorted out3.sorted | wc -l  # 0

# Compare option 1 and option 2
# in both:
comm out1.sorted out2.sorted -1 -2 | wc -l  # 1324242

# only in option 2
comm out1.sorted out2.sorted -1 -3 | wc -l  # 15638

# only in option 1
comm out1.sorted out2.sorted -2 -3 | wc -l  # 0

So I think option 3 would be a better approach, as long as it's OK that the de-duped sequences have a different order from the original.

daler · 2017-04-05T18:43:33Z

Update to get sorted output. This runs slower than the original, at ~20s.

paste - - < $INPUT | awk \
  '{a[toupper($2)]++; b[$2]=$1}; END \
  {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
  | sort -k1,1 -k2,2n -k3,3n \
  | awk '{print $1":"$2"-"$3"\n"$4}' \
  > ${OUTPUT}.3

rr1859 · 2019-08-16T09:01:39Z

I had some issues with the script above so I re-wrote this:
awk '{if(NR%2){printf("%s\t",$1)}else{print}}' ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | sort -k2,2| awk -v IGNORECASE=1 '{if($2==SEQ){gsub(">","",$1);ID=ID"|"$1;counter++} else{if(SEQ!="" && counter == 1){printf("%s\n%s\n", ID,SEQ)}SEQ=$2;ID=$1;counter = 1} }END{printf("%s\n%s\n", ID,SEQ)}' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa

rr1859 mentioned this issue Sep 13, 2017

About generated reduced genome. #39

Open

rr1859 pinned this issue Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduced_genome.sh error - grep: memory exhausted #26

reduced_genome.sh error - grep: memory exhausted #26

ml31k commented Mar 8, 2017 •

edited

Loading

rr1859 commented Mar 8, 2017

ml31k commented Mar 8, 2017

rr1859 commented Mar 9, 2017

daler commented Apr 5, 2017

daler commented Apr 5, 2017

rr1859 commented Aug 16, 2019

reduced_genome.sh error - grep: memory exhausted #26

reduced_genome.sh error - grep: memory exhausted #26

Comments

ml31k commented Mar 8, 2017 • edited Loading

rr1859 commented Mar 8, 2017

ml31k commented Mar 8, 2017

rr1859 commented Mar 9, 2017

daler commented Apr 5, 2017

Setup

Option 1

Option 2

Option 3

Comparisons

daler commented Apr 5, 2017

rr1859 commented Aug 16, 2019

ml31k commented Mar 8, 2017 •

edited

Loading