Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduced_genome.sh error - grep: memory exhausted #26

Open
ml31k opened this issue Mar 8, 2017 · 6 comments
Open

reduced_genome.sh error - grep: memory exhausted #26

ml31k opened this issue Mar 8, 2017 · 6 comments

Comments

@ml31k
Copy link

ml31k commented Mar 8, 2017

Hi,

I get an error running the reduce_genome.sh script on the full hg19 genome (seems to have issues with the line which selects unique fasta sequences). Increasing the session memory doesn't seem to help (as much as 96g); it might be a built in limitation in grep?

Ended up replacing:

grep -v '^>' ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | sort | uniq -i -u | grep -xF -f hg19_dpnii_flanking_sequences_100_unique_2.fa.temp -B 1 ${genome}${enzyme}flanking_sequences${fl}unique_2.fa | grep -v '^--' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa

with an awk liner:

awk '$1~/^>/{seq=$1; next};{$1=toupper($1)};a[$1]>0{a[$1]-=1;next};{a[$1]=seq};END{for(x in a){if(a[x]>0){print a[x];print x}}}' hg19_dpnii_flanking_sequences_100_unique_2.fa > hg19_dpnii_flanking_sequences_100_unique.fa

which seems to fix things...

@rr1859
Copy link
Owner

rr1859 commented Mar 8, 2017

Hi thanks a lot for re-writing this to be less memory intensive. I have never had problems running this script but I know other people have. Would you be ok with me giving this line to people if they are having the same problems?

@ml31k
Copy link
Author

ml31k commented Mar 8, 2017

Certainly!
I made a mistake earlier copying the liner. It's now fixed in the above comment

@rr1859
Copy link
Owner

rr1859 commented Mar 9, 2017

Thanks! :)

@daler
Copy link

daler commented Apr 5, 2017

For me, that awk one-liner doesn't give the same results as the original command. Full test with example data below, and an alternative approach that avoids grep and does match results from the original comand.

39MB example data here: https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa

Setup

wget https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa
INPUT=dm6_DpnII_25mer_flanking.fa
OUTPUT=dm6_DpnII_25mer_flanking_unique.fa

Option 1

As used in reduce_genome.sh; ~12 seconds

grep -v '^>' $INPUT \
  | sort \
  | uniq -i -u \
  | grep -xF -f - -B 1 $INPUT \
  | grep -v '^--' > ${OUTPUT}.1

Option 2

From @mimi31k in above post; ~4 seconds

awk \
'$1~/^>/{seq=$1; next};\
{$1=toupper($1)};\
a[$1]>0{a[$1]-=1;next};\
{a[$1]=seq};\
END\
{for(x in a){if(a[x]>0){print a[x];print x}}}' \
$INPUT > ${OUTPUT}.2

Option 3

My alternative attempt. Uses 2 arrays; a to keep track of the sequence count and b to keep track of the header for the last-seen sequence. Since we only care about cases where there's only one sequence, it's OK that the b array entries get overwritten for subsequent sequence occurrences. Plus it also keeps memory usage low. ~3 seconds

paste - - < $INPUT | awk \
  '{a[toupper($2)]++; b[$2]=$1}; END \
  {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
  > ${OUTPUT}.3

Comparisons

Since the awk methods do not retain the sorting of the original, first I'm sorting the files just for comparison purposes:

sort ${OUTPUT}.1 > out1.sorted
sort ${OUTPUT}.2 > out2.sorted
sort ${OUTPUT}.3 > out3.sorted

Then compare them. Option 2 is reporting about 1% extra sites compared to option 1, but option 1 and option 3 output match.

diff out1.sorted out3.sorted | wc -l  # 0

# Compare option 1 and option 2
# in both:
comm out1.sorted out2.sorted -1 -2 | wc -l  # 1324242

# only in option 2
comm out1.sorted out2.sorted -1 -3 | wc -l  # 15638

# only in option 1
comm out1.sorted out2.sorted -2 -3 | wc -l  # 0

So I think option 3 would be a better approach, as long as it's OK that the de-duped sequences have a different order from the original.

@daler
Copy link

daler commented Apr 5, 2017

Update to get sorted output. This runs slower than the original, at ~20s.

paste - - < $INPUT | awk \
  '{a[toupper($2)]++; b[$2]=$1}; END \
  {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
  | sort -k1,1 -k2,2n -k3,3n \
  | awk '{print $1":"$2"-"$3"\n"$4}' \
  > ${OUTPUT}.3

@rr1859
Copy link
Owner

rr1859 commented Aug 16, 2019

I had some issues with the script above so I re-wrote this:
awk '{if(NR%2){printf("%s\t",$1)}else{print}}' ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | sort -k2,2| awk -v IGNORECASE=1 '{if($2==SEQ){gsub(">","",$1);ID=ID"|"$1;counter++} else{if(SEQ!="" && counter == 1){printf("%s\n%s\n", ID,SEQ)}SEQ=$2;ID=$1;counter = 1} }END{printf("%s\n%s\n", ID,SEQ)}' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa

@rr1859 rr1859 pinned this issue Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants