<a href="https://colab.research.google.com/github/zephyris/discoba_alphafold/blob/main/DiscobaHMMER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@markdown Enter the query name and sequence then "Runtime>Run All"
from google.colab import files
import re
import hashlib

query_name = 'Tb927.8.4970_PFR2' #@param {type:"string"}
query_sequence = 'MLGTVDAIDYDGDRLHKVVLRFPAVRSGESEIVKEVWPCERIGQGSFGTVYRAVSSDYPRLALKISTGKSTRLRQELDVLSRVCTKGRLLLPRFEFGALNKTADLIVIGMELCVPSTLHDLLLSTRITSEAEMLFMAHQAVQAVSYVHAEGCIHRDIKLQNFVFDLDGNLKLIDFGLACNSLKPPAGDVVAGTVSFMSPEMAHNALHKDRRVSVGVAADVWSLGIVLFSIFTQRNPYPAPETPAPAAGSTPGGAGAAGVTGRGDITHGAEGEKGNDLSQQHRMNERLLRRVAAGDWQWPVGVTVSQDLKQLVNSILVVNPEERPSVSTILENKLWNLRRRYPPAAVAAFLGVQDDFLLSHDEAHLMRAVEERSAGVAASLRNSRLHSPASASNEDNGETDAQHSSSGSARNGGLNTSGATTSPARSSLKVVQRCGEGGIDGAVTVQVYDVRASTRKRSKPIREISVVMAEETAKTRRSKSARRATGAVSAPSSRVVSRAASTEYSRRIAPPAGARLQSSAAHTCANSGDDEGEAEAEAVNRGTSTSQRHSRGMSPVRLQDALETAGSVAAGQRDSVGHPKALVEPSCATPPLLTSGKQQPLSEHMRPDIQRSGSVELLEDAEAPTEASAMPTSHKRAASSGKKRRDASLRQPSSLILKGSTRDLSADAPRSTTTAASTQASTSLLASRTTLPLSAVNPSPSSSRQASLRRQASASAAAVSSAQGCAGHRGSSPVMKRAQRVALELGLDVIWHDEADHRRALSAMLLIEHAWLLASFRLTIEEDQERYSITWLAEEQEKSAAHPHRFKEVMQVMSKKYQYGFVCDMCDYEFLPTGPGEKDLHFFHCPCGRDLCPDCYTAYQRQCTCSCCRAVHSNSCVLREHLLLTGGTQYYSGSRKTNAAARADAVRGSFQAAASLNEEAESGDEASAPPEPPRRRGRPPKQDKNRSAVKQKGSRAAKDSSRRRRGAQDTLDVSVDDAHEVEQINLPRISIAAMQQQEERSSNGSHRGGGTAAVGVAPRPQRPEDVEVKQRPVESVPEGPWRPFARFKKDRRDEVAQQPTPEERDALLNGEWIRHFYLFPQAEPERVAASGTWAEGEEEPYAFVYHAQPGRTGAIFLTSDFPMHSAVFSMLERQFFVVNQVDTVEGVDSTRATSLLKAKGHPELRIAFHALQDIVAYDTNMMKQQRTPGTVSVYQAPRSAYSCNGEPFLYVRWFRFNENRTLSAFLLSNGAVQVFVNNEYELRWFDESRKFLIRYNGVCELVDDGTFALAPGINHLLYDSFDA' #@param {type:"string"}

def add_hash(x,y):
  return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

query_name = "".join(query_name.split())
query_name = re.sub(r'\W+', '', query_name)
query_name = add_hash(query_name, query_sequence)

with open(f"{query_name}.fasta", "w") as text_file:
    text_file.write(">%s\n%s" % (query_name, query_sequence))

#@markdown Searches take several minutes. The first run will set up the HMMER software and download a large sequence database so will take a little longer.

#@markdown All sequence and sequencing data used in this database is publicly available, but the raw data deserves proper citation and building this database is a significant time investment. Please watch this space for full details and how to credit this resource.

In [None]:
#@title Install Hmmer, fetch Discoba database

#@markdown Downloads a database from wheelerlab.net built from various publicly available non-Uniprot sources by Richard Wheeler.

#@markdown The majority of these sequences are not in the UniRef100 database, which is commonly used by online database search/alignment tools. Many are derived from raw sequencing data and are not (to my knowledge) found in any searchable databases.

#@markdown The database is currently being periodically updated. You can delete the "DISCOBA_READY" file using the "Files" interface on the left and re-run this section to update the database.

%%bash -s
#Install Hmmer, for search and alignment
if [ ! -f HMMER_READY ]; then
  apt-get install hmmer > /dev/null 2>&1
  touch HMMER_READY
fi

#Download the custom Discoba database
if [ ! -f DISCOBA_READY ]; then
  if [ -d discoba ]; then
    rm -r discoba
  fi
  mkdir discoba
  cd discoba
    curl http://wheelerlab.net/discobaStats.txt
    curl http://wheelerlab.net/discoba.fasta.gz -s -L -o discoba.fasta.gz
    gzip -d discoba.fasta.gz
  cd ..
  touch DISCOBA_READY
fi

#Install hh-suite
if [ ! -f HHSUITE_READY ]; then
  if [ -d hh-suite ]; then
    rm -r hh-suite
  fi
  git clone https://github.com/soedinglab/hh-suite
  touch HHSUITE_READY
fi

In [None]:
#@title Run Hmmer to get Discoba MSA

#@markdown It is normal for this to take several minutes.
%%bash -s $query_name

if [ ! -f $1.hmm.a3m ]; then
  echo "$1.fasta"
  echo "world"
  jackhmmer -A $1.hmm.sto -o $1.hmm.out $1.fasta discoba/discoba.fasta
  echo "hello"
  perl hh-suite/scripts/reformat.pl sto a3m $1.hmm.sto $1.hmm.a3m
fi

In [None]:
#@title Download the a3m file
#@markdown This a3m file can be used as a standalone alignment for your favourite AlphaFold2 implementation.
#@markdown It uses HMMER for alignment, as used in [the official AlphaFold colab notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb) - although that does not have an easy interface to provide a custom MSA.

#@markdown In my experience it can, but does not always, give significantly better results than the [AlphaFold](https://alphafold.ebi.ac.uk/) database for _L. infantum_ and _T. cruzi_ genes.

#@markdown The content of this a3m can be appended to a a3m from searching a different database, so long as the same input sequence was used.
files.download(f"{query_name}.a3m")