# How diverse are archaeal histones?
## Approach 
To look at the diversity of histones in archaea we need to:
0. Collect all archaeal proteins.
1. Download hmm-profile to search against.
1. Identify which proteins are histones using hmmsearch.
2. Create database having at least gene name, sequence, genomic GC, and type of phile (use keywords and look through each species). 
3. Vizualize these histones on 3D scatterplot (length vs genomic GC vs pI).
4. Use half_blast to identify doublet proteins.
5. Use sklearn to cluster proteins:
    1. 3 dimensions: length, genomic GC, pI.
    2. n dimensions: 3-mer(or other mer of aligned protiens).
    3. n dimensions: each amino acid of aligned protiens.
6. Evaluate each cluster:
    1. Find average or median protein from each cluster.
    2. See what are the characteristic of each group.
    3. Figure out which phyla are represented in each cluster.

#### 1. Collect all archaeal proteins using slurm script on fiji. (will need to install NCBI's eDirect) ~11hrs for 6.3 mil proteins

In [7]:
!sbatch fetch_archaeal_proteins.q

/bin/sh: esearch: command not found
/bin/sh: efetch: command not found


#### The .fa file is named with a date, so as not to overwrite it next time (ex. archaeal_proteins_20191203.fa)
#### 2. Now we need to download the eukaryotic histone hmm-profile to search our database against.
To do this, we can just go to http://pfam.xfam.org/family/PF00125#tabview=tab3
Under format alignment, select: 
Alignment: Seed 
Format: Stockholm
Order: Tree
Sequence: Inserts lower case
Gaps: Gaps as "-" (dashes)
Download/view: Download

Once downloaded, move into working directory (ex. PF00125_seed.txt)

#### Time to build the hmm-profile
On fiji (or install hmmer on local machine) execute the following, will build .hmm file called euk_histones.hmm (you can also call build_hmm_profile.sh)

In [1]:
%%bash
module purge
module load hmmer
hmmbuild euk_histones.hmm PF00125_seed.txt

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b2 (February 2015); http://hmmer.org/
# Copyright (C) 2015 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             PF00125_seed.txt
# output HMM file:                  euk_histones.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1     PF00125_seed            29   142   131     6.04  0.590 

# CPU time: 0.13u 0.01s 00:00:00.14 Elapsed: 00:00:00.15


#### 3. Use the archaeal protein fasta file and the hmm-profile to hmmsearch for archaeal histones. (might want to execute hmmsearch.q through sbatch instead). 

In [4]:
module purge
module load hmmer
d=`date +%Y%m%d`
hmmsearch euk_histones.hmm archaeal_proteins_20191203.fa > arc_histones_hits_${d}.out

bash: line 1: sbatch: command not found


#### Output looks like:

 hmmsearch :: search profile(s) against a sequence database
 HMMER 3.1b2 (February 2015); http://hmmer.org/
 Copyright (C) 2015 Howard Hughes Medical Institute.
 Freely distributed under the GNU General Public License (GPLv3).
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 query HMM file:                  euk_histones.hmm
 target sequence database:        archaeal_proteins_20191203.fa
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       PF00125_seed  [M=131]
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence               Description
    ------- ------ -----    ------- ------ -----   ---- --  --------               -----------
    1.6e-48  172.7   1.5    1.7e-48  172.5   1.5    1.0  1  RYG58632.1              hypothetical protein EON64_21105 [arc
    8.8e-47  167.0   0.1    1.1e-46  166.7   0.1    1.0  1  RYG46762.1              hypothetical protein EON67_09185, par
    8.9e-40  144.3   0.6      1e-39  144.1   0.6    1.0  1  RYH04814.1              hypothetical protein EON65_46445 [arc
    8.8e-33  121.7   0.3    1.6e-20   82.0   0.2    2.0  2  RYG51695.1              hypothetical protein EON67_02840 [arc
    5.8e-17   70.5   0.2    6.3e-17   70.4   0.2    1.0  1  RYH19767.1              hypothetical protein EON65_25540 [arc
    8.9e-16   66.7   0.0    1.1e-15   66.4   0.0    1.2  1  RYG62260.1              hypothetical protein EON64_18045, par
    9.6e-16   66.6   0.0    1.2e-15   66.3   0.0    1.2  1  RYH06781.1              hypothetical protein EON65_42495 [arc
    1.6e-15   65.8   0.4    6.5e-08   41.2   0.4    2.0  2  EJN60957.1              histone-like protein [Halogranum sala
      2e-15   65.5   0.6    6.3e-08   41.3   0.4    2.0  2  WP_049893615.1          aldolase [Halogranum salarium]
    2.9e-15   65.0   0.1    1.2e-13   59.8   0.0    2.1  1  RYG62261.1              hypothetical protein EON64_18050 [arc
    6.2e-15   63.9   0.5    6.3e-08   41.3   0.4    2.0  2  SFL21946.1              histone H3/H4 [Halogranum rubrum]
    6.2e-15   63.9   0.5    6.3e-08   41.3   0.4    2.0  2  WP_089870122.1          aldolase [Halogranum rubrum]
    7.9e-15   63.6   0.2    8.4e-15   63.5   0.2    1.0  1  RYG45536.1              hypothetical protein EON67_10485, par
    1.9e-14   62.4   0.0    2.5e-14   62.0   0.0    1.1  1  RYH21862.1              hypothetical protein EON65_20030 [arc
    2.1e-14   62.2   0.2    4.7e-08   41.7   0.1    2.0  2  ELY56629.1              Transcription factor CBF/NF-Y/histone
    2.1e-14   62.2   0.2    4.7e-08   41.7   0.1    2.0  2  WP_007259217.1          transcription factor CBF/NF-Y/histone


#### I'm going to move to my local machine now.