# GWSS Morphological Marker Gene Ortholog Search

By: Cassie Ettinger
Email: cassandra.ettinger@ucr.edu

In [30]:
#loads some basic os/ipython functionality
from os import chdir, mkdir
from os.path import join
from IPython.display import FileLinks, FileLink
from Bio import Phylo

## Data processing

Download from Flybase all gene sequences for selected marker genes

Table of marker genes of interest and flybase IDs, etc

In [11]:
!cat GWSS\ Morphological\ Gene\ Targets\ for\ Homology\ Searches.tsv

Morphological group	Gene name	Flybase ID	Flybase: Closest Ortholog ID	Flybase: Closest Reference Species	Literature: GenBank Assession No.	Literature: Reference Species	Literature: Citation	Notes
Eye color markers	white	FBgn0003996	CLEC000648	Cimex lectularius	MK480204	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	brown	FBgn0000241	ACYPI008444	Acyrthosiphon pisum	MK480212	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	scarlet	FBgn0003515	CLEC004040	Cimex lectularius	MK480213	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	punch	FBgn0003162	CLEC005231	Cimex lectularius	MK480217	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	purple	FBgn0003141	CLEC001054	Cimex lectularius	MK480218	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	DhpD	FBgn0261436	CLEC005050	Cimex lectularius	MK480214	Limnogon

First - turn multiline fasta into single line fasta file

In [68]:
!awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'\
< data/all_marker_protein_seqs.fasta > data/all_marker_protein_seqs_fixed.fasta

In [67]:
!awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'\
< data/Homalodisca_vitripennis_A6A7A9_masurca_v1.proteins.fa > data/Homalodisca_vitripennis_A6A7A9_masurca_v1.proteins.fixed.fa

In [69]:
!awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'\
< data/Homalodisca_vitripennis_A6A7A9_masurca_v1.cds-transcripts.fa > data/Homalodisca_vitripennis_A6A7A9_masurca_v1.cds-transcripts.fixed.fa

Split fasta file into indivdual marker genes

In [79]:
for gene in $(cat genes.txt);
do grep -A 1 '>'$gene data/all_marker_protein_seqs_fixed.fasta | sed '/^--$/d' > $gene'.fasta'; 
done

SyntaxError: invalid syntax (<ipython-input-79-fcd74c86010c>, line 1)

## Run phmmer to identify ortholog candidates

Run phmmer on each marker gene fasta and output both full output with alingments and also a table that can be quickly looked at to ID top hits

In [70]:
for gene in $(cat genes.txt);
do phmmer --tblout $gene'phmmer_out_table' -o $gene'phmmer_out' $gene'.fasta' data/Homalodisca_vitripennis_A6A7A9_masurca_v1.proteins.fixed.fa;
done                   

SyntaxError: invalid syntax (<ipython-input-70-8241b04525fe>, line 1)

Clean up outputs

In [None]:
for gene in $(cat genes.txt);
do mkdir $gene'_results';
done

In [None]:
for gene in $(cat genes.txt);
do mv $gene* $gene'_results';
done
#will error about the results folders but thats OK

Look at each output and make a list of top hits for each gene

### Eye Color Marker Genes

Jason has already done white, so let's use it sanity check this workflow

In [3]:
!cat white_results/whitephmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_036960-T1     -          white_CLEC000648     -            1.9e-94  319.9   3.9   4.2e-86  292.4   0.8   2.3   2   1   0   2   2   2   2 HOVITM_036960
HOVITM_106550-T1     -          white_CLEC000648     -            1.5e-93  317.0   0.0   1.9e-93  316.6   0.0   1.0   1   0   0   1   1   1   1 HOVITM_106550
HOVITM_106548-T1     -          white_CLEC000648     -              3e-93  315.9  11.6   2.1e-70  240.5   7.2   3.0   1   1   1   2   2   2   2 HOVITM_106548
HOVITM_064641-T1     -          white_CLEC00

Mulitple hits here with high scores

HOVITM_036960
HOVITM_106550 
HOVITM_106548

In [5]:
!cat brown_results/brownphmmer_out_table

#                                                                                 --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name                             accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------                   -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_036960-T1     -          brown_lcl|MK480212.1_prot_QEO19124.1_1 -           3.4e-131  441.7   2.8  4.2e-131  441.4   2.8   1.0   1   0   0   1   1   1   1 HOVITM_036960
HOVITM_064641-T1     -          brown_lcl|MK480212.1_prot_QEO19124.1_1 -           2.5e-113  382.7   1.4  3.3e-113  382.3   1.4   1.0   1   0   0   1   1   1   1 HOVITM_064641
HOVITM_102901-T1     -          brown_lcl|MK480212.1_prot_QEO19124.1_1 -            3.8e-66  226.8  15.2   3.9e-6

Multiple hits - top hit overlaps with white & scarlet - but HOVITM_035934 has higher hit scores here

HOVITM_036960
HOVITM_064641

In [6]:
!cat scarlet_results/scarletphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_064641-T1     -          scarlet_CLEC004040   -           5.3e-108  364.5   9.6   7.5e-71  241.9   0.3   2.0   2   0   0   2   2   2   2 HOVITM_064641
HOVITM_036960-T1     -          scarlet_CLEC004040   -           2.2e-107  362.4  15.2   5.8e-59  202.7   0.6   3.0   2   1   0   2   2   2   2 HOVITM_036960
HOVITM_102901-T1     -          scarlet_CLEC004040   -              4e-62  213.1  14.5   2.9e-50  173.9   0.2   2.0   2   0   0   2   2   2   2 HOVITM_102901
HOVITM_106550-T1     -          scarlet_CLEC

Both hits overlap with brown - but HOVITM_062953 slightly higher here for bed bug, HOVITM_035934 higher fly & ws

HOVITM_036960
HOVITM_064641

In [7]:
!cat punch_results/punchphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_066389-T1     -          punch_CLEC005231     -            9.9e-56  191.3   0.0   1.2e-55  191.0   0.0   1.1   1   0   0   1   1   1   1 HOVITM_066389
HOVITM_066388-T1     -          punch_CLEC005231     -            1.3e-09   40.4   0.2   1.5e-09   40.2   0.2   1.2   1   0   0   1   1   1   1 HOVITM_066388
HOVITM_066389-T1     -          punch_lcl|MK480217.1_prot_QEO19129.1_1 -            1.6e-67  227.0   0.3   2.1e-67  226.6   0.3   1.1   1   0   0   1   1   1   1 HOVITM_066389
HOVITM_052825-T1     -    

HOVITM_066389

In [8]:
!cat purple_results/purplephmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_118566-T1     -          purple_CLEC001054    -            1.1e-56  193.7   0.7   1.2e-56  193.5   0.7   1.0   1   0   0   1   1   1   1 HOVITM_118566
HOVITM_090371-T1     -          purple_CLEC001054    -             0.0044   19.6   1.9        12    8.5   0.0   4.9   2   2   3   5   5   5   1 HOVITM_090371
HOVITM_092621-T1     -          purple_CLEC001054    -               0.12   14.9   0.0      0.22   14.1   0.0   1.4   1   0   0   1   1   1   0 HOVITM_092621
HOVITM_118566-T1     -          purple_lcl|M

HOVITM_118566

In [9]:
!cat DhpD_results/DhpDphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_094535-T1     -          DhpD_CLEC005050      -            2.9e-30  108.2   0.1   3.1e-30  108.1   0.1   1.0   1   0   0   1   1   1   1 HOVITM_094535
HOVITM_094531-T1     -          DhpD_CLEC005050      -            2.3e-20   75.6   0.0   3.4e-20   75.0   0.0   1.2   1   0   0   1   1   1   1 HOVITM_094531
HOVITM_043802-T1     -          DhpD_CLEC005050      -            1.7e-06   30.0   0.1   1.8e-06   29.9   0.1   1.0   1   0   0   1   1   1   1 HOVITM_043802
HOVITM_063820-T1     -          DhpD_CLEC005

HOVITM_094535
HOVITM_094531

In [10]:
!cat sepia_results/sepiaphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_021241-T1     -          sepia_CLEC008904     -            5.8e-63  214.9   2.4   7.3e-63  214.6   2.4   1.0   1   0   0   1   1   1   1 HOVITM_021241
HOVITM_021242-T1     -          sepia_CLEC008904     -            5.5e-43  149.5   0.0   7.1e-43  149.1   0.0   1.0   1   0   0   1   1   1   1 HOVITM_021242
HOVITM_021244-T1     -          sepia_CLEC008904     -            1.5e-23   85.8   2.3   1.2e-06   30.4   0.0   3.0   1   1   2   3   3   3   3 HOVITM_021244
HOVITM_021243-T1     -          sepia_CLEC00

HOVITM_021241
HOVITM_021242

In [11]:
!cat cinnabar_results/cinnabarphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_100409-T1     -          cinnabar_CLEC025106  -           4.6e-161  538.9   0.0  5.6e-161  538.6   0.0   1.0   1   0   0   1   1   1   1 HOVITM_100409
HOVITM_062936-T1     -          cinnabar_CLEC025106  -            9.1e-15   56.9   0.0   9.4e-15   56.8   0.0   1.0   1   0   0   1   1   1   1 HOVITM_062936
HOVITM_047469-T1     -          cinnabar_CLEC025106  -             0.0018   19.7   0.0      0.03   15.6   0.0   2.1   2   0   0   2   2   2   1 HOVITM_047469
HOVITM_083683-T1     -          cinnabar_CLE

One very strong hit

HOVITM_100409

In [12]:
!cat rosy_results/rosyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_114983-T1     -          rosy_CLEC006546      -                  0 1787.8   2.2         0 1781.7   2.2   2.0   1   1   0   1   1   1   1 HOVITM_114983
HOVITM_038785-T1     -          rosy_CLEC006546      -           6.1e-159  533.4   0.0  7.3e-159  533.1   0.0   1.0   1   0   0   1   1   1   1 HOVITM_038785
HOVITM_079781-T1     -          rosy_CLEC006546      -             5e-157  527.0   1.0  6.3e-156  523.4   1.0   1.9   1   1   0   1   1   1   1 HOVITM_079781
HOVITM_055557-T1     -          rosy_CLEC006

HOVITM_114983

In [13]:
!cat vermillion_results/vermillionphmmer_out_table

#                                                                --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name            accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------  -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_040850-T1     -          vermillion_CLEC006165 -            7.3e-93  314.2   0.0   8.7e-93  313.9   0.0   1.0   1   0   0   1   1   1   1 HOVITM_040850
HOVITM_004546-T1     -          vermillion_CLEC006165 -            1.1e-10   44.0   0.2   1.3e-10   43.6   0.2   1.1   1   0   0   1   1   1   1 HOVITM_004546
HOVITM_045698-T1     -          vermillion_CLEC006165 -              0.081   14.8   0.0     0.081   14.8   0.0   1.0   1   0   0   1   1   1   0 HOVITM_045698
HOVITM_040850-T1     -          vermil

HOVITM_040850

### Body markers

In [14]:
!cat yellow_results/yellowphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_101006-T1     -          yellow_CLEC000391    -           5.8e-167  558.2   5.4  6.9e-167  557.9   5.4   1.0   1   0   0   1   1   1   1 HOVITM_101006
HOVITM_066476-T1     -          yellow_CLEC000391    -            1.1e-58  201.5   1.3   1.5e-58  201.0   1.3   1.1   1   0   0   1   1   1   1 HOVITM_066476
HOVITM_108993-T1     -          yellow_CLEC000391    -            6.6e-47  162.6   0.2     9e-47  162.2   0.2   1.0   1   0   0   1   1   1   1 HOVITM_108993
HOVITM_074737-T1     -          yellow_CLEC0

HOVITM_101006
HOVITM_074737
HOVITM_011069
HOVITM_046721

In [16]:
!cat ebony_results/ebonyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_049649-T1     -          ebony_CLEC007608     -           2.3e-138  465.1   0.1  2.6e-138  465.0   0.1   1.0   1   0   0   1   1   1   1 HOVITM_049649
HOVITM_024223-T1     -          ebony_CLEC007608     -            6.5e-55  189.2   0.1   8.1e-55  188.8   0.1   1.0   1   0   0   1   1   1   1 HOVITM_024223
HOVITM_070686-T1     -          ebony_CLEC007608     -            3.8e-37  130.4   0.0   4.8e-36  126.8   0.0   2.0   1   1   0   1   1   1   1 HOVITM_070686
HOVITM_024224-T1     -          ebony_CLEC00

HOVITM_049649

In [17]:
!cat tubby_results/tubbyphmmer_out_table

#                                                                    --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name                accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------      -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_027490-T1     -          tubby_L798_11124:KDR14905 -            0.00014   24.6   1.7   0.00014   24.6   1.7   3.1   2   1   1   3   3   3   1 HOVITM_027490
HOVITM_081800-T1     -          tubby_L798_11124:KDR14905 -              0.004   19.9   0.0   1.7e+02    4.8   0.0   3.8   1   1   4   5   5   5   0 HOVITM_081800
HOVITM_090004-T1     -          tubby_L798_11124:KDR14905 -              0.036   16.8   0.0     0.047   16.4   0.0   1.1   1   0   0   1   1   1   0 HOVITM_090004
HOVITM_010816-

May not have good match to this

HOVITM_040559

### Wing markers 

In [18]:
!cat wingless_results/winglessphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_053081-T1     -          wingless_CLEC012013  -            1.9e-93  315.6   6.8   2.5e-93  315.2   6.8   1.0   1   0   0   1   1   1   1 HOVITM_053081
HOVITM_026070-T1     -          wingless_CLEC012013  -            1.7e-89  302.6  16.0   2.1e-89  302.3  16.0   1.0   1   0   0   1   1   1   1 HOVITM_026070
HOVITM_026395-T1     -          wingless_CLEC012013  -            7.1e-85  287.4  15.9   8.3e-85  287.2  15.9   1.1   1   0   0   1   1   1   1 HOVITM_026395
HOVITM_053080-T1     -          wingless_CLE

HOVITM_053081
HOVITM_026070
HOVITM_117535
HOVITM_026395

In [19]:
!cat curly_results/curlyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_050937-T1     -          curly_CLEC009522     -                  0 2342.3   6.4         0 2342.0   6.4   1.0   1   0   0   1   1   1   1 HOVITM_050937
HOVITM_050936-T1     -          curly_CLEC009522     -            3.6e-66  226.0  14.8   9.8e-31  108.5   0.1   4.0   1   1   3   4   4   4   4 HOVITM_050936
HOVITM_036635-T1     -          curly_CLEC009522     -            9.2e-64  218.0  20.1   5.1e-44  152.5  14.0   3.9   3   2   0   3   3   3   2 HOVITM_036635
HOVITM_050938-T1     -          curly_CLEC00

HOVITM_050937

In [20]:
!cat dumpy_results/dumpyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_089588-T1     -          dumpy_CLEC025055     -                  0 7979.0 1731.5         0 1366.7 283.3   7.0   1   1   6   7   7   7   7 HOVITM_089588
HOVITM_089587-T1     -          dumpy_CLEC025055     -                  0 7383.2 1574.9         0 4942.6 1023.6   3.0   1   1   2   3   3   3   3 HOVITM_089587
HOVITM_089591-T1     -          dumpy_CLEC025055     -                  0 4523.1 1064.8         0 1905.7 410.5   7.0   1   1   6   7   7   7   7 HOVITM_089591
HOVITM_089596-T1     -          dumpy_CL

Several hits - but several alleles (and also maybe copies?) for this gene so maybe not unexpected; may not be good target

HOVITM_089588
HOVITM_089587
HOVITM_089591
HOVITM_089596
HOVITM_089585
HOVITM_087366

In [21]:
!cat miniature_results/miniaturephmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_009639-T1     -          miniature_CLEC002209 -           9.2e-266  885.1   1.9  1.2e-265  884.7   1.9   1.1   1   0   0   1   1   1   1 HOVITM_009639
HOVITM_045920-T1     -          miniature_CLEC002209 -            1.7e-91  309.9   1.2   3.9e-84  285.5   0.5   2.3   1   1   1   2   2   2   2 HOVITM_045920
HOVITM_045631-T1     -          miniature_CLEC002209 -            6.3e-84  284.8   5.5   6.3e-84  284.8   5.5   1.8   1   1   0   1   1   1   1 HOVITM_045631
HOVITM_074771-T1     -          miniature_CL

HOVITM_009639

In [22]:
!cat vestigal_results/vestigalphmmer_out_table

#                                                                 --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name             accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------   -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_088486-T1     -          vestigal_ACYPI34460-PA -            5.4e-28  101.0  26.7   9.3e-28  100.3  26.7   1.3   1   0   0   1   1   1   1 HOVITM_088486
HOVITM_039378-T1     -          vestigal_ACYPI34460-PA -            8.7e-08   34.6   2.2   1.1e-07   34.2   2.2   1.1   1   0   0   1   1   1   1 HOVITM_039378
HOVITM_103681-T1     -          vestigal_ACYPI34460-PA -              0.045   15.7   1.6     0.077   15.0   1.6   1.3   1   0   0   1   1   1   0 HOVITM_103681
HOVITM_062214-T1     -          

HOVITM_088486


### Bristle markers

In [23]:
!cat singed_results/singedphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_097669-T1     -          singed_CLEC025340    -           3.8e-118  398.0   6.9   1.6e-62  214.4   0.3   3.0   2   1   0   2   2   2   2 HOVITM_097669
HOVITM_097664-T1     -          singed_CLEC025340    -            7.3e-77  261.8   0.0   8.5e-77  261.6   0.0   1.0   1   0   0   1   1   1   1 HOVITM_097664
HOVITM_097663-T1     -          singed_CLEC025340    -              5e-65  222.7   1.3   1.3e-64  221.4   1.3   1.6   1   1   0   1   1   1   1 HOVITM_097663
HOVITM_036479-T1     -          singed_CLEC0

HOVITM_097669
HOVITM_097664
HOVITM_097663

### Eye shape markers

In [24]:
!cat bar_results/barphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_073654-T1     -          bar_CLEC007883       -            3.9e-48  165.6  24.0   4.8e-48  165.3  24.0   1.1   1   0   0   1   1   1   1 HOVITM_073654
HOVITM_049196-T1     -          bar_CLEC007883       -               0.11   14.6   0.0      0.13   14.5   0.0   1.0   1   0   0   1   1   1   0 HOVITM_049196
HOVITM_042995-T1     -          bar_CLEC007883       -               0.35   13.0   5.4      0.46   12.7   5.4   1.2   1   0   0   1   1   1   0 HOVITM_042995
HOVITM_055515-T1     -          bar_CLEC0078

HOVITM_073654
HOVITM_073652

In [25]:
!cat glass_results/glassphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITM_093440-T1     -          glass_ACYPI008619-PA -                  0 1434.8 269.3   8.1e-39  136.6  11.2  16.6   2   1  15  17  17  16  16 HOVITM_093440
HOVITM_047639-T1     -          glass_ACYPI008619-PA -           5.6e-110  371.2  84.8   6.3e-24   87.5  15.0   5.0   1   1   4   5   5   5   5 HOVITM_047639
HOVITM_078038-T1     -          glass_ACYPI008619-PA -           1.9e-109  369.4  20.0  2.8e-109  368.9  20.0   1.2   1   0   0   1   1   1   1 HOVITM_078038
HOVITM_007912-T1     -          glass_ACYPI0

HOVITM_093440

### Compiled Hit List

In [26]:
!cat CandidateHits.txt

white	HOVITM_036960
white	HOVITM_106550 
white	HOVITM_106548
cinnabar	HOVITM_100409
brown	HOVITM_064641
brown	HOVITM_036960
scarlet	HOVITM_064641
scarlet	HOVITM_036960
punch	HOVITM_066389
purple	HOVITM_118566
DhpD	HOVITM_094535
DhpD	HOVITM_094531
sepia	HOVITM_021242
sepia	HOVITM_021241
rosy	HOVITM_114983
vermillion	HOVITM_040850
yellow	HOVITM_101006
yellow	HOVITM_074737
yellow	HOVITM_011069
yellow	HOVITM_046721
ebony	HOVITM_049649
tubby	HOVITM_040559
wingless	HOVITM_053081
wingless	HOVITM_026070
wingless	HOVITM_117535
wingless	HOVITM_026395
curly	HOVITM_050937
dumpy	HOVITM_089588
dumpy	HOVITM_089587
dumpy	HOVITM_089591
dumpy	HOVITM_089596
dumpy	HOVITM_089585
dumpy	HOVITM_087366
miniature	HOVITM_009639
vestigal	HOVITM_088486
singed	HOVITM_097669
singed	HOVITM_097664
singed	HOVITM_097663
bar	HOVITM_073654
bar	HOVITM_073652
glass	HOVITM_093440

## Protein alignments of all candidates against references

Split candidate list by gene name

In [56]:
for gene in $(cat genes.txt);
do grep $gene CandidateHits.txt | cut -f 2 > $gene'.hits.txt';
done

zsh:1: parse error near `do'
zsh:1: parse error near `done'


For each gene, get all the protein seq hits from the genome

In [None]:
for gene in $(cat genes.txt);
    for hit in $(cat $gene'.hits.txt');
    do grep -A 1 '>'$hit data/Homalodisca_vitripennis_A6A7A9_masurca_v1.proteins.fixed.fa | sed '/^--$/d' >> $gene'.hits.fasta';
done

For each gene, combine the protein seqs of references and candidate hits in preparation for aligning

In [None]:
for gene in $(cat genes.txt);
do cat $gene'_results'/$gene'.fasta' $gene'.hits.fasta' > $gene'.aln.fasta'; 
done

Alignment with muscle

In [None]:
for gene in $(cat genes.txt);
do muscle -in $gene'.aln.fasta' -out $gene'.aln' -clw;
done

In [None]:
#output is html and has some colors; not sure 
for gene in $(cat genes.txt);
do muscle -in $gene'.aln.fasta' -out $gene'.aln.html' -html;
done

In [None]:
#output for trees
for gene in $(cat genes.txt);
do muscle -in $gene'.aln.fasta' -out $gene'.aln.tre.fasta';
done

Make protein phylogenies

In [None]:
for gene in $(cat genes.txt);
do FastTree $gene'.aln.tre.fasta' > $gene'.aln.tre';
done

Clean up output

In [None]:
for gene in $(cat genes.txt);
do mv $gene* $gene'_results';
done
#will error about the results folders but thats OK

## Look at alignment results

### Eye color markers

In [28]:
!cat white_results/white.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_036960-T1                      --MFTIEKSSISQGPLRLRRVTPPPPVQFSTSQSREKLSGNQSSGNTCT-----------
white_CLEC000648                      MKDFTPWVFVLPSGG-----VDDPTLTQQSSALSLD------------------------
white_FBpp0070468                     --MGQEDQELLIRGG-----SKHPSAEHLNNGDSGAASQSCINQGFGQAKNY--------
HOVITM_106550-T1                      -MVISPCVNRFSIDS-----RNKPSYLRQISHSCLHYTTASCTCSLATVLGFFNQNHRNG
white_lcl|MK480204.1_prot_QEO191      -----------MTGG-----HDEREPLLITANGNGSKVTYKAVSDLGKDDDF--------
HOVITM_106548-T1                      ------------------------------------------------------------
                                                                                                  

HOVITM_036960-T1                      --------------------------------------------PGLTLTWRDLSVYAKI
white_CLEC000648                      --------------------------------------------DHIIYTWLGVTVTCNV
white_FBpp0070468                     --------------

In [31]:
tree = Phylo.read("white_results/white.aln.tre", "newick")
#print(tree)
#rooting to Dmel as bed bug + gwss white genes should be closer if follows species tree, but not best outgroup 
tree.root_with_outgroup({"name": "white_FBpp0070468"}) 
Phylo.draw_ascii(tree)

                 _____________ white_CLEC000648
               _|
              | |_______ HOVITM_106550-T1
          ____|
         |    |  _____ white_lcl|MK480204.1_prot_QEO19116.1_1
  _______|    |_|
 |       |      |____ HOVITM_106548-T1
_|       |
 |       |_____________________________ HOVITM_036960-T1
 |
 | white_FBpp0070468



#need to get jbrowse image
<img src="images/jbrowse_white.png">

HOVITM_106550 looks super nice for 1st half and HOVITM_106548 looks good for 2nd half? whats happening?? 

In [34]:
!cat brown_results/brown.aln

MUSCLE (3.8) multiple sequence alignment


brown_FBpp0312192                     ------------------MQES----------------GGSSGQGGPS----LCLEWKQL
brown_ACYPI008444-PA                  MATKKLLDLKYNEMWKAWNSTN----------------EEDDDYSSPLFKRDLVLSWKQL
HOVITM_036960-T1                      --MFTIEKSSISQGPLRLRRVTPPPPVQFSTSQSREKLSGNQSSGNTC-TPGLTLTWRDL
brown_lcl|MK480212.1_prot_QEO191      MSYPNLMDSNVMEISLLTGQEGCPSP----------GLGKRSGSGSPV-QGGLTLSWHEL
HOVITM_064641-T1                      ------------------------------------------MYRSPS-KCHLGI-----
                                                                                   ..     * :     

brown_FBpp0312192                     NYYVPDQEQSNYSFWNECRKKRELRILQDASGHMKTGDLIAILGGSGAGKTTLLAAISQR
brown_ACYPI008444-PA                  NVTVVRKIPKLFGSS----EVVTKQILNNVSGNVECGTLLGIMGPSGSGKTTLMATISHR
HOVITM_036960-T1                      SVYAKIKKESLFKSSST--EY--RKIINNVSGAVPPGTLVALMGASGAGKSTLMAALAYQ
brown_lcl|MK480212.1_prot_QEO191      SVWIKKKDMEKSNF

In [36]:
tree = Phylo.read("brown_results/brown.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "brown_FBpp0312192"}) 
Phylo.draw_ascii(tree)

                       ________ brown_lcl|MK480212.1_prot_QEO19124.1_1
                   ___|
                  |   |  ______________ HOVITM_036960-T1
  ________________|   |_|
 |                |     |___________ HOVITM_064641-T1
_|                |
 |                |_______________ brown_ACYPI008444-PA
 |
 | brown_FBpp0312192



Unclear

In [37]:
!cat scarlet_results/scarlet.aln

MUSCLE (3.8) multiple sequence alignment


scarlet_FBpp0075149                   MSDSDSKRIDVEAPERVEQHELQVMPVGSTIEVPSLDSTPKLSKRNSSERSLPLRSYS--
scarlet_lcl|MK480213.1_prot_QEO1      ---------------------------MALVPATEINQ-------------MNFSTKGFV
HOVITM_036960-T1                      ---------------------------MFTIEKSSISQGPLRLRRVTPPPPVQFSTSQSR
scarlet_CLEC004040                    ---------------------------MD-------------------------------
HOVITM_064641-T1                      ---------------------------MY-------------------------------
                                                                                                  

scarlet_FBpp0075149                   ---KWSPTE-----QGATLVWRDLCVYTNVG------GSGQRMKRIINN-----------
scarlet_lcl|MK480213.1_prot_QEO1      KEEIWEQPE-----DGSTLTWTDLSIYVRCKKPRMLRPAKFSYKRIVNN-----------
HOVITM_036960-T1                      EKLSGNQSSGNTCTPGLTLTWRDLSVYAKIKKESLFKSSSTEYRKIINN-----------
scarlet_CLEC004040                    --------------

In [39]:
tree = Phylo.read("scarlet_results/scarlet.aln.tre", "newick")
#print(tree)
#rooting to Dmel as bed bug + gwss genes should be closer if follows species tree, but not best outgroup 
tree.root_with_outgroup({"name": "scarlet_FBpp0075149"}) 
Phylo.draw_ascii(tree)

              _________ scarlet_lcl|MK480213.1_prot_QEO19125.1_1
            _|
           | |_____ HOVITM_036960-T1
  _________|
 |         |              _______ scarlet_CLEC004040
_|         |_____________|
 |                       |___________ HOVITM_064641-T1
 |
 | scarlet_FBpp0075149



Unclear

Overlapping hits between white / brown / scarlet (though white may have other problems) - may want to expand #ref seqs and # hits investigated for these genes to really narrow in on which are 'white' vs. 'scarlet' vs. 'brown'

Biologically scarlet - brown - white must have similar domains to dimerize 

https://www.pnas.org/content/pnas/116/38/19046.full.pdf
<img src="images/Fig3.png" >

In [40]:
!cat punch_results/punch.aln

MUSCLE (3.8) multiple sequence alignment


punch_CLEC005231                      ------MNGTQGKEVQLRPPRLRTVSWQEEITEGDNDAPGTPKTPR--------------
punch_FBpp0071505                     MKPQTSEQNGSGQNGEGAADAVAVATIPTGEASAASATSGTDLTVSKNSQQLKLEMLNLE
punch_lcl|MK480217.1_prot_QEO191      ------------------------------------------------------------
HOVITM_066389-T1                      ------------------------------------------------------------
                                                                                                  

punch_CLEC005231                      -TSTTPGHEKCTFHHDLELDHKPPTRESLIPDMSRSYKMLLSSLGENPEREGLLKTPERA
punch_FBpp0071505                     LASNGSGHEKCTFHHDLELDHKPPTREALLPDMARSYRLLLGGLGENPDRQGLIKTPERA
punch_lcl|MK480217.1_prot_QEO191      ------------------------------------------------------------
HOVITM_066389-T1                      --------------------------------MASSYRMLLGSLGENPDRQGLLKTPERA
                                                    

In [41]:
tree = Phylo.read("punch_results/punch.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "punch_FBpp0071505"}) 
Phylo.draw_ascii(tree)

       _________________________________ punch_CLEC005231
     _|
  __| | HOVITM_066389-T1
 |  |
_|  | punch_lcl|MK480217.1_prot_QEO19129.1_1
 |
 | punch_FBpp0071505



HOVITM_066389 = punch

In [42]:
!cat purple_results/purple.aln

MUSCLE (3.8) multiple sequence alignment


purple_FBpp0088417                    MSQQPVAFLTRRETFSACHRLH-------------------SPQLSDAENLEVFGKCNNF
HOVITM_118566-T1                      ---MKQCFVVKILNLPTYFSKA--------LFLFYVPVFFFSPQLNDEENLETYGKCNNY
purple_CLEC001054                     -MASPIVYLTRVEKFSACHRLHRDKEVLRSSLIKTSRAFSGCPQLSDQVNKDVYGKCNNP
purple_lcl|MK480218.1_prot_QEO19      ---MAIAYLTRVEKFSACHRLH-------------------SPLLSDEDNLAVYGKCNNF
                                             ::..  .:.: .                      .* *.*  *  .:***** 

purple_FBpp0088417                    HGHGHNYTVEITVRGPIDRRTGMVLNITELKEAIETVIMKRLDHKNLDKDVEYFANTPST
HOVITM_118566-T1                      HGHGHNYTVEVTLKGPVTADTGMVMNINDLKKHMNKAIMEPMDHKNLDKDVPYFKNVVST
purple_CLEC001054                     NGHGHNYRVEVTVCGPVSKDTGMVMNLSDLKAHMNAAIMETLDHKNLDLDVPYFKDVVST
purple_lcl|MK480218.1_prot_QEO19      HGHGHNYTLEVTLRGPVSPDTGMVMNINDLKKIIQEAVMDTLDHKNIDKDVPYFKDVVST
                                      :****** :*:*: 

In [43]:
tree = Phylo.read("purple_results/purple.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "purple_FBpp0088417"}) 
Phylo.draw_ascii(tree)

                      _________________ HOVITM_118566-T1
                    _|
  _________________| |______ purple_lcl|MK480218.1_prot_QEO19130.1_1
 |                 |
_|                 |_________ purple_CLEC001054
 |
 | purple_FBpp0088417



HOVITM_118566 = purple

In [45]:
!cat DhpD_results/DhpD.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_094531-T1                      -------MRMVLLPMLTEQTRNLSSSCSGKKWLIDSRGWPRHQCHATKNHGCLRCRCRTT
HOVITM_094535-T1                      ------------------------------------------------------------
DhpD_FBpp0078625                      ------MATVFLGTVVHTKSFSEFESFEGGFLAVDDAG---KIIGVGQDYHAWASSNPAH
DhpD_CLEC005050                       ------PPIIIQGPIVHSVSKDRITALENKLIAVKD-G---KIVAL-EDSECMDEIRRMI
DhpD_lcl|MK480214.1_prot_QEO1912      MNFKQHENFVIQGPIIHSLSSNEIGYYENATIVVKK-G---KIVSF----DSEGKIKVSA
                                                                                                  

HOVITM_094531-T1                      QNLFENIHTESEMMEIEAKYSPNEVVVLEKGQFLIPGLIDTHTHAPQFPNKGLGYDKTLL
HOVITM_094535-T1                      ------------------------------------------------------------
DhpD_FBpp0078625                      AKGLTEVH-------------------LSDYQFLMPGFVDCHIHAPQFAQLGLGLDMPLL
DhpD_CLEC005050                       GDNFIFFK------

In [46]:
tree = Phylo.read("DhpD_results/DhpD.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "DhpD_FBpp0078625"}) 
Phylo.draw_ascii(tree)

            ___________ HOVITM_094535-T1
          ,|
          ||_____ DhpD_lcl|MK480214.1_prot_QEO19126.1_1
  ________|
 |        | ____________________________ HOVITM_094531-T1
_|        ||
 |         |___ DhpD_CLEC005050
 |
 | DhpD_FBpp0078625



HOVITM_094535?

In [47]:
!cat sepia_results/sepia.aln

MUSCLE (3.8) multiple sequence alignment


sepia_FBpp0076349                     MSNGRHLAKGSPMPDVPEDGILRLYSMRFCPFAQRVHLVLDAKQIPYHSIYINLTDKPEW
sepia_CLEC008904                      -MTVEHLAAGSKAVP-LQEGKLRLYSMRFCPYAQRAHLILNAKNIPHDTVYINLKNKPEW
sepia_lcl|MK480215.1_prot_QEO191      -MAPKHLSVGSSDVP-PEEGKLRLYSMRFCPYAQRVHLALNAKKIPYDVVYVNLKQKPEW
HOVITM_021242-T1                      -MRLLTLA-GSTDPP-LVAGKIRLYSMRYCPYSHRAHLVLLAKNISFDPIFINLKTKPEW
HOVITM_021241-T1                      --------------M-CEEGK---WDRRQRPRSK----AWRSSSL-HDDVWINLKSKPEW
                                                         *    :. *  * ::       :..: .  :::**. ****

sepia_FBpp0076349                     LLEKNPQGKVPALEIVREPGPPVLTESLLICEYLDEQYPLRPLYPRDPLKKVQDKLLIER
sepia_CLEC008904                      YLEQFPLGKVPAICVDGD----RIYESLIICDFLDEKYPENPMYPKDPLKKAKDRILIER
sepia_lcl|MK480215.1_prot_QEO191      FLERFPLSKVPALVVNNT----DLYESLVIADYLDEAYPGEKIFSQDPLQKAKDRILIEM
HOVITM_021242-T1                      YTNTVPSGKVPALL

In [49]:
tree = Phylo.read("sepia_results/sepia.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "sepia_FBpp0076349"}) 
Phylo.draw_ascii(tree)

                             _____ sepia_CLEC008904
                     _______|
                    |       |_________ sepia_lcl|MK480215.1_prot_QEO19127.1_1
  __________________|
 |                  |     ________ HOVITM_021242-T1
_|                  |____|
 |                       |_____________ HOVITM_021241-T1
 |
 | sepia_FBpp0076349



Unclear - look at jbrowse

In [50]:
!cat cinnabar_results/cinnabar.aln

MUSCLE (3.8) multiple sequence alignment


cinnabar_FBpp0311059                  MSPGIVSQEVNGRQEPTEAARDERHGRRRRVAVIGAGLVGSLAALNFARMGNHVDLYEYR
cinnabar_CLEC025106                   -----------------------MDRANLKVVVVGGGLVGSLIATYFGQRGYNVHLYEYR
cinnabar_lcl|MK480216.1_prot_QEO      -------------------MENTENGKKLRVAIIGGGLVGSLSACYFGKRGHEVHLYEYR
HOVITM_100409-T1                      ---------------------MEEQGKPLKIIIVGGGLVGSLSACYFGKRGHEVHLYEYR
                                                                   .: ::*.****** *  *.. *  * *****

cinnabar_FBpp0311059                  EDIRQALVVQGRSINLALSQRGRKALAAVGLEQEVLA-TAIPMRGRMLHDVRGNSSVVLY
cinnabar_CLEC025106                   EDIRTSELVQGKSINLALSVRGLKALEGIGIADSVRA-YGIPMYGRMIHSVTGKTRPIPY
cinnabar_lcl|MK480216.1_prot_QEO      KDIRKDELARGRSINLALSTRGRRALAGVGLEDKLVSHHGLPMYARMLHMTDGSTRAVPY
HOVITM_100409-T1                      QDIRTTELVQGRSINLAMSARARAALREVGLEDTMLR-HGIPMHARMIHGLDGSLHQIPY
                                      :***   :..*.**

In [51]:
tree = Phylo.read("cinnabar_results/cinnabar.aln.tre", "newick")
#print(tree)
#rooting to Dmel as bed bug + gwss genes should be closer if follows species tree, but not best outgroup 
tree.root_with_outgroup({"name": "cinnabar_FBpp0311059"}) 
Phylo.draw_ascii(tree)

                       _______________ cinnabar_CLEC025106
                   ___|
  ________________|   |____________ HOVITM_100409-T1
 |                |
_|                |________ cinnabar_lcl|MK480216.1_prot_QEO19128...
 |
 | cinnabar_FBpp0311059



HOVITM_100409 = cinnabar

In [52]:
!cat rosy_results/rosy.aln

MUSCLE (3.8) multiple sequence alignment


rosy_FBpp0082172                      -----MSNSVLVFFVNGKKVTEVSPDPECTLLTFLREKLRLCGTKLGCAEGGCGACTVMV
rosy_CLEC006546                       --MEVKETSVLVFFVNGRKVVDHSADPEWTLIHYLRKK--LCGTKLGCSEGGCGACTVMV
rosy_lcl|MK480206.1_prot_QEO1911      MKDTPQESDTLVFFVNGVKVVDKEVDPEWTLLFYLRNKLRLCGTKLGCAEGGCGACTVMV
HOVITM_114983-T1                      --MSGSSSSTLVFYVNGKKVEDSNVDPEWTLLYYLRNKLRLTGTKLGCAEGGCGACTVMV
                                            ....***:*** ** : . *** **: :**:*  * ******:***********

rosy_FBpp0082172                      SRLDRRANKIRHLAVNACLTPVCSMHGCAVTTVEGIGSTKTRLHPVQERLAKAHGSQCGF
rosy_CLEC006546                       SKYDRKKNHPIHFTVNACLTPVCAMHGLAVTTVEGIGSVKTKLHPVQERIAKSHGSQCGF
rosy_lcl|MK480206.1_prot_QEO1911      SKYDRKRQKILHYAVNACLAPVCSMHGLAVTTVEGIGSTKTRLHPVQERIAKAHGSQCGF
HOVITM_114983-T1                      SKFDRTSRKLIHFSANACLAPVCSMHGLAVTTVEGIGSTQTRLHPVQERIAKAHGSQCGF
                                      *. **  .:  * :

In [53]:
tree = Phylo.read("rosy_results/rosy.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "rosy_FBpp0082172"}) 
Phylo.draw_ascii(tree)

                    ___________ HOVITM_114983-T1
  _________________|
 |                 |   _________ rosy_lcl|MK480206.1_prot_QEO19118.1_1
_|                 |__|
 |                    |__________________ rosy_CLEC006546
 |
 | rosy_FBpp0082172



HOVITM_114983 = rosy

In [54]:
!cat vermillion_results/vermillion.aln

MUSCLE (3.8) multiple sequence alignment


vermillion_CLEC006165       ------------------------MLYADYLQLNKILTAQRMLSTEHNATVHDEHLFIIT
vermillion_FBpp0073242      MSCPYAGNGNDHDDSAVPLTTEVGKIYGEYLMLDKLLDAQCMLSEEDKRPVHDEHLFIIT
HOVITM_040850-T1            ------------------------------------------------------------
                                                                                        

vermillion_CLEC006165       HQAYELWFKQIIHELDSIRVIFNKPEGLEESETLEILKRLSRIVLILKLLVDQVMILETM
vermillion_FBpp0073242      HQAYELWFKQIIFEFDSIRDMLD-AEVIDETKTLEIVKRLNRVVLILKLLVDQVPILETM
HOVITM_040850-T1            ------------------------------------------------------------
                                                                                        

vermillion_CLEC006165       TPLDFMEFREYLCPASGFQSVQFRLIENKLGMKQENRVRFNQSYHSVFGRDEVALADIAQ
vermillion_FBpp0073242      TPLDFMDFRKYLAPASGFQSLQFRLIENKLGVLTEQRVRYNQKYSDVFS-DEEARNSIRN
HOVITM_040850-T1            ----------------------

In [55]:
tree = Phylo.read("vermillion_results/vermillion.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "vermillion_FBpp0073242"}) 
Phylo.draw_ascii(tree)

                           ___________________________ vermillion_CLEC006165
  ________________________|
_|                        |_____________________________ HOVITM_040850-T1
 |
 | vermillion_FBpp0073242



Missing a bit at the beginning - but HOVITM_040850?

In [56]:
!cat yellow_results/yellow.aln

MUSCLE (3.8) multiple sequence alignment


yellow_CLEC000391       LSLFSYLTLAMSSLPLLLLGLSY---------VAGELEIMYQWTLAKFDTPFNYPPNTK-
HOVITM_101006-T1        --------MALWPLLFALVGLAS-----------AELEVVNQWNLFDFDIPYGYPTNEN-
yellow_FBpp0070070      MFQDKGWILVTLITLVTPSWAAY------------KLQERYSWSQLDFAFPNTRLKDQAL
HOVITM_046721-T1        -------MLLLSPKLHQLTGVAK----------------------------KGYCKETAQ
HOVITM_074737-T1        ------------MNVTALCWISG-----LSGQPNNPLKEKFAWKTLDYVFEEEWIAQEAK
HOVITM_011069-T1        MSAEQLLYRAIHPSPAIVCSVDSEPNLQRAGQPNNPLKEKFAWKTLDYVFEEEWIAQEAK
                                                                                :   

yellow_CLEC000391       ----YRADTTFINSVEVGWDRVFVTLPRI----WSGNPASLAWVPRPRKGQPNDPSPPLQ
HOVITM_101006-T1        ----YSTSQSPSTGLEVGWDRLFLALPRF----MPGAPLSLAFIPRNQPGGYEELSPKLQ
yellow_FBpp0070070      ASGDYIPQNALPVGVEHFGNRLFVTVPRW----RDGIPATLTYINMDRSLTGS---PELI
HOVITM_046721-T1        LTRLG--ERNPA--------HIFLELSNPTHTERSRVPATLNYLPLDEAPVEE---PKLI
HOVITM

In [57]:
tree = Phylo.read("yellow_results/yellow.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "yellow_FBpp0070070"}) 
Phylo.draw_ascii(tree)

                ______ HOVITM_046721-T1
             __|
            |  |          , HOVITM_074737-T1
            |  |__________|
  __________|             | HOVITM_011069-T1
 |          |
 |          |                                  ____________ yellow_CLEC000391
_|          |_________________________________|
 |                                            |___ HOVITM_101006-T1
 |
 | yellow_FBpp0070070



Need to add in more orthologs - one looks very much like bed bug verison, and the other two look like a paralogs and similar to Dmel

In [58]:
!cat ebony_results/ebony.aln

MUSCLE (3.8) multiple sequence alignment


ebony_CLEC007608       --------------------------------------------MFLDGKHPTKSLSYGE
ebony_FBpp0083505      MGSLPQLSIVKGLQQDFVPRALHRIFEEQQLRHADKVALIYQPSTTGQGMAPSQS-SYRQ
HOVITM_049649-T1       ------------------------------------------------------------
                                                                                   

ebony_CLEC007608       VEERSNRLARALLLATKDK--SPNDDGDRVIGLCMEPSPELIIAILAVWKSGCSYLTFAP
ebony_FBpp0083505      MNERANRAARLLVAETHGRFLQPNSDGDFIVAVCMQPSEGLVTTLLAIWKAGGAYLPIDP
HOVITM_049649-T1       ------------------------------------------------------------
                                                                                   

ebony_CLEC007608       NAPVNRTRHIVQEARPVLVVTDKSGTGDLYAPTECVEFSGLEAVSADLSESTLEEDESYP
ebony_FBpp0083505      SFPANRIHHILLEAKPTLVIRDDDIDAGRFQGTPTLSTTELYAKSLQLAGSNLLSEEMLR
HOVITM_049649-T1       ------------------------------------------------------------
               

In [59]:
tree = Phylo.read("ebony_results/ebony.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "ebony_FBpp0083505"}) 
Phylo.draw_ascii(tree)

                            _________________________________ ebony_CLEC007608
  _________________________|
_|                         |______________________ HOVITM_049649-T1
 |
 | ebony_FBpp0083505



Missing first half at beginning but otherwise looks really good?
HOVITM_049649?

In [60]:
!cat tubby_results/tubby.aln

MUSCLE (3.8) multiple sequence alignment


tubby_L798_11124:KDR14905      ------MAAAVAKRTSAAVPPVVAGTVLYQPAVSCYRRPAFEAINLADADFTSDSHTSTH
HOVITM_040559-T1               -----------ISEDQISEGPISEGQTSEGPISEGPISEGQISEGQISEGPISEGQISEG
tubby_FBpp0084408              MRGFIIFAVLAVARADVGGYNYGAGIGSGGSISGGSLSGGSISGGSISGGSISGGSISGG
                                              . .      *     .        .  : .  . .  * .  *  

tubby_L798_11124:KDR14905      RYNNGFSSATNSSLGR-----------------------PTVDGNRHLSH----------
HOVITM_040559-T1               LISEGQTSEGQISEG------------------------PISEGQISEGQIS--------
tubby_FBpp0084408              SISGGSLSSGSLSGGSYSTNYAPVNTEFNKEFFTYSAPEADFEDNKSVSDLAATLKKNLR
                                 . *  *  . * *                        .  :.:   .           

tubby_L798_11124:KDR14905      ----SDHINRGNEIRVRNDTSAATDWPQNFTSVFCNHGRTEVNEL---------------
HOVITM_040559-T1               ----EGPISEGQI---SEGPISEGQISEGQIS------EGQISEG---------------
tubby_FBpp0084408   

In [61]:
tree = Phylo.read("tubby_results/tubby.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "tubby_FBpp0084408"}) 
Phylo.draw_ascii(tree)

                _____________________________________________ tubby_L798_11124
  _____________|
_|             |_________________ HOVITM_040559-T1
 |
 | tubby_FBpp0084408



I'm not really conviced by this - not sure this ones worth pursuing

In [62]:
!cat wingless_results/wingless.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_117535-T1          ---------MSMVRVMKRGDLNSAAPTSGPGGVEVEVLVFWSVDFVLKFICENCSWRQST
HOVITM_026070-T1          --------MLRPMSRPTPGVDDELWWARLLGLQGPSQGAVSLGPLSHKENCHRLQYLVER
wingless_FBpp0079060      -MDISYIFVICLMALCSGGSSLSQVEGKQKSGRGRGSMWWGIAKVGEPNNITPIMYMDPA
HOVITM_026395-T1          ------------------------------------------------------------
wingless_CLEC012013       MFKFKTIYYFCFFRNIEPANFNDVNGCEVTKGLSIG------------------------
HOVITM_053081-T1          ---------MGLLE-----ETSEPLPCGRTPGLSPG------------------------
                                                                                      

HOVITM_117535-T1          -REHFQCVDLFVLESTTNPAYPTGKSFSSAASKTQASPGNEALNCNEVECIDKNEGSLAV
HOVITM_026070-T1          --------QQQLCGLSENVLAAVGNGAKMSIEECQHQFRMSRWNCTTF----SNTSSVFG
wingless_FBpp0079060      IHSTLRRKQRRLVRDNPGVLGALVKGANLAISECQHQFRNRRWNCSTRNF--SRGKNLFG
HOVITM_026395-T1          -----------MCRAAPDAMIAVGDGIRLATLECQQQFKTHRWNCS

In [63]:
tree = Phylo.read("wingless_results/wingless.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "wingless_FBpp0079060"}) 
Phylo.draw_ascii(tree)

                      ____________________ HOVITM_026070-T1
         ____________|
        |            |     __________________ HOVITM_026395-T1
        |            |____|
  ______|                 |              _________ wingless_CLEC012013
 |      |                 |_____________|
 |      |                               |_______________ HOVITM_053081-T1
_|      |
 |      |____________________________________ HOVITM_117535-T1
 |
 | wingless_FBpp0079060



remove HOVITM_117535 and re-do alignment

In [64]:
!cat curly_results/curly.aln

MUSCLE (3.8) multiple sequence alignment


curly_FBpp0289611      MSVPSAPHQRAESKNRVPRPGQKNRKLPKLRLHWPGATYGGALLLLLISYGLELGSVHCY
curly_CLEC009522       ------------------------------------------------------------
HOVITM_050937-T1       ------------------------------------------------------------
                                                                                   

curly_FBpp0289611      EKMYSQTEKQRYDGWYNNLAHPDWGSVDSHLVRKAPPSYSDGVYAMAGANRPSTRRLSRL
curly_CLEC009522       ------------------------------MTRKTPAAYKDGVYMMSGEDRPSARRISQL
HOVITM_050937-T1       ------------------------------------------------------------
                                                                                   

curly_FBpp0289611      FMRGKDGLGSKFNRTALLAFFGQLVANEIVMASESGCPIEMHRIEIEKCDEMYDRECRGD
curly_CLEC009522       FMKGSDGLPSSRNRTALLAFFGQVVSSEVVMASESGCPIEVHQIPIDRCDDMYDPECKGG
HOVITM_050937-T1       ------------------------------------------------------------
               

In [65]:
tree = Phylo.read("curly_results/curly.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "curly_FBpp0289611"}) 
Phylo.draw_ascii(tree)

                                      ________________ curly_CLEC009522
  ___________________________________|
_|                                   |_______________________ HOVITM_050937-T1
 |
 | curly_FBpp0289611



HOVITM_050937 = curly

In [66]:
!cat dumpy_results/dumpy.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_087366-T1       ------------------------------------------------------------
HOVITM_089596-T1       ---------------------------------MKFRTGAAHTAPLAPVSLTCVITELLR
HOVITM_089585-T1       ------------------------------------------------------------
dumpy_CLEC025055       ----------------------------------------MNIIFDQFCENKIYNNRGAL
HOVITM_089588-T1       ------------------------------------------------------------
HOVITM_089587-T1       ------------------------------------------------------------
dumpy_FBpp0304628      MKIFLPLVTWIVLLLSSAVHSQYSQQPQPFKTNLRANSRFRGEVFYLNLENGYFGCQVNE
HOVITM_089591-T1       ------------------------------------------------------------
                                                                                   

HOVITM_087366-T1       -------------------------------------------MSVGVTDSRRDLCYARY
HOVITM_089596-T1       TTDKFDKLPKRSSCRGNSCGIRITRCTASRRSRHDCHKKDGVRCQNGACLDSQCHCNDGY
HOVITM_089585-T1 

In [67]:
tree = Phylo.read("dumpy_results/dumpy.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "dumpy_FBpp0304628"}) 
Phylo.draw_ascii(tree)

                           _________________________________ HOVITM_087366-T1
               ___________|
              |           |_______________________ HOVITM_089596-T1
              |
           ___|        ________________ HOVITM_089585-T1
          |   |    ___|
          |   |   |   |_ dumpy_CLEC025055
          |   |___|
  ________|       |___ HOVITM_089587-T1
 |        |       |
 |        |       |____ HOVITM_089588-T1
_|        |
 |        |_____ HOVITM_089591-T1
 |
 | dumpy_FBpp0304628



This ones pretty gross - I noticed had many copies on flybase - so may be poor choice - maybe try removing HOVITM_087366 and HOVITM_089596 and re-do alignment - but am not hopeful about clarity here

In [68]:
!cat miniature_results/miniature.aln

MUSCLE (3.8) multiple sequence alignment


miniature_FBpp0073400      ------------------------------MWSPQKGPTRLWDLRFSSCIFILHLMFSLV
miniature_CLEC002209       -------------------------------------------------------LFQRN
HOVITM_009639-T1           MRTITGPYSWQFPTSGSSRSQHYVGPDLGMSWLLAAISDSSWDLASSGS---WELLTSIA
                                                                                  : .  

miniature_FBpp0073400      IAGN-ELWPMERPDGMPNIVSLEVMCGKDHMDVHLTFSHPFEGIVSSKGQHSDPRCVYVP
miniature_CLEC002209       VLGG-EIWPLERPEGMPAIQSLEVMCGKDHMDVHLSFTQPFEGIVSSKGQYADPRCVYVP
HOVITM_009639-T1           ETGNSDIWPLERPDGMPAIQSLEVMCGKDHMDVHLSFTQPFEGIVSSKGQYGDPRCVYVP
                             *. ::**:***:*** * ***************:*::***********:.********

miniature_FBpp0073400      PSTGKTFFSFRISYSRCGTKPDLNGQFYENTVVVQYDKDLLEVWDEAKRLRCEWFNDYEK
miniature_CLEC002209       PSTGKTFFSFRIAYARCGTKPDLNGQFYENTVVVQYDKDLLEVWDEAKRLRCEWYNDYEK
HOVITM_009639-T1           PSTGKTFFSFRIAYARCGTKPDLHGQFYENTVV

In [69]:
tree = Phylo.read("miniature_results/miniature.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "miniature_FBpp0073400"}) 
Phylo.draw_ascii(tree)

                                      _______________ miniature_CLEC002209
  ___________________________________|
_|                                   |___________________ HOVITM_009639-T1
 |
 | miniature_FBpp0073400



HOVITM_009639 = miniature

In [70]:
!cat vestigal_results/vestigal.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_088486-T1            -----------------------------------------------MSGRLRPENPPVN
vestigal_ACYPI34460-PA      --MSCTEVMYQAYYPYLYQR--------------------------SSGTAAPPTRAPHH
vestigal_FBpp0086898        MAVSCPEVMYGAYYPYLYGRAGTSRSFYQYERFNQDLYSSSGVNLAASSSASGSSHSPCS
                                                                            .    . ..*  

HOVITM_088486-T1            R-----------------LHRNCVDDIMLRSR----------GGTPG-------------
vestigal_ACYPI34460-PA      H-FPPF------------THQYDRLRALESHQQASTSSPIGGGDSPANRWTTIADHTDSH
vestigal_FBpp0086898        PILPPSVSANAAAAVAAAAHNSAAAAVAVAANQASSSGGIGGGGLGGLGGLGGGPASGLL
                                               *.          .          *.  .             

HOVITM_088486-T1            -----RGRHSWERCGFI-----------TIECALKFSS----------------------
vestigal_ACYPI34460-PA      HSSVGSGASSVGPVGGIASPASSPATANSVSVVHKEED----------TSRDDVRTEMDE
vestigal_FBpp0086898        GSNVVPGSSSVGSVGLGMSPVL

In [71]:
tree = Phylo.read("vestigal_results/vestigal.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "vestigal_FBpp0086898"}) 
Phylo.draw_ascii(tree)

                         ______________________________ HOVITM_088486-T1
  ______________________|
_|                      |___________ vestigal_ACYPI34460-PA
 |
 | vestigal_FBpp0086898



HOVITM_088486?

In [72]:
!cat singed_results/singed.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_097663-T1        ------MNGHHENGELSNGSNKGVWTIGLINCQFKYLTAETFGFKTFDQNSRLGSVIINA
singed_FBpp0290891      MNGQGCELGHSNGDIISQNQQKGWWTIGLINGQHKYMTAETFGFK------------LNA
HOVITM_097664-T1        ------------------------------------------------------------
HOVITM_097669-T1        ------------------------------------------------------------
singed_CLEC025340       MNGVNGMNGHTNGEL--NGVGRGTWTIGLINVQYRYLTAETFGFK------------INA
                                                                                    

HOVITM_097663-T1        NGASLKKKQIWTLEPAGGEAAAIYLRSHLGKYLAVDSFGNVTCESDEKEQGGKFLITVPD
singed_FBpp0290891      NGASLKKKQLWTLEPSNTGESIIYLRSHLNKYLSVDQFGNVLCESDERDAGSRFQISISE
HOVITM_097664-T1        ------------------------------------------------------------
HOVITM_097669-T1        ------------------------------------------------------------
singed_CLEC025340       NGSSLKKKQMWTLEPAPGETNTVYLRSHLDKYLAVDSFGNVTCESEEKDPGSKFQIVISE
      

In [73]:
tree = Phylo.read("singed_results/singed.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "singed_FBpp0290891"}) 
Phylo.draw_ascii(tree)

                               ____________________________ HOVITM_097663-T1
                          ____|
                         |    |_________________ HOVITM_097664-T1
  _______________________|
 |                       |         __________________ HOVITM_097669-T1
_|                       |________|
 |                                |_ singed_CLEC025340
 |
 | singed_FBpp0290891



split in half? check jbrowse

In [74]:
!cat bar_results/bar.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_100409-T1      ------------------------------------------------------------
HOVITM_073652-T1      ------------------------------------------------------------
bar_FBpp0074204       MKDSMSILTQTPSEPNAAHPQLHHHLSTLQQQHHQHHLHYGLQPPAVAHSIHSTTTMSSG
bar_CLEC007883        ------------------------------------------------------------
HOVITM_073654-T1      ------------------------------------------------------------
                                                                                  

HOVITM_100409-T1      ------MEEQGKPLKIIIVGGGLVGSLSACYFGKRGHEVHLYEYRQDIRT----------
HOVITM_073652-T1      ------------------------------------------------------------
bar_FBpp0074204       GSTTTASGIGKPNRSRFMINDILAGSAAAAFYKQQQHHQQLHHHNNNNNSGSSGGSSPAH
bar_CLEC007883        ------------------------------------------------------------
HOVITM_073654-T1      ------------------------------------------------------------
                            

In [75]:
tree = Phylo.read("bar_results/bar.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "bar_FBpp0074204"}) 
Phylo.draw_ascii(tree)

         ___ HOVITM_073654-T1
    ____|
   |    |    ________________________________________________ HOVITM_100409-T1
  _|    |___|
 | |        | bar_CLEC007883
_| |
 | |_______________ HOVITM_073652-T1
 |
 | bar_FBpp0074204



Get rid of HOVITM_100409 and re-do alignment

In [77]:
!cat glass_results/glass.aln

MUSCLE (3.8) multiple sequence alignment


HOVITM_093440-T1          MDDQQKEYKCEVCSKVFPYPSALRKHSQIHNGEKQYGCDECGKSFTLKGNLKAHQFLHTG
glass_ACYPI008619-PA      ------------------------------------------------------------
glass_FBpp0083006         ------------------------------------------------------------
                                                                                      

HOVITM_093440-T1          ELPYACGICQKKFATESSLNSHIDVHTGVKAFSCSMCERKFRHRRGLETHERLHTGENMF
glass_ACYPI008619-PA      ------------------------------------------------------------
glass_FBpp0083006         ------------------------------------------------------------
                                                                                      

HOVITM_093440-T1          ECTYCDKKYNLKASLNNHLLLHTGEKPFSCDTCGKSFRSKMALSSHILIHTGERPYPCSI
glass_ACYPI008619-PA      ------------------------------------------------------------
glass_FBpp0083006         -----------------MGLLYKGSKL-----------------

In [78]:
tree = Phylo.read("glass_results/glass.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "glass_FBpp0083006"}) 
Phylo.draw_ascii(tree)

                     _____________________________________ HOVITM_093440-T1
  __________________|
_|                  |_______ glass_ACYPI008619-PA
 |
 | glass_FBpp0083006



Maybe? I think this is one where more reference seqs are needed to be sure

### All hits further confirmed by reverse blasting on uniprot 

Need to pull out stats from phmmer results for these top candidates & record 

## Refining results  

Candidate genes fall into three groups:

(a) remove bad hit and rerun alignment

(b) get more othologs, re-run pipeline

(c) check JBrowse, then add more orthologs, re-run pipeline

(d) misassembly?

Assigning genes to groups:

(a) wingless, bar

(b) brown, scarlet, yellow, dumpy, vestigal, glass, DhpD 

(c) vermillion, ebony

(d) white, sepia