# GWSS Morphological Marker Gene Ortholog Search

By: Cassie Ettinger
Email: cassandra.ettinger@ucr.edu

In [3]:
#loads some basic os/ipython functionality
from os import chdir, mkdir
from os.path import join
from IPython.display import FileLinks, FileLink
from Bio import Phylo

## Data processing

Download from Flybase all gene sequences for selected marker genes

Table of marker genes of interest and flybase IDs, etc

In [11]:
!cat GWSS\ Morphological\ Gene\ Targets\ for\ Homology\ Searches.tsv

Morphological group	Gene name	Flybase ID	Flybase: Closest Ortholog ID	Flybase: Closest Reference Species	Literature: GenBank Assession No.	Literature: Reference Species	Literature: Citation	Notes
Eye color markers	white	FBgn0003996	CLEC000648	Cimex lectularius	MK480204	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	brown	FBgn0000241	ACYPI008444	Acyrthosiphon pisum	MK480212	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	scarlet	FBgn0003515	CLEC004040	Cimex lectularius	MK480213	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	punch	FBgn0003162	CLEC005231	Cimex lectularius	MK480217	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	purple	FBgn0003141	CLEC001054	Cimex lectularius	MK480218	Limnogonus franciscanus	https://www.pnas.org/content/116/38/19046	
Eye color markers	DhpD	FBgn0261436	CLEC005050	Cimex lectularius	MK480214	Limnogon

First - turn multiline fasta into single line fasta file

In [68]:
!awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'\
< data/all_marker_protein_seqs.fasta > data/all_marker_protein_seqs_fixed.fasta

In [None]:
!awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'\
< data/Homalodisca_vitripennis_A6A7A9_masurca_v1_ragtag_v1.proteins.fa > data/Homalodisca_vitripennis_A6A7A9_masurca_v1_ragtag_v1.proteins.fixed.fa

In [69]:
!awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'\
< data/Homalodisca_vitripennis_A6A7A9_masurca_v1.cds-transcripts.fa > data/Homalodisca_vitripennis_A6A7A9_masurca_v1.cds-transcripts.fixed.fa

Split fasta file into indivdual marker genes

In [79]:
for gene in $(cat genes.txt);
do grep -A 1 '>'$gene data/all_marker_protein_seqs_fixed.fasta | sed '/^--$/d' > $gene'.fasta'; 
done

SyntaxError: invalid syntax (<ipython-input-79-fcd74c86010c>, line 1)

## Run phmmer to identify ortholog candidates

Run phmmer on each marker gene fasta and output both full output with alingments and also a table that can be quickly looked at to ID top hits

In [70]:
for gene in $(cat genes.txt);
do phmmer --tblout $gene'phmmer_out_table' -o $gene'phmmer_out' $gene'.fasta' data/Homalodisca_vitripennis_A6A7A9_masurca_v1_ragtag_v1.proteins.fixed.fa;
done                   

SyntaxError: invalid syntax (<ipython-input-70-8241b04525fe>, line 1)

Clean up outputs

In [None]:
for gene in $(cat genes.txt);
do mkdir $gene'_results';
done

In [None]:
for gene in $(cat genes.txt);
do mv $gene* $gene'_results';
done
#will error about the results folders but thats OK

Look at each output and make a list of top hits for each gene

### Eye Color Marker Genes

Jason has already done white, so let's use it sanity check this workflow

In [2]:
!cat white_results/whitephmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_063422-T1    -          white_CLEC000648     -            1.7e-94  319.9   3.9   3.7e-86  292.4   0.8   2.3   2   1   0   2   2   2   2 HOVITMR_063422
HOVITMR_025764-T1    -          white_CLEC000648     -              1e-93  317.4   0.0   1.3e-93  317.0   0.0   1.0   1   0   0   1   1   1   1 HOVITMR_025764
HOVITMR_025763-T1    -          white_CLEC000648     -            3.7e-92  312.2  11.3   5.8e-70  238.9   7.4   3.1   1   1   1   2   2   2   2 HOVITMR_025763
HOVITMR_046532-T1    -          white_CLE

Mulitple hits here with high scores

HOVITMR_063422
HOVITMR_025764
HOVITMR_025763

In [3]:
!cat brown_results/brownphmmer_out_table

#                                                                                 --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name                             accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------                   -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_063422-T1    -          brown_lcl|MK480212.1_prot_QEO19124.1_1 -           3.1e-131  441.7   2.8  3.7e-131  441.4   2.8   1.0   1   0   0   1   1   1   1 HOVITMR_063422
HOVITMR_023567-T1    -          brown_lcl|MK480212.1_prot_QEO19124.1_1 -           3.1e-112  378.9   1.2  4.2e-112  378.5   1.2   1.0   1   0   0   1   1   1   1 HOVITMR_023567
HOVITMR_046532-T1    -          brown_lcl|MK480212.1_prot_QEO19124.1_1 -            3.5e-78  266.4   7.6   2.2e

Multiple hits - top hit overlaps with white & scarlett

HOVITMR_063422
HOVITMR_023567

In [5]:
!cat scarlet_results/scarletphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_063422-T1    -          scarlet_CLEC004040   -             2e-107  362.4  15.2   5.2e-59  202.7   0.6   3.0   2   1   0   2   2   2   2 HOVITMR_063422
HOVITMR_023567-T1    -          scarlet_CLEC004040   -             4e-107  361.5   9.1   7.5e-71  241.8   0.3   2.0   2   0   0   2   2   2   2 HOVITMR_023567
HOVITMR_046532-T1    -          scarlet_CLEC004040   -            6.2e-75  255.2  12.1   1.5e-31  112.1   0.0   4.1   4   0   0   4   4   4   3 HOVITMR_046532
HOVITMR_060010-T1    -          scarlet_C

Both hits overlap with brown

HOVITMR_063422
HOVITMR_023567

In [6]:
!cat punch_results/punchphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_079319-T1    -          punch_CLEC005231     -            1.8e-74  252.5   0.2     3e-74  251.8   0.2   1.2   1   0   0   1   1   1   1 HOVITMR_079319
HOVITMR_079320-T1    -          punch_CLEC005231     -            1.8e-09   39.8   0.1   2.4e-09   39.4   0.1   1.2   1   0   0   1   1   1   1 HOVITMR_079320
HOVITMR_010707-T1    -          punch_CLEC005231     -               0.24   13.2   0.1      0.26   13.1   0.1   1.2   1   0   0   1   1   1   0 HOVITMR_010707
HOVITMR_018474-T1    -          punch_CLE

HOVITMR_079319

In [7]:
!cat purple_results/purplephmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_010106-T1    -          purple_CLEC001054    -            4.6e-56  191.5   2.2     5e-56  191.4   2.2   1.0   1   0   0   1   1   1   1 HOVITMR_010106
HOVITMR_010106-T1    -          purple_lcl|MK480218.1_prot_QEO19130.1_1 -            4.3e-51  174.5   1.6   4.9e-51  174.3   1.6   1.0   1   0   0   1   1   1   1 HOVITMR_010106
HOVITMR_048243-T1    -          purple_lcl|MK480218.1_prot_QEO19130.1_1 -               0.24   13.9   0.1      0.83   12.2   0.1   1.9   1   0   0   1   1   1   0 HOVITMR_048243
HOV

HOVITMR_010106

In [8]:
!cat DhpD_results/DhpDphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_014112-T1    -          DhpD_CLEC005050      -              1e-29  106.2   0.1   1.2e-29  106.0   0.1   1.0   1   0   0   1   1   1   1 HOVITMR_014112
HOVITMR_014113-T1    -          DhpD_CLEC005050      -            4.6e-21   77.8   1.9   9.5e-21   76.7   0.1   1.7   1   1   1   2   2   2   1 HOVITMR_014113
HOVITMR_029331-T1    -          DhpD_CLEC005050      -            3.1e-06   28.9   0.0   4.7e-06   28.3   0.0   1.2   1   0   0   1   1   1   1 HOVITMR_029331
HOVITMR_107863-T1    -          DhpD_CLEC

HOVITMR_014112
HOVITMR_014113

In [10]:
!cat sepia_results/sepiaphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_059208-T2    -          sepia_CLEC008904     -            3.4e-68  231.9   0.1   4.2e-68  231.6   0.1   1.0   1   0   0   1   1   1   1 HOVITMR_059208
HOVITMR_059208-T1    -          sepia_CLEC008904     -              1e-59  204.1   0.3   1.2e-59  203.9   0.3   1.0   1   0   0   1   1   1   1 HOVITMR_059208
HOVITMR_086766-T1    -          sepia_CLEC008904     -            1.2e-06   30.3   0.2   1.4e-06   30.0   0.2   1.2   1   0   0   1   1   1   1 HOVITMR_086766
HOVITMR_086765-T1    -          sepia_CLE

HOVITMR_059208

In [2]:
!cat cinnabar_results/cinnabarphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_030756-T1    -          cinnabar_CLEC025106  -            1.4e-97  329.6   0.4   1.6e-97  329.3   0.4   1.0   1   0   0   1   1   1   1 HOVITMR_030756
HOVITMR_030755-T1    -          cinnabar_CLEC025106  -            5.1e-35  123.4   0.1   6.5e-35  123.1   0.1   1.0   1   0   0   1   1   1   1 HOVITMR_030755
HOVITMR_107759-T1    -          cinnabar_CLEC025106  -              1e-14   56.6   0.0     1e-14   56.5   0.0   1.0   1   0   0   1   1   1   1 HOVITMR_107759
HOVITMR_022906-T1    -          cinnabar_

One very strong hit

HOVITMR_030756

In [3]:
!cat rosy_results/rosyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_021669-T1    -          rosy_CLEC006546      -                  0 1787.8   2.2         0 1781.7   2.2   2.0   1   1   0   1   1   1   1 HOVITMR_021669
HOVITMR_021667-T1    -          rosy_CLEC006546      -           5.4e-159  533.4   0.0  6.5e-159  533.1   0.0   1.0   1   0   0   1   1   1   1 HOVITMR_021667
HOVITMR_072387-T1    -          rosy_CLEC006546      -           3.8e-121  408.0   0.0  5.4e-121  407.5   0.0   1.1   1   0   0   1   1   1   1 HOVITMR_072387
HOVITMR_069740-T1    -          rosy_CLEC

HOVITMR_021669

In [4]:
!cat vermillion_results/vermillionphmmer_out_table

#                                                                --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name            accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------  -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_086284-T1    -          vermillion_CLEC006165 -            6.5e-93  314.2   0.0   7.8e-93  313.9   0.0   1.0   1   0   0   1   1   1   1 HOVITMR_086284
HOVITMR_086284-T2    -          vermillion_CLEC006165 -            6.7e-54  185.9   0.0   7.8e-54  185.7   0.0   1.0   1   0   0   1   1   1   1 HOVITMR_086284
HOVITMR_086280-T1    -          vermillion_CLEC006165 -            9.5e-11   44.0   0.2   1.2e-10   43.6   0.2   1.1   1   0   0   1   1   1   1 HOVITMR_086280
HOVITMR_099233-T1    -          ver

HOVITMR_086284

### Body markers

In [5]:
!cat yellow_results/yellowphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_038735-T1    -          yellow_CLEC000391    -           5.2e-167  558.2   5.4  6.2e-167  557.9   5.4   1.0   1   0   0   1   1   1   1 HOVITMR_038735
HOVITMR_024547-T1    -          yellow_CLEC000391    -              4e-59  202.7   1.3     5e-59  202.4   1.3   1.0   1   0   0   1   1   1   1 HOVITMR_024547
HOVITMR_037366-T1    -          yellow_CLEC000391    -            3.5e-42  146.9   0.7   5.3e-42  146.3   0.7   1.1   1   0   0   1   1   1   1 HOVITMR_037366
HOVITMR_096073-T1    -          yellow_CL

HOVITMR_038735 HOVITMR_037366 HOVITMR_096073 HOVITMR_070904

In [6]:
!cat ebony_results/ebonyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_055645-T1    -          ebony_CLEC007608     -           3.3e-115  388.4   0.1  3.6e-115  388.3   0.1   1.0   1   0   0   1   1   1   1 HOVITMR_055645
HOVITMR_055646-T1    -          ebony_CLEC007608     -            1.9e-85  290.0   0.2   2.5e-85  289.6   0.2   1.0   1   0   0   1   1   1   1 HOVITMR_055646
HOVITMR_000360-T1    -          ebony_CLEC007608     -            6.2e-40  139.5   0.0   2.1e-38  134.4   0.0   2.0   1   1   1   2   2   2   1 HOVITMR_000360
HOVITMR_001305-T1    -          ebony_CLE

HOVITMR_055645

In [7]:
!cat tubby_results/tubbyphmmer_out_table

#                                                                    --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name                accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------      -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_043349-T1    -          tubby_L798_11124:KDR14905 -            0.00012   24.7   1.5   0.00012   24.7   1.5   3.1   2   1   1   3   3   3   1 HOVITMR_043349
HOVITMR_042796-T1    -          tubby_L798_11124:KDR14905 -              0.013   18.0   0.0   2.2e+02    4.3   0.0   3.8   1   1   4   5   5   5   0 HOVITMR_042796
HOVITMR_031811-T1    -          tubby_L798_11124:KDR14905 -              0.032   16.8   0.0     0.042   16.4   0.0   1.1   1   0   0   1   1   1   0 HOVITMR_031811
HOVITMR_016

May not have good match to this

HOVITMR_043349

### Wing markers 

In [8]:
!cat wingless_results/winglessphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_057127-T1    -          wingless_CLEC012013  -            1.3e-92  312.7   8.8   1.7e-92  312.3   8.8   1.1   1   0   0   1   1   1   1 HOVITMR_057127
HOVITMR_070598-T1    -          wingless_CLEC012013  -            1.6e-89  302.6  16.0   1.9e-89  302.3  16.0   1.0   1   0   0   1   1   1   1 HOVITMR_070598
HOVITMR_020388-T1    -          wingless_CLEC012013  -            3.6e-86  291.5  16.2   4.6e-86  291.2  16.2   1.1   1   0   0   1   1   1   1 HOVITMR_020388
HOVITMR_029194-T1    -          wingless_

HOVITMR_057127 HOVITMR_070598 HOVITMR_020388 HOVITMR_029198

In [9]:
!cat curly_results/curlyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_045190-T1    -          curly_CLEC009522     -                  0 2650.6  12.5         0 2605.9   6.8   2.0   1   1   1   2   2   2   2 HOVITMR_045190
HOVITMR_013885-T1    -          curly_CLEC009522     -              5e-70  238.6  18.8     7e-50  171.8  16.3   3.1   2   2   0   2   2   2   2 HOVITMR_013885
HOVITMR_006514-T1    -          curly_CLEC009522     -            7.7e-49  168.4   0.0   1.2e-48  167.8   0.0   1.1   1   0   0   1   1   1   1 HOVITMR_006514
HOVITMR_033987-T1    -          curly_CLE

HOVITMR_045190

In [10]:
!cat dumpy_results/dumpyphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_031383-T1    -          dumpy_CLEC025055     -                  0 15944.6 3449.1         0 4733.3 985.5  10.0   1   1   9  10  10  10  10 HOVITMR_031383
HOVITMR_031381-T1    -          dumpy_CLEC025055     -                  0 4523.1 1064.8         0 1905.7 410.5   7.0   1   1   6   7   7   7   7 HOVITMR_031381
HOVITMR_031374-T1    -          dumpy_CLEC025055     -                  0 2208.6 1283.0  4.8e-148  495.4 145.3  29.6   2   1  22  25  25  22  22 HOVITMR_031374
HOVITMR_031379-T1    -          dumpy

Several hits - but several alleles (and also maybe copies?) for this gene so maybe not unexpected; may not be good target

HOVITMR_031383
HOVITMR_031374
HOVITMR_031381
HOVITMR_031379
HOVITMR_031385

In [11]:
!cat miniature_results/miniaturephmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_040001-T1    -          miniature_CLEC002209 -           1.2e-252  841.7   1.4  1.5e-252  841.3   1.4   1.1   1   0   0   1   1   1   1 HOVITMR_040001
HOVITMR_009464-T1    -          miniature_CLEC002209 -            4.1e-89  301.8   1.6   6.2e-84  284.7   0.5   2.8   1   1   2   3   3   3   2 HOVITMR_009464
HOVITMR_052857-T1    -          miniature_CLEC002209 -            6.8e-84  284.6   6.4   6.8e-84  284.6   6.4   1.8   1   1   0   1   1   1   1 HOVITMR_052857
HOVITMR_104598-T1    -          miniature

HOVITMR_040001

In [12]:
!cat vestigal_results/vestigalphmmer_out_table

#                                                                 --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name             accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------   -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_019057-T1    -          vestigal_ACYPI34460-PA -            2.8e-30  108.4  22.3   3.7e-30  108.0  22.3   1.1   1   0   0   1   1   1   1 HOVITMR_019057
HOVITMR_019053-T1    -          vestigal_ACYPI34460-PA -            3.7e-07   32.3   3.3   4.6e-07   32.0   3.3   1.1   1   0   0   1   1   1   1 HOVITMR_019053
HOVITMR_029112-T1    -          vestigal_ACYPI34460-PA -              0.041   15.7   1.6     0.069   15.0   1.6   1.3   1   0   0   1   1   1   0 HOVITMR_029112
HOVITMR_028380-T1    -       

HOVITMR_019057


### Bristle markers

In [13]:
!cat singed_results/singedphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_016219-T1    -          singed_CLEC025340    -           3.4e-118  398.0   6.9   1.5e-62  214.4   0.3   3.0   2   1   0   2   2   2   2 HOVITMR_016219
HOVITMR_016213-T1    -          singed_CLEC025340    -            1.4e-78  267.3   0.0   1.7e-78  267.0   0.0   1.0   1   0   0   1   1   1   1 HOVITMR_016213
HOVITMR_016212-T1    -          singed_CLEC025340    -            4.7e-69  235.9   3.6   1.2e-68  234.5   3.6   1.6   1   1   0   1   1   1   1 HOVITMR_016212
HOVITMR_007353-T1    -          singed_CL

HOVITMR_016219 HOVITMR_016213 HOVITMR_016212

### Eye shape markers

In [14]:
!cat bar_results/barphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_017333-T1    -          bar_CLEC007883       -            3.5e-48  165.6  24.0   4.3e-48  165.3  24.0   1.1   1   0   0   1   1   1   1 HOVITMR_017333
HOVITMR_030802-T1    -          bar_CLEC007883       -              0.062   15.3   4.5     0.062   15.3   4.5   3.0   3   0   0   3   3   3   0 HOVITMR_030802
HOVITMR_036980-T1    -          bar_CLEC007883       -               0.37   12.8   0.3       0.5   12.4   0.3   1.2   1   0   0   1   1   1   0 HOVITMR_036980
HOVITMR_080490-T1    -          bar_CLEC0

HOVITMR_017333

In [15]:
!cat glass_results/glassphmmer_out_table

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
HOVITMR_029041-T1    -          glass_ACYPI008619-PA -                  0 1447.1 279.4   7.4e-39  136.5  11.2  17.4   2   1  16  18  18  17  17 HOVITMR_029041
HOVITMR_108614-T1    -          glass_ACYPI008619-PA -           5.3e-113  381.0  92.7   2.3e-28  101.9   6.7   5.0   1   1   4   5   5   5   5 HOVITMR_108614
HOVITMR_028430-T1    -          glass_ACYPI008619-PA -           4.6e-111  374.6  70.3   2.5e-33  118.3  14.1   4.0   1   1   3   4   4   4   4 HOVITMR_028430
HOVITMR_032329-T1    -          glass_ACY

HOVITMR_029041

### Compiled Hit List

In [16]:
!cat CandidateHits.txt

white	HOVITMR_063422
white	HOVITMR_025764
white	HOVITMR_025763
cinnabar	HOVITMR_030756
brown	HOVITMR_063422
brown	HOVITMR_023567
scarlet	HOVITMR_063422
scarlet	HOVITMR_023567
punch	HOVITMR_079319
purple	HOVITMR_010106
DhpD	HOVITMR_014112
DhpD	HOVITMR_014113
sepia	HOVITMR_059208
rosy	HOVITMR_021669
vermillion	HOVITMR_086284
yellow	HOVITMR_038735 
yellow	HOVITMR_037366 
yellow	HOVITMR_096073 
yellow	HOVITMR_070904
ebony	HOVITMR_055645
tubby	HOVITMR_043349
wingless	HOVITMR_057127 
wingless	HOVITMR_070598 
wingless	HOVITMR_020388 
wingless	HOVITMR_029198
curly	HOVITMR_045190
dumpy	HOVITMR_031383
dumpy	HOVITMR_031374
dumpy	HOVITMR_031381
dumpy	HOVITMR_031379
dumpy	HOVITMR_031385
miniature	HOVITMR_040001
vestigal	HOVITMR_019057
singed	HOVITMR_016219 
singed	HOVITMR_016213 
singed	HOVITMR_016212
bar	HOVITMR_017333
glass	HOVITMR_029041

## Protein alignments of all candidates against references

Split candidate list by gene name

In [56]:
for gene in $(cat genes.txt);
do grep $gene CandidateHits.txt | cut -f 2 > $gene'.hits.txt';
done

zsh:1: parse error near `do'
zsh:1: parse error near `done'


For each gene, get all the protein seq hits from the genome

In [None]:
for gene in $(cat genes.txt);
    for hit in $(cat $gene'.hits.txt');
    do grep -A 1 '>'$hit data/Homalodisca_vitripennis_A6A7A9_masurca_v1_ragtag_v1.proteins.fixed.fa | sed '/^--$/d' >> $gene'.hits.fasta';
done

For each gene, combine the protein seqs of references and candidate hits in preparation for aligning

In [None]:
for gene in $(cat genes.txt);
do cat $gene'_results'/$gene'.fasta' $gene'.hits.fasta' > $gene'.aln.fasta'; 
done

Alignment with muscle

In [None]:
for gene in $(cat genes.txt);
do muscle -in $gene'.aln.fasta' -out $gene'.aln' -clw;
done

In [None]:
#output is html and has some colors; not sure 
#for gene in $(cat genes.txt);
#do muscle -in $gene'.aln.fasta' -out $gene'.aln.html' -html;
#done

In [None]:
#output for trees
for gene in $(cat genes.txt);
do muscle -in $gene'.aln.fasta' -out $gene'.aln.tre.fasta';
done

Make protein phylogenies

In [None]:
for gene in $(cat genes.txt);
do FastTree $gene'.aln.tre.fasta' > $gene'.aln.tre';
done

Clean up output

In [None]:
for gene in $(cat genes.txt);
do mv $gene* $gene'_results';
done
#will error about the results folders but thats OK

## Look at alignment results

### Eye color markers

In [17]:
!cat white_results/white.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_063422-T1                     --------------------------------MFTIEKSSISQGPLRLRRVT-----PPP
HOVITMR_025763-T1                     MKNKTNKYPCGKCLANVSKCSKAVLCKGSCNQWLHLKCTDLSKEDYERIKKSIIKKWLCS
white_CLEC000648                      -------------------------MKDFTPWVFVLPSGGVDDPTLTQQS-------SAL
white_FBpp0070468                     --------------------------MGQEDQELLIRGGS-KHPSAEHLNNGDSGAASQS
white_lcl|MK480204.1_prot_QEO191      ------------------------MTGGHDEREPLLITANGNGSKVTYKAVSDLGKDDDF
HOVITMR_025764-T1                     -----------------------MLIREVANKVIEMPCAQ-NHRNGSQKELS-----NGL
                                                                         :     .                  

HOVITMR_063422-T1                     PVQFSTSQSREKLSG-----------NQSSGNTCTP-------------GLTLTWRDLSV
HOVITMR_025763-T1                     NCELADIDEEIEVEDRVLYVELKEELESQELIIKNLSEDLAKANDEIKNVSTYTLNLETL
white_CLEC000648                      SLD-----------

In [18]:
tree = Phylo.read("white_results/white.aln.tre", "newick")
#print(tree)
#rooting to Dmel as bed bug + gwss white genes should be closer if follows species tree, but not best outgroup 
tree.root_with_outgroup({"name": "white_FBpp0070468"}) 
Phylo.draw_ascii(tree)

             _____________ white_CLEC000648
         ___|
        |   |  _____ HOVITMR_025764-T1
        |   |_|
  ______|     | ____ white_lcl|MK480204.1_prot_QEO19116.1_1
 |      |     ||
 |      |      |_______________________ HOVITMR_025763-T1
_|      |
 |      |____________________________ HOVITMR_063422-T1
 |
 | white_FBpp0070468



need to check jbrowse - is it 64 +63? Missassembly?

In [20]:
!cat brown_results/brown.aln

MUSCLE (3.8) multiple sequence alignment


brown_FBpp0312192                     ------------------MQES----------------GGSSGQGGPS----LCLEWKQL
brown_ACYPI008444-PA                  MATKKLLDLKYNEMWKAWNSTN----------------EEDDDYSSPLFKRDLVLSWKQL
HOVITMR_063422-T1                     --MFTIEKSSISQGPLRLRRVTPPPPVQFSTSQSREKLSGNQSSGNTC-TPGLTLTWRDL
HOVITMR_023567-T1                     ------------------------------------------MYRSPS-KCHLGI-----
brown_lcl|MK480212.1_prot_QEO191      MSYPNLMDSNVMEISLLTGQEGCPSP----------GLGKRSGSGSPV-QGGLTLSWHEL
                                                                                   ..     * :     

brown_FBpp0312192                     NYYVPDQEQSNYSFWNECRKKRELRILQDASGHMKTGDLIAILGGSGAGKTTLLAAISQR
brown_ACYPI008444-PA                  NVTVVRKIPKLFGSS----EVVTKQILNNVSGNVECGTLLGIMGPSGSGKTTLMATISHR
HOVITMR_063422-T1                     SVYAKIKKESLFKSSST--EY--RKIINNVSGAVPPGTLVALMGASGAGKSTLMAALAYQ
HOVITMR_023567-T1                     -----QEDVE----

In [21]:
tree = Phylo.read("brown_results/brown.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "brown_FBpp0312192"}) 
Phylo.draw_ascii(tree)

                       _______________ HOVITMR_063422-T1
                    __|
                   |  |  ______________ HOVITMR_023567-T1
  _________________|  |_|
 |                 |    |_______ brown_lcl|MK480212.1_prot_QEO19124.1_1
_|                 |
 |                 |________________ brown_ACYPI008444-PA
 |
 | brown_FBpp0312192



Unclear

In [23]:
!cat scarlet_results/scarlet.aln

MUSCLE (3.8) multiple sequence alignment


scarlet_FBpp0075149                   MSDSDSKRIDVEAPERVEQHELQVMPVGSTIEVPSLDSTPKLSKRNSSERSLPLRSYS--
scarlet_lcl|MK480213.1_prot_QEO1      ---------------------------MALVPATEINQ-------------MNFSTKGFV
HOVITMR_063422-T1                     ---------------------------MFTIEKSSISQGPLRLRRVTPPPPVQFSTSQSR
scarlet_CLEC004040                    ---------------------------MD-------------------------------
HOVITMR_023567-T1                     ---------------------------MY-------------------------------
                                                                                                  

scarlet_FBpp0075149                   ---KWSPTE-----QGATLVWRDLCVYTNVG------GSGQRMKRIINN-----------
scarlet_lcl|MK480213.1_prot_QEO1      KEEIWEQPE-----DGSTLTWTDLSIYVRCKKPRMLRPAKFSYKRIVNN-----------
HOVITMR_063422-T1                     EKLSGNQSSGNTCTPGLTLTWRDLSVYAKIKKESLFKSSSTEYRKIINN-----------
scarlet_CLEC004040                    --------------

In [24]:
tree = Phylo.read("scarlet_results/scarlet.aln.tre", "newick")
#print(tree)
#rooting to Dmel as bed bug + gwss genes should be closer if follows species tree, but not best outgroup 
tree.root_with_outgroup({"name": "scarlet_FBpp0075149"}) 
Phylo.draw_ascii(tree)

              ________ scarlet_lcl|MK480213.1_prot_QEO19125.1_1
            _|
           | |_____ HOVITMR_063422-T1
  _________|
 |         |             _______ scarlet_CLEC004040
_|         |____________|
 |                      |____________ HOVITMR_023567-T1
 |
 | scarlet_FBpp0075149



Unclear

Overlapping hits between white / brown / scarlet (though white may have other problems) - may want to expand #ref seqs and # hits investigated for these genes to really narrow in on which are 'white' vs. 'scarlet' vs. 'brown'

Biologically scarlet - brown - white must have similar domains to dimerize 

https://www.pnas.org/content/pnas/116/38/19046.full.pdf
<img src="images/Fig3.png" >

In [25]:
!cat punch_results/punch.aln

MUSCLE (3.8) multiple sequence alignment


punch_CLEC005231                      ------------------------------------MNGTQGKEVQLRPPRLRTVSWQEE
punch_FBpp0071505                     ---------------------------MKPQTSEQNGSGQNGEGAADAVAVATIPTGEAS
punch_lcl|MK480217.1_prot_QEO191      ------------------------------------------------------------
HOVITMR_079319-T1                     MAKDSRFQESRLRYHQISDTARTRGRCYDDAASYLKDYEISREHALYVFLIDSNEKWFIS
                                                                                                  

punch_CLEC005231                      ITEGDNDAPGTPKTP---------------------------------------------
punch_FBpp0071505                     AASATSGTDLTVSKNSQQLKLEMLNLE---------------------------------
punch_lcl|MK480217.1_prot_QEO191      ------------------------------------------------------------
HOVITMR_079319-T1                     ANDNVCSALITESDSLHNSRIEIFHVEDYPGLTTITADGQLILSRVYSLHNVHPCLAPRQ
                                                    

In [27]:
tree = Phylo.read("punch_results/punch.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "punch_FBpp0071505"}) 
Phylo.draw_ascii(tree)

          , punch_lcl|MK480217.1_prot_QEO19129.1_1
         _|
  ______| |______ HOVITMR_079319-T1
 |      |
_|      |_______________________________ punch_CLEC005231
 |
 | punch_FBpp0071505



HOVITMR_079319 = punch

In [29]:
!cat purple_results/purple.aln

MUSCLE (3.8) multiple sequence alignment


purple_FBpp0088417                    MSQQPVAFLTRRETFSACHRLH-------------------SPQLSDAENLEVFGKCNNF
HOVITMR_010106-T1                     ----------------------------------------MIPQLNDEENLETYGKCNNY
purple_CLEC001054                     -MASPIVYLTRVEKFSACHRLHRDKEVLRSSLIKTSRAFSGCPQLSDQVNKDVYGKCNNP
purple_lcl|MK480218.1_prot_QEO19      ---MAIAYLTRVEKFSACHRLH-------------------SPLLSDEDNLAVYGKCNNF
                                                                                * *.*  *  .:***** 

purple_FBpp0088417                    HGHGHNYTVEITVRGPIDRRTGMVLNITELKEAIETVIMKRLDHKNLDKDVEYFANTPST
HOVITMR_010106-T1                     HGHGHNYTVEVTLKGPVTADTGMVMNINDLKKHMNKAIMEPMDHKNLDKDVPYFKNVVST
purple_CLEC001054                     NGHGHNYRVEVTVCGPVSKDTGMVMNLSDLKAHMNAAIMETLDHKNLDLDVPYFKDVVST
purple_lcl|MK480218.1_prot_QEO19      HGHGHNYTLEVTLRGPVSPDTGMVMNINDLKKIIQEAVMDTLDHKNIDKDVPYFKDVVST
                                      :****** :*:*: 

In [30]:
tree = Phylo.read("purple_results/purple.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "purple_FBpp0088417"}) 
Phylo.draw_ascii(tree)

                           __________ purple_CLEC001054
  ________________________|
 |                        |  _________ HOVITMR_010106-T1
_|                        |_|
 |                          |__________ purple_lcl|MK480218.1_prot_QEO19130.1_1
 |
 | purple_FBpp0088417



HOVITMR_010106 = purple

In [31]:
!cat DhpD_results/DhpD.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_014113-T1                     -------------------------------------------------MMEIEAKYSPN
HOVITMR_014112-T1                     ------------------------------------------------------------
DhpD_FBpp0078625                      ------MATVFLGTVVHTKSFSEFESFEGGFLAVDDAGKIIGVGQDYHAWASSNPAHAKG
DhpD_CLEC005050                       ------PPIIIQGPIVHSVSKDRITALENKLIAVKD-GKIVAL-EDSECMDEIRRMIGDN
DhpD_lcl|MK480214.1_prot_QEO1912      MNFKQHENFVIQGPIIHSLSSNEIGYYENATIVVKK-GKIVSF-DSE---GKIKVSANDG
                                                                                                  

HOVITMR_014113-T1                     EV-VVLEKGQFLIPGLIDTHTHAPQFPNKGLGYDKTLLEWLNVYTFPLESKYEDENIALK
HOVITMR_014112-T1                     ---------------------MPPPLQNF----------WKCCWLL--------------
DhpD_FBpp0078625                      LTEVHLSDYQFLMPGFVDCHIHAPQFAQLGLGLDMPLLDWLNTYTFPLEAKFSNHQYAQQ
DhpD_CLEC005050                       FIFFKLEPGMFLCP

In [32]:
tree = Phylo.read("DhpD_results/DhpD.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "DhpD_FBpp0078625"}) 
Phylo.draw_ascii(tree)

               ______ DhpD_CLEC005050
          ____|
         |    |   ________________ HOVITMR_014112-T1
  _______|    |__|
 |       |       |_____ DhpD_lcl|MK480214.1_prot_QEO19126.1_1
_|       |
 |       |______________________________ HOVITMR_014113-T1
 |
 | DhpD_FBpp0078625



HOVITMR_014112?

In [33]:
!cat sepia_results/sepia.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_059208-T1                     ------------------------------------------------------MRYCPY
HOVITMR_059208-T2                     MNILFSSACFVLTCFQLGSLRTVYAAAAMAGKHLSSGS--TDPPLVAGKIRLYSMRYCPY
sepia_FBpp0076349                     ---------------------------MSNGRHLAKGSPMPDVP-EDGILRLYSMRFCPF
sepia_CLEC008904                      ----------------------------MTVEHLAAGS--KAVPLQEGKLRLYSMRFCPY
sepia_lcl|MK480215.1_prot_QEO191      ----------------------------MAPKHLSVGS--SDVPPEEGKLRLYSMRFCPY
                                                                                            **:**:

HOVITMR_059208-T1                     SHRAHLVLLAKNISFDPIFINLKTKPEWYTNTVPSGKVPALLVDGQ----IVSDSLIIAD
HOVITMR_059208-T2                     SHRAHLVLLAKNISFDPIFINLKTKPEWYTNTVPSGKVPALLVDGQ----IVSDSLIIAD
sepia_FBpp0076349                     AQRVHLVLDAKQIPYHSIYINLTDKPEWLLEKNPQGKVPALEIVREPGPPVLTESLLICE
sepia_CLEC008904                      AQRAHLILNAKNIP

In [34]:
tree = Phylo.read("sepia_results/sepia.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "sepia_FBpp0076349"}) 
Phylo.draw_ascii(tree)

                        _____ sepia_CLEC008904
                    ___|
                   |   |_______ sepia_lcl|MK480215.1_prot_QEO19127.1_1
  _________________|
 |                 |                  , HOVITMR_059208-T1
_|                 |__________________|
 |                                    | HOVITMR_059208-T2
 |
 | sepia_FBpp0076349



HOVITMR_059208 - same gene different splicing 

In [35]:
!cat cinnabar_results/cinnabar.aln

MUSCLE (3.8) multiple sequence alignment


cinnabar_FBpp0311059                  MSPGIVSQEVNGRQEPTEAARDERHGRRRRVAVIGAGLVGSLAALNFARMGNHVDLYEYR
cinnabar_CLEC025106                   -----------------------MDRANLKVVVVGGGLVGSLIATYFGQRGYNVHLYEYR
cinnabar_lcl|MK480216.1_prot_QEO      -------------------MENTENGKKLRVAIIGGGLVGSLSACYFGKRGHEVHLYEYR
HOVITMR_030756-T1                     ------------------------------------------------------------
                                                                                                  

cinnabar_FBpp0311059                  EDIRQALVVQGRSINLALSQRGRKALAAVGLEQEVLA-TAIPMRGRMLHDVRGNSSVVLY
cinnabar_CLEC025106                   EDIRTSELVQGKSINLALSVRGLKALEGIGIADSVRA-YGIPMYGRMIHSVTGKTRPIPY
cinnabar_lcl|MK480216.1_prot_QEO      KDIRKDELARGRSINLALSTRGRRALAGVGLEDKLVSHHGLPMYARMLHMTDGSTRAVPY
HOVITMR_030756-T1                     ------------------------------------------------------------
                                                    

In [36]:
tree = Phylo.read("cinnabar_results/cinnabar.aln.tre", "newick")
#print(tree)
#rooting to Dmel as bed bug + gwss genes should be closer if follows species tree, but not best outgroup 
tree.root_with_outgroup({"name": "cinnabar_FBpp0311059"}) 
Phylo.draw_ascii(tree)

                   ________ cinnabar_lcl|MK480216.1_prot_QEO19128...
  ________________|
 |                |    _______________ cinnabar_CLEC025106
_|                |___|
 |                    |_____________ HOVITMR_030756-T1
 |
 | cinnabar_FBpp0311059



HOVITMR_030756 = cinnabar

In [37]:
!cat rosy_results/rosy.aln

MUSCLE (3.8) multiple sequence alignment


rosy_FBpp0082172                      -----MSNSVLVFFVNGKKVTEVSPDPECTLLTFLREKLRLCGTKLGCAEGGCGACTVMV
rosy_CLEC006546                       --MEVKETSVLVFFVNGRKVVDHSADPEWTLIHYLRKK--LCGTKLGCSEGGCGACTVMV
rosy_lcl|MK480206.1_prot_QEO1911      MKDTPQESDTLVFFVNGVKVVDKEVDPEWTLLFYLRNKLRLCGTKLGCAEGGCGACTVMV
HOVITMR_021669-T1                     --MSGSSSSTLVFYVNGKKVEDSNVDPEWTLLYYLRNKLRLTGTKLGCAEGGCGACTVMV
                                            ....***:*** ** : . *** **: :**:*  * ******:***********

rosy_FBpp0082172                      SRLDRRANKIRHLAVNACLTPVCSMHGCAVTTVEGIGSTKTRLHPVQERLAKAHGSQCGF
rosy_CLEC006546                       SKYDRKKNHPIHFTVNACLTPVCAMHGLAVTTVEGIGSVKTKLHPVQERIAKSHGSQCGF
rosy_lcl|MK480206.1_prot_QEO1911      SKYDRKRQKILHYAVNACLAPVCSMHGLAVTTVEGIGSTKTRLHPVQERIAKAHGSQCGF
HOVITMR_021669-T1                     SKFDRTSRKLIHFSANACLAPVCSMHGLAVTTVEGIGSTQTRLHPVQERIAKAHGSQCGF
                                      *. **  .:  * :

In [38]:
tree = Phylo.read("rosy_results/rosy.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "rosy_FBpp0082172"}) 
Phylo.draw_ascii(tree)

                    ___________ HOVITMR_021669-T1
  _________________|
 |                 |   _________ rosy_lcl|MK480206.1_prot_QEO19118.1_1
_|                 |__|
 |                    |__________________ rosy_CLEC006546
 |
 | rosy_FBpp0082172



HOVITMR_021669 = rosy

In [39]:
!cat vermillion_results/vermillion.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_086284-T1           ------------------------------------------------------------
HOVITMR_086284-T2           ------------------------------------------------------------
vermillion_FBpp0073242      MSCPYAGNGNDHDDSAVPLTTEVGKIYGEYLMLDKLLDAQCMLSEEDKRPVHDEHLFIIT
vermillion_CLEC006165       ------------------------MLYADYLQLNKILTAQRMLSTEHNATVHDEHLFIIT
                                                                                        

HOVITMR_086284-T1           ------------------------------------------------------------
HOVITMR_086284-T2           ------------------------------------------------------------
vermillion_FBpp0073242      HQAYELWFKQIIFEFDSIRDMLD-AEVIDETKTLEIVKRLNRVVLILKLLVDQVPILETM
vermillion_CLEC006165       HQAYELWFKQIIHELDSIRVIFNKPEGLEESETLEILKRLSRIVLILKLLVDQVMILETM
                                                                                        

HOVITMR_086284-T1           ----------------------

In [40]:
tree = Phylo.read("vermillion_results/vermillion.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "vermillion_FBpp0073242"}) 
Phylo.draw_ascii(tree)

                    __________________ vermillion_CLEC006165
  _________________|
 |                 |                     , HOVITMR_086284-T1
_|                 |_____________________|
 |                                       |______________ HOVITMR_086284-T2
 |
 | vermillion_FBpp0073242



Missing a bit at the beginning - but HOVITMR_086284-T1?

In [41]:
!cat yellow_results/yellow.aln

MUSCLE (3.8) multiple sequence alignment


yellow_CLEC000391       ------------------------------LSLFSYLTLAMSSLPLLLLGLSYVAGELEI
HOVITMR_038735-T1       ------------------------------------MALW----PLLFALVGLASAELEV
yellow_FBpp0070070      ---------------------------------MFQDKGWILVTLITLVTPSWAAYKLQE
HOVITMR_070904-T1       ------------------------------------------------------------
HOVITMR_037366-T1       --------------------------MTVNTVAVCVALCWISGLSGQPNNP------LKE
HOVITMR_096073-T1       MSAEQLLYRAIHPSPAIVCSVDSEPNLQRAYCYSFETLCWISGLSGQPNNP------LKE
                                                                                    

yellow_CLEC000391       MYQWTLAKFDTPFNYPPNT-----KYRADTTFINSVEVGWDRVFVTLPRIWSGNPASLAW
HOVITMR_038735-T1       VNQWNLFDFDIPYGYPTNE-----NYSTSQSPSTGLEVGWDRLFLALPRFMPGAPLSLAF
yellow_FBpp0070070      RYSWSQLDFAFPNTRLKDQALASGDYIPQNALPVGVEHFGNRLFVTVPRWRDGIPATLTY
HOVITMR_070904-T1       --------------------------------------------------MSRVPATLNY
HOVITM

In [42]:
tree = Phylo.read("yellow_results/yellow.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "yellow_FBpp0070070"}) 
Phylo.draw_ascii(tree)

                ____ HOVITMR_070904-T1
            ___|
           |   |          , HOVITMR_037366-T1
           |   |__________|
  _________|              | HOVITMR_096073-T1
 |         |
 |         |                                   ___________ yellow_CLEC000391
_|         |__________________________________|
 |                                            |____ HOVITMR_038735-T1
 |
 | yellow_FBpp0070070



Need to add in more orthologs - one looks very much like bed bug verison, and the others look similar to Dmel

In [43]:
!cat ebony_results/ebony.aln

MUSCLE (3.8) multiple sequence alignment


ebony_CLEC007608       --------------------------------------------MFLDGKHPTKSLSYGE
ebony_FBpp0083505      MGSLPQLSIVKGLQQDFVPRALHRIFEEQQLRHADKVALIYQPSTTGQGMAPSQS-SYRQ
HOVITMR_055645-T1      ------------------------------------------------------------
                                                                                   

ebony_CLEC007608       VEERSNRLARALLLATKDK--SPNDDGDRVIGLCMEPSPELIIAILAVWKSGCSYLTFAP
ebony_FBpp0083505      MNERANRAARLLVAETHGRFLQPNSDGDFIVAVCMQPSEGLVTTLLAIWKAGGAYLPIDP
HOVITMR_055645-T1      ------------------------------------------------------------
                                                                                   

ebony_CLEC007608       NAPVNRTRHIVQEARPVLVVTDKSGTGDLYAPTECVEFSGLEAVSADLSESTLEEDESYP
ebony_FBpp0083505      SFPANRIHHILLEAKPTLVIRDDDIDAGRFQGTPTLSTTELYAKSLQLAGSNLLSEEMLR
HOVITMR_055645-T1      ------------------------------------------------------------
               

In [44]:
tree = Phylo.read("ebony_results/ebony.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "ebony_FBpp0083505"}) 
Phylo.draw_ascii(tree)

                          ___________________________________ ebony_CLEC007608
  _______________________|
_|                       |________________________ HOVITMR_055645-T1
 |
 | ebony_FBpp0083505



Missing first half at beginning but otherwise looks really good?
HOVITMR_055645?

In [45]:
!cat tubby_results/tubby.aln

MUSCLE (3.8) multiple sequence alignment


tubby_FBpp0084408              MRGFI-----------------------------------IFAVLAVARADV-------G
tubby_L798_11124:KDR14905      MAAAVAK-------RTSAAVPPVVAGTVLYQPAVSCYRRPAFEAINLADADF-TSDSHTS
HOVITMR_043349-T1              MPLELRKSLRTFNMKDDPRVPPV-------RPPAASYRRPSFEPINLADADFAASPPKSS
                               *   :                                    *  : :* **.       .

tubby_FBpp0084408              GYNYGAGIGSGGSISGGSL-----------------------------------------
tubby_L798_11124:KDR14905      THRYNNGFSSATNSSLG-------------------------------------------
HOVITMR_043349-T1              GQSFSRGISGKKNETEGTLAFLSNKAQSNNRNYIHTEENVVSSINNEVNINKQLLFESED
                                  :. *:..  . : *                                           

tubby_FBpp0084408              ------------------------------------------------------------
tubby_L798_11124:KDR14905      -----------------RPTV---------------------------------------
HOVITMR_043349-T1   

In [46]:
tree = Phylo.read("tubby_results/tubby.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "tubby_FBpp0084408"}) 
Phylo.draw_ascii(tree)

                                            _________ tubby_L798_11124
  _________________________________________|
_|                                         |_________________ HOVITMR_043349-T1
 |
 | tubby_FBpp0084408



I'm not really conviced by this - not sure this ones worth pursuing

In [47]:
!cat wingless_results/wingless.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_020388-T1         ------------------------------MTVSVYRLECGGC--------HLAKTPVVS
wingless_CLEC012013       --------------------------------MFKFKTIYYFC-----------------
HOVITMR_057127-T1         MPPSIWLVHCRYCHTIGKDSAGEVAAVRPSLTDIVVQQLSFTCELAIKAGGRLSTDQLSS
HOVITMR_070598-T1         -----------------------MLRPMSRPTPGVDDELWWAR--------LLGLQGPSQ
wingless_FBpp0079060      MDISYIFVICLMALCSGGSSLSQVEGKQKS--GRGRGSMWWGI-------AKVGEPNNIT
HOVITMR_029198-T1         ------------------------------------------------------------
                                                                                      

HOVITMR_020388-T1         SVVALGAHV--------------ICSRVPGLTPRQRDMCRAAPDAMIAVGDGIRLATLEC
wingless_CLEC012013       FFRNIEPANFNDVN---------GCEVTKGLSIGQVRLCQVYSDHMERVTIGAKKGISEC
HOVITMR_057127-T1         VLDSISPCCCRHMGLLEETSEPLPCGRTPGLSPGQTKLCHLYEDHMPGVGRGARAGIDEC
HOVITMR_070598-T1         GAVSLGPLSHKE-----------NCHRLQYLVERQQQLCGLSENVL

In [48]:
tree = Phylo.read("wingless_results/wingless.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "wingless_FBpp0079060"}) 
Phylo.draw_ascii(tree)

                            __________________ HOVITMR_020388-T1
                     ______|
                    |      |             _________ wingless_CLEC012013
       _____________|      |____________|
      |             |                   |_______________ HOVITMR_057127-T1
  ____|             |
 |    |             |__________________ HOVITMR_070598-T1
_|    |
 |    |_____ HOVITMR_029198-T1
 |
 | wingless_FBpp0079060



remove HOVITMR_029198 and re-do alignment

In [49]:
!cat curly_results/curly.aln

MUSCLE (3.8) multiple sequence alignment


curly_FBpp0289611      MSVPSAPHQRAESKNRVPRPGQKNRKLPKLRLHWPGATYGGALLLLLISYGLELGSVHCY
curly_CLEC009522       ------------------------------------------------------------
HOVITMR_045190-T1      ------MYKQKKKKHTTNKPTQHQKKKTKK------------------------------
                                                                                   

curly_FBpp0289611      EKMYSQTEKQRYDGWYNNLAHPDWGSVDSHLVRKAPPSYSDGVYAMAGANRPSTRRLSRL
curly_CLEC009522       ------------------------------MTRKTPAAYKDGVYMMSGEDRPSARRISQL
HOVITMR_045190-T1      ------------------------------NKKQPPPPTPNQHQKQVRVCNNQIENCDEI
                                                       .:.*..  :         . .  . . :

curly_FBpp0289611      FMRGKDGLGS-KFNRTALLAFFGQLVANEIVMASESGCPIEMHRIEIEKCDEMYDRECRG
curly_CLEC009522       FMKGSDGLPS-SRNRTALLAFFGQVVSSEVVMASESGCPIEVHQIPIDRCDDMYDPECKG
HOVITMR_045190-T1      YNKECEGGKSIPFHRAGY--------------DRKTGQSPNSPREQIENCDEIYNKECEG
               

In [50]:
tree = Phylo.read("curly_results/curly.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "curly_FBpp0289611"}) 
Phylo.draw_ascii(tree)

                                 ______________ curly_CLEC009522
  ______________________________|
_|                              |____________________________ HOVITMR_045190-T1
 |
 | curly_FBpp0289611



HOVITM_050937 = curly

In [51]:
!cat dumpy_results/dumpy.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_031385-T1      ------------------------------------------------------------
HOVITMR_031374-T1      ------------------------------------------------------------
HOVITMR_031379-T1      ------------------------------------------------------------
dumpy_FBpp0304628      MKIFLPLVTWIVLLLSSAVHSQYSQQPQPFKTNLRANSRFRGEVFYLNLENGYFGCQVNE
HOVITMR_031381-T1      ------------------------------------------------------------
HOVITMR_031383-T1      ------------------------------------------------------------
dumpy_CLEC025055       ------------------------------------------------------------
                                                                                   

HOVITMR_031385-T1      ------------------------------------------------------------
HOVITMR_031374-T1      --------------------------------MFDDCHKKDGVRCQNGACLDSQCHCNDG
HOVITMR_031379-T1      --------------------------------------------------------MLEA
dumpy_FBpp0304628

In [52]:
tree = Phylo.read("dumpy_results/dumpy.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "dumpy_FBpp0304628"}) 
Phylo.draw_ascii(tree)

              ____________________________ HOVITMR_031379-T1
            _|
           | |        ______________________________________ HOVITMR_031385-T1
           | |_______|
          ,|         |______________________________ HOVITMR_031374-T1
          ||
          ||  _____ HOVITMR_031383-T1
  ________||_|
 |        |  |______ dumpy_CLEC025055
_|        |
 |        |_____ HOVITMR_031381-T1
 |
 | dumpy_FBpp0304628



This ones pretty gross - I noticed had many copies on flybase - so may be poor choice - maybe try removing HOVITMR_031385 and HOVITMR_031374 and re-do alignment - but am not hopeful about clarity here

In [53]:
!cat miniature_results/miniature.aln

MUSCLE (3.8) multiple sequence alignment


miniature_FBpp0073400      MWSPQKGPTRLWDLRFSSCIFILHLMFSLVIAGNELWPMERPDGMPNIVSLEVMCGKDHM
miniature_CLEC002209       -------------------------LFQRNVLGGEIWPLERPEGMPAIQSLEVMCGKDHM
HOVITMR_040001-T1          -----------------------------------------------------MCGKDHM
                                                                                *******

miniature_FBpp0073400      DVHLTFSHPFEGIVSSKGQHSDPRCVYVPPSTGKTFFSFRISYSRCGTKPDLNGQFYENT
miniature_CLEC002209       DVHLSFTQPFEGIVSSKGQYADPRCVYVPPSTGKTFFSFRIAYARCGTKPDLNGQFYENT
HOVITMR_040001-T1          DVHLSFTQPFEGIVSSKGQYGDPRCVYVPPSTGKTFFSFRIAYARCGTKPDLHGQFYENT
                           ****:*::***********:.********************:*:********:*******

miniature_FBpp0073400      VVVQYDKDLLEVWDEAKRLRCEWFNDYEKTASKPPMVIADLDVIQLDFRGDNVDCWMEIQ
miniature_CLEC002209       VVVQYDKDLLEVWDEAKRLRCEWYNDYEKTASKPPMVIADLDVIQLDFRGDNVDCWMEIQ
HOVITMR_040001-T1          VVVQYDKDLLEVWDEAKRLRCEWYNDYEKTASK

In [54]:
tree = Phylo.read("miniature_results/miniature.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "miniature_FBpp0073400"}) 
Phylo.draw_ascii(tree)

                                      ________________ miniature_CLEC002209
  ___________________________________|
_|                                   |___________________ HOVITMR_040001-T1
 |
 | miniature_FBpp0073400



HOVITMR_040001 = miniature

In [55]:
!cat vestigal_results/vestigal.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_019057-T1           --MN--------------------------------------------------------
vestigal_ACYPI34460-PA      --MSCTEVMYQAYYPYLYQR--------------------------SSGTAAPPTRAPHH
vestigal_FBpp0086898        MAVSCPEVMYGAYYPYLYGRAGTSRSFYQYERFNQDLYSSSGVNLAASSSASGSSHSPCS
                              :.                                                        

HOVITMR_019057-T1           --FEPLV-------------------------------PLSRGGTPGRGR----------
vestigal_ACYPI34460-PA      H-FPPF------------THQYDRLRALESHQQASTSSPIGGGDSPANRWTTIADHTDSH
vestigal_FBpp0086898        PILPPSVSANAAAAVAAAAHNSAAAAVAVAANQASSSGGIGGGGLGGLGGLGGGPASGLL
                              : *                                  :. *.  .             

HOVITMR_019057-T1           ------------------------------------------------HSWER-------
vestigal_ACYPI34460-PA      HSSVGSGASSVGPVGGIASPASSPATANSVSVVHKEED----------TSRDDVRTEMDE
vestigal_FBpp0086898        GSNVVPGSSSVGSVGLGMSPVL

In [56]:
tree = Phylo.read("vestigal_results/vestigal.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "vestigal_FBpp0086898"}) 
Phylo.draw_ascii(tree)

                            ____________________________ HOVITMR_019057-T1
  _________________________|
_|                         |________ vestigal_ACYPI34460-PA
 |
 | vestigal_FBpp0086898



HOVITMR_019057

In [57]:
!cat singed_results/singed.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_016213-T1       MN----------------------------------------------------------
singed_FBpp0290891      MNGQGCELGHSNGDIISQNQQKGWWTIGLINGQHKYMTAETFGFKLNANGASLKKKQLWT
HOVITMR_016212-T1       MNGVNGMNGHHENGELSNGSNKGVWTIGLINCQFKYLTAETFGFK---------------
HOVITMR_016219-T1       ------------------------------------------------------------
singed_CLEC025340       MNGVNGMNGHTNGEL--NGVGRGTWTIGLINVQYRYLTAETFGFKINANGSSLKKKQMWT
                                                                                    

HOVITMR_016213-T1       ------------------------------------------------------------
singed_FBpp0290891      LEPSNTGESIIYLRSHLNKYLSVDQFGNVLCESDERDAGSRFQISISEDGSGRWALKNES
HOVITMR_016212-T1       ------------------------------------------------------------
HOVITMR_016219-T1       ------------------------------------------------------------
singed_CLEC025340       LEPAPGETNTVYLRSHLDKYLAVDSFGNVTCESEEKDPGSKFQIVISEDASGRWALKNTA
      

In [58]:
tree = Phylo.read("singed_results/singed.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "singed_FBpp0290891"}) 
Phylo.draw_ascii(tree)

             ______________ HOVITMR_016213-T1
  __________|
 |          |  , singed_CLEC025340
 |          |__|
_|             |   ____ HOVITMR_016219-T1
 |             |__|
 |                |_______________________________________ HOVITMR_016212-T1
 |
 | singed_FBpp0290891



unclear

In [59]:
!cat bar_results/bar.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_030756-T1      ------------------------------------------------------------
bar_FBpp0074204        MKDSMSILTQTPSEPNAAHPQLHHHLSTLQQQHHQHHLHYGLQPPAVAHSIHSTTTMSSG
bar_CLEC007883         ------------------------------------------------------------
HOVITMR_017333-T1      ------------------------------------------------------------
                                                                                   

HOVITMR_030756-T1      ------------------------------------------------------------
bar_FBpp0074204        GSTTTASGIGKPNRSRFMINDILAGSAAAAFYKQQQHHQQLHHHNNNNNSGSSGGSSPAH
bar_CLEC007883         ------------------------------------------------------------
HOVITMR_017333-T1      ------------------------------------------------------------
                                                                                   

HOVITMR_030756-T1      -MKRPMFDYNQTYIQHGYMELCIPPTQDGQFAMKPNYLHIWPRGSFMMIALPNQDCSWTV
bar_FBpp0074204

In [60]:
tree = Phylo.read("bar_results/bar.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "bar_FBpp0074204"}) 
Phylo.draw_ascii(tree)

          ____ HOVITMR_017333-T1
  _______|
 |       |   ________________________________________________ HOVITMR_030756-T1
_|       |__|
 |          | bar_CLEC007883
 |
 | bar_FBpp0074204



Get rid of HOVITMR_030756 and re-do alignment

In [61]:
!cat glass_results/glass.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_029041-T1         MSAWKKPSTTNIMKSLLYAKKSKPKQVDKDYSWTTKPSRKRGETFQCDFCKKIFLQKGNL
glass_ACYPI008619-PA      ------------------------------------------------------------
glass_FBpp0083006         ---------------------------------------------------MGLLYKGSK
                                                                                      

HOVITMR_029041-T1         RSHIFTFHMDDQQKEYKCEVCSKVFPYPSALRKHSQIHNGEKQYGCDECGKSFTLKGNLK
glass_ACYPI008619-PA      ------------------------------------------------------------
glass_FBpp0083006         LLNTILDSLEDQE-----------------------------------------------
                                                                                      

HOVITMR_029041-T1         AHQFLHTGELPYACGICQKKFATESSLNSHIDVHTGVKAFSCSMCERKFRHRRGLETHER
glass_ACYPI008619-PA      ------------------------------------------------------------
glass_FBpp0083006         --------------------------------------------

In [62]:
tree = Phylo.read("glass_results/glass.aln.tre", "newick")
#print(tree)
#rooting to Dmel 
tree.root_with_outgroup({"name": "glass_FBpp0083006"}) 
Phylo.draw_ascii(tree)

                     _____________________________________ HOVITMR_029041-T1
  __________________|
_|                  |_______ glass_ACYPI008619-PA
 |
 | glass_FBpp0083006



Maybe? I think this is one where more reference seqs are needed to be sure

### All hits further confirmed by reverse blasting on uniprot 

Need to pull out stats from phmmer results for these top candidates & record 

## Refining results  

Candidate genes fall into three groups:

(a) remove bad hit and rerun alignment

(b) get more othologs, re-run pipeline

(c) check JBrowse, then add more orthologs, re-run pipeline

(d) misassembly?

Assigning genes to groups:

(a) wingless, bar

(b) brown, scarlet, yellow, dumpy, vestigal, glass, DhpD 

(c) vermillion, ebony

(d) white, sepia

In [None]:
#combined white / brown / scarlet tree

muscle -in w_b_s.aln.fasta -out w_b_s.aln -clw
muscle -in w_b_s.aln.fasta -out w_b_s.aln.tre.fasta
FastTree w_b_s.aln.tre.fasta > w_b_s.aln.tre
mkdir w_b_s_results
mv w_b_s.* w_b_s_results

In [64]:
!cat w_b_s_results/w_b_s.aln

MUSCLE (3.8) multiple sequence alignment


brown_FBpp0312192                     ----------------------------------MQESGG-SSGQGG-------------
brown_ACYPI008444-PA                  ---------------------------------------M-ATKKLLDLKYNEMWKAWNS
HOVITMR_023567-T1                     ------------------------------------------------------------
brown_lcl|MK480212.1_prot_QEO191      ------------------------MSYPNLMDSNVMEISL-LTGQEGCPSPGLGKRSGSG
scarlet_CLEC004040                    ------------------------------------------------------------
scarlet_FBpp0075149                   -----------MSDSDSKRIDVEAPERVEQHELQVMPVGSTIEVPSLDSTPKLSKRNSSE
scarlet_lcl|MK480213.1_prot_QEO1      --------------------------------MALVPATE-INQMNFSTKGFVKEEIWEQ
HOVITMR_063422-T1                     ----------------MFTIEKSSISQGPLRLRRVTPPPP-VQFSTSQSREKLSGNQSSG
HOVITMR_025763-T1                     MKNKTNKYPCGKCLANVSKCSKAVLCKGSCNQWLHLKCTD-LSKEDYERIKKSIIKKWLC
white_CLEC000648                      ----------------

In [66]:
tree = Phylo.read("w_b_s_results/w_b_s.aln.tre", "newick")
#print(tree)
#rooting to Dmel white
tree.root_with_outgroup({"name": "white_FBpp0070468"}) 
Phylo.draw_ascii(tree)

                      ______ brown_lcl|MK480212.1_prot_QEO19124.1_1
                    _|
                   | |   _______ HOVITMR_023567-T1
                   | |__|
                ___|    |_____ scarlet_CLEC004040
               |   |
               |   |  ______________ brown_FBpp0312192
      _________|   |_|
     |         |     |____________ brown_ACYPI008444-PA
     |         |
     |         |    ______ scarlet_FBpp0075149
     |         |___|
     |             | _______ scarlet_lcl|MK480213.1_prot_QEO19125.1_1
  ___|             ||
 |   |              |____ HOVITMR_063422-T1
 |   |
 |   |   ____________ HOVITMR_025763-T1
 |   | ,|
_|   | || __ white_lcl|MK480204.1_prot_QEO19116.1_1
 |   |_|||
 |     | |__ HOVITMR_025764-T1
 |     |
 |     |______ white_CLEC000648
 |
 | white_FBpp0070468



HOVITMR_023567 = brown, HOVITMR_063422 = scarlet, HOVITMR_025763 &/or HOVITMR_025764 = white

In [68]:
tree = Phylo.read("w_b_s_results/w_b_s.aln.tre", "newick")
#print(tree)
#rooting to Dmel white
tree.root_with_outgroup({"name": "brown_FBpp0312192"}) 
Phylo.draw_ascii(tree)

               _____ brown_lcl|MK480212.1_prot_QEO19124.1_1
             ,|
             ||  ______ HOVITMR_023567-T1
             ||_|
             |  |____ scarlet_CLEC004040
             |
             |             _________ HOVITMR_025763-T1
            _|           ,|
           | |           ||_ white_lcl|MK480204.1_prot_QEO19116.1_1
           | |          _||
           | |         | ||__ HOVITMR_025764-T1
           | |   ______| |
           | |  |      | |____ white_CLEC000648
           | |  |      |
  _________| |__|      |___ white_FBpp0070468
 |         |    |
 |         |    |   ____ scarlet_FBpp0075149
 |         |    |__|
 |         |       | _____ scarlet_lcl|MK480213.1_prot_QEO19125.1_1
_|         |       ||
 |         |        |__ HOVITMR_063422-T1
 |         |
 |         |_________ brown_ACYPI008444-PA
 |
 | brown_FBpp0312192



## Problem gene - white, in 2 gene calls?

In [None]:
#align nucleotides

muscle -in white_nucl_results/white_nucl.fasta -out white_nucl_results/white_nucl.aln -clw
muscle -in white_nucl_results/white_nucl.fasta -out white_nucl_results/white_nucl.aln.tre.fasta
FastTree white_nucl.aln.tre.fasta > white_nucl_results/white_nucl.aln.tre


In [6]:
!cat white_nucl_results/white_nucl.aln

MUSCLE (3.8) multiple sequence alignment


vasa_FBpp0401446      ATGTCTGACGACTGGGATGATGAGCCCATTGTTGATACTCGCGGCGCCCGCGGTGGAGAT
HOVITMR_025763        ATGAAAAATAAAACAAACAAGTATCC------TTGTGGTAAATGTCTTGCTAATGTAAGT
HOVITMR_025764        ATGC----------------------------TTATACGAGAAGTA--------GCAAAT
                      ***                             *  *       *          * *  *

vasa_FBpp0401446      TGGAGCGATGATG--AGGACACGGCCAAGAGCTTCAGCGGCGAAGCTGAAGGCGATGGTG
HOVITMR_025763        AAGTGTAGCAAGGCAGTGCTATG--CAAAGGC-----------AGCTGCAATCAATGGTT
HOVITMR_025764        AAGGTTATTGAGAT-GCCATGTGCCCAAAACCACC----------------GCAATGG--
                        *       *           *  ***   *                    * ****  

vasa_FBpp0401446      T--TGGAGGGAGCGGTGGTGAAGGCGGCGGCTACCAAGGAGGAAATCGAGA-------TG
HOVITMR_025763        GCATTTAAAATGCACTGATTTAAGTAAAGAGGACTAT-GAAAGAATTAAAAAATCAATTA
HOVITMR_025764        --------------------------------ATCACAGAAGGAACTGAGCAATGGACTG
                          

In [8]:
tree = Phylo.read("white_nucl_results/white_nucl.aln.tre", "newick")
#print(tree)
#rooting to Dmel white
tree.root_with_outgroup({"name": "vasa_FBpp0401446"}) 
Phylo.draw_ascii(tree)

                                     _________________________ HOVITMR_025763
  __________________________________|
_|                                  |___________________ HOVITMR_025764
 |
 | vasa_FBpp0401446



Maybe not split? Maybe a duplication?

In [None]:
#align protiens

muscle -in white_prot_results/white_prot.fasta -out white_prot_results/white_prot.aln -clw
muscle -in white_prot_results/white_prot.fasta -out white_prot_results/white_prot.aln.tre.fasta
FastTree white_prot_results/white_prot.aln.tre.fasta > white_prot_results/white_prot.aln.tre


In [9]:
!cat white_prot_results/white_prot.aln

MUSCLE (3.8) multiple sequence alignment


HOVITMR_025763         MKNKTNKYPCGKCLANVSKCSKAVLCKGSCNQWLHLKCTDLSKEDYERIKKSIIKKWLCS
HOVITMR_025764         -----------------------MLIREVANKVIEMPCAQNHRNG---------------
white_FBpp0070468      ----------------MGQEDQELLIRGGSKH----PSAEHLNNG---------------
                                              :* .  .::     .::  .:.               

HOVITMR_025763         NCELADIDEEIEVEDRVLYVELKEELESQELIIKNLSEDLAKANDEIKNVSTYTLNLETL
HOVITMR_025764         -------------SQKELSNGLYLELSDTEKKPQVNGYSSGSGGSGSPYSTVSILPKENI
white_FBpp0070468      --------DSGAASQSCINQGFGQAKNYGTLRPPSPPEDSGSGSGQLA---------ENL
                                    .:  :   :    .           . .....            *.:

HOVITMR_025763         VLKKEESIELLERKVKEISENKSECKCLNKYKATGPRRSLPAKLSTPVSSIMNKNNRLDS
HOVITMR_025764         TYTWSN---------------------VNVFTRLEHRQN---RVVNLVHNVVHRDR--PQ
white_FBpp0070468      TYAWHN---------------------MDIFGAVNQPGSGWRQLVNRTRGLFCNERHIPA
               

In [11]:
tree = Phylo.read("white_prot_results/white_prot.aln.tre", "newick")
#print(tree)
#rooting to Dmel white
#tree.root_with_outgroup({"name": "vasa_FBpp0401446"}) 
Phylo.draw_ascii(tree)

  ___________________________________________________________ HOVITMR_025763
 |
_|___________________ HOVITMR_025764
 |
 |___ white_FBpp0070468

