# Analysis outline

The analysis has the following components:

1. [orthologs.ipynb](orthologs.ipynb) collect primate orthologs of each human gene from NCBI, remove redundant sequences, save them to a `*.fasta` file.  

1. [align.ipynb](align.ipynb) align orthologous sequences in each `*.fasta` file and save it to a `*.aln` file.

1. [pathogen.ipynb](pathogen.ipynb) extract pathogenic variants from ClinVar data. 

1. [clean_up.ipynb](clean_up.ipynb) establish connections between transcripts in ClinVar data and proteins in ortholog collections.

1. [find_CPDs.ipynb](find_CPDs.ipynb) find putative CPDs.

# Set up

Run the following cell.  It loads some required functions.

In [1]:
run kondrashov

Run the following two cell.  The first displays the path to the current folder.  The second displays the contents of that folder.  (Note: your output will be different from mine.)

In [3]:
pwd

'/Users/rbazev/Documents/GitHub/kondrashov'

In [7]:
ls

README.md        clean_up.ipynb   kondrashov.py
align.ipynb      [1m[34mfasta[m[m/           orthologs.ipynb
[1m[34maln[m[m/             find_CPDs.ipynb  pathogen.ipynb


If there is no folder called `fasta`, create it by running the following command.

In [5]:
mkdir fasta

If there is no folder called `aln`, create it by running the following command.

In [6]:
mkdir aln

Rerun the `ls` command above to confirm that you have created those folders successfully. 

The following defines a function that gets the species name from a NCBI sequence record.

# Get protein sequences for primate orthologs of human genes

**Note: this is a new version of the `orthologs.ipynb` notebook that prevents unique human sequences from being dropped.**

* Enter the gene name you are working on in the following cell.  In this page I'll use ALPL as an example.

In [7]:
# enter gene name here
gene = 'ALPL'

* In the [NCBI query page](https://www.ncbi.nlm.nih.gov/labs/gquery/) search for `Homo sapiens ALPL`.  

* When the search results come up ([this is the page for ALPL](https://www.ncbi.nlm.nih.gov/labs/gquery/all/?term=Homo+sapiens+ALPL)) click on `Orthologs`.  Here's what the output looks like for [ALPL](https://www.ncbi.nlm.nih.gov/gene/249/ortholog/?scope=9443&term=ALPL).

* Use the Taxonomy Tree to select **mammals > placentals > primates**.  For ALPL, this reduces the number of orthologs from 335 to 29.

* In the list of sequences click to select all (box labelled `0 selected`).

* Click on `Add to cart`.  The items appear in a box with a shopping cart symbol.  Click on that box.

* A popup window appears.  Click on `Protein alignment`.  Choose `all sequences per gene`.  Click `Align`.

* Select all the accession numbers in the box (XP_039330133.1, XP_037856105.1, etc) and copy them.

* Run the following cell (it pulls the names from the clipboard and puts them on a list).

In [10]:
# copy IDs from NCBI ortholog protein alignment dialog (all sequences per gene)
tmp = pd.read_clipboard(sep='\n', names=['id'])
seq_ids = [i for i in tmp['id']]
print('{0} orthologs: {1} primate sequences (retrieved on {2})\n'.format(gene, len(seq_ids), date.today()))
print(seq_ids)

ALPL orthologs: 134 primate sequences (retrieved on 2021-09-10)

['XP_039330133.1', 'XP_037856105.1', 'XP_037856035.1', 'XP_037856000.1', 'XP_037855971.1', 'XP_037855931.1', 'XP_037855886.1', 'XP_037855836.1', 'XP_037855813.1', 'XP_037586579.1', 'XP_037586562.1', 'XP_035162547.1', 'XP_035162546.1', 'XP_033083840.1', 'XP_033083839.1', 'XP_033083838.1', 'XP_033083837.1', 'XP_032615100.1', 'XP_032615099.1', 'XP_032156584.1', 'XP_032156583.1', 'XP_032156582.1', 'XP_032156581.1', 'XP_032156580.1', 'XP_032156579.1', 'XP_032156578.1', 'XP_032156577.1', 'XP_032156576.1', 'XP_031521026.1', 'XP_031521023.1', 'XP_031521022.1', 'XP_031521018.1', 'XP_030661091.1', 'XP_028701740.1', 'NP_001356734.1', 'NP_001356733.1', 'NP_001356732.1', 'XP_025210047.1', 'XP_025210038.1', 'XP_025210030.1', 'XP_025210024.1', 'XP_002811389.2', 'XP_012657049.2', 'XP_023365504.1', 'XP_023075500.1', 'XP_023075499.1', 'XP_023075498.1', 'XP_023075497.1', 'XP_017829425.1', 'XP_017742028.1', 'XP_017742025.1', 'XP_017742024.1'

In [5]:
# you can do it by hand if you already have all the sequences
seq_ids = ['XP_039330133.1', 'XP_037856105.1', 'XP_037856035.1', 'XP_037856000.1', 'XP_037855971.1', 'XP_037855931.1', 
           'XP_037855886.1', 'XP_037855836.1', 'XP_037855813.1', 'XP_037586579.1', 'XP_037586562.1', 'XP_035162547.1', 
           'XP_035162546.1', 'XP_033083840.1', 'XP_033083839.1', 'XP_033083838.1', 'XP_033083837.1', 'XP_032615100.1', 
           'XP_032615099.1', 'XP_032156584.1', 'XP_032156583.1', 'XP_032156582.1', 'XP_032156581.1', 'XP_032156580.1', 
           'XP_032156579.1', 'XP_032156578.1', 'XP_032156577.1', 'XP_032156576.1', 'XP_031521026.1', 'XP_031521023.1', 
           'XP_031521022.1', 'XP_031521018.1', 'XP_030661091.1', 'XP_028701740.1', 'NP_001356734.1', 'NP_001356733.1', 
           'NP_001356732.1', 'XP_025210047.1', 'XP_025210038.1', 'XP_025210030.1', 'XP_025210024.1', 'XP_002811389.2', 
           'XP_012657049.2', 'XP_023365504.1', 'XP_023075500.1', 'XP_023075499.1', 'XP_023075498.1', 'XP_023075497.1', 
           'XP_017829425.1', 'XP_017742028.1', 'XP_017742025.1', 'XP_017742024.1', 'XP_017742023.1', 'XP_017362337.1', 
           'XP_017362322.1', 'XP_017362309.1', 'XP_017362307.1', 'XP_016856392.1', 'XP_016811347.1', 'XP_016811339.1', 
           'XP_016811335.1', 'XP_016811328.1', 'XP_016811317.1', 'XP_016811306.1', 'XP_015296763.1', 'XP_014985703.1', 
           'XP_014985698.1', 'XP_014985689.1', 'XP_014985684.1', 'XP_014199352.1', 'XP_012657050.1', 'XP_012624163.1', 
           'XP_012624161.1', 'XP_012624159.1', 'XP_012624158.1', 'XP_012506390.1', 'XP_012506389.1', 'XP_012506388.1', 
           'XP_012506387.1', 'XP_012506386.1', 'XP_012506384.1', 'XP_012506383.1', 'XP_012295598.1', 'XP_012295596.1', 
           'XP_012295595.1', 'XP_012295594.1', 'XP_012295593.1', 'XP_011741980.1', 'XP_011741979.1', 'XP_011741978.1', 
           'XP_011935677.1', 'XP_011935676.1', 'XP_011935675.1', 'XP_011935674.1', 'XP_011935673.1', 'XP_011935672.1', 
           'XP_011935671.1', 'XP_011845537.1', 'XP_011789365.1', 'XP_011789364.1', 'XP_011789363.1', 'XP_011789361.1', 
           'XP_010343782.1', 'XP_010362644.1', 'XP_010362643.1', 'XP_009199482.1', 'XP_008964792.1', 'XP_008964791.1', 
           'XP_008998772.1', 'XP_008998771.1', 'XP_008998770.1', 'XP_008998769.1', 'XP_008998768.1', 'XP_008998767.1', 
           'XP_008060915.1', 'XP_008060914.1', 'XP_008060913.1', 'XP_007978339.1', 'XP_005544583.1', 'XP_005544582.1', 
           'XP_004024893.1', 'XP_004024892.1', 'XP_003934954.1', 'XP_003934953.1', 'XP_003891323.1', 'XP_003813998.1', 
           'XP_003813997.1', 'XP_003813996.1', 'NP_001253798.1', 'XP_003271616.1', 'NP_001171963.1', 'NP_001170991.1', 
           'NP_001120973.2', 'NP_000469.3']
print('{0} orthologs: {1} primate sequences\n'.format(gene, len(seq_ids)))
print(seq_ids)

ALPL orthologs: 134 primate sequences

['XP_039330133.1', 'XP_037856105.1', 'XP_037856035.1', 'XP_037856000.1', 'XP_037855971.1', 'XP_037855931.1', 'XP_037855886.1', 'XP_037855836.1', 'XP_037855813.1', 'XP_037586579.1', 'XP_037586562.1', 'XP_035162547.1', 'XP_035162546.1', 'XP_033083840.1', 'XP_033083839.1', 'XP_033083838.1', 'XP_033083837.1', 'XP_032615100.1', 'XP_032615099.1', 'XP_032156584.1', 'XP_032156583.1', 'XP_032156582.1', 'XP_032156581.1', 'XP_032156580.1', 'XP_032156579.1', 'XP_032156578.1', 'XP_032156577.1', 'XP_032156576.1', 'XP_031521026.1', 'XP_031521023.1', 'XP_031521022.1', 'XP_031521018.1', 'XP_030661091.1', 'XP_028701740.1', 'NP_001356734.1', 'NP_001356733.1', 'NP_001356732.1', 'XP_025210047.1', 'XP_025210038.1', 'XP_025210030.1', 'XP_025210024.1', 'XP_002811389.2', 'XP_012657049.2', 'XP_023365504.1', 'XP_023075500.1', 'XP_023075499.1', 'XP_023075498.1', 'XP_023075497.1', 'XP_017829425.1', 'XP_017742028.1', 'XP_017742025.1', 'XP_017742024.1', 'XP_017742023.1', 'XP_01

* Copy the output to the cell below

* Run the next cell.  It pulls the sequences from NCBI and grabs any unique human sequences.  Then it flags any redundant sequences and drops those.  For ALPL it drops 84 sequences and we end up with 57.

In [11]:
# 1. Collect unique human sequences
print('{0} orthologs: all primate sequences\n'.format(gene))
all_sequences = []
seq_records = []
inc = 0
exc = 0
with Entrez.efetch(
    db="protein", rettype="gb", retmode="text", id=seq_ids,
) as handle:
    for seq_record in SeqIO.parse(handle, "gb"):
        sp = get_species(seq_record)
        if sp == 'Homo sapiens':
            seq = seq_record.seq
            if seq in all_sequences:
                print("\t{0}\t({1} aa)\t{2}\t*** the same as sequence {3} (excluded) ***".format(seq_record.id, len(seq), seq_record.description, all_sequences.index(seq)))
                exc += 1
            else:
                all_sequences.append(seq)
                print("{3}:\t{0}\t({1} aa)\t{2}".format(seq_record.id, len(seq), seq_record.description, inc))
                seq_records.append(seq_record)
                inc += 1
# 2. Collect other unique sequences
with Entrez.efetch(
    db="protein", rettype="gb", retmode="text", id=seq_ids,
) as handle:
    for seq_record in SeqIO.parse(handle, "gb"):
        seq = seq_record.seq
        if seq in all_sequences:
            print("\t{0}\t({1} aa)\t{2}\t*** the same as sequence {3} (excluded) ***".format(seq_record.id, len(seq), seq_record.description, all_sequences.index(seq)))
            exc += 1
        else:
            all_sequences.append(seq)
            print("{3}:\t{0}\t({1} aa)\t{2}".format(seq_record.id, len(seq), seq_record.description, inc))
            seq_records.append(seq_record)
            inc += 1
print('\n\tTotal:\t', inc, ' unique sequences (', exc, ' excluded)', sep='')

ALPL orthologs: all primate sequences

0:	NP_001356734.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 1 precursor preproprotein [Homo sapiens]
	NP_001356733.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 1 preproprotein [Homo sapiens]	*** the same as sequence 0 (excluded) ***
	NP_001356732.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 1 preproprotein [Homo sapiens]	*** the same as sequence 0 (excluded) ***
1:	XP_016856392.1	(472 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform X2 [Homo sapiens]
2:	NP_001170991.1	(447 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 3 [Homo sapiens]
3:	NP_001120973.2	(469 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 2 [Homo sapiens]
	NP_000469.3	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 1 preproprotein [Homo sapiens]	*** the same as sequence 0 (excluded) ***
4:	XP_039330133.1	(524 aa)	alkaline phosphatase, tissue-nonspecific is

* Copy the output to the cell below

* The following cell writes the sequences to the file `ALPL.fasta` (fasta format).

In [12]:
# write the sequences to a fasta file
print('{0} orthologs: unique primate sequences\n'.format(gene))
i = 0
with open("fasta/{0}.fasta".format(gene), "w") as output:
    for seq_record in seq_records:
        print("{3}:\t{0}\t({1} aa)\t{2}".format(seq_record.id, len(seq), seq_record.description, i))
        SeqIO.write(seq_record, output, "fasta")
        i += 1
print("\n{0}.fasta saved!".format(gene))

ALPL orthologs: unique primate sequences

0:	NP_001356734.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 1 precursor preproprotein [Homo sapiens]
1:	XP_016856392.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform X2 [Homo sapiens]
2:	NP_001170991.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 3 [Homo sapiens]
3:	NP_001120973.2	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform 2 [Homo sapiens]
4:	XP_039330133.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform X1 [Saimiri boliviensis boliviensis]
5:	XP_037856105.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform X4 [Chlorocebus sabaeus]
6:	XP_037856035.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform X1 [Chlorocebus sabaeus]
7:	XP_037856000.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isoform X3 [Chlorocebus sabaeus]
8:	XP_037855971.1	(524 aa)	alkaline phosphatase, tissue-nonspecific isozyme isofo

* The following cell aligns the sequences and writes them to the file `ALPL.aln` (also in fasta format).  To look at the alignment open it in [this viewer application](https://alignmentviewer.org/).  (Hannah: you may not be able to run this step.  Just skip it and repeat the sequence processing above for the other genes.)

In [13]:
# align sequences
print('{0} orthologs\n'.format(gene))
muscle_cline = MuscleCommandline(input='fasta/{0}.fasta'.format(gene))
stdout, stderr = muscle_cline()
align = AlignIO.read(StringIO(stdout), 'fasta')
AlignIO.write(align, 'aln/{0}.aln'.format(gene), 'fasta')
print(align)
print("\n{0}.aln saved!".format(gene))

ALPL orthologs

Alignment with 57 rows and 598 columns
-----------------MISPFLVLAIGTCLTNSLVPEKEKDPK...--- XP_011935671.1
-----------------MISAFLILAIGTCLTNSFVPEKEKDPE...ILF XP_012657049.2
-----------------------------------------MPW...ILF XP_012657050.1
-----------------MISAFLVLAIGTCLANSLVPEKEKDPK...SLF XP_012624161.1
--------------------------------------------...SLF XP_012506389.1
-----------------MISAFLVLAIGTCLTNSLVPEKEKDPK...SLF XP_012506388.1
-----------------------------------------MPW...SLF XP_012506390.1
-----------------------------------------MPW...SLF XP_012624163.1
-----------------MISPFLVLVIGTCLTHCLVPEKEKDPK...MLF XP_008060913.1
--------------------------------------------...MLF XP_008060915.1
--------------------------------------------...MLF XP_008060914.1
-----------------MISPFLVLAIGTCLTNSLVPGMLGDTG...ILF XP_011845537.1
-----------------MISPFLVLAIGTCLTNSLVPEKEKDPK...VLF XP_012295596.1
-----------------MISPFLVLAIGTCLTNSLVPEKEKDPK...ILF XP_035162547.1
-----------------MISP

* Repeat for other genes.

In [15]:
!open .