# Analysis outline

The analysis has the following components:

1. [orthologs.ipynb](orthologs.ipynb) collect primate orthologs of each human gene from NCBI, remove redundant sequences, save them to a `*.fasta` file.  

1. [align.ipynb](align.ipynb) align orthologous sequences in each `*.fasta` file and save it to a `*.aln` file.

1. [pathogen.ipynb](pathogen.ipynb) extract pathogenic variants from ClinVar data. 

1. [clean_up.ipynb](clean_up.ipynb) establish connections between transcripts in ClinVar data and proteins in ortholog collections.

1. [find_CPDs.ipynb](find_CPDs.ipynb) find putative CPDs.

# Set up

Run the following cell.  It loads some required functions.

In [14]:
run kondrashov

Run the following two cell.  The first displays the path to the current folder.  The second displays the contents of that folder.  (Note: your output will be different from mine.)

In [2]:
pwd

'/Users/arnav/hhh Dropbox/arnav mehta/Mac/Documents/GitHub/kondrashov'

In [3]:
ls

ACE2 ortholog-Copy1.ipynb  [34mfasta[m[m/
ACE2 ortholog.ipynb        find_CPDs.ipynb
README.md                  kondrashov.py
TP53 ortholog.ipynb        orthologs-Copy1.ipynb
align.ipynb                orthologs.ipynb
[34maln[m[m/                       pathogen.ipynb
clean_up.ipynb


If there is no folder called `fasta`, create it by running the following command.

In [4]:
mkdir fasta

mkdir: fasta: File exists


If there is no folder called `aln`, create it by running the following command.

In [5]:
mkdir aln

mkdir: aln: File exists


Rerun the `ls` command above to confirm that you have created those folders successfully. 

The following defines a function that gets the species name from a NCBI sequence record.

# Get protein sequences for primate orthologs of human genes

**Note: this is a new version of the `orthologs.ipynb` notebook that prevents unique human sequences from being dropped.**

TP53


## TP53

In [6]:
# enter gene name here
gene = 'TP53'

* In the [NCBI query page](https://www.ncbi.nlm.nih.gov/labs/gquery/) search for `Homo sapiens ALPL`.  

* When the search results come up ([this is the page for ALPL](https://www.ncbi.nlm.nih.gov/labs/gquery/all/?term=Homo+sapiens+ALPL)) click on `Orthologs`.  Here's what the output looks like for [ALPL](https://www.ncbi.nlm.nih.gov/gene/249/ortholog/?scope=9443&term=ALPL).

* Use the Taxonomy Tree to select **mammals > placentals > primates**.  For ALPL, this reduces the number of orthologs from 335 to 29.

* In the list of sequences click to select all (box labelled `0 selected`).

* Click on `Add to cart`.  The items appear in a box with a shopping cart symbol.  Click on that box.

* A popup window appears.  Click on `Protein alignment`.  Choose `all sequences per gene`.  Click `Align`.

* Select all the accession numbers in the box (XP_039330133.1, XP_037856105.1, etc) and copy them.

* Run the following cell (it pulls the names from the clipboard and puts them on a list).

In [22]:
# copy IDs from NCBI ortholog protein alignment dialog (all sequences per gene)
tmp = pd.read_clipboard(sep= '/n', names=['id'])
seq_ids = tmp['id'].tolist()
print('{0} orthologs: {1} primate sequences (retrieved on {2})\n'.format(gene, len(seq_ids), date.today()))
print(seq_ids)

TP53 orthologs: 82 primate sequences (retrieved on 2022-07-21)

['NP_001394193.1', 'NP_001394192.1', 'NP_001394197.1', 'NP_001394198.1', 'NP_001394200.1', 'NP_001394199.1', 'NP_001394191.1', 'NP_001394196.1', 'NP_001394194.1', 'NP_001394195.1', 'XP_045231359.1', 'XP_045231358.1', 'XP_039331970.1', 'XP_037843787.1', 'XP_037843786.1', 'XP_008008383.2', 'XP_035155777.1', 'XP_003810114.2', 'XP_033040609.1', 'XP_033040608.1', 'XP_032136324.1', 'XP_032136323.1', 'XP_032136322.1', 'XP_032136321.1', 'XP_031995228.1', 'XP_004058559.2', 'XP_018868682.2', 'XP_018868681.2', 'XP_030656345.1', 'XP_028691546.1', 'XP_025219304.1', 'XP_025219303.1', 'XP_025219302.1', 'XP_025219301.1', 'XP_025219300.1', 'XP_016786959.2', 'XP_002827020.2', 'XP_023049083.1', 'XP_023049082.1', 'XP_021530379.1', 'XP_017706034.1', 'XP_017706033.1', 'XP_017387084.1', 'XP_017387083.1', 'XP_012661495.1', 'XP_012631521.1', 'XP_012631512.1', 'XP_012511070.1', 'XP_011725915.1', 'XP_011813950.1', 'XP_011813949.1', 'XP_011813948.1',

In [24]:
# you can do it by hand if you already have all the sequences
seq_ids = ['NP_001394193.1', 'NP_001394192.1', 'NP_001394197.1', 'NP_001394198.1', 'NP_001394200.1', 'NP_001394199.1', 'NP_001394191.1', 'NP_001394196.1', 'NP_001394194.1', 'NP_001394195.1', 'XP_045231359.1', 'XP_045231358.1', 'XP_039331970.1', 'XP_037843787.1', 'XP_037843786.1', 'XP_008008383.2', 'XP_035155777.1', 'XP_003810114.2', 'XP_033040609.1', 'XP_033040608.1', 'XP_032136324.1', 'XP_032136323.1', 'XP_032136322.1', 'XP_032136321.1', 'XP_031995228.1', 'XP_004058559.2', 'XP_018868682.2', 'XP_018868681.2', 'XP_030656345.1', 'XP_028691546.1', 'XP_025219304.1', 'XP_025219303.1', 'XP_025219302.1', 'XP_025219301.1', 'XP_025219300.1', 'XP_016786959.2', 'XP_002827020.2', 'XP_023049083.1', 'XP_023049082.1', 'XP_021530379.1', 'XP_017706034.1', 'XP_017706033.1', 'XP_017387084.1', 'XP_017387083.1', 'XP_012661495.1', 'XP_012631521.1', 'XP_012631512.1', 'XP_012511070.1', 'XP_011725915.1', 'XP_011813950.1', 'XP_011813949.1', 'XP_011813948.1', 'XP_011907826.1', 'XP_011907813.1', 'XP_011849932.1', 'XP_011849931.1', 'XP_011849930.1', 'XP_010360689.1', 'XP_008060532.1', 'XP_008008385.1', 'NP_001274608.1', 'XP_005582844.1', 'NP_001263628.1', 'NP_001263627.1', 'NP_001263626.1', 'NP_001263625.1', 'NP_001263624.1', 'NP_001263690.1', 'NP_001263689.1', 'XP_003929235.1', 'XP_003912321.1', 'NP_001119590.1', 'XP_001172077.2', 'XP_002747994.1', 'NP_001119589.1', 'NP_001119588.1', 'NP_001119587.1', 'NP_001119586.1', 'NP_001119585.1', 'NP_001119584.1', 'NP_000537.3', 'NP_001040616.1']
print('{0} orthologs: {1} primate sequences\n'.format(gene, len(seq_ids)))
print(seq_ids)

TP53 orthologs: 82 primate sequences

['NP_001394193.1', 'NP_001394192.1', 'NP_001394197.1', 'NP_001394198.1', 'NP_001394200.1', 'NP_001394199.1', 'NP_001394191.1', 'NP_001394196.1', 'NP_001394194.1', 'NP_001394195.1', 'XP_045231359.1', 'XP_045231358.1', 'XP_039331970.1', 'XP_037843787.1', 'XP_037843786.1', 'XP_008008383.2', 'XP_035155777.1', 'XP_003810114.2', 'XP_033040609.1', 'XP_033040608.1', 'XP_032136324.1', 'XP_032136323.1', 'XP_032136322.1', 'XP_032136321.1', 'XP_031995228.1', 'XP_004058559.2', 'XP_018868682.2', 'XP_018868681.2', 'XP_030656345.1', 'XP_028691546.1', 'XP_025219304.1', 'XP_025219303.1', 'XP_025219302.1', 'XP_025219301.1', 'XP_025219300.1', 'XP_016786959.2', 'XP_002827020.2', 'XP_023049083.1', 'XP_023049082.1', 'XP_021530379.1', 'XP_017706034.1', 'XP_017706033.1', 'XP_017387084.1', 'XP_017387083.1', 'XP_012661495.1', 'XP_012631521.1', 'XP_012631512.1', 'XP_012511070.1', 'XP_011725915.1', 'XP_011813950.1', 'XP_011813949.1', 'XP_011813948.1', 'XP_011907826.1', 'XP_011

* Run the next cell.  It pulls the sequences from NCBI and grabs any unique human sequences.  Then it flags any redundant sequences and drops those.  For ALPL it drops 84 sequences and we end up with 57.

In [28]:
# 1. Collect unique human sequences
print('{0} orthologs: all primate sequences\n'.format(gene))
all_sequences = []
seq_records = []
inc = 0
exc = 0
with Entrez.efetch(
    db="protein", rettype="gb", retmode="text", id=seq_ids,
) as handle:
    for seq_record in SeqIO.parse(handle, "gb"):
        sp = get_species(seq_record)
        if sp == 'Homo sapiens':
            seq = seq_record.seq
            if seq in all_sequences:
                print("\t{0}\t({1} aa)\t{2}\t*** the same as sequence {3} (excluded) ***".format(seq_record.id, len(seq), seq_record.description, all_sequences.index(seq)))
                exc += 1
            else:
                all_sequences.append(seq)
                print("{3}:\t{0}\t({1} aa)\t{2}".format(seq_record.id, len(seq), seq_record.description, inc))
                seq_records.append(seq_record)
                inc += 1
# # 2. Collect other unique sequences
with Entrez.efetch(
    db="protein", rettype="gb", retmode="text", id=seq_ids,
) as handle:
    for seq_record in SeqIO.parse(handle, "gb"):
        seq = seq_record.seq
        if seq in all_sequences:
            print("\t{0}\t({1} aa)\t{2}\t*** the same as sequence {3} (excluded) ***".format(seq_record.id, len(seq), seq_record.description, all_sequences.index(seq)))
            exc += 1
        else:
            all_sequences.append(seq)
            print("{3}:\t{0}\t({1} aa)\t{2}".format(seq_record.id, len(seq), seq_record.description, inc))
            seq_records.append(seq_record)
            inc += 1
print('\n\tTotal:\t', inc, ' unique sequences (', exc, ' excluded)', sep='')

TP53 orthologs: all primate sequences

0:	NP_001394193.1	(393 aa)	cellular tumor antigen p53 isoform a [Homo sapiens]
1:	NP_001394192.1	(354 aa)	cellular tumor antigen p53 isoform g [Homo sapiens]
2:	NP_001394197.1	(341 aa)	cellular tumor antigen p53 isoform b [Homo sapiens]
3:	NP_001394198.1	(302 aa)	cellular tumor antigen p53 isoform i [Homo sapiens]
	NP_001394200.1	(302 aa)	cellular tumor antigen p53 isoform i [Homo sapiens]	*** the same as sequence 3 (excluded) ***
	NP_001394199.1	(341 aa)	cellular tumor antigen p53 isoform b [Homo sapiens]	*** the same as sequence 2 (excluded) ***
	NP_001394191.1	(393 aa)	cellular tumor antigen p53 isoform a [Homo sapiens]	*** the same as sequence 0 (excluded) ***
	NP_001394196.1	(354 aa)	cellular tumor antigen p53 isoform g [Homo sapiens]	*** the same as sequence 1 (excluded) ***
	NP_001394194.1	(354 aa)	cellular tumor antigen p53 isoform g [Homo sapiens]	*** the same as sequence 1 (excluded) ***
	NP_001394195.1	(393 aa)	cellular tumor antigen p5

	NP_001274608.1	(393 aa)	cellular tumor antigen p53 [Macaca fascicularis]	*** the same as sequence 12 (excluded) ***
	XP_005582844.1	(393 aa)	cellular tumor antigen p53 isoform X2 [Macaca fascicularis]	*** the same as sequence 12 (excluded) ***
	NP_001263628.1	(187 aa)	cellular tumor antigen p53 isoform l [Homo sapiens]	*** the same as sequence 4 (excluded) ***
	NP_001263627.1	(182 aa)	cellular tumor antigen p53 isoform k [Homo sapiens]	*** the same as sequence 5 (excluded) ***
	NP_001263626.1	(234 aa)	cellular tumor antigen p53 isoform j [Homo sapiens]	*** the same as sequence 6 (excluded) ***
	NP_001263625.1	(302 aa)	cellular tumor antigen p53 isoform i [Homo sapiens]	*** the same as sequence 3 (excluded) ***
	NP_001263624.1	(307 aa)	cellular tumor antigen p53 isoform h [Homo sapiens]	*** the same as sequence 7 (excluded) ***
	NP_001263690.1	(354 aa)	cellular tumor antigen p53 isoform g [Homo sapiens]	*** the same as sequence 1 (excluded) ***
	NP_001263689.1	(354 aa)	cellular tumor a

In [27]:
seq_records

[SeqRecord(seq=Seq('MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWF...DSD'), id='NP_001394193.1', name='NP_001394193', description='cellular tumor antigen p53 isoform a [Homo sapiens]', dbxrefs=[]),
 SeqRecord(seq=Seq('MDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPL...DSD'), id='NP_001394192.1', name='NP_001394192', description='cellular tumor antigen p53 isoform g [Homo sapiens]', dbxrefs=[]),
 SeqRecord(seq=Seq('MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWF...ENC'), id='NP_001394197.1', name='NP_001394197', description='cellular tumor antigen p53 isoform b [Homo sapiens]', dbxrefs=[]),
 SeqRecord(seq=Seq('MDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPL...ENC'), id='NP_001394198.1', name='NP_001394198', description='cellular tumor antigen p53 isoform i [Homo sapiens]', dbxrefs=[]),
 SeqRecord(seq=Seq('MAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR...NSS'), id='NP_001263628.1', name='NP_001263628', description='cellular tumor antigen p53 isoform l 

* Copy the output to the cell below

* The following cell writes the sequences to the file `ALPL.fasta` (fasta format).

In [29]:
# write the sequences to a fasta file
print('{0} orthologs: unique primate sequences\n'.format(gene))
i = 0
with open("fasta/{0}.fasta".format(gene), "w") as output:
    for seq_record in seq_records:
        print("{3}:\t{0}\t({1} aa)\t{2}".format(seq_record.id, len(seq), seq_record.description, i))
        SeqIO.write(seq_record, output, "fasta")
        i += 1
print("\n{0}.fasta saved!".format(gene))

TP53 orthologs: unique primate sequences

0:	NP_001394193.1	(393 aa)	cellular tumor antigen p53 isoform a [Homo sapiens]
1:	NP_001394192.1	(393 aa)	cellular tumor antigen p53 isoform g [Homo sapiens]
2:	NP_001394197.1	(393 aa)	cellular tumor antigen p53 isoform b [Homo sapiens]
3:	NP_001394198.1	(393 aa)	cellular tumor antigen p53 isoform i [Homo sapiens]
4:	NP_001263628.1	(393 aa)	cellular tumor antigen p53 isoform l [Homo sapiens]
5:	NP_001263627.1	(393 aa)	cellular tumor antigen p53 isoform k [Homo sapiens]
6:	NP_001263626.1	(393 aa)	cellular tumor antigen p53 isoform j [Homo sapiens]
7:	NP_001263624.1	(393 aa)	cellular tumor antigen p53 isoform h [Homo sapiens]
8:	NP_001119589.1	(393 aa)	cellular tumor antigen p53 isoform f [Homo sapiens]
9:	NP_001119588.1	(393 aa)	cellular tumor antigen p53 isoform e [Homo sapiens]
10:	NP_001119587.1	(393 aa)	cellular tumor antigen p53 isoform d [Homo sapiens]
11:	NP_001119585.1	(393 aa)	cellular tumor antigen p53 isoform c [Homo sapiens]
12:	XP_0

* The following cell aligns the sequences and writes them to the file `ALPL.aln` (also in fasta format).  To look at the alignment open it in [this viewer application](https://alignmentviewer.org/).  (Hannah: you may not be able to run this step.  Just skip it and repeat the sequence processing above for the other genes.)

In [30]:
# align sequences
print('{0} orthologs\n'.format(gene))
muscle_cline = MuscleCommandline(input='fasta/{0}.fasta'.format(gene))
stdout, stderr = muscle_cline()
align = AlignIO.read(StringIO(stdout), 'fasta')
AlignIO.write(align, 'aln/{0}.aln'.format(gene), 'fasta')
print(align)
print("\n{0}.aln saved!".format(gene))

TP53 orthologs



ApplicationError: Non-zero return code 127 from 'muscle -in fasta/TP53.fasta', message '/bin/sh: muscle: command not found'

* Repeat for other genes.

In [31]:
!open .