GitHub - yafeng/SpliceVista: Splice variant analysis in shotgun proteomics data

yafeng / SpliceVista Public

Notifications You must be signed in to change notification settings
Fork 3
Star 8

Splice variant analysis in shotgun proteomics data

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
ECgene2HUGO.txt		ECgene2HUGO.txt
ENSP_ENSG_Gene.txt		ENSP_ENSG_Gene.txt
ENSP_Gene.txt		ENSP_Gene.txt
IDmap_file_example.txt		IDmap_file_example.txt
IPIaccesion_gene.txt		IPIaccesion_gene.txt
Noxchi_Arial.ttf		Noxchi_Arial.ttf
Readme.txt		Readme.txt
SpliceVista.py		SpliceVista.py
Uniprotid_Gene.txt		Uniprotid_Gene.txt
clusterpeptide.py		clusterpeptide.py
converter.py		converter.py
download.py		download.py
heavy_testfile.txt		heavy_testfile.txt
map_peptide2genome.py		map_peptide2genome.py
mapping_ECgene.py		mapping_ECgene.py
mapping_EVDB.py		mapping_EVDB.py
mergepsm.py		mergepsm.py
splicingvar.txt		splicingvar.txt
subexon.txt		subexon.txt
varseq.fa		varseq.fa
visualization-ECgene.py		visualization-ECgene.py
visualization.py		visualization.py

Repository files navigation

SpliceVista manual

Developed by Yafeng Zhu. Email: yafeng.zhu@scilifelab.se

OVERVIEW
SpliceVista is a tool for identification and visualization of splice variants based on shotgun proteomics data.

SpliceVista google discussion group: (preferably post your questions at the forum so that other people can see)
https://groups.google.com/forum/#!forum/splicevista-discussion-group

SpliceVista was written in Python 2.7.2. It contains six python scripts: converter.py, mergepsm.py, download.py, clusterpeptide.py, mapping.py and visualization.py.

HOW TO INSTALL
1. prerequisite packages.The following python packages need to be installed for SpliceVista to work:

Prerequisite packages: Biopython, numpy, scipy, pillow

Windows
download Winpython2.7 here http://sourceforge.net/projects/winpython/files/
numpy, scipy, pillow are included in Winpython.
next is to install biopython.Download it here http://www.lfd.uci.edu/~gohlke/pythonlibs/
choose one of them based on if your symstem is 32 or 64bit
biopython‑1.65‑cp27‑none‑win32.whl (32bit)
biopython‑1.65‑cp27‑none‑win_amd64.whl (64bit)

open winpython and use downloaed .whl file to install biopython.

2. Download SpliceVista using the following command if you use UNIX or LINUX system: (you might need to install git first, type the command in a terminal: apt-get install git-core)
git clone https://github.com/yafeng/SpliceVista

For windows users, go to website https://github.com/yafeng/SpliceVista and click Download ZIP at right side panel.
Note: After you have download SpliceVista, DO NOT move or rename original and output files generated by SpliceVista.

3. Open a terminal and cd to the SpliceVista folder. No need to make compile, you can start to use it.

HOW TO USE:
The manual consists of two walk through examples.
EXAMPLE I: known splice variant study

EXAMPLE II: novel splice variants study

EAMPLE III: Exon/Variant centric analysis

EAMPLE IV: Map peptides to genome and Return an GFF file

EXAMPLE I)For known splice variant study, you should still search your raw MS data as usual in conventional protein databases, such as Ensembl or Uniprot protein database.
With the results, you can do conventional proteomics analysis while still have the possibility of using SpliceVista to identify and quantify known splice variants.

EXAMPLE II)For novel splice variant study, you should search your raw MS data in ECgene protein database which contains predicted splice variants.
I have created a fasta format of ECgene protein database, you can download it from this link.
https://www.dropbox.com/sh/tgid20tv6h31fqj/AAA6TZ8OJzlNbQRkEf7xnWxia?dl=0

EXAMPLE I - Investigation of known splice variants in your data

Step1: prepare the input file for SpliceVista
First, you need to extract PSM data from the output file of database searching and format them in a tab-delimited text file like this:
Sequence Accession ratio_s1 ratio_s2 ratio_s3 ratio_s4
qDmPNAmPVSELTDk ENSP00000000233 1.046 0.954 0.835 0.984

(you need to calculate peptide ratio (relative abundance in each sample) before you run next step, the header is not necessary to be same as here)
Tip: arrange the needed columns in EXCEL, and then save it in a tab-delimited text file.
It is OK to have multiple IDs in the Protein accession column, but it has to be separated by semi comma (;) symbol. Peptide sequence can contain letters in lower case.

Here, we will use the heavy_testfile.txt file (a tab-delimited text file as described in step1) to illustrate the workflow.

(If you have already grouped PSMs into peptides, run step 2 and step 3 as usual.)
Step2: insert gene symbol for each PSM- converter.py
Insert gene symbol for each PSM. Each protein accession is assigned a gene symbol which will be used to retrieve its known splice variants in the next step. This is done by converter.py.

Command: Python converter.py --i heavy_testfile.txt --prefix heavy --database ensembl

The first argument --i is the input file, the second is the prefix of output file. Given the prefix in the example, you will get an output file named as heavy_psmdata.txt.
--database specify the database was used to search peptide spectra. Options are “ensembl”, “uniprot”, “IPI”, “ECgene”.

(NOTE: when you start to analyse your own data, this command can only be used if you have HUMAN protein ID in your data.
For non-human dataset, you need to insert gene symbol (in the column between peptide sequence and protein ID) by yourself so that it looks like this:
peptide gene_symbol protein_ID ratio_sample1 ratio_sample2 ratio_sample3 ...)

Then you can continue to step3.

Step3: group PSMs into peptides
For each peptide, calculate the median (default) or the mean of all PSMs' relative abundance in each sample.
Step3 will not make a difference on quantitative data if PSMs are already grouped into peptides)

Command: Python mergepsm.py --prefix heavy --method median
--method options are median, mean.
output: heavy_pepdata.txt

Use the same prefix in previous step to group PSMs into peptides, if mean is used as method,
the ratio of peptide is the mean of all the peptides PSM's relative ratios, standard deviation is calculated.
If median is used as method, the ratio of peptide is the median of all the peptides PSM's relative ratios, median absolute deviation is calculated.
Before clusterpeptide.py is used, all peptides will be assigned to cluster 0.

Step4: Download data from EVDB and GenBank - download.py (You can skip this and download HUMAN splice variant database (EVDB) files directly from the following link: https://www.dropbox.com/sh/q8vo542udkdbdwl/7SYS9HF3O8. After that, copy and replace the three files in SpliceVista directory)
The script in this step retrieves splice variants in EVDB by gene symbol and the translated sequences of these splice variants in GenBank.

Command: Python download.py --prefix heavy --organism 9606 --email your_email_address
Currently, nine species below are available to choose to search splice variant in EVDB.
organism="3702">A. thaliana

organism="9913">B. taurus

organism="6239">C. elegans

organism="7227">D. melanogaster

organism="7955">D. rerio

organism="9606">H. sapiens (default)

organism="10090">M. musculus

organism="39947">O. sativa

organism="10116">R. norvegicus

Use the same prefix in previous step and provide a valid email address to access NCBI GenBank.
Output: splicingvar.txt, subexon.txt, varseq.fa, gene_notfound.txt

You can run step4 command many times, only new splice variants that are not in the files will be downloaded

Note: DO NOT download splice variants for different species under the same directory, or files will be overwritten. SpliceVista Package contains three empty files splicingvar.txt, subexon.txt, varseq.fa that will store downloaded splice variant information. The output files splicingvar.txt and subexon.txt contain exon composition of each variant, both genomic and transcript coordinates. The file varseq.fa will store the amino acid sequences of all splice variants. The gene_notfound.txt file contains gene symbols which are not found in the EVDB database.
Tip: if you have multiple sample files in one project, it is better to put all the files under one directory named by the project and use different sample names for each file. Because there is usually an overlap of identified proteins among different samples, the script download.py first check if the splice variants of identified protein and the variant sequences have been downloaded so that it avoids downloading the data for same proteins multiple times.
This step will take time depending the number of new splice variant to be downloaded from the database (on average, it takes 1 second to retrieve information from EVDB and NCBI for one splice variant). In order not to cause Internet traffic, it is highly recommended to download HUMAN splice variant database files first from the link provided above.

Step5: cluster peptides based on quantitative pattern
This step mimics the PQPQ algorithm but much simplified. It only does the peptide clustering which groups peptides based on their quantitative patterns over samples, each peptide will be assigned a number to indicate which cluster it belongs to. If one protein has only one unique peptide, then this peptide will get a cluster '0' instead.
Command: python clusterpeptide.py --i heavy_pepdata.txt --o heavy_pepcluster.txt --metric correlation --method average --t 0.4

--i argument should be _pepdata.txt file from step3
--o argument is the name of output file. There are some other options to use.
--metric defines the way to calculate the distance, “correlation is the default. Other options such as "cosine", "correlation" are also available. (if cosine is chosen, make sure vectors do not contain value greater than 1)
For all available clustering metric, check on website http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html?highlight=distance.pdist#scipy.spatial.distance.pdist
--method defines the way of calculating the distance between the newly formed clusters. Available options are "single", "complete", "average" and "weighted". "average" is the default.
--t set distance cutoff for clustering, default is 0.4. (for pearson correlation, distance=1-correlation coefficient)

Step6: Map peptides to transcripts
Map peptides to known splice variants in EVDB database-mapping_EVDB.py
The script in this step maps identified peptides from (uniprot, ensemble and IPI database) to splice variants in EVDB database (assembly hg19 for human).
NOTE: if step5 is skipped, the input file of this step is the output of step3, which is heavy_pepdata.txt. If step5 is not skipped, the input file of this step is the output of step5, which is heavy_pepcluster.txt.

Command: Python mapping_EVDB.py heavy_pepdata.txt heavy (if step5 is skipped)
Command: Python mapping_EVDB.py heavy_pepcluster.txt heavy (if step5 is not skipped)

The first argument is input file and the second is the prefix of output file
Output: heavy_mappingout.txt, heavy_genestatistic.txt (The content of these two output files can be seen in supplementary file 1 table S1 and S2.)
If you encounter “KeyError” (meaning one splice variant was not sucessfully downloaed in step4), so please run step4 again.

The file mappingout.txt is peptide based format in which each row is one unique peptide. This file will be used for visualization. The file genestatistic.txt is gene based format in which each row is one gene.
Very few peptides in mapping output might have no coordinates reported. Since peptide sequence are mapped to splice transcripts sequences extracted in Genbank, there might be very few peptides identified (depending on the reference database used in the searching engine) that are not found in Genbank. And peptides with unknown letters (X) in the sequence will not be given any coordinate.

Step7: Visualize splice variants, peptides, peptide quantitative patterns
Use visualization.py when peptides are mapped to EVDB splice variants
Visualize the gene of interest. Given a gene symbol, it will generate a high quality picture which contains the exon composition of known splice variants in EVDB, transcriptional positions of identified peptides and quantitative patterns of peptides in each cluster.

Command: Python visualization.py --prefix heavy --gene ARF5 --f png
—-f is to set image output format, default is png, other formats such as tiff, emf, jpg can be used depending on whether you installed corresponding library.
Output:ARF5_pattern_heavy.png

Always specify prefix first (indicating in which sample file you want to see the identified peptides and splice variants for this gene) and then the gene name you would like to investigate. Gene symbol should be exactly the same as the one you see in the file genestatistic.txt. Look Fig.2 in the paper for interpretation of the figure.

Now you are finished with the example I.

EXAMPLE II - Investigation of novel splice variants in your data
1)First, you need to search your raw peptide spectra data in ECgene protein databases, arrange the peptide identification output similarly as in step1 at section I.
Say you have the input file name as Exp1_searchdata.txt.
2)Insert gene symbol
Command: Python converter.py --i Exp1_searchdata.txt --prefix Exp1 --database ECgene
3)Group PSMs into peptides
Command: Python mergepsm.py --prefix Exp1
4)Cluster peptides based on quantitative pattern.
Command: python clusterpeptide.py --i Exp1_pepdata.txt --o Exp1_pepcluster.txt
5)Map peptides to splice variants in ECgene database and assign genomic coordinates.
Before running this step, you need to download ECgene protein sequence file (same evidence level used in searching peptide spectra) and gene structure file from ECgene database website. http://genome.ewha.ac.kr/ECgene/.

python mapping_ECgene.py --i Exp1_pepcluster.txt --prefix Exp1 --ECgene hg18_b1_high_gene.txt --ECprotein hg18_b1_high_pep.txt
--i input peptide data file name from step 3 or 5
--prefix prefix name for output file
--ECgene gene structure file from ECgene database
--ECprotein protein sequence file from ECgene database

6) Visualise peptides aligned with ECgene splice variants.
If you want to visualize a peptide along with ECgene splice variants from gene H6C12216 in Exp1, you can use the following command.
Command: python visualization-ECgene.py --sample Exp1 --id H6C12216 --scale 5 --gene-structure-file hg18_b1_high_gene.txt --f png
You should always use the same gene structure file as used in mapping step.
--scale is to control the scaling of intron and exon size.
--f is to set image output format,All images generated have dpi=300 resolution.
Output: H6C12216_pattern_Exp1.png

For any problems related to the program, please contact:
yafeng.zhu@scilifelab.se

EXAMPLE III: Exon/Variant centric analysis
1) Search your peptide spectra data for exmaple in Ensembl protein database.
2) Download gene annotation GTF file from Ensembl database, and put in same folder with Fasta db
3) Download ID map file using Ensembl bioMart tool, three columns are ENSG_id, ENST_id, ENSP_id. See IDmap_file_example.txt
4) run SpliceVista.py

python SpliceVista.py --input input_filename --gtf Homo_sapiens.GRCh37.75.protein_coding.gtf --fasta Homo_sapiens.GRCh37.75.pep.all.fa --IDmap Ensembl75_IDlist.txt --output output_filename

NOTE: First three columns in the input file must be : peptide, ENSP_IDs, ENSG_IDs.

5) In the output, all peptides are assigned to exons and splice variants. Summarize peptide quantitative data to exon/variant level.

EAMPLE IV: Map peptides to genome and Return an GFF file
1) Do the first three steps as in Example III
2) run map_peptide2genome.py

python map_peptide2genome.py --input peptide_table.txt --gtf Homo_sapiens.GRCh37.75.gtf --fasta Homo_sapiens.GRCh37.75.pep.all.fa --IDmap Ensembl75_IDlist.txt --output output.gff3

NOTE: The first two columns must be : peptide, ENSP_ids