Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Make personal protein databases using next generation sequencing data.


Installation on a Unix/linux distribution (including OS X) is straightforward and enumerated below. To simplify this further (and offer a solution that should work on most platforms), we have included a docker file that will setup a docker container with all necessary dependencies installed.

With Docker

First, install the community edition docker application for your system.

Next, clone the github repo, which has the latest docker image for genpro and use the docker image to build a container with all dependencies (and GenPro installed).

git clone
cd GenPro
docker build -t genpro .

There are a few options for running GenPro using the docker container. The simplest way is to login to the container using the command below. For this to be useful, we need to allow the docker container to write to the host system. This is done with the -v option, as in -v /your/machine:/docker/container. You will need to set the host directory accordingly and note that in the example below it is set to the working directory where you launch the genpro docker container.

docker run -it -v $(pwd):/GenProData genpro bash

Installing on a unix/linux

The approach is:

  1. Install local::lib, which allows installation of GenPro (and other perl packages) without using sudo.
  2. Install App::cpanminus, which simplifies and automates the installation of perl packages.
  3. Clone the GenPro repository.
  4. Install the prebuilt GenPro package in the repository using cpanm.

Install local::lib by downloading the latest tarball and unpack it. See the bootstrapping section of the documentation.


curl -O
tar xzvf local-lib-2.000019.tar.gz
cd local-lib-2.000019
perl Makefile.PL --bootstrap
make test && make install

Install App::cpanminus.


curl -L | perl - App::cpanminus

Clone the GenPro repository with git. Install GenPro with cpanm, which will automatically install any needed dependencies.


git clone
cd GenPro
cpanm GenPro.tar.gz

An alternative approach is clone the repository, unpack the GenPro.tar.gz tarball and install it manually. Unless local::lib is installed, this approach will require using sudo.


git clone
cd GenPro
tar xzvf GenPro.tar.gz
cd GenPro
perl Makefile.PL
make test
make install    


Memory requirements

The two programs that require the most memory are and For, the memory requirement scales with genome size and gene density. For, the size of the reference protein database and the number of samples raise the memory requirement.

The table below gives representative memory consumption. The WGS samples were obtained from 1000genomes phase1 (hg19) vcf files and converted to the snp format for GenPro using vcfToSnp.

program                   | perl    | memory use | processed
--------------------------------------------------------------------------------       | v5.16.3 | 18.8 Gb    | hg38, chromosome 1
--------------------------------------------------------------------------------  | v5.16.3 | 795.7 Mb   | hg38, chromosome 1
-------------------------------------------------------------------------------- | v5.16.3 | 21.5 Gb    | 50 HapMap WGS, phase 1
-------------------------------------------------------------------------------- | v5.16.3 | 603 Mb     | HapMap NA06994, phase 1

1. Download genomic data for a particular organism. -d hg38 -g hg38

The above command will perform a dry-run download of hg38 (genome and annotated gene coordinates). It relies on rsync being installed, which should be present on unix, linux, and OS X by default. In the example, the data will be downloaded into hg38 directory, which may be created if it did not already exist. will download knownGenes track and the genome of the organism by default. By default, is set to a dry-run (i.e., no download). Use the --act or -a switch to "act", i.e., download the data. Take care when using it since it since it will download from a remote server.

2. Generate a binary index of the genome for the organism.

  • Use
  • There are helper scripts sh/ that work with SGE to build all chromosomes on a cluster with SGE. For example:
qsub -v USER -v PATH -cwd -t 1-26 <genome>
  • Alternatively, iterate over all the chromosomes. For example:
for ((i=1;i<27;i++)); do -g hg38 -c $i \
    --genedir hg38/gene             \
    --geneset knownGene             \
    -f hg38/chr -o hg38/idx;

3. Make reference proteins for the organism.

  • uses the indexed binary annotation and creates all proteins for a given chromosome.
  • A helper script automates this for SGE. For example,
qsub -v USER -v PATH -cwd -t 1-26 \
  <genome>              \
  <binary genome index> \
  <output dir>

4. Make the personal proteins for the sample.

  • This is a 2-step process that uses and
  • creates a per chromosome database of all relevant (i.e., nonsense/missense variants) for each sample in the snp file.
  • creates a finished personal protein database for each sample. It provides two outputs:
    1. A json-encoded file that enumerates the variant protein information
    2. A fasta file with all full-length reference and variant proteins, which may be used as input for a proteomics search program.
  • Creating the final db is on a per sample basis, but it can take quite a while. It depends on how many proteins have multiple variants. It is especially slow for proteins with >10 variants. For example, if you have a protein with 20 substitutions, there are 20! permutations. By design, all 20! proteins will be considered and only the proteins that contribute unique peptides will be retained. Any protein with more than 20 sites will have all variants inserted into the reference protein without performing any permutation.
  • To begin, you will need to have genotype calls in the snp file format. To convert vcf to snp format you will need bcftools installed. The helper script, bin/vcfToSnp, calls bcftools internally so bcftools will need to be in your path.


Make personal protein databases using next generation sequencing data.




No releases published


No packages published