genomepy
This page described the core genomepy functionality. These classes and functions can be found on the top level of the genomepy module (e.g. genomepy.search
), and are made available when running from genomepy import *
(we won't judge you).
Additional functions that do not fit the core functionality, but we feel are still pretty cool, are also described.
When looking to download a new genome/gene annotation, your first step would be genomepy.search
. This function will check either one, or all, providers. Advanced users may want to specify a provider for their search to speed up the process. To see which providers are available, use genomepy.list_providers
or genomepy.list_online_providers
:
list_providers
list_online_providers
search
If you have no idea what you are looking for, you could even check out all available genomes. Be warned, genomepy.list_available_genomes
is like watching the Star Wars title crawl.
list_available_genomes
If we search for homo sapiens for instance, we find that GRCh3.p13
and hg38
are the latest versions. These names describe the same genome, but different assemblies
, with differences between them.
One of these differences is the quality of the gene annotation. Next, we can inspect these with genomepy.head_annotations
:
head_annotations
Now that you have seen whats available, its time to download a genome. The default parameter for genomepy.install_genome
are optimized for sequence alignment and gene counting, but you have full control over them, so have a look!
genomepy won't overwrite any files you already downloaded (unless specified), but you can review your local genomes with genomepy.list_installed_genomes
.
install_genome
list_installed_genomes
If you want to download a sequence blacklist, or create an aligner index, you might wanna look at plugins! Don't worry, you can rerun the genome.install_genome
command, and genomepy will only run the new parts.
manage_plugins
The genome and gene annotations were installed in the genomes directory (unless specified otherwise). If you have a specific location in mind, you could set this as default in the genomepy config. To find and inspect it, use genomepy.manage_config
:
manage_config
Did something go wrong? Oh noes! If the problem persists, clear the genomepy cache with genomepy.clean
, and try again.
clean
Alright, you've got the goods! You can browse the genome's sequences and metadata with the genomepy.Genome
class. This class builds on the pyfaidx.Fasta
class to also provide you with several options to get specific sequences from your genome, and save these to file.
Genome
Methods
~Genome.close ~Genome.get_random_sequences ~Genome.get_seq ~Genome.get_spliced_seq ~Genome.items ~Genome.keys ~Genome.track2fasta ~Genome.values
Attributes
~Genome.gaps ~Genome.plugin ~Genome.sizes ~Genome.genomes_dir ~Genome.name ~Genome.genome_file ~Genome.genome_dir ~Genome.index_file ~Genome.sizes_file ~Genome.gaps_file ~Genome.annotation_gtf_file ~Genome.annotation_bed_file ~Genome.readme_file
You can obtain genomic sequences from a wide variety of inputs with as_seqdict
. To use the function, it must be explicitly imported with from genomepy.seq import as_seqdict
.
genomepy.seq.as_seqdict
A non-core function worth mentioning is genomepy.files.filter_fasta
, for when you wish to filter a fasta file by chromosome name using regex, but want the output straight to (another) fasta file.
genomepy.files.filter_fasta
Similarly, the genomepy.Annotation
class helps you get the genes in check. This class returns a number of neat pandas dataframes, such as the named_gtf
, or an annotation with the gene or chromosome names remapped to another type. Remapping gene names to another type is also possible with Annotation.map_genes
. This feature also comes as separate function genomepy.query_mygene
, as it's just so darn useful.
Annotation
query_mygene
Another non-core function worth mentioning is genomepy.annotation.filter_regex
, which allows you to filter a dataframe by any columns using regex.
genomepy.annotation.filter_regex