Skip to content

tbrann99/Subtelomere

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the code used for the analysis documented in the publication entitlted "Subtelomeric plasticity contributes to gene family expansion in the human parasitic flatworm Schistosoma mansoni". Publication accepted at BMC Genomics 

Brann, T., Beltramini, A., Chaparro, C. et al. Subtelomeric plasticity contributes to gene family expansion in the human parasitic flatworm Schistosoma mansoni. BMC Genomics 25, 217 (2024). https://doi.org/10.1186/s12864-024-10032-8

Supplementary files can be found in their native formats, alongside a bioinformatics workbook at https://zenodo.org/records/10721345
Please address comments or questions should you remain unsure to avp25[AT]cam.ac.uk


auto_sub.py
- Script to take pvalues per chromosome and to identify the end of enrichment at the chromosome termini.
- This script takes a single input, the p-values from a wilcoxon test of sliding windows of repeat content across S. mansoni chromosomes, now after the implementation of a Savitsky Golay filter.
- Specifically, calculates subtelomere bounds when a threshold of p<0.1 for enrichment is achieved and subsequently lost, at the start and end of each chromosome.

subtel.py
- Script takes paired loci and returns the list with a tag corresponding to if it is subtelomeric or chromosome body (using bounds from subtelomere.bed)
- This script takes two inputs
  - A file of paired loci, in our case from biser, which reports two sets of coordinates that share homology
  - Bedfile which we used to assess the locations reported in the first input, if it falls within the bounds defined here, on either side of the pair, it is tagged as "Subtelomere", or simply as "No".
- NOTE: Whilst this is a very simple bit of code / problem, the use of paired sets of coordinates renders something like bedtools intersect, no longer ideal for purpose. 

gene_viz.py
- Script creates a suitably formatted input for the R package "gggenes", which is used to visualise gene models and in our case, demonstrate their similiarity.
- Takes a single input - the gff of a gene attempting to be visualised. 
- The script will then recalculate the coordinates of the CDS fields,relative to 0 (the start of the first coding sequence), irrespective of orientation.
- This is best used with a wrapper, which contains the list of Smp_ids, which iteratively calls this script for each gene in the cluster. 
- Note, as outlined in the manuscript, some gene models were re-coloured and aligned to exon 3.

anno_domain_viz.py
- Script that generates domain data in an appropriate format for use in ITOL as an "additional dataset".
- Script takes a single input, the TSV generated by Interproscan domain searching of an amino acid sequence. 
- Note, this script is not versatile, it requires predetermined protein domains, colours and shapes to be specified (lines 30:66), though these can be altered to fit anothers requirements. 






About

Code used for associated publication

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages