This repository provides Python scripts used to perform analyses and generate images for the now published article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9176631/.
We will give instructions to set up a miniconda environment to contain the dependencies needed and to run the manuscript's analysis scripts.
Miniconda is the barebones version of the larger Conda package. We will use this so that we choose only the dependencies that are needed and therefore reduce the installation size and time. The Miniconda installation instructions are here: Miniconda
$ conda create -n gsalaris_phylo python=3.7
$ conda activate gsalaris_phylo
$ conda install -c etetoolkit ete3==3.1.2 ete_toolchain
$ conda install -c anaconda seaborn
$ conda install -c conda-forge biopython
$ conda install -c bioconda iqtree
$ cd 00.src
$ python3 000.get_data.py
$ uncompress ./Gblocks_Linux64_0.91b.tar.Z
$ tar -xvf ./Gblocks_Linux64_0.91b.tar
$ chmod +x muscle5.1.linux_intel64
$ chmod +x usearch11.0.667_i86linux32
$ ./usearch11.0.667_i86linux32 -cluster_fast ../01.data_raw/UniProt_BLAST_results.fasta -id 0.80 -centroids ../02.data/UniProt_BLAST_results.centroids.fasta
$ python3 001.clean_CAs.py
$ ./Gblocks_0.91b/Gblocks ../02.data/UniProt_BLAST_results.centroids.BCAs.fasta.muscle_aligned -t=p -b2=6 -b3=20 -b4=2 -b5=h -d=y -v=240
$ iqtree -s ../02.data/UniProt_BLAST_results.centroids.BCAs.fasta.muscle_aligned-gb -st AA -alrt 100000 -bb 100000 -nt AUTO -m TESTNEW+LM
$ python3 002.ete_tree.py
DESCRIPTION: Will download Muscle5.1, USEARCH 11, and Gblocks 0.91b to be used in the analysis.
INPUTS: N/A
OUTPUTS:
- (/src) program executables downloaded here.
DESCRIPTION: Do a basic regex test to see if each of the BLAST results contain both canonical beta-carbonic anhydrase AA motifs (CxDxR & HxxC). Do a Muscle alignment of the passing sequences.
INPUTS:
- (/data_raw/Gsalaris_novelBCA.fasta) Novel G. salaris BCA seq.
- (/data_raw/UniProt_BLAST_results.centroids.fasta) Identified centroids determined by USEARCH cluster analysis of BLAST results.
OUTPUTS:
- (/data/UniProt_BLAST_results.centroids.BCAs.fasta) Sequences containing BCA AA motifs.
- (/data/UniProt_BLAST_results.centroids.BCAs.fasta.muscle_aligned) Muscle-aligned sequences containing BCA AA motifs, with novel G. salaris BCA added before alignment.
DESCRIPTION: Run a transcription factor binding site (TFBS) prediction using tfbs_footprinter on the human version of the human vs. Neanderthal SNPs, which have been identified within 2,500 bp of a human protein-coding transcript transcription start site (TSS).
INPUTS:
- (/data/UniProt_BLAST_results.centroids.BCAs.fasta.muscle_aligned-gb.contree) Consensus tree generated by IQTree analysis.
OUTPUTS:
- (/output/Gsalaris_BCA_phylogram.svg) Phylogram of BCA sequences, colored by phylum.
- (/output/[color_pal].legend.svg) Legend pairing phyla and associated colors.