## Variability Analysis

#### `Author: Simon Hackl`
#### `Project: The OMPeome of Treponema pallidum`
#### `Contact: simon.hackl@uni-tuebingen.de`
#### `Date: 15.02.2022`

This _Python_ Notebook guides through and documents the steps of variability assessment.

### 1. Variant Filtering and Structure Allocation with MUSIAL2.0

In order to filter the samples' variant calls and allocate them to the respective protein structures, the directory `./R5_TPOMPeome_Hackl2022_VariabilityAnalysis/MUSIAL2-0_TPANIC_74SAMPLES_30OMPS` was created manually. `MUSIAL2.0` was downloaded (https://github.com/Integrative-Transcriptomics/MUSIAL) and stored in the mentioned directory.

Next, the following specification files for the genes and samples were generated in the directory.

`geneSpecification.txt`:
```
TPANIC_RS00045,TP0011,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0011.SignalP.pred.PPM3.pdb
TPANIC_RS00590,TP0117,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0117.pdb
TPANIC_RS00635,TP0126,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0126.PPM3.pdb
TPANIC_RS00665,TP0131,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0131.torf.SignalP.pred.PPM3.pdb
TPANIC_RS01545,TP0313,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0313.pdb
TPANIC_RS00695,TP0316,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0316.pdb
TPANIC_RS01560,TP0317,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0317.pdb
bamA,TP0326,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0326.PPM3.pdb
TPANIC_RS01695,TP0346,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0346.SignalP.pred.PPM3.pdb
TPANIC_RS02330,TP0479,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0479.PPM3.pdb
TPANIC_RS02520,TP0515,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0515.pdb
TPANIC_RS02695,TP0548,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0548.PPM3.pdb
TPANIC_RS02750,TP0558,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0558.SignalP.pred.PPM3.pdb
TPANIC_RS03015,TP0610,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0610.pdb
TPANIC_RS03065,TP0620,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0620.pdb
TPANIC_RS03070,TP0621,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0621.pdb
TPANIC_RS03470,TP0698,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0698.PPM3.pdb
TPANIC_RS03635,TP0733,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0733.PPM3.pdb
TPANIC_RS04235,TP0856,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0856.PPM3.pdb
TPANIC_RS04240,TP0858,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0858.PPM3.pdb
TPANIC_RS04245,TP0859,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0859.PPM3.pdb
TPANIC_RS04270,TP0865,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0865.pdb
TPANIC_RS04420,TP0897,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0897.pdb
TPANIC_RS04760,TP0966,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0966.PPM3.pdb
TPANIC_RS04765,TP0967,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0967.PPM3.pdb
TPANIC_RS04770,TP0968,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0968.pdb
TPANIC_RS04775,TP0969,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0969.PPM3.pdb
TPANIC_RS05095,TP1031,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP1031.pdb
TPANIC_RS02370,TP0488,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0488.pred.PPM3.pdb
TPANIC_RS03500,TP0705,../../R3_TPOMPeome_Hackl2022_ProteinStructurePreparation/ProteinStructures/TP0705.pred.PPM3.pdb
```

`sampleSpecification.txt`:
```
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/A2_1/A2_1-NC_021490.2-HC-variants.fxd.vcf,A2_1
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/A2_2/A2_2-NC_021490.2-HC-variants.fxd.vcf,A2_2
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/AR2/AR2-NC_021490.2-HC-variants.fxd.vcf,AR2
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/AU13/AU13-NC_021490.2-HC-variants.fxd.vcf,AU13
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/AU15/AU15-NC_021490.2-HC-variants.fxd.vcf,AU15
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/AU16/AU16-NC_021490.2-HC-variants.fxd.vcf,AU16
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/AU17/AU17-NC_021490.2-HC-variants.fxd.vcf,AU17
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/B3/B3-NC_021490.2-HC-variants.fxd.vcf,B3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/BAL3/BAL3-NC_021490.2-HC-variants.fxd.vcf,BAL3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/BAL73/BAL73-NC_021490.2-HC-variants.fxd.vcf,BAL73
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/BosniaA/BosniaA-NC_021490.2-HC-variants.fxd.vcf,BosniaA
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/C3/C3-NC_021490.2-HC-variants.fxd.vcf,C3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/CDC2/CDC2-NC_021490.2-HC-variants.fxd.vcf,CDC2
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/Chicago/Chicago-NC_021490.2-HC-variants.fxd.vcf,Chicago
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/CZ27/CZ27-NC_021490.2-HC-variants.fxd.vcf,CZ27
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/CZ33/CZ33-NC_021490.2-HC-variants.fxd.vcf,CZ33
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/Dallas/Dallas-NC_021490.2-HC-variants.fxd.vcf,Dallas
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/Fribourg/Fribourg-NC_021490.2-HC-variants.fxd.vcf,Fribourg
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/Gauthier/Gauthier-NC_021490.2-HC-variants.fxd.vcf,Gauthier
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/GHA1/GHA1-NC_021490.2-HC-variants.fxd.vcf,GHA1
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/GRA2/GRA2-NC_021490.2-HC-variants.fxd.vcf,GRA2
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/IND1/IND1-NC_021490.2-HC-variants.fxd.vcf,IND1
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/K3/K3-NC_021490.2-HC-variants.fxd.vcf,K3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/MexicoA/MexicoA-NC_021490.2-HC-variants.fxd.vcf,MexicoA
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE12/NE12-NC_021490.2-HC-variants.fxd.vcf,NE12
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE13/NE13-NC_021490.2-HC-variants.fxd.vcf,NE13
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE14/NE14-NC_021490.2-HC-variants.fxd.vcf,NE14
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE15/NE15-NC_021490.2-HC-variants.fxd.vcf,NE15
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE17/NE17-NC_021490.2-HC-variants.fxd.vcf,NE17
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE19/NE19-NC_021490.2-HC-variants.fxd.vcf,NE19
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NE20/NE20-NC_021490.2-HC-variants.fxd.vcf,NE20
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NIC1/NIC1-NC_021490.2-HC-variants.fxd.vcf,NIC1
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/NIC2/NIC2-NC_021490.2-HC-variants.fxd.vcf,NIC2
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/P3/P3-NC_021490.2-HC-variants.fxd.vcf,P3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF0697/PT_SIF0697-NC_021490.2-HC-variants.fxd.vcf,PT_SIF0697
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF0751/PT_SIF0751-NC_021490.2-HC-variants.fxd.vcf,PT_SIF0751
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF0857/PT_SIF0857-NC_021490.2-HC-variants.fxd.vcf,PT_SIF0857
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF0877_3/PT_SIF0877_3-NC_021490.2-HC-variants.fxd.vcf,PT_SIF0877_3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF0908/PT_SIF0908-NC_021490.2-HC-variants.fxd.vcf,PT_SIF0908
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF0954/PT_SIF0954-NC_021490.2-HC-variants.fxd.vcf,PT_SIF0954
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1002/PT_SIF1002-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1002
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1020/PT_SIF1020-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1020
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1063/PT_SIF1063-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1063
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1127/PT_SIF1127-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1127
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1135/PT_SIF1135-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1135
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1140/PT_SIF1140-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1140
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1142/PT_SIF1142-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1142
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1156/PT_SIF1156-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1156
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1167/PT_SIF1167-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1167
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1183/PT_SIF1183-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1183
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1196/PT_SIF1196-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1196
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1200/PT_SIF1200-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1200
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1242/PT_SIF1242-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1242
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1252/PT_SIF1252-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1252
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1261/PT_SIF1261-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1261
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1278/PT_SIF1278-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1278
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1280/PT_SIF1280-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1280
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1299/PT_SIF1299-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1299
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/PT_SIF1348/PT_SIF1348-NC_021490.2-HC-variants.fxd.vcf,PT_SIF1348
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/Q3/Q3-NC_021490.2-HC-variants.fxd.vcf,Q3
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SAM1/SAM1-NC_021490.2-HC-variants.fxd.vcf,SAM1
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SamoaD/SamoaD-NC_021490.2-HC-variants.fxd.vcf,SamoaD
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SEA86/SEA86-NC_021490.2-HC-variants.fxd.vcf,SEA86
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/Seattle81/Seattle81-NC_021490.2-HC-variants.fxd.vcf,Seattle81
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SHC-0/SHC-0-NC_021490.2-HC-variants.fxd.vcf,SHC-0
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SHD-R/SHD-R-NC_021490.2-HC-variants.fxd.vcf,SHD-R
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SHE-V/SHE-V-NC_021490.2-HC-variants.fxd.vcf,SHE-V
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SHG-I2/SHG-I2-NC_021490.2-HC-variants.fxd.vcf,SHG-I2
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SS14/SS14-NC_021490.2-HC-variants.fxd.vcf,SS14
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SW1/SW1-NC_021490.2-HC-variants.fxd.vcf,SW1
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SW4/SW4-NC_021490.2-HC-variants.fxd.vcf,SW4
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SW6/SW6-NC_021490.2-HC-variants.fxd.vcf,SW6
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/SW8/SW8-NC_021490.2-HC-variants.fxd.vcf,SW8
../../R4_TPOMPeome_Hackl2022_VariantCalling/VariantCalling/UW1/UW1-NC_021490.2-HC-variants.fxd.vcf,UW1
```

Finally the command ``java -jar ./MUSIAL-v2.0.jar -o ./output/ -r ../../R4_TPOMPeome_Hackl2022_VariantCalling/ReferenceGenome/NC_021490.2.fasta -a ../../R4_TPOMPeome_Hackl2022_VariantCalling/ReferenceGenome/NC_021490.2.gff3 -s ./sampleSpecification.txt -gf ./geneSpecification.txt -nt 4`` was run. The respective output files are generated per gene in the `./R5_TPOMPeome_Hackl2022_VariabilityAnalysis/MUSIAL2-0_TPANIC_74SAMPLES_30OMPS/output` directory.

### 2. Visual Assessment with MUSIAL2.0$^{\textbf{IVE}}$

The generated results per gene were analyzed manually using the companion visualization tool (https://integrative-transcriptomics.github.io/MUSIAL-IVE/). All collected information was stored in `./R5_TPOMPeome_Hackl2022_VariabilityAnalysis/VariabilityStatistics.csv`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update( { "text.usetex": False, "font.family": "serif" } )

from scipy.stats import spearmanr
from decimal import Decimal

In [None]:
df = pd.read_csv( "./R5_TPOMPeome_Hackl2022_VariabilityAnalysis/VariabilityStatistics.csv", delimiter = "\t" )
df

In [None]:
# Remove rows/genes with no or only one variant and an resulting KS-test p-value of null and run a Spearman Rank Correlation Test
df = df.loc[ pd.notnull( df.KSPValue ) ]
spearmanrResults = spearmanr( df.filter( [ "PercVariablePositions", "MeanDiffVariants", "TotalProteoforms", "KSPValue" ] ).to_numpy( ) )

In [None]:
spearmanrCoeffs = [ ]
for i in range( 4 ) :
    d = [ ]
    for j in range( 4 ) :
        if i >= 1 and j < i :
            d.append( spearmanrResults[ 0 ][ i, j ] )
        else :
            d.append( np.nan )
    spearmanrCoeffs.append( d )
    
fig, ax = plt.subplots( figsize = ( 9, 9 ) )
im = ax.imshow( spearmanrCoeffs, cmap='coolwarm', interpolation='nearest', vmin=-1, vmax=1 )
labels = [ "Variable\nPositions [%]", "Mean Number\nof Variants", "Total Observed\nProteoforms", "Kolmogorov-Smirnov\n$p$-value" ]

ax.set_xticks( np.arange( 0, 3, 1 ) )
ax.set_xticklabels( labels[ :-1 ], size = 14 )
ax.set_yticks( np.arange( 1, 4, 1 ) )
ax.set_yticklabels( labels[ 1: ], size = 14 )

plt.setp( ax.get_xticklabels( ), rotation=45, ha="right", rotation_mode="anchor" )

for i in range( 4 ):
    for j in range( 4 ):
        if i >= 1 and j < i :
            if round( spearmanrResults[ 0 ][ i, j ], 3 ) >= 0 :
                c = "black"
            else :
                c = "white"
            text = ax.text( j, i, round( spearmanrResults[ 0 ][ i, j ], 3 ), ha = "center", va = "center", color = c, size = 14 )

for spine in ax.spines.values( ) :
    spine.set_visible( False )
ax.tick_params( which = "minor", bottom = False, left = False)
    
ax.set_title( "Spearman's Rank Correlation Coefficients\nfor Variability Statistics", size = 18, x = 0.4, y = 0.8 )
fig.tight_layout( )
plt.show( )

In [None]:
spearmanrPVals = [ ]
for i in range( 4 ) :
    d = [ ]
    for j in range( 4 ) :
        if i >= 1 and j < i :
            d.append( spearmanrResults[ 1 ][ i, j ] )
        else :
            d.append( np.nan )
    spearmanrPVals.append( d )
    
fig, ax = plt.subplots( figsize = ( 9, 9 ) )
im = ax.imshow( spearmanrPVals, cmap='coolwarm', interpolation='nearest', vmin=0.0, vmax=0.0083 )
labels = [ "Variable\nPositions [%]", "Mean Number\nof Variants", "Total Observed\nProteoforms", "Kolmogorov-Smirnov\n$p$-value" ]

ax.set_xticks( np.arange( 0, 3, 1 ) )
ax.set_xticklabels( labels[ :-1 ], size = 14 )
ax.set_yticks( np.arange( 1, 4, 1 ) )
ax.set_yticklabels( labels[ 1: ], size = 14 )

plt.setp( ax.get_xticklabels( ), rotation=45, ha="right", rotation_mode="anchor" )

for i in range( 4 ):
    for j in range( 4 ):
        if i >= 1 and j < i :
            if spearmanrResults[ 1 ][ i, j ] >= 0.001 :
                c = "white"
            else :
                c = "white"
            expString = "{:.3E}".format( Decimal( spearmanrResults[ 1 ][ i, j ] ) )
            text = ax.text( j, i, expString.split( "E" )[ 0 ] + "$\cdot 10^{" + expString.split( "E" )[ 1 ] + "}$", ha = "center", va = "center", color = c, size = 14 )

for spine in ax.spines.values( ) :
    spine.set_visible( False )
ax.tick_params( which = "minor", bottom = False, left = False)
    
ax.set_title( "Spearman's Rank Correlation $p$-values\nfor Variability Statistics", size = 18, x = 0.4, y = 0.8 )
fig.tight_layout( )
plt.show( )