# recreating HGTDB

G+C content, 
codon usage, 
or amino acid compositio

**GC content:**

- We considered genes as extraneous in terms of the G+C content if their G+C(T) content deviated by >1.5sigma from the mean value of their genome or 
- if deviations of G+C(1) and G+C(3) were of the same sign and at least one was >1.5sigma

**Codon Usage:**
- Mahalonobis distance is used as a measure of the distance between the codon usage of a gene (X) and the mean of an organism

- each gene is a vector of 61-D space; rel freq of 61 codons (stop codons not included)

- Mahalonobis distance defined as 
    - dM(X,Xhat) = (X-Xhat)^T * S^-1 * (X-Xhat)

- The covariance S is a 61 X 61 covriance
matrix
    - Sij = sigma( (Xki - Xhati) * (Xkj - Xhatj)  )

- Xhat is mean value for each codon!

important part is 

"We calculated the Mahalanobis distance from each gene to the mean value of its own organism. These distances did not follow a normal distribution, so we could not apply the criteria regarding deviations >1.5 from the mean value to identify extraneous genes from codon usage. Instead we used a Montecarlo procedure"

- generating a random sample of 10,000 sequences from the means and standard deviations of the codon usage of each genome

- The Mahalanobis distances of these sets of random sequences had a normal distribution, and so, we could calculate a mean value and a standard deviation

**Amino acid:**

What large deviations from the mean values of amino acid composition represent is very ambiguous. They may be caused either by functional constraints or by the result of the extraneous codon usage or G+C content of a horizontally transferred gene. We therefore chose the restricting criterion:

We excluded from our set of genes predicted as being acquired by HGT those isolated genes whose derived protein has deviations of >3signa in at least one amino acid content. Only genes included in some of the alien genomic strips could present such deviation.

In [1]:
from data.utils.NCBI.data_loader import NCBIDataLoader
yakult = NCBIDataLoader('ASM82905v1')

found 1 ids
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/829/055/GCF_000829055.1_ASM82905v1/GCF_000829055.1_ASM82905v1_cds_from_genomic.fna.gz


# GC content deviation

## Part 1
We considered genes as extraneous in terms of the G+C content if their G+C(T) content deviated by >1.5 from the mean value of their genome or if deviations of G+C(1) and G+C(3) were of the same sign and at least one was >1.5

In [2]:
yakult.print_genome_summary()

Mean GC Content-> T:47.99780371380697, 1:54.43022578518216, 2:38.120079456762745, 3:51.4432288224944
Std GC content-> T:3.9333303002550197, 1:5.59620581567793, 2:5.1420556965271995, 3:7.1901061819491865
Relative nucleotide frequency: {'G': 0.25390022015214475, 'A': 0.26340302506088364, 'C': 0.2329938908275224, 'T': 0.2497028639594492}
Nucleotide Identity: {'A': [0.2553701468935958, 0.2571262119761974, 0.2777127163128578], 'T': [0.26331821397361743, 0.2391815333765315, 0.2466088445281987], 'G': [0.24067365367352883, 0.27363940624312644, 0.24738760053977898], 'C': [0.240637985459258, 0.23005284840414464, 0.22829083861916452]}
Dinucleotide Identity: {'AA': [0.08006800739520976, 0.08457528073856982, 0.08832173718677326], 'AG': [0.055446239084040255, 0.04828525059892876, 0.05131829876682947], 'AT': [0.06612173561530642, 0.06895022500698503, 0.07916569569745709], 'AC': [0.053734164799039336, 0.0553154556317138, 0.05890612590419031], 'TA': [0.04349738730330466, 0.04250224412514787, 0.03745404

In [3]:
yakult.print_gene_summary('LBCZ_RS00005')

Mean GC Content-> T: 48.44444444444444, 1:56.44444444444444, 2:32.22222222222222, 3:56.666666666666664
Std GC content-> T:0.11355281569120143, 1:0.3599257649923083, 2:-1.1469843157326693, 3:0.7264757587705374
Relative nucleotide frequency: {'G': 0.24592592592592594, 'A': 0.2903703703703704, 'C': 0.23851851851851852, 'T': 0.22518518518518518}
Nucleotide Identity: {'A': [0.3111111111111111, 0.3844444444444444, 0.17555555555555555], 'T': [0.12444444444444444, 0.29333333333333333, 0.2577777777777778], 'G': [0.3377777777777778, 0.10666666666666667, 0.29333333333333333], 'C': [0.22666666666666666, 0.21555555555555556, 0.2733333333333333]}
Dinucleotide Identity: {'AA': [0.13111111111111112, 0.13111111111111112, 0.060133630289532294], 'AG': [0.017777777777777778, 0.08, 0.0645879732739421], 'AT': [0.08444444444444445, 0.1111111111111111, 0.022271714922048998], 'AC': [0.07777777777777778, 0.06222222222222222, 0.026726057906458798], 'TA': [0.028888888888888888, 0.017777777777777778, 0.07126948775

In [4]:
# gene - genome
print(" Check if gene.GCT - genome.GCT deviates more than 1.5sigma")
print(f" GCT: {yakult['LBCZ_RS00005']['GCT'] - yakult.mean_GCT}  > 1.5sigma: {1.5*yakult.std_GCT}")
print("\n")
print(" Check if either gene.GC1 - genome.GC1 and gene.GC3 - genome.GC3 has same sign")
print( " AND at least one of them is more than 1.5sigma")
print(f" GC1: {yakult['LBCZ_RS00005']['GC1'] - yakult.mean_GC1}  > 1.5sigma: {1.5*yakult.std_GC1}")
print(f" GC3: {yakult['LBCZ_RS00005']['GC3'] - yakult.mean_GC3}  > 1.5sigma: {1.5*yakult.std_GC3}")



 Check if gene.GCT - genome.GCT deviates more than 1.5sigma
 GCT: 0.4466407306374762  > 1.5sigma: 5.89999545038253


 Check if either gene.GC1 - genome.GC1 and gene.GC3 - genome.GC3 has same sign
 AND at least one of them is more than 1.5sigma
 GC1: 2.0142186592622835  > 1.5sigma: 8.394308723516895
 GC3: 5.223437844172267  > 1.5sigma: 10.78515927292378


let us loop this damn thing

In [5]:
# need to rethink gene listing i guess

list_of_extraneous_genes = []
for i in yakult.genes:
    dev_GCT = yakult[i]['GCT'] - yakult.mean_GCT
    
    dev_GC1 = yakult[i]['GC1'] - yakult.mean_GC1
    dev_GC3 = yakult[i]['GC3'] - yakult.mean_GC3
    equal_sign_check = dev_GC1*dev_GC3
    
    if len(yakult[i]['sequence']) > 300:
        if dev_GCT > (1.5*yakult.std_GCT):
            #print(f'HGT found at {i}')
            list_of_extraneous_genes.append(i)
        elif (equal_sign_check > 0):
            if dev_GC1 > (1.5*yakult.std_GC1):
                #print(f'HGT found at {i}')
                list_of_extraneous_genes.append(i)
            elif dev_GC3 > (1.5*yakult.std_GC3):
                #print(f'HGT found at {i}')
                list_of_extraneous_genes.append(i)
            else:
                pass
        else:
            pass
        
                

In [6]:
len(list_of_extraneous_genes)

122

In [7]:
list_of_extraneous_genes

['LBCZ_RS00130',
 'LBCZ_RS00135',
 'LBCZ_RS00170',
 'LBCZ_RS00175',
 'LBCZ_RS00185',
 'LBCZ_RS00190',
 'LBCZ_RS00240',
 'LBCZ_RS00295',
 'LBCZ_RS00300',
 'LBCZ_RS00700',
 'LBCZ_RS00730',
 'LBCZ_RS01290',
 'LBCZ_RS01380',
 'LBCZ_RS01385',
 'LBCZ_RS01485',
 'LBCZ_RS01545',
 'LBCZ_RS01590',
 'LBCZ_RS01990',
 'LBCZ_RS02030',
 'LBCZ_RS02070',
 'LBCZ_RS02150',
 'LBCZ_RS02175',
 'LBCZ_RS02205',
 'LBCZ_RS02220',
 'LBCZ_RS02250',
 'LBCZ_RS02255',
 'LBCZ_RS02290',
 'LBCZ_RS02415',
 'LBCZ_RS02735',
 'LBCZ_RS02740',
 'LBCZ_RS04065',
 'LBCZ_RS04240',
 'LBCZ_RS04450',
 'LBCZ_RS04550',
 'LBCZ_RS04880',
 'LBCZ_RS05375',
 'LBCZ_RS05450',
 'LBCZ_RS05510',
 'LBCZ_RS05525',
 'LBCZ_RS05575',
 'LBCZ_RS05595',
 'LBCZ_RS05610',
 'LBCZ_RS06065',
 'LBCZ_RS06535',
 'LBCZ_RS06675',
 'LBCZ_RS06785',
 'LBCZ_RS06870',
 'LBCZ_RS06955',
 'LBCZ_RS06985',
 'LBCZ_RS07380',
 'LBCZ_RS07575',
 'LBCZ_RS07690',
 'LBCZ_RS08005',
 'LBCZ_RS08250',
 'LBCZ_RS08295',
 'LBCZ_RS08605',
 'LBCZ_RS08750',
 'LBCZ_RS09010',
 'LBCZ_RS09055

In [8]:
print(" Check if gene.GCT - genome.GCT deviates more than 1.5sigma")
print(f" GCT: {yakult['LBCZ_RS02070']['GCT'] - yakult.mean_GCT}  > 1.5sigma: {1.5*yakult.std_GCT}")
print("\n")
print(" Check if either gene.GC1 - genome.GC1 and gene.GC3 - genome.GC3 has same sign")
print( " AND at least one of them is more than 1.5sigma")
print(f" GC1: {yakult['LBCZ_RS02070']['GC1'] - yakult.mean_GC1}  > 1.5sigma: {1.5*yakult.std_GC1}")
print(f" GC3: {yakult['LBCZ_RS02070']['GC3'] - yakult.mean_GC3}  > 1.5sigma: {1.5*yakult.std_GC3}")

 Check if gene.GCT - genome.GCT deviates more than 1.5sigma
 GCT: 6.305226589223338  > 1.5sigma: 5.89999545038253


 Check if either gene.GC1 - genome.GC1 and gene.GC3 - genome.GC3 has same sign
 AND at least one of them is more than 1.5sigma
 GC1: 8.115228760272387  > 1.5sigma: 8.394308723516895
 GC3: 4.920407541141969  > 1.5sigma: 10.78515927292378


## Part 2
We also ran an 11-gene window through each genome. Five or more extraneous genes in a given window indicated the presence of an alien genomic strip. Finally, we filtered these strips to disregard short isolated segments and to include genes that we did not consider extraneous but that had a deviation of their G+C content of the same sign as the deviation of the strip to which they belong

In [9]:
from Bio.SeqUtils import GC123

In [10]:
list_of_genes = list(yakult.genes.keys())

genomic_strips = []


for k in range(len(list_of_genes)-10):
    window = {}
    j = 0
    while j < 11:
        # get window
        locust_tag = list_of_genes[k + j]
        
        # take genes that are more than 300bp
        if len(yakult[locust_tag]['sequence'])>300:
            data = yakult[locust_tag]
            window[locust_tag] = data
        # iterate        
        j+=1
        
    # count total extraneous genes in window
    extraneous_counter = 0
    for l in window.keys():
        if l in list_of_extraneous_genes:
            extraneous_counter+=1
    
    # check windows with more than or
    # equal to 5 extraneous genes
    if extraneous_counter >=5:
        # add their sequences together
        sequences = ''
        for m in window.keys():
            sequences += window[m]['sequence']
        
        # get standard deviation of strip
        GCT, GC1, GC2, GC3 = GC123(sequences)
        SDT = (GCT - yakult.mean_GCT)/yakult.std_GCT
        SD1 = (GC1- yakult.mean_GC1)/yakult.std_GC1
        SD2 = (GC2 - yakult.mean_GC2)/yakult.std_GC2
        SD3 = (GC3 - yakult.mean_GC3)/yakult.std_GC3
        
        # tag genes as extraneous if they have equal deviation to the its strip
        for n in window.keys():
            # check only genes not in current list
            if n not in list_of_extraneous_genes:
                check_SDT = window[n]['SDT']*SDT
                check_SD1 = window[n]['SD1']*SD1
                check_SD2 = window[n]['SD2']*SD2
                check_SD3 = window[n]['SD3']*SD3
                
                if (check_SDT > 0) and (check_SD1 > 0) and (check_SD2 > 0) and (check_SD3 > 0):
                    list_of_extraneous_genes.append(n)
            
        
        genomic_strips.append(window)


In [11]:
print(len(list_of_extraneous_genes))

127


In [12]:
for i in genomic_strips:
    print(len(i))

11
11
11
11
11
11
11
11
11
10
10
10
10


In [13]:
len(genomic_strips)

13

In [15]:
genomic_strips[0].keys()

dict_keys(['LBCZ_RS09705', 'LBCZ_RS09710', 'LBCZ_RS09715', 'LBCZ_RS09720', 'LBCZ_RS09725', 'LBCZ_RS09730', 'LBCZ_RS09735', 'LBCZ_RS09740', 'LBCZ_RS09745', 'LBCZ_RS09750', 'LBCZ_RS09755'])

In [16]:
window = {}
len(window)

0

In [17]:
for i in yakult.genes.keys():
    print(i)

LBCZ_RS00005
LBCZ_RS00010
LBCZ_RS00015
LBCZ_RS00020
LBCZ_RS00025
LBCZ_RS00030
LBCZ_RS00035
LBCZ_RS00040
LBCZ_RS00045
LBCZ_RS00050
LBCZ_RS00055
LBCZ_RS00060
LBCZ_RS00065
LBCZ_RS00070
LBCZ_RS00075
LBCZ_RS00080
LBCZ_RS00085
LBCZ_RS00090
LBCZ_RS00095
LBCZ_RS00100
LBCZ_RS00110
LBCZ_RS00115
LBCZ_RS00120
LBCZ_RS00125
LBCZ_RS00130
LBCZ_RS00135
LBCZ_RS00140
LBCZ_RS00145
LBCZ_RS00150
LBCZ_RS00155
LBCZ_RS00160
LBCZ_RS00165
LBCZ_RS00170
LBCZ_RS00175
LBCZ_RS00180
LBCZ_RS00185
LBCZ_RS00190
LBCZ_RS00195
LBCZ_RS00200
LBCZ_RS00205
LBCZ_RS00210
LBCZ_RS00215
LBCZ_RS00220
LBCZ_RS00225
LBCZ_RS00230
LBCZ_RS00240
LBCZ_RS00245
LBCZ_RS00250
LBCZ_RS00255
LBCZ_RS00260
LBCZ_RS00265
LBCZ_RS00270
LBCZ_RS00275
LBCZ_RS00280
LBCZ_RS00285
LBCZ_RS00290
LBCZ_RS00295
LBCZ_RS00300
LBCZ_RS00305
LBCZ_RS00310
LBCZ_RS00315
LBCZ_RS00320
LBCZ_RS15910
LBCZ_RS00325
LBCZ_RS16520
LBCZ_RS00330
LBCZ_RS00335
LBCZ_RS00340
LBCZ_RS00345
LBCZ_RS16245
LBCZ_RS00355
LBCZ_RS16250
LBCZ_RS16255
LBCZ_RS00365
LBCZ_RS00370
LBCZ_RS00375
LBCZ_RS00380

# Codon Usage

i guess the calculation of codon usage bias is different? GCT,1,2,3 is a percentage of some sort but cub is addition of stuff

The mahalonobis distance

In [2]:
len(yakult.cub)

64

In [3]:
yakult.cub

{'TTT': 20757,
 'TCT': 4555,
 'TAT': 16767,
 'TGT': 1777,
 'TTC': 13655,
 'TCC': 6408,
 'TAC': 10686,
 'TGC': 2450,
 'TTA': 14470,
 'TCA': 8185,
 'TAA': 1525,
 'TGA': 1241,
 'TTG': 23773,
 'TCG': 7634,
 'TAG': 785,
 'TGG': 9743,
 'CTT': 11449,
 'CCT': 6310,
 'CAT': 11819,
 'CGT': 8440,
 'CTC': 7674,
 'CCC': 3784,
 'CAC': 8347,
 'CGC': 13287,
 'CTA': 5704,
 'CCA': 9109,
 'CAA': 21000,
 'CGA': 5136,
 'CTG': 19297,
 'CCG': 15093,
 'CAG': 19724,
 'CGG': 10349,
 'ATT': 29989,
 'ACT': 9099,
 'AAT': 19437,
 'AGT': 10062,
 'ATC': 20767,
 'ACC': 19285,
 'AAC': 14983,
 'AGC': 10407,
 'ATA': 2081,
 'ACA': 9810,
 'AAA': 25838,
 'AGA': 1806,
 'ATG': 21493,
 'ACG': 15758,
 'AAG': 23433,
 'AGG': 1318,
 'GTT': 20731,
 'GCT': 17385,
 'GAT': 31296,
 'GGT': 16016,
 'GTC': 17209,
 'GCC': 25064,
 'GAC': 18108,
 'GGC': 24492,
 'GTA': 5107,
 'GCA': 18005,
 'GAA': 28551,
 'GGA': 7033,
 'GTG': 15608,
 'GCG': 19290,
 'GAG': 12108,
 'GGG': 8553}

In [4]:
yakult.mean_cub

{'TTT': 7.4,
 'TCT': 1.623885918003565,
 'TAT': 5.977540106951872,
 'TGT': 0.633511586452763,
 'TTC': 4.8680926916221035,
 'TCC': 2.284491978609626,
 'TAC': 3.809625668449198,
 'TGC': 0.8734402852049911,
 'TTA': 5.158645276292335,
 'TCA': 2.9180035650623886,
 'TAA': 0.5436720142602496,
 'TGA': 0.44242424242424244,
 'TTG': 8.475222816399286,
 'TCG': 2.7215686274509805,
 'TAG': 0.2798573975044563,
 'TGG': 3.4734402852049913,
 'CTT': 4.081639928698753,
 'CCT': 2.249554367201426,
 'CAT': 4.213547237076649,
 'CGT': 3.0089126559714794,
 'CTC': 2.7358288770053476,
 'CCC': 1.3490196078431373,
 'CAC': 2.9757575757575756,
 'CGC': 4.7368983957219255,
 'CTA': 2.033511586452763,
 'CCA': 3.247415329768271,
 'CAA': 7.4866310160427805,
 'CGA': 1.8310160427807487,
 'CTG': 6.8795008912655975,
 'CCG': 5.380748663101604,
 'CAG': 7.031729055258467,
 'CGG': 3.689483065953654,
 'ATT': 10.69126559714795,
 'ACT': 3.243850267379679,
 'AAT': 6.929411764705883,
 'AGT': 3.5871657754010697,
 'ATC': 7.40356506238859

In [5]:
yakult.std_cub

{'TTT': 6.192664281962268,
 'TCT': 3.1169823861455095,
 'TAT': 5.271846491356415,
 'TGT': 1.1937522906807942,
 'TTC': 4.742615774397596,
 'TCC': 2.8674195049097984,
 'TAC': 3.623863278997556,
 'TGC': 2.623496687010433,
 'TTA': 4.389519313621071,
 'TCA': 4.372623503053068,
 'TAA': 0.8819642997963005,
 'TGA': 1.4625389389702335,
 'TTG': 7.0398794751912535,
 'TCG': 4.044738449623214,
 'TAG': 0.7032676324904479,
 'TGG': 3.8603644920059224,
 'CTT': 3.498321698777621,
 'CCT': 2.286074400593279,
 'CAT': 3.66819374479769,
 'CGT': 2.98843744358569,
 'CTC': 2.940845743842563,
 'CCC': 1.600947781218815,
 'CAC': 2.8747069301234953,
 'CGC': 4.668238760512265,
 'CTA': 2.20102470957216,
 'CCA': 3.414465075359898,
 'CAA': 7.00879886681543,
 'CGA': 2.3061493864475904,
 'CTG': 6.208535074579245,
 'CCG': 5.2914810770258764,
 'CAG': 6.839702074019636,
 'CGG': 3.840052535590471,
 'ATT': 8.369638619822721,
 'ACT': 3.8694933925301895,
 'AAT': 6.5813982858732,
 'AGT': 4.229677761201908,
 'ATC': 6.178138801859

number of genes equals to the number of covariance matrices!

In [7]:
len(yakult.genes)

2805

In [8]:
# exclude stop codons
yakult.mean_cub

{'TTT': 7.4,
 'TCT': 1.623885918003565,
 'TAT': 5.977540106951872,
 'TGT': 0.633511586452763,
 'TTC': 4.8680926916221035,
 'TCC': 2.284491978609626,
 'TAC': 3.809625668449198,
 'TGC': 0.8734402852049911,
 'TTA': 5.158645276292335,
 'TCA': 2.9180035650623886,
 'TAA': 0.5436720142602496,
 'TGA': 0.44242424242424244,
 'TTG': 8.475222816399286,
 'TCG': 2.7215686274509805,
 'TAG': 0.2798573975044563,
 'TGG': 3.4734402852049913,
 'CTT': 4.081639928698753,
 'CCT': 2.249554367201426,
 'CAT': 4.213547237076649,
 'CGT': 3.0089126559714794,
 'CTC': 2.7358288770053476,
 'CCC': 1.3490196078431373,
 'CAC': 2.9757575757575756,
 'CGC': 4.7368983957219255,
 'CTA': 2.033511586452763,
 'CCA': 3.247415329768271,
 'CAA': 7.4866310160427805,
 'CGA': 1.8310160427807487,
 'CTG': 6.8795008912655975,
 'CCG': 5.380748663101604,
 'CAG': 7.031729055258467,
 'CGG': 3.689483065953654,
 'ATT': 10.69126559714795,
 'ACT': 3.243850267379679,
 'AAT': 6.929411764705883,
 'AGT': 3.5871657754010697,
 'ATC': 7.40356506238859

In [24]:
yakult['LBCZ_RS00005']['cub']

defaultdict(int,
            {'ATG': 10,
             'CCC': 4,
             'AAT': 16,
             'TTA': 4,
             'GAG': 9,
             'CTT': 4,
             'TGG': 3,
             'GCT': 8,
             'TAC': 6,
             'CTG': 16,
             'GAT': 23,
             'AAA': 24,
             'TTC': 5,
             'CGT': 4,
             'GAA': 23,
             'TTG': 12,
             'ACC': 12,
             'CCA': 4,
             'GTC': 16,
             'GGC': 10,
             'AGC': 6,
             'ACA': 3,
             'ATT': 16,
             'CAA': 11,
             'GCC': 16,
             'GTT': 6,
             'CTC': 7,
             'ATC': 11,
             'CCG': 7,
             'GCA': 2,
             'TCG': 4,
             'CAT': 5,
             'AAG': 10,
             'GTG': 9,
             'GGG': 3,
             'TAT': 6,
             'TTT': 12,
             'CAG': 17,
             'GAC': 8,
             'ACG': 17,
             'CAC': 5,
             'GCG': 11

In [12]:
import numpy as np

In [21]:
CODE_COVARMAT = {
    'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
    'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
    'tta': 'L', 'tca': 'S', 'ttg': 'L', 'tcg': 'S',
    'tgg': 'W', 'ctt': 'L', 'cct': 'P', 'cat': 'H',
    'cgt': 'R', 'ctc': 'L', 'ccc': 'P', 'cac': 'H',
    'cgc': 'R', 'cta': 'L', 'cca': 'P', 'caa': 'Q',
    'cga': 'R', 'ctg': 'L', 'ccg': 'P', 'cag': 'Q',
    'cgg': 'R', 'att': 'I', 'act': 'T', 'aat': 'N',
    'agt': 'S', 'atc': 'I', 'acc': 'T', 'aac': 'N',
    'agc': 'S', 'ata': 'I', 'aca': 'T', 'aaa': 'K',
    'aga': 'R', 'atg': 'M', 'acg': 'T', 'aag': 'K',
    'agg': 'R', 'gtt': 'V', 'gct': 'A', 'gat': 'D',
    'ggt': 'G', 'gtc': 'V', 'gcc': 'A', 'gac': 'D',
    'ggc': 'G', 'gta': 'V', 'gca': 'A', 'gaa': 'E',
    'gga': 'G', 'gtg': 'V', 'gcg': 'A', 'gag': 'E',
    'ggg': 'G'
}

calculate covarmat

In [33]:
# init covariance matrix
covarmat = np.zeros((61,61))
# fill covariance matrix
for i,i_tag in enumerate(CODE_COVARMAT):
    for j,j_tag in enumerate(CODE_COVARMAT):
        diff_A = yakult['LBCZ_RS00005']['cub'][i_tag.upper()] - yakult.cub[i_tag.upper()]
        diff_B = yakult['LBCZ_RS00005']['cub'][j_tag.upper()] - yakult.cub[j_tag.upper()]
        covarmat[i][j]= diff_A * diff_B

check covarmat

In [34]:
covarmat

array([[4.30355025e+08, 9.44519850e+07, 3.47706945e+08, ...,
        3.99942855e+08, 2.50993755e+08, 1.77369750e+08],
       [9.44519850e+07, 2.07298090e+07, 7.63128330e+07, ...,
        8.77772870e+07, 5.50867470e+07, 3.89281500e+07],
       [3.47706945e+08, 7.63128330e+07, 2.80931121e+08, ...,
        3.23135319e+08, 2.02791339e+08, 1.43306550e+08],
       ...,
       [3.99942855e+08, 8.77772870e+07, 3.23135319e+08, ...,
        3.71679841e+08, 2.33256621e+08, 1.64835450e+08],
       [2.50993755e+08, 5.50867470e+07, 2.02791339e+08, ...,
        2.33256621e+08, 1.46385801e+08, 1.03446450e+08],
       [1.77369750e+08, 3.89281500e+07, 1.43306550e+08, ...,
        1.64835450e+08, 1.03446450e+08, 7.31025000e+07]])

In [37]:
print(covarmat[1][21])
print(covarmat[21][1])
print(covarmat[50][10])
print(covarmat[10][50])

25961206.0
25961206.0
595165528.0
595165528.0


calculate mahalonobis distance

In [48]:
# init zeros
X = np.zeros((1,61))
Xhat = np.zeros((1,61))

for i,cds in enumerate(CODE_COVARMAT):
    X[0][i] = yakult['LBCZ_RS00005']['cub'][cds.upper()]
    Xhat[0][i] = yakult.cub[cds.upper()]

In [52]:
np.transpose(X-Xhat)*np.linalg.inv(covarmat)*(X-Xhat)

LinAlgError: Singular matrix

error above is cause by the covairance matrix not solvable ( cannot calc inverse)
- is this due to mean not from a normal distribution? 
- hence the need for montecarlo procedure?

In [56]:
np.linalg.det(covarmat)

0.0

playing around with monte carlo first:

- " generating a random sample of 10,000 sequences from the means and standard deviations of the codon usage of each genome. "

I can get the means and std deviations of codon usage but I guess I need to generate sequences from the mean and standard deviations