# Edge Addition Algorithm - simple implementation example

<font size="3">Run time is around ~5 minutes with the default input. \
\
In this example we will use the E.A.A. model to build a low-connectivity DCA model. \
The information that we have about the training RNA family consists in: the sequence alignment and the consenus secondary structure (both trough the Covariance Model) and the 3D contacts trough the PDB file. </font> 


In [9]:
include("FCSeqTools.jl");

<font size="3">Here is an example of  RF00379 molecule and its associated consensus secondary structure. \
To make the execution faster we will not generate full lenght molecules but just a portion from nucleotide 55 to 102. </font>

In [10]:
natural_sequences = do_number_matrix_prot(do_letter_matrix("CM_130530_MC.fasta"), 0.2);

<font size="3">Here is a segment example with its associated secondary structure. \
The database has a different size because the data-cleaning procedure depends on the region selected. \
Now we will run the E.A.A. building up our ineraction netwotk edge by edge till we reach a good performance generative model. \
At each iteration the algorithm reports: the added edge, the iteration number, the number of total added edges and the connectivity percentace of the fully connected case.\
Each 15 iterations the algorithm reports: the model score (Pearson between natural and artificial two-point correlations), the model mean energy, the model partition function and the model entropy. 

In [11]:
using Random

n_step = 100_000
method = "full_edge"

s = time()
Random.seed!(2) 
#                                                                                                 #21            #12000                                          #stats  #cumulative
score, likelihood_gain, generated_sequences, Jij, h, contact_list, site_degree, edge_list = E_A_A(21, n_step, 0.05, 12000, natural_sequences, "example_output.txt", method); 

s = time() - s

Fully connected model has 4560 edges, 2010960 elements and a score around ~ 0.95

iteration = 20,   Score = 0.164

 <E> = 196.36,  log(Z) = 0.82,   S = 197.18
 edges: 20,   elements: 8820,   edge complexity: 0.44 %,  elements complexity: 0.44 %

iteration = 40,   Score = 0.286

 <E> = 185.3,  log(Z) = 1.62,   S = 186.92
 edges: 40,   elements: 17640,   edge complexity: 0.88 %,  elements complexity: 0.88 %

iteration = 60,   Score = 0.413

 <E> = 175.23,  log(Z) = 2.44,   S = 177.67
 edges: 60,   elements: 26460,   edge complexity: 1.32 %,  elements complexity: 1.32 %

iteration = 80,   Score = 0.488

 <E> = 165.46,  log(Z) = 3.16,   S = 168.61
 edges: 80,   elements: 35280,   edge complexity: 1.75 %,  elements complexity: 1.75 %

iteration = 100,   Score = 0.566

 <E> = 157.06,  log(Z) = 3.93,   S = 160.99
 edges: 100,   elements: 44100,   edge complexity: 2.19 %,  elements complexity: 2.19 %

iteration = 120,   Score = 0.63

 <E> = 149.28,  log(Z) = 4.86,   S = 154.14
 edges: 120,   e

4250.3807780742645

<font size="3">The model obtained has a performance comparable to the fully connected DCA while having just ~20% of its connectivity. The entropy of the model is 35.08. This means that it is able to generate e³⁵ (3.5x10¹⁵) different 55-102 segments for the RF00379 family. \
Now we can test our artificial sequences. We do the classical statistical check of the PCA projection and the two-point correlation representation. \
We test the performance of our model against the one of the Covariance Model. The CM model only contains trivial one-point and secondary information so our model must do better than it. </font>

In [None]:
cm_sequences = rna_cm_model_generation(0.8,0.05,7000,natural_sequences,ss_contact_matrix);


In [None]:
plot_stat_check(natural_sequences, generated_sequences, cm_sequences)

<font size="3">The E.A.A. artificial molecules are practically statistically indistinguishable from the natural ones. We see that they have a very similar PCA projection (artificial one seems richer just because we have more artificial sequences than natural ones) while Covariance Model fails to capture the details of the distribution. 
    The selected model has almost a perfect two-point statistics for all site pairs while the CM model only captures it for the ones involved in secondary structure contacts. \
     </font>


<font size="3">The interpretability is one of the main reasons in our quest to find parsimonious generative models. Now that we are sure we obtained a good generative model with relatively few parameters we can try to interprete them. \
Dividing the added edges in secondary structure contacts, 3D contacts we have:

In [None]:
edge_interpretation_plot(len,ss_contact_matrix,tertiary_contact_matrix,edge_list[1:50,:])

<font size="3">We see that the secondary structure contacts are taken in the first few iteration. We have lot of neighbouring sites probably due to philogenic effects. It is striking that we see some 3D contacts (in particular around site 40) before the NONE edges. This
suggests that our algorithm effectively captures some information about the tertiary structure. \
Those results, that are far more general than this simple example, suggest that the added edges have a co-evolutionary interpretation.

<font size="3">
This notebook serves as an example of the application of the techniques described in the main text.
