## Plot showing how useful are the DMS constraint for RNAStructure, for different datasets

Run RNAstructure with and without the DMS constraints. Compute the F1 score between the two runs. If the F1 score is high, it means that the DMS constraints are not adding much information, as RNAStructure can do a good prediction without them.

One plot with two subplots:
- First column: Violin or Box plot of the F1 score distribution, colored by the different datasets (Ribonanza, UTR, pri-miRNA, ArchiveII, RNAstralign)
- Same distribution, but presented as the three histogram overlayed on the same plot (*might be redundant*)

**Assigned to**: Alberic

Use Ploty, and a white background

In [30]:
import pandas as pd

results = pd.read_feather('../Figure2/saved_data_plot/results_benchmark_dataset.feather').set_index('reference')
results.loc[results['dataset'] == 'pri_miRNA', 'dataset'] = 'pri-miRNA'
results.loc[results['dataset'] == 'human_mRNA', 'dataset'] = 'mRNA'


# Viloin plot per dataset
import plotly.graph_objects as go

fig = go.Figure()
dataset_names = ['ribonanza', 'pri-miRNA', 'mRNA', 'archiveII', 'RNAstralign']
results['dataset'] = results['dataset'].apply(lambda x: '{} (N={:,})'.format(x, len(results[results['dataset'] == x])))

for dataset in results['dataset'].unique():
    fig.add_trace(go.Violin(x=results['dataset'][results['dataset'] == dataset],
                            y=results['F1'][results['dataset'] == dataset],
                            box_visible=False,
                            points=False,
                            meanline_visible=True))
    
fig.update_layout(
                    # title="RNAstructure prediction performance on different datasets", 
                    width=800, height=400,
                    yaxis_range=[0, 1],
                    xaxis_title="",
                    yaxis_title="F1 score",# between RNAstructure predictions w/ and w/o chemical probing",
                    template='plotly_white', font_size=15, font_color='black',)

fig.update_layout(showlegend=False)
# don't show the horizontal grid lines
fig.update_yaxes(showgrid=False)
fig.update_yaxes(tickvals=[0, 0.2, 0.4, 0.6, 0.8, 1])
fig.show()

In [31]:
# save figure
fig.write_image("images/S1/a_benchmark_full_dataset.pdf")