# Read me
This template is meant to be a starter for your customized DREEM output data analysis.

- To install this library, please check the installation on the [Git repo](https://github.com/yvesmartindestaillades/NAP).
- To learn how to use this library, please get through the [tutorial](tutorial.ipynb).


# Your Project Name Here

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from os.path import exists, dirname
import os, sys
import string
import seaborn as sns


sys.path.append(os.path.abspath(".."))

from nap import data_wrangler, firebase, plot, utils

try:
    sys.path.append(dirname('libs/dreem/dreem')) 
except:
    "If dreem isn't installed on your computer, the code won't run"

# Step 1: Data wrangling
### Step 1.1: Define your study and some basics about your project

In [5]:
# Set your username for the database (at the moment, keep Yves)
username = 'TO DO'

# Select your study
study = 'tutorial' 

## Set your base coverage high-pass filter value
min_bases_cov = 1000 

# Set the resolution for the plots
mpl.rcParams['figure.dpi'] = 860 # the highest the resolution, the slowest the plotting

# Depending on the study you select, you'll get a series of tubes. You can also create new studies using this dictionary.
# Here's an example.
tubes_per_study = {   
    'tutorial':             ['A6', 'D6'],
    'replicates':           ['C5', 'A4', 'F4', 'A6', 'A7'],
    'salt':                 ['A6', 'B6', 'C6', 'D6', 'E6'], 
    'temperature':          ['D7', 'E7', 'F7', 'G7', 'H7', 'A8', 'B8', 'C8'], 
    'magnesium':            ['F6', 'G6', 'H6', 'A7', 'B7', 'C7'],
    '60 mM DMS kinestics':  ['D8', 'E8', 'F8', 'G8', 'H8', 'A9'],
    'all_tubes': [ele for ele in [f"{a}{b}" for a in string.ascii_uppercase[0:8] for b in range(1,11)] \
                if ele not in ['C3','C10','D10','E10','F10','G10','H10', 'E4']]
                + ['C5_realignment_v3']
    }

tubes = tubes_per_study[study]

### Step 1.2: Process new pickle files and push them to Firebase
- Select which tubes you want to push to Firebase.
To plot automatically arrays of tubes, see [tutorial](tutorial.ipynb), section 3.2.
- Process tubes and push them to Firebase.

In [7]:
## Pickle files to process and to push to Firebase
# Can be tubes if you want to process the tubes from your study, or [] if they are already on the database 
pickles_list = [] 

pickles = data_wrangler.generate_pickles(path_to_data='data/FULLSET',
                                         pickles_list=pickles_list)

# Indicate the location of your RNA structure file
RNAstructureFile = 'data/RNAstructureFile.csv'

# Default location for your local database (JSON file)
json_file = 'data/db.json'

# If the user gives some new pickles files, push them to the firebase, then pull the entire firebase
if len(pickles): 
    data_wrangler.push_pickles_to_firebase(pickles = pickles,
                                            RNAstructureFile = RNAstructureFile,
                                            min_bases_cov = min_bases_cov, 
                                            username=username)

### Step 1.3: Pull the data from the Firebase and clean/reformat it.
`df` is used for the analysis. Each of the construct have above 1000 reads for each tube.     
`df_full` is used for quality quality analysis. It has all constructs above 1000 valid reads for each tube individually.

In [8]:
# Pull the firebase
df_rough = firebase.load(tubes=tubes, username=username)

# Clean and reformat the dataset
df, df_full = data_wrangler.clean_dataset(df_rough=df_rough,
                                             tubes=tubes, 
                                             min_bases_cov=min_bases_cov)

Load data from Firebase

Tube A6 not found on Firebase

Tube D6 not found on Firebase
Tubes ['A6', 'D6'] couldn't be loaded from Firebase


ValueError: No objects to concatenate

# Step 2: Data quality analysis

It's always hard to realize that you were analysing noise. Here, we'll get through a series a plot to check the data sanity.

### Get the list of tubes and constructs:

In [None]:
print(f"tubes are: {tubes}")
print(f"constructs are: {df.construct.unique()}")

### Explore the data
`utils.get_roi_info(df=df, tube=tube, construct=construct)` gives information about the ROI of a (tube, construct) pair.

In [None]:
tube, construct = utils.rand_tube_construct(df)
utils.get_roi_info(df=df, tube=tube, construct=construct).xs((True, '0'),level=('paired','roi_structure_comparison'))   

### Plot the base coverage per construct distribution

In [None]:
plot.base_coverage_for_all_constructs(df=df_full, 
                                      min_bases_cov=min_bases_cov)

### Sanity-check (tube, construct)-wise base coverage plots
Plot randomly picked sequences to check the quality of the data.

In [None]:
plot.random_9_base_coverage(df=df, 
                                    min_bases_cov=min_bases_cov)

### Specific (tube, construct) base coverage plot
Plot specified (tube, construct) to check its quality.

In [None]:
plot.base_coverage(df, tube, construct, min_bases_cov=min_bases_cov, figsize=(15,7))

### Heatmap of the ROI coverage

In [None]:
plot.heatmap(df = df, 
             column="cov_bases_roi")

### Heatmap of the second half coverage

In [None]:
plot.heatmap(df = df, 
                column="cov_bases_sec_half")

# Step 3: Data analysis
In this part, we know that we read good data, and we want to visualize it through different plots.

### Analysis parameters

In [None]:
# Display the plots on this notebook? Not recommended if numerous plots
show_plots = True

# Constructs used
a_few_constructs = df.construct.unique()[:3].tolist()
first_construct = df.construct.unique()[0].tolist()
constructs_per_name = {
    'all_constructs': df.construct.unique().tolist(),
    str(a_few_constructs) : a_few_constructs,
    str(first_construct): [first_construct]
}

# Select constructs here
constructs_name = str(a_few_constructs)

# Define what you will analyse
constructs = constructs_per_name[constructs_name]

### Big script to run every selected function

In [None]:
# Analysis run in this script
analysis = {'base_per_base_partition':False,
            'base_per_base_sequence': True,
            'deltaG': True,
            'tube_comparison':False,
            'columns_csv': True,
            'deltaG_construct': True
            }


# Write here a script to run all of your plots

### Mutation sequence-wise

`plot.mutation_rate(df, tube, construct, plot_type, index, normalize)` plots the mutation rate base-wise for a given construct of a given tube as a barplot. 
Arguments:
- `plot_type` :
    - `'sequence'` : each bar is colored w.r.t to the base of the original sequence.
    - `'partition'` : each bar shows the partition of into which bases this base mutates.
- `index`:
    - `'index'`: each base is identified with its position number
    - `'base'`: each base is identified with its type (A, C, G, T)

In [None]:
for construct in constructs:
    for tube in tubes:
        plot.mutation_rate(df=df,
                           tube=tube,
                           construct=construct,
                           plot_type='sequence',
                           index='index')
        plot.save_fig(path=f"data/output/date/{study}/mut_per_base/sequence/{construct}/", 
                    title=f"base_per_base_sequence_{tube}_{construct}")
        plt.close(not show_plots)

### DeltaG plots

In [None]:
for tube in tubes:
    plot.deltaG(df=df, tube=tube)

    plot.save_fig(path=f"data/output/date/{study}/deltaG/", 
             title=f"deltaG_{tube}")

    plt.close(not show_plots)

### Tubes correlation

In [None]:
for construct in constructs:
        plot.compare_n_tubes(df, tubes, construct)
        plot.save_fig(path=f"data/output/date/comparison/{study}", 
                      title=f"comparison_{study}_{construct}")
        plt.close(not show_plots)
        print(construct, end=' ')

### Save columns to a csv file

In [None]:
utils.columns_to_csv(df=df,
                   tubes=tubes,
                   columns=['tube', 'construct','full_sequence','roi_sequence','mut_bases','info_bases'],
                   title='seq_and_reactivity_{study}',
                   path='data/output/date/{study}'
                   )

### Save construct vs deltaG 

In [None]:
utils.deltaG_vs_construct_to_csv(df=df, title=f"deltaG_vs_construct.csv", path = f"data/output/date/{study}", tubes=tubes)