# Imports

At first use, if dreem isn't installed on your computer, open a terminal and enter the following command lines:
```
$ cd [YOUR PATH TO THIS NAP REPO]
$ cd libs
$ git clone https://github.com/jyesselm/dreem
```
Then import regular libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from os.path import exists
import os
from nap import *
try:
    sys.path.append(dirname('libs/dreem/dreem')) 
except:
    "If dreem isn't installed on your computer, the code won't run"

# Step 1.1: Data wrangling

### Resources used

Here, we will set the username (at the moment, we'll call you Yves, it's a nice name). This is your main folder in the database. Check it by yourself [on the database!](https://console.firebase.google.com/u/0/project/dreem-542b7/database/dreem-542b7-default-rtdb/data)

The **tubes** that you chose here will be pulled from the database. Every tube correspond to a physical tube, also known as "experiment" during the wet lab part.  

The **constructs** are specific RNA sequences. They are referred to by their name, such as 8584 or 9572, and each tube has the same series of constructs.

A **study** is a group of tubes that are relevant to be studied together. For example, they are all replicates, or the salt concentration was increased along the tubes, etc.

The **pickles** are a dictionary of the tube's names and their respective path+title.

Set **switch_study** to True when you use another study for the first time. This will remove your former local json and download a new one from the Firebase.

In [None]:
## EDIT THIS ZONE 
username = 'Yves'
study = 'tutorial'  
switch_study = True
# END OF EDIT ZONE

## Database path
json_file = 'data/db.json'

## Constants
min_bases_cov = 1000 
mpl.mcParam['figure.dpi'] = 860 # the highest the resolution, the slowest the plotting

tubes_per_study = {   'tutorial':['A6','D6'],
                      'replicates':['C5','A4' , 'F4', 'A6', 'A7'],
                      'salt': ['A6','B6','C6','D6','E6'], 
                      'temperature':['D7','E7','F7','G7','H7','A8','B8','C8'], 
                      'magnesium':['F6', 'G6', 'H6', 'A7', 'B7', 'C7'],
                      '60 mM DMS kinestics':['D8', 'E8', 'F8', 'G8', 'H8', 'A9']
                      }

tubes = tubes_per_study[study]

### Load the data

![figure](tutorial_pics/firebase_schematics.png)

In [None]:
# If changed the data, remove the former dataset
if switch_study:
    try:
        os.remove(json_file)
    except:
        print('No json file to delete')

# If not local copy of firebase, pull the firebase, else, load your copy
if not exists(json_file):
    if not exists('data'):
        os.mkdir('data')
    df_rough = data_wrangler.load_data_from_firebase(tubes=tubes, username=username)
    data_wrangler.dump_dict_json(JSONFileDict=json_file,
                                 df=df_rough)
else:
    df_rough = data_wrangler.load_dict_json(json_file)

If everything is normal, so far, a json file was downloaded as `data/db.json`. Now, we'll extract two dataframes from this file, `df_full` for data quality analysis and `df` for data analysis. Check out the difference below.

### Clean and reformat the dataset. 
`df` is used for the analysis. Each of the construct have above 1000 reads for each tube.     
`df_full` is used for quality quality analysis. It has all constructs above 1000 valid reads for each tube individually.

In [None]:
df, df_full = turner_overthrow.clean_dataset(df_rough=df_rough,
                                             tubes=tubes, 
                                             min_bases_cov=min_bases_cov)

# Step 1.2: Data quality analysis

It's always hard to realize that you were analysing noise. Here, we'll get through a series a plot to check the data sanity.

### Show the tube's quantity of valid structures (good indicator of the tube's quality)

In [None]:
plot.valid_construct_per_tube(df=df_full,
                              min_bases_cov=min_bases_cov)

### Show the tube coverage distribution

In [None]:
plot.tube_coverage_distribution(df=df_full)

### Plot the base coverage per construct distribution

In [None]:
plot.base_coverage_for_all_constructs(df=df_full, 
                                      min_bases_cov=min_bases_cov)

### Sanity-check construct-wise base coverage plots
Plot randomly picked sequences to check the quality of the data.

In [None]:
plot.random_base_coverage_plot_wise(df=df, 
                                    min_bases_cov=min_bases_cov)

### Heatmap of the var part coverage

In [None]:
plot.heatmap(df = df, 
             column="cov_bases_var")

### Heatmap of the second half coverage

In [None]:
plot.heatmap(df = df, 
                column="cov_bases_sec_half")

# Step 1.3: Your turn to play

These plots showed the tutorial study. You want to:
- Change the study to another study, such as `'temperature'`, and replot this test routine
- Write a new study called `'my_new_study'`, using the tubes `['C1','D5','E6','F7']`, and replot this data sanity test routine.

# Step 2: Data Analysis
In this part, we know that we read good data, and we want to read it through different plots. Let's get through these plots.

So far, we've seen that we analyse our data through tubes and constructs. Plot types will require either a (tube, construct) pair, either a given tube, either a given construct. For example, a deltaG plot is tube-wise, because it shows all of the constructs of a given tube. 

### Step 2.1: Get the list of tubes and constructs:

`tubes` comes from your previous study choice, and is the list of the tubes that you want to use.

`df.construct.unique()` gives you the list of constructs.

In [None]:
print(f"tubes are: {tubes}")
print(f"constructs are: {df.construct.unique()}")

### Step 2.2: Explore the data
`utils.get_var_info(df=df, tube=tube, construct=construct)` gives information about the variable part of a (tube, construct) pair.

Let's explore the data using the previous explored tubs and constructs lists.

In [None]:
# Select a (tube, construct) pair
tube = tubes[0] 
construct = df.construct.unique()[0]

utils.get_var_info(df=df, tube=tube, construct=construct).xs((True, '0'),level=('paired','var_structure_comparison'))

### Step 2.3: DeltaG plots
Step 2.3.1: Let's start with a first plot, deltaG. DeltaG plots the mutation frequency of the paired bases of the variable part of each construct for a given tube. Give this function a tube and plot it! 

In [None]:
plot.deltaG(df=df,
            tube= "EDIT ME")

Step 2.3.2: How about saving this plot directly to your files? Use the following code to save your plot to your files.

In [None]:
tube = 'EDIT ME'

plot.deltaG(df=df, tube=tube)

plot.save_fig(path=f"data/figs/date/{study}/deltaG/", 
                title=f"deltaG_{tube}")

Step 2.3.3: Let's say that you want to save all of your tubes plots. Let's make a loop for that.

In [None]:
for tube in tubes:
    plot.deltaG(df=df, tube=tube)

    plot.save_fig(path=f"data/figs/date/{study}/deltaG/", 
                  title=f"deltaG_{tube}")

Step 2.3.4: These plots are a bit overwhelming, right? Just close them right after saving them to your files.

In [None]:
for tube in tubes:
    plot.deltaG(df=df, tube=tube)

    plot.save_fig(path=f"data/figs/date/{study}/deltaG/", 
                  title=f"deltaG_{tube}")

    plt.close()

It's been a long way together! Let's apply our new knowledge to another plot type.

### Step 2.4: Mutation sequence-wise

`plot.mutation_rate(df, tube, construct, plot_type, index, normalize)` plots the mutation rate base-wise for a given (tube construct) pair as a barplot. 
Arguments:
- `plot_type` :
    - `'sequence'` : each bar is colored w.r.t to the base of the original sequence.
    - `'partition'` : each bar shows the partition of into which bases this base mutates.
- `index`:
    - `'index'`: each base is identified with its position number
    - `'base'`: each base is identified with its type (A, C, G, T)

This plot type takes a (tube, construct) pair as an argument. That's fine, we know how to find our tubes list `tubes` and our construct list `df.construct.unique()`. 

Step 2.4.1: Let's do this plot:
- select a tube and a construct in your lists
- select `plot_type` : `'sequence'` and  `index`: `'index'`
- make the plot

Sequence type

In [None]:
plot.mutation_rate(df=df,
                tube= "TO DO",
                construct="TO DO",
                plot_type="TO DO",
                index="TO DO")

Step 2.4.2: Now, use the following parameters:
- keep the same tube and construct
- select `plot_type` : `'sequence'` and  `index`: `'base'`
- make the plot
- what's the difference?

In [None]:
plot.mutation_rate(df=df,
                tube= "TO DO",
                construct="TO DO",
                plot_type="TO DO",
                index="TO DO")

Step 2.4.3: Let's go for a last round. use the following parameters:
- keep the same tube and construct
- select `plot_type` : `'partition'` and  `index`: `'base'`
- make the plot
- what's the difference?

In [None]:
plot.mutation_rate(df=df,
                tube= "TO DO",
                construct="TO DO",
                plot_type="TO DO",
                index="TO DO")

Step 2.4.4: Generate a lot of plots. 
- Pick your favorite plot type and paste it in the loop.
- Define the list of construct that you want to plot.
- Run your code
- Check your results in the folder `data/figs/date/{study}/mut_per_base/sequence/{construct}`

/!\ WARNING: it takes a few seconds to generate one plot. If you generate too many plots, like the entire `df.construct.unique()` list for all of the `tubes`, it will take a while (on my computer, it takes ~25 minutes). Select subsets of these lists instead. 

In [None]:
constructs = ['TO DO']

for tube in tubes:
    for construct in constructs:
        # PASTE THE CODE FOR YOUR FAVORITE PLOT HERE
        plot.save_fig(path=f"data/figs/date/{study}/mut_per_base/sequence/{construct}/", 
                    title=f"base_per_base_sequence_{tube}_{construct}")
        plt.close()

### Step 2.5: Tubes comparison
This plot type is construct-wise. It compares the mutation rate of each base of this construct within the tube's list, 2 tubes by 2 tubes. The idea is to see the evolution of the data through the study.

Step 2.5.1: select a construct and plot this.

In [None]:
plot.compare_n_tubes(df=df,
                     tubes = tubes,
                     construct= 'TODO')

Step 2.5.2: Publiposting
- Select mutltiple constructs
- Produce multiple plots
- Open the corresponding folder and check that it worked fine

/!\ WARNING: it takes a few seconds to generate one plot. If you generate too many plots, like the entire `df.construct.unique()` list, it will take a while (on my computer, it takes ~10 minutes). Select subsets of this list instead. 

In [None]:
constructs = ['TO DO']

for construct in constructs:
        plot.compare_n_tubes(df, tubes, construct)
        plot.save_fig(path=f"data/figs/date/{study}/comparison/", 
                      title=f"comparison_{study}_{construct}")
        plt.close()
        print(construct, end=' ')

### Step 2.6: Save columns to a csv file

It can be useful to save relevant data from your dataset.

Step 2.6.1: Save columns to a csv file
- Set columns to `['tube', 'construct','full_sequence','var_sequence','mut_bases','info_bases']`
- Run the code
- Check the result

In [None]:
utils.columns_to_csv(df=df,
                   tubes=tubes,
                   columns='TO DO',
                   title='about_{study}',
                   path='data/figs/date/{study}'
                   )

### Step 2.6.2: Save construct vs deltaG 

- Run the code
- Check the result

In [None]:
utils.deltaG_vs_construct_to_csv(df=df, title=f"deltaG_vs_construct.csv", path = f"data/figs/date", tubes=tubes)

# Step 3: Advanced data management

In this part, we will learn how to:
- Process pickle files, the output of DREEM. 
- Push to the database using your own username.

/!\ One pickle file corresponds to one tube of the wet lab experimentation.


### Step 3.1 Process pickle files
Step 3.1.1: Get a sample dataset (pickles + additional content)

- Download it from [this link](https://drive.google.com/drive/folders/1sf7ZkF_TZOjU9MWjm9aB9nqxBTGnxC_d?usp=sharing).
- Store the pickle files under `'data/FULLSET/[tube name]/mutation_histos.p'`.
- Store the RNAstructure file under `'data/delta_g_plus_bracket_q1.csv'`.
- The pickle files you want to process are ['A6','D6']. It corresponds to your tubes.

In [None]:
tubes = ['A6','D6']


### Step 3.1.2: Generate pickles dictionary
`pickles` is a dictionary that has the tubes names as keys and their respective path as values. We use `pickles` to load the pickle files from your computer.

To create `pickles`, we use the following function:

```
pickles = data_wrangler.generate_pickles()
```


In [None]:
pickles = data_wrangler.generate_pickles(path_to_data='data/FULLSET',
                                         pickle_list= tubes)
print(pickles)

### Bonus Step: Play with data_wrangler.generate_pickles()
If you're using a series of repetitive tubes, such that for example ['A1','A2','A3','B1','B2','B3'], you may want to automatize this. 

Then, you can use additional arguments for data_wrangler.generate_pickles():
- `letters_boundaries`: the first and last letters of your tubes set (for example, ['A','D']).
- `numbers_boundaries`: the first and last numbers of your tubes set (for example, [1, 3]).
- `remove_pickles`: tubes names that are in the set generated by (letters_boundaries, numbers_boundaries), but that you don't want (for example, ['A2','C3'])

Exercice bonus:
- generate the following pickles list using the previous functions:
``` 
['A1',     'A3','A4','A5','A6',
      'B2','B3','B4','B5','B6',
 'C1','C2','C3','C4','C5','C6',
 'VERY_LONG_NAME_1','VERY_LONG_NAME_2']
                                        ```

In [None]:
print(data_wrangler.generate_pickles(path_to_data='data/FULLSET',
                            pickles_list= ['TODO'],
                            letters_boundaries =  ['TODO','TODO'],
                            number_boundaries=  ['TODO','TODO'],
                            remove_pickles= ['TODO']))

### Step 3.1.3: Push pickles to Firebase
Firebase is a database service provided by Google.


`data_wrangler.push_pickles_to_firebase()` does the following operations:
- loads additional content (typically, the output of RNAstructure)
- for each pickle file:
    - unpacks and reformat pickle files.
    - merges it with additional content (a sequence-wise security check is performed).
    - filters out every construct for which the worst base coverage in the region of interest is below `min_bases_cov`. 
    - pushes the result to Firebase

In [None]:
username = 'CHANGE YOUR USERNAME HERE'
RNAstructureFile = 'data/RNAstructureFile.csv'

data_wrangler.push_pickles_to_firebase(pickles = pickles,
                                        RNAstructureFile = RNAstructureFile,
                                        min_bases_cov = min_bases_cov, 
                                        username = username)


### About usernames

Your username corresponds to a folder at the database's root. It is useful to create different databases. This is very useful to separate:
-  users
- projects
- versions 
-  filtering types of the pickle files (such as different values of `min_bases_cov`)

A username can contain `/`, to make hierarchical folders. For example, if you create 
- `Animal/Dog`
- `Animal/Cat`
- `Animal/Englishman`

You'll get:
- `Animal`
    - `Dog`
    - `Cat`
    - `Englishman`

/!\ If a username is already used, THIS WILL OVERWRITE THE PREVIOUS DATA. So be careful in your namings :) 

### Step 3.1.4: Pull from Firebase

You made it so far, good job! Now, pull your new dataset and play with it!

Exercice 3.1.4:
- Pull from the Firebase your data
- Get through the data sanity analysis to check that you've done a good job :) 

In [None]:
# Remove your former dataset
try:
    os.remove(json_file)
except:
    print('No json file to delete')

# Pull from the firebase
df_rough = data_wrangler.load_data_from_firebase(tubes=tubes, username=username)
data_wrangler.dump_dict_json(JSONFileDict=json_file,
                             df=df_rough)

# Clean the data
df, df_full = turner_overthrow.clean_dataset(df_rough=df_rough,
                                             tubes=tubes, 
                                             min_bases_cov=min_bases_cov)