Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze yeast cytosolic 8x dicodon screen data #3

Open
3 of 4 tasks
kychen37 opened this issue Apr 1, 2022 · 42 comments
Open
3 of 4 tasks

Analyze yeast cytosolic 8x dicodon screen data #3

kychen37 opened this issue Apr 1, 2022 · 42 comments
Assignees
Labels
analysis An analysis that we decide to carry out

Comments

@kychen37
Copy link
Collaborator

kychen37 commented Apr 1, 2022

GanttStart: 2022-03-29

Background

Transformed at high efficiency using Bloom lab's liquid recovery protocol: https://github.com/rasilab/rqc_aggregation_aging/issues/98
Library prepped: https://github.com/rasilab/rqc_aggregation_aging/issues/104

Ideas TODO

  • Analyze the endogenous fragment data in the same library and see if the codon or amino acid effects can account for their mRNA effects.
  • Put the same library in different quality control factor mutants to see their effects on codon and aa mRNA effects.
  • Correlate with CSC, AASC, tAI
  • Correlate with ribosome pause scores from Weinberg 2016

Analysis Links

Brief conclusion

  • analysis used for publication here and here
@kychen37 kychen37 added the analysis An analysis that we decide to carry out label Apr 1, 2022
@kychen37 kychen37 self-assigned this Apr 1, 2022
@kychen37
Copy link
Collaborator Author

@rasi

@rasi
Copy link
Collaborator

rasi commented Apr 12, 2022

@kychen37 Ok. I remember you showing some data in the short updates, but I don't remember ever us discussing this. Can you update the issue whenever you finish some analysis (I think you used to do this previously?) I can give more useful feedback if I have some time to think about your results than just looking at them for the first time during group meeting.

@kychen37
Copy link
Collaborator Author

Prelim analysis, just making a heatmap and starting to look at inserts that are still missing despite good representation:

Code: https://github.com/rasilab/rqc_aggregation_aging/blob/master/analysis/deepseq/20220329_exp51_cyto_8x_dicodon/scripts/plot_insert_mrna_levels.ipynb

@kychen37
Copy link
Collaborator Author

Note to self: my branches are kind of messed up right now, my most recent analyses were done in master (I had switched to doing these analyses in deepseq and forgot, then continued in master). Pulling from master didn't update the deepseq branch, I may need to 'sync' or something, for now the master branch is the most up to date for the scripts of this analysis only

@kychen37
Copy link
Collaborator Author

@rasi Update I will revisit this data next week to see what is pertinent for my committee meeting

@rasi
Copy link
Collaborator

rasi commented Jun 6, 2022

@kychen37 Post the figures as Issue comments so that we can come back to them easily (the commit links above are enough to figure out where the figure is).

Why is the data not centered around 0 unlike here: https://github.com/rasilab/lab_analysis_code/blob/master/rasi/analysis/deepseq/20210703_pb_8xdicodon_resequence_grna_mrna/scripts/plot_human_8xdicodon_effects_files/figure-gfm/unnamed-chunk-6-1.png?

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 6, 2022

@rasi I've been trying to figure that out but I'm not sure. Could it have to do with the fact that the mean of all the dipeptide lfcs is slightly negative?

@rasi
Copy link
Collaborator

rasi commented Jun 6, 2022

@kychen37 The mean and median cannot be this different. First, calculate simple log fold change without bootstrapping and make sure you understand what is going on before doing the bootstrap. The bootstrap is just for error bars.

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 6, 2022

@rasi sorry I forgot that my lfc column had already been median normalized when I calculated the median and said it was zero. The median is actually -1.5 and mean is -1.7

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 6, 2022

After median-normalization, the median = 0 and mean = -0.19

@rasi
Copy link
Collaborator

rasi commented Jun 6, 2022

Either way, the above plot should be approximately centered around 0 if everything is done correctly.

@rasi
Copy link
Collaborator

rasi commented Jun 6, 2022

Can you also add horizontal lines for each amino acid similar to what is in Phil's paper?

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 6, 2022

@rasi I replotted the dipeptide heatmap using your code because I realized I wasn't looking at missing dipeptides correctly, turns out there is a group of dipeptides that are missing (in grey):

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 7, 2022

@rasi Does bootstrapping involve normalizing lfcs to the mean (I thought it didn't)? My data appears to skew negative (unnormalized lfcs range from -5 to 0.1), and the bootstrap function takes from these. Should I have it take from median-normalized lfcs instead (which are centered on zero)? It didn't look like you had done that in Phil's data so I didn't do it for mine

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

@kychen37 Ok, your explanation makes sense. It looks like I did not median normalize because my mRNA and gRNA had almost equal counts. But it will be good to do it for yours since your read counts are skewed.

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

@kychen37 Most of the missing dipeptides are hydrophobic or bulky. Maybe these are just toxic as you noticed in your plasmid library. Can you calculate the bootstrap error bars for the remaining dipeptides, so that you can highlight ones that are atleast 2xSD below median value?

Also, can you make the above plot for all codons? I assume Arg is so destabilizing primarily because it has two rare codons CGG and CGA, but this will be good to see.

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 7, 2022

@rasi I went with a mean-normalization since the bootstrap used mean so I figured I would keep it consistent:

code

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

The codon plot looks good! Will be useful to spread out the Y-axis a bit.

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

I recommend making frame plots similar to 1E in Phil's paper and a schematic that mirrors 1A as closely as possible.

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 7, 2022

@kychen37 Most of the missing dipeptides are hydrophobic or bulky. Maybe these are just toxic as you noticed in your plasmid library.

@rasi I'm trying to figure out what the effect of slice(1) is in the human code shown here:
Screen Shot 2022-06-06 at 10 28 06 PM

The way I had plotted dipeptides before (which did not include slice(1) and associated grouping) I get less missing dipeptides than when I include it, I directly compared the two methods here. I think I'm just not really understanding what the slice function is doing to the data and if I need it?

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

@kychen37: slice(1) takes the first element of each group (https://dplyr.tidyverse.org/reference/slice.html).

It should not drop any dipeptides since you are grouping by dipeptides in the previous step.

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 7, 2022

@rasi

code

  • median-normalized frame effects if we want it zero-centered:

  • not normalized:

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

Looks good and better than what I might have guessed :-) It will be useful to carefully look at the off-diagonal elements of the above plot to see whether it is noise or something interesting.

Try the same frame plot for codon pairs as well.

Look through the language in Phil's paper and see whether you can place the observations above in the context of what is known about translation in yeast (lot more than in human).

See some of my TODO's at the top comment. Add to them if you think of additional experiments to do (this can be a useful checklist to come back to as you dig deeper).

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 7, 2022

@rasi bootstrapped error bars for dipeptide levels of just the VK/FK dipeptides + some controls. I'm going to show this one because the full dipeptide plot is too big

@rasi
Copy link
Collaborator

rasi commented Jun 7, 2022

Sounds good. Add end caps to the error bars to make it a bit easier to see.
I would still start showing the full heat map from #3 after you show the single codon and single aa effects.

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 10, 2022

Breakdown of dicodons that are missing from this sequencing run, code

library_missing_from n_dicodons_missing possible reason for the dicodon being missing
gdna only 0
mrna only 108 since it's present in gdna and linkageseq, these mrnas may be severely destabilized, or mrna library was bottlenecked
gdna and mrna only 141 since it's present in linkageseq but not gdna or mrna, this implies either toxicity or bottlenecking
gdna, mrna, and linkage 180 something to do with oligo pool synthesis or cloning of the plasmid library, maybe bottlenecked at e.coli step?
TOTAL 429

For the 180 dicodons that are missing from linkage sequencing, this is the dipeptide breakdown (what proportion of inserts for each dipeptide were missing in linkage sequencing, top ~20):

Screen Shot 2022-06-10 at 3 29 56 PM

For the 180 dicodons that are missing from linkage sequencing of the integrating plasmid library, 130 of these dicodons are also missing in the dual-barcode 2um plasmid library. The dipeptide breakdown of these 130 common missing dicodons (top ~20):

Screen Shot 2022-06-10 at 3 33 04 PM

@rasi

@kychen37
Copy link
Collaborator Author

Update

@rasi

Next week I will look into endogenous inserts from this sequencing run and plan remaining experiments + paper

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 15, 2022

CODE

Remake Fig1A from Presnyak 2015 to make sure the CSCs I calculated matched there's.

  • CSC = correlation coefficient of (frequency of a codon within a gene) vs (total RNA half life from Presnyak2015 supplemental table 1)
  • My remake:

  • Presnyak 2015:

  • There are couple codons where the orders are a little off, there might have been a normalization step I was missing but overall it looks very similar

Correlating the Coller lab CSCs with my LFCs

  • after redoing CSCs above, I found their published CSCs in Forrest2020 supplement. Very similar to my re-calculated but decided to just plot using their published data instead
  • maybe a moderate correlation between the two metrics

Plotting all data together

@kychen37
Copy link
Collaborator Author

kychen37 commented Jun 15, 2022

8x dicodon median-normalized codon average LFCs color-coded by Coller designations of 'unstable' vs 'stable' (aka CSC < 0 or > 0)

Correlating AASC with booststrapped average amino acid LFC

@kychen37
Copy link
Collaborator Author

@rasi there is a moderate correlation between Coller lab CSCs and my average codon LFCs, there is basically no correlation between Coller lab AASCs and my average amino acid effects though, see comments above

@phiburke
Copy link

phiburke commented Jun 15, 2022

@Katharine Just wanted to say, that colored codon plot looks really good.

I wouldn't worry too much about the amino acid level effects not matching up well; AA-level effects in your library are probably driven more by localized repeats, whereas they're looking at transcriptome-wide aggregate effect measurements, so it's a very different thing.

Also, if you're doing literature comparisons you should totally compare your data to Gamble2016, Adjacent Codons Act in Concert to Modulate Translation Efficiency in Yeast from Stan Fields' group. That's probably the closest published experimental measurement to what you've done. Also, they only look at protein levels, so even if things don't match up well it might still be interesting (eg. some dicodons effect on mRNA but not protein level).

@kychen37
Copy link
Collaborator Author

Thanks @phiburke ! My memory of Gamble2016 is that they just found a bunch of rare codons but it's been a while so definitely worth a re-look!

@kychen37
Copy link
Collaborator Author

Update
@rasi
In addition to plots above, I am still interested in trying to correlate LFCs with tAI, CAI, and ribosome pause scores.

@kychen37
Copy link
Collaborator Author

kychen37 commented Jul 1, 2022

Update & TODO

  • I want to use the below resources to calculate tAI and CAI for my 6000-insert library
  • For CAI plot: Use tool or similar tools as described here to generate codon usage table and CAI scores from the 6000 insert library, then correlate with LFC
  • For tAI plot:

@rasi

@kychen37
Copy link
Collaborator Author

kychen37 commented Jul 9, 2022

@rasi
Update

  • calculated tAI for the 6000-insert library here
  • plotted LFC vs tAI here
  • weak correlation between LFC and tAI:

  • I'm currently thinking of other ways to analyze the data
  • I will also do this plot for CAI = geometric mean of relative adaptiveness of all codons in the gene sequence (which was calculated already in code above)

@kychen37
Copy link
Collaborator Author

  • Frydman optimized tAI, maybe Weinberg2016, find tAI for each codon
  • separate dicodon and endogenous (keep as separate experiments)

@kychen37
Copy link
Collaborator Author

kychen37 commented Aug 12, 2022

Update

@rasi

  • Plots using tAI

  • Plots using Frydman normalized tAI (nTE): code

  • I want to play around with how to plot these like we discussed in short updates (look more closely at the cloud of points with very low nTE/tAI and LFC, slice and dice some more)

@kychen37
Copy link
Collaborator Author

kychen37 commented Aug 16, 2022

Update

  • Plots for CAI

@kychen37
Copy link
Collaborator Author

kychen37 commented Dec 22, 2022

Summary of different types of data filtering

Read cutoffs based on insert & barcode-level plots

  • alignment stats as described above would suggest the following cutoffs:
insert_reads_cutoff = 500
barcode_reads_cutoff = 10
n_barcodes_cutoff = 6
  • dipeptide heat map applying the above insert/barcode-level cutoffs:

Screenshot 2022-12-22 at 1 33 46 PM

  • code:
diaa_lfc <- barcode_counts %>%
    filter(insert_num != 7000) %>% # remove spike-ins
    filter(barcode_count >= barcode_reads_cutoff) %>%
    group_by(sample_name, insert_num) %>%
    summarize(insert_count = sum(barcode_count), n_barcodes = n(), .groups='drop') %>%
    inner_join(insert_annotations, by = "insert_num") %>%
    pivot_wider(names_from = sample_name, values_from = c(insert_count, n_barcodes)) %>%
    drop_na() %>%
    filter( (n_barcodes_wt_gdna >= n_barcodes_cutoff) & (n_barcodes_wt_mrna >= n_barcodes_cutoff) ) %>%
    filter( (insert_count_wt_gdna >= insert_reads_cutoff) & (insert_count_wt_mrna >= insert_reads_cutoff) ) %>%
    ungroup() %>%
    group_by(diaa) %>%
    summarize(diaa_count_gdna = sum(insert_count_wt_gdna), diaa_count_mrna = sum(insert_count_wt_mrna), diaa_n_barc_gdna = sum(n_barcodes_wt_gdna), diaa_n_barc_mrna = sum(n_barcodes_wt_mrna)) %>%
    mutate(diaa_lfc = log2(diaa_count_mrna) - log2(diaa_count_gdna)) %>%
    ungroup() %>%
    mutate(lfc_med = diaa_lfc - median(diaa_lfc))

Read cutoffs using Burke2022 code

  • The code from Burke2022 is slightly different from above in that it groups by dipeptide first and then imposes the insert_reads_cutoff/barcode cutoffs, so the filtering is much more lenient
  • dipeptide heat map using Burke2022 code:

Screenshot 2022-12-22 at 1 33 59 PM

  • code:
diaa_lfc_burke2022 <- barcode_counts %>% 
  filter(insert_num != 7000) %>% # remove spike-ins
  group_by(insert_num, sample_name) %>%
  summarize(count = sum(barcode_count), n_barcodes = dplyr::n()) %>%
  ungroup() %>% 
  inner_join(insert_annotations, by = "insert_num") %>%
  group_by(diaa, aa1, aa2, sample_name) %>%
  mutate(count = sum(count), n_barcodes = sum(n_barcodes)) %>%
  ungroup() %>%
  group_by(diaa, aa1, aa2) %>%
  filter(sum(count) >= insert_reads_cutoff, sum(n_barcodes) >= n_barcodes_cutoff) %>%
  ungroup() %>% 
  pivot_wider(names_from = sample_name, values_from = c("count", "n_barcodes")) %>%
  mutate(lfc = log2(count_wt_mrna) - log2(count_wt_gdna)) %>% 
  group_by(diaa, aa1, aa2) %>%
  slice(1) %>% 
  ungroup() %>%
  drop_na() %>% 
  mutate(lfc_med = lfc - median(lfc))

Minimal read cutoffs -> bootstrap by dipeptide -> filter by SD

Screenshot 2022-12-22 at 1 34 13 PM

  • code
barcode_reads_cutoff = 10
n_barcodes_cutoff = 2

calc_lfc_bootstrap <- function(data, indices) {
  d <- data[indices,]
  log2(sum(d$barcode_count_mrna)) - log2(sum(d$barcode_count_gdna))
}

wt_diaa_bootstrap_diaa <- wt_dicodon %>%
  filter(sample_name == 'wt') %>%
  group_by(diaa) %>%
  nest() %>%
  mutate(lfc_boot = map(data, function(df) boot::boot(data=df, statistic=calc_lfc_bootstrap, R=100)$t)) %>%
  select(-data) %>%
  mutate(lfc = map_dbl(lfc_boot, mean)) %>%
  mutate(lfc_sd = map_dbl(lfc_boot, sd)) %>%
  select(-lfc_boot) %>%
  ungroup() %>%
  mutate(lfc_mean = lfc - mean(lfc)) %>%
  mutate(lfc_med = lfc - median(lfc)) %>%
  mutate(lfc_max = lfc - max(lfc)) %>%
  mutate(strain = 'wt') %>%
  filter(lfc_sd <= 0.25)

@rasi let me know if you have any thoughts on the proper/best way to plot this

@kychen37
Copy link
Collaborator Author

kychen37 commented Jan 3, 2023

Destabilized dipeptides vs frameshifts

  • bootstrapping grouped by dipeptides
  • filter dipeptides with bootstrapped sds <= 0.15 based on:

  • find dipeptides where bootstrapped dipeptide lfcs are >=2 standard deviations below the median, these dipeptides are labeled here in the plot of mrna & grna read counts in log2 space:

  • for these destabilized dipeptides, plot their effects against frameshift effects
    • get +1 frameshift of each dicodon belonging to destabilized dipeptide
    • get bootstrapped dipeptide lfc for frameshifts, removing any dicodons that happen to correspond to destabilized set of dicodons or that happen to include CGG/CGA

@kychen37
Copy link
Collaborator Author

kychen37 commented Jan 3, 2023

@rasi

code

  • median-normalized frame effects if we want it zero-centered:
  • not normalized:
  • remake of frame effects plot above, this time imposing read cutoffs (500 reads per insert, 10 reads per barcode, 6 barcodes per insert) and removing any dicodons that go missing:

@MariaT0 MariaT0 transferred this issue from another repository Aug 17, 2024
@rasi rasi reopened this Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis An analysis that we decide to carry out
Projects
None yet
Development

No branches or pull requests

3 participants