# STAT 450: Case Studies in Statistics

## Case study: relation between mRNA and protein levels

The picture illustrates what is known as the Central Dogma of Biology


![](prot_gene.png)

Despite expectations of a high correlation between mRNA and protein levels, experimental results have shown very low correlation values

Many research groups have investigated the relation between mRNA and protein levels. 

In 2014, a research group claimed to find a "predictive model", which can be used to predict protein from mRNA!!

We'll use data from this group submitted to the Journal as if it is "our client's data"


### Claim

Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates, our data show that [...] ***it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance***

### Question

- Is our analysis statistically correct?

- Is there another way to analyze the data? If so, do we get similar results?


![](nature_res.png)

# Let's take a look at their data

### Can we reproduce their analysis?

### Do we get the same conclusions?

## From The Art of Data Science (by Peng and Matsui)

### Exploratory Data Analysis: Checklist

1. Formulate your question
2. Read in your data
3. Check the packaging

### Checklist for The Art of Data Science (cont.)


4. Look at the top and the bottom of your data
5. Check your “n”s
6. Validate with at least one external data source
7. Make a plot
8. Try the easy solution first
9. Follow up

In [1]:
library(tidyverse)
#library(repr)
#options(repr.plot.width=7, repr.plot.height=4)
library(ggplot2)
library(broom)
#library(gridExtra)
#library(GGally)

prot <- data.matrix(read.csv("proteinUN.csv", row.names = 1))
mrna <- data.matrix(read.csv("geneUN.csv", row.names = 1))
tissues <- colnames(mrna)
genes <- proteins <-rownames(prot)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [33]:
#quick checks
stopifnot(all(colnames(prot) == colnames(mrna)))
stopifnot(all(rownames(prot) == rownames(mrna)))

# Exploring the data

- What is the dimension of this dataset?

- How many genes have been measured? How many tissues?

- What kind of variables are present in this dataset (factor?, numeric?)

- Are there missing values? 

- If you find missing values in a gene, are both the mRNA and the protein levels missing for that gene? 

- Is the proportion of missing values the same in all genes?

### Understanding missing values

- Count how many genes have K number of missing values (mRNA or protein) for $K=0, \ldots, 12$


- Make a plot that illustrates your analysis


- Is the correlation between mRNA and protein affected by the number of missing values?


- For genes with more than 9 complete pairs, what is the distribution of the correlation values between mRNA and protein? 


- For a gene with complete data, illustrate the relation between mRNA and protein

## The proposed predicted method

### For each gene $j$ the predicted protein level in the $i$-th tissue is given by 

### $$\hat{p}_{ij} = \hat{r}_j * \text{mRNA}_{ij}, \; \hat{r}_j \text{ is the j-th ratio}$$

### <font color="blue"> Is this a linear regression?? </font>

### [class discussion]