# Homework Set - III

Each Question/subquestion is worth **5 points**, unless indicated otherwise.

### ENCODE: Testing for Enrichment

Let's try to learn about possible genetic mechanisms of Alzheimer's using ENCODE data.

SNPs associated with this disease can be found in `ALZ_SNPs_hg38.bed`.

Choose an ENCODE data set from the ENCODE website that you think may be relevant to the disease (note: feel free to choose a tissue other than brain...there could be other tissues linked to the disease as well). Think both about tissue-type and ENCODE mark. Please do not use DNAse hypersensitivity or RNA-seq datasets. 

**Hint: it will be easiest if you find a data set with data in the bed file format.**

**Note: Make sure the coordinates of the file you pick also use the hg38 reference genome**

**Note: PLEASE try to pick a file size that is Less than 1 Gb -- this will make it run faster/Feasibly on CoCalc!**

**Q1.** (2 points) What dataset did you choose and why? Include the tissue, epigenetic mark tested, and any identifying information, such as the name of the sample.

**Q2:** (5 points) Now, test for significant overlap bewtween the Alzheimer's SNPs and your ENCODE mark. This time, instead of generating fake SNP sets, use the fisher exact test implemented in bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/fisher.html). 

**NOTE: The fisher tool requires that your data is pre-sorted by chromosome and then by start position. e.g.**

```bash
$ sort -k1,1 -k2,2n in.bed > in.sorted.bed 
```
    
**for BED files, where `in.bed` is your input bed file and `in.sorted.bed` is the name that you want for the sorted output.**

This test looks for an association between two classifications. In our case, we are looking for an association between being a SNP in ALZ_SNPs_hg38.bed and being in an interval from your ENCODE data set. Run this test on your datasets. 

**NOTE: We have provided you a pre-sorted genome file in this directory that you can use for this analysis (human.hg38.genome).**

What is the two-tailed p-value? This is the p-value that under the null hypothesis (random chance), the given parameter (probability of overlap) is more extreme than what we observe in our data. Please include your unix commands in the top box and your answer in the bottom box.

**Q3.** (5 points)  Was there a significant association between the disease-associated genetic variants and the ENCODE mark? Explain your findings to the best of your abilities. Are you suprised by the result? If so, why? If not, why not?

### ENCODE: Finding Active regulatory regions

You also know that H3K9me3 is a sign of *repressed* chromatin. Download and unzip the narrowPeak bed file for mouse H3K9me3 data (file **ENCFF234GVB**). Then, using the intersect command and -v flag, count the number of genes from MeAc_genes.bed that do **NOT** also have a H3K9me3 mark. MeAc_genes.bed was generated in the 2nd Encode Module and has been provided in this folder.

**Q4.** (5 points) Write the bedtools command(s) to do this.

**Q5.** (5 points) How many genes from MeAc_genes.bed do not have a H3K9me3 mark?

**Q6.** (5 points) Are you convinced that these genes are truly active genes in embroynic mouse liver? Do you think that there may be some genes in liver_genes.txt that are expressed that we missed? Why or why not? What complexities are we overlooking?

*Hint: Think about issues such as where different marks should be located relative to the gene (from the Enhancer paper in the prelab), whether bedtool's window command was the ideal one to use, and cell type heterogeneity*.

### Creating Pipelines and Automation of analysis (Rscripts)

In module 22, you created an Rscript that you can use for high-throughput analysis of some data

**Q7.** (5 points) Provide a copy of the Rscript that you created in your `H03_Homework-III` directory!

**Q8.** (5 points) Now, let's unlock the power of automation. We have provided you with 3 data sets (exp1, exp2, exp3):

    /data
    
Using your script, analyze each of the 3 data sets provided and generate outputs!

**Q9a** (3 points) In an R code block, read each of the 3 output files you created into R and print the contents to your notebook. 

Provide your code below:

**Q9b** (2 points) Using a **SINGLE** UNIX command, view the contents of all three output files. HINT: the commands `head` and `cat` can display multiple files with only one command - you might need to look up how to do this.

Provide your code below:

**Q10.** (5 points) You'll note that if you had 1000s of experiments to analyze, it would be a real pain to write out the command line for each one -- you have better things to do than that!

What could you do to save yourself from needing to do that? How would you modify the code to achieve that? (Provide a general description, but no need to write specific code)

### Automation / re-analysis Using Jupyter Notebooks

In Module 23 (Pharmcology), you will have probably noted that you could have tabulated virtually all of the above in excel. 

However, in a true high-throughput screening assay, you will have **hundreds** of plates to process. That's too much for even one human to do in excel, perfectly! 

You may also have noticed that through doing this assignment, you have written a 'generic' pipeline to process a single plate.

**Q11.** (5 points) Return to the in-class module. In order to process a different plate, called `plate2`, what would you change in the pipeline you created?

**Q12.** (5 points) In your module 23 directory, we included data from 6 additional plates. 

Process each and report here (excluding controls):
- the Z prime factor for each plate  
- the number of cells that gave lower than a **-4** normalized score, excluding controls, per plate. Note that rather than counting the results on the heatmap, you could `sum()` within the appropriate part of the heatmap table (excluding the controls, of course)

To do this, you could change the plate and re-run the markdown, and record the results in the cells below.

You will notice that in these data, you do not actually have any sample names attached to your data, e.g. what genes you actually screened.

Imagine that you were provided a file that looked like the following:

    sampleid,row,col,plate
    87234,C,3,1
    7134,C,4,1
    ...
    81672,P,22,7

i.e. a file with 2240 rows (+1 header) where each sampleids was mapped to a corresponding row and column. Note that positive and negative control columns are excluded.

**Q13.** (5 points) Imagine that you now wanted to obtain the sampleids (i.e., the gene code id!) from the a set of cells that were of interest, the 'hits' from the screen. 

For that, imagine that you had a second file which collated all of the cells across all plates which had a normalized Z-score less than -5. e.g., it looked like this:

    row,col,plate
    C,3,1
    D,5,2
    P,6,7

Describe in words the steps that would allow a computer to print out the sampleids ONLY for the entries listed in this new file. To help you, we have provided the first two steps in the process. You complete the rest!
   * Be specific in the details of what you would check for during your look-up.
   * Hint: Pretend of you had two sheets of paper, each with your lists, and you had to do this 'by hand'. What would you do, step-by-step?

**Q13 BONUS** (0 points) The following problem is optional, and will not affect your score on this homework in any way. However, if you provide a correct solution, you will get a "first Dibs" certificate for an event on the last day of class. *This will have nothing to do with your grade*.

Write code in R that implements your solution to Q13. You may use tidyverse. We have provided both files described in the question in the `q13_bonus` directory. Note that this data was randomly generated, and does not match up with data from the Pharmacology module.

### Prediction measurements and Machine Learning

**Q14.** (6 points) In the context of machine learning problems, describe what is meant by the following terms:

- Features
- Examples
- Labels


**Q15.** Imagine data from two trained models applied to a test set:

Model 1:
- Correctly labelled true positives examples: 601
- Correctly laballed true negatives examples: 999
- Actually negative examples labelled as positive: 371
- Actually positive examples labelled as negative: 29

Model 2:
- Correctly labelled true positives examples: 255
- Correctly laballed true negatives examples: 1345
- Actually negative examples labelled as positive: 25
- Actually positive examples labelled as negative: 375

For calculations, we've inputted these numbers in R for you per the following:

In [49]:
m1_tp = 601
m1_tn = 999
m1_fp = 371
m1_fn = 29

m2_tp = 255
m2_tn = 1345
m2_fp = 25
m2_fn = 375

**Q15a.** (6 points) What is the accuracy of Model 1 and Model 2, respectively?

**Q15b.** (6 points) Of the examples that were *lablled* positive, what proportion are correctly predicted in Model 1? Model 2?

**Q15c.** (6 points) Of the examples that are *actually* negative, what proportion are correctly predicted in Model 1? Model 2?

**Q15d.** (6 points) Of the examples that are *actually* positive, what proportion are correctly predicted in Model 1? Model 2?

**Q15e.** (8 points) Based on the above, are these two models producing equivalent performance? Why or why not? Describe a situation where application of Model 1 would be preferrable to Model 2; and conversely, a situation where application of Model 2 might be preferrable to Model 1. 