# What to expect

In notebook 2B we looked at the output of STAR, and combined the results for each sample in the <i>Schistosoma mansoni</i> example dataset into a single dataframe. We considered ways to normalize the gene count data and viewed the results using  Principle Components Analysis. In this second part of this session, you will repeat most of this process for your choice of dataset.

### The datasets
The following two datasets are available. Click on the links to find out more about each one:
1. *Plasmodium*
2. *Trypanosoma*

In [None]:
# Which dataset are you considering?
dataset = 

# Inspect STAR results
As before, take a look at the `analysis/<dataset>/multiqc/multiqc_report.html` by double-clicking:

<div class="alert alert-block alert-warning">
Questions:
    
1. What percentage of reads aligned to the reference (provide a range)? How does this compare to the example?
2. What percentage of reads mapped exactly once in each sample? Does this look reasonable?
3. Are there any samples which look less good? In what way?

<div class="alert alert-block alert-success">
Answers:
    


# Strandedness
Choose an accession and take a look at `analysis/<dataset>/star/<accession>/<accession>ReadsPerGene.out.tab`. 

<div class="alert alert-block alert-warning">
Question:
    
4. Do you think this library was stranded, reverse stranded or unstranded?


<details>
<summary><i>Hint</i></summary>

Use `head` on the command line to view the first few lines of a file. You can set the number of lines with `-n`
</details>

<div class="alert alert-block alert-success">
Answer:
    


# Combining data accross samples
To ensure you have the correct results going forwards we have already combined the outputs for this dataset into `analysis/<dataset>/star/ReadsPerGene.csv`. Use pandas to load the dataframe and take a look at it.

In [None]:
import pandas as pd

df = 

In [None]:
df

# Normalization with DESeq2

As with the example dataset, we will load this dataset into a DeseqDataSet and use DESeq2 to normalize the counts. For each dataset we have provided metadata in `data/<dataset>/metadata.csv`. First use pandas to load the metadata and have a look. 

In [None]:
metadata = 
counts = 

In [None]:
metadata

For *Plasmodium*, we want to compare the wildtype at the different timepoints. For *Trypanosoma*, we want to compare *Trypanosoma brucei brucei* stages with different morphologies. Filter the datasets to contain just these samples. 

In [None]:
counts_s = 
metadata_s = 

Generate a DeseqDataSet object for your analysis thinking about what to pass in to the `design factors` parameter.

In [None]:
from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_s,
    metadata=metadata_s,
    refit_cooks=True,
    design_factors=
)

dds.deseq2()

In [None]:
# View the normed counts using the following


# PCA Plot

Now take a look at how the overall data looks on a Principle Components Analysis plot of PC1 and PC2. Plot the loadings associated these components.

In [None]:
import scanpy as sc



<div class="alert alert-block alert-warning">

Questions:

5. Is there a separation between the groups?
6. What is PC1 separating?
7. What is PC2 separating?
8. Which 5 genes contribute most to PC1. Which 2 contribute most to PC2?

<div class="alert alert-block alert-success">
Answer:
    
