## Example data from the wild!! 
Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

“Starving” the yeast of these nutrients lets us find genes that:

Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.
Sounds pretty cool, right? So let’s get started!

You can check out the paper here: https://www.molbiolcell.org/doi/full/10.1091/mbc.e07-08-0779

### 1. Start by loading in the data as a pandas dataframe 
- data file = bcmb_bootcamp2020/day2/data/Brauer2008_DataSet1_clean.tds
- Note this is a tab separated file, you will need to specify the delimeter as "\t" in your load command

Each of those columns like G0.05, N0.3 and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with six limiting nutrients and six growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.

Now that you have loaded in and looked at the data list 2 reasons why this dataset does NOT follow the rules of tidy data (hint review section 2.3 of Hadly Wickam's Tidy data paper http://vita.had.co.nz/papers/tidy-data.pdf) 

ANSWER:
1. 
2. 

### 2. Make a new dataframe called df_clean that follows the tidy data rules, have it print
- (hint "NAME" column consists of gene name, biological functions, molecular functions, systematic names, and gene number. Split into 5 separate columns with unique names. This might be helpful https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.Series.str.split.html)

### 3. Next, visualize the data. Load matplotlib and seaborn for plotting
- The best part!!! Pretty plots are so much FUN!

First make a density plot of all the gene expression values, include an explanation of what this data means in a markdown cell underneath (ie. what is a gene expression value, what kind of distribution is this, what does a high negative number mean vs a high positive number)

### 4. Plotting some more!

Next let's dig into the data more. Using pandas again subset the dataframe to just keep the genes that have the string "cell cycle" as their biological process. (see note about "NAME" column above in step 2)  

Next, subset the dataframe again so it only contains cell cycle genes from the glucose treatments "G"

Hint: Consider looking into str.contains https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

Now that you have subsetted the dataframe let's plot! Use seaborn to plot a boxplot with sample on the x axis and expression on the y axis

Now overlay the seaborn plots to add the individual points to your boxplot (hint swarmplot).

YAY you're all finished and are now super extra awesome at tidying and visualizing data!! 