## Example data from the wild!! 
Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

“Starving” the yeast of these nutrients lets us find genes that:

Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.
Sounds pretty cool, right? So let’s get started!

You can check out the paper here: https://www.molbiolcell.org/doi/full/10.1091/mbc.e07-08-0779

### 1. Start by loading in the data as a pandas dataframe 
- data file = bcmb_bootcamp2020/day3/data/Brauer2008_DataSet1_clean.tds
- Note this is a tab separated file, you will need to specify the delimeter as "\t" in your load command

In [1]:
import pandas as pd

In [None]:
raw=pd.read_table("../data/Brauer2008_DataSet1_clean.tds", delimiter="\t")

In [4]:
raw

Unnamed: 0,GID,YORF,NAME,GWEIGHT,G0.05,G0.1,G0.15,G0.2,G0.25,G0.3,...,L0.15,L0.2,L0.25,L0.3,U0.05,U0.1,U0.15,U0.2,U0.25,U0.3
0,GENE1331X,A_06_P5820,SFB2 -- ER to Golgi transport -- molecul...,1,-0.24,-0.13,-0.21,-0.15,-0.05,-0.05,...,0.13,0.20,0.17,0.11,-0.06,-0.26,-0.05,-0.28,-0.19,0.09
1,GENE4924X,A_06_P5866,-- biological process unknown -- mol...,1,0.28,0.13,-0.40,-0.48,-0.11,0.17,...,0.02,0.04,0.03,0.01,-1.02,-0.91,-0.59,-0.61,-0.17,0.18
2,GENE4690X,A_06_P1834,QRI7 -- proteolysis and peptidolysis -- ...,1,-0.02,-0.27,-0.27,-0.02,0.24,0.25,...,-0.07,-0.05,-0.13,-0.04,-0.91,-0.94,-0.42,-0.36,-0.49,-0.47
3,GENE1177X,A_06_P4928,CFT2 -- mRNA polyadenylylation* -- RNA b...,1,-0.33,-0.41,-0.24,-0.03,-0.03,0.00,...,-0.05,0.02,0.00,0.08,-0.53,-0.51,-0.26,0.05,-0.14,-0.01
4,GENE511X,A_06_P5620,SSO2 -- vesicle fusion* -- t-SNARE activ...,1,0.05,0.02,0.40,0.34,-0.13,-0.14,...,0.00,-0.11,0.04,0.01,-0.45,-0.09,-0.13,0.02,-0.09,-0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5532,GENE2833X,A_06_P6094,KRE1 -- cell wall organization and bioge...,1,0.41,-0.28,0.30,0.50,-0.05,-0.08,...,0.38,0.23,0.21,0.15,0.32,0.62,0.54,0.01,0.56,0.28
5533,GENE271X,A_06_P3243,MTL1 -- cell wall organization and bioge...,1,0.50,-0.12,0.25,0.24,0.13,0.02,...,0.25,-0.02,-0.06,-0.10,,0.50,0.29,-0.14,0.47,0.27
5534,GENE1691X,A_06_P4196,KRE9 -- cell wall organization and bioge...,1,0.15,0.09,0.21,0.46,0.19,-0.02,...,0.37,0.21,0.16,-0.01,-0.68,0.63,0.41,0.09,0.48,0.43
5535,GENE1755X,A_06_P4680,UTH1 -- mitochondrion organization and b...,1,0.63,0.38,0.05,0.12,0.13,-0.01,...,-0.07,0.02,0.24,0.18,-0.89,0.19,0.03,0.04,0.13,0.19


Each of those columns like G0.05, N0.3 and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with six limiting nutrients and six growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.

Now that you have loaded in and looked at the data list 2 reasons why this dataset does NOT follow the rules of tidy data (hint review section 2.3 of Hadly Wickam's Tidy data paper http://vita.had.co.nz/papers/tidy-data.pdf) 

ANSWER:
1. 
2. 

### 2. Make a new dataframe called df_clean that follows the tidy data rules, have it print
- (hint "NAME" column consists of gene name, biological functions, molecular functions, systematic names, and gene number. Split into 5 separate columns with unique names. This might be helpful https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.Series.str.split.html)

In [10]:
splitname=raw['NAME'].str.split("--", expand=True)
df_clean=raw
df_clean['gene_name']=splitname[0]
df_clean['biofun']=splitname[1]
df_clean['molfun']=splitname[2]
df_clean['sysname']=splitname[3]
df_clean['genenum']=splitname[4]


In [None]:
df_clean.drop(columns=['NAME'])

KeyError: "['NAME'] not found in axis"

In [17]:
melty=df_clean.melt(id_vars=['GID', 'YORF', 'GWEIGHT', 'gene_name','biofun','molfun','sysname','genenum'], var_name='conditiontime', value_name='expression')
melty

Unnamed: 0,GID,YORF,GWEIGHT,gene_name,biofun,molfun,sysname,genenum,conditiontime,expression
0,GENE1331X,A_06_P5820,1,SFB2,ER to Golgi transport,molecular function unknown,YNL049C,1082129,G0.05,-0.24
1,GENE4924X,A_06_P5866,1,,biological process unknown,molecular function unknown,YNL095C,1086222,G0.05,0.28
2,GENE4690X,A_06_P1834,1,QRI7,proteolysis and peptidolysis,metalloendopeptidase activity,YDL104C,1085955,G0.05,-0.02
3,GENE1177X,A_06_P4928,1,CFT2,mRNA polyadenylylation*,RNA binding,YLR115W,1081958,G0.05,-0.33
4,GENE511X,A_06_P5620,1,SSO2,vesicle fusion*,t-SNARE activity,YMR183C,1081214,G0.05,0.05
...,...,...,...,...,...,...,...,...,...,...
199327,GENE2833X,A_06_P6094,1,KRE1,cell wall organization and biogenesis,structural constituent of cell wall,YNL322C,1083836,U0.3,0.28
199328,GENE271X,A_06_P3243,1,MTL1,cell wall organization and biogenesis,molecular function unknown,YGR023W,1080930,U0.3,0.27
199329,GENE1691X,A_06_P4196,1,KRE9,cell wall organization and biogenesis*,molecular function unknown,YJL174W,1082539,U0.3,0.43
199330,GENE1755X,A_06_P4680,1,UTH1,mitochondrion organization and biogenesis*,molecular function unknown,YKR042W,1082610,U0.3,0.19


In [18]:
melty['conditiontime'].unique()

array(['G0.05', 'G0.1', 'G0.15', 'G0.2', 'G0.25', 'G0.3', 'N0.05', 'N0.1',
       'N0.15', 'N0.2', 'N0.25', 'N0.3', 'P0.05', 'P0.1', 'P0.15', 'P0.2',
       'P0.25', 'P0.3', 'S0.05', 'S0.1', 'S0.15', 'S0.2', 'S0.25', 'S0.3',
       'L0.05', 'L0.1', 'L0.15', 'L0.2', 'L0.25', 'L0.3', 'U0.05', 'U0.1',
       'U0.15', 'U0.2', 'U0.25', 'U0.3'], dtype=object)

In [None]:
melty['metab']=melty['conditiontime'].str.slice(0,1)
melty['grate']=melty['conditiontime'].str.slice(1).astype(float)


numpy.float64

### 3. Subsetting!

Next let's dig into the data more. Using pandas again subset the dataframe to just keep the genes that have the string "cell cycle" as their biological process. (see note about "NAME" column above in step 2)  

In [36]:
for i in melty['biofun']:
    if "cell cycle" in i:
        print(i)

 regulation of progression through cell cycle 
 G1/S transition of mitotic cell cycle* 
 G2/M transition of mitotic cell cycle* 
 regulation of progression through cell cycle* 
 G1/S transition of mitotic cell cycle* 
 G1/S-specific transcription in mitotic cell cycle 
 G1/S transition of mitotic cell cycle* 
 G1/S transition of mitotic cell cycle* 
 cell cycle 
 G1/S transition of mitotic cell cycle 
 G2/M transition of mitotic cell cycle* 
 cell cycle 
 regulation of progression through mitotic cell cycle 
 G2/M transition of mitotic cell cycle* 
 G2/M transition of mitotic cell cycle* 
 G1-specific transcription in mitotic cell cycle 
 G1/S-specific transcription in mitotic cell cycle 
 cell cycle* 
 G2/M transition of mitotic cell cycle* 
 G1/S transition of mitotic cell cycle* 
 G1/S transition of mitotic cell cycle* 
 G2/M-specific transcription in mitotic cell cycle 
 G2/M transition of mitotic cell cycle* 
 G1/S transition of mitotic cell cycle* 
 cell cycle 
 cell cycle 
 G2/M

In [38]:
cyclegenes=melty['biofun'].str.contains("cell cycle")
sum(cyclegenes)

2160

In [40]:
melty_cycle=melty[cyclegenes]
melty_cycle

Unnamed: 0,GID,YORF,GWEIGHT,gene_name,biofun,molfun,sysname,genenum,conditiontime,expression,metab,grate
108,GENE1312X,A_06_P3432,1,ZPR1,regulation of progression through cell cycle,protein binding,YGR211W,1082106,G0.05,0.00,G,0.05
118,GENE3816X,A_06_P2720,1,PTC2,G1/S transition of mitotic cell cycle*,protein phosphatase type 2C activity,YER089C,1084958,G0.05,-1.04,G,0.05
434,GENE3053X,A_06_P1181,1,PIN4,G2/M transition of mitotic cell cycle*,molecular function unknown,YBL051C,1084089,G0.05,-0.71,G,0.05
445,GENE1659X,A_06_P1385,1,HSL7,regulation of progression through cell cycle*,protein-arginine N-methyltransferase activity,YBR133C,1082505,G0.05,-0.56,G,0.05
466,GENE589X,A_06_P4710,1,SIS2,G1/S transition of mitotic cell cycle*,phosphopantothenoylcysteine decarboxylase act...,YKR072C,1081300,G0.05,0.16,G,0.05
...,...,...,...,...,...,...,...,...,...,...,...,...
199116,GENE3589X,A_06_P6416,1,VHS3,G1/S transition of mitotic cell cycle*,phosphopantothenoylcysteine decarboxylase act...,YOR054C,1084698,U0.3,0.10,U,0.30
199120,GENE87X,A_06_P4891,1,SIC1,G1/S transition of mitotic cell cycle*,protein binding*,YLR079W,1080719,U0.3,0.02,U,0.30
199172,GENE317X,A_06_P2832,1,CDC4,G1/S transition of mitotic cell cycle*,protein binding*,YFL009W,1080982,U0.3,0.28,U,0.30
199191,GENE1372X,A_06_P4302,1,PTK2,G1/S transition of mitotic cell cycle*,protein kinase activity,YJR059W,1082178,U0.3,0.15,U,0.30


Next, subset the dataframe again so it only contains cell cycle genes from the glucose treatments "G"

Hint: Consider looking into str.contains https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [42]:
gmelty_cycle=melty_cycle[melty_cycle['metab']=="G"]
gmelty_cycle

Unnamed: 0,GID,YORF,GWEIGHT,gene_name,biofun,molfun,sysname,genenum,conditiontime,expression,metab,grate
108,GENE1312X,A_06_P3432,1,ZPR1,regulation of progression through cell cycle,protein binding,YGR211W,1082106,G0.05,0.00,G,0.05
118,GENE3816X,A_06_P2720,1,PTC2,G1/S transition of mitotic cell cycle*,protein phosphatase type 2C activity,YER089C,1084958,G0.05,-1.04,G,0.05
434,GENE3053X,A_06_P1181,1,PIN4,G2/M transition of mitotic cell cycle*,molecular function unknown,YBL051C,1084089,G0.05,-0.71,G,0.05
445,GENE1659X,A_06_P1385,1,HSL7,regulation of progression through cell cycle*,protein-arginine N-methyltransferase activity,YBR133C,1082505,G0.05,-0.56,G,0.05
466,GENE589X,A_06_P4710,1,SIS2,G1/S transition of mitotic cell cycle*,phosphopantothenoylcysteine decarboxylase act...,YKR072C,1081300,G0.05,0.16,G,0.05
...,...,...,...,...,...,...,...,...,...,...,...,...
33006,GENE3589X,A_06_P6416,1,VHS3,G1/S transition of mitotic cell cycle*,phosphopantothenoylcysteine decarboxylase act...,YOR054C,1084698,G0.3,-0.30,G,0.30
33010,GENE87X,A_06_P4891,1,SIC1,G1/S transition of mitotic cell cycle*,protein binding*,YLR079W,1080719,G0.3,0.04,G,0.30
33062,GENE317X,A_06_P2832,1,CDC4,G1/S transition of mitotic cell cycle*,protein binding*,YFL009W,1080982,G0.3,0.36,G,0.30
33081,GENE1372X,A_06_P4302,1,PTK2,G1/S transition of mitotic cell cycle*,protein kinase activity,YJR059W,1082178,G0.3,-0.11,G,0.30


Write the subsetted file out to a csv - and open it in excel or google sheets to examine it and see if you did it right.  Screenshot the result.

In [43]:
gmelty_cycle.to_csv("gmelty_cycle.csv")

YAY you're all finished and are now super extra awesome at tidying data!! 