# Clean and prepare GSE21784 raw data

Tong Shu Li

We will reshape the expression value matrix to long form and add in relevant metadata about each sample. Conversion of probe ids to other ids is not done here due to complexities with probe mappings.

In [1]:
import pandas as pd
import sys

In [2]:
sys.path.append("../..")

In [3]:
from src.geo import parse_series_matrix

## Read series matrix

The parser function checks that no values are missing.

In [4]:
series, samples, exp = parse_series_matrix("GSE21784_series_matrix.txt")

In [5]:
samples.shape

(9, 37)

In [6]:
samples.head(2)

Unnamed: 0,title,geo_accession,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,characteristics_ch1,...,contact_laboratory,contact_department,contact_institute,contact_address,contact_city,contact_state,contact_zip/postal_code,contact_country,supplementary_file,data_row_count
0,"L4 larvae, biological rep1",GSM542652,Public on May 11 2011,May 11 2010,May 11 2011,RNA,1,C. elegans L4 larvae,Caenorhabditis elegans,strain: Bristol N2,...,Dennis Kim,Biology,MIT,77 Massachusetts Ave. (68-440D),Cambridge,MA,2139,USA,ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supple...,22625
1,"L4 larvae, biological rep2",GSM542653,Public on May 11 2011,May 11 2010,May 11 2011,RNA,1,C. elegans L4 larvae,Caenorhabditis elegans,strain: Bristol N2,...,Dennis Kim,Biology,MIT,77 Massachusetts Ave. (68-440D),Cambridge,MA,2139,USA,ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supple...,22625


In [7]:
exp.shape

(203625, 3)

In [8]:
exp.head()

Unnamed: 0,probe_id,geo_id,log2_exp
0,171720_x_at,GSM542652,8.71
1,171721_x_at,GSM542652,9.91
2,171722_x_at,GSM542652,11.28
3,171723_x_at,GSM542652,12.97
4,171724_x_at,GSM542652,8.26


There were three biological replicates at three different ages: L4, day 6, and day 15. We will extract the day and replicate information out as metadata.

## Extract age and replicate number

For simplicity I will treat L4 worms as day 0 adults so that we can do linear regression later. This is probably not correct, but is an assumption I will make for now.

In [9]:
meta = (
    samples[["title", "geo_accession"]]
        .assign(
            age = lambda df: df["title"].str.split(",").str.get(0),
        
            replicate = lambda df:
                pd.to_numeric(df["title"].str.extract(r'(\d)$', expand = False)),
            
            days_old = lambda df:
                pd.to_numeric(
                    df["title"].str.extract(r'(\d+)', expand = False).replace("4", "0")
                )
        )
        .drop("title", axis = 1)
)

In [10]:
meta

Unnamed: 0,geo_accession,age,days_old,replicate
0,GSM542652,L4 larvae,0,1
1,GSM542653,L4 larvae,0,2
2,GSM542654,L4 larvae,0,3
3,GSM542655,day 6 adults,6,1
4,GSM542656,day 6 adults,6,2
5,GSM542657,day 6 adults,6,3
6,GSM542658,day 15 adults,15,1
7,GSM542659,day 15 adults,15,2
8,GSM542660,day 15 adults,15,3


### Save metadata to file

In [11]:
meta.to_csv("sample_metadata.tsv", sep = '\t', index = False)

## Add metadata to expression values

In [12]:
exp.groupby("geo_id").size()

geo_id
GSM542652    22625
GSM542653    22625
GSM542654    22625
GSM542655    22625
GSM542656    22625
GSM542657    22625
GSM542658    22625
GSM542659    22625
GSM542660    22625
dtype: int64

In [13]:
exp = (pd
    .merge(
        exp, meta, how = "left",
        left_on = "geo_id", right_on = "geo_accession"
    )
    .drop("geo_accession", axis = 1)
)

In [14]:
exp.shape

(203625, 6)

In [15]:
exp.head()

Unnamed: 0,probe_id,geo_id,log2_exp,age,days_old,replicate
0,171720_x_at,GSM542652,8.71,L4 larvae,0,1
1,171721_x_at,GSM542652,9.91,L4 larvae,0,1
2,171722_x_at,GSM542652,11.28,L4 larvae,0,1
3,171723_x_at,GSM542652,12.97,L4 larvae,0,1
4,171724_x_at,GSM542652,8.26,L4 larvae,0,1


In [16]:
exp.groupby("geo_id")["probe_id"].nunique().value_counts()

22625    9
Name: probe_id, dtype: int64

We will not annotate probes with mappings to other databases at the moment, since some genes are missing mappings, some have multiple mappings, and some have conflicting mappings. We will let the analysis scripts determine which genes they want to keep.

## Save to file

In [17]:
exp.to_csv("annot_GSE21784.tsv", sep = '\t', index = False)