# Parse metadata from GSE77110

Tong Shu Li

We need to extract the metadata from GSE77110 prior to expression analysis.

In [1]:
import pandas as pd
import sys

In [2]:
sys.path.append("../../..")

In [3]:
from src.util import read_file

---

## Explore the file format

Lines starting with `!` are comments and metadata. The file seems to begin with a contiguous block of metadata, has the actual expression values in the middle, and ends with a single line denoting the end of the expression matrix.

In [4]:
print([i for i, line in enumerate(read_file("GSE77110_series_matrix.txt")) if line.startswith("!")])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 22690]


The actual data table begins with `!series_matrix_table_begin` and ends with `!series_matrix_table_end`. The first 63 rows of the file are metadata that we will need to examine.

Everything seems to be tab separated in the file.

## Number of columns in metadata:

How many columns are in the metadata rows?

In [5]:
print([len(line.split('\t')) for line in read_file("GSE77110_series_matrix.txt") if line.startswith("!")])

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1, 1]


The first part of the file is metadata with two values, then expands to metadata with 16 values. Finally there are the two lines mentioned above which sandwich the actual expression matrix.

## Read the metadata

Load the metadata into a dataframe.

In [6]:
meta = []
info = []
for i, line in enumerate(read_file("GSE77110_series_matrix.txt")):
    if line.startswith("!"):
        vals = line[1:].split("\t")
        v = meta if len(vals) == 2 else info
        v.append(vals)
        
meta = pd.DataFrame(meta, columns = ["var", "val"])
info = pd.DataFrame(info)

In [7]:
meta.shape

(26, 2)

In [8]:
info.shape

(38, 16)

In [9]:
meta.head()

Unnamed: 0,var,val
0,Series_title,"""C.elegans time course study on dietary restri..."
1,Series_geo_accession,"""GSE77110"""
2,Series_status,"""Public on Jan 22 2016"""
3,Series_submission_date,"""Jan 21 2016"""
4,Series_last_update_date,"""Apr 22 2016"""


In [10]:
info.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,Sample_title,"""N2_AL_AD2""","""N2_AL_AD4""","""N2_CR_AD4""","""N2_IF_AD4""","""N2_AL_AD6""","""N2_CR_AD6""","""N2_IF_AD6""","""N2_AL_AD8""","""N2_CR_AD8""","""N2_IF_AD8""","""N2_AL_AD10""","""N2_CR_AD10""","""N2_IF_AD10""","""N2_IF_AD12""","""N2_IF_AD14"""
1,Sample_geo_accession,"""GSM2044469""","""GSM2044470""","""GSM2044471""","""GSM2044472""","""GSM2044473""","""GSM2044474""","""GSM2044475""","""GSM2044476""","""GSM2044477""","""GSM2044478""","""GSM2044479""","""GSM2044480""","""GSM2044481""","""GSM2044482""","""GSM2044483"""
2,Sample_status,"""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016""","""Public on Jan 22 2016"""
3,Sample_submission_date,"""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016""","""Jan 21 2016"""
4,Sample_last_update_date,"""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016""","""Jan 22 2016"""


The two columns of metadata isn't too helpful. It doesn't contain the details about each sample that we are looking for. That information is instead contained in the other metadata.

## Restructure the sample metadata

In [11]:
info = info.T

In [12]:
info.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
0,Sample_title,Sample_geo_accession,Sample_status,Sample_submission_date,Sample_last_update_date,Sample_type,Sample_channel_count,Sample_source_name_ch1,Sample_organism_ch1,Sample_characteristics_ch1,...,Sample_contact_department,Sample_contact_institute,Sample_contact_address,Sample_contact_city,Sample_contact_zip/postal_code,Sample_contact_country,Sample_supplementary_file,Sample_data_row_count,series_matrix_table_begin,series_matrix_table_end
1,"""N2_AL_AD2""","""GSM2044469""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 2""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
2,"""N2_AL_AD4""","""GSM2044470""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
3,"""N2_CR_AD4""","""GSM2044471""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under CR condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
4,"""N2_IF_AD4""","""GSM2044472""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under IF condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,


### Create column names

In [13]:
info = info.rename(columns = info.iloc[0])

In [14]:
info.head()

Unnamed: 0,Sample_title,Sample_geo_accession,Sample_status,Sample_submission_date,Sample_last_update_date,Sample_type,Sample_channel_count,Sample_source_name_ch1,Sample_organism_ch1,Sample_characteristics_ch1,...,Sample_contact_department,Sample_contact_institute,Sample_contact_address,Sample_contact_city,Sample_contact_zip/postal_code,Sample_contact_country,Sample_supplementary_file,Sample_data_row_count,series_matrix_table_begin,series_matrix_table_end
0,Sample_title,Sample_geo_accession,Sample_status,Sample_submission_date,Sample_last_update_date,Sample_type,Sample_channel_count,Sample_source_name_ch1,Sample_organism_ch1,Sample_characteristics_ch1,...,Sample_contact_department,Sample_contact_institute,Sample_contact_address,Sample_contact_city,Sample_contact_zip/postal_code,Sample_contact_country,Sample_supplementary_file,Sample_data_row_count,series_matrix_table_begin,series_matrix_table_end
1,"""N2_AL_AD2""","""GSM2044469""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 2""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
2,"""N2_AL_AD4""","""GSM2044470""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
3,"""N2_CR_AD4""","""GSM2044471""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under CR condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
4,"""N2_IF_AD4""","""GSM2044472""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under IF condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,


## Drop first row

In [15]:
info = info.drop(0).reset_index(drop = True)

In [16]:
info.head()

Unnamed: 0,Sample_title,Sample_geo_accession,Sample_status,Sample_submission_date,Sample_last_update_date,Sample_type,Sample_channel_count,Sample_source_name_ch1,Sample_organism_ch1,Sample_characteristics_ch1,...,Sample_contact_department,Sample_contact_institute,Sample_contact_address,Sample_contact_city,Sample_contact_zip/postal_code,Sample_contact_country,Sample_supplementary_file,Sample_data_row_count,series_matrix_table_begin,series_matrix_table_end
0,"""N2_AL_AD2""","""GSM2044469""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 2""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
1,"""N2_AL_AD4""","""GSM2044470""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
2,"""N2_CR_AD4""","""GSM2044471""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under CR condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
3,"""N2_IF_AD4""","""GSM2044472""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under IF condition, on adult day 4""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,
4,"""N2_AL_AD6""","""GSM2044473""","""Public on Jan 22 2016""","""Jan 21 2016""","""Jan 22 2016""","""RNA""","""1""","""N2 worms under AL condition, on adult day 6""","""Caenorhabditis elegans""","""strain: N2""",...,"""CAS-MPG Partner Institute for Computational B...","""Shanghai Institutes for Biological Sciences, ...","""Yueyang Road 320""","""Shanghai""","""200031""","""China""","""ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/suppl...","""22625""",,


At this point, a glance through the dataframe tells us that all of the important sample information is already located in just the first two columns. We can determine the strain, diet, and age directly from the sample title, and map that to the sample GEO accession number.

All of the other information in the dataframe is redundant and irrelevant.

### Rename columns

In [17]:
info.columns = info.columns.str.lower().str.replace("sample_", "")

In [18]:
info = info[["title", "geo_accession"]]

In [19]:
info.shape

(15, 2)

In [20]:
info

Unnamed: 0,title,geo_accession
0,"""N2_AL_AD2""","""GSM2044469"""
1,"""N2_AL_AD4""","""GSM2044470"""
2,"""N2_CR_AD4""","""GSM2044471"""
3,"""N2_IF_AD4""","""GSM2044472"""
4,"""N2_AL_AD6""","""GSM2044473"""
5,"""N2_CR_AD6""","""GSM2044474"""
6,"""N2_IF_AD6""","""GSM2044475"""
7,"""N2_AL_AD8""","""GSM2044476"""
8,"""N2_CR_AD8""","""GSM2044477"""
9,"""N2_IF_AD8""","""GSM2044478"""


### Remove double quotes

In [21]:
for col in ["title", "geo_accession"]:
    info[col] = info[col].str.strip('"')

In [22]:
info

Unnamed: 0,title,geo_accession
0,N2_AL_AD2,GSM2044469
1,N2_AL_AD4,GSM2044470
2,N2_CR_AD4,GSM2044471
3,N2_IF_AD4,GSM2044472
4,N2_AL_AD6,GSM2044473
5,N2_CR_AD6,GSM2044474
6,N2_IF_AD6,GSM2044475
7,N2_AL_AD8,GSM2044476
8,N2_CR_AD8,GSM2044477
9,N2_IF_AD8,GSM2044478


## Add back the data about each sample:

In [23]:
info["diet"] = info["title"].str[3:5]

In [24]:
info["days_old"] = info["title"].str.extract(r'AD(\d+)', expand = False)

In [25]:
info = info.rename(columns = {"title": "sample"})

In [26]:
info

Unnamed: 0,sample,geo_accession,diet,days_old
0,N2_AL_AD2,GSM2044469,AL,2
1,N2_AL_AD4,GSM2044470,AL,4
2,N2_CR_AD4,GSM2044471,CR,4
3,N2_IF_AD4,GSM2044472,IF,4
4,N2_AL_AD6,GSM2044473,AL,6
5,N2_CR_AD6,GSM2044474,CR,6
6,N2_IF_AD6,GSM2044475,IF,6
7,N2_AL_AD8,GSM2044476,AL,8
8,N2_CR_AD8,GSM2044477,CR,8
9,N2_IF_AD8,GSM2044478,IF,8


## Save metadata to file

In [27]:
info.to_csv("sample_metadata.tsv", sep = '\t', index = False)