0. Prepare data for text classification tutorial
---

For the ease of access and management, I downloaded all the texts and other metadata associated with all GSEs performed in mammals (human, mouse, rat) and store them into a `collection` of MongoDB named `meta`. 

Each documents is indexed by GSE/GSM as `id`. The data downloading was performed using `Bio.Entrez` module using the following query:

```python
handle = Entrez.esearch(db="gds", 
	term='("homo sapiens"[Organism] OR "mus musculus"[Organism] OR "rattus norvegicus"[Organism]) AND "gse"[Filter] AND "Expression profiling by array"', 
	retmode="xml", retmax=40000)
```

In [24]:
import os
import cPickle as pickle
import numpy as np
import pandas as pd
from pymongo import MongoClient
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import LabelEncoder

In [2]:
# Connect to MongoDB
client = MongoClient('mongodb://127.0.0.1:27017/')
db = client['microtask_signatures']
COLL_META = db['meta']

In [3]:
# Find all GSEs in the collection
cur = COLL_META.find({'id': {'$regex': r'^GSE[0-9]*'}}, 
                     projection={'_id':False, 'id':True, 'Series_title':True, 'Series_summary':True})
print 'unique GSEs found: %d '% cur.count()

unique GSEs found: 31905 


In [4]:
# Fetch the texts associated with them and make a pd.DataFrame
docs = [doc for doc in cur]
df = pd.DataFrame.from_records(docs).set_index('id')
del docs
print df.shape
df.head()

(31905, 2)


Unnamed: 0_level_0,Series_summary,Series_title
id,Unnamed: 1_level_1,Unnamed: 2_level_1
GSE1,[This series represents a group of cutaneous m...,NHGRI_Melanoma_class
GSE1000,[Amino acid conjugated surfaces and controls a...,Osteosarcoma TE85 cell tissue culture study
GSE10000,[We previously observed that formation of aort...,Age-dependent aorta transcriptomes in wild-typ...
GSE10001,[The thyroid hormone receptor (TR) has been pr...,Gene expression profiling in NCoR deficient mo...
GSE10002,[Primitive erythropoiesis in the mouse yolk sa...,Identification of Erythroid-Enriched Gene Expr...


#### Some cleaning-ups need to be done for the data: 

1. `Series_summay` fields might be list of paragraphs if there are multiple paragraphs, which need to be concatenatated.
2. Some rows have missing data currently filled by `np.nan`s, which need to be converted to strings.

In [5]:
# Concat lists in Series_summary
def maybe_concat(x):
    if type(x) == list:
        return ' '.join(x)
    else:
        return x
df['Series_summary'] = df['Series_summary'].map(maybe_concat)
# Fill na with ''
df = df.fillna('')
df.head()

Unnamed: 0_level_0,Series_summary,Series_title
id,Unnamed: 1_level_1,Unnamed: 2_level_1
GSE1,This series represents a group of cutaneous ma...,NHGRI_Melanoma_class
GSE1000,Amino acid conjugated surfaces and controls at...,Osteosarcoma TE85 cell tissue culture study
GSE10000,We previously observed that formation of aorta...,Age-dependent aorta transcriptomes in wild-typ...
GSE10001,The thyroid hormone receptor (TR) has been pro...,Gene expression profiling in NCoR deficient mo...
GSE10002,Primitive erythropoiesis in the mouse yolk sac...,Identification of Erythroid-Enriched Gene Expr...


Next, we want to retrieve labels annoated by the crowd of these documents from annother `collection` named `signatures`. 

In [6]:
COLL = db['signatures']
cur = COLL.find({'$and': [
            {'chdir_sva_exp2': {'$exists': True}}, 
            {'version': '1.0'},
            {"incorrect": {"$ne": True}}
        ]}, projection={'_id':False, 'id':True, 'geo_id':True})
print 'unique labeled signatures found: %d' % len(cur.distinct('geo_id'))

unique labeled signatures found: 1934


In [7]:
df_labels = pd.DataFrame.from_records([doc for doc in cur])
df_labels.head()

Unnamed: 0,geo_id,id
0,GSE763,drug:2721
1,GSE763,drug:2722
2,GSE763,drug:2724
3,GSE763,drug:2723
4,GSE581,dz:303


In [8]:
# Generate label column from `id`
df_labels['label'] = df_labels['id'].map(lambda x: x.split(':')[0])

# Remove `geo_id`s that have been annoatated with multiple labels
# count unique labels for each geo_ids
geo_labels = df_labels.groupby('geo_id')['label'].apply(lambda x: set(x))
geo_label_counts =  geo_labels.map(lambda x: len(x))
# get geo_ids with 1 label and unpack set
geo_labels = geo_labels[geo_label_counts == 1].map(lambda x: list(x)[0])
# convert to DataFrame
geo_labels = geo_labels.to_frame()
print geo_labels.shape
geo_labels.head()

(1785, 1)


Unnamed: 0_level_0,label
geo_id,Unnamed: 1_level_1
GSE1001,dz
GSE10064,dz
GSE10082,gene
GSE1009,dz
GSE1010,dz


In [9]:
# LEFT JOIN the df with geo_labels ON geo_id
df = df.merge(geo_labels, right_index=True, left_index=True, how='left')
print df.shape
df.head()

(31905, 3)


Unnamed: 0_level_0,Series_summary,Series_title,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSE1,This series represents a group of cutaneous ma...,NHGRI_Melanoma_class,
GSE1000,Amino acid conjugated surfaces and controls at...,Osteosarcoma TE85 cell tissue culture study,
GSE10000,We previously observed that formation of aorta...,Age-dependent aorta transcriptomes in wild-typ...,
GSE10001,The thyroid hormone receptor (TR) has been pro...,Gene expression profiling in NCoR deficient mo...,
GSE10002,Primitive erythropoiesis in the mouse yolk sac...,Identification of Erythroid-Enriched Gene Expr...,


In [10]:
# We then save this full DataFrame for future use
df.to_csv('data/GSEs_texts_with_labels.csv', encoding='utf-8')

In [18]:
# Create another df storing labeled documents
df_labeled = df.loc[~df['label'].isnull()]
print df_labeled.shape
df_labeled.head()

(1785, 3)


Unnamed: 0_level_0,Series_summary,Series_title,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSE1001,Sprague-Dawley rat retina post-injury and cont...,retina injury timecourse,dz
GSE10064,This study aims to determine if global gene ex...,Gene expression in immortalized B-lymphocytes ...,dz
GSE10082,Conventional biochemical and molecular techniq...,Aryl Hydrocarbon Receptor Regulates Distinct D...,gene
GSE1009,Gene expression profiling in glomeruli from hu...,Diabetic nephropathy,dz
GSE1010,RNA samples prepared from lymphoblastic cells ...,FCHL study,dz


In [21]:
# encode the labels 
encoder = LabelEncoder()
df_labeled.loc[:,'label_code'] = encoder.fit_transform(df_labeled['label'])

In [30]:
# split the labeled df stratifying the labels
cv = StratifiedKFold(df_labeled['label_code'])
split = np.empty([df_labeled.shape[0]])
for i, (_, valid_index) in enumerate(cv):
    split[valid_index] = i

df_labeled.loc[:, 'split'] = split
df_labeled.sort('label_code').head()

Unnamed: 0_level_0,Series_summary,Series_title,label,label_code,split
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GSE35961,Optimal treatment for nonalcoholic steatohepat...,Expression data from mouse liver treated with ...,drug,0,1
GSE51698,The gene expression profile of TAMs sorted fro...,The effect of imatinib therapy on tumor associ...,drug,0,2
GSE21946,Gamma tocotrienol induces apoptosis in breast ...,Expression data of MCF-7 cells treated with ga...,drug,0,0
GSE5230,Expression profiling the response to inhibitio...,Epigenetics of gene expression in human hepato...,drug,0,2
GSE2195,Immature (19/20 days of age) Alpk:APfCD-1 mice...,Phenotypic Anchoring of Gene Expression Change...,drug,0,0


In [32]:
# We then save this DataFrame for future use
df_labeled.to_csv('data/Labeled_GSEs_texts_with_labels.csv', encoding='utf-8')