# Preprocessing Glioblastoma Data
In this notebook, I preprocess the original glioblastoma gene expression for 200 patients to be in the correct format for me to use. That means, I have to derive a csv table with the genes as rows and the patient samples as columns.
However, I first have to join the two different dataframes (the gene expression data was delivered as two separate files) and convert the affymatrix IDs to Ensembl IDs.

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import mygene
%matplotlib inline

## 1. Read the Gene Expression

In [33]:
ge1 = pd.read_csv('GSE4271-GPL96_series_matrix.txt',
                  sep='\t', quoting=csv.QUOTE_NONE,
                  comment='!', skiprows=85, encoding='utf-8')
ge1.columns = [x.strip('"') for x in ge1.columns] # remove quotes around column names
ge1['ID_REF'] = ge1['ID_REF'].map(lambda x: x.strip('"'))

In [34]:
ge2 = pd.read_csv('GSE4271-GPL97_series_matrix.txt',
                  sep='\t', quoting=csv.QUOTE_NONE,
                  comment='!', skiprows=84, encoding='utf-8')
ge2.columns = [x.strip('"') for x in ge2.columns] # remove quotes around column names
ge2['ID_REF'] = ge2['ID_REF'].map(lambda x: x.strip('"'))

In [36]:
ge1.shape, ge2.shape

((22283, 101), (22645, 101))

## 2. Convert Affymatrix IDs to Ensembl-IDs

In [89]:
# use mygene to get ensembl IDs for the affymatrix genes
def add_ids_to_ge(ge):
    mg = mygene.MyGeneInfo()
    res = mg.querymany(pd.unique(ge.ID_REF),
                       scopes='all',
                       fields='ensembl.gene, symbol',
                       species='human', returnall=True
                      )

    # now, retrieve the names and IDs from a dictionary and put in DF
    def get_name_and_id(x):
        ens_id = x['ensembl'][0]['gene'] if type(x['ensembl']) is list else x['ensembl']['gene']
        affy = x['query']
        name = x['symbol']
        return [affy, ens_id, name]

    ens_ids = [get_name_and_id(x) for x in res['out'] if 'ensembl' in x]
    gene_affy_ens = pd.DataFrame(ens_ids, columns=['affy_ID', 'Ensembl_ID', 'Name'])
    gene_affy_ens.set_index('affy_ID', inplace=True)
    gene_affy_ens = gene_affy_ens[~gene_affy_ens.index.duplicated(keep='first')]

    # join with node list to have the correct order and derive gene names
    gene_affy_ens = ge.join(gene_affy_ens, on='ID_REF')
    print ("{} gene names (symbols) mapped successfully".format(len(ens_ids)))
    return gene_affy_ens

In [90]:
ge1 = add_ids_to_ge(ge1)
ge2 = add_ids_to_ge(ge2)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-22283...done.
Finished.
1125 input query terms found dup hits:
	[('1007_s_at', 2), ('1294_at', 2), ('1773_at', 2), ('200003_s_at', 2), ('200012_x_at', 4), ('200016_
1371 input query terms found no hit:
	['201205_at', '201265_at', '202015_x_at', '202091_at', '202280_at', '202881_x_at', '203326_x_at', '2
21934 gene names (symbols) mapped successfully
querying 1-1000...

In [96]:
ge1.set_index('Ensembl_ID', inplace=True)
ge2.set_index('Ensembl_ID', inplace=True)

In [101]:
ge1.index.isin(ge2.index).sum()

9123