# Parse Affy C. elegans Genome Array annotations

Tong Shu Li

Parse out the annotations for each probe for the Affymetrix C. elegans Genome Array platform ([GPL200](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL200)).

In [1]:
import numpy as np
import pandas as pd

## Read the annotation file:

1. Replaces the delimiter for empty ('---') with NaN
2. Drop any completely empty columns

In [2]:
affy = (
    pd.read_csv("Celegans.na35.annot.csv", sep = ',', comment = '#')
        .replace(to_replace = "---", value = np.nan)
        .dropna(axis = 1, how = "all")
        .rename(columns = lambda col: col.lower().replace(" ", "_"))
)

In [3]:
affy.shape

(22625, 29)

In [4]:
affy.head(2)

Unnamed: 0,probe_set_id,genechip_array,species_scientific_name,annotation_date,sequence_type,sequence_source,transcript_id(array_design),target_description,representative_public_id,unigene_id,...,refseq_transcript_id,wormbase,gene_ontology_biological_process,gene_ontology_cellular_component,gene_ontology_molecular_function,interpro,annotation_description,annotation_transcript_cluster,transcript_assignments,annotation_notes
0,171720_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,affy.Ce.20641,AV179929_rc /REP_DB=TREMBL Accession /GB=AV179...,AV179929,Cel.20567,...,NM_064443 /// NM_182150,,0000003 // reproduction // inferred from mutan...,,0005515 // protein binding // inferred from el...,IPR000342 // Regulator of G protein signalling...,This probe set was annotated using the Matchin...,Y48E1B.14a,Y48E1B.14a // cdna:known chromosome:WBcel235:I...,NM_064443 // refseq // 8 // Cross Hyb Matching...
1,171721_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,GenBank,affy.Ce.22856,g6767 /REP_DB=GenBank Identifier /5_PRIME_EXT_...,6767,Cel.121,...,NM_059851 /// NM_171841,,,0005737 // cytoplasm // inferred from direct a...,0005515 // protein binding // inferred from ph...,IPR005373 // Uncharacterised protein family UP...,This probe set was annotated using the Matchin...,"NM_059851(10),NM_171841(10),T01G9.2a,T01G9.2b",NM_059851 // Caenorhabditis elegans Protein T0...,


## How many empty cells per column?

In [5]:
affy.isnull().sum()

probe_set_id                            0
genechip_array                          0
species_scientific_name                 0
annotation_date                         0
sequence_type                           0
sequence_source                         0
transcript_id(array_design)             0
target_description                      0
representative_public_id                0
unigene_id                           1346
genome_version                          0
alignments                            323
gene_title                            684
gene_symbol                           684
unigene_cluster_type                 1351
ensembl                               717
entrez_gene                           684
swissprot                            1703
refseq_protein_id                    1185
refseq_transcript_id                 1164
wormbase                             5996
gene_ontology_biological_process     9976
gene_ontology_cellular_component    14188
gene_ontology_molecular_function  

## Check if any ID mappings are missing

We want to ideally map from the proprietary Affymetrix IDs to public identifiers. The most well known databases from the available information would be Wormbase, Entrez gene, and Emsembl.

Let's check if these identifiers are missing for any of the probes:

In [6]:
(
    affy[["probe_set_id", "entrez_gene", "wormbase", "ensembl"]]
        .isnull()
        .groupby(["ensembl", "entrez_gene", "wormbase"], as_index = False)
        .size()
        .rename("# Missing")
        .to_frame()
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,# Missing
ensembl,entrez_gene,wormbase,Unnamed: 3_level_1
False,False,False,16580
False,False,True,5238
False,True,True,90
True,False,False,49
True,False,True,74
True,True,True,594


Some 594 probe ids do not have mappable ids. Hand examination of these specific probes reveals that the `Representative Public ID` column can be used to identify the probed gene. However, these identifiers are not guaranteed to be unique.

## Annotate genes

We will resort to providing as many possible identifier mappings as we can for all the probes, and leave it to later processing to determine if we want to discard any probes due to missing mappings.

In [7]:
ids = (
    affy[[
        "probe_set_id", "entrez_gene", "wormbase",
        "ensembl", "representative_public_id"
    ]]
    .rename(
        columns = {
            "probe_set_id": "probe_id",
            "entrez_gene": "entrez_id",
            "wormbase": "wormbase_id",
            "ensembl": "ensembl_id",
            "representative_public_id": "other_id"
        }
    )
)

In [8]:
ids.shape

(22625, 5)

In [9]:
ids.head()

Unnamed: 0,probe_id,entrez_id,wormbase_id,ensembl_id,other_id
0,171720_x_at,174997,,WBGene00013011,AV179929
1,171721_x_at,172609,,WBGene00011344,6767
2,171722_x_at,176907,,WBGene00018934,AV189310
3,171723_x_at,180646,CE26817,WBGene00006928,CEC7564
4,171724_x_at,172353,CE11778,WBGene00000386,AV178012


This identifier mapping is far from satisfactory, so we will use the probe ids as the main unique ids and then worry about mappings later..

## Save to file

In [10]:
ids.to_csv("GPL200_id_mapping.tsv", sep = '\t', index = False)