# Parse Affy C. elegans Genome Array annotations

Tong Shu Li

Parse out the annotations for each probe for the Affymetrix C. elegans Genome Array platform (GPL200).

In [1]:
import numpy as np
import pandas as pd

---

## Read the annotation file:

In [2]:
affy = pd.read_csv("Celegans.na35.annot.csv", sep = ',', comment = '#')

In [3]:
affy.shape

(22625, 41)

In [4]:
affy.head()

Unnamed: 0,Probe Set ID,GeneChip Array,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Transcript ID(Array Design),Target Description,Representative Public ID,Archival UniGene Cluster,...,Gene Ontology Cellular Component,Gene Ontology Molecular Function,Pathway,InterPro,Trans Membrane,QTL,Annotation Description,Annotation Transcript Cluster,Transcript Assignments,Annotation Notes
0,171720_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,affy.Ce.20641,AV179929_rc /REP_DB=TREMBL Accession /GB=AV179...,AV179929,---,...,---,0005515 // protein binding // inferred from el...,---,IPR000342 // Regulator of G protein signalling...,---,---,This probe set was annotated using the Matchin...,Y48E1B.14a,Y48E1B.14a // cdna:known chromosome:WBcel235:I...,NM_064443 // refseq // 8 // Cross Hyb Matching...
1,171721_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,GenBank,affy.Ce.22856,g6767 /REP_DB=GenBank Identifier /5_PRIME_EXT_...,6767,---,...,0005737 // cytoplasm // inferred from direct a...,0005515 // protein binding // inferred from ph...,---,IPR005373 // Uncharacterised protein family UP...,---,---,This probe set was annotated using the Matchin...,"NM_059851(10),NM_171841(10),T01G9.2a,T01G9.2b",NM_059851 // Caenorhabditis elegans Protein T0...,---
2,171722_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,affy.Ce.25068,AV189310 /REP_DB=TREMBL Accession /5_PRIME_EXT...,AV189310,---,...,0005744 // mitochondrial inner membrane preseq...,---,---,IPR013261 // Mitochondrial import inner membra...,---,---,This probe set was annotated using the Matchin...,"F56B3.11a,F56B3.11b,NM_067590(10),NM_067591(9)",F56B3.11a // cdna:known chromosome:WBcel235:IV...,---
3,171723_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,affy.Ce.26172,CEC7564 /REP_DB=TREMBL Accession /5_PRIME_EXT_...,CEC7564,---,...,0005576 // extracellular region // inferred fr...,0005319 // lipid transporter activity // infer...,---,"IPR001747 // Lipid transport protein, N-termin...",---,---,This probe set was annotated using the Matchin...,"F59D8.2(10),NM_076211(10)",F59D8.2 // cdna:known chromosome:WBcel235:X:35...,NM_076188 // refseq // 8 // Cross Hyb Matching...
4,171724_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,affy.Ce.20249,AV178012_rc /REP_DB=TREMBL Accession /GB=AV178...,AV178012,---,...,0005622 // intracellular // inferred from elec...,0004721 // phosphoprotein phosphatase activity...,---,IPR001763 // Rhodanese-like domain // 2.8E-12,---,---,This probe set was annotated using the Matchin...,"K06A5.7.1(11),K06A5.7.2(11)",K06A5.7.1 // cdna:known chromosome:WBcel235:I:...,---


### Replace the delimiter for empty ("---") with NaN

In [5]:
affy = affy.replace(to_replace = "---", value = np.nan)

### Remove empty columns

In [6]:
affy = affy.drop(
    [col for col in affy.columns if all(affy[col].isnull())],
    axis = 1
)

### Remaining empty cells

In [7]:
affy.isnull().sum()

Probe Set ID                            0
GeneChip Array                          0
Species Scientific Name                 0
Annotation Date                         0
Sequence Type                           0
Sequence Source                         0
Transcript ID(Array Design)             0
Target Description                      0
Representative Public ID                0
UniGene ID                           1346
Genome Version                          0
Alignments                            323
Gene Title                            684
Gene Symbol                           684
Unigene Cluster Type                 1351
Ensembl                               717
Entrez Gene                           684
SwissProt                            1703
RefSeq Protein ID                    1185
RefSeq Transcript ID                 1164
WormBase                             5996
Gene Ontology Biological Process     9976
Gene Ontology Cellular Component    14188
Gene Ontology Molecular Function  

## Look at which identifiers to use

In [8]:
ids = affy[["Probe Set ID", "Entrez Gene", "WormBase", "Ensembl"]]

In [9]:
ids.shape

(22625, 4)

In [10]:
ids.head()

Unnamed: 0,Probe Set ID,Entrez Gene,WormBase,Ensembl
0,171720_x_at,174997,,WBGene00013011
1,171721_x_at,172609,,WBGene00011344
2,171722_x_at,176907,,WBGene00018934
3,171723_x_at,180646,CE26817,WBGene00006928
4,171724_x_at,172353,CE11778,WBGene00000386


In [11]:
ids.isnull().groupby(["Ensembl", "Entrez Gene", "WormBase"]).size()

Ensembl  Entrez Gene  WormBase
False    False        False       16580
                      True         5238
         True         True           90
True     False        False          49
                      True           74
         True         True          594
dtype: int64

Some 594 probe ids do not have mappable ids. Hand examination of these specific probes reveals that the `Representative Public ID` column can be used to identify the probed gene. However, these identifiers are not guaranteed to be unique.

## All identifiers

In [12]:
ids = affy[["Probe Set ID", "Entrez Gene", "WormBase", "Ensembl", "Representative Public ID"]]

In [13]:
ids.shape

(22625, 5)

In [14]:
ids.head()

Unnamed: 0,Probe Set ID,Entrez Gene,WormBase,Ensembl,Representative Public ID
0,171720_x_at,174997,,WBGene00013011,AV179929
1,171721_x_at,172609,,WBGene00011344,6767
2,171722_x_at,176907,,WBGene00018934,AV189310
3,171723_x_at,180646,CE26817,WBGene00006928,CEC7564
4,171724_x_at,172353,CE11778,WBGene00000386,AV178012


## Rename columns

In [15]:
ids = ids.rename(columns = {"Probe Set ID": "probe_id", "Entrez Gene": "entrez_id",
                           "WormBase": "wormbase_id", "Ensembl": "ensembl_id",
                           "Representative Public ID": "other_id"})

In [16]:
ids.head()

Unnamed: 0,probe_id,entrez_id,wormbase_id,ensembl_id,other_id
0,171720_x_at,174997,,WBGene00013011,AV179929
1,171721_x_at,172609,,WBGene00011344,6767
2,171722_x_at,176907,,WBGene00018934,AV189310
3,171723_x_at,180646,CE26817,WBGene00006928,CEC7564
4,171724_x_at,172353,CE11778,WBGene00000386,AV178012


In [17]:
ids.describe()

Unnamed: 0,probe_id,entrez_id,wormbase_id,ensembl_id,other_id
count,22625,21941,16629,21908,22625
unique,22625,18217,14514,18225,21880
top,194137_at,171952,CE03252,WBGene00023313 /// WBGene00023382,J04423
freq,1,10,12,11,14


This identifier mapping is far from satisfactory, so we will use the probe ids as the main unique ids and then worry about mappings later..

## Save to file

In [18]:
ids.to_csv("GPL200_id_mapping.tsv", sep = '\t', index = False)