# Resolve Affymetrix C. elegans Genome Array genes

Tong Shu Li

Convert probe ids in the Affymetrix C. elegans Genome Array platform ([GPL200](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL200)) to WormBase identifiers.

## Data Sources

1. WormBase
2. Affymetrix

Both Affymetrix and WormBase provide data downloads of the probe identifier mappings. We will cross-reference these two independent sources to ensure our mappings are accurate.

In [1]:
import numpy as np
import pandas as pd

from collections import defaultdict

## Read the Affymetrix provided file

File format:
- Empty cells are denoted with `"---"`
- Cells can have multiple entries; separated with `" /// "`

In [2]:
affy = (
    pd.read_csv("Celegans.na35.annot.csv", sep = ',', comment = '#')
        .replace(to_replace = "---", value = np.nan)
        .dropna(axis = 1, how = "all")
        .rename(columns = lambda col: col.lower().replace(" ", "_"))
        .rename(columns = {"probe_set_id": "probe_id"})
)

In [3]:
affy.shape

(22625, 29)

In [4]:
affy.head(2)

Unnamed: 0,probe_id,genechip_array,species_scientific_name,annotation_date,sequence_type,sequence_source,transcript_id(array_design),target_description,representative_public_id,unigene_id,...,refseq_transcript_id,wormbase,gene_ontology_biological_process,gene_ontology_cellular_component,gene_ontology_molecular_function,interpro,annotation_description,annotation_transcript_cluster,transcript_assignments,annotation_notes
0,171720_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,affy.Ce.20641,AV179929_rc /REP_DB=TREMBL Accession /GB=AV179...,AV179929,Cel.20567,...,NM_064443 /// NM_182150,,0000003 // reproduction // inferred from mutan...,,0005515 // protein binding // inferred from el...,IPR000342 // Regulator of G protein signalling...,This probe set was annotated using the Matchin...,Y48E1B.14a,Y48E1B.14a // cdna:known chromosome:WBcel235:I...,NM_064443 // refseq // 8 // Cross Hyb Matching...
1,171721_x_at,C. elegans Genome Array,Caenorhabditis elegans,"Oct 6, 2014",Exemplar sequence,GenBank,affy.Ce.22856,g6767 /REP_DB=GenBank Identifier /5_PRIME_EXT_...,6767,Cel.121,...,NM_059851 /// NM_171841,,,0005737 // cytoplasm // inferred from direct a...,0005515 // protein binding // inferred from ph...,IPR005373 // Uncharacterised protein family UP...,This probe set was annotated using the Matchin...,"NM_059851(10),NM_171841(10),T01G9.2a,T01G9.2b",NM_059851 // Caenorhabditis elegans Protein T0...,


In [5]:
affy["probe_id"].nunique()

22625

There are a total of 22625 probes on this chip.

## Read the WormBase provided file

In [6]:
wb = (
    pd.read_csv("c_elegans.PRJNA13758.WS252.affy_oligo_mapping.txt", sep = '\t')
        .rename(columns = str.lower)
        .rename(columns = {"oligo_set": "probe_id"})
)

In [7]:
wb.shape

(23370, 7)

In [8]:
wb.head()

Unnamed: 0,probe_id,wbgeneid,gene_sequence_name,gene_type,microarray_type,targeted_isoforms,remark
0,171720_x_at,WBGene00013011,Y48E1B.14,CDS,Y48E1B.14a|Y48E1B.14a,,
1,171721_x_at,WBGene00011344,T01G9.2,CDS,T01G9.2a|T01G9.2b|T01G9.2a|T01G9.2b,,
2,171722_x_at,WBGene00018934,F56B3.11,CDS,F56B3.11a|F56B3.11b|F56B3.11a|F56B3.11b,,
3,171723_x_at,WBGene00006927,F59D8.1,CDS,F59D8.1|F59D8.1,,
4,171723_x_at,WBGene00006928,F59D8.2,CDS,F59D8.2|F59D8.2,,


In [9]:
wb["probe_id"].nunique()

22560

The WormBase file has fewer probes.

## Examining file contents

What information does each file contain?

### Affy

Looks like the `ensembl` column contains the WormBase identifiers.

In [10]:
affy[["probe_id", "entrez_gene", "wormbase", "ensembl"]].head()

Unnamed: 0,probe_id,entrez_gene,wormbase,ensembl
0,171720_x_at,174997,,WBGene00013011
1,171721_x_at,172609,,WBGene00011344
2,171722_x_at,176907,,WBGene00018934
3,171723_x_at,180646,CE26817,WBGene00006928
4,171724_x_at,172353,CE11778,WBGene00000386


In [11]:
affy[["probe_id", "entrez_gene", "wormbase", "ensembl"]].isnull().sum()

probe_id          0
entrez_gene     684
wormbase       5996
ensembl         717
dtype: int64

Some probes have no WormBase identifiers.

### One WormBase identifier per probe?

In [12]:
affy["ensembl"].str.count("WBGene").value_counts()

1.0     21005
2.0       786
3.0        72
4.0        16
5.0        14
6.0         5
7.0         3
11.0        2
8.0         2
19.0        1
17.0        1
13.0        1
Name: ensembl, dtype: int64

Most of the probes map to one WormBase id, but some map to multiple, and some 717 have no mappings.

In [13]:
# examine probes with multiple mappings
(
    affy[["probe_id", "entrez_gene", "wormbase", "ensembl"]]
        .assign(
            num = lambda df:
                df["ensembl"].str.count("WBGene")
        )
        .query("num > 1")
        .head()
)

Unnamed: 0,probe_id,entrez_gene,wormbase,ensembl,num
36,171756_x_at,176528 /// 259336,CE20144 /// CE20148,WBGene00000667 /// WBGene00000669,2.0
47,171767_x_at,13183151,,WBGene00004622 /// WBGene00004677,2.0
48,171768_x_at,179296 /// 179298,CE06699 /// CE30585,WBGene00000599 /// WBGene00000717,2.0
91,171811_s_at,179698 /// 179701,CE06046,WBGene00010466 /// WBGene00010474,2.0
102,171822_x_at,174578 /// 174603,CE01568 /// CE02343,WBGene00001683 /// WBGene00001686,2.0


For the probes which map to multiple isoforms of a gene, we will need to extract the individual identifiers.

### WormBase

In [14]:
wb.isnull().sum()

probe_id                  0
wbgeneid                  0
gene_sequence_name      759
gene_type               759
microarray_type         723
targeted_isoforms     23130
remark                23094
dtype: int64

No probes or WormBase ids have missing entries.

In [15]:
wb["wbgeneid"].value_counts().head()

no overlapping gene    759
WBGene00006751          10
WBGene00000065           9
WBGene00003467           9
WBGene00014857           9
Name: wbgeneid, dtype: int64

However we notice that 759 probes are determined to not real genes according to WormBase.

In [16]:
wb["wbgeneid"].str.len().value_counts()

14    22611
19      759
Name: wbgeneid, dtype: int64

All cells contain only an individual WormBase id, except for those which are denoted as empty.

In [17]:
wb["wbgeneid"].nunique()

18279

Notice from above too that some WormBase genes have multiple probes on the chip.

In [18]:
wb["wbgeneid"] = wb["wbgeneid"].replace("no overlapping gene", np.nan)

---

## Combine mappings

We will generate a set of WormBase identifiers for each probe, and check that the sets match between the Affy and WormBase files.

In [19]:
set(affy["probe_id"]) > set(wb["probe_id"])

True

The Affymetrix probe set is a strict superset of the WormBase probes. We will therefore need to provide a mapping for all 22625 probes in the Affy file.

### Generate the set of WormBase ids for each probe

In [20]:
affy_map = defaultdict(set)

for probe, uid in zip(affy["probe_id"], affy["ensembl"]):
    if isinstance(uid, str):
        affy_map[probe] |= set(uid.split(" /// "))
    elif probe not in affy_map:
        affy_map[probe] = set()

In [21]:
wb_map = defaultdict(set)

for probe, uid in zip(wb["probe_id"], wb["wbgeneid"]):
    if isinstance(uid, str):
        wb_map[probe].add(uid)
    elif probe not in wb_map:
        wb_map[probe] = set()

### Combined mapping results:

In [22]:
res = pd.DataFrame(
    [
        (
            probe, len(affy_set), len(wb_map[probe]),
            affy_set == wb_map[probe], len(affy_set - wb_map[probe]), len(wb_map[probe] - affy_set)
        )
        for probe, affy_set in affy_map.items()
    ],
    columns = ["probe_id", "num_affy", "num_wb", "vals_eq", "affy_only", "wb_only"]
).sort_values("probe_id").reset_index(drop = True)

In [23]:
res.shape

(22625, 6)

In [24]:
res.head()

Unnamed: 0,probe_id,num_affy,num_wb,vals_eq,affy_only,wb_only
0,171720_x_at,1,1,True,0,0
1,171721_x_at,1,1,True,0,0
2,171722_x_at,1,1,True,0,0
3,171723_x_at,1,2,False,0,1
4,171724_x_at,1,1,True,0,0


In [25]:
res.query("num_affy == 0 and num_wb == 0").shape

(602, 6)

602 probes have no WormBase id mapping in either file.

### How many probes have agreeing mappings?

We exclude the probes which have no WormBase ids.

In [26]:
res.query("num_affy > 0 or num_wb > 0")["vals_eq"].value_counts()

True     20575
False     1448
Name: vals_eq, dtype: int64

There are 20575 probes which have agreeing mappings in both files. The rest do not, and some probes have no mappings.

## Summarize results

We will only save the probes with existing mappings that agree in both sources. All other probes are ignored.

In [27]:
final = pd.DataFrame(
    [
        (probe, val)
        for probe in res.query("vals_eq & (num_affy + num_wb) > 0")["probe_id"]
            for val in affy_map[probe]
    ],
    columns = ["probe_id", "wormbase_id"]
).sort_values(["wormbase_id", "probe_id"]).reset_index(drop = True)

In [28]:
final.shape

(20841, 2)

In [29]:
final.head()

Unnamed: 0,probe_id,wormbase_id
0,187171_s_at,WBGene00000001
1,192458_at,WBGene00000002
2,190092_at,WBGene00000003
3,190053_at,WBGene00000004
4,190396_s_at,WBGene00000005


In [30]:
final["probe_id"].nunique()

20575

In [31]:
final["wormbase_id"].nunique()

17281

## Save to file

In [32]:
final.to_csv("GPL200_wormbase_map.tsv", sep = '\t', index = False)