# Perform Gene ID Mapping

Author: Ashley Schwartz

Date: June 30, 2023

## Purpose and Background

This tutorial goes over how to simply convert a list of zebrafish Gene IDs to another Gene ID type. Gene IDs come in very different forms depending on the database or genome build you are using. This can get confusing! The Gene ID options are:

| Gene ID Name | Description | Example | Notes |
|--|--|--|--|
| ZFIN ID | ZFIN gene id: always starts with 'ZDB' for zebafish database | ZDB-GENE-011219-1 | used as the "master" gene id ([link](https://zfin.org/))|
| NCBI Gene ID | integer gene id managed by NCBI: also known as Entrez Gene ID | 140634 | [link](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=140634) |
| Symbol | descriptive symbol/name: RefSeq symbol used in RefSeq genome build | cyp1a | nomenclature defined by ZFIN |
| Ensembl Gene ID | Ensembl database gene id: always starts with 'ENSDAR'| ENSDARG00000098315 | [link](http://useast.ensembl.org/Danio_rerio/Location/View?g=ENSDARG00000098315;r=18:5588068-5598958) |

## Requirements

In this tutorial we will be utilizing two key elements:
- a sample Gene ID list (format: .csv, .tsv, .txt) for reading in the Gene IDs, otherwise typing or copy/pasting Gene IDs is also supported
    - the gene list we will be using is located in the data/test_data subdirectory of this current working directory with relative path `data/test_data/01_TPP.txt`
- the required python package
    - the python package is located in `../src/danRerlib`

In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.

_note: the Gene IDs are spelling and case sensitive_

## Set up Python environment

In [1]:
# IMPORT PYTHON PACKAGES
# ----------------------

# makes the notebook cell print all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
# path packages
import sys
from pathlib import Path
# data processing packages
import pandas as pd

In [4]:
# SET UP MY LOCAL PACKAGE
# -----------------------
# this step is only needed because the local package has not been released through pip

cwd = Path().absolute()

package_folder = cwd / Path('../src/danRerlib')
sys.path.append(str(package_folder))
import mapping, utils

# SET UP DATA DIRECTORY
# ---------------------
test_data_dir = cwd / Path('data/test/data/')
out_data_dir = cwd / Path('data/out_data/')

# note: I am using the Path package to take care of any operating
#       system differences for users of this tutorial

## Execute Mappings

There are a variety of scenarios when you might need to map Gene IDs. In the most simplest case, you might have a few IDs you would like to map to Entrez NCBI Gene IDs since that is a common Gene ID used in pathway databases. Other times, you might want to convert an entire column in an excel fil you have. We will go through a few different options. 

### Simple Case: Convert a list of Gene IDs

_Purpose: Given a small list of Gene IDs that are of Gene ID type A, convert to Gene ID type B._

You would most likely use the simple case if you have a small list of gene ids that you need to convert. Especially useful if you just want to copy and paste and retrieve your converted ids!

__Step 1__: Define your list of ids. In this case, I have NCBI Gene IDs. I named the python list `list_of_gene_ids` and include all the Gene IDs I want to convert. 

In [6]:
list_of_gene_ids = [ 
    100000252, 100000750, 100001198, 100001260, 100002225, 100002263, 
    100002756, 100003223, 100007521, 100149273, 100149794, 100170795,
    100321746, 100329897, 100330617,
]

__Step 2__: Tell the program which ID you currently have and which ID you would like to convert to. I currently have NCBI Gene IDs and I want to convert to ZFIN Gene IDs. Note that Gene ID options are spelling and case sensitive. Options are listed at the beginning of this document. (don't worry, the program will let you know if you have made a mistake when you launch the program!)

In [7]:
current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

__Step 3__: Launch the conversion function to get your converted ids. This means we want to run the `convert_ids` function in the `mapping` module of our library. Once executed, the converted ids will be stored in the `converted_ids` variable.

In [8]:
# do conversion
converted_ids = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type)

__Step 4__: To visualize your converted ids, you can either print them to the python shell or save them to a file. If you would like to print them, which is a fine idea if you only have a few, you can use the `print_series_pretty` function in the `utils` module of the library. 

In [9]:
utils.pretty_print_series(converted_ids)

ZDB-GENE-030131-1904
ZDB-GENE-030131-3404
ZDB-GENE-030325-1
ZDB-GENE-030616-609
ZDB-GENE-040426-743
ZDB-GENE-050309-246
ZDB-GENE-071009-6
ZDB-GENE-080219-34
ZDB-GENE-080723-44
ZDB-GENE-081223-2
ZDB-GENE-090313-141
ZDB-GENE-091117-28
ZDB-GENE-110309-3
ZDB-GENE-120215-92


If you would rather save the data to a file, you can save `converted_ids` to a file name called `converted_ids.txt` in the output data directory we defined previously. For some default options, you can use the `save_series` function in the `utils` module. Feel free to change the output directory to any folder of your choice.  

In [10]:
file_name = out_data_dir / 'converted_ids.txt'
utils.save_series(converted_ids, file_name)

### Simple Case: Convert a list of Gene IDs From a File

_Purpose: Given a list of Gene IDs from a file that are of Gene ID type A, convert to Gene ID type B._

If you have a file that contains a list of Gene IDs, you can easily repeat these steps by reading in that file first. Check it out:

In [11]:
data_file_path = Path('data/test_data/small_gene_id_list.txt')
list_of_gene_ids = pd.read_csv(data_file_path, sep='\t')
list_of_gene_ids

Unnamed: 0,NCBI Gene ID
0,100000252
1,100000750
2,100001198
3,100001260
4,100002225
5,100002263
6,100002756
7,100003223
8,100007521
9,100149273


When we use the `pandas` package to read in the data, it organizes it into what is called a `Pandas DataFrame`. All other steps can be executed the same way.

In [12]:
current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

# do conversion
converted_ids = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type)

utils.pretty_print_series(converted_ids)

ZDB-GENE-030131-1904
ZDB-GENE-030131-3404
ZDB-GENE-030325-1
ZDB-GENE-030616-609
ZDB-GENE-040426-743
ZDB-GENE-050309-246
ZDB-GENE-071009-6
ZDB-GENE-080219-34
ZDB-GENE-080723-44
ZDB-GENE-081223-2
ZDB-GENE-090313-141
ZDB-GENE-091117-28
ZDB-GENE-110309-3
ZDB-GENE-120215-92


You could also save the data to a file in the same manner. As you can see, we now have our converted ids. The ids will be in order based on the original Gene IDs given to the program. If you would like to keep the mapping, adding a column to the gene IDs you currently have might be useful (see below). Limitations of this method include a non 1:1 mapping between Gene ID options. It is quite common that there is more than one Ensembl Gene ID for another Gene ID option. This function will return all mappings, but, since it just returns the list, you do not know which gene in your original set has mapped to two different genes in the new set. This may not be an issue for some use cases, but sometimes it is important to know. 

If you would like to keep the old Gene IDs along with the mapping, follow the next set of instructions!

### Simple Case: Convert a list of Gene IDs and Keep Mapping

_Purpose: Given a list of Gene IDs that are of Gene ID type A, convert to Gene ID type B and keep both Gene ID A and Gene ID B in a table._



__Step 1__: Define your list of ids. This is the same list I used above, and remember they are NCBI Gene IDs. I am also defining my current Gene ID type and the Gene ID type I would like to convert to here.

In [13]:
list_of_gene_ids = [ 
    100000252, 100000750, 100001198, 100001260, 100002225, 100002263, 
    100002756, 100003223, 100007521, 100149273, 100149794, 100170795,
    100321746, 100329897, 100330617,
]

current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

__Step 2:__ Do converstion. If we would like to keep the mapping, we would use the `convert_ids` function in the `mapping` module of our library but activate the `keep_mapping` parameter. By default, as used earlier, `keep_mapping = False`

In [14]:
# do conversion
converted_id_table = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type, keep_mapping=True)
converted_id_table

Unnamed: 0,NCBI Gene ID,ZFIN ID
0,100002263,ZDB-GENE-030131-1904
1,100330617,ZDB-GENE-030131-3404
2,100001198,ZDB-GENE-030325-1
3,100003223,ZDB-GENE-030616-609
4,100000252,ZDB-GENE-040426-743
5,100001260,ZDB-GENE-050309-246
6,100321746,ZDB-GENE-071009-6
7,100002225,ZDB-GENE-080219-34
8,100170795,ZDB-GENE-080723-44
9,100000750,ZDB-GENE-081223-2


You can save the data in the same way. In this case, since we have a `Pandas DataFrame` with column headings, the column names will be saved automatically. 

If you would like to read the Gene IDs in from a file, all steps are the same besides the initialization of the Gene IDs:

In [15]:
data_file_path = Path('data/test_data/small_gene_id_list.txt')
list_of_gene_ids = pd.read_csv(data_file_path, sep='\t')

current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

# do conversion
converted_id_table = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type, keep_mapping=True)
converted_id_table

Unnamed: 0,NCBI Gene ID,ZFIN ID
0,100002263,ZDB-GENE-030131-1904
1,100330617,ZDB-GENE-030131-3404
2,100001198,ZDB-GENE-030325-1
3,100003223,ZDB-GENE-030616-609
4,100000252,ZDB-GENE-040426-743
5,100001260,ZDB-GENE-050309-246
6,100321746,ZDB-GENE-071009-6
7,100002225,ZDB-GENE-080219-34
8,100170795,ZDB-GENE-080723-44
9,100000750,ZDB-GENE-081223-2


The above methodologies are great if you have list of Gene IDs you would like to convert. There are cases where you might have a large dataset and one column in that dataset has the Gene IDs that you would like to convert. In this scenario, keeping all columns properly sorted is extremely important. 

### Convert Gene IDs in a Column of a Larger Dataset

_Purpose: you have a dataset with columns x, y, z. Column x has Gene IDs in type A. You would like to convert these Gene IDs to type B while maintaining the information of columns y, z._

In this scenario, you might have some data that looks like:

| NCBI Gene ID | PValue | logFC | 
|-|-|-|
| 100002263 | 2.3 | 0.03 | 
|... | ... | ... |

The information in the log2FC and PValue columns are essential to keep 'in order' with the GeneID column. It is often that in this scenario, you will have an entire gene set and will be dealing with a lot more data. Lets look at a test dataset for this case.

__Step 1:__ Read in the data. The data in the test directory is in `excel, csv, or tsv` format with a `.txt` extension. The `pandas` package can read this without an issue, we just need to specify the separator. `\t` is really the best for this type of data. Note that any excel file or csv file should work here.

In [16]:
data_file_path = Path('data/test_data/01_TPP.txt')
data = pd.read_csv(data_file_path, sep='\t')

To get a quick look at the data, we can print the first three table entries and some data stats:

In [17]:
# print first three lines
data.head(3)
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')

Unnamed: 0,NCBI Gene ID,PValue,logFC
0,100000006,0.792615,0.115009
1,100000009,0.607285,-0.144714
2,100000026,0.021338,0.603871


There are 21854 rows and 3 columns
