# KEGG Feature Engineering Data Preprocessing 

### Nomenclature Conversion:

Merging of HGNC ID and Normalized Counts dataframes

KEGG Pathways database uses HGNC ID naming convention but the Gene Expression dataset uses Ensembl Gene ID. To make the data relateable, we must map the Ensembl Gene ID to the HGNC ID for the Gene Expression matrix which has been normalized.

#### HGNC Data Importation

In [1]:
#Imports pandas and numpy packages
import pandas as pd
import numpy as np

#Imports HGNC dataframe
HGNC = pd.read_csv('HGNC.csv')

In [3]:
#Displays HGCN dataframe containing all human gene ID's taken from the HGNC database
HGNC.head(1)

Unnamed: 0,HGNC ID,Approved symbol,Approved name,Ensembl gene ID
0,5,A1BG,alpha-1-B glycoprotein,ENSG00000121410


In [4]:
#Renames 'Ensembl gene ID' column to 'gene_sliced' and displays first 5 rows of the dataframe
HGNC.columns = ['HGNC_ID', 'Approved_Symbol', 'Approved_name','gene_sliced']
HGNC.head(1)

Unnamed: 0,HGNC_ID,Approved_Symbol,Approved_name,gene_sliced
0,5,A1BG,alpha-1-B glycoprotein,ENSG00000121410


In [5]:
HGNC_map = HGNC.drop(['Approved_Symbol', 'Approved_name'], axis = 1)
HGNC_map.head(1)

Unnamed: 0,HGNC_ID,gene_sliced
0,5,ENSG00000121410


In [7]:
#Creates list of Ensemble gene ID's
HGNC_id = list(HGNC_map.iloc[: , 0])

In [47]:
#Creates an empty list and fills with Ensemble gene ID's after removing any numbers from Ensemble ID following decimal 

#gene_sliced=[]
#for gene in gene_sliced:
#    gene_new=gene.split('.')[0]
    
#    gene_sliced.append(gene_new)
    
#print(gene_sliced)

### Gene Expression Data Importation

Gene Expression has been normalized prior to importation.

In [18]:
#Imports normalized gene counts dataframe with Ensemble ID as index
counts = pd.read_csv('cpm_renamed_original.csv')

#Adds 'gene_sliced' as a new column in Counts dataframe and sets as an index 
counts['gene_sliced'] = HGNC['gene_sliced']
counts['Ensemble ID'] = HGNC['gene_sliced']
counts.head(1)

Unnamed: 0.1,Unnamed: 0,2001/1/1,2002/2/1,2003/1/1,2004/2/1,2006/2/1,2008/1/1,2010/2/1,2012/2/1,2013/2/1,...,2075/1/1,2078/2/1,2080/2/1,2081/2/1,2082/1/1,2083/2/1,2084/1/1,2085/2/1,gene_sliced,Ensemble ID
0,ENSG00000237973.1,41.1537,32.840876,33.472636,68.599342,55.83454,85.471215,82.970549,75.094779,34.152149,...,59.909131,58.834271,44.044076,33.586355,41.367272,81.654563,56.295355,42.16532,ENSG00000121410,ENSG00000121410


MAPPING HGNC and Counts dataframes

In [19]:
#Creates a dictionary from HGNC_map dataframe to connect 'gene_sliced' to 'HGNC_ID'
mapping = dict(HGNC_map[['gene_sliced', 'HGNC_ID']].values)

#Creates new column in counts dataframe with HGNC_ID label
counts['HGNC_ID'] = counts.gene_sliced.replace(mapping, inplace = True)

In [21]:
counts = counts.drop(['Unnamed: 0', 'HGNC_ID'], axis = 1).set_index('Ensemble ID')
counts.head(1)

Unnamed: 0_level_0,2001/1/1,2002/2/1,2003/1/1,2004/2/1,2006/2/1,2008/1/1,2010/2/1,2012/2/1,2013/2/1,2014/1/1,...,2074/1/1,2075/1/1,2078/2/1,2080/2/1,2081/2/1,2082/1/1,2083/2/1,2084/1/1,2085/2/1,gene_sliced
Ensemble ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000121410,41.1537,32.840876,33.472636,68.599342,55.83454,85.471215,82.970549,75.094779,34.152149,29.143781,...,72.274791,59.909131,58.834271,44.044076,33.586355,41.367272,81.654563,56.295355,42.16532,5


In [40]:
counts.to_csv('counts_by_HGNCid.csv')