### Motivation

In the handout, it was explained that the nucleotide column of the dataset represents the combined nucleotides from the neighboring 1-flanking position. Since this column is in the form of string data, encoding should be carried out to convert these strings into categorical data.

However, if encoding is carried out directly on the nucleotide column, there will be a high dimension of categories as there will be numerous combinations and permutations of nucleotides and this may not be meaningful. Thus, the approach taken is to split the column into three seperate columns where each column represents a single nucleotide. Then, encoding will be carried out on these three columns. This will likely result in far fewer categories while at the same time preserving the ordering of the nucleotides.

## Training the Encoder

### Reading In Training Data

In [1]:
import pandas as pd
import numpy as np
import joblib
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder

In [2]:
train = pd.read_csv("minmax_dataset.csv")

In [3]:
print("Number of distinct nucleotide combinations: " + str(train["nucleotide"].nunique()))

Number of distinct nucleotide combinations: 288


### Spliting The Nucleotide Column

As mentioned previously, the nucleotide column will be divided into its three seperate columns: <br>
the nucleotide in its previous position -> "nucleotide-1" <br>
the nucleotide in its current position -> "nucleotide" <br>
the nucleotide in its next position -> "nucleotide+1"

In [4]:
# Here, the nucleotides are split by indexing 
train['nucleotide-1'] = train['nucleotide'].str[0:5]
train['nucleotide+1'] = train['nucleotide'].str[2:7]
train['nucleotide'] = train['nucleotide'].str[1:6]

In [5]:
print( "Distinct nucleotides in each column")
print('nucleotide-1: '+ str(train['nucleotide-1'].nunique()))
print('nucleotide: '+ str(train['nucleotide'].nunique()))
print('nucleotide+1: '+ str(train['nucleotide'].nunique()))

Distinct nucleotides in each column
nucleotide-1: 24
nucleotide: 18
nucleotide+1: 18


Evidently, splitting the columns has reduced the number of categories significantly

### Encoding The Columns And Saving The Encoder

First an ordinal encoder needs to be initialised

In [6]:
oe = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

this ordinal encoder was initialised with two parameters. The "handle_unknown" parameter makes it such that the encoded value of unknown categories will be set to the value given for the parameter "unknown_value" which in this case = -1.

This was done in the event that there are nucleotides in the test data not present in the training data.

In [7]:
train[['nucleotide-1', 'nucleotide','nucleotide+1']] = oe.fit_transform(train[['nucleotide-1', 'nucleotide','nucleotide+1']])
# fit_transform fits the encoder on the data, then transforms the data

In [8]:
joblib.dump(oe, "nucleotide_encoder.joblib")

['nucleotide_encoder.joblib']

Joblib is a set of tools to provide lightweight pipelining in Python. Here it is being used to save the encoder as a file so that it can be loaded later to transform the test data.

## Testing The Encoder

### Reading In The Test Data And Encoder

In [9]:
test = pd.read_csv("dataset1.csv")

In [10]:
oe = joblib.load('nucleotide_encoder.joblib')

### Spliting The Nucleotide Column

Just like the traning data, the test data needs to be transformed into the three nucleotide columns

In [11]:
test['nucleotide-1'] = test['nucleotide'].str[0:5]
test['nucleotide+1'] = test['nucleotide'].str[2:7]
test['nucleotide'] = test['nucleotide'].str[1:6]

### Encoding The Nucleotides

In [12]:
test[['nucleotide-1', 'nucleotide','nucleotide+1']] = oe.transform(test[['nucleotide-1', 'nucleotide','nucleotide+1']])

Here, ".transform" is used since the encoder was already fit on the training data.