# 2.0. Data Prep for kNN

In the kNN analysis, we use a slightly modified version of the data that contains balanced classes for each tissue and replicate before splitting them into training in test as it is done for the general neural network classifier.

In this notebook, we demonstrate how to load data and balance all classes. The indices are saved to disk, so this step only needs to be run once. This step is important when training a classifier, since we want all classes and samples to be balances to prevent overfitting. If you want to just load all data for certain samples and classes, you can use the `load_data` function with `load_all_data=True`.

The classifier is trained to classify individual signals as belonging to a heart, adrenal, or aorta sample. We have 4 technical replicates (runs on different days), so we use all data. The model will be trained on three different technical replicates (runs of different days) and tested on the other one.

The nanopore was ran at a 10khz sampling frequency (10,000 points/second) and that the current was inversed every 10 seconds. Hence, the dataset consists of events from peptides that interacted with the nanopore for more than 3 seconds and less than 10 seconds (maximum time). We keep only signals with a length between 30,000 and 100,000.

In [2]:
import sys 
sys.path.append('..')
from src.constants import SAMPLES_DICT_3CLASS
from src.knn_load_data import INDICES_SAVE_PATH_3CLASS, get_data_indices, prefix_data_dir

In [3]:
SAMPLES_DICT_3CLASS = prefix_data_dir(SAMPLES_DICT_3CLASS, "/mmfs1/gscratch/ml4ml/cailinw/pore_data/")

In [4]:
print(SAMPLES_DICT_3CLASS)

{'heart': ['/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate1/segmented_peptides_raw_data_replicate1_heart.npy', '/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate2/segmented_peptides_raw_data_replicate2_heart.npy', '/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate3/segmented_peptides_raw_data_replicate3_heart.npy', '/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate4/segmented_peptides_raw_data_replicate4_heart.npy'], 'adrenal': ['/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate1/segmented_peptides_raw_data_replicate1_adrenal.npy', '/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate2/segmented_peptides_raw_data_replicate2_adrenal.npy', '/mmfs1/gscratch/ml4ml/cailinw/pore_data/PorTis/segmented_peptides_raw_data/replicate3/segmented_peptides_raw_data_replicate3_adrenal.npy', '/mm

Define the `INDICES_SAVE_PATH_3CLASS` constant in `src.knn_load_data` to keep the indices for the signals in the balanced set to be used in knn related analysis.

In [5]:
# Get indices for each sample, such that all samples are balanced
# Use the 3 classes (heart, adrenal, aorta)
# Discard signals with length less than 30,000
indices = get_data_indices(SAMPLES_DICT_3CLASS, 30000, INDICES_SAVE_PATH_3CLASS)

class_names=['heart', 'adrenal', 'aorta']
num_classes=3
num_replicates=[4, 4, 4]
In progress...


Class heart, replicate 4/4

Class adrenal, replicate 4/4

Class aorta, replicate 4/4