In [1]:
# import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import random

#### 1. Import data
**Note:** The raw gene expression counts are are originally in the format (genes × samples), where rows represent genes and columns represent samples.  
For machine learning / deep learning models, transpose the data so that each row corresponds to a sample and each column corresponds to a gene, which is the expected input format.

In [2]:
raw_counts = pd.read_csv("C:/Users/User/Documents/brca_subtype_ae/data/tcga_brca_original_data/raw_counts_original.csv", 
                         header = 0, index_col = 0).transpose()

In [3]:
sample_info = pd.read_csv("C:/Users/User/Documents/brca_subtype_ae/data/tcga_brca_original_data/sample_info_original.csv", 
                          header = 0, index_col = 0)

In [4]:
print(raw_counts.shape)
print(sample_info.shape)

(1111, 60660)
(1111, 85)


#### 2. Data cleaning

In [5]:
# check if the indexes(sample names) of all the files match (output should be true)
print(raw_counts.index.equals(sample_info.index))

True


The check returned `True`, which means that the sample names (indexes) in the raw counts and sample information all match and are in the correct order. This ensures that the datasets are properly aligned and safe to use for downstream analysis or ML models.

In [6]:
# find the number of unique patient IDs
sample_info['patient'].nunique()

1095

The number of unique samples is 1095 out of 1111 samples, which indicates the presence of duplicates.

In [7]:
# subset duplicated patients along with their first occurences
patient_duplicates_with_first_occ = sample_info.loc[sample_info['patient'].duplicated(keep=False)]

# count of all duplicates including first occurrence
patient_duplicates_with_first_occ.shape[0]

27

To avoid introducing bias in downstream analysis or machine learning models, it is best to remove all duplicates, including the first occurrences. Keeping only one copy of a duplicate could still bias the dataset if the counts are inconsistent, so complete removal ensures that each sample is unique and representative.

In [8]:
# remove all duplicates including first occurrences
sample_info_cleaned = sample_info.drop_duplicates(subset = ['patient'], keep = False)
raw_counts_cleaned = raw_counts.loc[sample_info_cleaned.index]

# check the new dimension
print(raw_counts_cleaned.shape)
print(sample_info_cleaned.shape)

(1084, 60660)
(1084, 85)


In [9]:
# check the count of each gender in the sample info without duplicates
sample_info_cleaned['gender'].value_counts()

gender
female    1071
male        12
Name: count, dtype: int64

Including both male and female samples, there are 1083 samples.  
After cleaning, the total count is 1084, indicating that one sample has missing gender information.
We keep only female samples to avoid bias, as the number of male samples is small.  
This ensures a more balanced and reliable dataset for downstream analysis.

In [10]:
# keep only female samples
sample_info_cleaned = sample_info_cleaned.loc[sample_info_cleaned['gender'] == 'female']
raw_counts_cleaned = raw_counts_cleaned.loc[sample_info_cleaned.index]

In [11]:
# check the count of samples with missing subtype information
print(sample_info_cleaned['paper_BRCA_Subtype_PAM50'].isna().sum())

0


There no samples with missing subtype information.

In [12]:
# check the count of each subtype
sample_info_cleaned['paper_BRCA_Subtype_PAM50'].value_counts()

paper_BRCA_Subtype_PAM50
LumA      555
LumB      209
Basal     185
Her2       82
Normal     40
Name: count, dtype: int64

Among the samples, luminal A has the highest number (555) because it is the most prevalent breast cancer subtype, followed by luminal B, basal-like, and HER2-enriched. The normal-like subtype (40 samples) is generally considered to be a tissue artifact rather than a true biological subtype, as discussed in [Parker et al., 2009](https://doi.org/10.1200/JCO.2008.18.1370). Since, it is not considered as a true or major subtype, these samples are removed. 

In [13]:
# remove normal-like subtype
sample_info_cleaned = sample_info_cleaned.loc[sample_info_cleaned['paper_BRCA_Subtype_PAM50'] != 'Normal']
raw_counts_cleaned = raw_counts_cleaned.loc[sample_info_cleaned.index]

In [14]:
# check the dimensions of the cleaned dataframes
print(sample_info_cleaned.shape)
print(raw_counts_cleaned.shape)

(1031, 85)
(1031, 60660)


In [15]:
# check if the indexes of the cleaned data match (output should be true)
print(raw_counts_cleaned.index.equals(sample_info_cleaned.index))

True


The indexes of the cleaned data all match in raw counts and sample information.  
This means the datasets are properly aligne..

In [16]:
# check if there are negative values in any of the gene columns (output should be false)
print((raw_counts_cleaned < 0).any().any())

False


- If the result is `False`, it means no negative counts were found (expected for raw gene expression data).  
- If the result is `True`, it means negative counts were detected, which is unexpected and ighty indicate preprocessing or data issues.

In [17]:
# check if there are decimal gene expression values in any of the gene columns
print((raw_counts_cleaned % 1 != 0).any().sum())

0


This checks whether any of the gene columns contain decimal values. For differential gene expression analysis using DESeq2, raw counts must be integers. A result of `0` is returned which means all genes have expression counts as integers.

In [18]:
# # save cleaned data as csv
# sample_info_cleaned.to_csv('sample_info_cleaned.csv')
# raw_counts_cleaned.to_csv('raw_counts_cleaned.csv')
# tpm_counts_cleaned.to_csv('tpm_counts_cleaned.csv')
# log2_tpm_counts_cleaned.to_csv('log2_tpm_counts_cleaned.csv')

#### 3. Data splitting

**Note:** We split the dataset into 70% training and 30% testing to ensure that there are sufficient samples from the HER2-enriched subtype in the test set.  
Since HER2 is the minority class, a larger test proportion helps maintain adequate representation for evaluation.

In [19]:
# define seed for train-test split
seed = 42

# initialize randomness before splitting to get the same training and test sets everytime 
random.seed(seed)
np.random.seed(seed)

In [20]:
# keep only subtype column in the sample info data prior to splitting
y_subtype = sample_info_cleaned.loc[:,['paper_BRCA_Subtype_PAM50']]
y_subtype.shape

(1031, 1)

In [21]:
# split raw counts and subtye into 70% training and 30% test
# stratify split by y (subtype label) to have consistent class distribution in both training and test sets
X_raw_train, X_raw_test, y_train, y_test = train_test_split(raw_counts_cleaned, y_subtype, test_size=0.30, 
                                                            stratify=y_subtype, random_state=seed)

In [22]:
# check dimensions after splitting
print(X_raw_train.shape)
print(X_raw_test.shape)
print(y_train.shape)
print(y_test.shape)

(721, 60660)
(310, 60660)
(721, 1)
(310, 1)


In [23]:
# check class proportion in training and test sets
print("Class proportion in the training set:")
print(y_train.value_counts())

print("\nClass proportion in the test set:")
print(y_test.value_counts())

Class proportion in the training set:
paper_BRCA_Subtype_PAM50
LumA                        388
LumB                        146
Basal                       130
Her2                         57
Name: count, dtype: int64

Class proportion in the test set:
paper_BRCA_Subtype_PAM50
LumA                        167
LumB                         63
Basal                        55
Her2                         25
Name: count, dtype: int64


In [24]:
# check whether the indices match between the x and y datasets
print(X_raw_train.index.tolist() == y_train.index.tolist())
print(X_raw_test.index.tolist() == y_test.index.tolist())

True
True


We verified that the sample indices match between the x (raw counts) and y (subtype label) datasets for both training and test sets.  
Both checks returned `True`, confirming that the datasets are properly aligned.

In [25]:
# # save training and test files
# X_raw_train.to_csv('X_raw_counts_train.csv')
# X_raw_test.to_csv('X_raw_counts_test.csv')
# y_train.to_csv('y_subtype_train.csv')
# y_test.to_csv('y_subtype_test.csv')