# Preprocessing

- Used Python `3.13.1`
- Folder structure as following:
  - Unhealthy data: `data/raw/unhealthy/*/*.tsv` 
  - Healthy data: `data/raw/healthy.gct`

## Data inputs, processing steps and outputs

Inputs:
- data/raw/unhealthy/*/*.tsv — individual unhealthy sample files (TPM columns read: gene_id, gene_name, tpm_unstranded).
- data/raw/healthy.gct — healthy matrix file (TPM values; columns Name/Description + sample columns).

Processing summary:
- Unhealthy .tsv files are read per-sample, a gene key is created as `gene_id|gene_name`, and all samples are combined into `data/processed/unhealthy_matrix.csv`.
- The healthy GCT is parsed and converted to `data/processed/healthy_matrix.csv` (gene_id, gene_name, sample columns).
- Both matrices are aligned on common genes producing `data/processed/unhealthy_aligned.csv` and `data/processed/healthy_aligned.csv` (same gene rows, sorted).
- Aligned matrices are transposed and concatenated into a single per-patient table `data/processed/combined_labeled.csv` with columns: `patient_id`, `healthy` (1 = healthy, 0 = unhealthy), and gene expression columns.
- Gene expression values are standardized separately for healthy and unhealthy groups using sklearn.preprocessing.StandardScaler and saved as `data/processed/combined_labeled_standardized.csv`.

Outputs (written files):
- data/processed/unhealthy_matrix.csv
- data/processed/healthy_matrix.csv
- data/processed/unhealthy_aligned.csv
- data/processed/healthy_aligned.csv
- data/processed/combined_labeled.csv
- data/processed/combined_labeled_standardized.csv

Notes:
- TPM values are standardized per-group (healthy vs unhealthy) to preserve group-specific distributions.
- Patient IDs are derived from filenames / column headers; verify naming if duplicates occur.
- Keep raw files untouched; processed CSVs are intended for downstream modeling.

## Prepare all unhealthy data
We read all the files inside the `unhealthy` directory and convert all the `.tsv` data to a unified `.csv` file.

In [8]:
import os
import pandas as pd
from glob import glob

all_files = glob(os.path.join("../../data/raw/unhealthy/", "*", "*.tsv"))

sample_dict = {}

for file_path in all_files:
    sample_name = os.path.splitext(os.path.basename(file_path))[0]
    try:
        df = pd.read_csv(file_path, sep="\t", skiprows=1, usecols=["gene_id", "gene_name", "tpm_unstranded"])
        df["gene_key"] = df["gene_id"] + "|" + df["gene_name"]
        df.set_index("gene_key", inplace=True)
        sample_dict[sample_name] = df["tpm_unstranded"]
    except Exception as e:
        print(f"Error processing {file_path}: {e}")

combined_df = pd.DataFrame(sample_dict)

combined_df.index.name = "gene_key"
combined_df.reset_index(inplace=True)
combined_df[["gene_id", "gene_name"]] = combined_df["gene_key"].str.split("|", expand=True)
combined_df.drop(columns="gene_key", inplace=True)
combined_df.set_index(["gene_id", "gene_name"], inplace=True)
combined_df.sort_index(inplace=True)
combined_df = combined_df[sorted(combined_df.columns)]

combined_df.to_csv("../../data/processed/unhealthy_matrix.csv")


In [24]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 60664 entries, ('ENSG00000000003.15', 'TSPAN6') to (nan, nan)
Columns: 1253 entries, 000f90b3-7383-4887-af84-73f231c03f39.rna_seq.augmented_star_gene_counts to ff86d680-8a35-4f19-a850-eca08dfd3d48.rna_seq.augmented_star_gene_counts
dtypes: float64(1253)
memory usage: 585.3+ MB


## Preprocess healthy data (.gct)

In [14]:
import pandas as pd

gct_df = pd.read_csv("../../data/raw/healthy.gct", sep='\t', skiprows=2)

gene_id = gct_df["Name"]
gene_name = gct_df["Description"]

expr_df = gct_df.drop(columns=["Name", "Description"])

expr_df.insert(0, "gene_name", gene_name)
expr_df.insert(0, "gene_id", gene_id)

expr_df.to_csv("../../data/processed/healthy_matrix.csv", index=False)


In [23]:
expr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56200 entries, 0 to 56199
Columns: 17384 entries, gene_id to GTEX-ZZPU-2726-SM-5NQ8O
dtypes: int64(17382), object(2)
memory usage: 7.3+ GB


## Align the matrixes
- make from 56.200 healty gene
- 60.664 unhealty genes 
- 18.858 same genes

In [16]:
import pandas as pd

unhealthy_df = pd.read_csv("../../data/processed/unhealthy_matrix.csv", index_col=["gene_id", "gene_name"])
healthy_df = pd.read_csv("../../data/processed/healthy_matrix.csv", index_col=["gene_id", "gene_name"])

common_genes = unhealthy_df.index.intersection(healthy_df.index)

unhealthy_common = unhealthy_df.loc[common_genes].sort_index()
healthy_common = healthy_df.loc[common_genes].sort_index()

assert unhealthy_common.shape[0] == healthy_common.shape[0], "Row mismatch after filtering."

unhealthy_common.to_csv("../../data/processed/unhealthy_aligned.csv")
healthy_common.to_csv("../../data/processed/healthy_aligned.csv")


In [20]:
unhealthy_common.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 18858 entries, ('ENSG00000001167.14', 'NFYA') to ('ENSG00000284596.1', 'MIR4467')
Columns: 1253 entries, 000f90b3-7383-4887-af84-73f231c03f39.rna_seq.augmented_star_gene_counts to ff86d680-8a35-4f19-a850-eca08dfd3d48.rna_seq.augmented_star_gene_counts
dtypes: float64(1253)
memory usage: 185.4+ MB


In [22]:
healthy_common.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 18858 entries, ('ENSG00000001167.14', 'NFYA') to ('ENSG00000284596.1', 'MIR4467')
Columns: 17382 entries, GTEX-1117F-0226-SM-5GZZ7 to GTEX-ZZPU-2726-SM-5NQ8O
dtypes: int64(17382)
memory usage: 2.4+ GB


## Combine them into 1 file 1.253 unhealthy 17.382 healthy

In [26]:
import pandas as pd

healthy = pd.read_csv("../../data/processed/healthy_aligned.csv", index_col=["gene_id"])
unhealthy = pd.read_csv("../../data/processed/unhealthy_aligned.csv", index_col=["gene_id"])

healthy_patients_T = healthy.T
unhealthy_patients_T = unhealthy.T

healthy_patients_T["healthy"] = 1
unhealthy_patients_T["healthy"] = 0

healthy_patients_T["patient_id"] = healthy_patients_T.index
unhealthy_patients_T["patient_id"] = unhealthy_patients_T.index

combined = pd.concat([healthy_patients_T, unhealthy_patients_T], axis=0)

gene_cols = [col for col in combined.columns if col not in ["patient_id", "healthy"]]
cols = ["patient_id", "healthy"] + gene_cols
combined = combined[cols]
combined = combined[~combined["patient_id"].str.startswith("gene_name")] # Remove gene_name rows

combined.to_csv("../../data/processed/combined_labeled.csv", index=False)


In [27]:
combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18635 entries, GTEX-1117F-0226-SM-5GZZ7 to ff86d680-8a35-4f19-a850-eca08dfd3d48.rna_seq.augmented_star_gene_counts
Columns: 18860 entries, patient_id to ENSG00000284596.1
dtypes: int64(1), object(18859)
memory usage: 2.6+ GB


## Standardize the values
Both our datasets use gene expression values with TPM as type.

In [28]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("../../data/processed/combined_labeled.csv")

# Identify gene columns
gene_cols = [c for c in df.columns if c not in ("patient_id", "healthy")]

# Apply separate standard scalers for healthy and unhealthy data
healthy_mask = df["healthy"] == 1
unhealthy_mask = df["healthy"] == 0

# Create two separate scalers
healthy_scaler = StandardScaler()
unhealthy_scaler = StandardScaler()

# Apply scaling separately to each group
df.loc[healthy_mask, gene_cols] = healthy_scaler.fit_transform(df.loc[healthy_mask, gene_cols])
df.loc[unhealthy_mask, gene_cols] = unhealthy_scaler.fit_transform(df.loc[unhealthy_mask, gene_cols])

# Save the standardized dataset
df.to_csv("../../data/processed/combined_labeled_standardized.csv", index=False)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18635 entries, 0 to 18634
Columns: 18860 entries, patient_id to ENSG00000284596.1
dtypes: float64(18858), int64(1), object(1)
memory usage: 2.6+ GB
