# PRESS Geneset Reduction

We would like to reduce PRESS to a smaller set of genes that are most informative for the phenotype of interest.

**Potential Feature Reduction Methods**:
- RFE-LASSO
- RFE-RF
- RFE-Adaboost
- [Sequential Feature Selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector)
- [Millipede](https://millipede.readthedocs.io/en/latest/getting_started.html)
- [Boruta](https://cran.r-project.org/web/packages/Boruta/vignettes/inahurry.pdf)
- [Ranked MSD](https://github.com/researher/Ranked_MSD/blob/master/Code_Ranked_MSD_and_F_Best.R)
- [Relief Algorithms](https://gitlab.com/moongoal/sklearn-relief)
- [Wx](https://github.com/deargen/DearWXpub)
- [Feature selection methods](https://github.com/MASOUD-AJUMS/Breast-cancer-prediction-/tree/main)
- Pick genes with highest explained variance
- Pick hub genes to correlations
- Overlap of DUKE & PACE log2FC
- High Basemeans

# 01 Data Preparation

We will start by loading the data and preparing it for the feature reduction methods. Just to summarise:
- We have 3 datasets: PACE, DUKE, and CHORD
- We are starting with 451 features (genes)

The data prep is done in the src

In [8]:
# Set some sys path stuff
import sys
import os
from MattTools import utils
from joblib import dump, load
import pandas as pd
import numpy as np

# More custom imports
sys.path.append('..')
import src.data_functions as dfs
import src.feature_selection as fs

utils.hide_warnings()
utils.set_random_seed(420)

# get the data
X, y = dfs.load_data('../data/clean/pace/features.csv', '../data/clean/pace/labels.csv')

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns, index=X.index)

print('Original shape:', X.shape)

Setting random seed to 420 for reproducibility.
Original shape: (84, 451)


(451,)

In [15]:
# Take out the pace scores and run a scaler on the data
press451 = load('../models/jobs/press451.joblib')
pred_prob = press451.predict_proba(X)[:, 1]
scaler = StandardScaler()
scores = pd.DataFrame(
    scaler.fit_transform(pred_prob.reshape(-1, 1)), 
    columns=['score'], 
    index=X.index
    )
# add the pred_prob to scores
scores['pred_prob'] = pred_prob


# 02 Geneset Selection Methods

## Recursive Feature Elimination (RFE)


In [None]:
from sklearn.linear_model import LogisticRegression
rfe_selector = fs.RFESelector(LogisticRegression())

rfe_selector.fit(X, y)
rfe_selector.plot_features()

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfe_selector = fs.RFESelector(RandomForestClassifier())

rfe_selector.fit(X, y)
rfe_selector.plot_features()

In [None]:
from sklearn.ensemble import AdaBoostClassifier
rfe_selector = fs.RFESelector(AdaBoostClassifier())

rfe_selector.fit(X, y)
rfe_selector.plot_features()

## Sequential Feature Selection

In [None]:
# # So why not try this with press?
# press451 = load('../models/jobs/press451.joblib')

# SeqSelector = fs.SFSSelector(press451, 0.05)

# SeqSelector.fit(X, y)
# SeqSelector.plot_results()

# # THIS IS TOO COMPUTATIONALLY EXPENSIVE SO MOVING TO PY FILES

## Millipede