# **Neighborhood Components Analysis (NCA) feature**

## **What is NCA?**
**Neighborhood Components Analysis (NCA)** is a supervised dimensionality reduction technique that learns a transformation of features to improve nearest neighbor classification. It optimizes the feature space such that similar data points (based on class labels) are closer together, improving classification performance.

Unlike traditional feature selection methods that rely on statistical significance, **NCA directly optimizes classification accuracy**, making it useful for identifying the most relevant features in a dataset.

[NCA Sckit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NeighborhoodComponentsAnalysis.html)

[Python Examples](https://github.com/erlendd/Neighborhood-Components-Analysis-in-Python/blob/master/Neighborhood%20Components%20Analysis.ipynb)

---

## **How to Interpret the Results?**
After applying NCA, each feature (miRNA) receives an **importance score**, representing its contribution to classification tasks:

| **Feature**      | **NCA_General** | **NCA_Stage** | **NCA_Subtype** |
|-----------------|--------------|------------|------------|
| hsa-mir-937    | 5.772509     | 2.329354   | 1.725681   |
| hsa-mir-3617   | 5.235266     | 2.440621   | 2.936686   |
| hsa-mir-6083   | 4.567811     | 1.000007   | 1.789992   |

### **Key Observations**
1. **General Classification (Cancer vs. No Cancer)**
   - **hsa-mir-937** has the highest score (**5.77**), meaning it is the most relevant miRNA for detecting cancer presence.
   - Features with high **NCA_General** values are crucial biomarkers for distinguishing cancerous from non-cancerous samples.

2. **Stage Classification (Early vs. Late)**
   - **hsa-mir-3617** is slightly more important than hsa-mir-937 for distinguishing lung cancer stages.
   - This suggests that **different miRNAs regulate different cancer progression stages**.

3. **Subtype Classification (Cancer Type)**
   - **hsa-mir-3617** has the highest **NCA_Subtype** value (**2.93**), indicating its importance in differentiating lung cancer subtypes.
   - Features with high **NCA_Subtype** values may be useful for **precision medicine approaches**, tailoring treatments based on subtype.


In [8]:
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pandas as pd
import numpy as np
import os

# === CONFIGURABLE PARAMETERS ===
SCALE_DATA = True
RANDOM_STATE = 42
PREPROCESSED_FILE = "../processed_data/preprocessed_miRNA_data.csv"

In [9]:
# === STEP 1: DATA PREPROCESSING ===
if os.path.exists(PREPROCESSED_FILE):
    print("Loading preprocessed data...")
    data_scaled = pd.read_csv(PREPROCESSED_FILE)
    
    # Extract features and targets
    X = data_scaled.drop(columns=['stage', 'subtype', 'general'])
    y_general = data_scaled['general']
    y_stage_encoded = data_scaled['stage']
    y_subtype_encoded = data_scaled['subtype']
    
else:
    print("Preprocessing data for the first time...")
    
    # Load raw data
    file_path = "miRNA_stage_subtype.csv"
    data = pd.read_csv(file_path)

    # Create 'General' Classification
    data['general'] = (data['stage'] > 0).astype(int)

    # Separate features and targets
    X = data.drop(columns=['stage', 'subtype', 'general'])
    y_general = data['general']
    y_stage = data['stage']
    y_subtype = data['subtype']

    # Scale Features (if enabled)
    if SCALE_DATA:
        scaler = StandardScaler()
        X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    # Encode categorical targets
    le_stage = LabelEncoder()
    y_stage_encoded = le_stage.fit_transform(y_stage)

    le_subtype = LabelEncoder()
    y_subtype_encoded = le_subtype.fit_transform(y_subtype)

    # Save preprocessed data
    data_scaled = X.copy()
    data_scaled['general'] = y_general
    data_scaled['stage'] = y_stage_encoded
    data_scaled['subtype'] = y_subtype_encoded
    data_scaled.to_csv(PREPROCESSED_FILE, index=False)
    print(f"Preprocessed data saved at: {PREPROCESSED_FILE}")


Loading preprocessed data...


In [11]:
# === STEP 2: NCA-BASED FEATURE ANALYSIS ===
def calculate_nca_importances(X, y):
    """Calculate feature importances using NCA components"""
    nca = NeighborhoodComponentsAnalysis(random_state=RANDOM_STATE)
    nca.fit(X, y)
    # Sum absolute values of components for each feature
    return np.abs(nca.components_).sum(axis=0)

# Calculate NCA importances for each target
nca_general = calculate_nca_importances(X, y_general)
nca_stage = calculate_nca_importances(X, y_stage_encoded)
nca_subtype = calculate_nca_importances(X, y_subtype_encoded)

# Create results DataFrame
nca_results_df = pd.DataFrame({
    'Feature': X.columns,
    'NCA_General': nca_general,
    'NCA_Stage': nca_stage,
    'NCA_Subtype': nca_subtype
})

# Sort features by NCA importance
nca_results_sorted = nca_results_df.sort_values(
    by=['NCA_General', 'NCA_Stage', 'NCA_Subtype'], 
    ascending=False
)

nca_results_sorted.to_csv("neighborhood_components_analysis.csv", index=False)

# Display top 10 features for each target
top_features_general = nca_results_sorted[['Feature', 'NCA_General']].head(10)
top_features_stage = nca_results_sorted[['Feature', 'NCA_Stage']].head(10)
top_features_subtype = nca_results_sorted[['Feature', 'NCA_Subtype']].head(10)

print("\nTop 10 Features (NCA Importance) for 'General' Classification:")
print(top_features_general)

print("\nTop 10 Features (NCA Importance) for 'Stage' Classification:")
print(top_features_stage)

print("\nTop 10 Features (NCA Importance) for 'Subtype' Classification:")
print(top_features_subtype)


  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret 


Top 10 Features (NCA Importance) for 'General' Classification:
           Feature  NCA_General
1863   hsa-mir-937     5.772509
515   hsa-mir-3617     5.235266
1421  hsa-mir-6083     4.567811
727   hsa-mir-4302     4.567811
1685  hsa-mir-6860     4.467054
13     hsa-mir-100     4.411412
1021  hsa-mir-4739     4.273464
1321  hsa-mir-5590     4.223181
1605  hsa-mir-6783     3.949810
1657  hsa-mir-6835     3.843756

Top 10 Features (NCA Importance) for 'Stage' Classification:
           Feature  NCA_Stage
1863   hsa-mir-937   2.329354
515   hsa-mir-3617   2.440621
1421  hsa-mir-6083   1.000007
727   hsa-mir-4302   1.000007
1685  hsa-mir-6860   2.392775
13     hsa-mir-100   3.183818
1021  hsa-mir-4739   2.410775
1321  hsa-mir-5590   2.872625
1605  hsa-mir-6783   2.812165
1657  hsa-mir-6835   3.143808

Top 10 Features (NCA Importance) for 'Subtype' Classification:
           Feature  NCA_Subtype
1863   hsa-mir-937     1.725681
515   hsa-mir-3617     2.936686
1421  hsa-mir-6083     1.789992
