# **Neighborhood Components Analysis (NCA) feature**

## **What is NCA?**
**Neighborhood Components Analysis (NCA)** is a supervised dimensionality reduction technique that learns a transformation of features to improve nearest neighbor classification. It optimizes the feature space such that similar data points (based on class labels) are closer together, improving classification performance.

Unlike traditional feature selection methods that rely on statistical significance, **NCA directly optimizes classification accuracy**, making it useful for identifying the most relevant features in a dataset.

[NCA Sckit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NeighborhoodComponentsAnalysis.html)

[Python Examples](https://github.com/erlendd/Neighborhood-Components-Analysis-in-Python/blob/master/Neighborhood%20Components%20Analysis.ipynb)

---

# **Neighborhood Components Analysis (NCA) feature**

## **How to Interpret the Results?**
After applying NCA, each feature (miRNA) receives an **importance score**, representing its contribution to classification tasks:

### **Top Features for Each Classification**
| **General (Cancer vs. No Cancer)** | **Stage (Early vs. Late)**       | **Subtype (Cancer Type)**       |
|------------------------------------|----------------------------------|----------------------------------|
| hsa-mir-937 (5.77)                 | hsa-mir-548ax (6.14)            | hsa-mir-184 (6.52)              |
| hsa-mir-3617 (5.23)                | hsa-mir-4703 (5.61)             | hsa-mir-551b (5.89)             |
| hsa-mir-6083 (4.56)                | hsa-mir-184 (5.50)              | hsa-mir-4490 (5.49)             |
| hsa-mir-4302 (4.56)                | hsa-mir-935 (5.34)              | hsa-mir-3681 (5.44)             |
| hsa-mir-6860 (4.46)                | hsa-mir-554 (5.04)              | hsa-mir-5087 (5.30)             |

### **Key Observations**
1. **General Classification (Cancer vs. No Cancer)**  
   - **hsa-mir-937** has the highest score (**5.77**), making it the most relevant miRNA for detecting cancer presence.  
   - Features like **hsa-mir-3617** and **hsa-mir-6083** also show high importance, suggesting they are critical biomarkers for distinguishing cancerous samples.  

2. **Stage Classification (Early vs. Late)**  
   - **hsa-mir-548ax** dominates with the highest score (**6.14**), indicating its pivotal role in differentiating cancer progression stages.  
   - **hsa-mir-4703** and **hsa-mir-184** follow closely, highlighting their potential involvement in tumor development mechanisms.  

3. **Subtype Classification (Cancer Type)**  
   - **hsa-mir-184** ranks highest (**6.52**), underscoring its significance in identifying cancer subtypes.  
   - **hsa-mir-551b** and **hsa-mir-4490** are also critical, suggesting their utility in **precision medicine** for tailoring subtype-specific therapies.  

In [7]:
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pandas as pd
import numpy as np
import os

# === CONFIGURABLE PARAMETERS ===
SCALE_DATA = True
RANDOM_STATE = 42
PREPROCESSED_FILE = "../processed_data/preprocessed_miRNA_data.csv"

In [8]:
# === STEP 1: DATA PREPROCESSING ===
if os.path.exists(PREPROCESSED_FILE):
    print("Loading preprocessed data...")
    data_scaled = pd.read_csv(PREPROCESSED_FILE)
    
    # Extract features and targets
    X = data_scaled.drop(columns=['stage', 'subtype', 'general'])
    y_general = data_scaled['general']
    y_stage_encoded = data_scaled['stage']
    y_subtype_encoded = data_scaled['subtype']
    
else:
    print("Preprocessing data for the first time...")
    
    # Load raw data
    file_path = "../processed_data/miRNA_stage_subtype.csv"
    data = pd.read_csv(file_path)

    # Create 'General' Classification
    data['general'] = (data['stage'] > 0).astype(int)

    # Separate features and targets
    X = data.drop(columns=['stage', 'subtype', 'general'])
    y_general = data['general']
    y_stage = data['stage']
    y_subtype = data['subtype']

    # Scale Features (if enabled)
    if SCALE_DATA:
        scaler = StandardScaler()
        X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    # Encode categorical targets
    le_stage = LabelEncoder()
    y_stage_encoded = le_stage.fit_transform(y_stage)

    le_subtype = LabelEncoder()
    y_subtype_encoded = le_subtype.fit_transform(y_subtype)

    # Save preprocessed data
    data_scaled = X.copy()
    data_scaled['general'] = y_general
    data_scaled['stage'] = y_stage_encoded
    data_scaled['subtype'] = y_subtype_encoded
    data_scaled.to_csv(PREPROCESSED_FILE, index=False)
    print(f"Preprocessed data saved at: {PREPROCESSED_FILE}")


Preprocessing data for the first time...
Preprocessed data saved at: ../processed_data/preprocessed_miRNA_data.csv


In [9]:
# === STEP 2: NCA-BASED FEATURE ANALYSIS ===
def calculate_nca_importances(X, y):
    """Calculate feature importances using NCA components"""
    nca = NeighborhoodComponentsAnalysis(random_state=RANDOM_STATE)
    nca.fit(X, y)
    # Sum absolute values of components for each feature
    return np.abs(nca.components_).sum(axis=0)

# Calculate NCA importances for each target
nca_general = calculate_nca_importances(X, y_general)
nca_stage = calculate_nca_importances(X, y_stage_encoded)
nca_subtype = calculate_nca_importances(X, y_subtype_encoded)

# Create results DataFrame
nca_results_df = pd.DataFrame({
    'Feature': X.columns,
    'NCA_General': nca_general,
    'NCA_Stage': nca_stage,
    'NCA_Subtype': nca_subtype
})

# Sort features separately for each target
top_general = nca_results_df[['Feature', 'NCA_General']].sort_values('NCA_General', ascending=False).head(10)
top_stage = nca_results_df[['Feature', 'NCA_Stage']].sort_values('NCA_Stage', ascending=False).head(10)
top_subtype = nca_results_df[['Feature', 'NCA_Subtype']].sort_values('NCA_Subtype', ascending=False).head(10)

nca_results_df.to_csv("../processed_data/neighborhood_components_analysis.csv", index=False)

# Add classification type labels to each top features DataFrame
top_general['Classification_Type'] = 'General'
top_stage['Classification_Type'] = 'Stage'
top_subtype['Classification_Type'] = 'Subtype'

# Rename columns for consistency
top_general = top_general.rename(columns={'NCA_General': 'Importance_Score'})
top_stage = top_stage.rename(columns={'NCA_Stage': 'Importance_Score'})
top_subtype = top_subtype.rename(columns={'NCA_Subtype': 'Importance_Score'})

# Display results
print("\nTop 10 Features (NCA Importance) for 'General' Classification:")
print(top_general)
print("\nTop 10 Features (NCA Importance) for 'Stage' Classification:")
print(top_stage)
print("\nTop 10 Features (NCA Importance) for 'Subtype' Classification:")
print(top_subtype)

# Combine all top features into a single DataFrame
combined_top_features = pd.concat([top_general, top_stage, top_subtype], axis=0)


  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret 


Top 10 Features (NCA Importance) for 'General' Classification:
           Feature  Importance_Score Classification_Type
1863   hsa-mir-937          5.772509             General
515   hsa-mir-3617          5.235266             General
1421  hsa-mir-6083          4.567811             General
727   hsa-mir-4302          4.567811             General
1685  hsa-mir-6860          4.467054             General
13     hsa-mir-100          4.411412             General
1021  hsa-mir-4739          4.273464             General
1321  hsa-mir-5590          4.223181             General
1605  hsa-mir-6783          3.949810             General
1657  hsa-mir-6835          3.843756             General

Top 10 Features (NCA Importance) for 'Stage' Classification:
             Feature  Importance_Score Classification_Type
1248   hsa-mir-548ax          7.055290               Stage
1861     hsa-mir-935          6.559476               Stage
985     hsa-mir-4703          6.498538               Stage
842     hsa