# Understanding Information Gain Results

## What is Information Gain?
Information Gain measures the reduction in uncertainty about a target variable provided by knowing the value of a feature. It is widely used in decision trees and feature selection to identify the most informative features.

- [Information Gain and Mutual Information for Machine Learning](https://machinelearningmastery.com/information-gain-and-mutual-information/)
- [Feature Selection Techniques in Machine Learning](https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/#:~:text=techniques%20used%20are%3A-,Information%20Gain,-%E2%80%93%20It%20is)

### How to Interpret the Results
- **High Information Gain**: Indicates that the feature strongly relates to the target variable. Features with higher values should be prioritized for predictive modeling.
- **Low Information Gain**: Suggests that the feature has little to no predictive power for the target variable.

### Practical Uses of Information Gain
1. **Feature Selection**: 
   - Retain features with high Information Gain to reduce dimensionality and improve model performance.
   - Discard features with very low Information Gain, as they contribute minimal predictive power.

2. **Feature Importance Analysis**: 
   - Understand which features are most relevant for the target variables (`stage` and `subtype`).
   - Guide domain experts to focus on significant variables for further analysis.

3. **Improving Model Efficiency**: 
   - By focusing on the top features, reduce the computational burden for model training.

In [22]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pandas as pd
import os

# === CONFIGURABLE PARAMETERS ===
SCALE_DATA = True
RANDOM_STATE = 42
PREPROCESSED_FILE = "preprocessed_miRNA_data.csv"
# Hardcoding random_state might ensure consistent results
# Remove if needing more variability

In [23]:
# Load the uploaded dataset
if os.path.exists(PREPROCESSED_FILE):
    print("Loading preprocessed data...")
    data_scaled = pd.read_csv(PREPROCESSED_FILE)
    
    # Extract features and targets
    X = data_scaled.drop(columns=['stage', 'subtype', 'general'])
    y_general = data_scaled['general']
    y_stage_encoded = data_scaled['stage']
    y_subtype_encoded = data_scaled['subtype']
    
else:
    print("Preprocessing data for the first time...")
    
    # Load raw data
    file_path = "miRNA_stage_subtype.csv"
    data = pd.read_csv(file_path)

    # Create 'General' Classification
    data['general'] = (data['stage'] > 0).astype(int)

    # Separate features and targets
    X = data.drop(columns=['stage', 'subtype', 'general'])
    y_general = data['general']
    y_stage = data['stage']
    y_subtype = data['subtype']

    # Scale Features (if enabled)
    if SCALE_DATA:
        scaler = StandardScaler()
        X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    # Encode categorical targets if necessary
    le_stage = LabelEncoder()
    y_stage_encoded = le_stage.fit_transform(y_stage)

    le_subtype = LabelEncoder()
    y_subtype_encoded = le_subtype.fit_transform(y_subtype)

    # Save preprocessed data
    data_scaled = X.copy()
    data_scaled['general'] = y_general
    data_scaled['stage'] = y_stage_encoded
    data_scaled['subtype'] = y_subtype_encoded
    data_scaled.to_csv(PREPROCESSED_FILE, index=False)
    print(f"Preprocessed data saved at: {PREPROCESSED_FILE}")


Preprocessing data for the first time...
Preprocessed data saved at: preprocessed_miRNA_data.csv


In [24]:
# === STEP 2: Compute Information Gain ===
info_gain_general = mutual_info_classif(X, y_general, random_state=RANDOM_STATE)
info_gain_stage = mutual_info_classif(X, y_stage_encoded, random_state=RANDOM_STATE)
info_gain_subtype = mutual_info_classif(X, y_subtype_encoded, random_state=RANDOM_STATE)

# Combine results into a DataFrame
info_gain_df = pd.DataFrame({
    'Feature': X.columns,
    'Info_Gain_General': info_gain_general,
    'Info_Gain_Stage': info_gain_stage,
    'Info_Gain_Subtype': info_gain_subtype
})

info_gain_df_sorted = info_gain_df.sort_values(by=['Info_Gain_General', 'Info_Gain_Stage', 'Info_Gain_Subtype'], ascending=False)

# Display the top 10 features for each classification level
top_features_general = info_gain_df_sorted[['Feature', 'Info_Gain_General']].sort_values(
    by='Info_Gain_General', ascending=False).head(10)

top_features_stage = info_gain_df_sorted[['Feature', 'Info_Gain_Stage']].sort_values(
    by='Info_Gain_Stage', ascending=False).head(10)

top_features_subtype = info_gain_df_sorted[['Feature', 'Info_Gain_Subtype']].sort_values(
    by='Info_Gain_Subtype', ascending=False).head(10)

print("Top 10 Features for 'General' Classification:")
print(top_features_general)

print("\nTop 10 Features for 'Stage' Classification:")
print(top_features_stage)

print("\nTop 10 Features for 'Subtype' Classification:")
print(top_features_subtype)

Top 10 Features for 'General' Classification:
             Feature  Info_Gain_General
1013    hsa-mir-4731           0.010480
541     hsa-mir-3661           0.009895
680     hsa-mir-4257           0.009355
1341    hsa-mir-5690           0.009244
225     hsa-mir-181d           0.009049
914     hsa-mir-4637           0.008743
1208  hsa-mir-526a-2           0.008537
1312  hsa-mir-5583-1           0.008408
274     hsa-mir-203b           0.008402
1136   hsa-mir-509-1           0.008129

Top 10 Features for 'Stage' Classification:
           Feature  Info_Gain_Stage
1680  hsa-mir-6858         0.052485
842   hsa-mir-4490         0.049394
1340   hsa-mir-569         0.048567
1628  hsa-mir-6806         0.047535
1417   hsa-mir-608         0.046619
1767  hsa-mir-7704         0.046536
1133  hsa-mir-5087         0.045498
1125   hsa-mir-502         0.045280
1099   hsa-mir-490         0.044939
1769  hsa-mir-7706         0.044834

Top 10 Features for 'Subtype' Classification:
             Feature  Info