# Understanding Information Gain Results

## What is Information Gain?
Information Gain measures the reduction in uncertainty about a target variable provided by knowing the value of a feature. It is widely used in decision trees and feature selection to identify the most informative features.

- [Information Gain and Mutual Information for Machine Learning](https://machinelearningmastery.com/information-gain-and-mutual-information/)
- [Feature Selection Techniques in Machine Learning](https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/#:~:text=techniques%20used%20are%3A-,Information%20Gain,-%E2%80%93%20It%20is)

### How to Interpret the Results
- **High Information Gain**: Indicates that the feature strongly relates to the target variable. Features with higher values should be prioritized for predictive modeling.
- **Low Information Gain**: Suggests that the feature has little to no predictive power for the target variable.

### Practical Uses of Information Gain
1. **Feature Selection**: 
   - Retain features with high Information Gain to reduce dimensionality and improve model performance.
   - Discard features with very low Information Gain, as they contribute minimal predictive power.

2. **Feature Importance Analysis**: 
   - Understand which features are most relevant for the target variables (`stage` and `subtype`).
   - Guide domain experts to focus on significant variables for further analysis.

3. **Improving Model Efficiency**: 
   - By focusing on the top features, reduce the computational burden for model training.

### Python Code for Results Interpretation

# Display the top 10 features by Information Gain for each target
```py
top_features_stage = info_gain_df_sorted[['Feature', 'Info_Gain_Stage']].sort_values(
    by='Info_Gain_Stage', ascending=False).head(10)

top_features_subtype = info_gain_df_sorted[['Feature', 'Info_Gain_Subtype']].sort_values(
    by='Info_Gain_Subtype', ascending=False).head(10)

print("Top 10 Features for 'Stage':")
print(top_features_stage)

print("\nTop 10 Features for 'Subtype':")
print(top_features_subtype)

# Plotting Information Gain (Optional)
import matplotlib.pyplot as plt

# Top features for stage
plt.figure(figsize=(10, 6))
plt.barh(top_features_stage['Feature'], top_features_stage['Info_Gain_Stage'], color='skyblue')
plt.title("Top 10 Features by Information Gain for 'Stage'")
plt.xlabel("Information Gain")
plt.ylabel("Feature")
plt.gca().invert_yaxis()
plt.show()

# Top features for subtype
plt.figure(figsize=(10, 6))
plt.barh(top_features_subtype['Feature'], top_features_subtype['Info_Gain_Subtype'], color='lightgreen')
plt.title("Top 10 Features by Information Gain for 'Subtype'")
plt.xlabel("Information Gain")
plt.ylabel("Feature")
plt.gca().invert_yaxis()
plt.show()
```

In [8]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import LabelEncoder


In [11]:
# Load the uploaded dataset
file_path = './miRNA_stage_subtype.csv'
data = pd.read_csv(file_path)

# Display the first few rows and general information about the dataset
data_head = data.head()

data_head


Unnamed: 0,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c,hsa-let-7d,hsa-let-7e,hsa-let-7f-1,hsa-let-7f-2,hsa-let-7g,...,hsa-mir-943,hsa-mir-944,hsa-mir-95,hsa-mir-9500,hsa-mir-96,hsa-mir-98,hsa-mir-99a,hsa-mir-99b,stage,subtype
0,7314.747386,7391.483138,7334.393081,10994.201497,471.496698,318.193106,1156.241547,3272.099771,3363.611772,442.783758,...,0.0,0.0,1.847031,0,40.298863,35.429417,148.602058,12118.707689,1,2
1,9518.042994,9460.443528,9574.874468,17578.281899,785.810318,358.652676,771.986446,3871.452122,3917.224498,487.829079,...,0.0,128.562009,4.607957,0,8.60152,38.86044,111.512567,7471.802757,1,2
2,4479.97634,4387.407628,4447.955716,12394.31011,404.624244,855.241747,246.267705,1353.016896,1415.311564,416.8503,...,0.0,161.267504,1.746579,0,33.767203,31.43843,168.253822,16026.613214,1,2
3,21277.962603,21166.590502,21255.800397,15161.474118,6684.570363,503.278464,2185.922959,15012.229891,14987.262342,1107.549261,...,0.0,1.683206,10.660302,0,5.049617,95.101114,1416.978551,12750.562682,1,2
4,8002.355461,8013.396682,8033.638922,19358.942067,1276.411235,765.754731,593.005616,2630.801098,2649.43316,367.580673,...,0.0,97.990843,3.450382,0,22.77252,46.235116,455.450396,14401.203493,1,2


In [10]:
# Separate features and targets
X = data.drop(columns=['stage', 'subtype'])
y_stage = data['stage']
y_subtype = data['subtype']

# Ensure targets are encoded as integers if not already
le_stage = LabelEncoder()
y_stage_encoded = le_stage.fit_transform(y_stage)

le_subtype = LabelEncoder()
y_subtype_encoded = le_subtype.fit_transform(y_subtype)

# Calculate Information Gain
info_gain_stage = mutual_info_classif(X, y_stage_encoded, random_state=42)
info_gain_subtype = mutual_info_classif(X, y_subtype_encoded, random_state=42)

# Combine results into a DataFrame
info_gain_df = pd.DataFrame({
    'Feature': X.columns,
    'Info_Gain_Stage': info_gain_stage,
    'Info_Gain_Subtype': info_gain_subtype
})

# Sort by highest Information Gain
info_gain_df_sorted = info_gain_df.sort_values(by=['Info_Gain_Stage', 'Info_Gain_Subtype'], ascending=False)
info_gain_df_sorted.head()

Unnamed: 0,Feature,Info_Gain_Stage,Info_Gain_Subtype
1680,hsa-mir-6858,0.052485,0.020078
842,hsa-mir-4490,0.049394,0.002544
1340,hsa-mir-569,0.048567,0.0
1628,hsa-mir-6806,0.047535,0.012377
1417,hsa-mir-608,0.046619,0.014534
