## Logistic Regression Workflow for Predicting MIR100HG Expression in PAAD

This section describes a complete modeling pipeline to classify the expression level of **MIR100HG** (High vs. Low) in **Pancreatic Adenocarcinoma (PAAD)** using **gene expression and DNA methylation data**. The model is built using **logistic regression**, with steps including preprocessing, feature selection, standardization, model training, and performance evaluation.

---

### 1. Data Sources

- **Expression Labels**: `PAAD_Model_MIR100HG_Expression_Levels.csv`
- **Gene Expression Features**: `PAAD_Model_Gene_Expression_Features.csv`
- **Methylation Features**: `PAAD_Model_Methylation_Features.csv`

After merging:
- **Total Samples**: 178  
- **Total Features (post-merge)**: 13,567

---

### 2. Preprocessing

- Removed `MIR100HG` gene from expression data to prevent label leakage.
- Added prefixes: `Gene_` for gene expression, `Methylation_` for methylation features.
- Merged gene expression and methylation data based on `Sample_ID`.
- Final feature matrix was split into predictors `X` and binary label `y`.

---

### 3. Train-Test Split

- 80% for training (142 samples), 20% for testing (36 samples)
- Stratified by label (MIR100HG expression status)

---

### 4. Feature Selection

Performed separately for gene and methylation features on the training set:

1. **Variance Filtering**:
   - Removed low-variance features (`threshold=0.005`)

2. **ANOVA F-test (SelectKBest)**:
   - Top 200 gene expression features
   - Top 100 methylation features

3. **Feature Masking**:
   - Selected features were applied consistently to both training and test sets.

---

### 5. Standardization

- Applied `StandardScaler` to training and test data
- Transformed features to zero mean and unit variance (essential for logistic regression)

---

### 6. Model Training

- **Model**: Logistic Regression
- **Parameters**:
  - Regularization: L2 (`penalty='l2'`)
  - Strength: `C=1.0`
  - Solver: `'liblinear'`
  - Max Iterations: `1000`
- Trained on standardized training features.

---

### 7. Feature Importance Analysis

- Calculated based on absolute value of logistic regression coefficients.
- Grouped feature importance by:
  - **Gene expression**
  - **Methylation**
- Computed contribution percentages for both types.
- Top 20 most influential features were recorded.

---

### 8. Test Set Evaluation

| Metric     | Description                         |
|------------|-------------------------------------|
| Accuracy   | Overall classification accuracy     |
| Precision  | Positive predictive value           |
| Recall     | Sensitivity (true positive rate)    |
| F1-score   | Harmonic mean of precision and recall |
| AUC        | Area under ROC curve                |

- Confusion matrix was also generated.
- All results were saved to the specified output directory.

---

### 9. Cross-Validation (5-Fold Stratified)

- In each fold:
  - Feature selection was repeated on training data
  - Features were standardized
  - Logistic regression model was re-trained
  - Performance metrics and feature importance were saved

- **Metrics Averaged Across Folds**:
  - Accuracy, Precision, Recall, F1-score, AUC

- **Per-Fold Outputs**:
  - Performance statistics
  - Feature importance rankings

---

### 10. Output Files

- Saved to:  
  **`D:\project data\M-28\NTU_DATA_MODEL\PAAD_Logistic_Regression`**

- Files include:
  - Feature importance CSVs (gene, methylation, all)
  - Model performance report (test set)
  - Cross-validation performance metrics
  - Per-fold feature importance

---


# PAAD Logistic Regression Results Summary

This report summarizes the results of a **Logistic Regression model** used to classify **MIR100HG expression status (High vs. Low)** in **Pancreatic Adenocarcinoma (PAAD)** using gene expression and DNA methylation data.

---

## 1. Dataset Overview

- **Samples**: 178
- **Initial Features**:
  - Gene Expression: 3317
  - Methylation: 10248
- **After Removing MIR100HG**:  
  - Combined feature matrix shape: (178, 13,566)

---

## 2. Preprocessing

- `MIR100HG` feature was detected and removed.
- Features were prefixed with `Gene_` and `Methylation_`.
- Final features after prefixing: **13,564** (3316 gene + 10248 methylation)

---

## 3. Train-Test Split

- **Training Set**: 142 samples
- **Test Set**: 36 samples

---

## 4. Feature Selection

Performed on training data only:

| Step                          | Gene Features | Methylation Features | Total |
|-------------------------------|----------------|------------------------|--------|
| After Variance Filtering      | 3322           | 10248                 | —      |
| Selected via ANOVA (F-test)   | 200            | 100                   | 300    |

---

## 5. Model Training: Logistic Regression

- **Penalty**: L2 Regularization
- **Solver**: `liblinear`
- **Max Iterations**: 1000
- **StandardScaler** applied to all selected features.

---

## 6. Feature Importance Analysis

### Feature Type Contribution

| Feature Type       | Contribution |
|--------------------|--------------|
| Gene Expression    | 67.12%       |
| Methylation        | 32.88%       |

### Top 5 Gene Expression Features

| Feature       | Importance | Percentage |
|---------------|------------|------------|
| Gene_PHLDB2   | 0.6096     | 1.40%      |
| Gene_NEGR1    | 0.5439     | 1.25%      |
| Gene_GEM      | 0.5159     | 1.18%      |
| Gene_LEPR     | 0.4277     | 0.98%      |
| Gene_CHST3    | 0.4239     | 0.97%      |

### Top 5 Methylation Features

| Feature                | Importance | Percentage |
|------------------------|------------|------------|
| Methylation_cg26233331 | 0.4749     | 1.09%      |
| Methylation_cg08178168 | 0.3868     | 0.89%      |
| Methylation_cg18537730 | 0.3847     | 0.88%      |
| Methylation_cg08785724 | 0.3751     | 0.86%      |
| Methylation_cg24633312 | 0.3575     | 0.82%      |

---

## 7. Test Set Performance

| Metric     | Value   |
|------------|---------|
| Accuracy   | 0.9167  |
| Precision  | 0.9412  |
| Recall     | 0.8889  |
| F1-score   | 0.9143  |
| AUC        | 0.9691  |

**Confusion Matrix**:

|               | Predicted Low | Predicted High |
|---------------|---------------|----------------|
| Actual Low    | 17            | 1              |
| Actual High   | 2             | 16             |

---

## 8. Cross-Validation (5-Fold)

Performed with independent feature selection and model retraining for each fold.

| Fold | Accuracy | Precision | Recall | F1-score | AUC    |
|------|----------|-----------|--------|----------|--------|
| 1    | 0.8889   | 0.8889    | 0.8889 | 0.8889   | 0.9691 |
| 2    | 0.8889   | 0.8889    | 0.8889 | 0.8889   | 0.9074 |
| 3    | 0.8611   | 0.7826    | 1.0000 | 0.8780   | 0.9753 |
| 4    | 0.7143   | 0.7000    | 0.7778 | 0.7368   | 0.8072 |
| 5    | 0.7429   | 0.7222    | 0.7647 | 0.7429   | 0.8137 |

**Mean Cross-Validation Performance**:

| Metric     | Value   |
|------------|---------|
| Accuracy   | 0.8192  |
| Precision  | 0.7965  |
| Recall     | 0.8641  |
| F1-score   | 0.8271  |
| AUC        | 0.8946  |

---

## 9. Output Directory

All results were saved to:

**`D:\project data\M-28\NTU_DATA_MODEL\PAAD_Logistic_Regression`**

Including:
- Top features (CSV files)
- Model performance on test set
- Per-fold cross-validation metrics
- Confusion matrix and AUC values

---



In [3]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold, f_classif, SelectKBest
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
import warnings

# Ignore the warning
warnings.filterwarnings('ignore')

# Set path
input_dir = r'D:\project data\M-28\NTU_DATA_CLEANED'
output_dir = r'D:\project data\M-28\NTU_DATA_MODEL\PAAD_Logistic_Regression'

# Make sure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# load data
print("load data...")
# MIR100HG Expression level data
mir_exp = pd.read_csv(os.path.join(input_dir, 'PAAD_Model_MIR100HG_Expression_Levels.csv'))
# gene expression data
gene_exp = pd.read_csv(os.path.join(input_dir, 'PAAD_Model_Gene_Expression_Features.csv'))
# Methylation data
methylation = pd.read_csv(os.path.join(input_dir, 'PAAD_Model_Methylation_Features.csv'))

print("The expression data shape of MIR100HG:", mir_exp.shape)
print("Gene expression data shape:", gene_exp.shape)
print("Methylated data shape:", methylation.shape)

# Data pre-processing
print("\nData pre-processing...")

# Extract the label
y_data = mir_exp[['Sample_ID', 'Group']]
y_data['Group'] = y_data['Group'].map({'High': 1, 'Low': 0})

# Process gene expression data
print("\nProcess gene expression data...")
gene_exp_pivot = gene_exp.copy()

# Check for the presence of the MIR100HG gene and remove it from the features
if 'MIR100HG' in gene_exp_pivot['HGNC_Symbol'].values:
    print("The MIR100HG gene was detected and removed from the features...")
    gene_exp_pivot = gene_exp_pivot[gene_exp_pivot['HGNC_Symbol'] != 'MIR100HG']

gene_exp_pivot = gene_exp_pivot.set_index('HGNC_Symbol')
gene_exp_pivot = gene_exp_pivot.transpose()
gene_exp_pivot = gene_exp_pivot.reset_index()
gene_exp_pivot = gene_exp_pivot.rename(columns={'index': 'Sample_ID'})

# Process methylation data
print("\nProcess methylation data...")
methylation_pivot = methylation.copy()
methylation_pivot = methylation_pivot.set_index('Probe_ID')
methylation_pivot = methylation_pivot.transpose()
methylation_pivot = methylation_pivot.reset_index()
methylation_pivot = methylation_pivot.rename(columns={'index': 'Sample_ID'})

# Combine data
print("\nCombine data...")
# First, combine MIR and gene expression
tmp_merged = y_data.merge(gene_exp_pivot, on='Sample_ID', how='inner')
print(f"The shape after the combination of MIR and gene expression: {tmp_merged.shape}")

# And then merge the methylated data
merged_data = tmp_merged.merge(methylation_pivot, on='Sample_ID', how='inner')
print(f"The shape after adding methylation data: {merged_data.shape}")

# Check the merged data
if merged_data.shape[0] == 0:
    print("Warning: The merged data is empty!")
    import sys
    sys.exit(1)

print(f"The shape of the merged data: {merged_data.shape}")
print(f"sample size: {merged_data.shape[0]}")
print(f"The number of features (including Sample_ID and Group): {merged_data.shape[1]}")

# Split the data into X and y
X = merged_data.drop(['Sample_ID', 'Group'], axis=1)
y = merged_data['Group']
sample_ids = merged_data['Sample_ID']

# Add prefixes to the features
gene_features = gene_exp_pivot.columns.tolist()[1:]  # skip Sample_ID
methylation_features = methylation_pivot.columns.tolist()[1:]  # skip Sample_ID

gene_prefix = {feature: f"Gene_{feature}" for feature in gene_features}
methylation_prefix = {feature: f"Methylation_{feature}" for feature in methylation_features}

# Renaming a Column
X = X.rename(columns={**gene_prefix, **methylation_prefix})

print(f"\nThe number of features after adding the prefix: {X.shape[1]}")
print(f"The quantity of gene expression characteristics: {len(gene_features)}")
print(f"The number of methylation characteristics: {len(methylation_features)}")

# Ensure that the MIR100HG-related features are deleted
mir100hg_cols = [col for col in X.columns if 'MIR100HG' in col]
if len(mir100hg_cols) > 0:
    print(f"Delete the following Mir100HG-related features:: {mir100hg_cols}")
    X = X.drop(mir100hg_cols, axis=1)

# Obtain the columns of gene expression and methylation characteristics
gene_columns = [col for col in X.columns if col.startswith('Gene_')]
methylation_columns = [col for col in X.columns if col.startswith('Methylation_')]

# Divide the training set and the test set
print("\nDivide the training set and the test set...")
X_train, X_test, y_train, y_test, ids_train, ids_test = train_test_split(
    X, y, sample_ids, test_size=0.2, random_state=42, stratify=y
)

print(f"The number of training set samples: {X_train.shape[0]}")
print(f"The number of test set samples: {X_test.shape[0]}")

# Feature selection function - for feature selection of gene expression and methylation data respectively
def select_features(X_train, y_train, var_threshold=0.005, k_gene=200, k_meth=100):
    # Isolate the characteristics of gene expression and methylation
    X_train_gene = X_train[gene_columns]
    X_train_meth = X_train[methylation_columns]
    
    print("Before feature selection:")
    print(f"Total number of features: {X_train.shape[1]}")
    print(f"The quantity of gene expression characteristics: {X_train_gene.shape[1]}")
    print(f"The number of methylation characteristics: {X_train_meth.shape[1]}")
    
    # 1. Perform variance filtering on the gene expression characteristics
    selector_var_gene = VarianceThreshold(threshold=var_threshold)
    X_train_gene_var = selector_var_gene.fit_transform(X_train_gene)
    gene_var_features = X_train_gene.columns[selector_var_gene.get_support()].tolist()
    print(f"Perform variance filtering on the gene expression characteristics: {len(gene_var_features)}")
    
    # 2. Perform variance filtering on the methylation characteristics
    selector_var_meth = VarianceThreshold(threshold=var_threshold)
    X_train_meth_var = selector_var_meth.fit_transform(X_train_meth)
    meth_var_features = X_train_meth.columns[selector_var_meth.get_support()].tolist()
    print(f"The number of methylated features after variance filtering: {len(meth_var_features)}")
    
    # 3. The ANOVA F test was conducted on the gene expression characteristics
    X_train_gene_var_df = pd.DataFrame(X_train_gene_var, columns=gene_var_features, index=X_train.index)
    k_gene_actual = min(k_gene, X_train_gene_var_df.shape[1])
    selector_f_gene = SelectKBest(f_classif, k=k_gene_actual)
    _ = selector_f_gene.fit_transform(X_train_gene_var_df, y_train)
    gene_selected_features = X_train_gene_var_df.columns[selector_f_gene.get_support()].tolist()
    print(f"The number of gene expression characteristics selected by ANOVA: {len(gene_selected_features)}")
    
    # 4. ANOVA F test was conducted on the methylation characteristics
    X_train_meth_var_df = pd.DataFrame(X_train_meth_var, columns=meth_var_features, index=X_train.index)
    k_meth_actual = min(k_meth, X_train_meth_var_df.shape[1])
    selector_f_meth = SelectKBest(f_classif, k=k_meth_actual)
    _ = selector_f_meth.fit_transform(X_train_meth_var_df, y_train)
    meth_selected_features = X_train_meth_var_df.columns[selector_f_meth.get_support()].tolist()
    print(f"The number of methylated features selected by ANOVA: {len(meth_selected_features)}")
    
    # Merge the selected features
    all_selected_features = gene_selected_features + meth_selected_features
    print(f"The total number of selected features after merging: {len(all_selected_features)}")
    print(f"Gene expression characteristics: {len(gene_selected_features)}")
    print(f"Methylation characteristics: {len(meth_selected_features)}")
    
    # Create a choice matrix
    combined_support = np.zeros(X_train.shape[1], dtype=bool)
    for i, feature in enumerate(X_train.columns):
        if feature in all_selected_features:
            combined_support[i] = True
    
    return combined_support, all_selected_features

# Feature selection is performed using the training set
print("\nFeature selection is performed on the training set...")
feature_support, selected_features = select_features(X_train, y_train)

# Application feature selection
X_train_selected = X_train.loc[:, feature_support]
X_test_selected = X_test.loc[:, feature_support]

print(f"The number of selected features: {X_train_selected.shape[1]}")
print("The number of selected gene expression characteristics:", len([f for f in X_train_selected.columns if f.startswith('Gene_')]))
print("The number of selected methylation features:", len([f for f in X_train_selected.columns if f.startswith('Methylation_')]))

# Standardize the features - this is very important for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

# Convert back to the DataFrame to retain the column names
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_selected.columns, index=X_train_selected.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_selected.columns, index=X_test_selected.index)

# Feature Importance Analysis
def feature_importance_analysis(model, X):
    # For logistic regression, the absolute value of the coefficients is used as the feature importance
    importances = np.abs(model.coef_[0])
    
    # Create a DataFrame to store the importance of features
    feature_importance_df = pd.DataFrame({
        'Feature': X.columns,
        'Importance': importances
    })
    
    # Sort by importance
    feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
    
    # Calculate the percentage
    feature_importance_df['Percentage'] = feature_importance_df['Importance'] / feature_importance_df['Importance'].sum() * 100
    
    return feature_importance_df

# Train the logistic regression model
print("\nTrain the logistic regression model...")
lr_model = LogisticRegression(
    penalty='l2',  # L2 regularization
    C=1.0,         # The reciprocal of the regularization intensity
    solver='liblinear',  # solver
    max_iter=1000,  # Maximum iterations 
    random_state=42
)

lr_model.fit(X_train_scaled_df, y_train)

# Analyze the importance of characteristics
print("\nAnalyze the importance of characteristics...")
feature_importance = feature_importance_analysis(lr_model, X_train_scaled_df)

# Classify the features and calculate the total importance of each type of feature
gene_importance = feature_importance[feature_importance['Feature'].str.startswith('Gene_')]
methylation_importance = feature_importance[feature_importance['Feature'].str.startswith('Methylation_')]

gene_importance_sum = gene_importance['Importance'].sum()
methylation_importance_sum = methylation_importance['Importance'].sum()
total_importance = gene_importance_sum + methylation_importance_sum

# Statistics on the importance of features
print("\nPercentage of feature importance:")
print(f"Gene expression characteristics: {gene_importance_sum / total_importance * 100:.2f}%")
print(f"Methylation characteristics: {methylation_importance_sum / total_importance * 100:.2f}%")

# Print the Top20 features grouped by category
print("\nPrint the Top20 features grouped by category:")
print(gene_importance.head(20).to_string(index=False))

print("\nTop20 characteristics of methylation:")
print(methylation_importance.head(20).to_string(index=False))

print("\nOverall Top20 features:")
print(feature_importance.head(20).to_string(index=False))

# Save the results of feature importance
gene_importance.to_csv(os.path.join(output_dir, 'PAAD_Logistic_Regression_Gene_Importance.csv'), index=False)
methylation_importance.to_csv(os.path.join(output_dir, 'PAAD_Logistic_Regression_Methylation_Importance.csv'), index=False)
feature_importance.to_csv(os.path.join(output_dir, 'PAAD_Logistic_Regression_All_Features_Importance.csv'), index=False)

# Evaluate the model on the test set
print("\nEvaluate the model on the test set...")
y_pred = lr_model.predict(X_test_scaled_df)
y_prob = lr_model.predict_proba(X_test_scaled_df)[:, 1]

# Calculate index
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)
cm = confusion_matrix(y_test, y_pred)

# Print evaluation results
print("\nTest set performance index:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"AUC: {auc:.4f}")
print("Confusion Matrix :")
print(cm)

# Save the performance indicators of the model
with open(os.path.join(output_dir, 'PAAD_Logistic_Regression_Model_Performance.txt'), 'w') as f:
    f.write(f"Test set performance indicators:\n")
    f.write(f"Accuracy: {accuracy:.4f}\n")
    f.write(f"Precision: {precision:.4f}\n")
    f.write(f"Recall: {recall:.4f}\n")
    f.write(f"F1-score: {f1:.4f}\n")
    f.write(f"AUC: {auc:.4f}\n")
    f.write("Confusion Matrix:\n")
    f.write(f"{cm[0][0]} {cm[0][1]}\n")
    f.write(f"{cm[1][0]} {cm[1][1]}\n")

# Cross-validation
print("\nCross-validation...")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_results = {
    'fold': [],
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1': [],
    'auc': []
}

fold = 1
for train_idx, val_idx in skf.split(X, y):
    print(f"\nPerform cross-validation for the {fold} fold...")
    
    # Obtain the current training and validation data of fold
    X_fold_train, X_fold_val = X.iloc[train_idx], X.iloc[val_idx]
    y_fold_train, y_fold_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Feature selection is performed within the fold
    print(f"Fold {fold} - feature selection...")
    fold_feature_support, fold_selected_features = select_features(X_fold_train, y_fold_train)
    
    # Application feature selection
    X_fold_train_selected = X_fold_train.loc[:, fold_feature_support]
    X_fold_val_selected = X_fold_val.loc[:, fold_feature_support]
    
    # standardization
    fold_scaler = StandardScaler()
    X_fold_train_scaled = fold_scaler.fit_transform(X_fold_train_selected)
    X_fold_val_scaled = fold_scaler.transform(X_fold_val_selected)
    
    # transform to DataFrame
    X_fold_train_scaled_df = pd.DataFrame(X_fold_train_scaled, columns=X_fold_train_selected.columns, index=X_fold_train_selected.index)
    X_fold_val_scaled_df = pd.DataFrame(X_fold_val_scaled, columns=X_fold_val_selected.columns, index=X_fold_val_selected.index)
    
    # Training model
    print(f"Fold {fold} - Training model...")
    fold_model = LogisticRegression(
        penalty='l2',
        C=1.0,
        solver='liblinear',
        max_iter=1000,
        random_state=42
    )
    fold_model.fit(X_fold_train_scaled_df, y_fold_train)
    
    # prediction
    y_fold_pred = fold_model.predict(X_fold_val_scaled_df)
    y_fold_prob = fold_model.predict_proba(X_fold_val_scaled_df)[:, 1]
    
    # Calculate performance index
    fold_accuracy = accuracy_score(y_fold_val, y_fold_pred)
    fold_precision = precision_score(y_fold_val, y_fold_pred)
    fold_recall = recall_score(y_fold_val, y_fold_pred)
    fold_f1 = f1_score(y_fold_val, y_fold_pred)
    fold_auc = roc_auc_score(y_fold_val, y_fold_prob)
    
    # Print the result of this fold
    print(f"Fold {fold} - performance index:")
    print(f"  accuracy: {fold_accuracy:.4f}")
    print(f"  precision: {fold_precision:.4f}")
    print(f"  recall: {fold_recall:.4f}")
    print(f"  F1-score: {fold_f1:.4f}")
    print(f"  AUC: {fold_auc:.4f}")
    
    # save results
    cv_results['fold'].append(fold)
    cv_results['accuracy'].append(fold_accuracy)
    cv_results['precision'].append(fold_precision)
    cv_results['recall'].append(fold_recall)
    cv_results['f1'].append(fold_f1)
    cv_results['auc'].append(fold_auc)
    
    # Analyze the importance of characteristics
    fold_importance = feature_importance_analysis(fold_model, X_fold_train_scaled_df)
    
    # save the importance of characteristics
    fold_importance.to_csv(os.path.join(output_dir, f'PAAD_LogisticRegression_Fold{fold}_Feature_Importance.csv'), index=False)
    
    fold += 1

# Calculate the average performance of cross-validation
cv_results_df = pd.DataFrame(cv_results)
cv_mean = cv_results_df.mean()

print("\nCross-verify the average performance:")
print(f"mean accuracy: {cv_mean['accuracy']:.4f}")
print(f"mean precision: {cv_mean['precision']:.4f}")
print(f"mean recall: {cv_mean['recall']:.4f}")
print(f"mean F1-score: {cv_mean['f1']:.4f}")
print(f"maen AUC: {cv_mean['auc']:.4f}")

# Save the cross-validation results
cv_results_df.to_csv(os.path.join(output_dir, 'PAAD_Logistic_Regression_CV_Results.csv'), index=False)

print("\nAnalysis completed! All results have been saved to:", output_dir)

load data...
The expression data shape of MIR100HG: (178, 4)
Gene expression data shape: (3317, 179)
Methylated data shape: (10248, 186)

Data pre-processing...

Process gene expression data...
The MIR100HG gene was detected and removed from the features...

Process methylation data...

Combine data...
The shape after the combination of MIR and gene expression: (178, 3318)
The shape after adding methylation data: (178, 13566)
The shape of the merged data: (178, 13566)
sample size: 178
The number of features (including Sample_ID and Group): 13566

The number of features after adding the prefix: 13564
The quantity of gene expression characteristics: 3316
The number of methylation characteristics: 10248

Divide the training set and the test set...
The number of training set samples: 142
The number of test set samples: 36

Feature selection is performed on the training set...
Before feature selection:
Total number of features: 13564
The quantity of gene expression characteristics: 3322
The

# Sort out the original data before DEG and DMA

In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold, f_classif, SelectKBest
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
import warnings

# Ignore the warning
warnings.filterwarnings('ignore')

# set path
input_dir = r'D:\project data\M-28\NTU_DATA_CLEANED'

# Make sure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# load data
print("load data...")
# Data of MIR100HG expression level
mir_exp = pd.read_csv(os.path.join(input_dir, 'PAAD_Model_MIR100HG_Expression_Levels.csv'))
# gene expression data
gene_exp = pd.read_csv(os.path.join(input_dir, 'PAAD_Model_Gene_Expression_Features_Test.csv'))
# Methylation data
methylation = pd.read_csv(os.path.join(input_dir, 'PAAD_Model_Methylation_Features_Test.csv'))

print("Methylation data:", mir_exp.shape)
print("Gene expression data shape:", gene_exp.shape)
print("Methylated data shape:", methylation.shape)

# data pre-processing
print("\ndata pre-processing...")

# Extract the label
y_data = mir_exp[['Sample_ID', 'Group']]
y_data['Group'] = y_data['Group'].map({'High': 1, 'Low': 0})

# Process gene expression data
print("\nProcess gene expression data...")
gene_exp_pivot = gene_exp.copy()

# Check for the presence of the MIR100HG gene and remove it from the features
if 'MIR100HG' in gene_exp_pivot['HGNC_Symbol'].values:
    print("The MIR100HG gene was detected and removed from the features...")
    gene_exp_pivot = gene_exp_pivot[gene_exp_pivot['HGNC_Symbol'] != 'MIR100HG']

gene_exp_pivot = gene_exp_pivot.set_index('HGNC_Symbol')
gene_exp_pivot = gene_exp_pivot.transpose()
gene_exp_pivot = gene_exp_pivot.reset_index()
gene_exp_pivot = gene_exp_pivot.rename(columns={'index': 'Sample_ID'})

# Process methylation data
print("\nProcess methylation data...")
methylation_pivot = methylation.copy()
methylation_pivot = methylation_pivot.set_index('Probe_ID')
methylation_pivot = methylation_pivot.transpose()
methylation_pivot = methylation_pivot.reset_index()
methylation_pivot = methylation_pivot.rename(columns={'index': 'Sample_ID'})

# merging data
print("\nmerging data...")
# First, combine MIR and gene expression
tmp_merged = y_data.merge(gene_exp_pivot, on='Sample_ID', how='inner')
print(f"The shape after the combination of MIR and gene expression: {tmp_merged.shape}")

# Then combine the methylation data
merged_data = tmp_merged.merge(methylation_pivot, on='Sample_ID', how='inner')
print(f"The shape after adding methylation data: {merged_data.shape}")

# Check the merged data
if merged_data.shape[0] == 0:
    print("Warning: The merged data is empty!")
    import sys
    sys.exit(1)

print(f"The shape of the merged data: {merged_data.shape}")
print(f"sample size: {merged_data.shape[0]}")
print(f"The number of features (including Sample_ID and Group): {merged_data.shape[1]}")

# Split the data into X and y
X = merged_data.drop(['Sample_ID', 'Group'], axis=1)
y = merged_data['Group']
sample_ids = merged_data['Sample_ID']

# Add prefixes to the features
gene_features = gene_exp_pivot.columns.tolist()[1:]  # skip Sample_ID
methylation_features = methylation_pivot.columns.tolist()[1:]  # skip Sample_ID

gene_prefix = {feature: f"Gene_{feature}" for feature in gene_features}
methylation_prefix = {feature: f"Methylation_{feature}" for feature in methylation_features}

# Renaming a Column
X = X.rename(columns={**gene_prefix, **methylation_prefix})

print(f"\nThe number of features after adding the prefix: {X.shape[1]}")
print(f"The quantity of gene expression characteristics: {len(gene_features)}")
print(f"The number of methylation characteristics: {len(methylation_features)}")

# Ensure that the MIR100HG-related features are deleted
mir100hg_cols = [col for col in X.columns if 'MIR100HG' in col]
if len(mir100hg_cols) > 0:
    print(f"Delete the following Mir100HG-related features: {mir100hg_cols}")
    X = X.drop(mir100hg_cols, axis=1)

# Obtain the columns of gene expression and methylation characteristics
gene_columns = [col for col in X.columns if col.startswith('Gene_')]
methylation_columns = [col for col in X.columns if col.startswith('Methylation_')]

# Divide the training set and the test set
print("\nDivide the training set and the test set...")
X_train, X_test, y_train, y_test, ids_train, ids_test = train_test_split(
    X, y, sample_ids, test_size=0.2, random_state=42, stratify=y
)

print(f"The number of training set samples: {X_train.shape[0]}")
print(f"The number of test set samples: {X_test.shape[0]}")

# Feature selection function - for feature selection of gene expression and methylation data respectively
def select_features(X_train, y_train, var_threshold=0.005, k_gene=200, k_meth=100):
    # Isolate the characteristics of gene expression and methylation
    X_train_gene = X_train[gene_columns]
    X_train_meth = X_train[methylation_columns]
    
    print("Before feature selection:")
    print(f"Total number of features: {X_train.shape[1]}")
    print(f"The quantity of gene expression characteristics: {X_train_gene.shape[1]}")
    print(f"The number of methylation characteristics: {X_train_meth.shape[1]}")
    
    # 1. Perform variance filtering on the gene expression characteristics
    selector_var_gene = VarianceThreshold(threshold=var_threshold)
    X_train_gene_var = selector_var_gene.fit_transform(X_train_gene)
    gene_var_features = X_train_gene.columns[selector_var_gene.get_support()].tolist()
    print(f"The number of gene expression characteristics after variance filtering: {len(gene_var_features)}")
    
    # 2. Perform variance filtering on the methylation characteristics
    selector_var_meth = VarianceThreshold(threshold=var_threshold)
    X_train_meth_var = selector_var_meth.fit_transform(X_train_meth)
    meth_var_features = X_train_meth.columns[selector_var_meth.get_support()].tolist()
    print(f"The number of methylated features after variance filtering: {len(meth_var_features)}")
    
    # 3. The ANOVA F test was conducted on the gene expression characteristics
    X_train_gene_var_df = pd.DataFrame(X_train_gene_var, columns=gene_var_features, index=X_train.index)
    k_gene_actual = min(k_gene, X_train_gene_var_df.shape[1])
    selector_f_gene = SelectKBest(f_classif, k=k_gene_actual)
    _ = selector_f_gene.fit_transform(X_train_gene_var_df, y_train)
    gene_selected_features = X_train_gene_var_df.columns[selector_f_gene.get_support()].tolist()
    print(f"The number of gene expression characteristics selected by ANOVA: {len(gene_selected_features)}")
    
    # 4. ANOVA F test was conducted on the methylation characteristics
    X_train_meth_var_df = pd.DataFrame(X_train_meth_var, columns=meth_var_features, index=X_train.index)
    k_meth_actual = min(k_meth, X_train_meth_var_df.shape[1])
    selector_f_meth = SelectKBest(f_classif, k=k_meth_actual)
    _ = selector_f_meth.fit_transform(X_train_meth_var_df, y_train)
    meth_selected_features = X_train_meth_var_df.columns[selector_f_meth.get_support()].tolist()
    print(f"The number of methylated features selected by ANOVA: {len(meth_selected_features)}")
    
    # Merge the selected features
    all_selected_features = gene_selected_features + meth_selected_features
    print(f"The total number of selected features after merging: {len(all_selected_features)}")
    print(f"Gene expression characteristics: {len(gene_selected_features)}")
    print(f"Methylation characteristics: {len(meth_selected_features)}")
    
    # Create a choice matrix
    combined_support = np.zeros(X_train.shape[1], dtype=bool)
    for i, feature in enumerate(X_train.columns):
        if feature in all_selected_features:
            combined_support[i] = True
    
    return combined_support, all_selected_features

# Feature selection is performed using the training set
print("\nFeature selection is performed on the training set...")
feature_support, selected_features = select_features(X_train, y_train)

# Application feature selection
X_train_selected = X_train.loc[:, feature_support]
X_test_selected = X_test.loc[:, feature_support]

print(f"The number of selected features: {X_train_selected.shape[1]}")
print("The number of selected gene expression characteristics:", len([f for f in X_train_selected.columns if f.startswith('Gene_')]))
print("The number of selected methylation features:", len([f for f in X_train_selected.columns if f.startswith('Methylation_')]))

# Standardize the features - this is very important for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

# Convert back to the DataFrame to retain the column names
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_selected.columns, index=X_train_selected.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_selected.columns, index=X_test_selected.index)

# Feature Importance Analysis
def feature_importance_analysis(model, X):
    # For logistic regression, the absolute value of the coefficients is used as the feature importance
    importances = np.abs(model.coef_[0])
    
    # Create a DataFrame to store the importance of features
    feature_importance_df = pd.DataFrame({
        'Feature': X.columns,
        'Importance': importances
    })
    
    # Sort by importance
    feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
    
    # Calculate the percentage
    feature_importance_df['Percentage'] = feature_importance_df['Importance'] / feature_importance_df['Importance'].sum() * 100
    
    return feature_importance_df

# Train the logistic regression model
print("\nTrain the logistic regression model...")
lr_model = LogisticRegression(
    penalty='l2',  # L2 regularization
    C=1.0,         # The reciprocal of the regularization intensity
    solver='liblinear',  # solver
    max_iter=1000,  # Maximum iterations 
    random_state=42
)

lr_model.fit(X_train_scaled_df, y_train)

# Analyze the importance of characteristics
print("\nAnalyze the importance of characteristics...")
feature_importance = feature_importance_analysis(lr_model, X_train_scaled_df)

# Classify the features and calculate the total importance of each type of feature
gene_importance = feature_importance[feature_importance['Feature'].str.startswith('Gene_')]
methylation_importance = feature_importance[feature_importance['Feature'].str.startswith('Methylation_')]

gene_importance_sum = gene_importance['Importance'].sum()
methylation_importance_sum = methylation_importance['Importance'].sum()
total_importance = gene_importance_sum + methylation_importance_sum

# Statistics on the importance of printing features
print("\nPercentage of characteristic importance:")
print(f"Gene expression characteristics: {gene_importance_sum / total_importance * 100:.2f}%")
print(f"Methylation characteristics: {methylation_importance_sum / total_importance * 100:.2f}%")

# Print the Top20 features grouped by category
print("\nTop20 characteristics of gene expression:")
print(gene_importance.head(20).to_string(index=False))

print("\nMethylated top20 characteristics:")
print(methylation_importance.head(20).to_string(index=False))

print("\nOverall Top20 features:")
print(feature_importance.head(20).to_string(index=False))


# Evaluate the model on the test set
print("\nEvaluate the model on the test set...")
y_pred = lr_model.predict(X_test_scaled_df)
y_prob = lr_model.predict_proba(X_test_scaled_df)[:, 1]

# compute index
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)
cm = confusion_matrix(y_test, y_pred)

# Print the assessment results
print("\nTest set performance index:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"AUC: {auc:.4f}")
print("confusion matrix:")
print(cm)

# Save the performance indicators of the model
with open(os.path.join(output_dir, 'PAAD_LogisticRegression_Model_Performance.txt'), 'w') as f:
    f.write(f"Test set performance index:\n")
    f.write(f"Accuracy: {accuracy:.4f}\n")
    f.write(f"Precision: {precision:.4f}\n")
    f.write(f"Recall: {recall:.4f}\n")
    f.write(f"F1-score: {f1:.4f}\n")
    f.write(f"AUC: {auc:.4f}\n")
    f.write("Confusion Matrix:\n")
    f.write(f"{cm[0][0]} {cm[0][1]}\n")
    f.write(f"{cm[1][0]} {cm[1][1]}\n")

# Cross-validation
print("\nCross-validation...")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_results = {
    'fold': [],
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1': [],
    'auc': []
}

fold = 1
for train_idx, val_idx in skf.split(X, y):
    print(f"\nPerform cross-validation for the {fold} fold...")
    
    # Obtain the current training and validation data of fold
    X_fold_train, X_fold_val = X.iloc[train_idx], X.iloc[val_idx]
    y_fold_train, y_fold_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Feature selection is performed within the fold
    print(f"Fold {fold} - Feature selection...")
    fold_feature_support, fold_selected_features = select_features(X_fold_train, y_fold_train)
    
    # Application feature selection
    X_fold_train_selected = X_fold_train.loc[:, fold_feature_support]
    X_fold_val_selected = X_fold_val.loc[:, fold_feature_support]
    
    # standardization
    fold_scaler = StandardScaler()
    X_fold_train_scaled = fold_scaler.fit_transform(X_fold_train_selected)
    X_fold_val_scaled = fold_scaler.transform(X_fold_val_selected)
    
    # convert back to DataFrame
    X_fold_train_scaled_df = pd.DataFrame(X_fold_train_scaled, columns=X_fold_train_selected.columns, index=X_fold_train_selected.index)
    X_fold_val_scaled_df = pd.DataFrame(X_fold_val_scaled, columns=X_fold_val_selected.columns, index=X_fold_val_selected.index)
    
    # training model
    print(f"Fold {fold} - training model...")
    fold_model = LogisticRegression(
        penalty='l2',
        C=1.0,
        solver='liblinear',
        max_iter=1000,
        random_state=42
    )
    fold_model.fit(X_fold_train_scaled_df, y_fold_train)
    
    # prediction
    y_fold_pred = fold_model.predict(X_fold_val_scaled_df)
    y_fold_prob = fold_model.predict_proba(X_fold_val_scaled_df)[:, 1]
    
    # Calculate performance indicators
    fold_accuracy = accuracy_score(y_fold_val, y_fold_pred)
    fold_precision = precision_score(y_fold_val, y_fold_pred)
    fold_recall = recall_score(y_fold_val, y_fold_pred)
    fold_f1 = f1_score(y_fold_val, y_fold_pred)
    fold_auc = roc_auc_score(y_fold_val, y_fold_prob)
    
    # Print the result of this fold
    print(f"Fold {fold} - performance index:")
    print(f"  accuracy: {fold_accuracy:.4f}")
    print(f"  precision: {fold_precision:.4f}")
    print(f"  recall: {fold_recall:.4f}")
    print(f"  F1-score: {fold_f1:.4f}")
    print(f"  AUC: {fold_auc:.4f}")
    
    # preserve results
    cv_results['fold'].append(fold)
    cv_results['accuracy'].append(fold_accuracy)
    cv_results['precision'].append(fold_precision)
    cv_results['recall'].append(fold_recall)
    cv_results['f1'].append(fold_f1)
    cv_results['auc'].append(fold_auc)
    
    # Analyze the importance of characteristics
    fold_importance = feature_importance_analysis(fold_model, X_fold_train_scaled_df)
    
    # save the importance of characteristics
    fold_importance.to_csv(os.path.join(output_dir, f'PAAD_LogisticRegression_Fold{fold}_Feature_Importance.csv'), index=False)
    
    fold += 1

# Calculate the average performance of cross-validation
cv_results_df = pd.DataFrame(cv_results)
cv_mean = cv_results_df.mean()

print("\nmean performance of cross-validation:")
print(f"mean accuracy: {cv_mean['accuracy']:.4f}")
print(f"mean precision: {cv_mean['precision']:.4f}")
print(f"mean recall: {cv_mean['recall']:.4f}")
print(f"mean F1-score: {cv_mean['f1']:.4f}")
print(f"mean AUC: {cv_mean['auc']:.4f}")


load data...
Methylation data: (178, 4)
Gene expression data shape: (40060, 179)
Methylated data shape: (374096, 186)

data pre-processing...

Process gene expression data...
The MIR100HG gene was detected and removed from the features...

Process methylation data...

merging data...
The shape after the combination of MIR and gene expression: (178, 40061)
The shape after adding methylation data: (178, 414157)
The shape of the merged data: (178, 414157)
sample size: 178
The number of features (including Sample_ID and Group): 414157

The number of features after adding the prefix: 414155
The quantity of gene expression characteristics: 40059
The number of methylation characteristics: 374096

Divide the training set and the test set...
The number of training set samples: 142
The number of test set samples: 36

Feature selection is performed on the training set...
Before feature selection:
Total number of features: 414155
The quantity of gene expression characteristics: 40089
The number of