# MAFLD Prediction in T2DM — Reproducible Pipeline (PLOS ONE Revision)

**Prepared for peer review — 2025-10-17**

This notebook consolidates the full analysis pipeline used in the manuscript:
- Data cleaning and validation
- Missingness mechanism assessment and MAR simulation
- Imputation benchmarking across multiple methods
- Outlier detection
- Class balancing
- Feature selection (including GA + Taguchi tuning)
- Model training with stratified K-fold CV and hyperparameter search
- Evaluation with variance estimates (repeated CV)
- Feature importance (XGBoost, GB, LightGBM) and SHAP

**Reproducibility notes**
- Random seeds are centralized below.
- All tuning/selection steps happen **inside** CV loops to avoid leakage.
- Figures and tables generated here align with the revised manuscript and supplementary materials.


## 0) Setup & Configuration


In [None]:

# ---- Global configuration & seeds ----
from pathlib import Path
import numpy as np
import pandas as pd

SEEDS = {
    "global": 42,
    "cv_seeds": [42, 52, 62, 72, 82, 92, 102, 112, 122, 132],
    "impute_seeds": [232, 123, 1, 313, 78, 121],
}

# Project paths (adjust if needed)
DATA_DIR = Path("data")       # place raw CSVs here if sharing
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Display options
pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 160)
np.random.seed(SEEDS["global"])

print("Config initialized. Seeds set. OUTPUT_DIR =", OUTPUT_DIR.resolve())


## 1) Load and Inspect Data
Load source data and perform basic checks. Ensure that each row corresponds to a **unique patient** (no patient appears more than once).


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\4-imputed_data.csv")

# ستون آخر به عنوان برچسب
label_col = data.columns[-1]
y = data.iloc[:, -1].copy()
X = data.iloc[:, :-1].copy()

# مقیاس‌بندی برای LOF (خیلی مهم)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# LOF
lof = LocalOutlierFactor()
pred = lof.fit_predict(X_scaled)   # 1=inlier, -1=outlier

# جدا کردن اینلایرها و آوتلایرها با حفظ ایندکس
inliers_df = X.loc[pred == 1].copy()
outliers_df = X.loc[pred != 1].copy()

# برچسب را برگردان به هر کدام
inliers_df[label_col] = y.loc[inliers_df.index]
outliers_df[label_col] = y.loc[outliers_df.index]

# ذخیره
inliers_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\5-outlier_removed.csv", index=False)
outliers_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\5.5-the_outliers.csv", index=False)


In [None]:
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
# Load your dataset
data = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\5-outlier_removed.csv")

# Separate features (X) and target (y)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Initialize the RandomUnderSampler
under_sampler = RandomUnderSampler(random_state=101)

# Apply random undersampling
X_resampled, y_resampled = under_sampler.fit_resample(X, y)

# Combine resampled features and target into a DataFrame
resampled_data = pd.DataFrame(X_resampled, columns=data.columns[:-1])
resampled_data["fattyliver"] = y_resampled

# Save the resampled balanced dataset to a new CSV file
resampled_data.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv", index=False)

print("Random undersampling completed and balanced dataset saved.")


In [None]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, precision_score, roc_curve

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)   # <- درست
pd.set_option("display.width", None)

pca_save_path = r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Plain_Summary.csv"
plain_results_rows = []

# reading data
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")

# X, y
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=101
)

# scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# models
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(probability=True),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Extra Tree': ExtraTreesClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),  # <- وارنینگ کمتر
    'LightGBM': LGBMClassifier(verbose=-1)
}

# grids
param_grids = {
    'Logistic Regression': {'C': [0.1, 1, 10]},
    'KNN': {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']},
    'SVM': {'C': [0.1, 1, 10]},
    'Decision Tree': {'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]},
    'Random Forest': {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]},
    'Extra Tree': {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]},
    'Gradient Boosting': {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200, 300]},
    'XGBoost': {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200, 300]},
    'LightGBM': {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200, 300], 'num_leaves': [31, 50, 100], 'force_col_wise': [True]}
}

# train/eval
plt.figure()
for model_name, model in models.items():
    print(f"Training {model_name}...")
    grid = GridSearchCV(model, param_grids[model_name], scoring="f1", cv=5, n_jobs=-1)  # n_jobs=-1 برای سرعت
    grid.fit(X_train, y_train)
    best_model = grid.best_estimator_
    params = grid.best_params_

    y_pred = best_model.predict(X_test)
    y_prob = best_model.predict_proba(X_test)[:, 1]

    acc = accuracy_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    pre = precision_score(y_test, y_pred)
    f1  = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_prob)

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    plt.plot(fpr, tpr, label=model_name)

    plain_results_rows.append({
        "Classifier": model_name,
        "Best Parameters": params,
        "Accuracy": acc,
        "Recall": rec,
        "Precision": pre,
        "F1": f1,
        "AUC": auc,
    })

    print(f"Model: {model_name}")
    print(f"Best Parameters: {params}")
    print(f"Accuracy: {acc:.3f} | Recall: {rec:.3f} | Precision: {pre:.3f} | F1: {f1:.3f} | AUC: {auc:.3f}")
    print("---------------------------")

plt.plot([0, 1], [0, 1], linestyle='--', lw=1, color='black')
plt.xlim([0, 1]); plt.ylim([0, 1.05])
plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.savefig(r"C:\Users\z_kho\OneDrive\Desktop\ROC.png", dpi=300)
plt.show()

plain_results_df = pd.DataFrame(plain_results_rows)
plain_results_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\plain_results.csv", index=False)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler

# ورودی‌ها
save_path = r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Kbest_Summary.csv"

# داده را زودتر بخوان تا بتوانی feature_names را ست کنی
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")
feature_names = df.columns[:-1].tolist()

# # X, y و اسپیلت
# X = df.iloc[:, :-1].values
# y = df.iloc[:, -1].values
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, stratify=y, random_state=101
# )

results_rows = []

# Outcomes برای ذخیره‌ی منحنی دقت برحسب K
Outcomes = {name: [] for name in Classifiers.keys()}

# CV ترجیحاً Stratified برای کلاس‌بندی
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("K-best Feature Selection:")

for n_clf in Classifiers.keys():
    print(f"{n_clf} is going ....")
    num_k = []

    for k in range(1, X_train.shape[1] + 1):
        num_k.append(k)

        # اسکیل داخل Pipeline تا leakage نداشته باشیم
        pipeline = Pipeline([
            # ('scaler', StandardScaler()),
            ('feature_selection', SelectKBest(score_func=f_classif, k=k)),
            ('classifier', Classifiers[n_clf])
        ])

        scores = cross_val_score(pipeline, X_train, y_train, cv=kf, scoring='accuracy', n_jobs=-1)
        Outcomes[n_clf].append(np.mean(scores))

    # پیدا کردن بهترین K
    max_accuracy = max(Outcomes[n_clf])
    max_index = Outcomes[n_clf].index(max_accuracy)
    best_k = num_k[max_index]

    # ترسیم و ذخیره‌ی نمودار
    plt.axvline(x=best_k, color='r', linestyle='--')
    plt.text(best_k + 0.5, max_accuracy, f'({best_k}, {max_accuracy:.2f})')
    plt.plot(num_k, Outcomes[n_clf])
    plt.title(n_clf)
    plt.xlabel("Number of K")
    plt.ylabel("Accuracy")
    plt.savefig(rf"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Kbest_{n_clf}.png", dpi=300)
    plt.clf()

    # فیت نهایی با بهترین K روی کل train
    best_pipeline = Pipeline([
        ('feature_selection', SelectKBest(score_func=f_classif, k=best_k)),
        ('classifier', Classifiers[n_clf])
    ])
    best_pipeline.fit(X_train, y_train)

    # پیش‌بینی و متریک‌ها
    y_pred = best_pipeline.predict(X_test)

    # اگر مدل احتمال می‌دهد، AUC بگیر
    if hasattr(best_pipeline.named_steps['classifier'], "predict_proba"):
        y_score = best_pipeline.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_score)
    else:
        auc = np.nan

    acc = accuracy_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred)

    # استخراج اسامی فیچرهای انتخاب‌شده
    kbest = best_pipeline.named_steps['feature_selection']
    mask = kbest.get_support()
    selected_features = [feature_names[i] for i, m in enumerate(mask) if m]

    results_rows.append({
        "Classifier": n_clf,
        "Best_K": best_k,
        "Accuracy": acc,
        "Recall": rec,
        "Precision": prec,
        "F1": f1,
        "AUC": auc,
        "Selected_Features": selected_features
    })

# ذخیره‌ی خلاصه
pd.DataFrame(results_rows).to_csv(save_path, index=False)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# اگر نتایج KBest را داری:
Kbest_df = pd.DataFrame(results_rows)
Kbest_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Kbest_Summary.csv", index=False)

# مسیر خروجی جدول
pca_save_path = r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\PCA_Summary.csv"

# خواندن داده (اگر قبلاً در محیط هست، می‌تونی حذف کنی)
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")
feature_names = df.columns[:-1].tolist()

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values


# CV ایمن (stratified)
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# دیکشنری امتیازهای CV و جمع‌کننده نتایج
Outcomes_PCA = {name: [] for name in Classifiers.keys()}
pca_results_rows = []

print("PCA Feature Extraction:")

for n_clf in Classifiers.keys():
    print(f"{n_clf} is going ...")
    num_k = []
    Outcomes_PCA[n_clf].clear()

    # تعداد کامپوننت‌ها: 1 تا تعداد فیچرها
    for k in range(1, X_train.shape[1] + 1):
        num_k.append(k)

        pipeline = Pipeline([
       # جلوگیری از leakage
            ('pca', PCA(n_components=k, random_state=0)),
            ('clf', Classifiers[n_clf])
        ])

        scores = cross_val_score(
            pipeline, X_train, y_train, cv=kf, scoring='accuracy', n_jobs=-1
        )
        Outcomes_PCA[n_clf].append(np.mean(scores))

    # بهترین k
    max_accuracy = max(Outcomes_PCA[n_clf])
    max_index = Outcomes_PCA[n_clf].index(max_accuracy)
    best_k = num_k[max_index]

    # رسم و ذخیره نمودار
    plt.axvline(x=best_k, color='r', linestyle='--')
    plt.text(best_k + 0.5, max_accuracy, f'({best_k}, {max_accuracy:.2f})')
    plt.plot(num_k, Outcomes_PCA[n_clf])
    plt.title(f"PCA - {n_clf}")
    plt.xlabel("Number of Components")
    plt.ylabel("Accuracy")
    plt.savefig(rf"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\PCA_{n_clf}.png", dpi=300)
    plt.clf()

    # فیت نهایی با بهترین k روی کل train
    best_pca_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=best_k, random_state=0)),
        ('clf', Classifiers[n_clf])
    ])
    best_pca_pipeline.fit(X_train, y_train)

    # پیش‌بینی روی test
    y_pred = best_pca_pipeline.predict(X_test)

    # AUC فقط اگر proba موجود است
    if hasattr(best_pca_pipeline.named_steps['clf'], "predict_proba"):
        y_score = best_pca_pipeline.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_score)
    else:
        auc = np.nan

    acc = accuracy_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred)

    # «اهمیت» فیچرها از PCA (لودینگ‌ها وزن‌دهی‌شده با سهم واریانس)
    pca_obj = best_pca_pipeline.named_steps['pca']
    comps = pca_obj.components_                 # (n_components, n_features)
    evr = pca_obj.explained_variance_ratio_     # (n_components,)
    abs_comps = np.abs(comps)
    weighted = abs_comps * evr[:, None]
    feat_importance = weighted.sum(axis=0)      # (n_features,)

    # انتخاب top-N فیچر به تعداد n_components منتخب (فقط برای گزارش)
    top_idx = np.argsort(-feat_importance)[:best_k]
    selected_features = [feature_names[i] for i in top_idx]

    pca_results_rows.append({
        "Classifier": n_clf,
        "Best_n_components": best_k,
        "Accuracy": acc,
        "Recall": rec,
        "Precision": prec,
        "F1": f1,
        "AUC": auc,
        "Selected_Features": selected_features
    })

# ذخیره خروجی
pca_results_df = pd.DataFrame(pca_results_rows)
pca_results_df["Selected_Features"] = pca_results_df["Selected_Features"].apply(lambda lst: ", ".join(lst))
pca_results_df.to_csv(pca_save_path, index=False, encoding='utf-8-sig')

print("\n=== PCA Summary per Classifier (Test-set metrics) ===")
print(pca_results_df)
print(f"\nSaved to: {pca_save_path}")


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# اگر df اینجا تعریف نشده، بخوانش:
# df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")

feature_names = df.columns[:-1].tolist()

results_rows = []
save_path = r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\RFECV_Summary.csv"

# X, y
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# train/test split


# CV مناسب
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("RFECV Feature Selection (with per-fold scaling):")

# اسکیل را داخل RFECV با Pipeline انجام بده تا leakage نداشته باشیم
rfecv_pipeline = Pipeline([
    ('rfecv', RFECV(
        estimator=RandomForestClassifier(n_estimators=200, random_state=42),
        step=1,
        cv=kf,
        scoring='accuracy',
        n_jobs=-1
    ))
])

rfecv_pipeline.fit(X_train, y_train)

# دسترسی به شیء RFECV
rfecv = rfecv_pipeline.named_steps['rfecv']
mask = rfecv.get_support()
selected_features = [feature_names[i] for i, m in enumerate(mask) if m]

print(f"Optimal number of features: {rfecv.n_features_}")
print("Selected features:", selected_features)

# داده‌ی انتخاب‌شده (خام؛ اسکیل داخل هر پایپ‌لاین انجام می‌شود)
X_train_sel = X_train[:, mask]
X_test_sel  = X_test[:,  mask]

# برای ذخیره‌ی امتیازهای CV
Outcomes = {name: [] for name in Classifiers.keys()}

for n_clf in Classifiers.keys():
    print(f"{n_clf} is going ....")

    # اسکیل داخل پایپ‌لاین هر فولد → بدون leakage
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', Classifiers[n_clf])
    ])

    # CV روی ترین (۵ فولد)
    scores = cross_val_score(pipeline, X_train_sel, y_train, cv=kf, scoring='accuracy', n_jobs=-1)
    Outcomes[n_clf] = scores.tolist()
    mean_acc = float(np.mean(scores))

    # نمودار امتیاز هر فولد
    xs = list(range(1, len(scores) + 1))
    plt.plot(xs, scores, marker='o')
    plt.axhline(y=mean_acc, color='r', linestyle='--')
    plt.title(f"RFECV - {n_clf}")
    plt.xlabel("Fold")
    plt.ylabel("Accuracy")
    plt.savefig(rf"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\RFECV_{n_clf}.png", dpi=300)
    plt.clf()

    # فیت نهایی روی ترین و متریک‌های تست
    pipeline.fit(X_train_sel, y_train)
    y_pred = pipeline.predict(X_test_sel)

    # AUC فقط اگر proba دارد
    if hasattr(pipeline.named_steps['classifier'], "predict_proba"):
        y_score = pipeline.predict_proba(X_test_sel)[:, 1]
        auc = roc_auc_score(y_test, y_score)
    else:
        auc = np.nan

    acc  = accuracy_score(y_test, y_pred)
    rec  = recall_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    f1   = f1_score(y_test, y_pred)

    results_rows.append({
        "Classifier": n_clf,
        "Selected_Count": int(rfecv.n_features_),
        "Accuracy": acc,
        "Recall": rec,
        "Precision": prec,
        "F1": f1,
        "AUC": auc,
        "Selected_Features": selected_features
    })

# ذخیره
results_df = pd.DataFrame(results_rows)
results_df["Selected_Features"] = results_df["Selected_Features"].apply(lambda lst: ", ".join(lst))
results_df.to_csv(save_path, index=False, encoding='utf-8-sig')

print("\n=== RFECV Summary per Classifier (Test-set metrics) ===")
print(results_df)
print(f"\nSaved to: {save_path}")


In [None]:
#outlier Imputation

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor
from sklearn.manifold import TSNE
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\4-imputed_data.csv")

lof = LocalOutlierFactor()
outliers = lof.fit_predict(data)
inliers = data[outliers == 1]
outlier = data[outliers != 1]

data = pd.DataFrame(inliers, columns=data.columns)
out = pd.DataFrame(outlier, columns=data.columns)
data.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\5-outlier_removed.csv" , index=False)
out.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\5.5-the_outliers.csv", index=False)


In [None]:
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)
pd.set_option("display.max_row", None)
pd.set_option("display.width", None)
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, precision_score

pca_save_path = r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Plain_Summary.csv"
plain_results_rows = []
#reading data
df=pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")
# df=pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\7-balanced_data_without_SBP.csv")



#defining x & y
x=df.iloc[:,:-1].values
y=df.iloc[:,-1].values

#train test spilit
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y,random_state=101)


#standard scaler
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)


#classifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
# define models
models = {
    'Logistic Regression': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(probability=True),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Extra Tree': ExtraTreesClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': xgb.XGBClassifier(),
    'LightGBM': LGBMClassifier(verbose=-1)
}

# define parameter grids for hyperparameter tuning
param_grids = {
    'Logistic Regression': {'C': [0.1, 1, 10]},
    'KNN': {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']},
    'SVM': {'C': [0.1, 1, 10]},
    'Decision Tree': {'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]},
    'Random Forest': {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]},
    'Extra Tree': {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]},
    'Gradient Boosting': {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200, 300]},
    'XGBoost': {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200, 300]},
    'LightGBM': {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 5, 7], 'n_estimators': [100, 200, 300], 'num_leaves': [31, 50, 100], 'force_col_wise': [True]}
}



# training and evaluating models
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, roc_curve


for model_name, model in models.items():
    print(f"Training {model_name}...")
    param_grid = param_grids[model_name]
    grid_search = GridSearchCV(model, param_grid, scoring="f1", cv=5)
    grid_search.fit(x_train, y_train)
    best_model = grid_search.best_estimator_
    parameters = grid_search.best_params_

    # Make predictions on the test set
    y_pred = best_model.predict(x_test)
    y_pred_prob = best_model.predict_proba(x_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    pres = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_prob[:, 1])

    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob[:, 1])
    plt.plot(fpr, tpr, label=model_name)

    plain_results_rows.append(
        {
        "Classifier": model_name,
        "Best Parameters" : parameters,
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": pres,
        "F1": f1,
        "AUC": auc,
    })


    # Print the results
    print(f"Model: {model_name}")
    print(f"Best Parameters: {parameters}")
    print(f"Accuracy: {accuracy}")
    print(f"Recall: {recall}")
    print(f"F1 Score: {f1}")
    print(f"AUC: {auc}")
    print("---------------------------")

plt.plot([0, 1], [0, 1], color='black', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.savefig(r"C:\Users\z_kho\OneDrive\Desktop\ROC.png", dpi=300)
plt.show()

plain_results_df = pd.DataFrame(plain_results_rows)
plain_results_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\plain_results.csv")


In [None]:
# === فقط در صورت نیاز: اگر قبلاً این‌ها را import نکردی، اضافه کن
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# اگر قبلاً تعریف نکرده‌ای:
feature_names = df.columns[:-1].tolist()

# جایی قبل از شروع حلقه‌ها (اگر نبوده):
results_rows = []
save_path = r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Kbest_Summary.csv"

df=pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")

#defining x & y
x=df.iloc[:,:-1].values
y=df.iloc[:,-1].values

#train test spilit
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y,random_state=101)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(x_train)
X_test_scaled = scaler.transform(x_test)


Outcomes = {
}

for i in Classifiers.keys():
    Outcomes[i] = []


# K-fold cross-validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

# K-best feature selection
print("K-best Feature Selection:")

for n_clf in models.keys():
    print(f"{n_clf}is going ....")
    num_k = []
    for k in range(1, x_train.shape[1] + 1):
        print(f"#{k}")

        num_k.append(k)

        pipeline = Pipeline([
            ('feature_selection', SelectKBest(score_func=f_classif, k=k)),
            ('classifier', Classifiers[n_clf])
        ])

        scores = cross_val_score(pipeline, X_train_scaled, y_train, cv=kf, scoring='accuracy')
        score = np.mean(scores)
        Outcomes[n_clf].append(score)


    max_accuracy = max(Outcomes[n_clf])
    max_index = Outcomes[n_clf].index(max_accuracy)
    plt.axvline(x=num_k[max_index], color='r', linestyle='--')  # Vertical line at maximum accuracy point
    print(f'({num_k[max_index]}, {max_accuracy:.2f})')
    plt.text(num_k[max_index] + 0.5, max_accuracy, f'({num_k[max_index]}, {max_accuracy:.2f})')

    plt.plot(num_k,Outcomes[n_clf])
    plt.title(n_clf)
    plt.xlabel("Number of K")
    plt.ylabel("Accuracy")
    plt.savefig(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Kbest_{}.png".format(n_clf),dpi=300)
    plt.clf()

    best_acc = max(Outcomes[n_clf])
    best_idx = Outcomes[n_clf].index(best_acc)
    best_k = num_k[best_idx]

    # بازتعریف و فیت پایپ‌لاین با بهترین k روی دیتای train (اسکیل‌شده)
    best_pipeline = Pipeline([
        ('feature_selection', SelectKBest(score_func=f_classif, k=best_k)),
        ('classifier', Classifiers[n_clf])
    ])
    best_pipeline.fit(X_train_scaled, y_train)

    # پیش‌بینی روی تست و محاسبه متریک‌ها
    y_pred = best_pipeline.predict(X_test_scaled)
    # --- پیش‌بینی روی تست ---
    y_pred = best_pipeline.predict(X_test_scaled)

    # --- فقط اگر predict_proba داشت، AUC حساب کن ---
    if hasattr(best_pipeline.named_steps['classifier'], "predict_proba"):
        y_score = best_pipeline.predict_proba(X_test_scaled)[:, 1]
        auc = roc_auc_score(y_test, y_score)
    else:
        auc = np.nan  # اگر مدل proba نداشت، AUC را نذار (NaN)
    acc = accuracy_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_score)

    # استخراج لیست اسامی فیچرهای انتخاب‌شده برای بهترین k
    kbest = best_pipeline.named_steps['feature_selection']
    mask = kbest.get_support()
    selected_features = [feature_names[i] for i, m in enumerate(mask) if m]

    # اضافه به سطرهای خروجی
    results_rows.append({
        "Classifier": n_clf,
        "Best_K": best_k,
        "Accuracy": acc,
        "Recall": rec,
        "Precision": prec,
        "F1": f1,
        "AUC": auc,
        "Selected_Features": selected_features  # به صورت لیست ذخیره می‌شود
    })


In [None]:
import os


# 1) اگر Res در حافظه نیست از CSV بخوان (در غیر این صورت این خط را حذف یا کامنت کن)
Res = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\ResultsOfGrid.csv")

# 2) مسیر خروجی‌های R:
mf_dir = r"C:\Users\z_kho\OneDrive\Desktop"
mf_files = [f"output{i}.csv" for i in range(1, 100) if os.path.exists(os.path.join(mf_dir, f"output{i}.csv"))]
if not mf_files:
    print("هیچ فایل output*.csv برای missForest پیدا نشد.")

# 3) اگر می‌خواهی پارامترهای maxiter/ntree هم ثبت شوند:
#    مطابق حلقه‌ی R (maxiter در بیرون، ntree در داخل)، ترتیب ترکیب‌ها این است:
maxiter_values = [5
                #   , 10, 20
                  ]
ntree_values   = [50

                #   , 100, 150
                  ]

import os

# مسیر خروجی‌های R
mf_dir = r"C:\Users\z_kho\OneDrive\Desktop\sixth-121"

# 3) ترکیب‌های پارامتر
maxiter_values = [5]       # می‌توانید سایر مقادیر را اضافه کنید
ntree_values   = [50]      # می‌توانید سایر مقادیر را اضافه کنید
param_pairs = [(mi, nt) for mi in maxiter_values for nt in ntree_values]

# ساخت mf_files بر اساس تعداد ترکیب‌ها
mf_files = []
for idx in range(1, len(param_pairs)+1):
    fname = f"output{idx}.csv"
    path = os.path.join(mf_dir, fname)
    if os.path.exists(path):
        mf_files.append(fname)
    else:
        print(f"هشدار: فایل {fname} وجود ندارد!")

print("mf_files:", mf_files)
param_pairs = [(mi, nt) for mi in maxiter_values for nt in ntree_values]  # [(5,50),(5,100),(5,150),(10,50),...]
print(param_pairs)
# 4) وزن‌دهی بر اساس سهم مفقودی هر ستون (مثل قبل)
scale_weight = {}
sum_miss = int(dff.isna().sum().sum())
if sum_miss == 0:
    print("هشدار: در dff هیچ NaN نیست (sum_miss == 0). وزن‌دهی ممکن است معنی‌دار نباشد.")
for cls in df_test.columns:
    we = int(dff[cls].isna().sum()) if cls in dff.columns else 0
    scale_weight[cls] = (we / sum_miss) if sum_miss > 0 else 0.0

# 5) فقط ستون‌هایی را ارزیابی کن که واقعاً در df (dff) NaN داشتند و در df_test هم وجود دارند
continues_eval = [c for c in continues if (c in missings) ]
binary_eval    = [c for c in ['Retino','CAD','CVA','Smoking'] if (c in missings) ]

labels, labels2, params_col, con_values, bin_values, details = [], [], [], [], [], []

print(mf_files)
for idx, fname in enumerate(mf_files, start=1):
    path = os.path.join(mf_dir, fname)
    print(f"Evaluating missForest file: {fname}")

    # 5-1) بارگذاری و هم‌راستاسازی
    df_imputed = pd.read_csv(path)
    # هم‌نام و هم‌ترتیب با dff:
    df_imputed = df_imputed[dff.columns]
    df_imputed.index = dff.index

    # 5-2) ارزیابی (از همان evaluate_imputation خودت)
    eval_dict = evaluate_imputation(df_imputed, df_test, missings)  # همان تابع قبلی‌ات

    # 5-3) امتیاز وزن‌دار پیوسته/باینری
    # پیوسته:
    cont_score, w_sum_c = 0.0, 0.0
    for c in continues_eval:
        if c in eval_dict and "R2" in eval_dict[c]:
            cont_score += eval_dict[c]["R2"] * scale_weight.get(c, 0.0)
            w_sum_c   += scale_weight.get(c, 0.0)
    cont_score = (cont_score / w_sum_c) if w_sum_c > 0 else np.nan

    # باینری:
    bin_score, w_sum_b = 0.0, 0.0
    for b in binary_eval:
        if b in eval_dict and "Accuracy" in eval_dict[b]:
            bin_score += eval_dict[b]["Accuracy"] * scale_weight.get(b, 0.0)
            w_sum_b   += scale_weight.get(b, 0.0)
    bin_score = (bin_score / w_sum_b) if w_sum_b > 0 else np.nan

    # 5-4) برچسب‌ها و پارامترها
    lab = f"MissForest_{idx}"
    labels.append(lab)
    labels2.append(lab)  # برای سازگاری با ساختار Res
    if idx <= len(param_pairs):
        params_col.append({"maxiter": param_pairs[idx-1][0], "ntree": param_pairs[idx-1][1]})
    else:
        params_col.append({})  # اگر فایل‌ها بیشتر از 9 بودند

    con_values.append(cont_score)
    bin_values.append(bin_score)
    details.append(eval_dict)

# 6) ساخت DataFrameِ missForest و ادغام با Res
Res_mf = pd.DataFrame({
    "Labels":   labels,
    "Labels2":  labels2,
    "Parameters": params_col,
    "Continues": con_values,
    "Binary":    bin_values,
    "Details":   details
})

# اگر Res قبلاً ساخته شده:
try:
    Res_combined = pd.concat([Res, Res_mf], ignore_index=True)
except NameError:
    # اگر Res در حافظه نبود، فقط missForest را داریم
    Res_combined = Res_mf.copy()

# ذخیره نسخه‌ی ادغام‌شده
Res_combined.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\ResultsOfGrid_ALL.csv", index=False)
print("Saved:", r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\ResultsOfGrid_ALL.csv")


In [None]:
import matplotlib
matplotlib.use('module://matplotlib_inline.backend_inline')  # نمایش داخل VS Code/Jupyter
import matplotlib.pyplot as plt


# خواندن کل نتایج
res_all = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\ResultsOfGrid_ALL.csv")

# ستون Model از روی Labels ساخته می‌شه (قبل از "_" هرچی باشه = نام مدل)
res_all["Model"] = res_all["Labels"].str.split("_").str[0]

# انتخاب بهترین ردیف برای هر مدل بر اساس Continues
best_per_model = res_all.loc[res_all.groupby("Model")["Continues"].idxmax()].reset_index(drop=True)

# ذخیره جدول بهترین‌ها
best_per_model.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\BestPerModel.csv", index=False)
print(best_per_model[["Model","Labels","Parameters","Continues","Binary"]])

# ---- نمودار مقایسه ----
plt.figure(figsize=(10,5))
plt.bar(best_per_model["Model"], best_per_model["Continues"])
plt.title("Best Continues (Weighted R²) per Model")
plt.ylabel("Weighted R²")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(r"C:\Users\z_kho\OneDrive\Desktop\best_contunues.png", dpi=200)
plt.close()

plt.figure(figsize=(10,5))
plt.bar(best_per_model["Model"], best_per_model["Binary"])
plt.title("Best Binary (Weighted Accuracy) per Model")
plt.ylabel("Weighted Accuracy")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\best_binary.png", dpi=200)
plt.close()


In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import AdaBoostClassifier




data = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\3-my_null_data_40_del.csv")


In [None]:
classifiers_name = ["LogReg", "KNN", "DT", "SVM", "RF", "ET", "XGB", "AdaBoost"]
classifiers = {
    "LogReg":LogisticRegression(),
    "KNN":KNeighborsClassifier(),
    "DT":DecisionTreeClassifier(),
    "SVM":SVC(),
    "RF": RandomForestClassifier(),
    "ET": ExtraTreesClassifier(),
    "XGB": xgb.XGBClassifier(objective='binary:logistic'),
    "AdaBoost": AdaBoostClassifier(),
}

# ---------- Import values
prob_per = 0.90
rand_per = 0.10

def set_classifier(col):
    The_col = Gridi[Gridi["Column"] == col]
    Maxi = np.argmax(The_col["F1"])
    The_row = The_col.iloc[Maxi]

    clf = classifiers[The_row["Classifier"]]
    param = eval(The_row["Parameters"])
    clf.set_params(**param)
    return clf



data = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\3-my_null_data_40_del.csv")
data = data.iloc[:,:-1]

#Making missing numbers and stuff
missing_dataframe = pd.DataFrame(columns=["Variable", "Missing_number" ,"Percentage"])
data_records = data.shape[0]
for i, col in enumerate(data.columns):

    missing = data[col].isnull().sum()
    missing_dataframe.loc[i, "Variable"] = col
    missing_dataframe.loc[i, "Missing_number"] = missing
    missing_dataframe.loc[i, "Percentage"] = round((missing/data_records)*100,2)

missing_dataframe = missing_dataframe.sort_values(by= "Percentage" , ascending=False)
missing_dataframe.reset_index(inplace=True , drop=True)
missing_columns = missing_dataframe["Variable"][missing_dataframe["Missing_number"]>100]

print(missing_dataframe)
# Preparing test and train df

df = data.copy()

from sklearn.model_selection import train_test_split
df_train , df_test = train_test_split(df , train_size=0.5 , random_state=232)


df_test = df_test.dropna()
df_test["row"] = range(1, len(df_test) + 1)
df_test_original = df_test.copy()


new_col = ["Column","Accuracy" , "Precision" , "Recall" , "F1"]
Log_res = pd.DataFrame(columns=new_col)

for colmn in missing_columns:

    clf = set_classifier(colmn)
    y = np.where(df_train[colmn].isna(), 1, 0)
    selected = [i for i in df_train.columns if i != colmn]
    x = df_train[selected].fillna(df_train[selected].median())

    clf.fit(x,y)
    prob = clf.predict_proba(df_test[selected])

    df_test["{}_prob".format(colmn)] = prob[0][1]
    df_test = df_test.sort_values(by="{}_prob".format(colmn), ascending=False).reset_index(drop=True)
    missing_percent = int(missing_dataframe["Percentage"][missing_dataframe["Variable"] == colmn].iloc[0]) / 100
    num_rows = int((missing_percent * df_test.shape[0])*prob_per) + 1
    num_rows_rand = int((missing_percent * df_test.shape[0]) * rand_per) + 1

    df_test["{}_missing".format(colmn)] = df_test[colmn]
    df_test.loc[:num_rows , "{}_missing".format(colmn)] = np.nan
    random_indices = np.random.choice(df_test[df_test["{}_missing".format(colmn)] != np.nan].index, size=num_rows_rand, replace=False)
    df_test.loc[random_indices, "{}_missing".format(colmn) ] = np.nan

    df_test = df_test.sample(frac=1).reset_index(drop=True)


# Delete all '_prob' columns
prob_columns = [col for col in df_test.columns if col.endswith('_prob')]
df_test.drop(columns=prob_columns, inplace=True)

# Replace columns with their corresponding '_missing' columns
for col in df_test.columns:
    if col.endswith('_missing'):
        original_col = col.replace('_missing', '')
        if original_col in df_test.columns:
            df_test[original_col] = df_test[col]

# Drop the '_missing' columns after replacement
missing_columns = [col for col in df_test.columns if col.endswith('_missing')]
df_test.drop(columns=missing_columns, inplace=True)

df_test= df_test.sort_values(by="row")

df_test = df_test.drop("row" , axis=1)
df_test_original = df_test_original.drop("row" , axis=1)

df_test.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\test2.csv", index=False)
df_test_original.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\test_original2.csv", index=False)


In [None]:
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)
pd.set_option("display.max_row", None)
pd.set_option("display.width", None)
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, precision_score
import shap

#reading data
df=pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")


#defining x & y
x=df.iloc[:,:-1].values
y=df.iloc[:,-1].values

#train test spilit
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y,random_state=101)


# #feature selection
# from sklearn.feature_selection import SelectKBest, f_classif
# selector = SelectKBest(f_classif, k=15)  # Select top 10 features based on F-score
# x_train_selected = selector.fit_transform(x_train, y_train)
# x_test_selected = selector.transform(x_test)

#standard scaler
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test =scaler.transform(x_test)

x_train = pd.DataFrame(x_train , columns= df.columns[:-1])
x_test = pd.DataFrame(x_test , columns= df.columns[:-1])

#classifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

chosen = [
        # [0, 1, 4, 6, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 24, 28, 29, 30],
        #   [1, 4, 7, 11, 12, 22, 24, 28],
        #   [0, 1, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 24, 25, 29],
        #   [0, 1, 2, 3, 6, 7, 11, 12, 13, 17, 18, 22, 24, 26, 27, 29, 30],
        #   [0, 1, 2, 5, 6, 7, 11, 12, 13, 14, 17, 18, 20, 21, 24, 27, 29, 30],
          list(range(31))
          # ,[0, 1, 2, 3, 4, 8, 9, 10, 11, 12, 21, 23, 24, 25, 27, 28, 29, 30],
        # list(range(31))
    ]


models = {
    # 'Logistic Regression': LogisticRegression(C=1),
    # 'KNN': KNeighborsClassifier(n_neighbors=5 , weights='distance'),
    # 'SVM': SVC(probability=True , C=0.1 , kernel='linear' ),
    # 'Decision Tree': DecisionTreeClassifier(max_depth=5 , min_samples_split=5),
    # 'Extra Tree': ExtraTreesClassifier(max_depth=None, min_samples_split=5, n_estimators=200),
#    'Gradient Boosting': GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=300)
    'XGBoost': XGBClassifier(n_estimators=300, max_depth=5, learning_rate=0.1, subsample=1.0, colsample_bytree=1.0, eval_metric='logloss', use_label_encoder=False, tree_method='hist', n_jobs=-1, random_state=0),

    #'LightGBM': LGBMClassifier(force_col_wise=True,learning_rate= 0.1, max_depth= 7, n_estimators=200, num_leaves=100)
}

f_name = [df.columns[i] for i in chosen[0]]
print(f_name)
f_name = [
"DDM: 3.5%",
"PLT: 13.3%",
"Retino: 0.4%",
"Sex: 1.2%",
"Height: 2.8%",
"Weight: 4.4%",
"Waist: 2.4%",
"DBP: 1.6%",
"CRP: 13.6%",
"FBS: 2.2%",
"LDL: 2.8%",
"Cr: 1.7%",
"UA: 2.8%",
"ALT: 36.3%",
"ALKP: 3.3%",
"CVA: 0.7%",
"HOMA: 2.9%",
"BMI: 4.2%",
]

model = models.values()
names = list(models.keys())

for i, clf in enumerate(model) :

    x_train_selected = x_train.iloc[:,chosen[i]].values
    x_test_selected = x_test.iloc[:, chosen[i]].values


    clf.fit(x_train_selected, y_train)

    # Make predictions on the test set
    y_pred = clf.predict(x_test_selected)
    y_pred_prob=clf.predict_proba(x_test_selected)


    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    pres = precision_score(y_test, y_pred)
    auc=roc_auc_score(y_test,y_pred_prob[:,1])


    # Print the results
    print(f"Model: {names[i]}")
    print(f"Accuracy: {accuracy}")
    print(f"Recall: {recall}")
    print(f"Presicion: {pres}")
    print(f"F1 Score: {f1}")
    print(f"AUC: {auc}")
    print("---------------------------")
    # SHAP values

    plt.rcParams["font.family"] = "Calibri"

    explainer = shap.Explainer(clf, x_train_selected)
    shap_values = explainer(x_train_selected, check_additivity=False)

    # SHAP summary plot
    fig, ax = plt.subplots()
    shap.summary_plot(shap_values, x_train_selected, feature_names=f_name, show=False)

    from matplotlib.colors import LinearSegmentedColormap

    # Define your own color gradient with start and end colors
    start_color = "#0F9ED5"  # Blue
    end_color = "#d70f47"  # Red

    # Create a colormap from the two colors
    custom_cmap = LinearSegmentedColormap.from_list("custom_cmap", [start_color, end_color])

    for fc in plt.gcf().get_children():
        for fcc in fc.get_children():
            if hasattr(fcc, "set_cmap"):
                fcc.set_cmap(custom_cmap)

    plt.savefig(r"C:\Users\z_kho\OneDrive\Desktop\SHAP.png",dpi=600)
    plt.show()


In [None]:
# ===== requirements: pandas, numpy, scikit-learn, xgboost, lightgbm =====
import numpy as np
import pandas as pd

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ----------------------------
# 0) داده‌ها
# ----------------------------
# >>> این دو خط را مطابق داده‌های خودت تنظیم کن <<<
# df: شامل همه ستون‌های ورودی + ستون target
# target_col: نام ستون هدف
# df = ...  # دیتافریم خودت
target_col = "fattyliver"
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\Clean Outputs\2-imputed_data.csv")
X = df.drop(columns=[target_col]).copy()
y = df[target_col].values
feature_names = X.columns.tolist()
X_np = X.values
n_features = X_np.shape[1]

from sklearn.preprocessing import StandardScaler

# --- اسکیل کردن فقط برای انتخاب فیچر ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_np)

# 2) Gradient Boosting + KBest (K=22)
K_GB = 22
sel_gb = SelectKBest(score_func=f_classif, k=K_GB)
X_sel_gb = sel_gb.fit_transform(X_scaled, y)  # اینجا از X_scaled استفاده کن
mask_gb = sel_gb.get_support(indices=True)



# ----------------------------
# 1) XGBoost (بدون فیچر سلکشن) — بهترین مدل نهایی
# ----------------------------
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.1,
    subsample=1.0,
    colsample_bytree=1.0,
    eval_metric='logloss',
    use_label_encoder=False,
    tree_method='hist',
    n_jobs=-1,
    random_state=0,
    importance_type='gain'  # برای شفافیت
)
xgb.fit(X_np, y)
imp_xgb = xgb.feature_importances_.astype(float)  # طول = n_features

# ----------------------------
# 2) Gradient Boosting + KBest (K=22)
#    نکته: فقط روی 22 فیچر برتر فیت می‌کنیم و ایمپورتنس را
#    به وکتور n_features نگاشت می‌کنیم (بقیه = 0)
# ----------------------------
K_GB = 22  # طبق گفته‌ی تو
sel_gb = SelectKBest(score_func=f_classif, k=K_GB)
X_sel_gb = sel_gb.fit_transform(X_np, y)
mask_gb = sel_gb.get_support(indices=True)  # ایندکس‌های انتخاب‌شده

gb = GradientBoostingClassifier(
    learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
)
gb.fit(X_sel_gb, y)
imp_gb_selected = gb.feature_importances_.astype(float)  # طول = K_GB

# نگاشت به طول کامل
imp_gb = np.zeros(n_features, dtype=float)
imp_gb[mask_gb] = imp_gb_selected

# ----------------------------
# 3) LightGBM با 18 فیچر انتخابی مشخص‌شده
#    ایندکس‌هایی که خودت دادی:
# ----------------------------
lgb_indices = [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 17, 20, 24, 25, 26, 29]
X_sel_lgb = X_np[:, lgb_indices]

lgb = LGBMClassifier(
    force_col_wise=True,
    learning_rate=0.1,
    max_depth=7,
    n_estimators=100,
    verbose=-1,
    random_state=0
)
lgb.fit(X_sel_lgb, y)
imp_lgb_selected = lgb.feature_importances_.astype(float)  # طول = 18

# نگاشت به طول کامل
imp_lgb = np.zeros(n_features, dtype=float)
for idx_local, idx_global in enumerate(lgb_indices):
    imp_lgb[idx_global] = float(imp_lgb_selected[idx_local])

# ----------------------------
# 4) نرمال‌سازی اختیاری (مثلاً مجموع هر مدل = 1) — اگر خواستی کامنت را بردار
# ----------------------------
def normalize(v):
    s = v.sum()
    return v / s if s > 0 else v

imp_xgb = normalize(imp_xgb)
imp_gb  = normalize(imp_gb)
imp_lgb = normalize(imp_lgb)

# ----------------------------
# 5) ساخت CSV خروجی
#    اگر ستونی در مدلی انتخاب نشده، مقدارش 0 است (الان همینطور شده)
# ----------------------------
out_df = pd.DataFrame({
    "feature": feature_names,
    "XGBoost_without": imp_xgb,
    "GB_KBest": imp_gb,
    "LightGBM_selected": imp_lgb,
})

# مرتب‌سازی اختیاری بر اساس XGBoost
out_df = out_df.sort_values("XGBoost_without", ascending=False).reset_index(drop=True)

# ذخیره
out_df.to_csv("feature_importance_comparison.csv", index=False)

print("Saved: feature_importance_comparison.csv")
out_df.head(15)


In [None]:
# ===== requirements: pandas, numpy, scikit-learn, xgboost, lightgbm =====
import numpy as np
import pandas as pd

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ----------------------------
# 0) داده‌ها
# ----------------------------
target_col = "fattyliver"
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\Clean Outputs\2-imputed_data.csv")

X_df = df.drop(columns=[target_col]).copy()
y = df[target_col].values
feature_names = X_df.columns.tolist()

X_np = X_df.values
n_features = X_np.shape[1]

# ----------------------------
# اسکیل فقط برای انتخاب فیچر (KBest)
# ----------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_np)

# ----------------------------
# 1) XGBoost (بدون فیچر سلکشن) — بهترین مدل نهایی
# ----------------------------
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.1,
    subsample=1.0,
    colsample_bytree=1.0,
    eval_metric='logloss',
    use_label_encoder=False,
    tree_method='hist',
    n_jobs=-1,
    random_state=0,
    importance_type='gain'
)
xgb.fit(X_np, y)
imp_xgb = xgb.feature_importances_.astype(float)  # طول = n_features

# ----------------------------
# 2) Gradient Boosting + KBest (K=22)
#    سلکتور را با X_scaled فیت کن، ولی مدل را روی X_np[:, mask] آموزش بده
# ----------------------------
K_GB = 22
sel_gb = SelectKBest(score_func=f_classif, k=K_GB)
sel_gb.fit(X_scaled, y)

mask_gb = sel_gb.get_support(indices=True)  # ایندکس‌های انتخاب‌شده
X_sel_gb = X_np[:, mask_gb]                 # داده‌ی خام (اسکیل لازم نیست برای درخت)

gb = GradientBoostingClassifier(
    learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
)
gb.fit(X_sel_gb, y)
imp_gb_selected = gb.feature_importances_.astype(float)  # طول = K_GB

# نگاشت به طول کامل
imp_gb = np.zeros(n_features, dtype=float)
imp_gb[mask_gb] = imp_gb_selected

# ----------------------------
# 3) LightGBM با 18 فیچر انتخابی مشخص‌شده
# ----------------------------
lgb_indices = [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 17, 20, 24, 25, 26, 29]
X_sel_lgb = X_np[:, lgb_indices]

lgb = LGBMClassifier(
    force_col_wise=True,
    learning_rate=0.1,
    max_depth=7,
    n_estimators=100,
    n_jobs=-1,
    verbose=-1,
    random_state=0
)
lgb.fit(X_sel_lgb, y)
imp_lgb_selected = lgb.feature_importances_.astype(float)  # طول = 18

# نگاشت به طول کامل
imp_lgb = np.zeros(n_features, dtype=float)
for idx_local, idx_global in enumerate(lgb_indices):
    imp_lgb[idx_global] = float(imp_lgb_selected[idx_local])

# ----------------------------
# 4) نرمال‌سازی اختیاری (جمع هر ستون = 1)
# ----------------------------
def normalize(v):
    s = v.sum()
    return v / s if s > 0 else v

imp_xgb = normalize(imp_xgb)
imp_gb  = normalize(imp_gb)
imp_lgb = normalize(imp_lgb)

# ----------------------------
# 5) ساخت CSV خروجی
# ----------------------------
out_df = pd.DataFrame({
    "feature": feature_names,
    "XGBoost_without": imp_xgb,
    "GB_KBest": imp_gb,
    "LightGBM_selected": imp_lgb,
})

# مرتب‌سازی اختیاری بر اساس XGBoost
out_df = out_df.sort_values("XGBoost_without", ascending=False).reset_index(drop=True)

out_df.to_csv("feature_importance_comparison.csv", index=False)
print("Saved: feature_importance_comparison.csv")
print(out_df.head(15))


In [None]:
import numpy as np
import pandas as pd
import random
import heapq
from sklearn.metrics import  f1_score, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

AbbClassifers = ["SVM"
    # "LG", "SVM", "DT" ,
    #  "ET" ,
    #  "GB" , "XGB" , "LGB"
    ]

CL = [

    # LogisticRegression(C=1, solver="liblinear", max_iter=1000, random_state=0),
    SVC(C=1, probability=True, random_state=0)
    # ,
    # # DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0),
    # # ExtraTreesClassifier(
    # #         n_estimators=100, max_depth=None, min_samples_split=2, n_jobs=-1, random_state=0
    # #     ),
    # GradientBoostingClassifier(
    #         learning_rate=0.1, max_depth=5, n_estimators=100, random_state=0
    #     ),
    # XGBClassifier(
    #         n_estimators=100, max_depth=5, learning_rate=0.1,
    #         subsample=1.0, colsample_bytree=1.0,
    #         eval_metric="logloss", use_label_encoder=False,
    #         tree_method="hist", n_jobs=-1, random_state=0
    #     ),
    # LGBMClassifier(
    #         n_estimators=100, learning_rate=0.1, max_depth=7, num_leaves=31,
    #         force_col_wise=True, random_state=0, n_jobs=-1
    #     )
]



df=pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")


#defining x & y
x=df.iloc[:,:-1].values
y=df.iloc[:,-1].values

#train test spilit

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y,random_state=101)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.transform(x_test)

ro = 0
row = 0

for n_classifier in range(len(AbbClassifers)):

    Results = pd.DataFrame(
        columns=["row", "gen", "train", "validation", "random_state", "classifier", "Best", "Best Res", "test"])
    clf = CL[n_classifier]
    name = AbbClassifers[n_classifier]
    print("========================")
    print("Classifier: ", name)


    # The first time check results of recall without feature selection
    clf.fit(X_train, y_train)

    scores = cross_val_score(clf, X_train, y_train, scoring='accuracy', cv=3, n_jobs=1)
    acc = scores.mean()
    Results.loc[row, "validation"] = acc
    print("First croos validation Recall without feature selection: ", acc)
    Results.loc[row, "Validation"] = acc


    first_y_pred = clf.predict(X_test)
    first_acc = accuracy_score(y_test, first_y_pred)
    print("First test Recall without feature selection: ", first_acc)
    Results.loc[row, "test"] = first_acc



    # Row in Result DataFrame
    row += 1

    # Defining number of features
    num_features = np.shape(x)[1]
    Best_rec = []
    Best_fea = []

    # --- Input area
    # Hyperparamters
    num_population = 200
    num_gens = 50

    mutation_rate = 0.2
    elitism_rate = 0.1

    least_num = 4
    most_length = num_features - 1

    remained = int((1 - elitism_rate) * num_population)
    elitism_n = num_population - remained


    # Rollet Selection:
    # -- Get's the fitness function of all the chromosomes in popularion
    # -- Selects one chr based on random seclection but with probabilites

    def roulette_selection(fitness_values):
        total_fitness = sum(fitness_values)
        probabilities = [fitness / total_fitness for fitness in fitness_values]
        selected_index = np.random.choice(len(fitness_values), p=probabilities)
        return selected_index


    # Translates the binary chromosemes [0,1,0,1] to a list of selected featues [1,3]
    def translate(chromosome):
        return [i for i, gene in enumerate(chromosome) if gene == 1]


    #Creats a random first generation
    def first_pop(num_features, model, X_train, y_train):

        first_pop = []

        # Generating K Best
        while len(first_pop)<num_population:
            new_pop=[random.randint(0,1) for _ in range(num_features)]
            first_pop.append(new_pop)
        fitness_values = evaluate(first_pop)

        return first_pop, fitness_values


    # Get's the features and returens the fitness function values

    fitness_cache = {}

    def evaluate(population):
        Fitness_values = []
        for chromosome in population:
            feats = tuple(i for i,g in enumerate(chromosome) if g == 1)
            if len(feats) < 4:   # حداقل تعداد ویژگی
                Fitness_values.append(0.0)
                continue
            if feats in fitness_cache:
                Fitness_values.append(fitness_cache[feats])
                continue
            score = cross_val_score(clf, X_train[:, feats], y_train,
                                    scoring="accuracy", cv=3, n_jobs=-1).mean()
            fitness_cache[feats] = score
            Fitness_values.append(score)
        return Fitness_values



    # Crossover
    def crossover(p1, p2):
        cross_point = random.randint(1, num_features)
        c1 = p1[:cross_point] + p2[cross_point:]
        c2 = p2[:cross_point] + p1[cross_point:]
        return c1, c2


    # Mutation
    def mutate(individual):

        mut_point = random.randint(1, num_features)

        if random.randint(0, 1) == 1:
            pop = [random.choice([0, 1]) for _ in range(num_features - mut_point)]
            mutated_individual = individual[:mut_point] + pop[:]
        else:
            pop = [random.choice([0, 1]) for _ in range(mut_point)]
            mutated_individual = pop[:] + individual[mut_point:]

        return mutated_individual


    # Genetic algorithm
    def gen(population, rec_values):
        new_population = []

        # Generate new individuals
        while len(new_population) < num_population:
            selected_parents = [roulette_selection(rec_values) for _ in range(2)]
            par1, par2 = population[selected_parents[0]], population[selected_parents[1]]

            if random.random() < mutation_rate:
                new1, new2 = mutate(par1), mutate(par2)
                new_population.extend([new1, new2])
            else:
                ch1, ch2 = crossover(par1, par2)
                new_population.extend([ch1, ch2])

        # Evaluate fitness of new individuals
        new_rec_values = evaluate(new_population)

        # Combine populations and fitness values
        population_pool = population + new_population
        rec_values += new_rec_values

        # Select elite individuals
        elite_indices = heapq.nlargest(elitism_n, range(len(population)), key=lambda index: rec_values[index])
        elite_population = [population[index] for index in elite_indices]
        elite_rec_values = [rec_values[index] for index in elite_indices]

        # Remove elite individuals from the population
        for index in sorted(elite_indices, reverse=True):
            del population_pool[index]
            del rec_values[index]

        # Select remaining individuals using roulette selection
        selected_indices = [roulette_selection(rec_values) for _ in range(num_population - elitism_n)]
        selected_population = [population_pool[index] for index in selected_indices]
        selected_rec_values = [rec_values[index] for index in selected_indices]

        # Update population and fitness values
        population = elite_population + selected_population
        rec_values = elite_rec_values + selected_rec_values

        # Record best fitness and corresponding features
        best_index = max(range(len(population)), key=lambda index: rec_values[index])
        Best_rec.append(rec_values[best_index])
        Best_fea.append(translate(population[best_index]))

        return population, rec_values


    def check_last_five_lists(list_of_lists):
        if len(list_of_lists) < 15:
            return False  # Not enough lists to compare

        last_five_lists = list_of_lists[-15:]
        # Check if all elements are the same
        if all(lst == last_five_lists[0] for lst in last_five_lists):
            return True
        else:
            return False


    # Now running
    generation = 0

    for g in range(num_gens):
        print(g)
        if g == 0:
            pop_u, rec_values = first_pop(df.shape[1] - 1, LogisticRegression(), X_train, y_train)

        else:
            pop_u, rec_values = gen(pop_u, rec_values)

            clf.fit(X_train[:, Best_fea[-1]], y_train)
            y_predic = clf.predict(X_train[:, Best_fea[-1]])
            acc = accuracy_score(y_train, y_predic)

            Results.loc[row, "train"] = acc
            print("train: ", acc)

            scores = cross_val_score(clf, X_train[:,  Best_fea[-1]], y_train, scoring='accuracy', cv=3)
            acc = scores.mean()
            Results.loc[row, "validation"] = acc
            print("validation:", acc)

            y_predic = clf.predict(X_test[:, Best_fea[-1]])
            acc = accuracy_score(y_test, y_predic)

            Results.loc[row, "test"] = acc
            print("test:", acc)
            print("***************")

            Results.loc[row, "classifier"] = type(clf).__name__
            Results.loc[row, "Best"] = str(Best_fea[-1])
            Results.loc[row, "Best Res"] = Best_rec[-1]
            row += 1
        generation += 1
        if check_last_five_lists(Best_fea):
            break

    Results.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\GA\Results{}.csv".format(name))


In [None]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, mean_absolute_error
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from itertools import product
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Load data
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\test.csv")
df_test = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\test_original.csv")


binary = [ 'Retino', 'htn', 'sex',  'CAD', 'CVA', 'Smoking']
for bin in binary:
    df[bin] = np.round(df[bin])
    df_test[bin] = np.round(df_test[bin])

dff = df.copy()
dfff = df.copy()

missings = [i for i in df.columns if df[i].isna().sum() > 0]
continues = ['PLT', 'hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']



param_grid_svr = {'C': [0.1, 1, 10], 'epsilon': [0.1, 0.2, 0.3]}
param_grid_rf = {'n_estimators': [50, 100, 200], "max_depth" :[3,5,7]}
param_grid_gbr = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2]}
param_grid_etr = {'n_estimators': [50, 100, 200], "max_depth" :[3,5,7]}
param_grid_abr = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]}
param_grid_lgbm = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2]}
param_grid_xgboost = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2]}

params = [param_grid_svr,param_grid_rf,  param_grid_gbr, param_grid_etr, param_grid_abr, param_grid_lgbm, param_grid_xgboost]


estimators = {
    'SVR': IterativeImputer(estimator=SVR(), random_state=101, max_iter=10, initial_strategy='mean'),
    'RandomForest5': IterativeImputer(estimator=RandomForestRegressor(), max_iter=10, random_state=101, initial_strategy='mean'),
    'GradientBoosting': IterativeImputer(estimator=GradientBoostingRegressor(), max_iter=10, random_state=101, initial_strategy='mean'),
    'ExtraTrees': IterativeImputer(estimator=ExtraTreesRegressor(), max_iter=10, random_state=101, initial_strategy='mean'),
    'AdaBoost': IterativeImputer(estimator=AdaBoostRegressor(), max_iter=10, random_state=101, initial_strategy='mean'),
    'LightGBM': IterativeImputer(estimator=LGBMRegressor(), max_iter=10, random_state=101, initial_strategy='mean'),
    'XGBoost': IterativeImputer(estimator=XGBRegressor(), max_iter=10, random_state=101, initial_strategy='mean'),
}

# Dictionary to store results
results = {}


# Function to evaluate the imputation results
def evaluate_imputation(df_imputed, df_true, missings):
    evaluations = {}
    for col in missings:
        missing_indices = dff[col].isna()
        y_true = df_true.loc[missing_indices, col].values
        y_pred = df_imputed.loc[missing_indices, col].values

        if col in continues:
            mse = mean_squared_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            MABR = mean_absolute_error (y_true, y_pred)
            evaluations[col] = {'MSE': mse, 'R2': r2, 'MABR':MABR}
        else:
            y_pred = np.round(y_pred).astype(int)
            acc = accuracy_score(y_true, y_pred)
            evaluations[col] = {'Accuracy': acc}
    return evaluations

combos={}

for j, (name, estimator) in enumerate(estimators.items()):
    combinations = product(*params[j].values())

    for i ,comb in enumerate(combinations):
        print(f"Imputing with {name} _ {i}...")
        param_combo = dict(zip(params[j].keys(), comb))
        estimator.estimator.set_params(**param_combo)
        combos[f"{name}_{i}"] = param_combo

        if callable(estimator):
            df_imputed = estimator(dfff.copy())
        else:
            df_imputed = estimator.fit_transform(dfff.copy())
        df_imputed = pd.DataFrame(df_imputed, columns=dfff.columns)
        results[f"{name}_{i}"] = evaluate_imputation(df_imputed, df_test, missings)


print(results)

scale_weight = {}
sum_miss = np.sum(dff.isna().sum(), axis=0)

for cls in df_test.columns:
    we = dfff[cls].isna().sum()
    scale_weight[cls] = we / sum_miss


labels =[]
con_values = []
binary_values = []

continues = ['PLT','hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
binary = ['Retino','CAD', 'CVA', 'Smoking']
for est in results.keys():
    labels.append(est)
    continues_score = 0
    binary_score = 0

    scale_weight_con = 0
    scale_weight_con_list = []
    for con in continues:
        continues_score = continues_score + results[est][con]["R2"] * scale_weight[con]
        scale_weight_con = scale_weight_con + scale_weight[con]
        scale_weight_con_list.append(scale_weight[con])
    con_values.append(continues_score /scale_weight_con )

    scale_weight_bin = 0
    scale_weight_bin_list = []
    for bin in binary:
        binary_score = binary_score+ results[est][bin]["Accuracy"] * scale_weight[bin]
        scale_weight_bin = scale_weight_bin + scale_weight[bin]
        scale_weight_bin_list.append(scale_weight[bin])
    binary_values.append(binary_score /scale_weight_bin )


#
# plt.bar(labels,con_values)
# plt.title("continues")
# plt.show()
#
# plt.bar(labels,binary_values)
# plt.title("binary")
# plt.show()

Res = pd.DataFrame(columns=["Labels" ,"Labels2", "Parameters" , "Continues" , "Binary","Details"])
Res["Labels"] = labels
Res["Labels2"] = combos.keys()
Res["Parameters"] = combos.values()
Res["Continues"] = con_values
Res["Binary"] = binary_values
Res["Details"] = results.values()


Wi_con = pd.DataFrame(columns = ["Con","Con_weight","Bin","Bin_weight"])
Wi_bin = pd.DataFrame(columns = ["Bin","Bin_weight"])
Wi_con["Con"] = continues
Wi_con["Con_weight"] = scale_weight_con_list

Wi_bin["Bin"] = binary
Wi_bin["Bin_weight"] = scale_weight_bin_list



Res.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\ResultsOfGrid.csv")
Wi_con.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\WeightsCon.csv")
Wi_bin.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\WeightsBin.csv")


In [None]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, mean_absolute_error, f1_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from itertools import product
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.svm import SVC
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\60-40\test 60-40.csv")
df_test = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\60-40\test_original 60-40.csv")

binary = ['Retino', 'htn', 'sex', 'CAD', 'CVA', 'Smoking']
for bin in binary:
    df[bin] = np.round(df[bin])
    df_test[bin] = np.round(df_test[bin])

dff = df.copy()
dfff = df.copy()

missings = [i for i in df.columns if df[i].isna().sum() > 0]
continues = ['PLT', 'hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
reg_param_grid = {'fit_intercept': [True, False], 'copy_X': [True, False], 'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
dtree_param_grid = {'max_depth': [3, 5, 8], 'min_samples_split': [2, 5, 10]}
svm_param_grid = {'C': [0.1, 1, 10]}
rf_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}
et_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}
xgb_param_grid = {'max_depth': [3, 5, 8], 'learning_rate': [0.1, 0.05, 0.01]}
ada_param_grid = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]}
knn_param_grid = {'n_neighbors':[5,7,10,15]}



params = [knn_param_grid]

estimators = {
    "KNN" : KNNImputer()
}

# Function to evaluate the imputation results
def evaluate_imputation(df_imputed, df_true, missings):
    evaluations = {}
    for col in missings:
        missing_indices = dff[col].isna()
        y_true = df_true.loc[missing_indices, col].values
        y_pred = df_imputed.loc[missing_indices, col].values

        if col in continues:
            mse = mean_squared_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            MABR = mean_absolute_error(y_true, y_pred)
            evaluations[col] = {'MSE': mse, 'R2': r2, 'MABR': MABR}
        else:
            y_pred = np.clip(y_pred, 0, 1)  # Clip values to [0, 1]
            y_pred = np.round(y_pred).astype(int)
            acc = accuracy_score(y_true, y_pred)
            F1 = f1_score(y_true, y_pred , zero_division = 0)
            evaluations[col] = {'Accuracy': acc , "F1": F1}
    return evaluations

# Process each estimator and save results to CSV
for j, (name, estimator) in enumerate(estimators.items()):
    combinations = product(*params[j].values())
    results = {}
    combos = {}

    for i, comb in enumerate(combinations):
        print(f"Imputing with {name} _ {i}...")
        param_combo = dict(zip(params[j].keys(), comb))
        estimator.set_params(**param_combo)
        combos[f"{name}_{i}"] = param_combo

        df_imputed = estimator.fit_transform(dfff.copy())
        df_imputed = pd.DataFrame(df_imputed, columns=dfff.columns)
        results[f"{name}_{i}"] = evaluate_imputation(df_imputed, df_test, missings)

    scale_weight = {}
    sum_miss = np.sum(dff.isna().sum(), axis=0)

    for cls in df_test.columns:
        we = dfff[cls].isna().sum()
        scale_weight[cls] = we / sum_miss

    labels = []
    con_values = []
    binary_values = []

    continues = ['PLT', 'hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
    binary = ['Retino', 'CAD', 'CVA', 'Smoking']
    for est in results.keys():
        labels.append(est)
        continues_score = 0
        binary_score = 0

        scale_weight_con = 0
        scale_weight_con_list = []
        for con in continues:
            continues_score += results[est][con]["R2"] * scale_weight[con]
            scale_weight_con += scale_weight[con]
            scale_weight_con_list.append(scale_weight[con])
        con_values.append(continues_score / scale_weight_con)

        scale_weight_bin = 0
        scale_weight_bin_list = []
        for bin in binary:
            binary_score += results[est][bin]["Accuracy"] * scale_weight[bin]
            scale_weight_bin += scale_weight[bin]
            scale_weight_bin_list.append(scale_weight[bin])
        binary_values.append(binary_score / scale_weight_bin)

    Res = pd.DataFrame(columns=["Labels", "Labels2", "Parameters", "Continues", "Binary", "Details"])
    Res["Labels"] = labels
    Res["Labels2"] = combos.keys()
    Res["Parameters"] = combos.values()
    Res["Continues"] = con_values
    Res["Binary"] = binary_values
    Res["Details"] = results.values()

    Wi_con = pd.DataFrame(columns=["Con", "Con_weight"])
    Wi_bin = pd.DataFrame(columns=["Bin", "Bin_weight"])
    Wi_con["Con"] = continues
    Wi_con["Con_weight"] = scale_weight_con_list

    Wi_bin["Bin"] = binary
    Wi_bin["Bin_weight"] = scale_weight_bin_list

    Res.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\ResultsOfGrid_KNN.csv")


In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


data = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\3-my_null_data_40_del.csv")
imputer =  IterativeImputer(estimator=ExtraTreesRegressor(n_estimators = 300, max_depth = 8), max_iter=10, random_state=101, initial_strategy='mean')

imputed_data = imputer.fit_transform(data)
imputed_data = pd.DataFrame(imputed_data , columns=data.columns)
binary = ['Retino', 'htn', 'sex', 'CAD', 'CVA', 'Smoking']
for col in binary:
    imputed_data[col] = np.round(imputed_data[col])
imputed_data.to_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\4-imputed_data.csv" , index=False)


In [None]:
import pandas as pd
data =  pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\90-10\ResultsOfGrid_XGBoost.csv")

estimators = [
    'Ridge' ,
    'DecisionTree',
    'SVR',
    'RandomForest',
    'ExtraTrees',
#    'XGBoost',
    'AdaBoost',
    # "missforest",
    # "KNN"
]

for i in estimators:
    new_data = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\90-10\ResultsOfGrid_{}.csv".format(i))
    data = pd.concat((data , new_data))

data.to_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\90-10\AllToGether.csv", index = False)


In [None]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, mean_absolute_error, f1_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from itertools import product
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.svm import SVC
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\60-40\test 60-40.csv")
df_test = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\60-40\test_original 60-40.csv")

binary = ['Retino', 'htn', 'sex', 'CAD', 'CVA', 'Smoking']
for bin in binary:
    df[bin] = np.round(df[bin])
    df_test[bin] = np.round(df_test[bin])

dff = df.copy()
dfff = df.copy()

missings = [i for i in df.columns if df[i].isna().sum() > 0]
continues = ['PLT', 'hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
reg_param_grid = {'fit_intercept': [True, False], 'copy_X': [True, False], 'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
dtree_param_grid = {'max_depth': [3, 5, 8], 'min_samples_split': [2, 5, 10]}
svm_param_grid = {'C': [0.1, 1, 10]}
rf_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}
et_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}
xgb_param_grid = {'max_depth': [3, 5, 8], 'learning_rate': [0.1, 0.05, 0.01]}
ada_param_grid = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]}
knn_param_grid = {'n_neighbors':[5,7,10,15]}



params = [knn_param_grid]

estimators = {
    "missforest" : KNNImputer()



}

# Function to evaluate the imputation results
def evaluate_imputation(df_imputed, df_true, missings):
    evaluations = {}
    for col in missings:
        missing_indices = dff[col].isna()
        y_true = df_true.loc[missing_indices, col].values
        y_pred = df_imputed.loc[missing_indices, col].values

        if col in continues:
            mse = mean_squared_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            MABR = mean_absolute_error(y_true, y_pred)
            evaluations[col] = {'MSE': mse, 'R2': r2, 'MABR': MABR}
        else:
            y_pred = np.clip(y_pred, 0, 1)  # Clip values to [0, 1]
            y_pred = np.round(y_pred).astype(int)
            acc = accuracy_score(y_true, y_pred)
            F1 = f1_score(y_true, y_pred , zero_division = 0)
            evaluations[col] = {'Accuracy': acc , "F1": F1}
    return evaluations

# Process each estimator and save results to CSV
for j, (name, estimator) in enumerate(estimators.items()):
    combinations = product(*params[j].values())
    results = {}
    combos = {}

    for i in range(1,10):
        print(f"Imputing with {name} _ {i}...")

        df_imputed = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\python codes\1-Missing Data\reivse\60-40\missforest 60-40\output{}.csv".format(i))
        results[f"{name}_{i}"] = evaluate_imputation(df_imputed, df_test, missings)

    scale_weight = {}
    sum_miss = np.sum(dff.isna().sum(), axis=0)

    for cls in df_test.columns:
        we = dfff[cls].isna().sum()
        scale_weight[cls] = we / sum_miss

    labels = []
    con_values = []
    binary_values = []

    continues = ['PLT', 'hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
    binary = ['Retino', 'CAD', 'CVA', 'Smoking']
    for est in results.keys():
        labels.append(est)
        continues_score = 0
        binary_score = 0

        scale_weight_con = 0
        scale_weight_con_list = []
        for con in continues:
            continues_score += results[est][con]["R2"] * scale_weight[con]
            scale_weight_con += scale_weight[con]
            scale_weight_con_list.append(scale_weight[con])
        con_values.append(continues_score / scale_weight_con)

        scale_weight_bin = 0
        scale_weight_bin_list = []
        for bin in binary:
            binary_score += results[est][bin]["Accuracy"] * scale_weight[bin]
            scale_weight_bin += scale_weight[bin]
            scale_weight_bin_list.append(scale_weight[bin])
        binary_values.append(binary_score / scale_weight_bin)

    Res = pd.DataFrame(columns=["Labels", "Labels2", "Parameters", "Continues", "Binary", "Details"])
    Res["Labels"] = labels
    # Res["Labels2"] = combos.keys()
    # Res["Parameters"] = combos.values()
    Res["Continues"] = con_values
    Res["Binary"] = binary_values
    Res["Details"] = results.values()

    Wi_con = pd.DataFrame(columns=["Con", "Con_weight"])
    Wi_bin = pd.DataFrame(columns=["Bin", "Bin_weight"])
    Wi_con["Con"] = continues
    Wi_con["Con_weight"] = scale_weight_con_list

    Wi_bin["Bin"] = binary
    Wi_bin["Bin_weight"] = scale_weight_bin_list

    Res.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\ResultsOfGrid_missforest.csv")


In [None]:
# ==== Reproducible CV with CIs (No Feature Selection) ====
# Requirements: scikit-learn, numpy, pandas, xgboost, lightgbm
# If needed: pip install scikit-learn xgboost lightgbm pandas numpy

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from typing import Dict, List, Tuple

# ------------------------------
# 1) تنظیمات پایه
# ------------------------------
TARGET_COL = "fattyliver"    # <<--- نام ستون هدفت رو اینجا بگذار (۰/۱)
N_SPLITS = 5            # K-fold
SEEDS: List[int] = [42, 52, 62, 72, 82, 92, 102, 112, 122, 132]  # 10 تکرار با بذرهای مختلف
SCORING_AVG = "binary"  # برای precision/recall/f1 روی کلاس مثبت (label=1)
POS_LABEL = 1

# ------------------------------
# 2) آماده‌سازی مدل‌ها با هایپرپارامترهای اعلام‌شده
#    (اسکیلر فقط برای مدل‌های حساس به مقیاس درون Pipeline)
# ------------------------------
def make_models() -> Dict[str, Pipeline]:
    models = {}

    # Logistic Regression {'C': 1}
    models["Logistic Regression"] = Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", LogisticRegression(C=1, solver="liblinear", max_iter=1000, random_state=0))
    ])

    # KNN {'n_neighbors': 7, 'weights': 'distance'}
    models["KNN"] = Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", KNeighborsClassifier(n_neighbors=7, weights="distance"))
    ])

    # SVM {'C': 1}
    models["SVM"] = Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", SVC(C=1, probability=True, random_state=0))
    ])

    # Decision Tree {'max_depth': 5, 'min_samples_split': 5}
    models["Decision Tree"] = Pipeline([
        ("clf", DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0))
    ])

    # Random Forest {'max_depth': None, 'min_samples_split': 10, 'n_estimators': 300}
    models["Random Forest"] = Pipeline([
        ("clf", RandomForestClassifier(
            n_estimators=300, max_depth=None, min_samples_split=10, n_jobs=-1, random_state=0
        ))
    ])

    # Extra Trees {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 300}
    models["Extra Tree"] = Pipeline([
        ("clf", ExtraTreesClassifier(
            n_estimators=300, max_depth=None, min_samples_split=2, n_jobs=-1, random_state=0
        ))
    ])

    # Gradient Boosting {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
    models["Gradient Boosting"] = Pipeline([
        ("clf", GradientBoostingClassifier(
            learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
        ))
    ])

    # XGBoost {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
    models["XGBoost"] = Pipeline([
        ("clf", XGBClassifier(
            n_estimators=300, max_depth=5, learning_rate=0.1,
            subsample=1.0, colsample_bytree=1.0,
            eval_metric="logloss", use_label_encoder=False,
            tree_method="hist", n_jobs=-1, random_state=0
        ))
    ])

    # LightGBM {'force_col_wise': True, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'num_leaves': 31}
    models["LightGBM"] = Pipeline([
        ("clf", LGBMClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=7, num_leaves=31,
            force_col_wise=True, random_state=0, n_jobs=-1
        ))
    ])

    return models

# ------------------------------
# 3) توابع کمکی برای CI
# ------------------------------
def mean_sd_ci(x: np.ndarray, alpha: float = 0.05) -> Tuple[float, float, float, float]:
    """
    برمی‌گرداند: mean, sd, ci_low, ci_high برای سطح اطمینان ۱-آلفا
    از تقریب نرمال/تی با sd/√n استفاده می‌کند.
    """
    x = np.asarray(x, dtype=float)
    n = len(x)
    mean = x.mean()
    sd = x.std(ddof=1) if n > 1 else 0.0
    # برای نمونه‌های نسبتاً بزرگ، 1.96 خوبه؛ اگر خواستی دقیق‌تر: t-quantile
    z = 1.96
    se = sd / np.sqrt(max(n, 1))
    return mean, sd, mean - z * se, mean + z * se

# ------------------------------
# 4) حلقه CV تکراری با بذرهای مختلف
# ------------------------------
def evaluate_models_repeated_cv(
    df: pd.DataFrame,
    target_col: str = TARGET_COL,
    seeds: List[int] = SEEDS,
    n_splits: int = N_SPLITS
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    X = df.drop(columns=[target_col]).values
    y = df[target_col].values

    models = make_models()

    rows_per_fold = []   # نتایج هر فولد/تکرار
    summary_rows = []    # خلاصه نهایی

    for model_name, pipe in models.items():
        print("Now Runing:",model_name)
        # برای هر مدل، با هر seed یک StratifiedKFold تازه می‌سازیم
        per_fold_scores = {
            "accuracy": [], "precision": [], "recall": [], "f1": [], "roc_auc": []
        }

        for seed in seeds:
            skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

            for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
                X_train, X_test = X[train_idx], X[test_idx]
                y_train, y_test = y[train_idx], y[test_idx]

                pipe.fit(X_train, y_train)

                # پیش‌بینی احتمال برای ROC-AUC (در صورت نبود predict_proba از decision_function استفاده می‌کنیم)
                if hasattr(pipe[-1], "predict_proba"):
                    y_proba = pipe.predict_proba(X_test)[:, 1]
                elif hasattr(pipe[-1], "decision_function"):
                    # decision_function را به [0,1] تبدیل می‌کنیم
                    df_raw = pipe.decision_function(X_test)
                    # نگاشت مین-مکس
                    df_min, df_max = df_raw.min(), df_raw.max()
                    if df_max > df_min:
                        y_proba = (df_raw - df_min) / (df_max - df_min)
                    else:
                        y_proba = np.zeros_like(df_raw, dtype=float)
                else:
                    # fallback (به ندرت)
                    y_proba = pipe.predict(X_test).astype(float)

                y_pred = (y_proba >= 0.5).astype(int)

                acc = accuracy_score(y_test, y_pred)
                prec = precision_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                rec = recall_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                f1 = f1_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                try:
                    auc = roc_auc_score(y_test, y_proba)
                except ValueError:
                    # اگر فقط یک کلاس در y_test باشد
                    auc = np.nan

                per_fold_scores["accuracy"].append(acc)
                per_fold_scores["precision"].append(prec)
                per_fold_scores["recall"].append(rec)
                per_fold_scores["f1"].append(f1)
                per_fold_scores["roc_auc"].append(auc)

                rows_per_fold.append({
                    "model": model_name,
                    "seed": seed,
                    "fold": fold_idx,
                    "accuracy": acc,
                    "precision": prec,
                    "recall": rec,
                    "f1": f1,
                    "roc_auc": auc
                })

        # خلاصه آماری با CI برای هر متریک
        for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
            arr = np.array(per_fold_scores[metric], dtype=float)
            # حذف nan برای roc_auc در صورت لزوم
            arr = arr[~np.isnan(arr)]
            mean, sd, lo, hi = mean_sd_ci(arr) if len(arr) > 0 else (np.nan, np.nan, np.nan, np.nan)
            summary_rows.append({
                "model": model_name,
                "metric": metric,
                "mean": mean,
                "std": sd,
                "ci95_low": lo,
                "ci95_high": hi,
                "n_folds": len(arr)
            })

    per_fold_df = pd.DataFrame(rows_per_fold)
    summary_df = pd.DataFrame(summary_rows)

    return per_fold_df, summary_df

# ------------------------------




# 5) مثالِ اجرا
# ------------------------------
# فرض: df را از قبل ساخته‌ای و ستون TARGET_COL را دارد
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")  # نمونه

per_fold_df, summary_df = evaluate_models_repeated_cv(df, target_col=TARGET_COL)

# # ذخیره برای گزارش به داورها / ضمیمه‌ها
per_fold_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_per_fold_no_fs.csv", index=False)
summary_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_summary_no_fs.csv", index=False)

# # نمایش خلاصه مرتب‌شده بر اساس AUC
print(
    summary_df[summary_df["metric"] == "roc_auc"]
    .sort_values("mean", ascending=False)
    .assign(mean_sd=lambda d: d["mean"].round(3).astype(str) + " ± " + d["std"].round(3).astype(str),
            ci95=lambda d: "[" + d["ci95_low"].round(3).astype(str) + ", " + d["ci95_high"].round(3).astype(str) + "]")
    [["model","mean_sd","ci95","n_folds"]]
    .to_string(index=False)
)


In [None]:
# ==== Repeated CV + 95% CI with SelectKBest (per-model K) ====
# Requirements: scikit-learn, numpy, pandas, xgboost, lightgbm

import numpy as np
import pandas as pd
from typing import Dict, List, Tuple

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ------------------------------
# تنظیمات
# ------------------------------
TARGET_COL = "fattyliver"   # اگر نمی‌خواهی به نام تکیه کنی، پایین از iloc استفاده کن
N_SPLITS = 5
SEEDS = [42, 52, 62, 72, 82, 92, 102, 112, 122, 132]  # 10 تکرار
SCORING_AVG = "binary"
POS_LABEL = 1

# Kهای منتخب برای هر مدل (طبق چیزی که دادی)
# K_MAP: Dict[str, int] = {
#     "Logistic Regression": 28,
#     "KNN": 6,
#     "SVM": 28,
#     "Decision Tree": 5,
#     "Random Forest": 31,
#     "Extra Tree": 22,
#     "Gradient Boosting": 25,
#     "XGBoost": 22,
#     "LightGBM": 21,
# }

#For with SMOTE
K_MAP: Dict[str, int] = {
    "Logistic Regression": 26,
    "KNN": 12,
    "SVM": 18,
    "Decision Tree": 8,
    "Random Forest": 22,
    "Extra Tree": 24,
    "Gradient Boosting": 22,
    "XGBoost": 31,
    "LightGBM": 30,
}


# ------------------------------
# ساخت مدل‌ها + KBest داخل Pipeline
# ------------------------------
def make_models_kbest() -> Dict[str, Pipeline]:
    # امتیازدهی با MI؛ برای تصادفی‌بودن MI، random_state می‌گذاریم تا تکرارپذیر باشد
    selector = lambda k: SelectKBest(score_func=lambda X, y: mutual_info_classif(
        X, y, random_state=0, discrete_features="auto"
    ), k=k)

    models = {}

    # 1) Logistic Regression  {'C': 1}
    models["Logistic Regression"] = Pipeline([
        ("kbest", selector(K_MAP["Logistic Regression"])),
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", LogisticRegression(C=1, solver="liblinear", max_iter=1000, random_state=0)),
    ])

    # 2) KNN  {'n_neighbors': 7, 'weights': 'distance'}
    models["KNN"] = Pipeline([
        ("kbest", selector(K_MAP["KNN"])),
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", KNeighborsClassifier(n_neighbors=7, weights="distance")),
    ])

    # 3) SVM  {'C': 1}
    models["SVM"] = Pipeline([
        ("kbest", selector(K_MAP["SVM"])),
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", SVC(C=1, probability=True, random_state=0)),
    ])

    # 4) Decision Tree  {'max_depth': 5, 'min_samples_split': 5}
    models["Decision Tree"] = Pipeline([
        ("kbest", selector(K_MAP["Decision Tree"])),
        ("clf", DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0)),
    ])

    # 5) Random Forest  {'max_depth': None, 'min_samples_split': 10, 'n_estimators': 300}
    models["Random Forest"] = Pipeline([
        ("kbest", selector(K_MAP["Random Forest"])),
        ("clf", RandomForestClassifier(
            n_estimators=300, max_depth=None, min_samples_split=10, n_jobs=-1, random_state=0
        )),
    ])

    # 6) Extra Tree  {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 300}
    models["Extra Tree"] = Pipeline([
        ("kbest", selector(K_MAP["Extra Tree"])),
        ("clf", ExtraTreesClassifier(
            n_estimators=300, max_depth=None, min_samples_split=2, n_jobs=-1, random_state=0
        )),
    ])

    # 7) Gradient Boosting  {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
    models["Gradient Boosting"] = Pipeline([
        ("kbest", selector(K_MAP["Gradient Boosting"])),
        ("clf", GradientBoostingClassifier(
            learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
        )),
    ])

    # 8) XGBoost  {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
    models["XGBoost"] = Pipeline([
        ("kbest", selector(K_MAP["XGBoost"])),
        ("clf", XGBClassifier(
            n_estimators=300, max_depth=5, learning_rate=0.1,
            subsample=1.0, colsample_bytree=1.0,
            eval_metric="logloss", use_label_encoder=False,
            tree_method="hist", n_jobs=-1, random_state=0
        )),
    ])

    # 9) LightGBM  {'force_col_wise': True, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'num_leaves': 31}
    models["LightGBM"] = Pipeline([
        ("kbest", selector(K_MAP["LightGBM"])),
        ("clf", LGBMClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=7, num_leaves=31,
            force_col_wise=True, random_state=0, n_jobs=-1
        )),
    ])

    return models

# ------------------------------
# محاسبه میانگین/انحراف معیار/CI95
# ------------------------------
def mean_sd_ci(x: np.ndarray, alpha: float = 0.05) -> Tuple[float, float, float, float]:
    x = np.asarray(x, dtype=float)
    n = len(x)
    mean = x.mean()
    sd = x.std(ddof=1) if n > 1 else 0.0
    z = 1.96
    se = sd / np.sqrt(max(n, 1))
    return mean, sd, mean - z * se, mean + z * se

# ------------------------------
# موتور ارزیابی (تکرارهای CV با seedهای مختلف)
# ------------------------------
def evaluate_models_repeated_cv_kbest(
    df: pd.DataFrame,
    target_col: str = TARGET_COL,
    seeds: List[int] = SEEDS,
    n_splits: int = N_SPLITS
) -> Tuple[pd.DataFrame, pd.DataFrame]:

    # اگر نمی‌خواهی روی نام ستون تکیه کنی:
    # X = df.iloc[:, :-1].values
    # y = df.iloc[:, -1].values
    X = df.drop(columns=[target_col]).values
    y = df[target_col].values

    models = make_models_kbest()

    rows_per_fold = []
    summary_rows = []

    for model_name, pipe in models.items():
        per_fold_scores = {m: [] for m in ["accuracy", "precision", "recall", "f1", "roc_auc"]}

        for seed in seeds:
            skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

            for fold_idx, (tr, te) in enumerate(skf.split(X, y), start=1):
                X_train, X_test = X[tr], X[te]
                y_train, y_test = y[tr], y[te]

                pipe.fit(X_train, y_train)

                # احتمال برای AUC
                if hasattr(pipe[-1], "predict_proba"):
                    y_proba = pipe.predict_proba(X_test)[:, 1]
                elif hasattr(pipe[-1], "decision_function"):
                    df_raw = pipe.decision_function(X_test)
                    mn, mx = df_raw.min(), df_raw.max()
                    y_proba = (df_raw - mn) / (mx - mn) if mx > mn else np.zeros_like(df_raw, dtype=float)
                else:
                    y_proba = pipe.predict(X_test).astype(float)

                y_pred = (y_proba >= 0.5).astype(int)

                acc = accuracy_score(y_test, y_pred)
                prec = precision_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                rec = recall_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                f1 = f1_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                try:
                    auc = roc_auc_score(y_test, y_proba)
                except ValueError:
                    auc = np.nan

                per_fold_scores["accuracy"].append(acc)
                per_fold_scores["precision"].append(prec)
                per_fold_scores["recall"].append(rec)
                per_fold_scores["f1"].append(f1)
                per_fold_scores["roc_auc"].append(auc)

                rows_per_fold.append({
                    "model": model_name,
                    "seed": seed,
                    "fold": fold_idx,
                    "accuracy": acc,
                    "precision": prec,
                    "recall": rec,
                    "f1": f1,
                    "roc_auc": auc,
                    "kbest_k": K_MAP[model_name]
                })

        for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
            arr = np.array(per_fold_scores[metric], dtype=float)
            arr = arr[~np.isnan(arr)]
            mean, sd, lo, hi = mean_sd_ci(arr) if len(arr) > 0 else (np.nan, np.nan, np.nan, np.nan)
            summary_rows.append({
                "model": model_name,
                "metric": metric,
                "mean": mean,
                "std": sd,
                "ci95_low": lo,
                "ci95_high": hi,
                "n_folds": len(arr),
                "kbest_k": K_MAP[model_name]
            })

    per_fold_df = pd.DataFrame(rows_per_fold)
    summary_df = pd.DataFrame(summary_rows)
    return per_fold_df, summary_df

# ------------------------------
# مثال اجرا
# ------------------------------
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")


per_fold_df, summary_df = evaluate_models_repeated_cv_kbest(df, target_col=TARGET_COL)
per_fold_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_per_fold_kbest.csv", index=False)
summary_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_summary_kbest.csv", index=False)

print(
    summary_df[summary_df["metric"] == "roc_auc"]
    .sort_values("mean", ascending=False)
    .assign(mean_sd=lambda d: d["mean"].round(3).astype(str) + " ± " + d["std"].round(3).astype(str),
            ci95=lambda d: "[" + d["ci95_low"].round(3).astype(str) + ", " + d["ci95_high"].round(3).astype(str) + "]")
    [["model","kbest_k","mean_sd","ci95","n_folds"]]
    .to_string(index=False)
)


In [None]:
# ==== Repeated CV + 95% CI with PCA (per-model n_components) ====
# Requirements: scikit-learn, numpy, pandas, xgboost, lightgbm

import numpy as np
import pandas as pd
from typing import Dict, List, Tuple

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ------------------------------
# تنظیمات
# ------------------------------
TARGET_COL = "fattyliver"   # اگر نمی‌خواهی روی نام تکیه کنی، پایین از iloc استفاده کن
N_SPLITS = 5
SEEDS: List[int] = [42, 52, 62, 72, 82, 92, 102, 112, 122, 132]  # 10 تکرار
SCORING_AVG = "binary"
POS_LABEL = 1

# بهترین تعداد مؤلفه‌های PCA برای هر مدل
# PCA_MAP: Dict[str, int] = {
#     "Logistic Regression": 30,
#     "KNN": 15,
#     "SVM": 30,
#     "Decision Tree": 14,
#     "Random Forest": 28,
#     "Extra Tree": 28,
#     "Gradient Boosting": 30,
#     "XGBoost": 25,
#     "LightGBM": 30,
# }

#PCA with SMOTE
PCA_MAP: Dict[str, int] = {
    "Logistic Regression": 29,
    "KNN": 17,
    "SVM": 30,
    "Decision Tree": 20,
    "Random Forest": 17,
    "Extra Tree": 29,
    "Gradient Boosting": 30,
    "XGBoost": 30,
    "LightGBM": 30,
}


# ------------------------------
# ساخت مدل‌ها + PCA داخل Pipeline
# ------------------------------
def make_models_pca() -> Dict[str, Pipeline]:
    # PCA بعد از StandardScaler برای همه مدل‌ها (یکسان‌سازی مقیاس قبل از تجزیه)
    # whiten=False چون معمولاً برای طبقه‌بندی لازم نیست و نویز اضافه می‌کند
    def block(nc: int):
        return [("scaler", StandardScaler(with_mean=True, with_std=True)),
                ("pca", PCA(n_components=nc, svd_solver="auto", random_state=0))]

    models = {}

    # 1) Logistic Regression  {'C': 1}
    models["Logistic Regression"] = Pipeline(block(PCA_MAP["Logistic Regression"]) + [
        ("clf", LogisticRegression(C=1, solver="liblinear", max_iter=1000, random_state=0))
    ])

    # 2) KNN  {'n_neighbors': 7, 'weights': 'distance'}
    models["KNN"] = Pipeline(block(PCA_MAP["KNN"]) + [
        ("clf", KNeighborsClassifier(n_neighbors=7, weights="distance"))
    ])

    # 3) SVM  {'C': 1}
    models["SVM"] = Pipeline(block(PCA_MAP["SVM"]) + [
        ("clf", SVC(C=1, probability=True, random_state=0))
    ])

    # 4) Decision Tree  {'max_depth': 5, 'min_samples_split': 5}
    models["Decision Tree"] = Pipeline(block(PCA_MAP["Decision Tree"]) + [
        ("clf", DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0))
    ])

    # 5) Random Forest  {'max_depth': None, 'min_samples_split': 10, 'n_estimators': 300}
    models["Random Forest"] = Pipeline(block(PCA_MAP["Random Forest"]) + [
        ("clf", RandomForestClassifier(
            n_estimators=300, max_depth=None, min_samples_split=10, n_jobs=-1, random_state=0
        ))
    ])

    # 6) Extra Tree  {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 300}
    models["Extra Tree"] = Pipeline(block(PCA_MAP["Extra Tree"]) + [
        ("clf", ExtraTreesClassifier(
            n_estimators=300, max_depth=None, min_samples_split=2, n_jobs=-1, random_state=0
        ))
    ])

    # 7) Gradient Boosting  {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
    models["Gradient Boosting"] = Pipeline(block(PCA_MAP["Gradient Boosting"]) + [
        ("clf", GradientBoostingClassifier(
            learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
        ))
    ])

    # 8) XGBoost  {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}
    models["XGBoost"] = Pipeline(block(PCA_MAP["XGBoost"]) + [
        ("clf", XGBClassifier(
            n_estimators=300, max_depth=5, learning_rate=0.1,
            subsample=1.0, colsample_bytree=1.0,
            eval_metric="logloss", use_label_encoder=False,
            tree_method="hist", n_jobs=-1, random_state=0
        ))
    ])

    # 9) LightGBM  {'force_col_wise': True, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'num_leaves': 31}
    models["LightGBM"] = Pipeline(block(PCA_MAP["LightGBM"]) + [
        ("clf", LGBMClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=7, num_leaves=31,
            force_col_wise=True, random_state=0, n_jobs=-1
        ))
    ])

    return models

# ------------------------------
# محاسبه میانگین/SD/CI95
# ------------------------------
def mean_sd_ci(x: np.ndarray, alpha: float = 0.05) -> Tuple[float, float, float, float]:
    x = np.asarray(x, dtype=float)
    n = len(x)
    mean = x.mean()
    sd = x.std(ddof=1) if n > 1 else 0.0
    z = 1.96
    se = sd / np.sqrt(max(n, 1))
    return mean, sd, mean - z * se, mean + z * se

# ------------------------------
# موتور ارزیابی (تکرارهای CV با seedهای مختلف)
# ------------------------------
def evaluate_models_repeated_cv_pca(
    df: pd.DataFrame,
    target_col: str = TARGET_COL,
    seeds: List[int] = SEEDS,
    n_splits: int = N_SPLITS
) -> Tuple[pd.DataFrame, pd.DataFrame]:

    # اگر می‌خواهی همیشه ستون آخر هدف باشد:
    # X = df.iloc[:, :-1].values
    # y = df.iloc[:, -1].values
    X = df.drop(columns=[target_col]).values
    y = df[target_col].values

    models = make_models_pca()

    rows_per_fold = []
    summary_rows = []

    for model_name, pipe in models.items():
        per_fold_scores = {m: [] for m in ["accuracy", "precision", "recall", "f1", "roc_auc"]}

        for seed in seeds:
            skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

            for fold_idx, (tr, te) in enumerate(skf.split(X, y), start=1):
                X_train, X_test = X[tr], X[te]
                y_train, y_test = y[tr], y[te]

                pipe.fit(X_train, y_train)

                # احتمال برای AUC
                if hasattr(pipe[-1], "predict_proba"):
                    y_proba = pipe.predict_proba(X_test)[:, 1]
                elif hasattr(pipe[-1], "decision_function"):
                    df_raw = pipe.decision_function(X_test)
                    mn, mx = df_raw.min(), df_raw.max()
                    y_proba = (df_raw - mn) / (mx - mn) if mx > mn else np.zeros_like(df_raw, dtype=float)
                else:
                    y_proba = pipe.predict(X_test).astype(float)

                y_pred = (y_proba >= 0.5).astype(int)

                acc = accuracy_score(y_test, y_pred)
                prec = precision_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                rec = recall_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                f1 = f1_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                try:
                    auc = roc_auc_score(y_test, y_proba)
                except ValueError:
                    auc = np.nan

                per_fold_scores["accuracy"].append(acc)
                per_fold_scores["precision"].append(prec)
                per_fold_scores["recall"].append(rec)
                per_fold_scores["f1"].append(f1)
                per_fold_scores["roc_auc"].append(auc)

                rows_per_fold.append({
                    "model": model_name,
                    "seed": seed,
                    "fold": fold_idx,
                    "accuracy": acc,
                    "precision": prec,
                    "recall": rec,
                    "f1": f1,
                    "roc_auc": auc,
                    "pca_n_components": PCA_MAP[model_name]
                })

        for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
            arr = np.array(per_fold_scores[metric], dtype=float)
            arr = arr[~np.isnan(arr)]
            mean, sd, lo, hi = mean_sd_ci(arr) if len(arr) > 0 else (np.nan, np.nan, np.nan, np.nan)
            summary_rows.append({
                "model": model_name,
                "metric": metric,
                "mean": mean,
                "std": sd,
                "ci95_low": lo,
                "ci95_high": hi,
                "n_folds": len(arr),
                "pca_n_components": PCA_MAP[model_name]
            })

    per_fold_df = pd.DataFrame(rows_per_fold)
    summary_df = pd.DataFrame(summary_rows)
    return per_fold_df, summary_df

# ------------------------------
# # مثال اجرا
# ------------------------------
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")

per_fold_df, summary_df = evaluate_models_repeated_cv_pca(df, target_col=TARGET_COL)
per_fold_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_per_fold_pca.csv", index=False)
summary_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_summary_pca.csv", index=False)

# بررسی سریع AUCها:
print(
    summary_df[summary_df["metric"] == "roc_auc"]
    .sort_values("mean", ascending=False)
    .assign(mean_sd=lambda d: d["mean"].round(3).astype(str) + " ± " + d["std"].round(3).astype(str),
            ci95=lambda d: "[" + d["ci95_low"].round(3).astype(str) + ", " + d["ci95_high"].round(3).astype(str) + "]")
    [["model","pca_n_components","mean_sd","ci95","n_folds"]]
    .to_string(index=False)
)


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score

# ===== تنظیمات =====
N_SPLITS = 5
SEEDS = [0, 42, 101, 202, 303]
TARGET_COL = "fattyliver"  # یا آخرین ستون با iloc[:, -1]

# ===== مدل‌ها با هایپرپارامترهای انتخاب‌شده =====
models = {
    "Logistic Regression": LogisticRegression(C=1, max_iter=500, solver="liblinear"),
    "KNN": KNeighborsClassifier(n_neighbors=7, weights="distance"),
    "SVM": SVC(C=1, probability=True),
    "Decision Tree": DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0),
    "Random Forest": RandomForestClassifier(max_depth=None, min_samples_split=10, n_estimators=300, random_state=0),
    "Extra Tree": ExtraTreesClassifier(max_depth=None, min_samples_split=2, n_estimators=300, random_state=0),
    "Gradient Boosting": GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0),
    "XGBoost": XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=300, use_label_encoder=False, eval_metric="logloss", random_state=0),
    "LightGBM": LGBMClassifier(force_col_wise=True, learning_rate=0.1, max_depth=7, n_estimators=100, num_leaves=31, random_state=0),
}

def evaluate_models_rfecv(df: pd.DataFrame):
    X = df.drop(columns=[TARGET_COL]).values
    y = df[TARGET_COL].values

    results_rows = []
    for model_name, model in models.items():
        all_metrics = []
        for seed in SEEDS:
            skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=seed)
            fold_metrics = []
            for train_idx, test_idx in skf.split(X, y):
                X_train, X_test = X[train_idx], X[test_idx]
                y_train, y_test = y[train_idx], y[test_idx]

                pipe = Pipeline([
                    ("scaler", StandardScaler()),
                    ("clf", model)
                ])

                pipe.fit(X_train, y_train)
                y_pred = pipe.predict(X_test)
                y_prob = pipe.predict_proba(X_test)[:, 1]

                fold_metrics.append([
                    accuracy_score(y_test, y_pred),
                    recall_score(y_test, y_pred),
                    precision_score(y_test, y_pred),
                    f1_score(y_test, y_pred),
                    roc_auc_score(y_test, y_prob),
                ])
            all_metrics.extend(fold_metrics)

        all_metrics = np.array(all_metrics)
        results_rows.append({
            "Classifier": model_name,
            # "Selected_Features_RFECV": 30,  # چون برای همه ۳۰ ویژگی انتخاب شد
            "Selected_Features_RFECV": 11,  # چون برای همه ۳۰ ویژگی انتخاب شد with SMOTE
            "Accuracy_mean": np.mean(all_metrics[:, 0]),
            "Accuracy_std": np.std(all_metrics[:, 0]),
            "Recall_mean": np.mean(all_metrics[:, 1]),
            "Recall_std": np.std(all_metrics[:, 1]),
            "Precision_mean": np.mean(all_metrics[:, 2]),
            "Precision_std": np.std(all_metrics[:, 2]),
            "F1_mean": np.mean(all_metrics[:, 3]),
            "F1_std": np.std(all_metrics[:, 3]),
            "AUC_mean": np.mean(all_metrics[:, 4]),
            "AUC_std": np.std(all_metrics[:, 4]),
        })
    return pd.DataFrame(results_rows)

# ===== اجرا =====
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")
results_rfecv = evaluate_models_rfecv(df)
results_rfecv.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoostRFECV_Summary.csv", index=False)
print(results_rfecv)


In [None]:
# ==== Repeated CV + 95% CI with GA-selected features (per-model subsets) ====
# pip install scikit-learn xgboost lightgbm pandas numpy

import numpy as np
import pandas as pd
from typing import Dict, List, Tuple

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ------------------------------
# تنظیمات کلی
# ------------------------------
TARGET_COL = "fattyliver"   # اگر نمی‌خوای به نام تکیه کنی، پایین گزینه ستون آخر هم هست
N_SPLITS = 5
SEEDS = [42, 52, 62, 72, 82, 92, 102, 112, 122, 132]   # 10 تکرار
POS_LABEL = 1
SCORING_AVG = "binary"

# ------------------------------
# نگاشت GA: ایندکس‌های ویژگی منتخب برای هر مدل (طبق داده‌ای که دادی)
# ایندکس‌ها نسبت به «X = df.drop(target)» هستن (از 0 شروع).
# ------------------------------
GA_MAP: Dict[str, List[int]] = {
    "LightGBM":            [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 17, 20, 24, 25, 26, 29],
    "Gradient Boosting":   [0, 1, 4, 7, 8, 11, 12, 16, 17, 18, 20, 22, 24, 25],
    "Logistic Regression": [0, 1, 2, 6, 7, 9, 10, 11, 12, 13, 16, 17, 18, 19, 24, 25, 29],
    "Extra Tree":          [0, 1, 7, 11, 12, 14, 15, 16, 24, 27, 30],
    "Decision Tree":       [0, 1, 9, 12, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 30],
    "XGBoost":             [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 16, 17, 19, 20, 21, 24, 25, 26, 27, 28, 29, 30],
    "KNN":                 [4, 7, 9, 11, 12, 24, 25, 28, 30],
    "SVM" : [0, 1, 2, 3, 4, 7, 8, 10, 11, 16, 17, 20, 23, 24, 28, 29, 30]
}

# نام‌ها را با کلیدهای مدل‌ها هماهنگ می‌کنیم
MODEL_KEYS = [
    "Logistic Regression", "KNN", "SVM", "Decision Tree",
    "Random Forest", "Extra Tree", "Gradient Boosting", "XGBoost", "LightGBM"
]

# ------------------------------
# سازنده‌ی مدل‌ها (با n_jobs=-1 و SVM سریع)
# ------------------------------
def build_models() -> Dict[str, object]:
    return {
        "Logistic Regression": LogisticRegression(C=1, solver="liblinear", max_iter=1000, random_state=0),
        "KNN": KNeighborsClassifier(n_neighbors=7, weights="distance", n_jobs=-1),
        "SVM": SVC(C=1, probability=False, random_state=0),  # AUC از decision_function
        "Decision Tree": DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0),
        "Random Forest": RandomForestClassifier(
            n_estimators=300, max_depth=None, min_samples_split=10, n_jobs=-1, random_state=0
        ),
        "Extra Tree": ExtraTreesClassifier(
            n_estimators=300, max_depth=None, min_samples_split=2, n_jobs=-1, random_state=0
        ),
        "Gradient Boosting": GradientBoostingClassifier(
            learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
        ),
        "XGBoost": XGBClassifier(
            n_estimators=300, max_depth=5, learning_rate=0.1,
            subsample=1.0, colsample_bytree=1.0,
            eval_metric="logloss", use_label_encoder=False,
            tree_method="hist", n_jobs=-1, random_state=0
        ),
        "LightGBM": LGBMClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=7, num_leaves=31,
            force_col_wise=True, n_jobs=-1, random_state=0
        ),
    }

# ------------------------------
# کمک‌تابع CI
# ------------------------------
def mean_sd_ci(x: np.ndarray) -> Tuple[float, float, float, float]:
    x = np.asarray(x, dtype=float)
    x = x[~np.isnan(x)]
    n = len(x)
    if n == 0:
        return (np.nan, np.nan, np.nan, np.nan)
    mean = x.mean()
    sd = x.std(ddof=1) if n > 1 else 0.0
    z = 1.96
    se = sd / np.sqrt(max(n, 1))
    return mean, sd, mean - z * se, mean + z * se

# ------------------------------
# ارزیابی با GA feature subsets
# ------------------------------
def evaluate_models_repeated_cv_ga(
    df: pd.DataFrame,
    target_col: str = TARGET_COL,
    seeds: List[int] = SEEDS,
    n_splits: int = N_SPLITS
) -> Tuple[pd.DataFrame, pd.DataFrame]:

    # اگر ترجیح می‌دی همیشه ستون آخر هدف باشه:
    # feat_df = df.iloc[:, :-1].copy(); y = df.iloc[:, -1].to_numpy()
    feat_df = df.drop(columns=[target_col]).copy()
    y = df[target_col].to_numpy()
    # (اختیاری) به float32 برای سرعت/مموری کمتر:
    for c in feat_df.columns:
        if pd.api.types.is_numeric_dtype(feat_df[c]):
            feat_df[c] = feat_df[c].astype(np.float32)

    models = build_models()
    all_rows, summary_rows = [], []

    n_features_total = feat_df.shape[1]
    feature_indices_all = list(range(n_features_total))

    for model_name in MODEL_KEYS:
        model = models[model_name]
        # ایندکس‌های GA برای این مدل (اگر نداشتی، همه‌ی فیچرها)
        ga_idx = GA_MAP.get(model_name, feature_indices_all)

        # انتخاب‌گر ستونی بر اساس ایندکس‌ها (داخل هر فولد، بدون لیکیج)
        selector = ColumnTransformer(
            transformers=[("sel", "passthrough", ga_idx)],
            remainder="drop", verbose_feature_names_out=False
        )

        # آیا نیاز به اسکیلر داریم؟
        needs_scaler = model_name in {"Logistic Regression", "KNN", "SVM"}
        steps = [("select", selector)]
        if needs_scaler:
            steps.append(("scaler", StandardScaler(with_mean=True, with_std=True)))

        steps.append(("clf", model))
        pipe = Pipeline(steps=steps)

        per_fold = {m: [] for m in ["accuracy","precision","recall","f1","roc_auc"]}

        for seed in seeds:
            skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
            for fold_idx, (tr, te) in enumerate(skf.split(feat_df, y), start=1):
                X_train = feat_df.iloc[tr]
                X_test  = feat_df.iloc[te]
                y_train, y_test = y[tr], y[te]

                pipe.fit(X_train, y_train)

                # امتیاز برای AUC
                clf = pipe[-1]
                if hasattr(clf, "predict_proba"):
                    y_score = pipe.predict_proba(X_test)[:, 1]
                elif hasattr(clf, "decision_function"):
                    y_score = pipe.decision_function(X_test)
                else:
                    y_score = pipe.predict(X_test).astype(float)

                y_pred = (y_score >= 0.5).astype(int)

                acc = accuracy_score(y_test, y_pred)
                prec = precision_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                rec = recall_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                f1 = f1_score(y_test, y_pred, average=SCORING_AVG, zero_division=0, pos_label=POS_LABEL)
                try:
                    auc = roc_auc_score(y_test, y_score)
                except ValueError:
                    auc = np.nan

                per_fold["accuracy"].append(acc)
                per_fold["precision"].append(prec)
                per_fold["recall"].append(rec)
                per_fold["f1"].append(f1)
                per_fold["roc_auc"].append(auc)

                all_rows.append({
                    "model": model_name,
                    "seed": seed,
                    "fold": fold_idx,
                    "accuracy": acc,
                    "precision": prec,
                    "recall": rec,
                    "f1": f1,
                    "roc_auc": auc,
                    "ga_n_features": len(ga_idx),
                    "ga_indices": ga_idx
                })

        # خلاصه با CI95
        for metric in ["accuracy","precision","recall","f1","roc_auc"]:
            mean, sd, lo, hi = mean_sd_ci(np.array(per_fold[metric], dtype=float))
            summary_rows.append({
                "model": model_name,
                "metric": metric,
                "mean": mean,
                "std": sd,
                "ci95_low": lo,
                "ci95_high": hi,
                "n_folds": len(per_fold[metric]),
                "ga_n_features": len(ga_idx),
                "ga_indices": ga_idx
            })

    per_fold_df = pd.DataFrame(all_rows)
    summary_df = pd.DataFrame(summary_rows)
    return per_fold_df, summary_df

# ------------------------------
# مثال اجرا
# ------------------------------
# df = pd.read_csv("your_clean_data.csv")
# اگر برچسبت رشته‌ای است، به 0/1 نگاشت کن:
# df["fatty liver"] = df["fatty liver"].map({"Non-MAFLD":0, "MAFLD":1}).astype(int)

# per_fold_df, summary_df = evaluate_models_repeated_cv_ga(df, target_col=TARGET_COL)
# per_fold_df.to_csv("cv_per_fold_ga.csv", index=False)
# summary_df.to_csv("cv_summary_ga.csv", index=False)




# # مثال اجرا
# ------------------------------
df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")

per_fold_df, summary_df = evaluate_models_repeated_cv_ga(df, target_col=TARGET_COL)
per_fold_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_per_fold_Ga.csv", index=False)
summary_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\cv_summary_Ga.csv", index=False)

# # بررسی سریع AUCها:
# print(
#     summary_df[summary_df["metric"] == "roc_auc"]
#     .sort_values("mean", ascending=False)
#     .assign(mean_sd=lambda d: d["mean"].round(3).astype(str) + " ± " + d["std"].round(3).astype(str),
#             ci95=lambda d: "[" + d["ci95_low"].round(3).astype(str) + ", " + d["ci95_high"].round(3).astype(str) + "]")
#     [["model","pca_n_components","mean_sd","ci95","n_folds"]]
#     .to_string(index=False)
# )



# نمایش سریع AUCها
print(
     summary_df[summary_df["metric"]=="roc_auc"]
     .sort_values("mean", ascending=False)
     .assign(mean_sd=lambda d: d["mean"].round(3).astype(str)+" ± "+d["std"].round(3).astype(str),
             ci95=lambda d: "["+d["ci95_low"].round(3).astype(str)+", "+d["ci95_high"].round(3).astype(str)+"]")
     [["model","ga_n_features","mean_sd","ci95","n_folds"]]
     .to_string(index=False)
 )


In [None]:
import numpy as np
import pandas as pd
from typing import List, Tuple, Dict

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ---------- تنظیمات ----------
TARGET_COL = "fattyliver"   # یا نام ستون هدفت
N_SPLITS = 5
SEEDS: List[int] = [0, 42, 101, 202, 303]  # می‌تونی مثل سایر اسکریپت‌ها 10 تایی هم بذاری
POS_LABEL = 1

# ---------- مدل‌ها (هایپرپارامتر مثل قبل + سرعت بهتر) ----------
def make_models() -> Dict[str, object]:
    return {
        "Logistic Regression": LogisticRegression(C=1, solver="liblinear", max_iter=1000, random_state=0),
        "KNN": KNeighborsClassifier(n_neighbors=7, weights="distance", n_jobs=-1),
        "SVM": SVC(C=1, probability=False, random_state=0),  # AUC از decision_function
        "Decision Tree": DecisionTreeClassifier(max_depth=5, min_samples_split=5, random_state=0),
        "Random Forest": RandomForestClassifier(
            n_estimators=300, max_depth=None, min_samples_split=10, n_jobs=-1, random_state=0
        ),
        "Extra Tree": ExtraTreesClassifier(
            n_estimators=300, max_depth=None, min_samples_split=2, n_jobs=-1, random_state=0
        ),
        "Gradient Boosting": GradientBoostingClassifier(
            learning_rate=0.1, max_depth=5, n_estimators=300, random_state=0
        ),
        "XGBoost": XGBClassifier(
            n_estimators=300, max_depth=5, learning_rate=0.1,
            subsample=1.0, colsample_bytree=1.0,
            eval_metric="logloss", use_label_encoder=False,
            tree_method="hist", n_jobs=-1, random_state=0, verbosity=0
        ),
        "LightGBM": LGBMClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=7, num_leaves=31,
            force_col_wise=True, n_jobs=-1, random_state=0, verbose=-1
        ),
    }

# ---------- کمکی: میانگین/انحراف معیار/CI95 ----------
def mean_sd_ci(x: np.ndarray) -> Tuple[float, float, float, float]:
    x = np.asarray(x, dtype=float)
    x = x[~np.isnan(x)]
    n = len(x)
    if n == 0:
        return np.nan, np.nan, np.nan, np.nan
    mean = x.mean()
    sd = x.std(ddof=1) if n > 1 else 0.0
    z = 1.96
    se = sd / np.sqrt(n)
    return mean, sd, mean - z*se, mean + z*se

# ---------- ارزیابی RFECV (ثابت: 30 فیچر انتخاب‌شده) ----------
def evaluate_models_repeated_cv_rfecv(
    df: pd.DataFrame,
    target_col: str = TARGET_COL,
    seeds: List[int] = SEEDS,
    n_splits: int = N_SPLITS
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    # اگر می‌خوای همیشه ستون آخر هدف باشه:
    # Xall = df.iloc[:, :-1].copy(); y = df.iloc[:, -1].to_numpy()
    Xall = df.drop(columns=[target_col]).copy()
    y = df[target_col].to_numpy()

    # بهینه‌سازی سبک
    for c in Xall.columns:
        if pd.api.types.is_numeric_dtype(Xall[c]):
            Xall[c] = Xall[c].astype(np.float32)

    models = make_models()

    per_fold_rows = []
    summary_rows = []
    SELECTED_COUNT = 30  # طبق نتیجه RFECV شما

    for model_name, clf in models.items():
        # اسکیل برای مدل‌های حساس
        needs_scaler = model_name in {"Logistic Regression", "KNN", "SVM"}
        steps = []
        if needs_scaler:
            steps.append(("scaler", StandardScaler()))
        steps.append(("clf", clf))
        pipe = Pipeline(steps=steps)

        scores_collect = {m: [] for m in ["accuracy","precision","recall","f1","roc_auc"]}

        for seed in seeds:
            skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
            for fold_idx, (tr, te) in enumerate(skf.split(Xall, y), start=1):
                X_train, X_test = Xall.iloc[tr].values, Xall.iloc[te].values
                y_train, y_test = y[tr], y[te]

                pipe.fit(X_train, y_train)

                # امتیاز/نمره برای AUC و برچسب
                clf_final = pipe[-1]
                if hasattr(clf_final, "predict_proba"):
                    y_score = pipe.predict_proba(X_test)[:, 1]
                    y_pred = (y_score >= 0.5).astype(int)
                elif hasattr(clf_final, "decision_function"):
                    y_score = pipe.decision_function(X_test)
                    y_pred = (y_score >= 0).astype(int)  # آستانه صفر برای decision_function
                else:
                    y_score = pipe.predict(X_test).astype(float)
                    y_pred = (y_score >= 0.5).astype(int)

                acc = accuracy_score(y_test, y_pred)
                prec = precision_score(y_test, y_pred, average="binary", zero_division=0, pos_label=POS_LABEL)
                rec = recall_score(y_test, y_pred, average="binary",  zero_division=0, pos_label=POS_LABEL)
                f1 = f1_score(y_test, y_pred, average="binary", zero_division=0, pos_label=POS_LABEL)
                try:
                    auc = roc_auc_score(y_test, y_score)
                except ValueError:
                    auc = np.nan

                scores_collect["accuracy"].append(acc)
                scores_collect["precision"].append(prec)
                scores_collect["recall"].append(rec)
                scores_collect["f1"].append(f1)
                scores_collect["roc_auc"].append(auc)

                per_fold_rows.append({
                    "model": model_name,
                    "seed": seed,
                    "fold": fold_idx,
                    "accuracy": acc,
                    "precision": prec,
                    "recall": rec,
                    "f1": f1,
                    "roc_auc": auc,
                    "kbest_k": SELECTED_COUNT  # برای سازگاری نام ستون
                })

        # خلاصه به قالب واحد
        for metric in ["accuracy","precision","recall","f1","roc_auc"]:
            mean, sd, lo, hi = mean_sd_ci(np.array(scores_collect[metric], dtype=float))
            summary_rows.append({
                "model": model_name,
                "metric": metric,
                "mean": mean,
                "std": sd,
                "ci95_low": lo,
                "ci95_high": hi,
                "n_folds": len(scores_collect[metric]),
                "kbest_k": SELECTED_COUNT  # برای سازگاری نام ستون
            })

    per_fold_df = pd.DataFrame(per_fold_rows)
    summary_df = pd.DataFrame(summary_rows)
    return per_fold_df, summary_df


df = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\6-balanced_data.csv")
per_fold_df, summary_df = evaluate_models_repeated_cv_rfecv(df, target_col=TARGET_COL)
summary_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\RFECV2_Summary.csv", index=False)
per_fold_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\per fold 2RFECV_Summary.csv", index=False)
print(summary_df.head())


In [None]:
import pandas as pd

# آدرس فایل
file_path = r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\1-my_null_data.csv"

# خوندن فایل CSV
df = pd.read_csv(file_path)

# محاسبه درصد null ها برای هر ستون و مرتب‌سازی نزولی
null_percent = df.isnull().mean() * 100
null_percent_sorted = null_percent.sort_values(ascending=False)

print(null_percent_sorted)


In [None]:
import pandas as pd

# مثال: یک DataFrame نمونه
df = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\1-my_null_data.csv")
# 1) درصد نال به ازای هر ستون (مرتب شده، با فرمت درصد)
def percent_missing_by_column(df, sort_desc=True, format_percent=True):
    s = df.isnull().mean() * 100  # mean() روی boolean -> نسبت Trueها
    if sort_desc:
        s = s.sort_values(ascending=False)
    if format_percent:
        return s.map(lambda x: f"{x:.2f}%")
    return s

# 2) درصد نال کلی (در کل دیتافریم)
def percent_missing_overall(df):
    total_cells = df.size
    total_missing = df.isnull().sum().sum()
    return (total_missing / total_cells) * 100

# 3) درصد نال هر ردیف (می‌تونی آستانه برای فیلتر گذاری استفاده کنی)
def percent_missing_by_row(df, format_percent=True):
    s = df.isnull().mean(axis=1) * 100
    if format_percent:
        return s.map(lambda x: f"{x:.2f}%")
    return s

# 4) خلاصه‌ی کامل (ستون‌ها + کلی)
def missing_summary(df):
    col_pct = df.isnull().mean() * 100
    col_count = df.isnull().sum()
    overall_pct = percent_missing_overall(df)
    summary = pd.DataFrame({
        'missing_count': col_count,
        'missing_pct': col_pct.map(lambda x: round(x, 2))
    }).sort_values('missing_pct', ascending=False)
    return summary, round(overall_pct, 2)

# استفاده و نمایش
print("درصد نال هر ستون:")
print(percent_missing_by_column(df))

print("\nدرصد نال کلی دیتافریم:")
print(f"{percent_missing_overall(df):.2f}%")

print("\nدرصد نال هر ردیف:")
print(percent_missing_by_row(df))

print("\nخلاصه کامل:")
summary_df, overall = missing_summary(df)
print(summary_df)
print(f"\nدرصد نال کلی: {overall:.2f}%")


In [None]:
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

# --------------------------
# Helpers
# --------------------------
def _find_col(df: pd.DataFrame, name: str) -> str:
    """Find a column by name case-insensitively; returns actual column name or raises."""
    lower_map = {c.lower(): c for c in df.columns}
    if name.lower() in lower_map:
        return lower_map[name.lower()]
    # Try common variants
    variants = [name.replace("_",""), name.replace(" ",""), name.upper(), name.lower(), name.capitalize()]
    for v in variants:
        if v.lower() in lower_map:
            return lower_map[v.lower()]
    raise KeyError(f"Column '{name}' not found (case-insensitive). Available: {list(df.columns)[:10]} ...")

def _row_missing_pct(df: pd.DataFrame) -> pd.Series:
    return df.isna().mean(axis=1)

def _col_missing_pct(df: pd.DataFrame) -> pd.Series:
    return df.isna().mean(axis=0)

def _coerce_binary(series: pd.Series) -> pd.Series:
    """Coerce a likely-binary series to {0,1} where possible."""
    s = series.copy()
    # Map common string values
    mapping = {
        'yes':1, 'no':0, 'y':1, 'n':0, 'true':1, 'false':0, 'male':1, 'female':0,
        'm':1, 'f':0
    }
    if s.dtype == object:
        s = s.str.strip().str.lower().map(mapping).astype('float64')
    # If still not numeric, try casting
    if not np.issubdtype(s.dtype, np.number):
        s = pd.to_numeric(s, errors='coerce')
    # Clip to 0/1 if takes only two unique numeric values
    uniq = pd.unique(s.dropna())
    if len(uniq) == 2:
        # Map min->0, max->1
        mn, mx = float(np.nanmin(uniq)), float(np.nanmax(uniq))
        s = s.map({mn:0.0, mx:1.0})
    return s

def _summarize_cohort(df: pd.DataFrame,
                      key_cont_cols: List[str],
                      target_col: str,
                      sex_col: str = None) -> Dict:
    """Return basic cohort characteristics."""
    out = {}
    out['n_rows'] = int(df.shape[0])
    out['n_cols'] = int(df.shape[1])
    out['overall_missing_pct'] = float(df.isna().mean().mean()*100)

    # Continuous summaries
    cont_summary = {}
    for col in key_cont_cols:
        if col not in df.columns:  # skip if missing
            continue
        s = pd.to_numeric(df[col], errors='coerce')
        cont_summary[col] = {
            'mean': float(np.nanmean(s)),
            'median': float(np.nanmedian(s)),
            'missing_pct': float(s.isna().mean()*100)
        }
    out['continuous'] = cont_summary

    # Target prevalence
    tgt = _coerce_binary(df[target_col])
    out['mafld_prevalence_pct'] = float(np.nanmean(tgt)*100)

    # Sex proportion (optional)
    if sex_col and sex_col in df.columns:
        sex = _coerce_binary(df[sex_col])
        out['sex_prop_male_pct'] = float(np.nanmean(sex)*100)

    return out

def _smd(x: pd.Series, y: pd.Series) -> float:
    """Standardized mean difference for continuous variables (Hedges not applied)."""
    x = pd.to_numeric(x, errors='coerce')
    y = pd.to_numeric(y, errors='coerce')
    mx, my = np.nanmean(x), np.nanmean(y)
    sx, sy = np.nanstd(x, ddof=1), np.nanstd(y, ddof=1)
    # Pooled SD
    sp = np.sqrt(((sx**2) + (sy**2))/2.0)
    if sp == 0 or np.isnan(sp):
        return np.nan
    return float((mx - my)/sp)

# --------------------------
# Main cleaning & compare
# --------------------------
def clean_with_thresholds(df: pd.DataFrame,
                          feature_missing_thresh: float,
                          row_missing_thresh: float) -> Tuple[pd.DataFrame, Dict]:
    """
    Remove columns with missing_pct > feature_missing_thresh
    and rows with missing_pct > row_missing_thresh. Returns cleaned df and a small log.
    Thresholds are in fractions (e.g., 0.45 means 45%).
    """
    col_miss = _col_missing_pct(df)
    keep_cols = col_miss[col_miss <= feature_missing_thresh].index.tolist()
    removed_cols = col_miss[col_miss > feature_missing_thresh].sort_values(ascending=False)

    df2 = df[keep_cols].copy()
    row_miss = _row_missing_pct(df2)
    keep_rows = row_miss[row_miss <= row_missing_thresh].index
    removed_rows = row_miss[row_miss > row_missing_thresh].sort_values(ascending=False)

    df_clean = df2.loc[keep_rows].copy()

    log = {
        'removed_cols': removed_cols.to_dict(),
        'removed_rows_count': int(removed_rows.shape[0]),
        'kept_rows_count': int(df_clean.shape[0]),
        'kept_cols_count': int(df_clean.shape[1]),
    }
    return df_clean, log

def compare_thresholds(df: pd.DataFrame,
                       thresholds: List[Tuple[float, float]] = [(0.30, 0.50),
                                                                (0.45, 0.50),
                                                                (0.60, 0.50)],
                       key_cont_candidates: List[str] = ['age','BMI','ALT','AST','PLT','CRP'],
                       target_name: str = 'fattyliver',
                       sex_name: str = 'sex') -> Dict:
    """
    Run cleaning for multiple thresholds and summarize cohorts.
    Returns dict with:
      - 'scenarios': per-threshold summaries
      - 'smd_vs_045': SMDs of key continuous vars vs the 0.45/0.50 scenario
    """
    # Resolve essential columns case-insensitively
    target_col = _find_col(df, target_name)
    sex_col = None
    try:
        sex_col = _find_col(df, sex_name)
    except KeyError:
        pass

    # Resolve the set of key continuous columns that actually exist
    resolved_keys = []
    for nm in key_cont_candidates:
        try:
            resolved_keys.append(_find_col(df, nm))
        except KeyError:
            pass
    # De-duplicate while preserving order
    key_cont_cols = list(dict.fromkeys(resolved_keys))

    scenarios = {}
    cleaned_dfs = {}

    for f_thr, r_thr in thresholds:
        label = f"feat_{int(f_thr*100)}_row_{int(r_thr*100)}"
        df_clean, log = clean_with_thresholds(df, f_thr, r_thr)
        summ = _summarize_cohort(df_clean, key_cont_cols, target_col, sex_col)
        summ['thresholds'] = {'feature_missing': f_thr, 'row_missing': r_thr}
        summ['removed_cols_top5'] = dict(sorted(log['removed_cols'].items(),
                                                key=lambda x: x[1],
                                                reverse=True)[:5])
        summ['removed_rows_count'] = log['removed_rows_count']
        scenarios[label] = summ
        cleaned_dfs[label] = df_clean

    # Compute SMDs for key continuous variables vs 0.45/0.50 as baseline (if available)
    base_key = "feat_45_row_50"
    smd_table = {}
    if base_key in cleaned_dfs:
        base = cleaned_dfs[base_key]
        for label, dfi in cleaned_dfs.items():
            if label == base_key:
                continue
            smds = {}
            for col in key_cont_cols:
                if col in base.columns and col in dfi.columns:
                    smds[col] = _smd(base[col], dfi[col])
            smd_table[label] = smds

    return {'scenarios': scenarios, 'smd_vs_045': smd_table}

# --------------------------
# Example usage
# --------------------------
df = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\0-Original Data.csv")
# اجرای مقایسه با thresholdهای مختلف
result = compare_thresholds(df)

# خلاصه ویژگی‌های هر سناریو (چند متغیر مهم و تعداد ردیف/ستون و شیوع MAFLD)
rows = []
for label, summ in result['scenarios'].items():
    rows.append({
        'scenario': label,
        'n_rows': summ['n_rows'],
        'n_cols': summ['n_cols'],
        'overall_missing_%': round(summ['overall_missing_pct'], 2),
        'MAFLD_prev_%': round(summ['mafld_prevalence_pct'], 2),
        'Age_mean': round(summ['continuous'].get('age',{}).get('mean', np.nan), 2),
        'BMI_mean': round(summ['continuous'].get('BMI',{}).get('mean', np.nan), 2),
        'ALT_mean': round(summ['continuous'].get('alt',{}).get('mean', np.nan), 2),
        'AST_mean': round(summ['continuous'].get('ast',{}).get('mean', np.nan), 2),
        'PLT_mean': round(summ['continuous'].get('PLT',{}).get('mean', np.nan), 2),
        'CRP_mean': round(summ['continuous'].get('CRP',{}).get('mean', np.nan), 2),
    })
summary_df = pd.DataFrame(rows)
print(summary_df)

# تفاوت‌ها (SMD) نسبت به سناریوی اصلی (45%/50%)
smd_df = pd.DataFrame(result['smd_vs_045']).T
print("\nSMDs vs 45%/50% baseline:")
print(smd_df.round(3))


In [None]:
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

# --------------------------
# Helpers
# --------------------------
def _find_col(df: pd.DataFrame, name: str) -> str:
    """Find a column by name case-insensitively; returns actual column name or raises."""
    lower_map = {c.lower(): c for c in df.columns}
    if name.lower() in lower_map:
        return lower_map[name.lower()]
    # Try a few variants
    variants = [name.replace("_",""), name.replace(" ",""), name.upper(), name.lower(), name.capitalize()]
    for v in variants:
        if v.lower() in lower_map:
            return lower_map[v.lower()]
    raise KeyError(f"Column '{name}' not found (case-insensitive). Available: {list(df.columns)[:10]} ...")

def _row_missing_pct(df: pd.DataFrame) -> pd.Series:
    return df.isna().mean(axis=1)

def _col_missing_pct(df: pd.DataFrame) -> pd.Series:
    return df.isna().mean(axis=0)

def _coerce_binary(series: pd.Series) -> pd.Series:
    """Coerce a likely-binary series to {0,1} where possible."""
    s = series.copy()
    mapping = {'yes':1, 'no':0, 'y':1, 'n':0, 'true':1, 'false':0, 'male':1, 'female':0, 'm':1, 'f':0}
    if s.dtype == object:
        s = s.astype(str).str.strip().str.lower().map(mapping).astype('float64')
    if not np.issubdtype(s.dtype, np.number):
        s = pd.to_numeric(s, errors='coerce')
    uniq = pd.unique(s.dropna())
    if len(uniq) == 2:
        mn, mx = float(np.nanmin(uniq)), float(np.nanmax(uniq))
        s = s.map({mn:0.0, mx:1.0})
    return s

def _summarize_cohort(df: pd.DataFrame,
                      key_cont_cols: List[str],
                      target_col: str,
                      sex_col: str = None) -> Dict:
    """Return basic cohort characteristics."""
    out = {}
    out['n_rows'] = int(df.shape[0])
    out['n_cols'] = int(df.shape[1])
    out['overall_missing_pct'] = float(df.isna().mean().mean()*100)

    # Continuous summaries
    cont_summary = {}
    for col in key_cont_cols:
        if col not in df.columns:
            continue
        s = pd.to_numeric(df[col], errors='coerce')
        cont_summary[col] = {
            'mean': float(np.nanmean(s)),
            'median': float(np.nanmedian(s)),
            'missing_pct': float(s.isna().mean()*100)
        }
    out['continuous'] = cont_summary

    # Target prevalence
    tgt = _coerce_binary(df[target_col])
    out['mafld_prevalence_pct'] = float(np.nanmean(tgt)*100)

    # Sex proportion (optional)
    if sex_col and sex_col in df.columns:
        sex = _coerce_binary(df[sex_col])
        out['sex_prop_male_pct'] = float(np.nanmean(sex)*100)

    return out

def _smd(x: pd.Series, y: pd.Series) -> float:
    """Standardized mean difference for continuous variables (Hedges not applied)."""
    x = pd.to_numeric(x, errors='coerce')
    y = pd.to_numeric(y, errors='coerce')
    mx, my = np.nanmean(x), np.nanmean(y)
    sx, sy = np.nanstd(x, ddof=1), np.nanstd(y, ddof=1)
    sp = np.sqrt(((sx**2) + (sy**2))/2.0)
    if sp == 0 or np.isnan(sp):
        return np.nan
    return float((mx - my)/sp)

# --------------------------
# Main cleaning & compare (supports both orders)
# --------------------------
def clean_with_thresholds(df: pd.DataFrame,
                          feature_missing_thresh: float,
                          row_missing_thresh: float,
                          order: str = "cols_then_rows") -> Tuple[pd.DataFrame, Dict]:
    """
    Clean by thresholds with controllable order + remove duplicates.
    order: "cols_then_rows" (default) or "rows_then_cols"
    Thresholds are fractions (e.g., 0.45 means 45%).
    """
    if order not in ("cols_then_rows", "rows_then_cols"):
        raise ValueError("order must be 'cols_then_rows' or 'rows_then_cols'")

    log = {}

    if order == "cols_then_rows":
        # 1) drop columns by feature threshold
        col_miss0 = _col_missing_pct(df)
        keep_cols = col_miss0[col_miss0 <= feature_missing_thresh].index.tolist()
        removed_cols = col_miss0[col_miss0 > feature_missing_thresh].sort_values(ascending=False)
        df2 = df[keep_cols].copy()

        # 2) drop rows by row threshold
        row_miss = _row_missing_pct(df2)
        keep_rows = row_miss[row_miss <= row_missing_thresh].index
        removed_rows = row_miss[row_miss > row_missing_thresh].sort_values(ascending=False)
        df_clean = df2.loc[keep_rows].copy()

    else:  # rows_then_cols
        # 1) drop rows by row threshold (on original df)
        row_miss0 = _row_missing_pct(df)
        keep_rows = row_miss0[row_miss0 <= row_missing_thresh].index
        removed_rows = row_miss0[row_miss0 > row_missing_thresh].sort_values(ascending=False)
        df2 = df.loc[keep_rows].copy()

        # 2) drop columns by feature threshold (on filtered rows)
        col_miss = _col_missing_pct(df2)
        keep_cols = col_miss[col_miss <= feature_missing_thresh].index.tolist()
        removed_cols = col_miss[col_miss > feature_missing_thresh].sort_values(ascending=False)
        df_clean = df2[keep_cols].copy()

    # 3) حذف رکوردهای duplicate
    before_dups = df_clean.shape[0]
    df_clean = df_clean.drop_duplicates()
    after_dups = df_clean.shape[0]
    removed_dups_count = before_dups - after_dups

    log['removed_cols'] = removed_cols.to_dict()
    log['removed_rows_count'] = int(removed_rows.shape[0])
    log['removed_duplicates_count'] = int(removed_dups_count)
    log['kept_rows_count'] = int(df_clean.shape[0])
    log['kept_cols_count'] = int(df_clean.shape[1])

    return df_clean, log


def compare_thresholds(df: pd.DataFrame,
                       thresholds: List[Tuple[float, float]] = [(0.30, 0.50),
                                                                (0.45, 0.50),
                                                                (0.60, 0.50)],
                       key_cont_candidates: List[str] = ['age','BMI','ALT','AST','PLT','CRP'],
                       target_name: str = 'fattyliver',
                       sex_name: str = 'sex',
                       order: str = "cols_then_rows") -> Dict:
    """
    Run cleaning for multiple thresholds and summarize cohorts.
    order: "cols_then_rows" or "rows_then_cols"
    Returns:
      - 'scenarios': dict of per-threshold summaries
      - 'smd_vs_045': SMDs vs the 0.45/0.50 scenario (same order)
    """
    # Resolve essential columns case-insensitively
    target_col = _find_col(df, target_name)
    sex_col = None
    try:
        sex_col = _find_col(df, sex_name)
    except KeyError:
        pass

    # Resolve key continuous columns that exist
    resolved_keys = []
    for nm in key_cont_candidates:
        try:
            resolved_keys.append(_find_col(df, nm))
        except KeyError:
            pass
    key_cont_cols = list(dict.fromkeys(resolved_keys))

    order_tag = "CthenR" if order == "cols_then_rows" else "RthenC"
    scenarios, cleaned_dfs = {}, {}

    for f_thr, r_thr in thresholds:
        label = f"{order_tag}_feat_{int(f_thr*100)}_row_{int(r_thr*100)}"
        df_clean, log = clean_with_thresholds(df, f_thr, r_thr, order=order)
        summ = _summarize_cohort(df_clean, key_cont_cols, target_col, sex_col)
        summ['thresholds'] = {'feature_missing': f_thr, 'row_missing': r_thr, 'order': order}
        summ['removed_cols_top5'] = dict(sorted(log['removed_cols'].items(),
                                                key=lambda x: x[1],
                                                reverse=True)[:5])
        summ['removed_rows_count'] = log['removed_rows_count']
        summ['removed_duplicates_count'] = log['removed_duplicates_count']
        scenarios[label] = summ
        cleaned_dfs[label] = df_clean

    # SMDs vs 0.45/0.50 baseline under the same order
    base_key = f"{order_tag}_feat_45_row_50"
    smd_table = {}
    if base_key in cleaned_dfs:
        base = cleaned_dfs[base_key]
        for label, dfi in cleaned_dfs.items():
            if label == base_key:
                continue
            smds = {}
            for col in key_cont_cols:
                if col in base.columns and col in dfi.columns:
                    smds[col] = _smd(base[col], dfi[col])
            smd_table[label] = smds

    return {'scenarios': scenarios, 'smd_vs_045': smd_table}

# --------------------------
# Example usage (both orders)
# --------------------------
if __name__ == "__main__":
    # 1) دیتافریم رو لود کن (مسیر فایل خودت رو بگذار)
    df = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\0-Original Data.csv")

    # 2) اول ستون‌ها بعد ردیف‌ها
    res_CthenR = compare_thresholds(df, order="cols_then_rows")
    rows1 = []
    for label, summ in res_CthenR['scenarios'].items():
        rows1.append({
            'scenario': label,
            'n_rows': summ['n_rows'],
            'n_cols': summ['n_cols'],
            'overall_missing_%': round(summ['overall_missing_pct'], 2),
            'MAFLD_prev_%': round(summ['mafld_prevalence_pct'], 2),
            'Age_mean': round(summ['continuous'].get('age',{}).get('mean', np.nan), 2),
            'BMI_mean': round(summ['continuous'].get('BMI',{}).get('mean', np.nan), 2),
            'ALT_mean': round(summ['continuous'].get('ALT', summ['continuous'].get('alt', {})).get('mean', np.nan), 2),
            'AST_mean': round(summ['continuous'].get('AST', summ['continuous'].get('ast', {})).get('mean', np.nan), 2),
            'PLT_mean': round(summ['continuous'].get('PLT',{}).get('mean', np.nan), 2),
            'CRP_mean': round(summ['continuous'].get('CRP',{}).get('mean', np.nan), 2),
            'Removed_dups': summ['removed_duplicates_count'],
        })
    summary_CthenR = pd.DataFrame(rows1)
    print("=== Columns-then-Rows ===")
    print(summary_CthenR.to_string(index=False))

    print("\nSMDs vs 45%/50% (CthenR):")
    print(pd.DataFrame(res_CthenR['smd_vs_045']).T.round(3).to_string())

    # 3) اول ردیف‌ها بعد ستون‌ها
    res_RthenC = compare_thresholds(df, order="rows_then_cols")
    rows2 = []
    for label, summ in res_RthenC['scenarios'].items():
        rows2.append({
            'scenario': label,
            'n_rows': summ['n_rows'],
            'n_cols': summ['n_cols'],
            'overall_missing_%': round(summ['overall_missing_pct'], 2),
            'MAFLD_prev_%': round(summ['mafld_prevalence_pct'], 2),
            'Age_mean': round(summ['continuous'].get('age',{}).get('mean', np.nan), 2),
            'BMI_mean': round(summ['continuous'].get('BMI',{}).get('mean', np.nan), 2),
            'ALT_mean': round(summ['continuous'].get('ALT', summ['continuous'].get('alt', {})).get('mean', np.nan), 2),
            'AST_mean': round(summ['continuous'].get('AST', summ['continuous'].get('ast', {})).get('mean', np.nan), 2),
            'PLT_mean': round(summ['continuous'].get('PLT',{}).get('mean', np.nan), 2),
            'CRP_mean': round(summ['continuous'].get('CRP',{}).get('mean', np.nan), 2),
            'Removed_dups': summ['removed_duplicates_count'],
        })
    summary_RthenC = pd.DataFrame(rows2)
    print("\n=== Rows-then-Columns ===")
    print(summary_RthenC.to_string(index=False))

    print("\nSMDs vs 45%/50% (RthenC):")
    print(pd.DataFrame(res_RthenC['smd_vs_045']).T.round(3).to_string())

    # (اختیاری) ذخیره خروجی‌ها برای پیوست به پاسخ داور
    summary_CthenR.to_csv("summary_columns_then_rows.csv", index=False)
    summary_RthenC.to_csv("summary_rows_then_columns.csv", index=False)


In [None]:
import pandas as pd
import numpy as np

# ---------- تنظیمات کاربر ----------
CSV_PATH = r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\0-Original Data.csv"

# ستون/پرچمی که نشان می‌دهد بیمار سونوگرافی دارد (مثال: 'has_ultrasound' یا 'ultra' یا ...)
# اگر چنین ستونی نداری و محدودسازی به سونوگرافی را قبلاً در فایل اعمال کرده‌ای، این فیلتر را غیرفعال کن.
ULTRA_FLAG_COL = None  # مثلا "has_ultrasound" یا None اگر ندارید
ULTRA_POS_VALUES = {1, True, 'yes', 'Yes', 'Y'}  # مقادیر مثبت

# لیست ستون‌های نامرتبط که در مقاله حذف کردی (این را طبق پروژه خودت پر کن)
IRRELEVANT_COLS = [
    # مثال: 'prescription_date', 'visit_date', 'some_id', ...
]

# بازه‌های مجاز برای تبدیل out-of-range به NaN (فقط مثال؛ با Numbers واقعی خودت جایگزین کن)
RANGES = {
    'age': (0, 120),
    'BMI': (10, 80),
    'ALT': (0, 500),
    'AST': (0, 500),
    'PLT': (10, 1000),
    'CRP': (0, 300),
    # ... هر چیزی که داری
}

# آستانه‌ها
FEATURE_MISSING_THR = 0.45  # 45%
ROW_MISSING_THR = 0.50      # 50%

# ترتیب حذف‌ها: اول ستون‌ها بعد ردیف‌ها (مطابق متن مقاله)
ORDER = "cols_then_rows"  # یا "rows_then_cols" اگر واقعاً این بوده

# کلیدهای یکتا برای حذف رکوردهای تکراری (اگر معیاری داری؛ در غیر این صورت بر اساس کل سطر dedup می‌کند)
DEDUP_KEYS = None  # مثال: ['patient_id'] یا None


# ---------- توابع کمکی ----------
def pct_missing_by_col(df):
    return df.isna().mean().sort_values(ascending=False)

def clamp_ranges_to_nan(df, ranges_dict):
    df2 = df.copy()
    for col, (lo, hi) in ranges_dict.items():
        if col in df2.columns:
            s = pd.to_numeric(df2[col], errors='coerce')
            s = s.mask((s < lo) | (s > hi), np.nan)
            df2[col] = s
    return df2

def apply_ultrasound_filter(df, flag_col, pos_values):
    if flag_col is None or flag_col not in df.columns:
        return df, "no_ultra_filter"
    mask = df[flag_col].isin(pos_values)
    return df[mask].copy(), f"ultra_filtered({mask.sum()} kept)"

def drop_irrelevant(df, cols):
    cols_present = [c for c in cols if c in df.columns]
    return df.drop(columns=cols_present), cols_present

def drop_by_thresholds(df, f_thr, r_thr, order="cols_then_rows"):
    df2 = df.copy()
    if order == "cols_then_rows":
        col_miss = df2.isna().mean()
        keep_cols = col_miss[col_miss <= f_thr].index
        rem_cols = col_miss[col_miss > f_thr].sort_values(ascending=False)
        df2 = df2[keep_cols]

        row_miss = df2.isna().mean(axis=1)
        keep_rows = row_miss[row_miss <= r_thr].index
        rem_rows_count = (row_miss > r_thr).sum()
        df2 = df2.loc[keep_rows].copy()
    else:  # rows_then_cols
        row_miss = df2.isna().mean(axis=1)
        keep_rows = row_miss[row_miss <= r_thr].index
        rem_rows_count = (row_miss > r_thr).sum()
        df2 = df2.loc[keep_rows].copy()

        col_miss = df2.isna().mean()
        keep_cols = col_miss[col_miss <= f_thr].index
        rem_cols = col_miss[col_miss > f_thr].sort_values(ascending=False)
        df2 = df2[keep_cols]

    return df2

def deduplicate(df, keys=None):
    if keys is None:
        before = len(df)
        out = df.drop_duplicates()
        return out, before - len(out), "full-row"
    else:
        before = len(df)
        out = df.sort_index().drop_duplicates(subset=keys, keep='first')
        return out, before - len(out), f"subset({','.join(keys)})"


# ---------- Pipeline با لاگ کامل ----------
df0 = pd.read_csv(CSV_PATH)
print(f"[0] loaded: n={len(df0)}, p={df0.shape[1]}")

# (۱) فیلتر سونوگرافی
df1, ultra_info = apply_ultrasound_filter(df0, ULTRA_FLAG_COL, ULTRA_POS_VALUES)
print(f"[1] ultrasound filter: {ultra_info} -> n={len(df1)}, p={df1.shape[1]}")

# (۲) حذف ستون‌های نامرتبط
df2, removed_list = drop_irrelevant(df1, IRRELEVANT_COLS)
print(f"[2] drop irrelevant cols: removed={len(removed_list)} -> n={len(df2)}, p={df2.shape[1]}")

# (۳) out-of-range -> NaN
df3 = clamp_ranges_to_nan(df2, RANGES)
print(f"[3] clamp to ranges -> n={len(df3)}, p={df3.shape[1]} (no row change expected)")

# (۴) آستانه‌ها (ستون‌ها سپس ردیف‌ها، یا برعکس)
df4 = drop_by_thresholds(df3, FEATURE_MISSING_THR, ROW_MISSING_THR, order=ORDER)
print(f"[4] thresholds ({ORDER}) -> n={len(df4)}, p={df4.shape[1]}")

# (۵) حذف رکوردهای تکراری
df5, dup_count, dedup_mode = deduplicate(df4, keys=DEDUP_KEYS)
print(f"[5] deduplicate {dedup_mode}: removed {dup_count} -> n={len(df5)}, p={df5.shape[1]}")

# گزارش میسینگ ستون‌ها پس از مرحله 5 (باید همان حوالی 31 فیچر + تارگت باشد)
col_miss_final = pct_missing_by_col(df5)
print("\nTop missing columns after cleaning:")
print(col_miss_final.head(10).apply(lambda x: round(x*100,2)))

print(f"\n>>> FINAL n (should ~ 3769): {len(df5)}")

# (اختیاری) سیو برای مراحل بعدی
# df5.to_csv("cleaned_for_modeling.csv", index=False)


## 2) Data Cleaning
- Remove irrelevant columns (per clinical expert judgement).
- Enforce valid ranges; cast dtypes.
- Drop duplicates.
- Apply missingness thresholds (45% per-column, 50% per-row) as justified in the revision.


In [None]:
# فرض: این سه تا دیتافریم قبلا ساخته شدن
# Kbest_df, pca_results_df, rfecv_results_df

# انتخاب فقط ستون‌های مهم برای مقایسه
kbest_sel = Kbest_df[["Classifier","Best_K","Accuracy"]].rename(
    columns={"Best_K":"KBest_n","Accuracy":"KBest_Accuracy"}
)
pca_sel = pca_results_df[["Classifier","Best_n_components","Accuracy"]].rename(
    columns={"Best_n_components":"PCA_n","Accuracy":"PCA_Accuracy"}
)
rfecv_sel = results_df[["Classifier","Selected_Count","Accuracy"]].rename(
    columns={"Selected_Count":"RFECV_n","Accuracy":"RFECV_Accuracy"}
)

# فرضاً plain_results_rows هم داری
plain_df = pd.DataFrame(plain_results_rows)[["Classifier","Accuracy"]].rename(
    columns={"Accuracy":"Accuracy_Without"}
)

# حالا merge چهار جدول
final_table = plain_df.merge(kbest_sel, on="Classifier")\
                      .merge(pca_sel, on="Classifier")\
                      .merge(rfecv_sel, on="Classifier")

# ذخیره و نمایش
final_table.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\FeatureSelection_Comparison.csv", index=False, encoding="utf-8-sig")
print(final_table)


In [None]:
def evaluate_imputation(df_imputed, df_true, missings):
    evaluations = {}
    for col in missings:
        missing_indices = dff[col].isna()
        miss_idx = dff.index[missing_indices]
        common = df_true.index.intersection(miss_idx)
        y_true = df_true.loc[common, col].values
        y_pred = df_imputed.loc[common, col].values


        if col in continues:
            mse = mean_squared_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            MABR = mean_absolute_error (y_true, y_pred)
            evaluations[col] = {'MSE': mse, 'R2': r2, 'MABR':MABR}
        else:
            y_pred = np.round(y_pred).astype(int)
            acc = accuracy_score(y_true, y_pred)
            evaluations[col] = {'Accuracy': acc}
    return evaluations


In [None]:
og_data_path = r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\3-my_null_data_40_del.csv"
save_path_first_classification = r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\LogRes.csv"


In [None]:
data = pd.read_csv(og_data_path).iloc[:, :-1]
missing_dataframe = pd.DataFrame(columns=["Variable", "Missing_number", "Percentage"])
total_records = data.shape[0]

for i, col in enumerate(data.columns):
    missing = data[col].isnull().sum()
    missing_dataframe.loc[i] = [col, missing, round((missing / total_records) * 100, 2)]

missing_dataframe = missing_dataframe.sort_values(by="Percentage", ascending=False).reset_index(drop=True)
missing_columns = missing_dataframe["Variable"][missing_dataframe["Missing_number"] > 100]

print(missing_dataframe.head())


In [None]:
# spilliting two halfs
df_train, df_test = train_test_split(data, train_size=0.5)

#Second Half to train and val for evaluating prediction of classifiers
df_train_train, df_train_val = train_test_split(df_train, train_size=0.7)


In [None]:
new_col = ["Classifier", "Parameters", "Column", "Accuracy", "Precision", "Recall", "F1"]
Log_res = pd.DataFrame(columns=new_col)
logreg_param_grid = {'C': [0.1, 1, 10], 'max_iter': [100, 200, 300], 'solver': ['lbfgs', 'liblinear']}
knn_param_grid = {'n_neighbors': [3, 5, 7]}
dtree_param_grid = {'max_depth': [3, 5, 8], 'min_samples_split': [2, 5, 10]}
svm_param_grid = {'C': [0.1, 1, 10]}
rf_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}
et_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}
xgb_param_grid = {'max_depth': [3, 5, 8], 'learning_rate': [0.1, 0.05, 0.01]}
ada_param_grid = {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]}
results = []

Grids = [
    logreg_param_grid,
    knn_param_grid, dtree_param_grid
    , svm_param_grid, rf_param_grid, et_param_grid, xgb_param_grid, ada_param_grid
      ]
classifiers_name = [
    "LogReg",
    "KNN", "DT"
    , "SVM", "RF", "ET", "XGB", "AdaBoost"
    ]
classifiers = [
    LogisticRegression(),
    KNeighborsClassifier(),
    DecisionTreeClassifier()
    ,
    SVC(),
    RandomForestClassifier(),
    ExtraTreesClassifier(),
    xgb.XGBClassifier(objective='binary:logistic'),
    AdaBoostClassifier()
]


row = 0
for i in range(len(classifiers)):
    print(f"-----{classifiers_name[i]}-----")
    clf = classifiers[i]
    param_grid = Grids[i]
    for colmn in missing_columns:
        print(f"-{colmn}--")
        selected = [c for c in df_train.columns if c != colmn]

        y_train = np.where(df_train_train[colmn].isna(), 1, 0)
        y_test = np.where(df_train_val[colmn].isna(), 1, 0)

        x_train = df_train_train[selected].fillna(df_train_train[selected].median())
        x_test = df_train_val[selected].fillna(df_train_val[selected].median())

        grid_search = GridSearchCV(clf, param_grid=param_grid, cv=3, scoring="f1")
        grid_search.fit(x_train, y_train)
        y_pred = grid_search.predict(x_test)

        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, zero_division=0)
        recall = recall_score(y_test, y_pred, zero_division=0)
        f1 = f1_score(y_test, y_pred, zero_division=0)

        Log_res.loc[row, "Classifier"] = classifiers_name[i]
        Log_res.loc[row, "Parameters"] = str(grid_search.best_params_)
        Log_res.loc[row, "Column"] = colmn
        Log_res.loc[row, "Accuracy"] = accuracy
        Log_res.loc[row, "Precision"] = precision
        Log_res.loc[row, "Recall"] = recall
        Log_res.loc[row, "F1"] = f1
        row += 1


In [None]:
Gridi = Log_res
columns = Gridi["Column"].unique()


In [None]:
Log_res


In [None]:
# Load data
df = df_test.copy()
df_test = df_test_original.copy()


df = df.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)


In [None]:
df.head()


In [None]:
print("total_NaN_in_dff:", dff.isna().sum().sum())
print("total_NaN_in_df:", df.isna().sum().sum())


In [None]:
# Dictionary to store results
results = {}
# Function to evaluate the imputation results
def evaluate_imputation(df_imputed, df_true, missings):
    evaluations = {}
    for col in missings:
        missing_indices = dff[col].isna()
        miss_idx = dff.index[missing_indices]
        common = df_true.index.intersection(miss_idx)
        y_true = df_true.loc[common, col].values
        y_pred = df_imputed.loc[common, col].values


        if col in continues:
            mse = mean_squared_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            MABR = mean_absolute_error (y_true, y_pred)
            evaluations[col] = {'MSE': mse, 'R2': r2, 'MABR':MABR}
        else:
            y_pred = np.round(y_pred).astype(int)
            acc = accuracy_score(y_true, y_pred)
            evaluations[col] = {'Accuracy': acc}
    return evaluations

combos={}

for j, (name, estimator) in enumerate(estimators.items()):
    combinations = product(*params[j].values())

    for i ,comb in enumerate(combinations):
        print(f"Imputing with {name} _ {i}...")
        param_combo = dict(zip(params[j].keys(), comb))

        if name == "KNN":
            estimator.set_params(**param_combo)
            combos[f"{name}_{i}"] = param_combo
            df_imputed = estimator.fit_transform(dfff.copy())
        else:

            estimator.estimator.set_params(**param_combo)
            combos[f"{name}_{i}"] = param_combo

            if callable(estimator):
                df_imputed = estimator(dfff.copy())

            else:
                df_imputed = estimator.fit_transform(dfff.copy())



        df_imputed = pd.DataFrame(df_imputed, columns=dfff.columns)
        df_imputed.index = dff.index

        results[f"{name}_{i}"] = evaluate_imputation(df_imputed, df_test, missings)


print(results)

scale_weight = {}
sum_miss = np.sum(dff.isna().sum(), axis=0)

if sum_miss == 0:
    print("Warning: no missing cells in df (sum_miss == 0) → وزن‌دهی رو رد می‌کنم.")

for cls in df_test.columns:
    we = dfff[cls].isna().sum()
    scale_weight[cls] = we / sum_miss


labels =[]
con_values = []
binary_values = []

continues = ['PLT','hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
binary = ['Retino','CAD', 'CVA', 'Smoking']

continues_eval = [c for c in ['PLT','hip','CRP','VitD','insulin','UA','ast','alt','alkp','homa'] if c in missings]
binary_eval    = [c for c in ['Retino','CAD','CVA','Smoking']                                    if c in missings]


for est in results.keys():
    labels.append(est)
    continues_score = 0
    binary_score = 0

    scale_weight_con = 0
    scale_weight_con_list = []
    for con in continues:
        continues_score = continues_score + results[est][con]["R2"] * scale_weight[con]
        scale_weight_con = scale_weight_con + scale_weight[con]
        scale_weight_con_list.append(scale_weight[con])
    con_values.append(continues_score /scale_weight_con )

    scale_weight_bin = 0
    scale_weight_bin_list = []
    for bin in binary:
        binary_score = binary_score+ results[est][bin]["Accuracy"] * scale_weight[bin]
        scale_weight_bin = scale_weight_bin + scale_weight[bin]
        scale_weight_bin_list.append(scale_weight[bin])
    binary_values.append(binary_score /scale_weight_bin )


#
# plt.bar(labels,con_values)
# plt.title("continues")
# plt.show()
#
# plt.bar(labels,binary_values)
# plt.title("binary")
# plt.show()

Res = pd.DataFrame(columns=["Labels" ,"Labels2", "Parameters" , "Continues" , "Binary","Details"])
Res["Labels"] = labels
Res["Labels2"] = combos.keys()
Res["Parameters"] = combos.values()
Res["Continues"] = con_values
Res["Binary"] = binary_values
Res["Details"] = results.values()


Wi_con = pd.DataFrame(columns = ["Con","Con_weight","Bin","Bin_weight"])
Wi_bin = pd.DataFrame(columns = ["Bin","Bin_weight"])
Wi_con["Con"] = continues
Wi_con["Con_weight"] = scale_weight_con_list

Wi_bin["Bin"] = binary
Wi_bin["Bin_weight"] = scale_weight_bin_list



Res.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\ResultsOfGrid.csv")
Wi_con.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\WeightsCon.csv")
Wi_bin.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\WeightsBin.csv")


In [None]:
# ===== Split Halves & Complete Subset =====
idx = np.arange(len(df))
first_half  = df.iloc[idx % 2 == 0].reset_index(drop=True)
second_half = df.iloc[idx % 2 == 1].reset_index(drop=True)
print("First half:", first_half.shape, "Second half:", second_half.shape)

# subset کاملاً کامل برای شبیه‌سازی
complete_subset = first_half.dropna(axis=0).reset_index(drop=True)
print("Complete subset:", complete_subset.shape)

# ستون‌هایی که مفقودی دارند (روی کل داده تمیزشده)
cols_with_missing = [c for c in df.columns if df[c].isna().any()]
target_missing_rates = df[cols_with_missing].isna().mean().to_dict()


In [None]:
# ===== Learn Missingness Models =====
CLASSIFIER_GRID = {
    # "logreg": (LogisticRegression(max_iter=500), {"C":[0.1,1,10], "solver":["liblinear","lbfgs"]}),
    "knn":    (KNeighborsClassifier(), {"n_neighbors":[3,5,7], "weights":["uniform","distance"]}),
    # "svc":    (SVC(), {"C":[0.5,1,5], "kernel":["rbf","linear"], "gamma":["scale"]}),
    "dt":     (DecisionTreeClassifier(), {"max_depth":[None,5,10]})
    # ,
    # "rf":     (RandomForestClassifier(), {"n_estimators":[200,400], "max_depth":[None,10]}),
    # "et":     (ExtraTreesClassifier(), {"n_estimators":[200,400], "max_depth":[None,10]}),
    # "ada":    (AdaBoostClassifier(), {"n_estimators":[200,400], "learning_rate":[0.5,1.0]}),


    #     ("KNN", KNeighborsClassifier(), {'n_neighbors': [3, 5, 7]}),
    # ("DT", DecisionTreeClassifier(), {'max_depth': [3, 5, 8], 'min_samples_split': [2, 5, 10]})
    # ,
    # ("SVM", SVC(), {'C': [0.1, 1, 10]}),
    # ("RF", RandomForestClassifier(), {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 8]}),
    # ("GB", GradientBoostingClassifier(), {'n_estimators': [100, 200, 300], 'learning_rate': [0.1, 0.05, 0.01]}),
    # ("XGB", xgb.XGBClassifier(objective='binary:logistic', eval_metric="logloss"),
    #     {'max_depth': [3, 5, 8], 'learning_rate': [0.1, 0.05, 0.01]})
}

def learn_missingness_models(df_with_missing, cols):
    models = {}
    scorer = make_scorer(f1_score)
    for col in cols:
        y = df_with_missing[col].isna().astype(int)
        X = df_with_missing.drop(columns=[col]).copy()
        X = X.fillna(-777)  # جایگزینی امن برای NaN
        best = (-1, None, None)
        for name, (est, grid) in CLASSIFIER_GRID.items():
            cv = GridSearchCV(est, grid, cv=3, scoring=scorer, n_jobs=-1)
            cv.fit(X, y)
            if cv.best_score_ > best[0]:
                best = (cv.best_score_, name, cv.best_estimator_)
        models[col] = {"best_name": best[1], "estimator": best[2], "best_f1": float(best[0])}
    return models

miss_models = learn_missingness_models(second_half, cols_with_missing)
list(miss_models.items())[:3]  # پیش‌نمایش


## 3) Missingness Analysis & MAR Simulation
- Little’s test to reject MCAR; adopt MAR assumption.
- Train per-column classifiers to model missingness probability using observed features.
- Simulate missingness on a **complete subset** to benchmark imputers against ground truth.


In [None]:
Kbest_df = pd.DataFrame(results_rows)
Kbest_df.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\With AdaBoost\Kbest_Summary.csv")


## 4) Imputation Benchmarking
- Evaluate MICE (with multiple estimators), KNN, MissForest, ExtraTrees/AdaBoost-based iterative imputers.
- Metrics: R² for continuous, Accuracy for binary; weighted by each variable’s missing fraction.
- Select the most robust imputer across seeds (per revision: AdaBoost).


In [None]:
#Running without feature Selection


In [None]:
final_table


In [None]:
best_imputer_index = np.argmax(best_per_model["Continues"])
print(best_per_model.iloc[best_imputer_index])


In [None]:
og_data_path = r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\3-my_null_data_40_del.csv"
og_data = pd.read_csv(og_data_path)


imputed_data = best_imputer.fit_transform(og_data)
imputed_data = pd.DataFrame(imputed_data , columns=og_data.columns)
binary = ['Retino', 'htn', 'sex', 'CAD', 'CVA', 'Smoking']
for col in binary:
    imputed_data[col] = np.round(imputed_data[col])
imputed_data.to_csv(r"C:\Users\z_kho\OneDrive\Desktop\4-imputed_data.csv" , index=False)


## 5) Outlier Detection
- Apply density-based LOF after imputation to avoid bias from missingness patterns.
- Remove only clearly erroneous records.


## 6) Dataset Balancing
- Mild class imbalance. Show results for random undersampling and SMOTE; report negligible differences.


## 7) Feature Selection
- KBest, RFECV, PCA, and **Genetic Algorithm** (Taguchi-tuned hyperparameters).
- Perform **within-fold** selection to avoid leakage.


## 8) Modeling & Cross-Validation
- Models: LR, KNN, SVM, DT, ET, GB, XGBoost, LightGBM.
- Stratified 5-fold CV; grid/random search **within each training fold**.
- Primary metric: AUC; tiebreaker: F1.


In [None]:
df_test = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\test2.csv")
df_test_original = pd.read_csv(r"C:\Users\z_kho\OneDrive\Desktop\sixth-121\test_original2.csv")


In [None]:
# Load data
df = df_test.copy()
df_test = df_test_original.copy()


df = df.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
dff = df.copy()
dfff = df.copy()


In [None]:
missings = [c for c in df.columns if (c in df_test.columns) and (df[c].isna().sum() > 0)]

continues = ['PLT','hip', 'CRP', 'VitD', 'insulin', 'UA', 'ast', 'alt', 'alkp', 'homa']
binary = ['Retino','CAD', 'CVA', 'Smoking']

continues_eval = [c for c in ['PLT','hip','CRP','VitD','insulin','UA','ast','alt','alkp','homa'] if c in missings]
binary_eval    = [c for c in ['Retino','CAD','CVA','Smoking']                                    if c in missings]


In [None]:
# ===== Load Data =====
df = pd.read_csv(r"C:\zaza\documents\University\my subjects\arshad\And beyond\Missing Data\datasets\Data\3-my_null_data_40_del.csv")
print("Raw shape:", df.shape)
df.head()


In [None]:
df.columns


## 9) Evaluation & Metrics (with variance)
- Report mean ± SD and 95% CI across repeated CV.
- Include confusion matrices, ROC/PR curves if applicable.


## 10) Feature Importance & Explainability
- Compare importance across XGBoost (no FS), GB (with KBest), LightGBM (with reduced features).
- Add SHAP summary for the final model.


## 11) Save Artifacts
- Save fitted models, feature lists, and result tables to `outputs/` for supplement.


---

### Notes
- All code cells have been cleaned of transient debugging prints and cleared of outputs.
- Please place any private data files in a `data/` folder before running.
- If certain figures/tables are required for the supplement, run corresponding cells in Sections 9–11.
