# Checks

## Beginning
df.info (), df.head(10)

- for binary columns: filtered_df = df[~df['InsulinResistance'].isin([0, 1])] -> to find out whether there are any values other than 0 and 1
- data.drop_duplicates(inplace=True) for duplicates


# Mapping

In [None]:
mapping = {'Female': 0, 'Male': 1}
df['gender'] = df['gender'].map(mapping)


mapping = {'Low': 1, 'Moderate': 2, 'High': 3}
data['PhysicalActivityLevel'] = data['PhysicalActivityLevel'].map(mapping)

## Histogram & Density
Helps get an overview of how the data is spread out.

In [None]:
import seaborn as sns

# Set the style of seaborn plots
sns.set_theme(style="whitegrid")

# Function to plot distribution of each column
def plot_distributions(df):
    for column in df.columns:
        plt.figure(figsize=(10, 5))
        
        # Histogram
        plt.subplot(1, 2, 1)
        sns.histplot(df[column], kde=False, bins=30)
        plt.title(f'Histogram of {column}')
        plt.xlabel(column)
        plt.ylabel('Frequency')
        
        # Density plot (KDE)
        plt.subplot(1, 2, 2)
        sns.kdeplot(df[column], fill=True)
        plt.title(f'Density Plot of {column}')
        plt.xlabel(column)
        plt.ylabel('Density')
        
        plt.tight_layout()
        plt.show()
        
        # Display basic statistics
        print(f'Statistics for {column}:')
        print(df[column].describe())
        print('\nSkewness:', df[column].skew())
        print('\nKurtosis:', df[column].kurtosis())
        print('\n')

# Plot distributions
plot_distributions(df)


### Density Plot
Components of a Density Plot:
- X-axis (Horizontal axis): Represents the range of values for the variable being analyzed.
- Y-axis (Vertical axis): Represents the density, which is an estimate of the probability of the variable taking a given value.
- Curve: The smooth line represents the estimated density function. The area under the curve sums to 1, indicating that it represents a probability distribution.

Key Information from a Density Plot:
- Central Tendency: The peaks of the density plot indicate where the data values are concentrated. The highest peak represents the mode (the most common value).
- Spread: The width of the curve provides information about the variability of the data. A wider curve indicates more spread out data, while a narrower curve indicates that the data values are closer together.
- Skewness: The shape of the curve shows the skewness of the data. If the curve is symmetrical, the data is evenly distributed. If it tails off to one side, the data is skewed in that direction:
    - Right-skewed (positive skew): The tail is longer on the right side.
    - Left-skewed (negative skew): The tail is longer on the left side.
- Multimodality: If the density plot has more than one peak, the data might have multiple modes. This could indicate the presence of subgroups within the data.
- Comparison between Groups: When multiple density plots are overlaid, they can be used to compare distributions between different groups.


### Kurtosis
Kurtosis is a statistical measure that describes the shape of a distribution's tails in relation to its overall shape, particularly focusing on the outliers. It indicates whether the data are heavy-tailed or light-tailed compared to a normal distribution.

- Mesokurtic (Kurtosis ≈ 3): This is the kurtosis of a normal distribution. The tails are neither heavy nor light.
- Leptokurtic (Kurtosis > 3): This indicates heavier tails. The distribution has more outliers.
- Platykurtic (Kurtosis < 3): This indicates lighter tails. The distribution has fewer outliers.

In practice, kurtosis is often reported as "excess kurtosis," which is calculated as kurtosis - 3. Thus:

Excess Kurtosis ≈ 0: Normal distribution.
Excess Kurtosis > 0: Leptokurtic.
Excess Kurtosis < 0: Platykurtic.

### Skewness
Skewness measures the asymmetry of the probability distribution of a real-valued random variable. It indicates whether the data distribution is skewed to the left or right:

- Positive Skewness (> 0): The right tail is longer; the mass of the distribution is concentrated on the left.
- Negative Skewness (< 0): The left tail is longer; the mass of the distribution is concentrated on the right.
- Zero Skewness (≈ 0): The distribution is symmetric.

### Threshold Skewness & Kurtosis
Skewness:
- |Skewness| < 0.5: The data is fairly symmetrical.
- 0.5 ≤ |Skewness| < 1: The data is moderately skewed.
- |Skewness| ≥ 1: The data is highly skewed.

Kurtosis:
- Excess Kurtosis between -1 and 1: This range is often considered acceptable or indicative of a relatively normal distribution.
- Excess Kurtosis outside of -1 to 1: Indicates significant deviations from normality. Values outside this range suggest heavy or light tails.

Practical Considerations
- Context: The specific thresholds might vary depending on the field of study. For instance, financial data often exhibit high kurtosis due to the presence of outliers.
 Data Transformation: If the data is highly skewed or has extreme kurtosis, transformations (like logarithmic or Box-Cox) can sometimes help in normalizing the distribution.


# Correlation Matrix

In [None]:
corr_matrix = data.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True, square=True, linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

# Train-Test 
Detailed Workflow:
1. Load Data: Load your raw dataset into a dataframe.
2. Initial Split: Perform the train-test split to separate the dataset into training and test sets.
3. Imputation on Training Data: Train the imputation model on the training set.
4. Apply Imputation: Apply the trained imputation model to both the training and test sets.
5. Identify and Fix Issues: Identify and fix issues in the training data and apply the same fixes to the test data.
6. Model Training and Evaluation: Train your machine learning model on the training set and evaluate it on the test set.  

Key Points:
- Consistent Processing: Any cleaning, transformation, or preprocessing step applied to the training data must also be applied to the test data.
- Fit on Training, Apply to Both: For steps like scaling or encoding, fit the transformer (e.g., scaler, encoder) on the training data and then apply it to both the training and test data.
- No Leakage: Ensure that no information from the test set leaks into the training process to maintain the validity of your model evaluation.

## Train-Test Split with Export of test data into csv

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
data = pd.read_csv('your_data.csv')

# Define the features and target
# Replace 'feature_columns' and 'target_column' with your actual column names
X = data.drop(columns=['target_column']) 
y = data['target_column'] 

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Combine X_test and y_test into a single DataFrame for exporting
test_data = pd.concat([X_test, y_test], axis=1)

# Export test data to CSV
test_data.to_csv('test_data.csv', index=False)

# Optionally, you can also export the train data to CSV
train_data = pd.concat([X_train, y_train], axis=1)
train_data.to_csv('train_data.csv', index=False)

# Optionally, you can delete the test data from your current environment
del X_test, y_test, test_data

# Import, Train-Test Split, Exporting Test, with TPOT

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier

# Step 2: Load the dataset (assuming a CSV file)
# Replace 'your_dataset.csv' with the path to your actual dataset
dataset_path = 'your_dataset.csv'
data = pd.read_csv(dataset_path)

# Step 3: Perform train-test split
# Assume the target variable is named 'target'. Change it to your target column name
X = data.drop(columns=['target'])
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Export the test set to a CSV file
test_set = pd.concat([X_test, y_test], axis=1)
test_set.to_csv('test_set.csv', index=False)

# Remove the test set from the dataframe
data = pd.concat([X_train, y_train], axis=1)

# Step 5: Use TPOT to find the best model
# Initialize TPOTClassifier
tpot = TPOTClassifier(verbosity=2, generations=5, population_size=50, random_state=42)

# Fit TPOT on the training data
tpot.fit(X_train, y_train)

# Export the best model pipeline code
tpot.export('tpot_best_model.py')

# Evaluate the best model on the test set
score = tpot.score(X_test, y_test)
print(f'Test Accuracy: {score:.4f}')

# Confusion Matrix

In [None]:
# computing the predicted y for the whole df 
y_pred = model.predict(X)

#adding predicted y to the whole df
df_with_predictions = df.copy()
df_with_predictions['predictions'] = y_pred

In [None]:
# Compute confusion matrix
cm = confusion_matrix(y, y_pred)

# Plotting the confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

## Preprocessing
https://scikit-learn.org/stable/data_transforms.html#data-transforms

In [None]:
# 1. Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# 2. Preprocesing on the training data -> fitting the imputer on the training data 
from sklearn.impute import SimpleImputer

# For categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train_cat = cat_imputer.fit_transform(X_train[cat_columns])

# For numerical columns using a more advanced imputer, e.g., KNNImputer
from sklearn.impute import KNNImputer
num_imputer = KNNImputer(n_neighbors=5)
X_train_num = num_imputer.fit_transform(X_train[num_columns])


In [None]:
# 3. preprocessing on test data with models fit on training data
X_test_cat = cat_imputer.transform(X_test[cat_columns])
X_test_num = num_imputer.transform(X_test[num_columns])


In [None]:
# 4. combining the numerical and categorical columns into X_train and X_test
import numpy as np

X_train_processed = np.hstack((X_train_cat, X_train_num))
X_test_processed = np.hstack((X_test_cat, X_test_num))


In [None]:
# 5. further processing e.g. scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_processed)
X_test_scaled = scaler.transform(X_test_processed)


In [None]:
# 6. model training 
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train_scaled, y_train)


In [None]:
# 7. model evaluation
y_pred = model.predict(X_test_scaled)


# Imputation

https://scikit-learn.org/stable/modules/impute.html#impute

In [None]:
pip install numpy pandas scikit-learn shap matplotlib


## Reverting & Imputing Mean-Imputed Values

In [None]:
import pandas as pd
import numpy as np

columns_with_mean_imputation = ['BMI', 'BloodSugarLevel']


def replace_mean_imputation_with_nan(df, columns, mean_values):
    """
    Replace mean-imputed values with NaN in the specified columns of a DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame to process.
    columns (list of str): The list of column names that had mean imputation applied.
    mean_values (dict): A dictionary with column names as keys and mean values as values.

    Returns:
    pd.DataFrame: The DataFrame with mean-imputed values replaced with NaN.
    """
    tolerance = 1e-6  # Increase tolerance
    for col in columns:
        mean_value = mean_values[col]
        print(f"Using mean value for column {col}: {mean_value}")  # Debug print
        df[col] = df[col].apply(lambda x: np.nan if np.abs(x - mean_value) < tolerance else x)
    return df


# Manually set the observed mean values for the columns
mean_values = {
    'BMI': 28.052368,
    'BloodSugarLevel': 105.053928  # Manually set the observed mean value
}

# Assuming X_train is your DataFrame
X_train = replace_mean_imputation_with_nan(X_train, columns_with_mean_imputation, mean_values)
X_test = replace_mean_imputation_with_nan(X_test, columns_with_mean_imputation, mean_values)

In [None]:
# List of numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Identify mean-imputed values and set them to NaN
for col in numerical_cols:
    mean_value = df[col].mean()
    df.loc[df[col] == mean_value, col] = np.nan

In [None]:
pip install sklearn
pip install fancyimpute

In [None]:
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Make a copy of the DataFrame for each imputation method
df_knn = df.copy()
df_rf = df.copy()

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=5)
df_knn[numerical_cols] = knn_imputer.fit_transform(df_knn[numerical_cols])

# RandomForest Imputation
for col in numerical_cols:
    df_rf[col] = impute_with_random_forest(df_rf, col)
    
def impute_with_random_forest(df, col):
    # Separate the data into features and target
    X = df.drop(columns=[col])
    y = df[col]
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Fit the RandomForest model on the training set
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predict the missing values
    missing_mask = y.isna()
    df.loc[missing_mask, col] = rf.predict(df.loc[missing_mask, X.columns])
    
    return df[col]


In [None]:
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Make a copy of the DataFrame for each imputation method
df_knn = df.copy()
df_rf = df.copy()

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=5)
df_knn[numerical_cols] = knn_imputer.fit_transform(df_knn[numerical_cols])

# RandomForest Imputation
for col in numerical_cols:
    df_rf[col] = impute_with_random_forest(df_rf, col)
    
def impute_with_random_forest(df, col):
    # Separate the data into features and target
    X = df.drop(columns=[col])
    y = df[col]
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Fit the RandomForest model on the training set
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predict the missing values
    missing_mask = y.isna()
    df.loc[missing_mask, col] = rf.predict(df.loc[missing_mask, X.columns])
    
    return df[col]


In [None]:
from sklearn.metrics import mean_squared_error

# Original DataFrame with missing values
df_original = df.copy()

# Mean Squared Error for KNN
mse_knn = mean_squared_error(df_original[numerical_cols], df_knn[numerical_cols])

# Mean Squared Error for RandomForest
mse_rf = mean_squared_error(df_original[numerical_cols], df_rf[numerical_cols])

print(f'MSE for KNN Imputation: {mse_knn}')
print(f'MSE for RandomForest Imputation: {mse_rf}')


# Comparing KNN and RandomForestImputation

In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Assuming X_train and X_test are already defined
# Ensure the columns are in the correct order

# Step 1: Initialize the imputers
knn_imputer = KNNImputer(n_neighbors=5)
rf_imputer = IterativeImputer(estimator=RandomForestRegressor(), max_iter=10, random_state=0)

# Step 2: Fit and transform the training data
X_train_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_train_rf_imputed = pd.DataFrame(rf_imputer.fit_transform(X_train), columns=X_train.columns, index=X_train.index)

# Step 3: Transform the test data using the fitted imputers
X_test_knn_imputed = pd.DataFrame(knn_imputer.transform(X_test), columns=X_test.columns, index=X_test.index)
X_test_rf_imputed = pd.DataFrame(rf_imputer.transform(X_test), columns=X_test.columns, index=X_test.index)

# Optional: Compare imputations
# Here you can compare the results of the imputations if you have a ground truth or if you want to check the distributions

def compare_imputations(original, knn_imputed, rf_imputed, columns):
    comparison = pd.DataFrame({
        'Original': original[columns].mean(),
        'KNN Imputed': knn_imputed[columns].mean(),
        'RF Imputed': rf_imputed[columns].mean()
    })
    return comparison

# Specify columns with NaN values
columns_with_nan = ['BMI', 'BloodSugarLevel']

comparison_train = compare_imputations(X_train, X_train_knn_imputed, X_train_rf_imputed, columns_with_nan)
comparison_test = compare_imputations(X_test, X_test_knn_imputed, X_test_rf_imputed, columns_with_nan)

print("Comparison of Imputations on Training Data:")
print(comparison_train)
print("\nComparison of Imputations on Test Data:")
print(comparison_test)


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import shap
import matplotlib.pyplot as plt

# Load your dataset
# Make sure to replace 'your_data.csv' with your actual data file
data = pd.read_csv('diabetes_study_final_data.csv')

data.info()

In [None]:
data.describe()

In [None]:
#randomforest


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import shap
import matplotlib.pyplot as plt

# Load your dataset
# Make sure to replace 'your_data.csv' with your actual data file
data = pd.read_csv('your_data.csv')
X = data.drop('target_column', axis=1)
y = data['target_column']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training a model (using RandomForest as an example)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Feature Permutation Importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
sorted_idx = result.importances_mean.argsort()

plt.figure(figsize=(12, 6))
plt.boxplot(result.importances[sorted_idx].T, vert=False, labels=X_test.columns[sorted_idx])
plt.title("Permutation Importance of Features")
plt.tight_layout()
plt.show()

# SHAP Summary Plot
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot for all features
shap.summary_plot(shap_values, X_test, plot_type="bar")
plt.show()

# Detailed SHAP summary plot (useful for insights on feature impact)
shap.summary_plot(shap_values, X_test)
plt.show()

# Compute and plot SHAP values for a single prediction
choosen_instance = X_test.iloc[0]
shap_values_instance = explainer.shap_values(choosen_instance)
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values_instance[1], choosen_instance)


In [None]:
#Logistic Regression
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import shap
import matplotlib.pyplot as plt

# Load your dataset
# Replace 'your_data.csv' and 'target_column' with your actual data file and target variable
data = pd.read_csv('your_data.csv')
X = data.drop('target_column', axis=1)
y = data['target_column']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training a Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Feature Permutation Importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
sorted_idx = result.importances_mean.argsort()

# Plotting the feature importances
plt.figure(figsize=(12, 6))
plt.boxplot(result.importances[sorted_idx].T, vert=False, labels=X_test.columns[sorted_idx])
plt.title("Permutation Importance of Features")
plt.tight_layout()
plt.show()

# SHAP Summary Plot
# Logistic Regression requires a linear explainer
explainer = shap.LinearExplainer(model, X_train, feature_dependence="independent")
shap_values = explainer.shap_values(X_test)

# Summary plot for all features
shap.summary_plot(shap_values, X_test, plot_type="bar")
plt.show()

# Detailed SHAP summary plot (useful for insights on feature impact)
shap.summary_plot(shap_values, X_test)
plt.show()

# Compute and plot SHAP values for a single prediction
choosen_instance = X_test.iloc[0]
shap_values_instance = explainer.shap_values(choosen_instance)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values_instance, choosen_instance)


## Assess the performance using common classification metrics such as accuracy, precision, recall, F1-score, and ROC-AUC

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Assuming predictions and y_test are already defined from your model testing phase
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
print(f'ROC AUC: {roc_auc:.2f}')

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()


In [None]:
print(classification_report(y_test, predictions, target_names=['Absent', 'Present']))

# Cross Validation

In [None]:

# Initialize the classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform cross-validation
scores = cross_val_score(rf, X, y, cv=5)  # 5-fold cross-validation

# Print the accuracy for each fold
print("Accuracy for each fold:", scores)

# Print average accuracy
print("Average accuracy:", scores.mean())

# Support Vector Machines

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming X contains your features (including BMI) and y contains your target variable (Cardiovascular Disease)
X = df_with_predictions[['BMI']]
y = df_with_predictions['CardiovascularDisease']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the SVM model
svm_model = SVC(kernel='linear', C=1.0)
svm_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = svm_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


## Cross-Validation

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Assuming X contains your features (including BMI) and y contains your target variable (Cardiovascular Disease)
X = df_with_predictions[['BMI']]
y = df_with_predictions['CardiovascularDisease']

# Creating the SVM model
svm_model = SVC(kernel='linear', C=1.0)

# Performing 5-fold cross-validation
cv_scores = cross_val_score(svm_model, X, y, cv=5)

# Print cross-validation scores
print("Cross-validation scores:", cv_scores)

# Calculate and print the mean accuracy
print("Mean accuracy:", cv_scores.mean())


In [None]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb

# Load example data
data = datasets.load_iris()
X = data.data
y = data.target

# Define the models and parameters
models = {
    'RandomForest': (RandomForestClassifier(), {
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 10, 20, 30]
    }),
    'LogisticRegression': (LogisticRegression(), {
        'C': [0.1, 1, 10]
    }),
    'SVM': (SVC(), {
        'kernel': ['linear', 'rbf'],
        'C': [1, 10, 100]
    }),
    'XGBoost': (xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'), {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7]
    })
}

# Set up outer cross-validation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Loop through each type of model
for name, (model, params) in models.items():
    # Set up inner cross-validation
    inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
    clf = GridSearchCV(estimator=model, param_grid=params, cv=inner_cv, scoring='accuracy')
    
    # Perform outer cross-validation
    outer_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy')
    
    # Print the results
    print(f"{name}: Mean accuracy = {np.mean(outer_scores):.3f} (+/- {np.std(outer_scores):.3f})")



Below is an alternate version after getting an error indicating that the LogisticRegression model is not converging within the default number of iterations.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

models = {
    'RandomForest': (RandomForestClassifier(), {
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 10, 20, 30]
    }),
    'LogisticRegression': (Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(max_iter=1000))
    ]), {
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['lbfgs', 'liblinear', 'sag', 'saga']
    }),
    'SVM': (Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', SVC())
    ]), {
        'classifier__kernel': ['linear', 'rbf'],
        'classifier__C': [1, 10, 100]
    }),
    'XGBoost': (xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'), {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7]
    })
}

# Set up outer cross-validation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Loop through each type of model
for name, (model, params) in models.items():
    # Set up inner cross-validation
    inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
    clf = GridSearchCV(estimator=model, param_grid=params, cv=inner_cv, scoring='accuracy')
    
    # Perform outer cross-validation
    outer_scores = cross_val_score(clf, X_train, y_train, cv=outer_cv, scoring='accuracy')
    
    # Print the results
    print(f"{name}: Mean accuracy = {np.mean(outer_scores):.3f} (+/- {np.std(outer_scores):.3f})")


# Reading, Exploration, Cleaning, Feature Engineering, Data Transfrmation, Imputation

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 1. Data Import
data = pd.read_csv('data.csv')

# 2. Data Exploration
print(data.head())
print(data.info())
print(data.describe())

# 3. Data Cleaning
# Handle missing values and duplicates
data.drop_duplicates(inplace=True)

# 4. Feature Engineering (example: creating a new feature)
data['new_feature'] = data['feature1'] / data['feature2']

# 5. Data Transformation
# Define numerical and categorical features
numerical_features = ['feature1', 'feature2', 'new_feature']
categorical_features = ['category1', 'category2']

# Define transformers
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 6. Train-Test Split
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7. Imputation (handled in the transformers above)

# 8. Feature Selection (example using all features for simplicity)
# 9. Resampling (if needed) (example not shown here)

# Fit and transform the training data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Your data is now ready for model training
print('Training data shape:', X_train.shape)
print('Testing data shape:', X_test.shape)


# Imputation with RandomForest for Numerical and Categorical Features

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import IterativeImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# 1. Data Import
data = pd.read_csv('data.csv')

# 2. Data Exploration
print(data.head())
print(data.info())
print(data.describe())

# 3. Data Cleaning
# Handle missing values and duplicates
data.drop_duplicates(inplace=True)

# 4. Train-Test Split
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define transformers

# Custom categorical imputer
class CategoricalImputer:
    def __init__(self):
        self.label_encoders = {}
        self.imputer = IterativeImputer(estimator=RandomForestClassifier(), max_iter=10, random_state=42)

    def fit(self, X, y=None):
        X = X.copy()
        for col in X.columns:
            le = LabelEncoder()
            X[col] = le.fit_transform(X[col].astype(str))
            self.label_encoders[col] = le
        self.imputer.fit(X)
        return self

    def transform(self, X):
        X = X.copy()
        for col in X.columns:
            le = self.label_encoders[col]
            X[col] = le.transform(X[col].astype(str))
        X_imputed = pd.DataFrame(self.imputer.transform(X), columns=X.columns)
        for col in X.columns:
            le = self.label_encoders[col]
            X_imputed[col] = le.inverse_transform(X_imputed[col].round().astype(int))
        return X_imputed

# Numerical Transformer using IterativeImputer with RandomForestRegressor
numerical_features = ['feature1', 'feature2', 'new_feature']
categorical_features = ['category1', 'category2']

numerical_transformer = Pipeline(steps=[
    ('imputer', IterativeImputer(estimator=RandomForestRegressor(), max_iter=10, random_state=42)),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', CategoricalImputer()),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Fit the preprocessor on the training data
X_train_transformed = preprocessor.fit_transform(X_train)

# Transform the test data
X_test_transformed = preprocessor.transform(X_test)

# Your data is now ready for model training
print('Training data shape:', X_train_transformed.shape)
print('Testing data shape:', X_test_transformed.shape)


## Training a RandomForest to impute missing values


In [None]:
# Step 1: Split the data into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Step 2: Separate predictors and target in the training data
X_train = train_df.drop(columns=['InsulinResistance'])
y_train = train_df['InsulinResistance']

# Step 3: Remove rows where y_train is NaN for training the imputation model
X_train_notna = X_train[y_train.notna()]
y_train_notna = y_train[y_train.notna()]

# Check for NaNs in y_train_notna
print("NaNs in y_train_notna:", y_train_notna.isna().sum())

# Ensure y_train_notna contains only 0 or 1
print("Unique values in y_train_notna:", y_train_notna.unique())

# Step 4: Train the model for imputing the categorical column
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_notna, y_train_notna)

# Step 5: Impute missing values in the training set
missing_mask_train = y_train.isna()
X_train_missing = X_train[missing_mask_train]
if not X_train_missing.empty:
    train_df.loc[missing_mask_train, 'InsulinResistance'] = rf_model.predict(X_train_missing)

# Step 6: Apply the same steps to the test set
X_test = test_df.drop(columns=['InsulinResistance'])
y_test = test_df['InsulinResistance']
missing_mask_test = y_test.isna()
X_test_missing = X_test[missing_mask_test]

if not X_test_missing.empty:
    test_df.loc[missing_mask_test, 'InsulinResistance'] = rf_model.predict(X_test_missing)

# Now the train_df and test_df have imputed values for the 'InsulinResistance' column
print(train_df)
print(test_df)