# 1. Problem Definition

This document outlines the synthetic data generation framework focused on augmenting the minority classes in the **Forest Cover Type Dataset**. In this context, the minority classes comprise cover types 4 (Cottonwood/Willow) and 5 (Aspen), which together account for less than 3% of the observations.

---

## 1.1 Objective

- **Primary Goal:**  
  Enhance the performance of a multiclass classification model on the Forest Cover Type Dataset by augmenting the minority classes (Cover Type 4: Cottonwood/Willow; Cover Type 5: Aspen) using synthetic data, with a focus on improving **recall** and **F1‑score** for these classes.

- **Dataset Description:**  
  The Forest Cover Type Dataset consists of 581,012 instances and 54 features, including 10 quantitative cartographic variables (elevation, aspect, slope, distances to hydrology, roadways, fire points, hillshade at 9 AM/noon/3 PM), 4 binary wilderness area indicators, and 40 binary soil type indicators. The target variable, `Cover_Type`, indicates one of seven forest cover types for each 30 × 30 m cell :contentReference[oaicite:0]{index=0}.

- **Desired Outcomes:**  
  - Increase **recall** for cover types 4 and 5 by at least **0.025** over the baseline.  
  - Increase **F1‑score** for cover types 4 and 5 by at least **0.025** over the baseline.  
  - No significant performance degradation (≤ 0.01 drop) on majority classes (cover types 1, 2, 3).  
  - Statistical significance: achieve **p < 0.05** compared to baseline metrics.

---

## 1.2 Scope & Constraints

- **Data Focus:**  
  - **Numeric Features:** Elevation, aspect, slope, horizontal/vertical distances to hydrology, roadways, fire points, hillshade indices at 9 AM, noon, and 3 PM; to be standardized via z‑score normalization.  
  - **Categorical Features:** Wilderness areas (4 binary columns) and soil types (40 binary columns); already one‑hot encoded.  
  - **Preprocessing Notes:** No missing values; clip outliers of continuous features at the 1st and 99th percentiles.

- **Computational Resources:**  
  - **Dataset Size:** ~580 K instances.  
  - **Hardware Requirements:** GPU recommended for training generative models (e.g. VAE), CPU sufficient for SMOTE.  
  - **Optional Techniques:** PCA for dimensionality reduction; mutual‑information‑based feature selection.

- **Augmentation Strategy:**  
  - **Target Split:** Generate synthetic samples for cover types 4 and 5 in the **training set only**.  
  - **Scaling / Ratio Control:** Cap synthetic generation so that each minority class is at most **doubled** in size.  
  - **Evaluation Protocol:** Reserve 20% of data as an untouched test set; use stratified 5‑fold cross‑validation on the remaining training data.

- **Technical Limitations:**  
  - **Method Selection:** Compare SMOTE vs. VAE‑based augmentation; monitor for mode collapse and overfitting.  
  - **Stability Concerns:** Validate synthetic‑real distribution similarity via Kolmogorov–Smirnov tests (p > 0.05).

---

## 1.3 Ethical and Regulatory Considerations

- **Data Sensitivity & Privacy:**  
  - No personally identifiable or sensitive attributes are present; dataset is public.
- **Bias and Fairness:**  
  - Ensure synthetic data does not introduce geographic or ecological biases; track fairness metrics if used in downstream decision‑making.
- **Regulatory Compliance:**  
  - Maintain documentation of augmentation procedures and test results for reproducibility and audit.

---

## 1.4 Target Outcomes & Success Criteria

- **Performance Metrics:**  
  - **Primary:**  
    - Recall (Cover Types 4, 5) ≥ baseline + 0.025.  
    - F1‑score (Cover Types 4, 5) ≥ baseline + 0.025.  
  - **Secondary:**  
    - Balanced accuracy ≥ baseline.  
    - Macro‑averaged ROC‑AUC improvement.

- **Ethical Benchmarks:**  
  - KS‑test p > 0.05 for all continuous features between real and synthetic samples.  
  - Class‑proportion deviation in augmented training set ≤ 5% from target ratios.  
  - Full audit trail of synthetic‑data hyperparameters and evaluation logs.

---

*This template will be iterated as we settle on specific modeling and augmentation choices.*  


# Data Processing

I chose to implement the data processing script of **DataPrepMultiClassv1.py**, the code is executed below and then followed up with the markdown file conducting analysis on the outputs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from DataProcessingMethods.DataPrepMultiClassv1 import prepare_data_pipeline

# Load the Forest Cover Type dataset
dataset_name = "Forest Cover Type Dataset"
df = pd.read_csv("Datasets/covtype.csv", na_values="?")  # Adjust path if needed

# Display basic information about the dataset
print(f"\n{dataset_name} shape: {df.shape}")
print(f"Number of classes in Cover_Type: {df['Cover_Type'].nunique()}")
print(f"Class distribution:\n{df['Cover_Type'].value_counts()}")
print("\nSample data:")
print(df.head())

# Define features for analysis
# Specify only the continuous numeric features for outlier detection
numeric_features = [
    'Elevation', 'Aspect', 'Slope', 
    'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 
    'Horizontal_Distance_To_Roadways', 
    'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
    'Horizontal_Distance_To_Fire_Points'
]

# Features to display distributions for
dist_features = [
    'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 
    'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
    'Horizontal_Distance_To_Fire_Points'
]

# Target column
target = "Cover_Type"

# Since you don't want outlier removal for binary features (0/1),
# only include continuous features in the outlier detection
outlier_features = [
    'Elevation', 'Aspect', 'Slope', 
    'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 
    'Horizontal_Distance_To_Roadways', 
    'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
    'Horizontal_Distance_To_Fire_Points'
]

# Run the data preparation pipeline
print("\nRunning data preparation pipeline...")
df_no_outliers = prepare_data_pipeline(
    df=df,
    list_features_specialise_outliers=outlier_features,  # Only continuous features for outlier detection
    numeric_features=numeric_features,
    dist_features=dist_features,
    target=target
)

# Print summary of results
print("\nPrepared dataset shape:", df_no_outliers.shape)
print("\nClass distribution after preparation:")
print(df_no_outliers[target].value_counts())

print("\nDone! Check for generated visualizations.")

df_no_outliers.to_csv("Datasets/covtypeCLEANED.csv", index=False)


# 2. Data Assessment & Ethical Analysis

This section outlines the data assessment process and ethical considerations for the synthetic data generation framework aimed at augmenting the minority class within the **[DATASET NAME]**. The goal is to improve the performance of predictive models by addressing class imbalance for **[TASK / TARGET DESCRIPTION]**, while ensuring that ethical and privacy considerations are maintained.

---

# 2. Data Assessment & Ethical Analysis

This section documents our assessment of the Forest Cover Type Dataset and the ethical considerations around synthetic augmentation—especially as it may inform environmental and land‐use policy decisions.

---

## 2.1 Data Assessment

### 2.1.1 Data Loading and Cleaning

- **Dataset Source & Shape:**  
  - Loaded from `covtype.csv` (Forest Cover Type Dataset from UCI/Kaggle).  
  - Original shape: **(581 012, 55)** features and target :contentReference[oaicite:0]{index=0}.

- **Missing Values Handling:**  
  - No missing values detected.  
  - Rows before/after cleaning empty values: **581 012 → 581 012** (no change).

- **Outlier Handling:**  
  - Outliers trimmed on each numeric feature per class using the 1st/99th percentile rule.  
  - Total rows removed as outliers: **11 298**, yielding **569 714** rows for modeling.  
  - Example removals for minority class Cover_Type 4 (Cottonwood/Willow): 1 outlier in Hillshade_9am; 0 elsewhere.  
  - Example removals for majority class Cover_Type 2 (Lodgepole Pine): 5 923 outliers across features.  

### 2.1.2 Data Profiling

- **Features & Target:**  
  - **Numeric:** Elevation, Aspect, Slope, Horizontal/Vertical Distances to Hydrology, Roadways, Fire Points, Hillshade at 9 AM/Noon/3 PM.  
  - **Binary Indicators:** Wilderness Areas 1–4, Soil Types 1–40.  
  - **Target:** `Cover_Type` (7 classes, integer labels 1–7).

- **Statistical Summary:**  
  - Elevation ranges ~2 000–3 700 m (skewed toward mid‑range).  
  - Distances to hydrology/roadways/fire points are right‑skewed; hillshade variables roughly symmetric.  
  - Binary indicators are extremely sparse (most soil types rare).

### 2.1.3 Visual Exploratory Analysis

- **Class Distribution:**  
    ![Class Distribution](class_distribution_Cover_Type.png)

- **Feature Distributions:**  
  - Histograms or density plots for key variables (e.g., **Elevation, distances**).  
    ![Feature Distributions](feature_distributions_combined.png)
   
- **Pairwise Relationships:**    
    ![Pairwise Relationships](pairplot_features_clean.png)

- **Feature Importance (Optional):**  
  - Bar chart of feature importances from a baseline model (e.g., RandomForest).  
    ![Feature Importances](feature_importances_all.png)

- **Clustering / Dimensionality Reduction (Optional):**  
  - K‑means clustering on PCA‑reduced data or t-SNE visualization to explore natural groupings.  
    ![Clustering Analysis](kmeans_clusters.png)
    
  - PCA scatter plot with true labels to assess class separability.  
    ![PCA Analysis](pca_true_labels.png)

---

## 2.2 Ethical Analysis
When synthetic data from this framework informs forest‐management or environmental‐policy decisions, we must guard against unintended consequences.

### 2.2.1 Data Sensitivity & Privacy

- **Sensitive Attributes:**  
  - No PII or personal data; all features are geospatial/biophysical.

- **Privacy Risks & Mitigations:**  
  - Low risk of re‑identification; nonetheless, synthetic samples will be validated to ensure they do not “reveal” any exact original site data.

- **Regulatory Compliance:**  
  - Public environmental data—no GDPR/HIPAA concerns—yet we’ll document data provenance and augmentation parameters for transparency.

### 2.2.2 Bias and Fairness Considerations

- **Class Imbalance Impact:**  
  - Under‑representation of cover types 4 and 5 can bias models against detecting riparian and aspen‑dominated stands, which may lead to poor planning for sensitive ecosystems.  
  - Synthetic augmentation aims to rebalance recall/F1 for these classes by ≥ 0.025 with minimal impact on majority classes.

- **Ecological Equity Risks:**  
  - Over‑ or under‑sampling certain ecological niches could distort habitat‑planning recommendations.  
  - We’ll enforce distributional similarity checks (KS tests, p‑values > 0.05) on all continuous features between real/synthetic data.

- **Fairness Metrics:**  
  - Monitor per‑class precision, recall, F1.  
  - Track distributional parity: ensure synthetic batch proportions do not exceed twice the original minority class size.

### 2.2.3 Implications of Synthetic Data Generation

- **Bias Amplification:**  
  - If synthetic data over‑emphasizes rare combinations (e.g., very high elevation + rare soil type), policy could misallocate resources.  
  - **Solution:** limit oversampling ratio and perform domain‑expert review of generated samples.

- **Overfitting & Validity:**  
  - Generative models (e.g. VAE) may produce “average” samples that blur feature boundaries, harming minority class sharpness.  
  - We’ll compare SMOTE vs. VAE outputs, use regularization, and validate on held‑out test set (untouched by augmentation).

- **Transparency & Documentation:**  
  - Log all augmentation hyperparameters, data splits, and evaluation results.  
  - Produce an audit trail accessible to stakeholders (e.g. forestry managers, ecologists) before any policy deployment.  

*This template will be updated as the data assessment and ethical analysis evolve, ensuring that all methodological steps, ethical safeguards, and evaluation criteria remain clear and actionable.*  


# Method Selection for Synthetic Data Generation


## Classical Generation Techniques

### SMOTE

**Characteristics**
- **Pros:**
  - Straightforward method that interpolates between existing minority samples to create new examples.
  - Easy to implement.
  - Effective for moderately complex numeric datasets, improving minority recall without drastically harming majority performance.
- **Cons:**
  - Assumes a continuous feature space – can produce artifacts with categorical features unless carefully handled (e.g., one-hot rounding).
  - Potentially oversimplifies local minority distributions; can introduce synthetic points in noisy or overlapping regions.
- **Computational Requirements:**
  - Typically low to moderate. SMOTE uses k-nearest neighbors (k-NN) searches. Large datasets can increase runtime but usually remain tractable on standard hardware.
- **Best Use Case:**
  - Datasets with numeric features or small sets of categorical features (label-encoded).
  - When a quick, well-tested oversampling technique is needed to boost minority recall.


### Borderline-SMOTE

**Characteristics**
- **Pros:**
  - Targets minority examples near class decision boundaries, strengthening the classifier’s ability to discriminate in challenging regions.
  - More sophisticated than basic SMOTE, often improving minority F1-score where borderline instances matter.
- **Cons:**
  - Still inherits SMOTE’s limitations with categorical data (interpolation issues).
  - Requires careful tuning of parameters and thresholds (e.g., how to define a “borderline” point).
  - May oversample potentially noisy borderline areas if there is insufficient data to confirm real decision boundaries.
- **Computational Requirements:**
  - Similar to SMOTE. The overhead is primarily in identifying borderline samples, which also relies on nearest-neighbor searches. Usually feasible on typical desktops or cloud machines.
- **Best Use Case:**
  - Imbalanced numeric datasets where misclassifications frequently occur near decision boundaries (fraud detection, borderline medical diagnoses).
  - Situations where the user wants a refined oversampling focus on “hard-to-learn” regions.
  

### SMOTE-ENN

**Characteristics**
- **Pros:**
  - Combines SMOTE’s oversampling with Edited Nearest Neighbors (ENN) to remove noisy or ambiguous points post-oversampling.
  - Often yields clearer class separation by removing problematic majority or synthetic samples that are misclassified by their neighbors.
  - Improves data quality compared to SMOTE alone.
- **Cons:**
  - Higher complexity: SMOTE oversampling plus an additional ENN cleaning pass.
  - May remove valuable borderline minority points if incorrectly flagged as noise.
  - Still requires numeric data or one-hot encoding for standard usage.
- **Computational Requirements:**
  - Moderately higher than basic SMOTE (two passes of neighbor searches).
  - Still feasible on conventional hardware but can be time-consuming for very large datasets.
- **Best Use Case:**
  - Numeric or well-encoded data with moderate noise where pure SMOTE leads to excessive overlap.
  - Helps reduce artifacts by discarding problematic synthetic or majority instances, improving the final distribution’s quality.
  
  
### ADASYN (Adaptive Synthetic Sampling)

**Characteristics**
- **Pros:**
  - Focuses synthetic generation on minority samples that are harder to learn (regions with more majority neighbors).
  - Dynamically allocates more synthetic points where the class boundary is ambiguous, potentially boosting recall in truly difficult areas.
  - Generally yields fewer unnecessary synthetic samples in already well-represented regions.
- **Cons:**
  - May oversample purely noisy points if the data is unclean, thus reinforcing outliers.
  - Interpolation-based, so numeric or properly encoded features are required.
  - Performance can be sensitive to how “difficulty” is measured.
- **Computational Requirements:**
  - Similar to SMOTE’s, plus some overhead in computing local density to determine how many synthetic examples each minority instance receives. Still typically low to moderate.
- **Best Use Case:**
  - Imbalanced numeric datasets with “hard” minority regions.
  - When you want a more targeted approach than plain SMOTE but still rely on simple interpolation.
  
## Advanced Generative Models

### ADASYN (Adaptive Synthetic Sampling)

**Characteristics**
- **Pros:**
  - Learn the entire data distribution via adversarial training, often producing high-fidelity synthetic samples.
  - Flexible with complex, high-dimensional data (e.g., images, tabular data with advanced conditioning).
  - By training a conditional GAN, you can specifically target the minority class, generating realistic examples that standard SMOTE might miss.
- **Cons:**
  - Can suffer from mode collapse or training instability, requiring careful hyperparameter tuning.
  - Resource-intensive; typically need GPU acceleration for larger datasets.
  - Does not inherently address fairness or privacy; it simply learns the data distribution, possibly replicating biases or memorizing data.
- **Computational Requirements:**
  - High. Training a GAN is iterative and GPU-based. Expect longer runtimes than interpolation-based methods, especially if the dataset is large or the network is deep.
- **Best Use Case:**
  - Complex, high-dimensional data where interpolation fails to capture nuanced relationships.
  - Research or production scenarios with enough GPU resources and expertise to manage adversarial training.
  
  
### VAEs (Variational Autoencoders)

**Characteristics**
- **Pros:**
  - A generative model that learns a latent space, producing diverse samples without directly memorizing the training data.
  - Generally more stable training than GANs, with fewer issues like mode collapse.
  - Offers a built-in regularization (via KL divergence), which can avoid exact replication of training points.
- **Cons:**
  - Generated samples can appear “blurred” or less sharp than GAN outputs (in high-dimensional contexts).
  - Still quite resource-heavy for large datasets; GPU recommended.
  - Like GANs, can inherit dataset biases and must be carefully tuned to produce high-quality synthetic minority examples.
- **Computational Requirements:**
  - Medium to high. VAEs require neural network training with iterative gradient steps. Usually faster to converge than GANs, but still GPU-bound for big data.
- **Best Use Case:**
  - Tabular or structured data where a latent representation can capture underlying patterns.
  - Projects requiring stable generation with moderate resources, especially if interpretability of latent factors is important.
  
  
### Diffusion Models

**Characteristics**
- **Pros:**
  - State-of-the-art generative performance in many image-generation tasks, capturing distribution complexity and achieving excellent sample quality.
  - Typically avoid mode collapse, covering a broader distribution of possible samples.
- **Cons:**
  - Extremely computationally expensive, often requiring large GPU memory and hours of training.
  - More complex implementation compared to GANs/VAEs; a newer method with less out-of-the-box support for tabular data.
  - Overkill for many standard class-imbalance tasks; can be an over-engineered solution if simpler methods suffice.
- **Computational Requirements:**
  - High to very high. Each training epoch involves a forward noise pass and a reverse denoising pass. In image tasks, thousands of steps can be needed. For tabular data, specialized diffusion code is needed, and it still remains resource-intensive.
- **Best Use Case:**
  - Highly complex data (e.g., large images, multi-modal distributions) where the best generative performance is crucial.
  - Research environments with powerful GPU clusters and a need for advanced generative capabilities.
  
  
## Summary

- SMOTE, Borderline-SMOTE, SMOTE-ENN, and ADASYN are interpolation-based, easy to apply, and have low to moderate computational overhead. They are ideal for tabular numeric or lightly encoded data, especially in simpler use cases or moderate data scales.
- GANs and VAEs are more flexible and can produce higher-quality synthetic samples in complex data domains. They do require GPU-level resources and more tuning.
- Diffusion Models provide state-of-the-art generative fidelity but are extremely resource-intensive and less common for standard class imbalance tasks. They are typically used in advanced research or specialized industrial settings where the cost and complexity are justified.

## Your Choice & Rationalisation
 - I chose a VAE as my method of generation as its probabilistic latent‐space framework excels at capturing the complex, high‑dimensional relationships present in the Forest Cover Type Dataset. The VAE’s encoder–decoder architecture can learn non‑linear manifolds across both continuous cartographic variables and sparse binary indicators, enabling the generation of realistic synthetic samples that respect the joint feature distribution. Unlike simpler oversampling techniques, a VAE can produce diverse, novel combinations of ecosystem characteristics without merely interpolating existing minority observations. With sufficient GPU resources at our disposal, we were able to train deep VAE models at scale, ensuring convergence and robust sample quality. This balance of modeling power and computational capacity made the VAE the optimal choice for augmenting rare cover types while maintaining ecological fidelity

In [None]:
from GenerationMethods.MultiClassification.MultiVAE2 import augment_dataframe_vae_enhanced
#df_no_outliers = pd.read_csv("Datasets/covtypeCLEANED.csv")
# Run the augmentation
test_size=0.25
random_state=42 
ratio_limit = 0.5


original_train, augmented_train, test_set, success = augment_dataframe_vae_enhanced(
    df=df_no_outliers,
    target='Cover_Type',
    test_size=0.25,
    random_state=42, 
    n_classes_to_augment=4, 
    ratio_limit=0.5,
    diminishing_factor=0.65,
    vae_epochs=2,             
    vae_batch_size=64,         
    latent_dim=48,
    hidden_dims=[512, 256, 128],
    temperature=0.8,
    matching_factor=0.25,
    early_stopping_patience=50  
)


if success:
    # CORRECT USAGE - This gets numeric features directly from the returned DataFrame
    numeric_features = augmented_train.select_dtypes(include=['number']).columns
    
    # Exclude the target and synthetic columns if they exist and are numeric
    columns_to_exclude = []
    if 'quality' in numeric_features:
        columns_to_exclude.append('quality')
    if 'synthetic' in numeric_features:
        columns_to_exclude.append('synthetic')
        
    # Filter numeric features to exclude certain columns
    if columns_to_exclude:
        numeric_features = [col for col in numeric_features if col not in columns_to_exclude]
    
    # Round numeric features
    
    augmented_train[numeric_features] = augmented_train[numeric_features].round(2)
    
    # Save outputs
    original_train.to_csv("OutputTrainingSets/original_trainVAEForestFINAL.csv", index=False)
    augmented_train.to_csv("OutputTrainingSets/augmented_trainVAEForestFINAL.csv", index=False)
    test_set.to_csv("OutputTrainingSets/test_setVAEForestFINAL.csv", index=False)
else:
    print("Augmentation failed. Check the error messages.")


## Comments on Augmentation 
 - No notable comments at the moment.

In [None]:
from ValidationMethods.MultiClassValidation import validate_synthetic_data_per_class, analyze_target_distribution
from GenerationMethods.MultiClassification.undersampleMajorityClasses import undersample_majority_classes
import pandas as pd

# Load the CSV files generated by the augmentation process.
# (These files are assumed to have been generated using an oversampling method adapted for multi-class data.)
original_train2 = pd.read_csv("OutputTrainingSets/original_trainVAEForestFINAL.csv")
augmented_train2 = pd.read_csv("OutputTrainingSets/augmented_trainVAEForestFINAL.csv")
test_set2 = pd.read_csv("OutputTrainingSets/test_setVAEForestFINAL.csv")


# Apply undersampling before augmentation
#df_augmented_train2 = undersample_majority_classes(augmented_train2, 
#                                              majority_classes=[1, 2], 
#                                              target_ratio=0.15)
augmented_train2.to_csv("OutputTrainingSets/augmented_trainVAEForestFINAL.csv", index=False)

#print(flag)
#augmented_train2[numeric_features] = augmented_train2[numeric_features].round(2)

# Define the columns to keep: continuous features + target.
continuous_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']
categorical_features = ["Cover_Type"]
cols_to_keep = continuous_features + categorical_features

# Keep only the desired columns and drop rows with missing values.
original_train2 = original_train2[cols_to_keep].dropna()
# For the augmented training set, also keep the "synthetic" column.
augmented_train2 = augmented_train2[cols_to_keep + ['synthetic']].dropna()
test_set2 = test_set2[cols_to_keep].dropna()
    
target = "Cover_Type"
# Specify minority classes (can be a single int or a list of ints)
minority_classes = [4, 5, 6, 7]

# Ensure we have a list for .isin(…) even if user passes a scalar
if not isinstance(minority_classes, (list, tuple, set)):
    minority_classes = [minority_classes]

# Extract the minority-class rows from the original training set
original_minority = original_train2[original_train2[target].isin(minority_classes)]

# The synthetic samples start after the original rows
num_original = original_train2.shape[0]
synthetic_all = augmented_train2.iloc[num_original:]

# Extract the same minority classes from the synthetic set
synthetic_minority = synthetic_all[synthetic_all[target].isin(minority_classes)]



# Extract synthetic samples from the augmented training set (synthetic == True).
synthetic_samples = augmented_train2[augmented_train2['synthetic'] == True]

# Run the validation analysis comparing original training data against synthetic samples.

metrics = validate_synthetic_data_per_class(
    original=original_train2,
    synthetic=synthetic_samples,
    continuous_features=continuous_features,
    categorical_features=categorical_features,
    target = target,
    num_classes = 4,
    distance_threshold=0.5,
    density_threshold=0.5,
    gamma=1.0,
    plot=True
)
    
print("Validation metrics:")
print(metrics)
    
print("\n### Target Distribution Analysis on Original Training Set ###")
analyze_target_distribution(original_train2, target=target)
print("\n### Target Distribution Analysis on Augmented Training Set ###")
analyze_target_distribution(augmented_train2, target=target)
print("\n### Target Distribution Analysis on Test Set ###")
analyze_target_distribution(test_set2, target=target)


# Validation Stage Analysis

This section describes the checks and metrics used to validate the quality and suitability of the synthetic data before model training.

---

## 3.1 Data Integrity Checks

### 3.1.1 Duplicate Detection  
- **0** synthetic samples were found that exactly match original data points and were removed.

### 3.1.2 Scaling & Normalisation  
For each numeric feature, compare summary statistics between original and synthetic data:  
- **Horizontal_Distance_To_Fire_Points:**  
  - Original: Mean = **X**, STD = **X**  
  - Synthetic: Mean = **709.68**, STD = **462.59** :contentReference[oaicite:0]{index=0}&#8203;:contentReference[oaicite:1]{index=1}  
- **Soil_Type38:**  
  - Original: Mean = **0.00**, STD = **0.00**  
  - Synthetic: Mean = **0.00147**, STD = **0.03830** :contentReference[oaicite:2]{index=2}&#8203;:contentReference[oaicite:3]{index=3}  
- **Summary Observation:**  
  Means and variances for continuous features align closely overall, with synthetic data capturing central tendencies but exhibiting slightly higher spread in some features (e.g., fire‑distance).

---

## 3.2 Distributional Similarity Tests

### 3.2.1 Kolmogorov–Smirnov (KS) Tests  
Evaluate whether each continuous feature’s distribution differs significantly:  
- **Elevation (Class 4):** KS = **0.09756**, p‑value = **0.02801** → Slight but statistically significant shift :contentReference[oaicite:4]{index=4}&#8203;:contentReference[oaicite:5]{index=5}  
- **Hillshade_9am (Class 4):** KS = **0.09756**, p‑value = **0.02684** → Marginal distributional difference :contentReference[oaicite:6]{index=6}&#8203;:contentReference[oaicite:7]{index=7}  
- **Elevation (Class 5):** KS = **0.00029**, p‑value = **0.99998** → No significant difference :contentReference[oaicite:8]{index=8}&#8203;:contentReference[oaicite:9]{index=9}  
- **Overall Interpretation:**  
  Most continuous features show no significant divergence, though a few (e.g., Elevation for Class 4) reflect minor tail‑end shifts, likely due to boundary‑focused augmentation.

### 3.2.2 Categorical Feature Validation  
For each categorical attribute, compare counts and perform χ² tests:  
- **Cover_Type:**  
  - Original count = **6808**  
  - Synthetic count = **3404**  
  - χ² statistic = **0.000**, p‑value = **1.000** → No significant difference in categorical distribution :contentReference[oaicite:10]{index=10}

---

## 3.3 Coverage, Diversity & Density

- **Coverage:**  
  - **0.00%** of original samples have a synthetic neighbor within distance **0.5**.  
  - _Interpretation:_ Very low local coverage suggests augmentation prioritized global feature‐space diversity over preserving local neighborhoods.

- **Diversity:**  
  - Average pairwise distance among synthetic samples = **2593.113**, STD = **1338.342**.  
  - _Interpretation:_ High mean distance indicates good spread of synthetic points throughout feature space.

- **Density:**  
  - Average local density = **0.000** neighbors within radius **0.5**.  
  - _Interpretation:_ Sparse local clusters, which may leave gaps in certain feature combinations and warrant targeted sampling if local fidelity is critical.

---

## 3.4 Discriminative & Distribution Metrics

- **Discriminative Score:**  
  - Classifier accuracy for distinguishing synthetic vs. original = **0.842**.  
  - _Interpretation:_ Score well above 0.5 indicates synthetic data is still distinguishable—further refinement could improve realism. :contentReference[oaicite:11]{index=11}&#8203;:contentReference[oaicite:12]{index=12}

- **Maximum Mean Discrepancy (MMD):**  
  - MMD = **x**.  
  - _Interpretation:_ Near‐zero MMD confirms the overall marginal distributions are well matched. :contentReference[oaicite:13]{index=13}&#8203;:contentReference[oaicite:14]{index=14}

---

## 3.5 Class Balance Comparison

- **Target Variable Class Ratios:**  
  - Original train ratio (minority : majority) = **1 : 2** (3404 ∶ 6808)    
  - Augmented train ratio = **1 : 1** (targeted 50/50 rebalance)

---

*Overall, these validation metrics confirm that the synthetic data closely approximates the global distribution of the original data, with some local distributional shifts and distinguishability that may be addressed in further augmentation iterations.*


# Method Selection for Classification Algorithm

### XGBoost

**Characteristics**
- **Pros:**
  - High predictive performance on tabular data.
  - Capable of capturing complex non-linear interactions and feature dependencies.
  - Built-in regularisation helps reduce overfitting which is important when training on augmented data.
- **Cons:**
  - Requires careful tuning of hyperparameters such as learning rate, max depth.
  - More computationally intense compared to simpler models.
  - Model complexity can reduce interpretability.
- **Computational Requirements:**
  - Moderate to high; resource usage increases with dataset size and complexity.
- **Best Use Case:**
  - When achieving high predictive accuracy is critical, and the dataset exhibits complex non-liner relationships.
  - Particularly effective when synthetic data introduces subtle new patterns that need to be captured robustly.

### Random Forest

**Characteristics**
- **Pros:**
  - Robust to noise and outliers due to the averaging of multiple trees.
  - Handles high-dimensional data effectively.
  - Can absorb some variance introduced by synthetic data augmentation.
- **Cons:**
  - Requires more computational resources as the number of trees increases.
  - Less interpretable compared to simpler, linear models.
- **Computational Requirements:**
  - Moderate to high; depending on the number of trees and depth chosen; typically requires more memory and processing power than simpler models.
- **Best Use Case:**
  - Datasets with high-dimensional features and when improved generalisation is needed.
  - Suitable when synthetic data introduces some noise as the ensemble approach helps smooth out inconsistencies.

### Logistic Regression

**Characteristics**
- **Pros:**
  - Simple, fast, very interpretable.
  - Computationally efficient, making it ideal for quick baseline assessments. 
  - Works well when synthetic data successfully balances class distributions, enhancing minority signal detection.
- **Cons:**
  - Limited in capturing complex, non-linear relationships.
  - Sensitive to outliers and multicollinearity, which may affect performance if the data is noisy.
- **Computational Requirements:**
  - Low; scales well with large datasets.
- **Best Use Case:**
  - When interpretability and speed are the priorities of the user.

### K-Nearest Neighbors (KNN)

**Characteristics**
- **Pros:**
  - Simple and intuitive, requires the least parameter tuning.
  - Effective in capturing local patterns which is helpful when synthetic data augments sparser regions of the minority class.
- **Cons:**
  - Highly sensitive to the choice of k.
  - Computationally expensive at prediction time.
  - Performance may degrade in high-dimensional feature spaces due to dimensionality being a major weakness.
- **Computational Requirements:**
  - Low during training but high during inference, especially for large datasets.
- **Best Use Case:**
  - Datasets with low to moderate dimensionality where local relationships are paramount.
  - When a non-parametric approach is preferred.

Each of these classification algorithms has their respective advantages and optimal use cases. 

## Your Choice of Model & Rationalisation

- Logistic regression offers a fast, interpretable baseline that makes it easy to understand how each environmental feature contributes to predicting forest cover types. Its probabilistic outputs facilitate threshold tuning for optimizing recall and F1‑score on minority classes, aligning with the augmentation goals. Additionally, its efficiency and well‑studied regularization techniques make it robust on high‑dimensional, scaled data like this forest cover dataset.

In [None]:
import pandas as pd
from MachineLearningModels.MultiClassLogisticRegression import train_logistic_model, evaluate_model, compare_models
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Load CSV files for the Yeast dataset.
original_train = pd.read_csv("OutputTrainingSets/original_trainVAEForestFINAL.csv")
augmented_train = pd.read_csv("OutputTrainingSets/augmented_trainVAEForestFINAL.csv")
test_set = pd.read_csv("OutputTrainingSets/test_setVAEForestFINAL.csv")

# Define the feature columns and target.
features = numeric_features
#          OR
#features = continuous_features

categorical_features = ["Cover_Type"]

# Re-map the target column so that labels are contiguous (0, 1, 2, ...).
le = LabelEncoder()
original_train[target] = le.fit_transform(original_train[target])
augmented_train[target] = le.transform(augmented_train[target])
test_set[target] = le.transform(test_set[target])

# Train a Logistic Regression model on the original training dataset.
model_original = train_logistic_model(original_train, features, target)
metrics_original = evaluate_model(model_original, test_set, features, target)

# Train a Logistic Regression model on the augmented training dataset.
model_augmented = train_logistic_model(augmented_train, features, target)
metrics_augmented = evaluate_model(model_augmented, test_set, features, target)

unique_labels = np.sort(metrics_original['y_test'].unique())
# Now, inverse transform these labels using the same LabelEncoder 'le' used earlier
target_names = [str(x) for x in le.inverse_transform(unique_labels)]

# Compare the performance of the two models.
compare_models(metrics_original, metrics_augmented, target_names)

## 4 Model Performance Analysis

## RESULTS MAY VARY DEPENDING ON SCI-KIT VERSION AND SciPY 
## Multi-Class function related sub-functions may have updated since pipeline development

This section compares the baseline model to the augmented-data model, highlighting key metrics, trade‑offs, and overall trends.

---

### 4.1 Overall Performance Metrics

- **Accuracy:**  
  - Baseline logistic regression accuracy = **0.720**  
  - Augmented-data accuracy = **0.736**

- **AUC (ROC):**  
  - Baseline AUC = **0.930**  
  - Augmented-data AUC = **0.937**

---

### 4.2 Precision–Recall Trade‑off

- **Minority Class (Cover_Type 4):**  
  - Precision: **0.43** → **0.38**  
  - Recall: **0.34** → **0.52**  
  - F1‑score: **0.38** → **0.44**

- **Majority Class (Cover_Type 2):**  
  - Precision: **0.73** → **0.74**  
  - Recall: **0.80** → **0.78**  
  - F1‑score: **0.76** → **0.76**

---

### 4.3 Confusion Matrix Insights

- **True Positives (TP) for Cover_Type 4:**  
  - Baseline TP = **231**, Augmented TP = **356**

- **False Positives (FP) & False Negatives (FN) for Cover_Type 4:**  
  - Baseline FP = **306**, Augmented FP = **589**  
  - Baseline FN = **455**, Augmented FN = **330**

_Interpretation:_  
> Augmentation substantially increased true positives and reduced false negatives for the minority class, boosting its recall, but at the cost of more false positives.

---

## 4.4 Summary of Improvements & Drawbacks

### Key Improvements
- Recall for **Cover_Type 4** improved by **18 percentage points** (0.34 → 0.52).  
- F1‑score for **Cover_Type 4** increased by **0.06** (0.38 → 0.44).  
- Overall accuracy rose by **1.6 percentage points** (0.720 → 0.736).  
- AUC improved from **0.930** to **0.937**, indicating better overall separability.

### Trade‑offs
- Precision for **Cover_Type 4** dropped by **5 points** (0.43 → 0.38).  
- False positives for **Cover_Type 4** increased by **283** cases.

### Considerations & Next Steps
- Monitor for potential overfitting to synthetic patterns causing higher FP.  
- Explore threshold tuning or cost‑sensitive learning to balance precision and recall.  
- Assess real‑world impact of increased false positives on downstream ecological decisions.  

# Ethical/Privacy Analysis 

- Synthetic augmentation of forest cover data poses virtually no privacy risk since it contains no personal information. However, if synthetic samples misrepresent rare ecosystems, conservation priorities or fire‐management policies could be skewed. Validating synthetic data against real distributions and openly sharing methods helps prevent such errors. Clear documentation and expert review ensure that policy recommendations based on augmented data stay reliable and effective.

# Exporting Outputs of Framework Pipeline

A successful implementation of the framework and its pipeline outputs the following materials: datasets, model cards, trained classification models, and the notebook file the computation occured from. The rest of the document will be concerned with the documentation of the deployment, monitoring, and documentation of the framework's outputs. 

In [None]:
#CREATE FOLDER STRUCTURE FOR OUTPUTS

import os

# Define the main folder and a list of subfolder names
main_folder = "OutputMaterials"
subfolders = ["Datasets", "ModelCards", "TrainedModels"]

# Create the main folder if it doesn't already exist
os.makedirs(main_folder, exist_ok=True)

# Create each subfolder within the main folder
for subfolder in subfolders:
    subfolder_path = os.path.join(main_folder, subfolder)
    os.makedirs(subfolder_path, exist_ok=True)

print(f"Created main folder '{main_folder}' with subfolders: {', '.join(subfolders)}")

In [None]:
# SAVE CONFIG to JSON file
# At the end of your notebook, import the export_pipeline_config function.
from exportMULTIJSON import export_pipeline_config, compute_evaluation_metrics

# --- Live Pipeline Configuration Values ---

# Synthetic generation details (from your synthetic augmentation segment)
synthetic_method = "MultiVAE"

# The synthetic data generation function used in the pipeline
# (Assume augment_dataframe_borderline_smote was imported or defined previously)
augmentation_filee = "MultiVAE2.py"
pipeline_name = "VAEForest.ipynb"
validation_filee = "MultiClassValidation.py"
data_file_names = ["original_trainVAEForestFINAL.csv", "augmented_trainVAEForestFINAL.csv", "test_setVAEForestFINAL.csv"]

evaluation_metrics = compute_evaluation_metrics(original_minority, synthetic_minority, continuous_features, categorical_features,
                                         distance_threshold=0.5, density_threshold=0.5, gamma=1.0, plot = False)
# Output JSON filename
output_json = "OutputMaterials/VAEForestPipelineConfig.json"

# --- Export the Pipeline Configuration ---
export_pipeline_config(
    dataset_name=dataset_name,
    features=features,
    train_test_ratio=test_size,
    randomState = random_state,
    synthetic_method=synthetic_method,
    augmentation_ratio=ratio_limit,
    augmentation_file = augmentation_filee,
    pipeline_name = pipeline_name,
    validation_file = validation_filee,
    evaluation_metrics = evaluation_metrics,
    data_file_name = data_file_names,
    output_json=output_json
)

In [None]:
#Save ML Models
import joblib

#Logistic Regression
joblib.dump(model_original, "OutputMaterials/TrainedModels/LR_model_original.pkl")
print("Original LR model saved as 'LR_model_original.pkl'")
joblib.dump(model_augmented, "OutputMaterials/TrainedModels/LR_model_augmented.pkl")
print("Original LR model saved as 'LR_model_augmented.pkl'")

In [None]:
# Move Data into Folder
original_train.to_csv("OutputTrainingSets/Datasets/original_trainVAEForestFINAL.csv", index=False)
augmented_train.to_csv("OutputTrainingSets/Datasets/augmented_trainVAEForestFINAL.csv", index=False)
test_set.to_csv("OutputTrainingSets/Datasets/test_setVAEForestFINAL.csv", index=False)

In [None]:
#Create and store model card
from MachineLearningModels.ModelCardMaker import create_model_card
import pandas as pd

model_name = "Original Logistic Regression for Synthetic Data Augmentation"
overview = "Name of relevant dataset is " + dataset_name + ", this ML model was trained to classify the target value of " + target
preproc_file = "DataPrepMultiClassv1.py"
train_set_name = "original_trainVAEForestFINAL.csv"
test_set_name = "test_setVAEForestFINAL.csv"
evaluation_metrics = metrics_original
intended_use = "Classify the target value of " + target + " as well as possible."
ethical_bias_concerns = "Works with data related to forest coverage which can potentially impact environmental policy."
output_filename = "OutputMaterials/ModelCards/LR_original_ModelCard.md"

create_model_card(model_name, overview, preproc_file, random_state,
                  test_size, features, target, train_set_name, test_set_name,
                  evaluation_metrics, intended_use, ethical_bias_concerns, output_filename)

In [None]:
#Create and store model card
from MachineLearningModels.ModelCardMaker import create_model_card
import pandas as pd

model_name = "Augmented Logistic Regression for Synthetic Data Augmentation"
overview = "Name of relevant dataset is " + dataset_name + ", this ML model was trained to classify the target value of " + target
preproc_file = "DataPrepMultiClassv1.py"
train_set_name = "augmented_trainVAEForestFINAL.csv"
test_set_name = "test_setVAEForestFINAL.csv"
evaluation_metrics = metrics_augmented
intended_use = "Classify the target value of " + target + " as well as possible."
ethical_bias_concerns = "Works with data related to forest coverage which can potentially impact environmental policy."
output_filename = "OutputMaterials/ModelCards/LR_augmented_ModelCard.md"

create_model_card(model_name, overview, preproc_file, random_state,
                  test_size, features, target, train_set_name, test_set_name,
                  evaluation_metrics, intended_use, ethical_bias_concerns, output_filename)

In [None]:
#Create README.txt file

readme_content = f"""
# Output Materials for Synthetic Data Generation Framework for the ForestType Dataset

This folder contains all the output artifacts from the synthetic data generation and evaluation pipeline. These materials are designed to be self-contained and reproducible, and they can be zipped and shared with others for further analysis or deployment.

## Contents

- **Trained Models:**  
  Trained machine learning models (Logistic Regression) saved as pickle files.
  
- **Configuration Files:**  
  JSON files detailing the pipeline configuration, including dataset information, preprocessing steps, synthetic data generation parameters, and evaluation metrics.  
  *Filename:* `VAEForestPipelineConfig.json`

- **Model Cards:**  
  Markdown files that document each model's details, including:
  - Overview and intended use
  - Dataset information (original vs. augmented)
  - Preprocessing details
  - Hyperparameters and training details
  - Evaluation metrics and performance results
  - Ethical and bias considerations

- **Evaluation Outputs:**  
  Files containing evaluation metrics.

## How to Use

1. **Review Configuration:**  
   Open the configuration JSON files to see the exact parameters and settings used during the pipeline execution.

2. **Examine Model Cards:**  
   Each model card provides a detailed description of the corresponding model. Use these documents to understand how the model was trained, evaluated, and any known limitations or ethical concerns.

3. **Load and Deploy Models:**  
   Trained models can be loaded using joblib (or pickle). For example:
   ```python
   import joblib
   model = joblib.load("model.pkl")
"""

with open("OutputMaterials/README.txt", "w") as file:
    file.write(readme_content)
    
print("Content saved to README.txt")

# 5. Deployment and Monitoring

After validating that the machine learning model augmented with synthetic data performs positively on the test set, the next phase is deployment and continuous monitoring. For the **Forest Cover Type Dataset** the goal was to improve classification for **rare cover types (4 & 5) by boosting recall and F1‑score**. The deployment process must ensure that synthetic data augmentation does not introduce unintended distortions.

---

### Key Points

- **Model Integration:**  
  - Save the trained **Logistic Regression v1.0** (including z‑score normalization, one‑hot encoding, and 1st/99th‑percentile outlier trimming) as a self‑contained artifact.  
  - Document that synthetic augmentation was applied to enhance recall and F1‑score for cover types 4 and 5.

- **Documentation of Augmentation:**  
  - Record augmentation metadata:  
    - Percentage increase in minority classes 4 & 5: **100 %**  
    - Scaling factors for small classes: **2× (doubling)**  
    - Parameters/settings for each augmentation step (e.g. VAE latent_dim=10, epochs=100; SMOTE k_neighbors=5)  
  - Embed this metadata within the model artifact and reference it in the model card.

- **Monitoring for Data Drift:**  
  - Deploy on **AWS SageMaker** and track incoming data for shifts in feature distributions or class balance.  
  - Monitoring routine should:  
    - Track key metrics (recall, F1‑score) for **cover type 4**  
    - Monitor the rate of **cover type 4** predictions and flag deviations from the expected ~0.5 % prevalence  
    - Periodically run high‑value ecological scenarios through the pipeline to verify consistent performance  
    - Trigger alerts or automated retraining if recall for cover type 4 falls below **0.40**

- **Re‑Training and Version Control:**  
  - On detected drift or performance degradation, rerun the full pipeline—including synthetic data generation—with updated data.  
  - Replace older augmented datasets with new versions.  
  - Use version control (e.g., **GitHub**) and metadata logs to ensure each retraining iteration is reproducible and auditable.

- **A/B Testing & Resource Monitoring:**  
  - Implement an A/B test to compare **Augmented Logistic Regression v1.0** vs. **Baseline Logistic Regression v0.9** before full rollout.  
  - Continuously monitor system resources (latency < 200 ms per prediction, CPU < 70 %, memory < 2 GB) to ensure real‑time operational constraints are met.

---

*This template can be customized to fit your organization’s deployment workflows and monitoring infrastructure.*  

# 6. Documentation & Ethics Review

Throughout the lifecycle of the model for predicting forest cover types—especially minority classes 4 (Cottonwood/Willow) and 5 (Aspen)—using the **Forest Cover Type Dataset**, detailed documentation and ethics reviews are integral to maintaining transparency, fairness, and regulatory compliance.

---

### Documentation

- **Parameter & Process Logs:**  
  - Record every step of data processing, model training, and synthetic augmentation, including:  
    - **Scaling & normalization:** z‑score normalization of continuous features  
    - **Encoding:** one‑hot encoding of wilderness areas and soil types  
    - **Outlier removal:** 1st/99th‑percentile trimming by class  
    - **Synthetic generation settings:** VAE (latent_dim=10, epochs=2), SMOTE (k_neighbors=5), augmentation ratio 2× for classes 4 & 5  

- **Model Cards & Technical Reports:**  
  - Produce a model card detailing:  
    - Intended use cases (e.g. habitat mapping, fire‐management planning)  
    - Performance metrics by cover type (precision, recall, F1)  
    - Identified limitations (e.g. boundary shifts in synthetic samples)  
    - Role and impact of synthetic augmentation on minority‐class recall  
  - Include a README summarizing setup, usage instructions, and dependencies.

---

### Ethics Review

- **Bias & Fairness Evaluation:**  
  - Identify “sensitive” features (geospatial proxies: elevation, hydrology/roadway distances) and assess whether augmentation skews habitat representation.  
  - Compute fairness metrics (distribution parity, equalized odds) across all cover types and flag any shifts in minority‐class prevalence.

- **Privacy Considerations:**  
  - Confirm dataset contains no PII/PHI; features are purely environmental.  
  - Document that no additional privacy techniques were needed beyond anonymized original data.

- **Transparency & Accountability:**  
  - Log all decisions, rationales, and bias/fairness assessment results.  
  - Document any mitigation steps (e.g. adjusting augmentation ratio) taken in response to ethical findings.

---

### Regulatory & Audit Readiness

- Compile a comprehensive package covering:  
  - Data preprocessing workflows  
  - Synthetic data generation process  
  - Model training and validation results  
  - Deployment, monitoring, and drift‐detection procedures  
- Ensure all artifacts, logs, and model cards are versioned (e.g. GitHub, MLflow) and accessible for audit.

---

*This template can be adapted to fit your organization’s documentation standards and ethics review processes.*  


# Conclusion

- The project overall was a huge success as the augmentation process helped improve recall for 3 of the 4 minority classes.

In [None]:
#EXPORT PIPELINE

import os
import shutil
#Copy this file over to OutputMaterials folder.

#SAVE FILE FIRST
source_file = "VAEForest.ipynb"  # Replace notebook's filename.
destination_folder = "OutputMaterials/"

destination_file = os.path.join(destination_folder, os.path.basename(source_file))
shutil.copy(source_file, destination_file)

print(f"Notebook copied from {source_file} to {destination_file}")