# Updated Realistic Synthetic Waveguide Dataset
This repository contains a realistic synthetic dataset generator for optical waveguide characterization. It builds on a physics-inspired core, injects controlled noise, and tunes to match real experimental statistics. The final output is:

- **50,000 samples**  
- **15 input features** (geometrical, material, and physical parameters)  
- **14 output targets** (losses, mode properties, effective index, polarization, etc.)  

Use this dataset to train data-driven models that predict waveguide performance from basic design parameters—with the added realism of experimental-data correction.

## 📂 Input Parameters (15 Features)

| Name | Description | Units / Format |
|---|---|---|
| `core_index` | Complex refractive index of the core, as string `n_real+n_imagj` (e.g. `1.488000-3.200000e-08j`). | — |
| `clad_index` | Complex refractive index of the cladding, same format. | — |
| `core_radius_m` | Core radius $a$. | meters (0.5 × 10⁻⁶–10 × 10⁻⁶ m) |
| `clad_radius_m` | Cladding radius $b$. | meters (20–50 µm) |
| `length_m` | Waveguide length $L$. | meters (1 mm–0.5 m) |
| `wavelength_m` | Operating wavelength $\lambda$. | meters (500 nm–1600 nm) |
| `polarization` | Input polarization (0 = TE, 1 = TM). | unitless (0–1) |
| `alpha_core` | Core intrinsic loss $\alpha_{\rm core}$. | m⁻¹ (1e-4–1e-3) |
| `alpha_clad` | Cladding intrinsic loss $\alpha_{\rm clad}$. | m⁻¹ (1e-4–1e-3) |
| `photoelastic_coeff` | Photoelastic coefficient $p$. | unitless (0.20–0.25) |
| `delta_rho_over_rho` | Density variation ratio $\Delta\rho/\rho$. | unitless (1e-12–1e-11) |
| `sigma_rms_m` | RMS surface roughness $\sigma$. | meters (1–10 nm) |
| `roughness_corr_length_m` | Roughness correlation length $L_{\rm corr}$. | meters (100 nm–1 µm) |
| `w_in_m` | Input beam waist $w_{\rm in}$. | meters (1–5 µm) |
| `input_power` | Input optical power $P_{\rm in}$. | Watts (1–10 mW) |

## 🌟 Output Targets (14 Features)

| Name | Description |
|---|---|
| `propagation_loss_dB` | Propagation loss (dB) |
| `insertion_loss_dB` | Insertion (coupling) loss (dB) |
| `coupling_loss_dB` | Same as insertion loss |
| `mode_field_diameter_m` | Mode field diameter $2w$ |
| `mode_confinement_factor` | Fraction of power confined in the core $\Gamma$ |
| `single_mode` | `Y` if single-mode ($V<2.405$), else `N` |
| `multi_mode` | Complement of `single_mode` |
| `scattering_loss_dB` | Scattering loss (dB) |
| `effective_index` | Effective refractive index $n_{\rm eff}$ |
| `cross_coupling` | Cross-coupling metric |
| `TE_percent`, `TM_percent` | Mode polarization percentages |
| `V_parameter` | Normalized frequency $V$ |
| `output_power` | Output power $P_{\rm out}$ |

## 🧮 Key Equations

**Normalized Frequency**  
$$V = \frac{2\pi\,a}{\lambda}\sqrt{n_{\mathrm{core,real}}^2 - n_{\mathrm{clad,real}}^2}$$

**Mode Field Diameter**  
$$w = a\Bigl(0.65 + 1.619\,V^{-1.5} + 2.879\,V^{-6}\Bigr)\quad\mathrm{MFD}=2w$$

**Mode Confinement Factor**  
$$
u = \begin{cases}
0.9\,V, & V < 2.405,\\
V - 0.5, & V \ge 2.405,
\end{cases}
\quad \Gamma = \frac{u^2}{V^2}
$$

**Intrinsic Attenuation**  
$$\alpha_{\rm eff} = \alpha_{\rm core}\,\Gamma + \alpha_{\rm clad}\,(1-\Gamma)$$

**Scattering Loss Coefficients**  
$$
\alpha_{\rm scatt,bulk}
= \frac{8\pi^3}{3\,\lambda^4}\,p^2\Bigl(\frac{\Delta\rho}{\rho}\Bigr)^2\,\Gamma,
\quad
\alpha_{\rm scatt,surf}
= \frac{4\pi^3}{\lambda^2}\,\sigma_{\rm rms}^2\,L_{\rm corr}
$$

**Total Attenuation**  
$$\alpha_{\rm total} = \alpha_{\rm eff} + \alpha_{\rm scatt,bulk} + \alpha_{\rm scatt,surf}$$

**Output Power**  
$$P_{\rm out} = P_{\rm in}\,\exp\bigl(-\alpha_{\rm total}L\bigr)$$

**Propagation Loss (dB)**  
$$\mathcal{L}_{\rm prop} = 10\,\log_{10}\!\Bigl(\frac{P_{\rm in}}{P_{\rm out}}\Bigr)$$

**Gaussian Overlap**  
$$
T_{\rm nom}
= \frac{2\,w_{\rm in}\,w}{w_{\rm in}^2 + w^2}
  \exp\!\Bigl(-\frac{\Delta x^2}{w_{\rm in}^2 + w^2}\Bigr),
\quad
\Delta x \sim \mathcal{U}(0,2w)
$$

**Insertion/Coupling Loss**  
$$\mathrm{IL}_{\rm dB} = -20\,\log_{10}(T_{\rm nom}),\quad \mathrm{CL}_{\rm dB} = \mathrm{IL}_{\rm dB}$$

**Effective Index**  
$$n_{\rm eff} = \sqrt{n_{\rm clad,real}^2 + \frac{u^2}{V^2}(n_{\rm core,real}^2 - n_{\rm clad,real}^2)}$$

**Cross-Coupling**  
$$
\mathrm{CrossCoupling} =
\begin{cases}
0, & V<2.405,\\
\frac{1}{2}\,\frac{V-2.405}{V}, & V\ge2.405
\end{cases}
$$

**Polarization Percentages**  
$$
\mathrm{TE}\% = (1 - p_{\rm pol})\times100,\quad
\mathrm{TM}\% = p_{\rm pol}\times100
$$
For \(V\ge2.405\):
$$
\mathrm{TE}\% = \bigl((1-p_{\rm pol})(1-C) + 0.5\,C\bigr)\times100,\quad
\mathrm{TM}\% = \bigl(p_{\rm pol}(1-C) + 0.5\,C\bigr)\times100
$$

**Noise Injection**  
$$\eta \sim \mathcal{N}\bigl(0,(\frac{\text{noise}\%}{100})^2\bigr),\quad x' = x\,(1 + \eta)$$

**Clamping / Experimental Correction**  
$$
\mathcal{L}_{\rm prop} \ge 0.1,\;
\mathrm{IL}_{\rm dB},\;\mathrm{CL}_{\rm dB},\;\mathrm{scattering\_loss\_dB}\ge0,\;
P_{\rm out} \ge 10^{-20}\,\mathrm{W}
$$

## ⚙️ Data Generation Procedure

1. **Load experimental data**  
   Read propagation losses and MFDs from the literature; compute mean & std.  
2. **Sample inputs & compute physics**  
   Uniformly sample inputs; compute parameters $V$, $w$, $\Gamma$, loss coefficients, $n_{\rm eff}$, polarization, etc.  
3. **Inject noise**  
   Apply 5\% Gaussian noise to each computed output.  
4. **Experimental correction**  
   Adjust loss & MFD distributions to match experimental statistics.  
5. **Persist to CSV**  
   Batch-write 1,000 samples at a time into `final_realistic_synthetic_dataset.csv`.

In [None]:
# Usage Example
!git clone https://github.com/yourusername/waveguide-dataset.git
%cd waveguide-dataset
!python generate_waveguide_dataset.py
# Output: final_realistic_synthetic_dataset.csv

## 🔧 Parameter Definitions in Equations

| Parameter                | Meaning                                                   | Units          | Relevance                                                                 |
|--------------------------|-----------------------------------------------------------|----------------|---------------------------------------------------------------------------|
| **a**                    | Waveguide core radius                                     | m              | Affects normalized frequency (V) and mode field size                     |
| **λ**                    | Operating wavelength                                      | m              | Fundamental to frequency parameter V and dispersion                       |
| **n<sub>core,real</sub>**   | Real part of core refractive index                        | unitless       | Determines index contrast, guiding properties                             |
| **n<sub>clad,real</sub>**   | Real part of cladding refractive index                    | unitless       | Along with n<sub>core</sub>, sets normalized frequency and confinement    |
| **V**                    | Normalized frequency                                      | unitless       | Governs single- vs. multi-mode operation                                  |
| **w**                    | Gaussian mode-field radius                                | m              | Intermediate for calculating MFD                                           |
| **MFD**                  | Mode field diameter                                       | m              | Defines beam size for coupling and loss calculations                      |
| **u**                    | Eigenvalue parameter for mode                                | unitless       | Used in confinement and effective index computations                      |
| **Γ**                    | Mode confinement factor                                   | unitless       | Fraction of power confined in core                                        |
| **α<sub>core</sub>**        | Core intrinsic loss coefficient                            | m<sup>-1</sup>   | Base loss contribution from core material                                  |
| **α<sub>clad</sub>**        | Cladding intrinsic loss coefficient                        | m<sup>-1</sup>   | Base loss contribution from cladding material                              |
| **α<sub>scatt,bulk</sub>**  | Bulk scattering coefficient                                | m<sup>-1</sup>   | Loss from volume scattering (Rayleigh-style)                               |
| **α<sub>scatt,surf</sub>**  | Surface scattering coefficient                             | m<sup>-1</sup>   | Loss from roughness at core-cladding interface                             |
| **α<sub>eff</sub>**         | Effective attenuation                                     | m<sup>-1</sup>   | Combines core/clad intrinsic losses weighted by confinement               |
| **α<sub>total</sub>**       | Total attenuation                                         | m<sup>-1</sup>   | Sum of effective and scattering losses                                     |
| **P<sub>in</sub>**          | Input optical power                                       | W              | Starting power for loss calculations                                       |
| **P<sub>out</sub>**         | Output optical power                                      | W              | Power after propagation for loss calculation                               |
| **L**                    | Waveguide length                                          | m              | Distance over which losses accumulate                                       |
| **L<sub>corr</sub>**        | Roughness correlation length                              | m              | Characteristic scale of surface roughness for scattering                   |
| **p**                    | Photoelastic coefficient                                   | unitless       | Relates density fluctuations to scattering                                 |
| **Δρ/ρ**                 | Density variation ratio                                    | unitless       | Input for bulk scattering through density fluctuations                     |
| **σ<sub>rms</sub>**         | RMS surface roughness                                     | m              | Magnitude of surface roughness contributing to surface scattering          |
| **w<sub>in</sub>**         | Input beam waist                                          | m              | Coupling overlap between input beam and guided mode                        |
| **Δx**                   | Lateral misalignment                                       | m              | Offset in coupling overlap calculation                                      |
| **T<sub>nom</sub>**         | Nominal overlap integral                                   | unitless       | Gaussian overlap for coupling/insertion loss                                |
| **η**                    | Noise factor (Gaussian)                                    | unitless       | Random variation applied for experimental realism                          |
| **x′**                   | Noisy output value                                         | same as x      | Final value after noise injection                                           |


## 📊 Experimental Data Reference
Below is the experimental data reference table loaded directly from the source Excel file:

In [None]:
import pandas as pd
df = pd.read_excel('Lit_review_Experimental_Data_Details.xlsx')
df