# Objective

In this Lab Session, students will learn to generate time series data using DopplGANger and TimeGAN, understand the application of GANs in time series generation, and compare their performance. Students will implement the code in Python within a Google Colab environment.

**Task 1: Environment Setup and Data Preparation**

*step 1 Install Dependencies*

Install required libraries in Colab, including `tensorflow, numpy, pandas, matplotlib`, and dependencies for `[DopplGANger].

In [None]:
# please enter your codes here

*step 2 Prepare Dataset*

Download the UCI Air Quality Dataset (https://archive.ics.uci.edu/ml/datasets/Air+Quality), which contains multivariate time series data (e.g., CO, NOx readings).

In [None]:
# please enter your codes here

*step 3 Data Preprocessing*

*   Clean Data: Replace missing values (marked as -200) with NaN, fill using linear interpolation, and remove invalid records (e.g., rows with all NaN).
*   Select Key Features: Choose key features for multivariate time series, e.g., CO(GT), NOx(GT), and temperature (T), to reduce complexity and focus on meaningful series.


*   Normalize: Scale selected features to [-1, 1].
*   Split Sequences: Segment data into time series windows (e.g., 24 steps) for model input.





In [None]:
# please enter your codes here

some hints for reference

In [None]:
# Task 1: Data Preparation
def download_and_prepare_air_quality_data(seq_len=24, selected_features=['CO(GT)', 'NOx(GT)', 'T']):
    # Step 1: Download UCI Air Quality Dataset
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.zip'
    urllib.request.urlretrieve(url, 'AirQualityUCI.zip')

    # Extract and load data
    with zipfile.ZipFile('AirQualityUCI.zip', 'r') as zip_ref:
        zip_ref.extractall('air_quality')

    data = pd.read_csv('air_quality/AirQualityUCI.csv', sep=';', decimal=',')

    # Step 2: Clean Data
    # Replace missing values (-200) with NaN

    # Remove rows where all selected features are NaN

    # Fill remaining NaN with linear interpolation


    # Step 3: Select Key Features


    # Step 4: Normalize Data
    scaler = MinMaxScaler(feature_range=(-1, 1))
    data_normalized = scaler.fit_transform(data)

    # Step 5: Split Sequences


**Task 2: Implement and Train DopplGANger**



Load DopplGANger Model:


*   Clone the official DopplGANger GitHub repository (https://github.com/fjxmlzn/DopplGANger) in your environment.
*   Install required dependencies (e.g., torch, ganite).

*   Import DopplGANger’s core modules and understand its architecture (feature generator, time series generator, and discriminator).

1. Set Hyperparameters: Configure hyperparameters, e.g.,
time series length (24), feature dimensions (number of selected features), batch size, and epochs.

2. Train Model: Train DopplGANger using the preprocessed air quality dataset, monitoring losses.

3. Generate Data: Generate synthetic time series data with the trained DopplGANger and save results.

In [None]:
# please enter your codes here

**Task 3: Evaluate DopplGANger Generated Data**


1）Visualize Generated Data: Plot original and DopplGANger-generated data, comparing feature trends.
Quantitative Evaluation:

    1. Autocorrelation Consistency: Compute the autocorrelation function (ACF) for real and synthetic data, comparing similarity across lags using mean squared error (MSE).

    2. Dynamic Time Warping (DTW) Distance: Calculate DTW distance between real and synthetic sequences to assess shape similarity.

    3. Periodicity Consistency: Use Fast Fourier Transform (FFT) to compute power spectral density (PSD) of real and synthetic data, comparing periodic patterns via KL divergence or cosine similarity.
    4. Statistical Metrics: Compute mean and variance of generated vs. original data to evaluate basic statistical properties.

2）ummarize Evaluation: Discuss DopplGANger’s generated data quality, focusing on time-series characteristics (autocorrelation, periodicity, shape similarity).

*note: below the evaluation codes are for reference, you don't have to follow the codes.*

In [None]:
# Task 3: Evaluation
def evaluate_generated_data(real_data, synthetic_data, features, n_lags=10):
    plt.figure(figsize=(15, 5 * len(features)))

    # Visualization
    for i, feature in enumerate(features):
        plt.subplot(len(features), 2, i * 2 + 1)
        plt.plot(real_data[0, :, i], label='Real')
        plt.title(f'Real Data - {feature}')
        plt.legend()

        plt.subplot(len(features), 2, i * 2 + 2)
        plt.plot(synthetic_data[0, :, i], label='DopplGANger')
        plt.title(f'DopplGANger - {feature}')
        plt.legend()

    plt.tight_layout()
    plt.savefig('dopplganger_data_comparison.png')
    plt.show()

    # Quantitative Metrics
    results = {}

    for i, feature in enumerate(features):
        real_seqs = real_data[:, :, i]
        synth_seqs = synthetic_data[:, :, i]

        # Autocorrelation Consistency (ACF with MSE)
        real_acf = np.mean([acf(seq, nlags=n_lags, fft=True) for seq in real_seqs], axis=0)
        synth_acf = np.mean([acf(seq, nlags=n_lags, fft=True) for seq in synth_seqs], axis=0)
        acf_mse = mean_squared_error(real_acf, synth_acf)
        results[f'{feature}_ACF_MSE'] = acf_mse

        # Dynamic Time Warping (DTW) Distance
        dtw_distances = []
        for real_seq, synth_seq in zip(real_seqs[:10], synth_seqs[:10]):  # Limit for speed
            distance, _ = fastdtw(real_seq, synth_seq)
            dtw_distances.append(distance)
        results[f'{feature}_DTW_Distance'] = np.mean(dtw_distances)

        # Periodicity Consistency (PSD with Cosine Similarity and KL Divergence)
        freqs, real_psd = signal.periodogram(real_seqs.flatten())
        _, synth_psd = signal.periodogram(synth_seqs.flatten())
        min_len = min(len(real_psd), len(synth_psd))
        real_psd = real_psd[:min_len]
        synth_psd = synth_psd[:min_len]
        real_psd = real_psd / (np.sum(real_psd) + 1e-10)
        synth_psd = synth_psd / (np.sum(synth_psd) + 1e-10)
        psd_cosine = 1 - cosine(real_psd, synth_psd)
        psd_kl = entropy(real_psd + 1e-10, synth_psd + 1e-10)
        results[f'{feature}_PSD_Cosine'] = psd_cosine
        results[f'{feature}_PSD_KL'] = psd_kl

        # Statistical Metrics
        results[f'{feature}_Mean_Real'] = np.mean(real_seqs)
        results[f'{feature}_Mean_Synth'] = np.mean(synth_seqs)
        results[f'{feature}_Variance_Real'] = np.var(real_seqs)
        results[f'{feature}_Variance_Synth'] = np.var(synth_seqs)

    for key, value in results.items():
        print(f"{key}: {value:.4f}")
    return results
