# Let's create a Synthetic Data using an Auto-encoder in Python
Explanation:
1) Prepare the Data: Normalize the small real-world dataset.
2) Define the Autoencoder Model: Build a simple autoencoder with an encoder and a decoder.
3) Train the Autoencoder: Train the autoencoder to learn the compressed representation and reconstruction of the data.
4) Generate Synthetic Data: Sample from the latent space of the autoencoder and use the decoder to generate synthetic data.

Further Refinements:
1. Data Augmentation: Apply small perturbations to the existing data to create more training samples.
2. Bootstrapping: Resample the existing data with replacement to create multiple training datasets.


In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

In [None]:
# Sample limited real-world data
# Let's create a small dataset for demonstration purposes
real_data = np.array([[25, 50000, 1], [30, 60000, 2], [22, 40000, 1], [35, 80000, 3], [28, 55000, 2]])
real_data = (real_data - real_data.mean(axis=0)) / real_data.std(axis=0)  # Normalize the data


In [None]:
# Define the autoencoder model
input_dim = real_data.shape[1]
encoding_dim = 2  # Dimensionality of the latent space

input_layer = Input(shape=(input_dim,))
encoder = Dense(encoding_dim, activation="relu")(input_layer)
decoder = Dense(input_dim, activation="sigmoid")(encoder)

autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse')

In [None]:
# Train the autoencoder
autoencoder.fit(real_data, real_data, epochs=1000, batch_size=2, shuffle=True, verbose=0)

In [None]:
# Generate synthetic data
encoded_data = np.random.normal(size=(1000, encoding_dim))  # Sample from the latent space
synthetic_data = autoencoder.predict(encoded_data)
synthetic_data = synthetic_data * real_data.std(axis=0) + real_data.mean(axis=0)  # De-normalize the data

In [None]:
# Create a DataFrame
synthetic_data_df = pd.DataFrame(synthetic_data, columns=['Age', 'Income', 'Education Level'])

In [None]:
# Display the synthetic data
print(synthetic_data_df.head())

In [None]:
# Optionally save to a CSV file
synthetic_data_df.to_csv('synthetic_data.csv', index=False)