# Synthetic Data Generation using Generative AI

Synthetic data is artificially generated data that mimics real-world data. It is created by algorithms, models, or simulations rather than being collected from actual events or real-world scenarios.

The dataset found contains daily records of insights into app usage patterns over time. The goal of this project will be to generate synthetic data that mimics the original dataset by ensuring that it maintains the same statistical properties while providing privacy for users' actual usage behaviour.

In [1]:
# Importing the necessary libraries

import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler

In [2]:
# loading the dataset

data = pd.read_csv('screentime_analysis.csv')

data.head()

Unnamed: 0,Date,App,Usage (minutes),Notifications,Times Opened
0,2024-08-07,Instagram,81,24,57
1,2024-08-08,Instagram,90,30,53
2,2024-08-26,Instagram,112,33,17
3,2024-08-22,Instagram,82,11,38
4,2024-08-12,Instagram,59,47,16


In [3]:
data.shape

(200, 5)

## Data Preprocessing

In [4]:
# Dropping Date and App columns since they are specific identifiers and we cannot generate them

data_gan = data.drop(columns = ['Date','App'])

# Normalizing the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data_gan)

# Converting back to dataframe
normalized_df = pd.DataFrame(normalized_data, columns = data_gan.columns)

normalized_df.head()

Unnamed: 0,Usage (minutes),Notifications,Times Opened
0,0.677966,0.163265,0.571429
1,0.754237,0.204082,0.530612
2,0.940678,0.22449,0.163265
3,0.686441,0.07483,0.377551
4,0.491525,0.319728,0.153061


## Using GANs to build a Generative AI model for Synthetic Data Generation

Process:

1. The generator will be trained to produce data similar to the normalized Usage, Notifications, and Times opened columns.
2. The discriminator will be trained to distinguish between the real and generated data.
3. Next, we will alternate between training the discriminator and generator. The discriminator will be trained to classify real vs fake data, and the generator will be trained to fool the discriminator.

In [5]:
# The generator will take a latent noise vector as input and generate a synthetic sample similar to the data.

latent_dim = 100

def build_generator(latent_dim):
  model = Sequential([
      Dense(128, input_dim = latent_dim),
      LeakyReLU(alpha = 0.01), # Introduces non-linearity and helps the model learn better by allowing a small gradient for negative inputs
      BatchNormalization(momentum = 0.8), # Stabilizes training and accelerates convergence by normalizing layer outputs
      Dense(512),
      LeakyReLU(alpha = 0.01),
      BatchNormalization(momentum = 0.8), # second layer increases the model's capacity to learn features
      Dense(3, activation = 'sigmoid') # generates a vector with 3 features
  ])
  return model

# create the generator
generator = build_generator(latent_dim)
generator.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [6]:
# Example of generating data using the generator network

# generate random noise for 1000 samples
noise = np.random.normal(0, 1, (1000, latent_dim))

# generate synthetic data using the generator
generated_data = generator.predict(noise)

generated_data[:5]

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step


array([[0.69741046, 0.71925294, 0.38959202],
       [0.6142535 , 0.683277  , 0.3320918 ],
       [0.5523561 , 0.61253643, 0.36208808],
       [0.54635954, 0.6536044 , 0.4081475 ],
       [0.66404605, 0.604072  , 0.43178597]], dtype=float32)

In [7]:
# Building discriminator which will take a real or synthetic data sample and classify it as real or fake

def build_discriminator():
  model = Sequential([
      Dense(512, input_dim = 3), # matches the output dimension of the generator
      LeakyReLU(alpha = 0.01),
      Dense(256), # further reducing the feature size while retaining rich representations
      LeakyReLU(alpha = 0.01),
      Dense(128),
      LeakyReLU(alpha = 0.01),
      Dense(1, activation = 'sigmoid')
  ])
  model.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])
  # binary_crossentropy -> for binary classfication
  # Adam -> Uses adaptive learning rates for each parameter and  Uses past gradients to smooth updates.
  return model

# Creating the discriminator
discriminator = build_discriminator()
discriminator.summary()

In [11]:
# We will freeze the discriminator's weights when training the generator to ensure only the generator is updated during those training steps

def build_gan(generator, discriminator):
    # freeze the discriminator’s weights while training the generator
    discriminator.trainable = False

    model = Sequential([generator, discriminator])
    model.compile(loss='binary_crossentropy', optimizer=Adam())
    return model

# create the GAN
gan = build_gan(generator, discriminator)
gan.summary()

## Training the GAN

Process:

1. Generate random noise.
2. Use the generator to create fake data.
3. Train the discriminator on both real and fake data.
4. Train the generator via the GAN to fool the discriminator.

In [20]:
discriminator.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
def train_gan(gan, generator, discriminator, data, epochs=5000, batch_size=128, latent_dim=100):
    for epoch in range(epochs):
        # select a random batch of real data
        idx = np.random.randint(0, data.shape[0], batch_size)
        real_data = data[idx]

        # generate a batch of fake data
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        fake_data = generator.predict(noise)

        # labels for real and fake data
        real_labels = np.ones((batch_size, 1))  # real data has label 1
        fake_labels = np.zeros((batch_size, 1))  # fake data has label 0

        # train the discriminator
        d_loss_real = discriminator.train_on_batch(real_data, real_labels)
        d_loss_fake = discriminator.train_on_batch(fake_data, fake_labels)

        # train the generator via the GAN
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        valid_labels = np.ones((batch_size, 1)) 
        g_loss = gan.train_on_batch(noise, valid_labels)

        # print the progress every 1000 epochs
        if epoch % 1000 == 0:
            print(f"Epoch {epoch}: D Loss: {0.5 * np.add(d_loss_real, d_loss_fake)}, G Loss: {g_loss}")

train_gan(gan, generator, discriminator, normalized_data, epochs=2000, batch_size=128, latent_dim=latent_dim)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
Epoch 0: D Loss: [0.7149801 0.3625803], G Loss: 0.6453141570091248
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/ste

In [26]:
# generate new data
noise = np.random.normal(0, 1, (1000, latent_dim))  # generate 1000 synthetic samples
generated_data = generator.predict(noise)

# convert the generated data back to the original scale
generated_data_rescaled = scaler.inverse_transform(generated_data)

# convert to DataFrame
generated_df = pd.DataFrame(generated_data_rescaled, columns=data_gan.columns)

generated_df.head()

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 


Unnamed: 0,Usage (minutes),Notifications,Times Opened
0,87.278854,146.999985,99.0
1,87.257317,146.999985,99.0
2,87.42453,146.999985,99.0
3,87.356461,146.999985,99.0
4,87.234169,146.999985,98.999985
