## Using Faker library to generate synthetic data for simulation

The following is the synthetic data we will use to simulate the streaming data

In [14]:
from faker import Faker
import numpy as np
import pandas as pd
from datetime import datetime,timedelta

In [15]:
banking_df = pd.read_csv("../data/processed/banking_behaviour_preference.csv")
start_date = pd.to_datetime("2024-01-01")  # Starting date, First friday of 2024
banking_df.insert(1, 'Time', start_date)

# `CustomerDataGenerator` Class Documentation

## Overview

The `CustomerDataGenerator` class is designed to create synthetic customer data with configurable parameters for `churn_rate`, `campaign_effectiveness`, and `customer_satisfaction`. This enables the generation of datasets with varying levels of customer engagement and satisfaction based on a controlled set of inputs. The class also allows for a consistent timestamp to be applied to each generated dataset, facilitating time-based comparisons.

## Class Parameters

When initializing the `CustomerDataGenerator`, the following parameters are available:

- **churn_rate** (`float`): Controls the probability that an existing customer will churn. This value should be between `0` and `1`, where `1` means all existing customers will churn, and `0` means no churn.

- **campaign_effectiveness** (`float`): Represents the level of campaign effectiveness on customer behavior. This value ranges from `0` to `1`, where higher values indicate greater effectiveness, resulting in increased values for features such as `No_of_product`, `Total_Trans_Amt`, and `Total_Trans_Count`.

- **customer_satisfaction** (`float`): Represents the level of customer satisfaction. Similar to `campaign_effectiveness`, this value ranges from `0` to `1`. Higher values influence customer behavior positively, particularly for features related to product usage and transaction frequency.

## Methods

`generate_data(existing_df, num_records, timestamp=None)`

The main method to generate synthetic customer data.

#### Parameters:
- **existing_df** (`pd.DataFrame`): The original customer data from which existing clients are sampled. This dataset should contain a unique identifier column named `CLIENTNUM`.
- **num_records** (`int`): The total number of synthetic records to generate. The method will randomly divide this number between existing and new customers.
- **timestamp** (`datetime` or `str`): A fixed timestamp to apply to all generated records. This parameter allows time-based differentiation between datasets.

#### Returns:
- **all_data** (`pd.DataFrame`): A DataFrame containing the generated data, with columns:
  - `CLIENTNUM`: Unique identifier for each customer.
  - `Income_Category`: Encoded as integers from 0 to 4.
  - `No_of_product`: Number of products used by the customer, adjusted based on effectiveness and satisfaction levels.
  - `Total_Trans_Amt`: Total transaction amount, scaled by effectiveness and satisfaction.
  - `Total_Trans_Count`: Total count of transactions, also influenced by effectiveness and satisfaction.
  - Additional columns representing customer attributes and behaviors.
  - `Churned`: Binary indicator where `1` represents a churned customer and `0` represents a retained customer.
  - `Time`: Timestamp indicating when the data was generated, set to the provided `timestamp` argument.

In [16]:
class CustomerDataGenerator:
    def __init__(self, churn_rate=0.1, campaign_effectiveness=0.5, customer_satisfaction=0.5):
        """
        Initialize the generator with churn rate, campaign effectiveness, and customer satisfaction.
        
        :param churn_rate: Probability of churn for existing clients
        :param campaign_effectiveness: Scale factor (0-1) for campaign's effectiveness on features
        :param customer_satisfaction: Scale factor (0-1) for customer satisfaction effect on features
        """
        self.churn_rate = churn_rate
        self.campaign_effectiveness = campaign_effectiveness
        self.customer_satisfaction = customer_satisfaction
        self.fake = Faker()
    
    def _adjust_based_on_campaign(self, value, max_increase):
        """Adjust values based on campaign effectiveness and customer satisfaction."""
        increase_factor = 1 + self.campaign_effectiveness * self.customer_satisfaction  # Value between 1 and 2
        return min(int(value * increase_factor), max_increase)

    def generate_data(self, existing_df, num_records, timestamp=None):
        """
        Generate synthetic customer data with a fixed timestamp.
        
        :param existing_df: DataFrame containing existing customer data
        :param num_records: Total number of records to generate
        :param timestamp: Fixed datetime to apply to all records in the generated data
        :return: DataFrame with generated customer data
        """
        # Set the timestamp to the current datetime if none is provided
        if timestamp is None:
            timestamp = datetime.now()

        # Get existing client numbers
        existing_clients = existing_df['CLIENTNUM'].values

        # Determine number of existing and new records
        num_existing = int(np.random.uniform(0.0, 0.8) * num_records)
        num_existing = min(num_existing, len(existing_clients))  # Ensure we do not exceed the actual number
        num_new = num_records - num_existing

        # Select random existing clients for updates
        updated_client_nums = np.random.choice(existing_clients, size=num_existing, replace=False)
        new_client_nums = np.arange(existing_clients.max() + 1, existing_clients.max() + 1 + num_new)

        # Create updated records for existing clients
        updated_data = []
        for client in updated_client_nums:
            updated_data.append({
                'CLIENTNUM': client,
                'Income_Category': self.fake.random_int(min=0, max=4),  # Income category as integer
                'No_of_product': self._adjust_based_on_campaign(self.fake.random_int(min=1, max=3), 6),
                'Total_Trans_Amt': self._adjust_based_on_campaign(self.fake.random_int(min=500, max=2000), 10000),
                'Total_Trans_Count': self._adjust_based_on_campaign(self.fake.random_int(min=10, max=50), 150),
                'Credit Score': self.fake.random_int(min=300, max=850),
                'Outstanding Loans': self.fake.random_int(min=0, max=50000),
                'Balance': self.fake.random_int(min=0, max=300000),
                'PhoneService': self.fake.random_int(min=0, max=1),
                'InternetService': self.fake.random_int(min=0, max=2),
                'TechSupport': self.fake.random_int(min=0, max=2),
                'PaperlessBilling': self.fake.random_int(min=0, max=1),
                'PaymentMethod': self.fake.random_int(min=0, max=3),
                'Churned': 0,  # Initially set as not churned
                'Time': timestamp  # Fixed timestamp for each record
            })
        
        # Apply churn rate only to the updated (existing) customers
        num_churned = int(self.churn_rate * num_existing)
        churned_clients = np.random.choice(range(num_existing), size=num_churned, replace=False)
        
        for i in churned_clients:
            updated_data[i]['Churned'] = 1  # Mark these clients as churned

        # Create new customer records (with no churn)
        new_data = []
        for client in new_client_nums:
            new_data.append({
                'CLIENTNUM': client,
                'Income_Category': self.fake.random_int(min=0, max=4),  # Income category as integer
                'No_of_product': self._adjust_based_on_campaign(self.fake.random_int(min=1, max=3), 6),
                'Total_Trans_Amt': self._adjust_based_on_campaign(self.fake.random_int(min=500, max=2000), 10000),
                'Total_Trans_Count': self._adjust_based_on_campaign(self.fake.random_int(min=10, max=50), 150),
                'Credit Score': self.fake.random_int(min=300, max=850),
                'Outstanding Loans': self.fake.random_int(min=0, max=50000),
                'Balance': self.fake.random_int(min=0, max=300000),
                'PhoneService': self.fake.random_int(min=0, max=1),
                'InternetService': self.fake.random_int(min=0, max=2),
                'TechSupport': self.fake.random_int(min=0, max=2),
                'PaperlessBilling': self.fake.random_int(min=0, max=1),
                'PaymentMethod': self.fake.random_int(min=0, max=3),
                'Churned': 0,  # New customers are not churned
                'Time': timestamp  # Fixed timestamp for each record
            })
        
        # Combine updated and new data into one DataFrame
        all_data = pd.DataFrame(updated_data + new_data)

        return all_data

# Synthetic Data Generation with Varying Effectiveness and Satisfaction

The following code demonstrates how to use the `CustomerDataGenerator` class to create two distinct datasets with different levels of `campaign_effectiveness` and `customer_satisfaction`. The purpose is to simulate customer behavior under scenarios with high and low engagement, represented by high and low effectiveness and satisfaction parameters.

## Code Breakdown

### 1. Initialize the `CustomerDataGenerator` Instances
We create two instances of `CustomerDataGenerator` with different levels of `campaign_effectiveness` and `customer_satisfaction` to simulate varying customer engagement:
- `high_generator`: Initialized with `campaign_effectiveness=0.9` and `customer_satisfaction=0.9`, representing a high-engagement scenario where the campaign is highly effective and customers are very satisfied.
- `low_generator`: Initialized with `campaign_effectiveness=0.1` and `customer_satisfaction=0.1`, representing a low-engagement scenario with minimal effectiveness and satisfaction.

Both instances are set with a `churn_rate` of 0.1, meaning that 10% of existing customers are marked as churned in both scenarios.

In [17]:
# Initialize the CustomerDataGenerator with high and low effectiveness and satisfaction
high_generator = CustomerDataGenerator(churn_rate=0.1, campaign_effectiveness=0.9, customer_satisfaction=0.9)
low_generator = CustomerDataGenerator(churn_rate=0.1, campaign_effectiveness=0.1, customer_satisfaction=0.1)

# Define the timestamps for each dataset
initial_date = datetime(2024, 1, 1)
high_timestamp = initial_date + timedelta(weeks = 1)
low_timestamp = initial_date + timedelta(weeks=2)

# Generate synthetic data with high effectiveness and satisfaction
high_data = high_generator.generate_data(banking_df, num_records=10000, timestamp=high_timestamp)

# Generate synthetic data with low effectiveness and satisfaction
low_data = low_generator.generate_data(banking_df, num_records=8000, timestamp=low_timestamp)

# Display the first few rows of each dataset
print("High Effectiveness and Satisfaction Data:")
print(high_data.head())

print("\nLow Effectiveness and Satisfaction Data:")
print(low_data.head())

High Effectiveness and Satisfaction Data:
   CLIENTNUM  Income_Category  No_of_product  Total_Trans_Amt  \
0  733607565                0              1              973   
1  856598266                1              3             1634   
2  909727049                1              5             1723   
3  161982512                4              5             1116   
4  270863839                1              1             2146   

   Total_Trans_Count  Credit Score  Outstanding Loans  Balance  PhoneService  \
0                 28           628              38257   218679             0   
1                 88           653              39473    10978             0   
2                 74           589              31258   275257             0   
3                 32           347               7590    41391             0   
4                 50           746              18786   281665             0   

   InternetService  TechSupport  PaperlessBilling  PaymentMethod  Churned  \
0        

## Saving data

In [18]:
high_data.to_csv("../data/processed/Simulation_data_high.csv", index = False)
low_data.to_csv("../data/processed/Simulation_data_low.csv", index = False)