<a href="https://colab.research.google.com/github/tejjusbhat/SaaS-Customer-Churn-Prediction/blob/main/Dataset_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Generation for SaaS Churn Prediction
The aim of this notebook is to simulate a dataset that a SaaS provider will have on their customers to create a model that can predict churn in customers.

In [None]:
import numpy as np
import pandas as pd

Loading the "d0r1h/customer_churn" dataset from huggingface as it already contained a lot of values that are typical for a SaaS product.

In [None]:
from datasets import load_dataset

ds = load_dataset("d0r1h/customer_churn")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
df = pd.DataFrame(ds["train"])
display(df.head())

Unnamed: 0,age,gender,security_no,region_category,membership_category,joining_date,joined_through_referral,referral_id,preferred_offer_types,medium_of_operation,...,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
0,18,F,XW0DQ7H,Village,Platinum Membership,17-08-2017,No,xxxxxxxx,Gift Vouchers/Coupons,?,...,300.63,53005.25,17,781.75,Yes,Yes,No,Not Applicable,Products always in Stock,0
1,32,F,5K0N3X1,City,Premium Membership,28-08-2017,?,CID21329,Gift Vouchers/Coupons,Desktop,...,306.34,12838.38,10,,Yes,No,Yes,Solved,Quality Customer Care,0
2,44,F,1F2TCL3,Town,No Membership,11-11-2016,Yes,CID12313,Gift Vouchers/Coupons,Desktop,...,516.16,21027.0,22,500.69,No,Yes,Yes,Solved in Follow-up,Poor Website,1
3,37,M,VJGJ33N,City,No Membership,29-10-2016,Yes,CID3793,Gift Vouchers/Coupons,Desktop,...,53.27,25239.56,6,567.66,No,Yes,Yes,Unsolved,Poor Website,1
4,31,F,SVZXCWB,City,No Membership,12-09-2017,No,xxxxxxxx,Credit/Debit Card Offers,Smartphone,...,113.13,24483.66,16,663.06,No,Yes,Yes,Solved,Poor Website,1


renaming some columns for clarity

In [None]:
df.rename(columns={"churn_risk_score": "churn", "avg_time_spent": "avg_session_duration"}, inplace=True)

renaming the tiers to make it more like a typical SaaS product

In [None]:
tier_map = {
    "Basic Membership": "Basic",
    "Silver Membership": "Pro",
    "Gold Membership": "Pro",
    "Platinum Membership": "Enterprise",
    "Premium Membership": "Enterprise",
    "No Membership": "Basic"
}
df["plan_tier"] = df["membership_category"].map(tier_map)
df["plan_tier"] = df["plan_tier"].fillna("Basic")
df.drop(columns=["membership_category"], inplace=True)

##Generating logs
It is typical for SaaS to have multiple rows of time series data for a particular customer that we can later aggregate into a single row.

Using a random number generator to generate values for certain columns as follows:
- **logins:** it uses a base multiplier and then a weekday/ weekend multiplier to simulate traffic according to the day of the week
- **api_calls:** shows api usage with a similar logic to generating logins but a different base multiplier
- **session_mins:** shows the time spent on the session

In [None]:
rng = np.random.default_rng(123)

customers = df["security_no"].values
n_days = 90
plan = df["plan_tier"].map({"Basic":0, "Pro":1, "Enterprise":2}).values
dates = pd.date_range(end=pd.Timestamp.today().normalize(), periods=n_days, freq="D")

rows = []
for i, cid in enumerate(customers):
    tier = plan[i]
    # Base intensities by tier
    base_login_lambda = [0.15, 0.30, 0.45][tier]   # expected logins/day
    base_api_lambda   = [5,    30,    150][tier]   # expected API calls/day

    # Personal modifiers from aggregate features if present
    stickiness = df.loc[df["security_no"]==cid, "stickiness_score"].iloc[0] if "stickiness_score" in df else rng.uniform(0.1, 0.9)
    adoption   = df.loc[df["security_no"]==cid, "feature_adoption_rate"].iloc[0] if "feature_adoption_rate" in df else rng.beta(2,5)

    for d in dates:
        # Weekday/seasonality bump
        weekday = d.weekday()  # 0=Mon..6=Sun
        weekday_mult = 1.2 if weekday < 5 else 0.8

        # Draw events using poisson to simulate reality
        logins = rng.poisson(lam=base_login_lambda * weekday_mult * (0.5 + 0.8*stickiness))
        api    = rng.poisson(lam=base_api_lambda   * weekday_mult * (0.4 + 1.0*adoption))

        # Session minutes per login (gamma distribution is nice for positive skew)
        session_min = 0
        if logins > 0:
            session_min = float(rng.gamma(shape=2 + 3*stickiness, scale=6, size=1) * logins) # Using gamma for positive skew

        rows.append([cid, d.date(), int(logins), int(api), round(session_min,2)])

usage = pd.DataFrame(rows, columns=["security_no","date","logins","api_calls","session_minutes"])

  session_min = float(rng.gamma(shape=2 + 3*stickiness, scale=6, size=1) * logins) # Using gamma for positive skew


##Aggregating the dataset
We are now aggregating the simluated values and merging it back into the main dataframe

In [None]:
# Aggregate back to monthly-ish features over the window
agg = usage.groupby("security_no").agg(
    logins_90d=("logins","sum"),
    active_days_90d=("logins", lambda x: (x>0).sum()),
    api_calls_90d=("api_calls","sum"),
    session_minutes_90d=("session_minutes","sum"),
    # recency: days since last active day in the window
    days_since_active=("date", lambda s: (pd.Timestamp.today().normalize().date() - max(s.loc[usage.loc[s.index, "logins"]>0]) if (usage.loc[s.index, "logins"]>0).any() else pd.Timestamp.today().normalize().date() + pd.Timedelta(days=999)).days),
).reset_index()

# Merge into your master DF
df = df.merge(agg, on="security_no", how="left").fillna({
    "logins_90d":0, "active_days_90d":0, "api_calls_90d":0, "session_minutes_90d":0, "days_since_active":90
})

##Final Dataset simulated for SaaS

In [None]:
df.head()

Unnamed: 0,age,gender,security_no,region_category,joining_date,joined_through_referral,referral_id,preferred_offer_types,medium_of_operation,internet_option,...,past_complaint,complaint_status,feedback,churn,plan_tier,logins_90d,active_days_90d,api_calls_90d,session_minutes_90d,days_since_active
0,18,F,XW0DQ7H,Village,17-08-2017,No,xxxxxxxx,Gift Vouchers/Coupons,?,Wi-Fi,...,No,Not Applicable,Products always in Stock,0,Enterprise,46,33,8793,1027.11,2
1,32,F,5K0N3X1,City,28-08-2017,?,CID21329,Gift Vouchers/Coupons,Desktop,Mobile_Data,...,Yes,Solved,Quality Customer Care,0,Enterprise,37,30,8605,862.65,1
2,44,F,1F2TCL3,Town,11-11-2016,Yes,CID12313,Gift Vouchers/Coupons,Desktop,Wi-Fi,...,Yes,Solved in Follow-up,Poor Website,1,Basic,17,17,267,411.37,16
3,37,M,VJGJ33N,City,29-10-2016,Yes,CID3793,Gift Vouchers/Coupons,Desktop,Mobile_Data,...,Yes,Unsolved,Poor Website,1,Basic,8,8,227,215.77,20
4,31,F,SVZXCWB,City,12-09-2017,No,xxxxxxxx,Credit/Debit Card Offers,Smartphone,Mobile_Data,...,Yes,Solved,Poor Website,1,Basic,13,11,294,231.94,2


In [None]:
df.to_csv("churn_data.csv", index=False)