### How Is it done ??
### Before GenAI emerged (Well, this is used still in many companies but **GenAI is what everyone is talking about these days** )
Some of the modelling techniques used:
- Classic statistical methods
- Deep Learning models (GAN , VAE behind the scene)
- Mix of classic statistical models and Deep Learning.

Once synthetic data is generated, we need to evaluate it to make sure it is OK to use in downstream tasks. There are many libraries, websites offering these kind of solutions. But here, we focus in GenAI part.

## Use case

Synthetic data refers to artificially generated data that imitates the characteristics of real data without containing any information from actual individuals or entities. It is typically created through mathematical models, algorithms, or other data generation techniques. Synthetic data can be used for a variety of purposes, including testing, research, and training machine learning models, while preserving privacy and security.

Benefits of Synthetic Data:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

**Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.**

## Quickstart

In this notebook, we'll dive deep into generating synthetic medical billing records using the langchain library. This tool is particularly useful when you want to develop or test algorithms but don't want to use real patient data due to privacy concerns or data availability issues.

## Setup
- First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs.
- [Pydantic](https://docs.pydantic.dev/latest/): Data validation library for Python

In [None]:
%%capture
!pip install -U langchain langchain_experimental openai

In [None]:
# set environment variables
# https://platform.openai.com/account/api-keys
import os
os.environ["OPENAI_API_KEY"] = ""

In [None]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator, OPENAI_TEMPLATE
from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX


For example, replace imports like: `from langchain.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


## 1. Define Your Data Model
- Every dataset has a structure or a "schema".
- The MedicalBilling class below serves as our schema for the synthetic data.
- By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [None]:
from pydantic import BaseModel
from datetime import date
from typing import Optional

class FraudAttributes(BaseModel):
    # Identifiers
    transaction_id: str
    customer_name: str
    customer_address: str
    customer_ip_address: str
    merchant_name: str
    merchant_address: str
    merchant_ip_address: str

    # Transaction details
    item_purchased: str
    activity_type: str  # e.g., purchase, balance transfer
    type_of_transaction: str  # e.g., online, POS
    transaction_amount: float
    average_transaction_amount: float
    credit_limit: float
    date_of_last_balance_transfer: date

    # Device and network
    device_used: str
    is_device_anomaly: bool
    is_geo_anomaly: bool

    # Account changes and lifecycle
    date_of_email_change: date
    date_of_phone_change: date
    date_of_address_change: date
    date_of_password_change: date
    date_of_account_creation: date
    date_of_account_locked: date
    date_of_credit_limit_changed: date
    account_age_days: Optional[int] = 0  # Derived from account creation

    # Authentication / Challenge history
    date_of_last_otp_email_challenge: date
    date_of_last_otp_sms_challenge: date
    date_of_last_cvv_challenge: date
    date_of_last_ssn_cvv_challenge: date
    date_of_last_verid_challenge: date
    number_of_email_otp_challenges_failed_last_7d: int
    number_of_sms_otp_challenges_failed_last_7d: int

    # Behavioral signals
    number_of_transactions_last_24h: int
    number_of_failed_logins_last_24h: int
    is_transaction_spike: bool # Flag if unusually high txn

    # Risky transaction types
    is_external_reward_redemption: bool
    is_balance_transfer: bool

    is_fraud: bool









## 2. Sample Data
To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional medical billing records:

In [None]:
examples = [
    {"example":"""transaction_id": "TXN000003",
  "customer_name": "Charlie Lee",
  "customer_address": "99 Dark Alley, Unknownville",
  "customer_ip_address": "185.21.34.109",
  "merchant_name": "LuxuryWatches",
  "merchant_address": "123 Offshore Plaza",
  "merchant_ip_address": "185.21.34.110",

  "item_purchased": "Luxury Watch",
  "activity_type": "purchase",
  "type_of_transaction": "online",
  "transaction_amount": 6500.00,
  "average_transaction_amount": 200.00,
  "credit_limit": 7000.00,
  "date_of_last_balance_transfer": "2025-04-10",

  "device_used": "Unknown Device",
  "is_device_anomaly": True,
  "is_geo_anomaly": True,

  "date_of_email_change": "2025-05-04",
  "date_of_phone_change": "2025-05-04",
  "date_of_address_change": "2025-05-04",
  "date_of_password_change": "2025-05-04",
  "date_of_account_creation": "2025-03-01",
  "date_of_account_locked": "2025-05-05",
  "date_of_credit_limit_changed": "2025-04-15",
  "account_age_days": 66,

  "date_of_last_otp_email_challenge": "2025-05-05",
  "date_of_last_otp_sms_challenge": "2025-05-05",
  "date_of_last_cvv_challenge": "2025-05-05",
  "date_of_last_ssn_cvv_challenge": "2025-05-05",
  "date_of_last_verid_challenge": "2025-05-05",
  "number_of_email_otp_challenges_failed_last_7d": 3,
  "number_of_sms_otp_challenges_failed_last_7d": 2,

  "number_of_transactions_last_24h": 10,
  "number_of_failed_logins_last_24h": 5,
  "is_transaction_spike": True,

  "is_external_reward_redemption": False,
  "is_balance_transfer": False,

  "is_fraud": True"""},
    {"example":"""transaction_id": "TXN000001",
  "customer_name": "Alice Johnson",
  "customer_address": "123 Elm Street, Springfield",
  "customer_ip_address": "192.168.1.10",
  "merchant_name": "Bookstore Inc",
  "merchant_address": "456 Main Street, Springfield",
  "merchant_ip_address": "192.168.1.20",

  "item_purchased": "Book",
  "activity_type": "purchase",
  "type_of_transaction": "POS",
  "transaction_amount": 25.99,
  "average_transaction_amount": 27.00,
  "credit_limit": 5000.00,
  "date_of_last_balance_transfer": "2024-12-01",

  "device_used": "Alice's iPhone",
  "is_device_anomaly": False,
  "is_geo_anomaly": False,

  "date_of_email_change": "2024-10-10",
  "date_of_phone_change": "2024-09-15",
  "date_of_address_change": "2024-08-01",
  "date_of_password_change": "2024-11-20",
  "date_of_account_creation": "2020-01-15",
  "date_of_account_locked": "2025-05-05",
  "date_of_credit_limit_changed": "2023-12-01",
  "account_age_days": 1938,

  "date_of_last_otp_email_challenge": "2025-03-01",
  "date_of_last_otp_sms_challenge": "2025-03-02",
  "date_of_last_cvv_challenge": "2025-03-01",
  "date_of_last_ssn_cvv_challenge": "2025-01-01",
  "date_of_last_verid_challenge": "2025-02-15",
  "number_of_email_otp_challenges_failed_last_7d": 0,
  "number_of_sms_otp_challenges_failed_last_7d": 0,

  "number_of_transactions_last_24h": 2,
  "number_of_failed_logins_last_24h": 0,
  "is_transaction_spike": False,

  "is_external_reward_redemption": False,
  "is_balance_transfer": False,

  "is_fraud": False"""},
    {"example":"""transaction_id": "TXN000002",
  "customer_name": "Bob Smith",
  "customer_address": "78 Lakeview Blvd, Lakeside",
  "customer_ip_address": "172.16.0.22",
  "merchant_name": "CreditPay",
  "merchant_address": "789 Bank Street, Lakeside",
  "merchant_ip_address": "172.16.0.30",

  "item_purchased": "N/A",
  "activity_type": "balance transfer",
  "type_of_transaction": "online",
  "transaction_amount": 500.00,
  "average_transaction_amount": 520.00,
  "credit_limit": 10000.00,
  "date_of_last_balance_transfer": "2025-04-01",

  "device_used": "Bob's MacBook Pro",
  "is_device_anomaly": False,
  "is_geo_anomaly": False,

  "date_of_email_change": "2023-06-10",
  "date_of_phone_change": "2023-06-10",
  "date_of_address_change": "2023-06-10",
  "date_of_password_change": "2024-12-20",
  "date_of_account_creation": "2019-05-20",
  "date_of_account_locked": "2025-05-05",
  "date_of_credit_limit_changed": "2022-10-10",
  "account_age_days": 2179,

  "date_of_last_otp_email_challenge": "2025-04-30",
  "date_of_last_otp_sms_challenge": "2025-04-30",
  "date_of_last_cvv_challenge": "2025-04-30",
  "date_of_last_ssn_cvv_challenge": "2024-12-15",
  "date_of_last_verid_challenge": "2025-01-05",
  "number_of_email_otp_challenges_failed_last_7d": 0,
  "number_of_sms_otp_challenges_failed_last_7d": 0,

  "number_of_transactions_last_24h": 1,
  "number_of_failed_logins_last_24h": 0,
  "is_transaction_spike": False,

  "is_external_reward_redemption": False,
  "is_balance_transfer": True,

  "is_fraud": False"""},
]

## 3. Craft a Prompt Template
The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [None]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

The `FewShotPromptTemplate` includes:

- `prefix` and `suffix`: These likely contain guiding context or instructions.
- `examples`: The sample data we defined earlier.
- `input_variables`: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "medical_billing" to guide the model further.
- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.

## 4. Creating the Data Generator
With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.

In [None]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=FraudAttributes,
    llm=ChatOpenAI(temperature=1),
    prompt=prompt_template,
)

## 5. Generate Synthetic Data
Finally, let's get our synthetic data!

In [None]:
synthetic_results = synthetic_data_generator.generate(
    subject="CreditCardFraudData",
    extra="lets have 2 fraud transaction and 8 non fraud transaction",
    runs=10,
)

This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models.

In [None]:
type(synthetic_results)

list

## 6. Visualize the Generated Synthetic Data

In [None]:
len(synthetic_results)

10

In [None]:
synthetic_results

[FraudAttributes(transaction_id='TXN000003', customer_name='Charlie Lee', customer_address='99 Dark Alley, Unknownville', customer_ip_address='185.21.34.109', merchant_name='LuxuryWatches', merchant_address='123 Offshore Plaza', merchant_ip_address='185.21.34.110', item_purchased='Luxury Watch', activity_type='purchase', type_of_transaction='online', transaction_amount=6500.0, average_transaction_amount=200.0, credit_limit=7000.0, date_of_last_balance_transfer=datetime.date(2025, 4, 10), device_used='Unknown Device', is_device_anomaly=True, is_geo_anomaly=True, date_of_email_change=datetime.date(2025, 5, 4), date_of_phone_change=datetime.date(2025, 5, 4), date_of_address_change=datetime.date(2025, 5, 4), date_of_password_change=datetime.date(2025, 5, 4), date_of_account_creation=datetime.date(2025, 3, 1), date_of_account_locked=datetime.date(2025, 5, 5), date_of_credit_limit_changed=datetime.date(2025, 4, 15), account_age_days=66, date_of_last_otp_email_challenge=datetime.date(2025, 5,

## 7. Converting the synthetic data into Pandas Dataframe

In [None]:
import pandas as pd

# Create a list of dictionaries from the objects
synthetic_data = []
for item in synthetic_results:
    synthetic_data.append({
        'transaction_id': item.transaction_id,
        'customer_name': item.customer_name,
        'customer_address': item.customer_address,
        'customer_ip_address': item.customer_ip_address,
        'merchant_name': item.merchant_name,
        'merchant_address': item.merchant_address,
        'merchant_ip_address': item.merchant_ip_address,

        # Transaction details
        'item_purchased': item.item_purchased,
        'activity_type': item.activity_type,
        'type_of_transaction': item.type_of_transaction,
        'transaction_amount': item.transaction_amount,
        'average_transaction_amount': item.average_transaction_amount,
        'credit_limit': item.credit_limit,
        'date_of_last_balance_transfer': item.date_of_last_balance_transfer,

        # Device and network
        'device_used': item.device_used,
        'is_device_anomaly': item.is_device_anomaly,
        'is_geo_anomaly': item.is_geo_anomaly,

        # Account changes and lifecycle
        'date_of_email_change': item.date_of_email_change,
        'date_of_phone_change': item.date_of_phone_change,
        'date_of_address_change': item.date_of_address_change,
        'date_of_password_change': item.date_of_password_change,
        'date_of_account_creation': item.date_of_account_creation,
        'date_of_account_locked': item.date_of_account_locked,
        'date_of_credit_limit_changed': item.date_of_credit_limit_changed,
        'account_age_days': item.account_age_days,

        # Authentication / Challenge history
        'date_of_last_otp_email_challenge': item.date_of_last_otp_email_challenge,
        'date_of_last_otp_sms_challenge': item.date_of_last_otp_sms_challenge,
        'date_of_last_cvv_challenge': item.date_of_last_cvv_challenge,
        'date_of_last_ssn_cvv_challenge': item.date_of_last_ssn_cvv_challenge,
        'date_of_last_verid_challenge': item.date_of_last_verid_challenge,
        'number_of_email_otp_challenges_failed_last_7d': item.number_of_email_otp_challenges_failed_last_7d,
        'number_of_sms_otp_challenges_failed_last_7d': item.number_of_sms_otp_challenges_failed_last_7d,

        # Behavioral signals
        'number_of_transactions_last_24h': item.number_of_transactions_last_24h,
        'number_of_failed_logins_last_24h': item.number_of_failed_logins_last_24h,
        'is_transaction_spike': item.is_transaction_spike,

        # Risky transaction types
        'is_external_reward_redemption': item.is_external_reward_redemption,
        'is_balance_transfer': item.is_balance_transfer,

        # Additional field
        'is_fraud': item.is_fraud  # Add the fraud flag
    })

# Create a Pandas DataFrame from the list of dictionaries
synthetic_df = pd.DataFrame(synthetic_data)

# Display the DataFrame
print(type(synthetic_df))
synthetic_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,transaction_id,customer_name,customer_address,customer_ip_address,merchant_name,merchant_address,merchant_ip_address,item_purchased,activity_type,type_of_transaction,...,date_of_last_ssn_cvv_challenge,date_of_last_verid_challenge,number_of_email_otp_challenges_failed_last_7d,number_of_sms_otp_challenges_failed_last_7d,number_of_transactions_last_24h,number_of_failed_logins_last_24h,is_transaction_spike,is_external_reward_redemption,is_balance_transfer,is_fraud
0,TXN000003,Charlie Lee,"99 Dark Alley, Unknownville",185.21.34.109,LuxuryWatches,123 Offshore Plaza,185.21.34.110,Luxury Watch,purchase,online,...,2025-05-05,2025-05-05,3,2,10,5,True,False,False,True
1,TXN000001,Alice Johnson,"123 Elm Street, Springfield",192.168.1.10,Bookstore Inc,"456 Main Street, Springfield",192.168.1.20,Book,purchase,POS,...,2025-01-01,2025-02-15,0,0,2,0,False,False,False,False
2,TXN000002,Bob Smith,"78 Lakeview Blvd, Lakeside",172.16.0.22,CreditPay,"789 Bank Street, Lakeside",172.16.0.30,,balance transfer,online,...,2024-12-15,2025-01-05,0,0,1,0,False,False,True,False
3,TXN000003,Charlie Lee,"99 Dark Alley, Unknownville",185.21.34.109,LuxuryWatches,123 Offshore Plaza,185.21.34.110,Luxury Watch,purchase,online,...,2025-05-05,2025-05-05,3,2,10,5,True,False,False,True
4,TXN000001,Alice Johnson,"123 Elm Street, Springfield",192.168.1.10,Bookstore Inc,"456 Main Street, Springfield",192.168.1.20,Book,purchase,POS,...,2025-03-01,2025-03-01,0,0,2,0,False,False,False,False
5,TXN000002,Bob Smith,"78 Lakeview Blvd, Lakeside",172.16.0.22,CreditPay,"789 Bank Street, Lakeside",172.16.0.30,,balance transfer,online,...,2024-12-15,2025-01-05,0,0,1,0,False,False,True,False
6,TXN000003,Charlie Lee,"99 Dark Alley, Unknownville",185.21.34.109,LuxuryWatches,123 Offshore Plaza,185.21.34.110,Luxury Watch,purchase,online,...,2025-05-05,2025-05-05,3,2,10,5,True,False,False,True
7,TXN000001,Alice Johnson,"123 Elm Street, Springfield",192.168.1.10,Bookstore Inc,"456 Main Street, Springfield",192.168.1.20,Book,purchase,POS,...,2025-03-01,2025-03-01,0,0,2,0,False,False,False,False
8,TXN000002,Bob Smith,"78 Lakeview Blvd, Lakeside",172.16.0.22,CreditPay,"789 Bank Street, Lakeside",172.16.0.30,,balance transfer,online,...,2024-12-15,2025-01-05,0,0,1,0,False,False,True,False
9,TXN000003,Charlie Lee,"99 Dark Alley, Unknownville",185.21.34.109,LuxuryWatches,123 Offshore Plaza,185.21.34.110,Luxury Watch,purchase,online,...,2025-05-05,2025-05-05,3,2,10,5,True,False,False,True


In [None]:
synthetic_df.shape
!pip install openpyxl




### Start exploring based on your usecase and use the same approach for real sensitive data. But, be careful, as the synthetic data might not capture the real-world complexities.

In [None]:
synthetic_df.to_excel('output_file.xlsx', index=False)

In [None]:
!pip install sdv

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata



In [None]:
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=synthetic_df)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(synthetic_df)

synthetic_data = synthesizer.sample(num_rows=100000)

synthetic_data.to_excel('synthetic_output.xlsx', index=False)




In [None]:
synthetic_data

Unnamed: 0,transaction_id,customer_name,customer_address,customer_ip_address,merchant_name,merchant_address,merchant_ip_address,item_purchased,activity_type,type_of_transaction,...,date_of_last_ssn_cvv_challenge,date_of_last_verid_challenge,number_of_email_otp_challenges_failed_last_7d,number_of_sms_otp_challenges_failed_last_7d,number_of_transactions_last_24h,number_of_failed_logins_last_24h,is_transaction_spike,is_external_reward_redemption,is_balance_transfer,is_fraud
0,TXN000002,Bob Smith,"123 Elm Street, Springfield",38e8:1b4:c581:4cd8:50aa:60fd:11e8:3604,CreditPay,"789 Bank Street, Lakeside",c6ed:1396:a698:8d3e:aa08:99c6:20da:51f1,,balance transfer,POS,...,866-03-9576,2025-01-06,dallen@example.net,0,1,0,False,False,True,False
1,TXN000002,Alice Johnson,"123 Elm Street, Springfield",a75c:8282:8f6d:1fe9:bf51:21a9:d4e8:fea1,CreditPay,"789 Bank Street, Lakeside",cf2e:b98b:5b5b:8b1b:60fb:8bba:2834:c561,,purchase,POS,...,057-18-4825,2025-01-07,morgan40@example.org,0,1,0,False,False,False,False
2,TXN000001,Charlie Lee,"99 Dark Alley, Unknownville",ed0b:e17:8c3c:30e6:e7c3:4921:e980:5393,LuxuryWatches,123 Offshore Plaza,b656:d6de:ecfc:d013:fe55:6ea9:df29:adb5,Luxury Watch,purchase,online,...,605-84-5762,2025-05-04,osutton@example.org,2,10,5,False,False,False,False
3,TXN000002,Alice Johnson,"123 Elm Street, Springfield",f45:d835:3492:c658:297b:c4ad:b53b:ab70,Bookstore Inc,123 Offshore Plaza,8c29:fe6:d4ff:6052:39ea:cfdf:a630:d946,Luxury Watch,purchase,online,...,072-32-9328,2025-03-30,bradleytaylor@example.com,0,10,0,False,False,False,False
4,TXN000002,Bob Smith,"78 Lakeview Blvd, Lakeside",3411:8b28:846a:68d4:fa11:db5d:8db4:3bf1,CreditPay,"789 Bank Street, Lakeside",40b7:5b2:9f21:855c:dbb3:3017:e008:496f,,balance transfer,online,...,172-95-7508,2025-02-07,youngerica@example.org,0,1,0,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,TXN000001,Bob Smith,"123 Elm Street, Springfield",27d8:64ac:784f:185c:cb18:9e83:3866:1d15,Bookstore Inc,"789 Bank Street, Lakeside",8c1a:dce4:5606:39f3:2f5:54f8:328:8fc,,balance transfer,online,...,699-71-3176,2025-01-24,ebarnes@example.com,0,1,0,False,False,True,False
99996,TXN000001,Alice Johnson,"99 Dark Alley, Unknownville",f02c:9e5b:51f1:c511:3bdc:9654:ae13:582,LuxuryWatches,"456 Main Street, Springfield",b130:4f9e:530f:7a15:6811:c6bc:2ca1:4cb9,Book,balance transfer,online,...,038-82-3375,2025-05-04,travismerritt@example.org,2,10,5,True,False,False,False
99997,TXN000002,Bob Smith,"78 Lakeview Blvd, Lakeside",5213:5a4c:42e4:39a:d40a:24d:3c2b:68a1,CreditPay,"456 Main Street, Springfield",d896:85bf:404a:3a01:32a:cbc8:5f60:86ab,,purchase,POS,...,255-22-6450,2025-01-05,lopezerika@example.org,0,1,0,False,False,False,False
99998,TXN000002,Bob Smith,"123 Elm Street, Springfield",f44:2d2a:c80:8b67:a95b:3ee3:536:2b03,CreditPay,"456 Main Street, Springfield",5007:e0ba:9989:5f13:725e:5b5e:905d:abd9,Book,purchase,online,...,882-22-0964,2025-02-03,joseph08@example.org,0,1,0,False,False,False,False


In [None]:
# 📦 Step 0: Imports
import pandas as pd
import numpy as np
from datetime import datetime

# Load your DataFrame
df = synthetic_data

# Ensure datetime columns are parsed
date_cols = [
    "date_of_last_balance_transfer", "date_of_email_change", "date_of_phone_change",
    "date_of_address_change", "date_of_password_change", "date_of_account_creation",
    "date_of_account_locked", "date_of_credit_limit_changed", "date_of_last_otp_email_challenge",
    "date_of_last_otp_sms_challenge", "date_of_last_cvv_challenge", "date_of_last_ssn_cvv_challenge",
    "date_of_last_verid_challenge"
]
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Add a transaction date column if not already present
df["transaction_date"] = datetime.now()  # Replace with real timestamp if available

# 🧮 Step 1: Feature Engineering – Time Gaps
df["days_since_email_change"] = (df["transaction_date"] - df["date_of_email_change"]).dt.days
df["days_since_last_otp_sms"] = (df["transaction_date"] - df["date_of_last_otp_sms_challenge"]).dt.days
df["days_since_password_change"] = (df["transaction_date"] - df["date_of_password_change"]).dt.days
df["days_since_account_creation"] = (df["transaction_date"] - df["date_of_account_creation"]).dt.days

# 🧮 Step 2: Behavioral Feature Engineering
df["transaction_velocity"] = df["number_of_transactions_last_24h"] / 24
df["total_failed_challenges_7d"] = df["number_of_email_otp_challenges_failed_last_7d"] + df["number_of_sms_otp_challenges_failed_last_7d"]
df["amount_vs_average_ratio"] = df["transaction_amount"] / df["average_transaction_amount"]
df["account_age_vs_credit_limit"] = df["account_age_days"] / df["credit_limit"]

# 🎯 Step 3: Association Rule Mining on Fraud Cases
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Simplify to binary/categorical
df_mining = df[df["is_fraud"] == True].copy()

# Select categorical and bin numeric data
features_for_rules = [
    "device_used", "is_device_anomaly", "is_geo_anomaly", "is_transaction_spike",
    "is_external_reward_redemption", "is_balance_transfer"
]
df_mining["amount_bin"] = pd.qcut(df_mining["transaction_amount"], q=4, duplicates='drop')
df_mining["velocity_bin"] = pd.qcut(df_mining["transaction_velocity"], q=4, duplicates='drop')

df_rule_rows = df_mining[features_for_rules + ["amount_bin", "velocity_bin"]].astype(str).values.tolist()

# Run association mining
te = TransactionEncoder()
te_ary = te.fit(df_rule_rows).transform(df_rule_rows)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

frequent_items = apriori(df_encoded, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric="confidence", min_threshold=0.8)
print(rules[['antecedents', 'confidence', 'lift']].sort_values(by="lift", ascending=False))

# 🌳 Step 4: Shallow Decision Tree for Rule Discovery
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

features = [
    "transaction_amount", "amount_vs_average_ratio", "days_since_email_change",
    "days_since_last_otp_sms", "total_failed_challenges_7d", "transaction_velocity",
    "account_age_vs_credit_limit", "is_device_anomaly", "is_geo_anomaly", "is_transaction_spike"
]

X = df[features].copy()
y = df["is_fraud"]

# Encode booleans
for col in X.select_dtypes(include='bool').columns:
    X[col] = X[col].astype(int)

# Fill missing values
X.fillna(-1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Train and show simple tree
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)

print(export_text(clf, feature_names=list(X.columns)))

# 🌀 Optional Step 5: Clustering for Visual Pattern Discovery
# from sklearn.preprocessing import StandardScaler
# import umap
# import hdbscan
# import matplotlib.pyplot as plt

# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)
# embedding = umap.UMAP(n_neighbors=15, min_dist=0.3).fit_transform(X_scaled)
# clusterer = hdbscan.HDBSCAN(min_cluster_size=20).fit(embedding)

# plt.scatter(embedding[:, 0], embedding[:, 1], c=clusterer.labels_, cmap='Spectral', s=10)
# plt.title("Fraud Pattern Clusters")
# plt.show()

  df[col] = pd.to_datetime(df[col], errors='coerce')
  df[col] = pd.to_datetime(df[col], errors='coerce')
  df[col] = pd.to_datetime(df[col], errors='coerce')


TypeError: can only concatenate str (not "int") to str