# 01 - Generate Synthetic Login Data

This notebook generates synthetic login event data for training the identity risk scoring model.

## Dataset Characteristics
- **10,000 login events** over 30 days
- **500 unique users** across 5 tenants
- **10% fraud rate** with realistic attack patterns
- Features: user_id, tenant_id, timestamp, ip, device_id, location, success, mfa_used, vpn_detected

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
from src.core.data_generator import generate_logins

## Generate Dataset

In [None]:
# Generate 10k login events with 10% fraud rate
df = generate_logins(
    n_events=10000,
    n_users=500,
    fraud_rate=0.10,
    days_back=30,
    output_path='../data/logins.parquet'
)

## Explore the Data

In [None]:
# Schema
print("Schema:")
print(df.dtypes)
print(f"\nShape: {df.shape}")

In [None]:
# Sample rows
df.head(10)

In [None]:
# Distribution by tenant
print("Events by Tenant:")
print(df['tenant_id'].value_counts())

In [None]:
# Fraud patterns
print("\nFraud vs Normal Comparison:")
comparison = df.groupby('is_fraudulent').agg({
    'success': 'mean',
    'mfa_used': 'mean',
    'vpn_detected': 'mean',
}).round(3)
comparison.columns = ['success_rate', 'mfa_rate', 'vpn_rate']
print(comparison)

In [None]:
# Location distribution for fraudulent logins
print("\nFraudulent Login Locations:")
print(df[df['is_fraudulent']]['location_country'].value_counts().head(10))

## Verify Output

In [None]:
# Verify parquet file
df_loaded = pd.read_parquet('../data/logins.parquet')
print(f"Loaded {len(df_loaded)} rows from parquet")
assert len(df_loaded) == 10000, "Expected 10k rows"
assert abs(df_loaded['is_fraudulent'].mean() - 0.10) < 0.01, "Expected ~10% fraud rate"
print("All validations passed!")