汇总反欺诈领域论文学习资料、基于传统机器学习以及图算法的反欺诈应用实例。如有侵权,以下是我的联系方式:
- Linkedin: Zixi Liu
经典论文总结
- 反欺诈算法经典论文解读
- Awesome Collection of Fraud Detection Papers
- Trustworthy Machine Learning
- Social Networks and Web Security and Privacy
- Attention与Transformer传送门
常见反欺诈模型策略总结:
- [Worldline 信用卡交易反欺诈模型综述] Context-aware credit card fraud detection
- [信用卡反欺诈研究综述] Credit card fraud detection using machine learning: A survey
- [ebay] 基于用户行为的交易反欺诈探索
- 基于图神经网络的欺诈检测
- TitAnt: 蚂蚁金融在线实时交易欺诈检测
- 如何制定有效的模型应用策略
- [黑产反欺诈综述]风控要略-互联网业务反欺诈之路
- WhatsApp Spam Detection
- Account Takeover (ATO)
- Stolen Financials
常见的反欺诈终端组件:
- 设备指纹 Device Fingerprint
- 生物探针 Biometrics-Based Authentication
- 智能验证码 Verification Code
- 基于图(短时间截面)和用户的行为序列数据(较长时间窗口)建模
- 数据预处理,特征选择,特征降维
- [综述]风控对抗中的常规特征及处置选择
1. Preprocessing 预处理
特征清洗
- 清洗异常样本
- 采样:正负样本不均衡,样本权重
数值型特征
- 归一 Normalizer:处理特征值尺度不同问题 (i.e. min-max, scale to [-1, 1], z-score, log-based, L2, Gauss Rank, Robust Scaling etc.)
- 分箱 Binning:处理特征值分布不均匀问题(有监督分箱如卡方分箱、决策树分箱/无监督分桶如固定宽度分箱、分位数分箱等)
- 将连续特征离散化,旨在引入非线性变换,对异常值不敏感、防止过拟合;
- Tree-based模型中,高基数特征相对于低基数特征处于支配地位。
类别型特征
- 交叉组合FM/分箱 Binning/Count Encoding/Target Encoding/Odds Ratio/相关系数/WOE
时序特征
- 历史事件分时段统计
一些搜索排序业务的特征工程思路:
2. Feature Selection 特征选择
高质量特征有信息量(Informative),有区分性(Discriminative),特征之间有相互独立性(Independent),特征应易于理解。
特征工程中稀疏变量处理思路:
- Sparse Representation与Dimension Reduction
- 降维是将原space的数据在subspace(space spanned by selected major eigenvector)里进行表达;稀疏表达则是在a union of subspace里进行表达。
- 稀疏表达作为自然信号的regularizer。
特征选择常用方法:
- 过滤法 Filter (根据目标变量与自变量之间的关联)
- 单变量特征过滤:卡方检验,ANOVA,信息增益等。
- 多变量特征过滤
- [多元特征过滤] Relief-Based Feature Selection: Introduction and Review
- [谱图] Spectral Feature Selection for Supervised and Unsupervised Learning
- 包装法 Wrapper
- 根据目标函数(AUC/MSE)变化决定是否加入特征变量。
- 嵌入法 Embedded
- 学习器自动选择特征:L1正则化,L2 Ridge,决策树,信息增益,深度学习等。
- 图特征选择 Graph-based
- 核密度估计(KDE):对比train、valid、test在不同标签下的分布特征,挑选稳定有效的特征。
3. 数据增强 Data Augmentation
- [Snap Research] Data Augmentation for Graph Neural Networks
- [Knowledge Graph] Injecting Semantic Background Knowledge into Neural Networks using Graph Embeddings
行为序列,关系网络的节点向量化,文字类特征,都可以用到文本处理思路。
文本挖掘入门
Embeddings(从原始数据提取出来的Feature,通过神经网络映射之后的低维向量)
- [理解Word Representation] Linguistic Regularities in Continuous Space Word Representations
- [Word2Vec] Efficient Estimation of Word Representations in Vector Space
- [Word2Vec] word2vec Parameter Learning Explained
- Embeddings of Categorical Variables for Sequential Data in Fraud Context
- [Airbnb] Hotel2vec: Learning Attribute-Aware Hotel Embeddings with Self-Supervision
- [Doordash] Store2vec: Personalized Store Feed with Vector Embeddings
地址类文本解析,地址相关性任务
- 地址要素解析将地址文本拆分成独立语义的要素,并对这些要素进行类型识别。
文本聚类/Topic Model话题模型
经典树模型 Tree-Based Models
决策树模型以及基于树模型的Boosting模型
- Random Forest Overview
- The Elements of Statistical Learning, <Chapter 9, 10>
- Comparative Study of RF and SVM on Credit Card Fraud Detection
Boosting家族
- [1999] Greedy function approximation: A gradient boosting machine
- [2000] Additive logistic regression: A statistical view of boosting
- XGBoost: A Scalable Tree Boosting System
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree
循环神经网络 Recurrent Neural Networks
RNN通过重复使用单元结构实现记忆能力。
- [JD Finance] Session-Based Fraud Detection in Online E-Commerce Transactions Using Recurrent Neural Networks
- [Alibaba] Online Credit Payment Fraud Detection via Structure-Aware Hierarchical Recurrent Neural Network
- E.T.-RNN: Applying Deep Learning to Credit Loan Applications
Long Short Term Memory (LSTM)
LSTM通过训练控制参数决定哪些信息被保留/忘记。
- [Sepp Hochreiter, Jurgen Schmidhuber] Long-Short Term Memory
- A Tutorial into LSTM
- [Linkedin] Using deep learning to detect abusive sequences of member activity
Ads/RecSys/用户行为序列模型
- [Wide&Deep] Wide & Deep Learning for Recommender Systems (Google 2016)
- [DIN] Deep Interest Network for Click-Through Rate Prediction (Alibaba 2018)
- [SIM] Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction (Alibaba 2019)
- [MIMN] Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction (Alibaba 2019)
- Sequential Modeling with Multiple Attributes for Watchlist Recommendation in E-Commerce (Ebay 2021)
- [Enrich Deep Model] Sequential Behavioral Data Processing Using Deep Learning and the Markov Transition Field in Online Fraud Detection
- [注意力机制] Attention is All You Need
图表示学习 Network Representation Learning
Graph Embedding模型:DeepWalk, node2cev, LINE, SDNE, Struc2Vec, GraRep.
- [DeepWalk]: Online Learning of Social Representations
- [node2vec]: Scalable Feature Learning for Networks
- [LINE]: Large-scale Information Network Embedding
- [SDNE]: Structural Deep Network Embedding
- [Struc2Vec]: Learning Node Representations from Structural Identity
- [GraRep]: Learning Graph Representations with Global Structural Information
图学习在反欺诈领域的应用:
一些解决异常检测任务的over-smoothing问题(异常节点和正常节点的表达难以区分)的思路:
- 利用注意力机制从多个视图聚合邻域信息
- 利用重采样方法聚合不同类别邻域信息
- 设计损失函数辅助训练
社区发现 Community Detection
- [Graph Laplacian] Eigenvalues of the Laplacian and Their Relationship To The Connectedness of a Graph
- EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs
- 社区发现算法库与可视化
- 社区发现算法总结
图模型 Graphical Models
利用关系网络识别网络中异常的网络结构和社群。
- [JD Finance] Graph mining assisted semi-supervised learning for fraudulent cash-out detection
- A novel approach for automated credit card transaction fraud detection using network-based extensions
- [Capital One (2019)] DeepTrax: Embedding Graphs of Financial Transactions
- A Comprehensive Survey on Graph Anomaly Detection with Deep Learning
图神经网络入门
- Graph neural networks: A review of methods and applications
- [GCN] Semi-Supervised Classification with Graph Convolutional Networks
- [GraphSage] Inductive Representation Learning on Large Graphs
- [GAT] Graph Attention Networks
反欺诈图神经网络框架
- [KDD 2016] FRAUDAR: Bounding Graph Fraud in the Face of Camouflage
- [Ant Group - AGL] AGL: A Scalable System for Industrial-purpose Graph Machine Learning
- [Ant Group - GeniePath] GeniePath: Graph Neural Networks with Adaptive Receptive Paths
- [Ant Group - SemiGNN] A Semi-supervised Graph Attentive Network for Financial Fraud Detection
- [Alibaba] AliGraph: A Comprehensive Graph Neural Network Platform
- [PyG] Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models
- PyTorch-BigGraph
线上交易支付
- [Online Payment Services] Representing Fine-Grained Co-Occurrences for Behavior-Based Fraud Detection in Online Payment Services
- [Live-Streaming] Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network Approach
账号安全
- [Facebook - SynchroTrap] Uncovering Large Groups of Active Malicious Accounts in Online Social Networks
- SocialWatch: Detection of Online Service Abuse via Large-Scale Social Graphs
- [Wechat] Unveiling Fake Accounts at the Time of Registration: An Unsupervised Approach, 2021
- [Ant Group] Heterogeneous Graph Neural Networks for Malicious Account Detection
- [Ant Group] Adversarial Attack on Graph Structured Data
假新闻识别
运费骗保识别
贷款违约预测
洗钱识别
优化方法
ML模型评估策略
- Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
- Slice Finder: Automated Data Slicing for Model Validation
- Subgroup Discovery Algorithms: A Survey and Empirical Evaluation
AB Testing
金融场景风控模型策略
- 样本提取
- 模型开发时的跨时间验证集(OOT):主要用于衡量同时期新模型相对于旧模型的模型效果提升度和制定决策点(Cut-off)时的效果预估。
- 近期样本集(BackScore):近期所有进入模型打分阶段的样本集,主要用于设定新模型在于其通过率下的模型阈值Threshold。
- 模型策略的制定:一般需要在转化率Conversion Rate与坏账率之间进行权衡。
单模型策略:用于业务前期中期,模型间关联性较强时。
- 基于模型通过率与坏账率的决策点:理想状态是提高通过率并降低坏账率.
- 基于lift的决策点设定:lift表示风控模型对预测目标中不良客户的识别比率高于随机识别比率的倍数。Lift分箱将所有客户的模型评分分为10-20箱,计算Cumulatively Bad(%) by Model与Cumulatively Bad(%) Randomly的比值即为lift。通过lift的大小,设定模型的决策阈值。
多模型策略:利用加权等方式将多个模型分融合成一个模型分。
- 多模型串行准入:多个模型以串行方式依次决策准入。
- 多模型交叉准入:在生成风险等级的过程中,需要充分考虑每个交叉格子中样本量,保证其统计学意义。
- 模型策略评估
开发新模型时,同一OOT样本集上,新模型技术指标(AUC, KS, Gini等)一定优于旧模型。使用新模型仍存在潜在风险:
- 新模型训练样本集的特征通过回溯得到,有可能发生特征线上与线下不一致风险。
- 新模型在旧模型决策后的样本上评估,相对于未来应用存在一定差异。
- KS值:通常用来描述模型的正负样本区分能力。
- 负样本个数:用来描述模型排序能力。
- 捕获率(Capture Rate):阈值以下捕获欺诈的能力。
- 模型策略的上线与验证
- 对抗验证 Adversarial Validation
- 如果模型在测试集和OOT均效果不好,很可能是过拟合问题,应减少模型复杂度,做特征筛选,减少树模型深度等。
- 如果在测试集表现较好,OOT表现不好,则模型跨时间稳定性不够好。应检测随时间推移单个特征的取值分布波动。
- 如果测试集和OOT表现良好,线上效果不好,应先检查过拟合抽取更新数据集,再核对线上线下特征逻辑。
- 如果模型分数逐月下滑,需要用新样本频繁迭代更新模型(Refit)。
Knowledge Base
- 非常全面的ML资料wiki
- Graph Code Samples
- 2021金融风控模型内容摘要[图算法+时间序列模型]
- 鲲鹏分布式平台在蚂蚁金服的应用
- 微信支付GNN链接预测技术
- 深度学习综述的综述
- 理解图的拉普拉斯矩阵
- 信号处理中的滤波器用于放大或缩小相关频率,消除不相关频率。线性空间中的矩阵乘法等同于尺度变化,与频域中的滤波器操作是相同的。
- 节点的拉普拉斯算子描述的是节点与邻居节点之间信号的差异。
- Graph Learning综述
- 图算法学习-高密子图挖掘
- 开源分布式图数据库
- 什么是IP地址