You have been hired by Alpha Insurance to develop predictive models to determine which automobile claims are fraudulent. You have been given data on approximately 5000 auto claims which includes a variable indicating whether the company believes the claim is fraudulent or not.
- Robert Shea
Bryant University ~ Fall 2018
These variables appear to be the best for detecting fraudulent claims:
- Claim Amount - Uncommonly high claim amounts are more likely to be fraudulent.
- Claim Cause - The more severe claim causes (fire and collision) will be less likely to be fraudulent.
- Claim Report Type - Fraud claims will be reported with as little human interaction as possible.
- Employment Status - Claimants who are not currently employed are more likely to report fraudulent claims.
- Income - The higher the level of education, the less likely reports are to be fraudulent. (This may also be linked with income)
- Univariate exploration
- Bivariate exploration
- Impute missing values
- Handle outliers
- Transform variables with functions
- Transform variables with binning
- Balancing Sample
- Decision Tree
- Neural Network
- Model Selection
- How to encode categorical data: https://www.datacamp.com/community/tutorials/categorical-data
- Random undersampling:
- Credit card fraud example: https://github.com/IBM/xgboost-smote-detect-fraud