Insurance-Fraud-Prediction

Problem Statement

To build a classification methodology to determine whether a customer is placing a fraudulent insurance claim

Detailed EDA

see detailed EDA here

Architecture

Data Description

The client will send data in multiple sets of files in batches at a given location. The data has been extracted from the census bureau. The data contains the following attributes: Features:

months_as_customer: It denotes the number of months for which the customer is associated with the insurance company.
age: continuous. It denotes the age of the person.
policy_number: The policy number.
policy_bind_date: Start date of the policy.
policy_state: The state where the policy is registered.
policy_csl-combined single limits. How much of the bodily injury will be covered from the total damage. https://www.berkshireinsuranceservices.com/arecombinedsinglelimitsbetter
policy_deductable: The amount paid out of pocket by the policy-holder before an insurance provider will pay any expenses.
policy_annual_premium: The yearly premium for the policy.
umbrella_limit: An umbrella insurance policy is extra liability insurance coverage that goes beyond the limits of the insured's homeowners, auto or watercraft insurance. It provides an additional layer of security to those who are at risk of being sued for damages to other people's property or injuries caused to others in an accident.
insured_zip: The zip code where the policy is registered.
insured_sex: It denotes the person's gender.
insured_education_level: The highest educational qualification of the policy-holder.
insured_occupation: The occupation of the policy-holder.
insured_hobbies: The hobbies of the policy-holder.
insured_relationship: Dependents on the policy-holder.
capital-gain: It denotes the monitory gains by the person.
capital-loss: It denotes the monitory loss by the person.
incident_date: The date when the incident happened.
incident_type: The type of the incident.
collision_type: The type of collision that took place.
incident_severity: The severity of the incident.
authorities_contacted: Which authority was contacted.
incident_state: The state in which the incident took place.
incident_city: The city in which the incident took place.
incident_location: The street in which the incident took place.
incident_hour_of_the_day: The time of the day when the incident took place.
property_damage: If any property damage was done.
bodily_injuries: Number of bodily injuries.
Witnesses: Number of witnesses present.
police_report_available: Is the police report available.
total_claim_amount: Total amount claimed by the customer.
injury_claim: Amount claimed for injury
property_claim: Amount claimed for property damage.
vehicle_claim: Amount claimed for vehicle damage.
auto_make: The manufacturer of the vehicle
auto_model: The model of the vehicle.
auto_year: The year of manufacture of the vehicle.

Target Label:

Whether the claim is fraudulent or not. 38. fraud_reported: Y or N Apart from training files, we also require a "schema" file from the client, which contains all the relevant information about the training files such as: Name of the files, Length of Date value in FileName, Length of Time value in FileName, Number of Columns, Name of the Columns, and their datatype.

Data Validation

In this step, we perform different sets of validation on the given set of training files.

Name Validation- We validate the name of the files based on the given name in the schema file. We have created a regex pattern as per the name given in the schema file to use for validation. After validating the pattern in the name, we check for the length of date in the file name as well as the length of time in the file name. If all the values are as per requirement, we move such files to "Good_Data_Folder" else we move such files to "Bad_Data_Folder."
Number of Columns - We validate the number of columns present in the files, and if it doesn't match with the value given in the schema file, then the file is moved to "Bad_Data_Folder."
Name of Columns - The name of the columns is validated and should be the same as given in the schema file. If not, then the file is moved to "Bad_Data_Folder".
The datatype of columns - The datatype of columns is given in the schema file. It is validated when we insert the files into Database. If the datatype is wrong, then the file is moved to "Bad_Data_Folder".
Null values in columns - If any of the columns in a file have all the values as NULL or missing, we discard such a file and move it to "Bad_Data_Folder". Data Insertion in Database

Database Creation and connection - Create a database with the given name passed. If the database has already been created, open a connection to the database.
Table creation in the database - Table with name - "Good_Data", is created in the database for inserting the files in the "Good_Data_Folder" based on given column names and datatype in the schema file. If the table is already present, then the new table is not created, and new files are inserted in the already present table as we want training to be done on new as well as old training files.
Insertion of files in the table - All the files in the "Good_Data_Folder" are inserted in the above-created table. If any file has invalid data type in any of the columns, the file is not loaded in the table and is moved to "Bad_Data_Folder".

Model Training

Data Export from Db - The data in a stored database is exported as a CSV file to be used for model training.
Data Preprocessing
a) Drop the columns not required for prediction. b) For this dataset, the null values were replaced with ‘?’ in the client data. Those ‘?’ have been replaced with NaN values. c) Check for null values in the columns. If present, impute the null values using the categorical imputer. d) Replace and encode the categorical values with numeric values. e) Scale the numeric values using the standard scaler.
Clustering - KMeans algorithm is used to create clusters in the preprocessed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using "KneeLocator" function. The idea behind clustering is to implement different algorithms The Kmeans model is trained over preprocessed data, and the model is saved for further use in prediction.
Model Selection – After the clusters have been created, we find the best model for each cluster. We are using two algorithms, “SVM” and "XGBoost". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the model is selected for each cluster. All the models for every cluster are saved for use in prediction.

Prediction Data Description

The Client will send the data in multiple sets of files in batches at a given location. Data will contain the annual income of various persons. Apart from prediction files, we also require a "schema" file from the client, which contains all the relevant information about the training files such as: Name of the files, Length of Date value in FileName, Length of Time value in FileName, Number of Columns, Name of the Columns and their datatype.

Data Validation

In this step, we perform different sets of validation on the given set of training files.

Name Validation- We validate the name of the files based on given Name in the schema file. We have created a regex pattern as per the name given in the schema file, to use for validation. After validating the pattern in the name, we check for the length of date in the file name as well as the length of the timestamp in the file name. If all the values are as per requirement, we move such files to "Good_Data_Folder" else we move such files to "Bad_Data_Folder".
Number of Columns - We validate the number of columns present in the files, and if it doesn't match with the value given in the schema file, then the file is moved to "Bad_Data_Folder".
Name of Columns - The name of the columns is validated and should be same as given in the schema file. If not, then the file is moved to "Bad_Data_Folder".
Datatype of columns - The datatype of columns is given in the schema file. This is validated when we insert the files into Database. If the datatype is incorrect, then the file is moved to "Bad_Data_Folder".
Null values in columns - If any of the columns in a file has all the values as NULL or missing, we discard such file and move it to "Bad_Data_Folder".

Data Insertion in Database

Database Creation and connection - Create a database with the given name passed. If the database is already created, open the connection to the database.
Table creation in the database - Table with name - "Good_Data", is created in the database for inserting the files in the "Good_Data_Folder" based on given column names and datatype in the schema file. If the table is already present, then a new table is not created, and new files are inserted into the already present table as we want training to be done on new as well old training files.
Insertion of files in the table - All the files in the "Good_Data_Folder" are inserted in the above-created table. If any file has invalid data type in any of the columns, the file is not loaded in the table and is moved to "Bad_Data_Folder".

Prediction

Data Export from Db - The data in the stored database is exported as a CSV file to be used for prediction.
Data Preprocessing :
a) Drop the columns not required for prediction. b) For this dataset, the null values were replaced with ‘?’ in the client data. Those ‘?’ have been replaced with NaN values. c) Check for null values in the columns. If present, impute the null values using the categorical imputer. d) Replace and encode the categorical values with numeric values. e) Scale the numeric values using the standard scaler.
Clustering - KMeans model created during training is loaded, and clusters for the preprocessed prediction data is predicted.
Prediction - Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster.
Once the prediction is made for all the clusters, the predictions along with the Wafer names are saved in a CSV file at a given location, and the location is returned to the client.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Application_logging		Application_logging
Best_model_finder		Best_model_finder
Data_ingestion		Data_ingestion
Data_preprocessing		Data_preprocessing
Data_preprocessing_prediction		Data_preprocessing_prediction
Data_transform_prediction		Data_transform_prediction
Data_transform_training		Data_transform_training
Database		Database
Database_operation_prediction		Database_operation_prediction
Database_operation_training		Database_operation_training
EDA		EDA
File_operation		File_operation
File_operation_prediction		File_operation_prediction
Model_functions		Model_functions
Prediction_Batch_files		Prediction_Batch_files
Prediction_FileFromDB		Prediction_FileFromDB
Prediction_Logs		Prediction_Logs
Prediction_Output_File		Prediction_Output_File
Raw_data_validation_prediction		Raw_data_validation_prediction
Raw_data_validation_training		Raw_data_validation_training
Schema Files		Schema Files
Training_Batch_Files		Training_Batch_Files
Training_FileFromDB		Training_FileFromDB
Training_Logs		Training_Logs
data		data
models		models
preprocessing_data		preprocessing_data
screenshots		screenshots
templates		templates
README.md		README.md
create_log_directories.py		create_log_directories.py
main.py		main.py
pred_validate.py		pred_validate.py
predict_from_model.py		predict_from_model.py
requirements.txt		requirements.txt
train_model.py		train_model.py
train_validate.py		train_validate.py

richakbee/Insurance-Fraud-Prediction

Folders and files

Latest commit

History

Repository files navigation

Insurance-Fraud-Prediction

Problem Statement

Detailed EDA

Architecture

Data Description

Target Label:

Data Validation

Model Training

Prediction Data Description

Data Validation

Data Insertion in Database

Prediction

Deployment

Development (Flask & Post Man for API testing)

WebApp

About

Resources

Stars

Watchers

Forks

Languages