<!-- ![image](https://i.ytimg.com/vi/TbEZIZo4PZ8/maxresdefault.jpg) -->

### Introduction to Classification - INSURANCE CLAIM ANALYSIS
Introduction to Logistic Regression

Sigmoid Function

Confusion Matrix

Classification Evaluation Metrics

https://www.kaggle.com/code/mohamedbakrey/make-a-prediction-for-insurance-claim-report

https://youtu.be/ntBa7YKc9XM?si=1V1RL0wCmxM_gjor  Regression price

https://youtu.be/OOLhKLXCJiU?si=dWxvFk82mVO8WxEP DBSCAN

# Predicting auto and insurance fraud in general:
is a contract, represented by a policy, in which an individual or entity receives financial protection or reimbursement against losses from an insurance company. The company pools clients' risks to make payments more affordable for the insured.
# Insurance Policy Components
When choosing a policy, it is important to understand how insurance works.

A firm understanding of these concepts goes a long way in helping you choose the policy that best suits your needs. For instance, whole life insurance may or may not be the right type of life insurance for you. There are three components of any type of insurance (premium, policy limit, and deductible) that are crucial.

> The goal of this note, Kuho Hwa, is to make a simplified and structured analysis to make an explanation of that dirty process called fraud and lack of it through analysis and the machine learning system.


## ðŸ”¹ **Business Understanding**

Insurance companies process thousands of claims annually. A large percentage may be **exaggerated or fraudulent**, leading to financial loss. This dataset represents information about customers, their policies, accidents, and whether the claim was fraudulent.

---

## ðŸ”¹ **Business Problem**

> **The insurance company is experiencing increasing financial loss due to fraudulent and unnecessary claims.**
> Detecting fraud manually is slow, expensive, and inaccurate.

---

## ðŸ”¹ **Business Objective**

| Primary Objective                                               | Secondary Objective                                  |
| --------------------------------------------------------------- | ---------------------------------------------------- |
| Build a **Machine Learning Model to Predict Fraudulent Claims** | Predict expected **claim cost** and detect anomalies |
| Reduce false insurance payouts & financial losses               | Prioritize claims for manual investigation           |
| Improve claim approval time for genuine customers               | Enhance underwriting & pricing decisions             |

---

### Expected Outcomes

ðŸ“Œ Reduce fraudulent claims by 30â€“60%
ðŸ“Œ Save millions in wrongful payments
ðŸ“Œ Faster claim settlements for genuine customers
ðŸ“Œ Improve risk-based pricing for future policies

---


# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read data 

In [None]:
db=pd.read_csv('insurance_claims_report.csv')
#
db.head()


---

## ðŸ”¹ **Dataset Column Explanation with Business Use**

| Column Name                                            | Meaning / Description                                               | Business Use in Insurance Analytics / ML                                                                  |
| ------------------------------------------------------ | ------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **months_as_customer**                                 | Number of months insured with the company                           | Helps evaluate customer loyalty & churn probability; long-term customers are less likely to commit fraud. |
| **age**                                                | Age of insured person                                               | Younger drivers may have higher risk profiles â†’ used for risk pricing.                                    |
| **policy_number**                                      | Unique ID for policy                                                | Identification â€“ not useful for modeling, usually dropped.                                                |
| **policy_bind_date**                                   | Date when the policy was issued                                     | Helps assess policy age; new policies with quick claims may signal fraud.                                 |
| **policy_state (OH, IL, IN etc.)**                     | State where policy was registered                                   | Geographic-based risk pricing; accident patterns vary by region.                                          |
| **policy_csl (250/500, 100/300 etc.)**                 | Combined single limit â†’ Maximum coverage limits                     | High coverage may attract fraudulent claims for bigger payout.                                            |
| **policy_deductable**                                  | Amount customer pays before insurance covers                        | Low deductible may increase claim frequency.                                                              |
| **policy_annual_premium**                              | Yearly insurance premium paid                                       | Higher premium = higher risk customer profile.                                                            |
| **umbrella_limit**                                     | Additional liability coverage above policy                          | Fraudsters often target high umbrella limits for large payouts.                                           |
| **insured_zip**                                        | ZIP code of insured customer                                        | Geography affects accident risk & fraud trends.                                                           |
| **insured_sex**                                        | Gender of policy holder                                             | Risk segmentation (male drivers historically slightly riskier).                                           |
| **insured_education_level**                            | Educational qualification                                           | Low education may correlate with higher claim probability statistically.                                  |
| **insured_occupation**                                 | Job category of customer                                            | High-risk professions (drivers, mechanics) â†’ more exposure â†’ more claims.                                 |
| **insured_hobbies**                                    | Hobbies listed (bungie-jumping, board-games)                        | Risk indicator: high-risk hobbies may lead to more accident exposure.                                     |
| **insured_relationship**                               | Relationship of policyholder to household (husband, own-child etc.) | Helps profile dependents, household risk level.                                                           |
| **capital-gains / capital-loss**                       | Financial gain/loss for the customer                                | Unstable finances may correlate with fraudulent intent.                                                   |
| **incident_date**                                      | Date of accident/claim                                              | Helps detect seasonal or suspicious claim timing.                                                         |
| **incident_type**                                      | Type of accident (Vehicle Theft, Collision etc.)                    | Key target variable in claim approval logic.                                                              |
| **collision_type**                                     | Side/Front/Rear/None                                                | Helps determine accident legitimacy & repair cost.                                                        |
| **incident_severity**                                  | Minor/Major/Total Loss                                              | Severity indicates claim value & fraud likelihood.                                                        |
| **authorities_contacted**                              | Police / Fire / None                                                | Genuine claims usually have police report â†’ fraud often lacks.                                            |
| **incident_state / incident_city / incident_location** | Location of accident                                                | Cross-check for mismatches from policy location â†’ fraud flag.                                             |
| **incident_hour_of_the_day**                           | Time accident occurred                                              | Late-night or odd hour accidents often have higher fraud probability.                                     |
| **number_of_vehicles_involved**                        | Count of vehicles in accident                                       | Helps validate incident seriousness.                                                                      |
| **property_damage**                                    | YES/NO/Unknown                                                      | Claims with missing damage info can be suspicious.                                                        |
| **bodily_injuries**                                    | Number of injured persons                                           | Higher injuries â†’ larger payout â†’ fraud cases often exaggerate.                                           |
| **witnesses**                                          | Number of eyewitnesses                                              | Low/no witnesses increases fraud probability.                                                             |
| **police_report_available**                            | YES/NO/?                                                            | No report = possible fraud scenario.                                                                      |
| **total_claim_amount**                                 | Final total compensation                                            | Target variable in claim cost prediction models.                                                          |
| **injury_claim / property_claim / vehicle_claim**      | Breakdown of claim amounts                                          | Helps build prediction models for claim estimation.                                                       |
| **auto_make / auto_model / auto_year**                 | Car details                                                         | Newer/luxury cars â†’ higher claim payouts; targeted more in fraud.                                         |
| **fraud_reported (Y/N)**                               | Fraud label                                                         | Target variable for fraud detection ML models.                                                            |

---



In [None]:
# number of rows and colums
db.shape

In [None]:
# Basic details of data
db.info()

In [None]:
# basic statistical summary
db.describe()

In [None]:
db.describe(include="O")

In [None]:
# set the video
pd.set_option('display.max_columns', 500)
db

In [None]:
# check all columns name 
db.columns

### 2. data cleaning

In [None]:
# check null values
db.isnull().sum()

In [None]:
# before replace value check the datatypes
db.dtypes

In [None]:
# check columns
db["authorities_contacted"].head()

In [None]:
# it is object datatypes so find the mode value
db["authorities_contacted"].mode()[0]

In [None]:
# or check all distinct values
db["authorities_contacted"].value_counts()

In [None]:
# now replace the value
db["authorities_contacted"] = db["authorities_contacted"].fillna(db["authorities_contacted"].mode()[0])

In [None]:
# Again check the null values
db.isnull().sum()

- Before apply models , we need to check the data quality and data preprocessing. 
    - We need to check the data for missing values, outliers, and data types.
    - We need to handle missing values by either removing them or imputing them with a suitable
    method.
    - We need to transform the data into a suitable format for modeling, such as scaling or encoding
    categorical variables.
    - We need to split the data into training and testing sets to evaluate the model's performance.

In [None]:
db.head()

In [None]:

# change the Label data into numbers --> fraud columns
# change yes into 1 and No 0
# map
db["fraud_reported"] =db["fraud_reported"].map({"Y":1,"N":0})

In [None]:
db.head()

### Change all useful columns in number
- for example policy_number is not import so remove it 
- check all colums then encode

In [None]:
# Here i  only take numerical columns
db.dtypes

In [None]:
db["authorities_contacted"].dtypes !="O"

In [None]:
# Here i only take numerical columns
numerical_col = [] # only store numerical columns name 
for i in db.columns:
    if db[i].dtypes !="O":
        print(i)
        numerical_col.append(i)
numerical_col

In [None]:
# only store numerical data into df
df= pd.DataFrame()
for i in numerical_col:
    df[i]= db[i]
df.head()

In [None]:
df.columns

In [None]:
# Remove unwanted columns
df = df.drop(['policy_number','insured_zip','capital-gains','capital-loss','auto_year'],axis=1)

In [None]:
df.head()

#### Split the data
- Before Create model  (split data into x and y ) feature and label

In [None]:
df.columns

In [None]:
# use only numerical columns for training
# x-->features ,y-->Label
X = df.drop('fraud_reported',axis=1)
Y = df['fraud_reported']

In [None]:
X
Y

In [None]:
# split the data into --> train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=.2, random_state=42)

In [None]:
x_train.shape

In [None]:
y_train.shape

#### Let's apply our classification models one by one:
1) Logistic Regression:

In [None]:
# model training 
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)

In [None]:
# training Accuracy 
#  Accuracy on seeen data / training data
model.score(x_train,y_train)

In [None]:
# testing accuracy
#  Accuracy on unseeen data / new data
model.score(x_test,y_test)

### Predictions

In [None]:
y_pred = model.predict(x_test)
y_test

In [None]:
# Error calcualte
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)

#### Evaluation matrics use
Model evaluation is the crucial process of assessing an AI/ML model's performance, reliability, and ability to generalize using unseen data and specific metrics (like accuracy, precision, recall, F1-score) to ensure it's effective, not just memorizing training data, and ready for real-world deployment, catching issues like overfitting, underfitting, or bias. 

1. Accuracy 
2. confussion matrix
3. classifications report
4. precision ,recall, f1score
5. Roc-Auc

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report

print('Accuracy: {}'.format(accuracy_score(y_test, y_pred), 4))
# print('%s' %("name"))

A confusion matrix is a table used to evaluate the performance of a classification model by comparing the predicted outcomes to the actual outcomes.

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# ignore the  ignore
import warnings
warnings.filterwarnings("ignore")

In [None]:
print(classification_report(y_test, y_pred))

![](https://images.prismic.io/encord/edfa849b-03fb-43d2-aba5-1f53a8884e6f_image5.png?auto=compress,format)

![](https://www.researchgate.net/publication/346062755/figure/fig5/AS:960496597483542@1606011642491/Confusion-matrix-and-performance-evaluation-metrics.png)

In [None]:
accuracy_score(y_test, y_pred)

## model Improvements in Machine Learning


In [None]:
# 72
# Diffrents way to improve the model Accuracy 
# 1. Data-centric improvements
    # Data cleaning:
    # Feature engineering
    # Increase data:
    # Add context:  -- relivence

# 2. Model-centric improvements
    # 1.Choose a robust algorithm
    # 2.Hyperparameter tuning
    # 3. Cross-validation
    # 4. Ensemble learning
    # Regularization: only for overfitting

# Other strategies
    # 1. Reframe the problem: Sometimes, improving a model isn't about the data or algorithm, but about how the problem itself is defined.
    # 2. Model monitoring: After deployment, continuously monitor the model's performance in a production environment to identify and address any issues that arise over time. 

## 2.Hyperparameter tuning
Hyperparameter tuning (or optimization) is the crucial process of finding the best configuration settings (hyperparameters) for a machine learning model before training begins.

In [None]:
# Common Tuning Methods

In [None]:
# model 
from sklearn.linear_model import LogisticRegression
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)

lr1.score(x_train,y_train)

In [None]:
# parameter change
# 1. manually  parameter change
lr2 = LogisticRegression(penalty="l2",solver='lbfgs', C=0.5)
lr2.fit(x_train,y_train)

lr2.score(x_train,y_train)

In [None]:
# parameter change
# 1. manually  parameter change
lr2 = LogisticRegression(penalty="l2",solver='saga', C=0.5)
lr2.fit(x_train,y_train)

lr2.score(x_train,y_train)

In [None]:
list1=[0.5,1,10,15,20]
for i in list1:
    print(i)

In [None]:
for i in list1:
    lr2 = LogisticRegression(penalty="l2",solver='saga', C=i)
    lr2.fit(x_train,y_train)

    print(i, lr2.score(x_train,y_train))

In [None]:
C= [0.1,.3,0.5,1,10, 100,100000]
p =['l1','l2']
solver =['lbfgs','saga','liblinear']
for c in C:
    for i in p:
        lr2 = LogisticRegression(penalty=i,solver='liblinear', C=c)
        lr2.fit(x_train,y_train)

        print(lr2.score(x_train,y_train),c,i)

### Cross-validation
 is a technique used to check how well a machine learning model performs on unseen data while preventing overfitting. It works by:

- Splitting the dataset into several parts.
- Training the model on some parts and testing it on the remaining part.
- Repeating this resampling process multiple times by choosing different parts of the dataset.
- Averaging the results from each validation step to get the final performance.

1. Holdout Validation
- from sklearn.model_selection import train_test_split
![](https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/cross-validation-in-machine-learning-how-to-do-it-right-1-1.jpg?resize=1020%2C534&ssl=1)

2. LOOCV (Leave One Out Cross Validation)
- from sklearn.model_selection import LeaveOneOut

![](https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/cross-validation-in-machine-learning-how-to-do-it-right-3.jpg?w=1200&ssl=1)

3. Stratified Cross-Validation
- from sklearn.model_selection import StratifiedKFold
![](https://dataaspirant.com/wp-content/uploads/2020/12/8-Stratified-K-Fold-Cross-Validation-768x516.png)

4. K-Fold Cross Validation
- from sklearn.model_selection import KFold
![](http://media.geeksforgeeks.org/wp-content/uploads/20250927122541290704/222.webp)

https://dataaspirant.com/cross-validation/

In [None]:
# Methods 
# 1. Random search CV
# 2. GridSearch CV
# 3. Bayesian search

In [None]:

# Grid search cv
from sklearn.model_selection import GridSearchCV
# random serch cv

param = {
    'C': [0.1, 1.0, 10.0, 100.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'] # 'liblinear' supports both 'l1' and 'l2' penalties
}
lr3 =LogisticRegression()

grid = GridSearchCV(lr3,param_grid=param,verbose=3 ,cv=2)
# cv is the suffle of data to train in same parameter value

In [None]:
grid.fit(x_train,y_train)

In [None]:
# show the best params
grid.best_params_

In [None]:
# retrun best model
grid.best_estimator_
model = grid.best_estimator_

In [None]:
model.predict(x_test)

#### 2.Randomsearch cv


In [196]:
from sklearn.model_selection import RandomizedSearchCV
# grid = GridSearchCV(lr3,param_grid=param,verbose=3 ,cv=2)
random = RandomizedSearchCV(lr3,param_distributions=param,verbose=3 ,cv=10,n_iter=8) 
# n_iter show how much combinations check in model fit

In [197]:
random.fit(x_train,y_train)

ValueError: Found input variables with inconsistent numbers of samples: [800, 700]

In [None]:
random.best_params_

In [198]:
random.score(x_train,y_train)

NotFittedError: This RandomizedSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [199]:
y_train.shape

(700,)

In [None]:
#  save the model 
import pickle
filename = 'logistic_model1.pkl'
pickle.dump(lr2, open(filename, 'wb'))

## Stream_lit code

In [None]:
import streamlit as st
import pandas as pd
import pickle

# Load the trained model
filename = 'logistic_model.pkl'
loaded_model = pickle.load(open(filename, 'rb'))

st.title("Insurance Claim Fraud Prediction")

st.write("""
This app predicts whether an insurance claim is fraudulent based on several factors.
""")

# Define input fields for numerical features
months_as_customer      = st.number_input('total years of customer as a loayality', min_value=0, max_value=500, value=100)
months_as_customer = months_as_customer *12

age                     = st.number_input('Age', min_value=18, max_value=100, value=30)
policy_annual_premium   = st.number_input('Policy Annual Premium', min_value=0.0, max_value=2500.0, value=1000.0)
auto_year               = st.number_input('Auto Year', min_value=1980, max_value=2023, value=2010)
umbrella_limit          = st.number_input('Umbrella Limit', min_value=-10000000, max_value=10000000, value=0)


# Create a button to make a prediction
if st.button('Predict Fraud'):
    # Create a DataFrame from the input values
    input_data = pd.DataFrame({
        'months_as_customer': [months_as_customer],
        'age': [age],
        'policy_annual_premium': [policy_annual_premium],
        'auto_year': [auto_year],
        'umbrella_limit': [umbrella_limit]
      })

    # Select only the numerical columns used for prediction based on the original code
    numerical_cols_for_prediction = ['months_as_customer', 'age', 'policy_annual_premium', 'auto_year', 'umbrella_limit']
    input_data_for_prediction = input_data[numerical_cols_for_prediction]

    # Make the prediction
    prediction = loaded_model.predict(input_data_for_prediction)

    # Display the prediction
    if prediction[0] == 'Y':
        st.write("Prediction: Fraud Reported (Y)")
    else:
        st.write("Prediction: No Fraud Reported (N)")



#### overview of all models

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

df = pd.read_csv("insurance_claims_report.csv")
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

# Sample data creation - REPLACE THIS with your full CSV file loading
# data = {
#     "months_as_customer": [328, 228, 134, 256, 228, 256],
#     "age": [48, 42, 18, 41, 44, 39],
#     "policy_annual_premium": [1406.91, 1197.22, 1413.14, 1415.74, 1583.91, 1351.10],
#     "auto_year": [2004, 2007, 2007, 2014, 2009, 2003],
#     "umbrella_limit": [0, 5000000, 5000000, 6000000, 6000000, 0],
#     "fraud_reported": ['Y', 'Y', 'N', 'Y', 'N', 'Y']
# }
# df = pd.DataFrame(data)

# Features and target
X = df[["months_as_customer", "age", "policy_annual_premium", "auto_year", "umbrella_limit"]]
y = df["fraud_reported"]

# Encode target variable
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # Y -> 1, N -> 0

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

# Models dictionary
models = {
    "Logistic Regression": LogisticRegression(),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=3),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Support Vector Machine": SVC(),
    "Naive Bayes": GaussianNB(),
    "Neural Network (MLP)": MLPClassifier(max_iter=500)
}
# Train & evaluate all models
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    print(f"{name}:\nAccuracy = {acc:.2f}")
    print("=======================")

Logistic Regression:
Accuracy = 0.73
K-Nearest Neighbors:
Accuracy = 0.68
Decision Tree:
Accuracy = 0.61
Random Forest:
Accuracy = 0.70
Gradient Boosting:
Accuracy = 0.71
Support Vector Machine:
Accuracy = 0.73
Naive Bayes:
Accuracy = 0.67
Neural Network (MLP):
Accuracy = 0.73
