# **INSURANCE CLAIM ANALYSIS**

## Objectives

* Fetch a dataset from Kaggle and save as raw data as a csv file
* Perform the Extract, Transform and Load (ETL)process of the raw csv file
* Engineer features to help with the analysis process
* Explore copy of the clean dataset to help build visualisations to support the hypotheses
* Use cleaned data for dashboard app

## Inputs

* The raw data will be taken from the insurance_data.csv file as the only input being used for the below analysis.

## Outputs

* A clean dataset of the insurance_data.csv file
* Feature engineering the dataset
* Produce a variety of charts during exploratory phase
* Present findings through interactive dashboard

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [7]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\funmi\\OneDrive\\Documents\\Code Institute\\vscode-projects\\insurance-claim-analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [8]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [9]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\funmi\\OneDrive\\Documents\\Code Institute\\vscode-projects\\insurance-claim-analysis'

# Section 1 - Extracting the data

Section 1 content:

* Import required Python libraries for analysis and visualisation
* Load and extract the dataset from the data folder
* Show the first few rows of the dataset to understand the structure
* Extract statistical summary and gain basic information on the data


In [10]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

print('All required python libraries have been imported')

All required python libraries have been imported


---

# Section 2 - Transform the data (cleaning and preparation)

Section 2 content:

* Handle missing values and duplications
* Transform the data by dropping unnecessary rows and/or columns
* Feature engineer columns relevant for analysis
* Export cleaned data for Power BI and Python visualisation


In [11]:
# Loading original dataset
df_claim = pd.read_csv('data/insurance_data.csv')

# Display number of rows and columns within the raw data
print(f"Data loaded. Initial shape: {df_claim.shape}")

Data loaded. Initial shape: (1340, 11)


In [None]:
# create a copy of the original dataset to start ETL and EDA process
df = df_claim.copy()
display(df.head())

Unnamed: 0,index,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
2,2,3,,male,33.3,82,Yes,0,No,southeast,1135.94
3,3,4,,male,33.7,80,No,0,No,northwest,1136.4
4,4,5,,male,34.1,100,No,0,No,northwest,1137.01


In [None]:
# check to see data types
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          1340 non-null   int64  
 1   PatientID      1340 non-null   int64  
 2   age            1335 non-null   float64
 3   gender         1340 non-null   str    
 4   bmi            1340 non-null   float64
 5   bloodpressure  1340 non-null   int64  
 6   diabetic       1340 non-null   str    
 7   children       1340 non-null   int64  
 8   smoker         1340 non-null   str    
 9   region         1337 non-null   str    
 10  claim          1340 non-null   float64
dtypes: float64(3), int64(4), str(4)
memory usage: 115.3 KB


In [14]:
# check to see any missing values
df.isnull().sum()

index            0
PatientID        0
age              5
gender           0
bmi              0
bloodpressure    0
diabetic         0
children         0
smoker           0
region           3
claim            0
dtype: int64

The above report shows there are missing values in 2 columns - age and region. Since the amount is immaterial they can be dropped from the dataframe.

In [None]:
# drop rows that have missing values in columns age and region
df = df.dropna(subset=["age", "region"])
# confirming the missing values have been dropped by the change in the shape of the dataframe
print("After dropping rows with missing values in columns: Age and Region", df.shape)
# confirm no missing values
print(df.isnull().sum())

After dropping rows with missing values in columns: Age and Region (1332, 11)
index            0
PatientID        0
age              0
gender           0
bmi              0
bloodpressure    0
diabetic         0
children         0
smoker           0
region           0
claim            0
dtype: int64


In [None]:
# drop column index as this is not needed. superseded by dataframe index
df.drop(columns=['index'], inplace=True)
display(df.head())

Unnamed: 0,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
7,8,19.0,male,41.1,100,No,0,No,northwest,1146.8
8,9,20.0,male,43.0,86,No,0,No,northwest,1149.4
9,10,30.0,male,53.1,97,No,0,No,northwest,1163.46


### Encoding BMI Column

Encoding the BMI column by categorising and creating a new column containing numeric ranges to match the ranges found on the NHS website will make it easier during analysis:

Underweight = <18.5
Normal weight = 18.5-24.9
Overweight = 25.0-29.9
Obese = 30.0+

In [None]:
# Bins used to create cut-off points for the 4 ranges used for BMI values
bins = [0, 18.5, 24.9, 29.9, float('inf')]

# Labels associated with the bins created above
labels = ['Underweight', 'Normal weight', 'Overweight', 'Obese']

# Creating BMI category column from the above bins and labels 
df['BMI_category'] = pd.cut(df['BMI'], bins=bins, labels=labels)

display(df.head())

---

# Section 3 - Data Visualisation

Section 3 content:

* Hypothesis 1 - Use of a scatterplot to display how smoking status impacts insurance claims
* Hypothesis 2 - Use of a Heatmap to 

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Conclusions and Next Steps

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
#import os
#try:
  # create your folder here
  # os.makedirs(name='')
#except Exception as e:
  #print(e)
