#  Medical Insurance Claim Exploratory Data Analysis

## 1. Defining the problem

This dataset contains the data of few people medical history, whether they have a certain medical condition or not, if they have gone through any surgery previously, their age, height and weight. 

Based on all this conditions a person pays different amount of Medical Insurance Premium, which is also present in the dataset.

From this data, we have to analyze, what factors and what medical conditions is affecting or varying the premium amount. Also by further exploring, we can alanyze the impact of a persons age and weight on the different conditions occured.


## 2. Asking the questions

* How is the age of a person is affecting the premium?
* Does undergoing a major surgery and any transplant previously is affecting the premium?
* People having other medical conditions like Diabetes, Allergies, Bloodpressure, ChronicDisease is affecting the premium ammount.
* Does people having cancer in family history is also affecting the premium?
* How is the height and weight of  person is affecting the premium?



** Note: While proceeding through the analysis if any question arises, it will be noted down here. And then try to answers those questions.

## 3. Importing the data and checking for consistency

In [None]:
#Importing numpy and pandas
import numpy as np
import pandas as pd

In [None]:
#Importing dataset
df=pd.read_csv('../input/medical-insurance-premium-prediction/Medicalpremium.csv')

In [None]:
#Viewing the dataset
df.head()

In [None]:
df.shape

In [None]:
#Checking columns name, null values and data types of each column
df.info()

All the columns are of intiger format, which is correct viewing the dataset.

There is no null values in any column.

We can also check for consistency for the columns: 'Age', 'Height', 'Weight' and 'PremiumPrice'. This is done to identify if there is any outlier. Outlier values will bias our Analysis.

At last we will remove if extra spaces exists in any column.

## 4. Data Cleaning

#### Checking for consistency in 'Age', 'Height', 'Weight' and 'PremiumPrice' columns.

In [None]:
Unique_Age = df['Age'].unique()
print('Age_range', sorted(Unique_Age))
print('\n')

Unique_Height = df['Height'].unique()
print('Height_range', sorted(Unique_Height))
print('\n')

Unique_Weight = df['Weight'].unique()
print('Weight_range', sorted(Unique_Weight))
print('\n')

Unique_PremiumPrice = df['PremiumPrice'].unique()
print('PremiumPrice_range', sorted(Unique_PremiumPrice))

The Age of the people ranges from 18 years to 66 years. This is a correct range and thus has no outlier.

The Height of the people ranges from 145cm to 188cm. This has no outlier.

The Weight of the people ranges from 51kg to 132kg. This has no outlier.

The Premium Price ranges from Rs.15000 to Rs.40000. This is a corret premium range yearly(maybe not in Rupees, but in other currency). And thus this also has no outlier.

#### Removing extra spaces if any existed

In [None]:
df.columns = df.columns.str.strip()

In [None]:
#Viewing our cleaned data
df.tail()

#### This is the clean data we have to perform our further analysis

## 5. Performing Analysis

#### First we will plot a heatmap to find the corellation between different features or columns

In [None]:
#Importing the plotting libraries

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#plotting heatmap

fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.heatmap(df.corr(), annot=True, cmap="YlGnBu")

From the above heatmap, we see that there is a high corelation between Age of a person and Premium paid by him. 

We can also see good corellation between Number of major Surgeries to Age, and to Blood Pressure problems.

Also there are good corellation between Premium price to Any Transplaent, Number of Major Surgeries, Any Chronic Disease. 

We will explore all this relations

### Q1. How is the age of a person is affecting the premium?

To answer this question we will first find the corellation between 'Age' and 'Premium' column. If there is a good corellation, we will plot and explore more

In [None]:
# Finding the corellation between Age and Premium column

df['Age'].corr(df['PremiumPrice'])

There is a good corellation between Age of a person and Premium paid by him. We will plot between this two to find details.

In [None]:
# Plotting between Age and Premium column

sns.scatterplot(x='Age',y='PremiumPrice',data=df)

From this plot we can see that except few points. Most of the points are alligned in a particular price forcertain range of Age. 

Most of the people between 18 to 30 years pays a premium of 15000. Few people between age 25 to 30 years pays a premium of around 19000. And so on for different Age range. 

To understand this more clearly we will add a new column whose value will be based on the age group. We will group age between range 0-20 years, 21-30 yers, 31-40 years, 41-50 years, 51-60 years, and 60 years and soon.

In [None]:
# Adding a new column as 'age_range'

bins = [0,20,30,40,50,60,100]
df['age_range'] = pd.cut(df['Age'], bins,labels=('0-20','21-30','31-40','41-50','51-60','61 and above'))
df.head()

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='age_range',y='PremiumPrice',data = df).set(title='Age Vs Premium Price')

Thus for the first question the answer can be given from the plot itself.

### Ans1. As the age of a person increases, the price of the premium also increases.


### Q2. Does undergoing a major surgery and any transplant previously is affecting the premium?

In [None]:
# Finding the corellation between Age and Premium column

df['NumberOfMajorSurgeries'].corr(df['PremiumPrice'])

However there is not a very strong corellation, so we will not explore this much. But will plot a single plot

In [None]:
# Plotting between Age and Premium column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='NumberOfMajorSurgeries',y='PremiumPrice',data = df).set(title='Number Of Major Surgeries Vs Premium Price')

There can be seen no major relations or changeg between Number Of Major Surgeries and Premium Price. But we can conclude that peole undergoing more surgeries pays a little higher premium.

However we have seen a good corellation between Number Of Major Surgeries and Age. We will explore this further.

In [None]:
# Plotting between Age and Premium column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='NumberOfMajorSurgeries',y='Age',data = df).set(title='Number Of Major Surgeries Vs Age')

From this plot its clear that people whose age are more than 50 years undergoes 2 or more surgeries.

Thus we can conclude that people whose age is more than 50 years and who have undergone more than 2 surgeries has to pay high premium.


Next we will find relation between AnyTransplants and Premium Paid

In [None]:
# Finding the corellation between AnyTransplants and Premium column

df['AnyTransplants'].corr(df['PremiumPrice'])

In [None]:
# Plotting between AnyTransplants and Premium column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='AnyTransplants',y='PremiumPrice',data = df).set(title='Transplants Vs Premium Price')

We can see that People who have done a transplantation pays higher premium than those who have not done any transplantation

### Ans2. People whose age is more than 50 years and who have undergone more than 2 surgeries has to pay high premium. And also those who have done a transplantation pays higher premium than those who have not done any transplantation.

### Q3. How does people having other medical conditions like Diabetes, Allergies, Bloodpressure, ChronicDisease is affecting the premium ammount.

In [None]:
# Finding the corellation between Diabetes and Premium column

df['Diabetes'].corr(df['PremiumPrice'])

In [None]:
# Plotting between Diabetes and PremiumPrice column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='Diabetes',y='PremiumPrice',data = df).set(title='Diabetes Vs PremiumPrice')

People having Diabetes pays a little higher Premium.

In [None]:
# Finding the corellation between Allergies and Premium column

df['KnownAllergies'].corr(df['PremiumPrice'])

There is no relation between Diabetes and Allergies of people to the Premium Paid. 

But, however from the heatmap we can see some relation between Diabetes to Age

In [None]:
# Finding the corellation between Diabetes and Age column

df['Diabetes'].corr(df['Age'])

In [None]:
# Plotting between Diabetes and Age column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='Diabetes',y='Age',data = df).set(title='Diabetes Vs Age')

Most of the people having diabetes are more tan 40 years of age.

Next we will find corellation between Blood pressure and Premium paid

In [None]:
# Finding the corellation between BloodPressureProblems and Premium column

df['BloodPressureProblems'].corr(df['PremiumPrice'])

In [None]:
# Plotting between BloodPressureProblems and PremiumPrice column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='BloodPressureProblems',y='PremiumPrice',data = df).set(title='Blood Pressure Vs PremiumPrice')

There is not much reation between Blood pressure and Premium Paid. But people having Blood Pressure pays a little higher Premium.

From the heatmap we have seen some relation of Blood Pressure to Age

In [None]:
# Plotting between BloodPressureProblems and Age column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='BloodPressureProblems',y='Age',data = df).set(title='Blood Pressure Vs Age')

Most of the people having Blood Pressure are also more than 40 years of age.

Next finding the corellation between Chronic disease and Premium paid

In [None]:
# Plotting between Chronic disease and Age column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='AnyChronicDiseases',y='PremiumPrice',data = df).set(title='Chronic disease Vs Premium Paid')

Most of the people having Chronic disease pays a higher premium

### Ans3. Most of the people having Blood Pressure and Diabetes are over 40 years. People who have any one of this health conditions like blood Pressure, Diabetes, Chronic Disease pays a higher Premium.

### Q4. Does people having cancer in family history is also affecting the premium?

In [None]:
# Plotting between HistoryOfCancerInFamily and PremiumPrice column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='HistoryOfCancerInFamily',y='PremiumPrice',data = df).set(title='HistoryOfCancerInFamily Vs Premium Paid')

Cancer in family history doesnot affect the Premium Price much. 

But from the heatmap its seen that there is a good co rellation of history of cancer family to Number Of Major Surgeries

In [None]:
# Plotting between HistoryOfCancerInFamily and NumberOfMajorSurgeries column

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(x='HistoryOfCancerInFamily',y='NumberOfMajorSurgeries',data = df).set(title='HistoryOfCancerInFamily Vs NumberOfMajorSurgeries')

People who have Cancer in family history are more likely to undergoing Major Surgery. Thus people having cancer in family history may have high chances of health related problems that will lead to surgery

### Ans4. We can conclude that people who have cancer in family history has atleast one major surgery done. And previously we have seen that people undergoing more major surgeries pays a higher premium.

### Q5. How is the height and weight of person is affecting the premium?

We will first categorize the weight between certain values

In [None]:
# Adding a new column as 'age_range'

bins = [50,70,90,110,130,150]
df['Weight_range'] = pd.cut(df['Weight'], bins,labels=('50-70','71-90','91-110','111-130','131-150'))
df.head()

In [None]:
# Plotting between Weight_range and PremiumPrice column

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='Weight_range',y='PremiumPrice',data = df).set(title='Weight Vs Premium Price')

There cannot be seen any major relations between Weight and Premium Price. However we can calculate the Body Mass Ratio(BMI) and then find corellation between BMI and Premium price and other diseases. BMI=kg/m^2.

In [None]:
# Adding a new column as 'BMI'

df['BMI'] = df['Weight'] / ((df['Height']*.01) * (df['Height']*.01))

In [None]:
df.head(10)

In [None]:
#Checking the BMI values range
print('minimum_value: ',df['BMI'].unique().min())
print('\n')
print('maximum_value: ',df['BMI'].unique().max())

Now We will categorize the BMI between certain range of values

In [None]:
# Adding a new column as 'BMI_range'

bins = [0,20,30,40,50]
df['BMI_range'] = pd.cut(df['BMI'], bins,labels=('0-20','20-30','30-40','40-50'))
df.head()

In [None]:
# Plotting between BMI_range and PremiumPrice column

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='BMI_range',y='PremiumPrice',data = df).set(title='BMI Vs Premium Price')

Thus no major corellation can be seen between BMI and Premium Prices

### Ans5. The premium price is not affected by the Weight and Height

## 6. Outcome

### Key Points

#### As the age of a person increases, the price of the premium also increases.

#### People whose age is more than 50 years and who have undergone more than 2 surgeries in the past has to pay higher premium. 

#### And also people those who have done a transplantation pays higher premium than those who have not done any transplantation.

#### Most of the people having Blood Pressure and Diabetes are over 40 years. People who have any one of this health conditions like blood Pressure, Diabetes, Chronic Disease pays a higher Premium compared to those who have no health problem.

#### People who have cancer in family history has chances of undegoing a major surgery. And people undergoing more major surgeries pays a higher premium.

##### It is seen than with the increase in age of person different types of health conditions occurs and thus their premium increases

### Recommendation

##### More people should do a health insurance at early age as possible. This will minimize his premium from the beginning as at an early age he will not have any major health related issues.

##### With the increase in age people should take more care of his health. Everyone should be engage in some kind of physical health activity. This will reduce the chances of occurence of health related problems.
