##  Predicting Insurance Policy Values with Linear Regression
#### By *Aleah Lydeatte-Hepburn*

Medical Insurance protects users financially by making policies afforable with monthly fees, which then accumulate and go on to cover portions of sudden or exorbitant medical bills. It contains:

* *Monthly Premiums* users pay towards their policies.
* *Co-Insurance*, which covers medical procedures.
* *Copayments*, which are small fees users pay to see insurance-provided specialists.
* *Deductibles*, the portion the insurance company pays for.

** *Linear Regression* ** models the relationship between inputs and outputs as a line. The input passes through the line and gets transformed by the slope. Then, their product is added to an intercept (*to account for deviations in data*) and our output result becomes $y_{predicted}$. 

For *Ordinary Least Squares*, the method uses regression but also minimizes vertial differences between data points the line travels through in order to hold accuracy. 

Using Regression, we can predict possible slope values for each feature in our model, using their original values and the values we want to predict.  

Habits can affect a person's health can also affect the cost of their insurance policy. For example, smoking can lead to lung issues, gum disease and multiple forms of Cancer if sustained for many years. Policies exist as additional coverage and certain habits can affect how often holders can get sick.

**Goal**: Implement *OLS Linear Regression* on two features in *insurance.csv* to find the function that determines the value of users' insurance. Then, use the function to predict medical insurance costs with estimated feature values. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
insurance = pd.read_csv('/kaggle/input/insurance/insurance.csv')
insurance.head()

In [None]:
len(insurance)

Looking at *insurance.csv*'s data and checking the number of dataframe entries, we can see the dataset is pretty small since it only has 1338 rows. This shortness may affect the slope values of our line.

In [None]:
insurance.isnull().sum() #No missing data, whatsoever!

Since I need numeric features to create my regression line, I will plot features as a heatmap and observe if any positive correlation exists between them. 

In [None]:
insur_rel = insurance.corr()
f, ax = plt.subplots(figsize=(11,7))

sns.heatmap(insur_rel, square=True, linewidths=3, annot=True, cmap="YlGnBu")

### What Do We See?

We see the feature **Age** has the largest positive correlation with **Charges**. Then, **BMI**, also has a big correlation with **Charges**, compared to the rest of the features. 

From this investigation, we can also see that the feature **Children** has little to no relationship with any of the other features, so that means *the number of children don't influence insurance policy costs*. 

In [None]:
# The feature 'Smoker', contains boolean data
# It may help us gain more info about our dataset

smoker = insurance['smoker']== 'yes'
non_smoker = insurance['smoker']=='no'

in_smoke = insurance[smoker]
in_non_smoke = insurance[non_smoker]
#Two new dataframes -- other than insurance

prob_smoke = len(in_smoke) / len(insurance)
prob_non_smoke = len(in_non_smoke) / len(insurance)
print("The percentage of policy holders smoking is {:.2f}%".format(prob_smoke * 100))
print("The percentage of policy holders not smoking is {:.2f}%".format(prob_non_smoke* 100))

We can see from the probabilities that only a little over 20% of policy holders smoke. 



###  Which of these groups would have higher valued insurance?

In [None]:
in_smoke.head()

In [None]:
in_non_smoke.head()

The value difference in the **charges** column shows *'Smokers'*, **who make up 20% of policy holders** could possibly be charged more than *'Non-Smokers'*. 

In [None]:
# Policy Value Conditions 
high_charge = insurance['charges'] > 30000
low_charge = insurance['charges'] <= 30000

In [None]:
smoke_high = insurance[ smoker & high_charge]
smoke_low =  insurance[ smoker &  low_charge]
non_smoke_high = insurance[ non_smoker & high_charge]
non_smoke_low = insurance[ non_smoker & low_charge]

In [None]:
print("High Insurance policies for Smokers range from ${:.2f}".format(min(smoke_high['charges'])), " to ${:.2f}".format(max(smoke_high['charges'])))
print("Low Insurance policies for Smokers range from ${:.2f}".format(min(smoke_low['charges'])), " to ${:.2f}".format(max(smoke_low['charges'])))

In [None]:
print("High Insurance policies for Non-Smokers range from ${:.2f}".format(min(non_smoke_high['charges'])), " to ${:.2f}".format(max(non_smoke_high['charges'])))
print("Low Insurance policies for Non-Smokers range from ${:.2f}".format(min(non_smoke_low['charges'])), " to ${:.2f}".format(max(non_smoke_low['charges'])))

There is a noticable difference for insurance policy pricing when we view the differences between amounts less and greater than \$30,000. 

* Low valued policies for *Smokers* start at more expensive prices than for *Non-Smokers*. Here, the starting difference is ${\$11,000}$.
* High valued policies for *Smokers* have a greater range (${\$33,00}$) than *Non-Smokers* (${\$6,800}$).

## What do these amounts tell us?
1. *Smokers* have more higher valued policies, due to insurance providers expecting coverage for necessary medical attention. 
1. *Non-Smokers* are more likely to have lower valued policies, due to insurance providers expecting less need for coverage, since they are assumed to be healthier. 

In [None]:
smoke_high.age.hist(bins=10)
plt.show()

In [None]:
smoke_low.age.hist(bins=10)
plt.show()

The distributions are higher towards the left region of both histograms.

It shows that *the age associated with the highest number of policy holders who smoke is 20 years old* and 

In [None]:
non_smoke_high.age.hist(bins=10)
plt.show()

In [None]:
non_smoke_low.age.hist(bins=10)
plt.show()

The two histograms show that **less than 10** non-smoker policy holders have high policies --- the rest are all lower valued. 

 ### What are these Histograms telling us? 
 
1. *Non-Smokers* with high policies are expected to be older. 

1.  20 year olds, under the conditions above, have the most policy holders of any age group, possibly due to insurance coverage from their first full-time job. 

1. The histograms for *Smokers* went down with *Age*, which means **as smoking policy holders grow older, they are likely to outgrow the habit.**

In [None]:
less_kid = insurance['children'] < 3
many_kid = insurance['children'] >= 3
#Children Conditions
small_fam = insurance[less_kid]
many_fam =  insurance[many_kid]

In [None]:
sns.catplot(x = 'smoker', y='children', data=insurance, height=7, kind='boxen', linewidth=2.0)
plt.title("Number of Children for Smokers/Non-Smokers")
#It looks like boxplot shapes change at 3 children, so I'll investigate

In [None]:
#Around 0 - 2 kids, it's pretty even until the groups reach 3 children
sns.catplot(x = 'smoker', y='children', data=many_fam, height=7, kind='boxen', linewidth=2.0)
plt.title("Smokers/Non-Smokers with 3 or more Children")

Non-Smokers have a thicker, solid boxplot and Smokers have a more narrow boxplot, with the 5 children limit counted as an outlier.
So, this shows:

* *Policy holders **who are Smokers** will usually have up to 3 children*
* *Policy holders **who are Non-Smokers** are likely to have more children* 

## The Ordinary Least Squares Model

This dataset only has 1338 entries, which lets us test the accuracy of two different regression models. In Machine Learning, using 2 or more features on a small dataset gives a risk for overfitting on input data. This means we must rely on a single feature for each model. 

Here, we can focus on an *Age* based model and a *BMI* based model. Then, we'll measure the accuracy of each model when predicting *Charges*. 

We have an idea of how features are related from the heatmap, but now we can actually create a list of estimated insurance values from our models. 

In [None]:
#Reg Line using 'Age' to predict 'Charges'
ac = smf.ols(formula = 'charges ~ age', data=insurance).fit()
ac.params

In [None]:
sns.lmplot(x = 'age', y = 'charges', data = insurance, hue="smoker", height=11)
plt.title("Age vs. Price")

## What can we see?
When the model uses *Age* to predict *Charges*, our regression line passes between two other band-like regions that travel in the same direction. Each of the three bands captured in our plot shows how insurance increases, according to age.

The lower band is completely made up of Non-Smokers, the middle band is a mix of mostly Smokers and some Non-Smokers and the top band is of Smokers. 

The scatterplot shows that as bands get higher, they have a more looser cluster of datapoints. 

We can see the increase in value over time for all three bands is about the same, but **the starting price of a person's policy has a larger effect on final policy values**. It makes sense because the start price will affect how much policy holders contribute every month and that contribution affects the policy's value gain ovet time. 

In [None]:
ac.summary()

The regression line doesn't differentiate between smokers and non-smokers when calculating the line -- to go into that much detail would lead to overfit and the line the model gave us fits the total data. At least the *slope* is useful and we can always adjust the *y-intercept*. 

In [None]:
# Predict insurance values using list of ages, as a dataframe
work_age = pd.DataFrame({'age':[18, 30, 45, 61]})
print(ac.predict(work_age))


 ### Conclusion from *Age* vs. *Charges*
 "Age has an effect on insurance value, but starting price will have a larger effect on the final price."

In [None]:
#Reg Line using 'BMI' to predict 'Charges' 
bc = smf.ols(formula = 'charges ~ bmi', data = insurance).fit()
bc.params

In [None]:
sns.lmplot(x = 'bmi', y = 'charges', data = insurance, hue="smoker", height=11)
plt.title("BMI vs. Price")

## What can we see?

In this scatterplot, we see value increases for Smokers and Non-Smokers more clearly and their distribution (*Smokers* are **20.48%** and *Non-Smokers* are **79.52%**). The value increase for Non-Smokers is *constant with little increase* and they have a thicker scatterplot between $25$ and $35$ BMI. 

The increase for Smokers is *more positive linear* and we can see that Smokers do start off with higher policy charges than Non-Smokers. 


At $30$ BMI, policy holders would be considered overweight and a wider branch forms there between the two groups. Here, weight doesn't have the same effect on charges that a smoking habit would. However, it doesn't give us much info because most policy holders in the data have less than $35$ BMI. 

Also, BMI is a value that, unlike Age, can decrease. However, lower BMI doesn't easily translate to lower insurance charges. 

In [None]:
bc.summary()

In [None]:
bmi_cge = pd.DataFrame({'bmi':[20, 30, 40, 50]})
print(bc.predict(bmi_cge))

## Conclusion fromn "BMI vs. Charges"

"There is a slightly increase to charges, but starting price also affects the value of insurance policies."