# Predicting Medical Expenses Using Linear Regression
1. A typical problem statement for machine learning
2. Downloading and exploring a dataset for machine learning
3. Linear regression with one variable using Scikit-learn
4. Linear regression with multiple variables
5. Using categorical features for machine learning
6. Regression coefficients and feature importance
7. Other models and techniques for regression using Scikit-learn
8. Applying linear regression to other datasets

## Problem Statement
QUESTION: ACME insurance Inc. offers affordable health insurance to thousands of customer all over the US. As the lead data scientist at ACM, you are asked with creating an automated system to estimate the annual medical expenditure for new customers, using information such as their age, sex, BMI, children, smoking habits and region of residence.

**Estimates from your system will be used to determine the annual insurance premium (amount paid every month) offered to the customer**. Due to regulatory requirements, you must be able to explain why **your system outputs a certain prediction**.

## Downloading the Data

In [63]:
# restart the kernel after installing
#%pip install pandas-profiling --quiet

In [64]:
medical_charges_url = 'https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv'

In [65]:
from urllib.request import urlretrieve
urlretrieve(medical_charges_url, 'medical.csv')

('medical.csv', <http.client.HTTPMessage at 0x209b10c5cc0>)

In [66]:
import pandas as pd
medical_df = pd.read_csv('medical.csv')
medical_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [67]:
medical_df.info()
# No data is msssing

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [68]:
medical_df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


## 

## Divide the data into test and train 


In [69]:
X = medical_df.drop('charges', axis=1) # drop charge (y) from the dataset
y = medical_df.charges

In [70]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    stratify=X.age, # to avoid bias in dataset
                                                    random_state=42)

## Exploratory Analysis and Visualization

Explore the data by visualizing the distribution of values in some columns of the dataset
and the relationships between charges and other columns

In [71]:
#%pip install plotly

In [72]:
# Import required libs
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [73]:
# to impove default style and font size of the charts
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10,6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### Age

In [74]:
X_test.age.describe()

count    268.000000
mean      38.914179
std       14.062864
min       18.000000
25%       26.000000
50%       39.000000
75%       51.000000
max       64.000000
Name: age, dtype: float64

Observations
1. Min age is 18
2. Max age 64

In [75]:
fig = px.histogram(X_train,
                   x='age',
                   marginal='box',
                   nbins=47,
                   title='Distribution  of Age')
fig.update_layout(bargap=0.1)


It seems people with age 18 and 19 have highest number of insurance. 

### BMI

In [76]:
fig = px.histogram(X_train,
                   x='bmi',
                   marginal='box',
                   nbins=47,
                   title='Distribution  of BMI')
fig.update_layout(bargap=0.1)

Observations
1. It form Gaussian distribution
2. People between 25 to 38 are having highest BMI than rest
3. There are few outliers on the end of distribution 

### Charges

In [77]:
charges_df= pd.DataFrame(y_train, columns=['charges'])
charges_df['smoker'] = X_train.smoker

In [78]:
fig = px.histogram(
    charges_df,
    x='charges',
    marginal='box',
    color='smoker',
    color_discrete_sequence=['green', 'grey'],
    title='Annual Medical Charges'
)
fig.update_layout(bargap=0.1)


Observation
1. major customers' (and who are non-smoker) medical charges are under 10k. 
2. it seems for the customers with smoking habit charges are higher than 10k or maybe due to some illness 
   

EXERCISE: Visualize the distribution of medical charges in connection with other factors like "sex" and "region". What do you observe?

### Sex

In [79]:
charges_df= pd.DataFrame(y_train, columns=['charges'])
charges_df['sex'] = X_train.sex

fig = px.histogram(
    charges_df,
    x='charges',
    marginal='box',
    color='sex',
    color_discrete_sequence=[ 'pink','blue'],
    title='Annual Medical Charges'
)
fig.update_layout(bargap=0.1)

Observations
1. Males charges are higher than female
2. still males and famles' median charges are near to each other

### Region

In [80]:
charges_df['region'] = X_train.region

fig = px.histogram(
    charges_df,
    x='charges',
    marginal='box',
    color='region',
    color_discrete_sequence=['green', 'grey', 'pink', 'orange'],
    title='Annual Medical Charges'
)
fig.update_layout(bargap=0.1)

Observations
1. Southeast chargies are highest as well as southwest are lowest
2. Northwest and southwest have similer charges

### Smoker

In [81]:
# Visualize smoker yer or no value
charges_df['smoker'] = X_train.smoker
px.histogram(charges_df, x='smoker', color='sex', title='Smoker')

Observation
1. about 20% customers are smokers


EXERCISE: Visualize the distributions of the "sex", "region" and "children" columns and report your observations.

### Sex

In [82]:
px.histogram(charges_df, x='sex', title='Sex')

In [83]:
px.histogram(charges_df, x='region', title='Region')

Observation
1. It seems highest customers are from southeast and rest of region have same number of customers

In [84]:
charges_df['children'] = X_train.children
px.histogram(charges_df, x='children', title='Children')

Observation
1. It seems highest customers do not have kids, that can be explainable because highest number of customers are with 18, 19 age
2. very less customers have 5 children

### Ages and Charges

In [92]:
fig = px.scatter(
    X_train,
    x='age',
    y=charges_df.charges,
    opacity=0.8,
    hover_data=['sex'],
    title='Age vs. Charges'   
)
fig.update_layout(
    yaxis_title='charges'
)
fig.update_traces(marker_size=7)