<table align="center" width=100%>
    <tr>
        <td>
            <div align="center">
                <font color="#21618C" size=24px>
                    <b> Health Insurance Case study </b>
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Data Definition

**age** : Age of the policyholder (Numeric)

**sex:** Gender of policyholder (Categoric)

**bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight (Numeric)

**children:** Number of children covered by health insurance / Number of dependents (Numeric)

**smoker:** Indicates policyholder is a smoker or a non-smoker (non-smoker=0;smoker=1) (Categoric)

**region:** The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.(Categoric)

**charges:**  Individual medical costs billed by health insurance. (Numerical)

In [None]:
# supress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# 'Os' module provides functions for interacting with the operating system 
import os

# 'Pandas' is used for data manipulation and analysis
import pandas as pd 

# 'Numpy' is used for mathematical operations on large, multi-dimensional arrays and matrices
import numpy as np

# 'Matplotlib' is a data visualization library for 2D and 3D plots, built on numpy
import matplotlib.pyplot as plt
%matplotlib inline

# 'Seaborn' is based on matplotlib; used for plotting statistical graphics
import seaborn as sns # we use seaborn as it gives more smoother visualizations than pyplot but we have to iport pyplot as it uses the output of pyplot to give its own output

# 'Scikit-learn' (sklearn) emphasizes various regression, classification and clustering algorithms
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# import function for ridge regression
from sklearn.linear_model import Ridge

# import function for lasso regression
from sklearn.linear_model import Lasso

# import function for elastic net regression
from sklearn.linear_model import ElasticNet

# 'Statsmodels' is used to build and analyze various statistical models
import statsmodels 
import statsmodels.api as sm #This allows us to import lot more functionalities in statsmodel libraries 
import statsmodels.stats.api as sms #This allows ws to import only statistical models 
from statsmodels.tools.eval_measures import rmse #evaluates model performance 
from statsmodels.compat import lzip #helps in compacting the data in case the data is to huge and we want to zip it 
from statsmodels.graphics.gofplots import ProbPlot 

# 'SciPy' is used to perform scientific computations
from scipy.stats import f_oneway
from scipy.stats import jarque_bera
from scipy import stats

## 2. Set Options

In [None]:
pd.options.display.max_columns = None # displays all the columns in the dataframe
#pd.options.display.max_rows = None # displays all the rows in the dataset if necessary
pd.options.display.float_format = '{:.6f}'.format # returns an output value upto 6 decimals

## 3. Read Data

In [7]:
df = pd.read_csv('insurance.csv')
df.head() # if you want more than 5 rows use head(number of rows you want to add) 
#df.tail() gives the last 5 values in the datafrmae 

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 4. Data Analysis and preparation

### 4.1 Understand The Dataset

#### 4.1.1 Data Dimension

In [10]:
df.shape

(1338, 7)

We see that there is 7 columns and 1338 rows, in which 6 columns are the idependent variables whereas the 7th column 'Charges' is the dependent variable.

### 4.1.2 Data Types 

**1. Check Datatypes**

In [11]:
df.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

We see there are object datatypes which is not good as they store the values as string or numbers, hence may take more storage. So we have to convert them to ctaegorical variables which store the string values in a int format reducing the storage size and also ML models don't work with text data they have to be converted to numerics. Most of the time columns such as name,email id etc which have unique values for each row are droppped as there main aim is to track individuals and are not useful for ML models.

**2. Change Datatypes**

In [12]:
df.sex = df.sex.astype('category')
df.smoker = df.smoker.astype('category')
df.region = df.region.astype('category')

**3. Rececking the datatypes after conversion**

In [13]:
df.dtypes

age            int64
sex         category
bmi          float64
children       int64
smoker      category
region      category
charges      float64
dtype: object