# 1. UNDERSTANDING THE BUISNESS PROBLEM


The aim of the problem is to predict the insurance premium charge for an employee.

- Features available are:
 
 **Categorical**
    - smoker: yes/no
    - region: residential area.
    - sex: male/female
  
  **Numeric**
    - bmi: Body mass index (18.5 to 24.9)
    - children: No of childrens 
    - charges: Insurance Premium Charges


Data Source:https://www.kaggle.com/mirichoi0218/insurance

# 2. IMPORT LIBRARIES AND DATASETS

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# read the csv file 
insurance_df = pd.read_csv('insurance.csv')

In [None]:
insurance_df.head()

# 3. Data Pre-Processing

## Data Cleaning

In [None]:
insurance_df.info()

In [None]:
insurance_df.isnull().sum()

In [None]:
# check if there are any Null values
sns.heatmap(insurance_df.isnull(), yticklabels = False, cmap=sns.diverging_palette(50, 500, n=500))


## Data Relation

In [None]:
# Grouping by region to see any relationship between region and charges
# Seems like south east region has the highest charges and body mass index
df_region = insurance_df.groupby(by='region').mean()
df_region

In [None]:
# Grouping data by 'age' to see the relationship between 'age' and 'charges'
# Seems like age 64 has the highest charges
df_region = insurance_df.groupby(by='age').mean()
df_region = df_region.sort_values(by='charges')
df_region.tail()

## Encoding
Machine only understand one language that is numbers . So we need to convert the categorical data into numbers. So here we deal with that. 

We use one-hot or getdummies for column which have higher number of categories to prevent ordering.

### Label Encoding

In [None]:
sex = {
  "male": "1",
  "female": "0"
}
smoker = {
  "Yes": "1",
  "No": "0"
}

In [None]:
# Check unique values in the 'sex' column
print(insurance_df['sex'].unique())
# Check the unique values in the 'smoker' column
print(insurance_df['smoker'].unique())

In [None]:
# convert categorical variable to numerical
insurance_df['sex'] = insurance_df['sex'].apply(lambda x: 0 if x == 'female' else 1)
insurance_df['smoker'] = insurance_df['smoker'].apply(lambda x: 0 if x == 'no' else 1)

In [None]:
insurance_df.head()

In [None]:
# Check unique values in 'region' column
insurance_df['region'].unique()

### Dummies

Dummy Variable Trap : 
When Two Columns are highly correlated. So we drop the first column to remove this problem.

In [None]:
def dummy_df(dataset,columnname):
    dummies = pd.get_dummies(dataset[columnname],drop_first=True)
    dataset = dataset.drop(columnname,1)
    dataset = pd.concat([dataset,dummies],axis=1)
    return dataset

In [None]:
insurance_df = dummy_df(insurance_df,'region')

In [None]:
insurance_df.head()

### Statistical Analysis

In [None]:
insurance_df.describe()

In [None]:
print(insurance_df['age'].mean())
print(insurance_df['bmi'].mean())
print(insurance_df['charges'].mean())

# 4. Data Visualization

In [None]:
print(smoker)
sns.countplot(insurance_df['smoker'])
plt.show()

In [None]:
sns.countplot(insurance_df['children'])
plt.show()

In [None]:
sns.distplot(insurance_df['age'])

In [None]:
sns.distplot(insurance_df['bmi'])

In [None]:
# plot pairplot
sns.pairplot(insurance_df,hue='sex')
plt.show()

In [None]:
insurance_df[['age', 'sex', 'bmi', 'children', 'smoker', 'charges']].hist(bins = 30, figsize = (20,20), color = 'r')
plt.show()

In [None]:
#Try to Fit a Decision Boundary
sns.regplot(x = 'age', y = 'charges', data = insurance_df)
plt.show()

In [None]:
#Try to Fit a Decision Boundary
sns.regplot(x = 'bmi', y = 'charges', data = insurance_df)
plt.show()

**Co-relation Matrix**

In [None]:
corr = insurance_df.corr()
corr

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(corr,cmap='coolwarm',annot=True,linecolor='black', linewidths=2)

# 5. Data Splitting

## Splitting Independent & Dependent Variables 

In [None]:
insurance_df.columns

In [None]:
#X Contain the Independent Variables
X = insurance_df.drop(columns =['charges'])
#y Contain the dependent Variables
y = insurance_df['charges']

In [None]:
print(X.shape)
print(y.shape)

In [None]:
X = np.array(X).astype('float32')
y = np.array(y).astype('float32')

In [None]:
y = y.reshape(-1,1)

In [None]:
# Only take the numerical variables and scale them
X 

## Splitting Data Into Training & Testing Set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

In [None]:
#scaling the data before feeding the model
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler_x = StandardScaler()
X_train = scaler_x.fit_transform(X_train)
X_test = scaler_x.transform(X_test)

scaler_y = StandardScaler()
y_train = scaler_y.fit_transform(y_train)
y_test = scaler_y.transform(y_test)


In [None]:
X_train

# 6. Training & Testing using SK-LEARN

In [None]:
# using linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score

classifier = LinearRegression()
classifier.fit(X_train, y_train)


In [None]:
accuracy = classifier.score(X_test, y_test)
print(accuracy*100)

In [None]:
y_predict_orig = scaler_y.inverse_transform(y_predict)
y_test_orig = scaler_y.inverse_transform(y_test)


# 6. Training & Testing using SAGEMAKER