# Insurance Data Analysis

This is my first data analysis work following the working through the steps in the book Hands-on machine learning.
I am a novice, I need feedback please.

In [None]:
import numpy as np  # Linear Algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # visualization
%matplotlib inline

#### Load the data

In [None]:
pd.set_option('max_rows', 10)

file_path = '../input/insurance/insurance.csv'

insurance = pd.read_csv(file_path)
insurance.head()

In [None]:
insurance.describe()

In [None]:
insurance.info()

In [None]:
insurance.hist(bins = 20)

In [None]:
sns.pairplot(insurance)

In [None]:
# plotting bmi vs charges to find correlation

plt.figure( figsize = (5,5))
plt.title('Plot of BMI vs Age')
sns.scatterplot(x ='bmi', y ='charges', data  = insurance, hue ='sex')

The above graph shows a strong relationship between '<b> bmi </b> and insurance price

In [None]:
plt.figure(figsize = (5,5))
plt.title('Relation between Sex and Insurance Price')
sns.barplot( x ='sex', y = 'charges', data = insurance, hue = 'smoker')

In [None]:
sns.jointplot( x = 'children', y = 'charges', data = insurance)


In [None]:
sns.kdeplot(insurance['charges'])

### Split the dataset into Test_Set and Train_set

In [None]:
# import thr train_test_split function from scikit learn
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(insurance, test_size = 0.2, random_state = 42)



In [None]:
#distribution of age in the data set
sns.distplot(insurance['age'], kde = False)

Based on the above distribution of age, I am going to stratify the age 

In [None]:
insurance['age_cat'] = pd.cut(insurance['age'],
                              bins =[0,20,40,60, np.inf],
                              labels = [1,2,3,4])

                        
                    

In [None]:
sns.distplot(insurance['age_cat'], kde = False)


Now I am ready to do stratified sampling based on the age category. For this task, I am going to use scikit-learn <b>StratifiedShuffleSplit class:</b>

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits =1, test_size = 0.2,random_state =42)

    

In [None]:
for train_index, test_index in split.split(insurance, insurance["age_cat"]):
    strat_train_set = insurance.loc[train_index]
    strat_test_set = insurance.loc[test_index]


Let’s see if this worked as expected. You can start by looking at the income category
proportions in the test set:


In [None]:
 strat_test_set["age_cat"].value_counts() / len(strat_test_set)

In [None]:
strat_test_set.head()

Now you should remove the income_cat attribute so the data is back to its original
state:


In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop('age_cat', axis = 1, inplace = True)

In [None]:
strat_train_set.head()

### Looking for Correlation
Since the dataset is not too large, you can easily compute the standard correlation
coefficient (also called Pearson’s r) between every pair of attributes using the corr()
method

In [None]:
corr_matrix = insurance.corr()

corr_matrix['charges'].sort_values(ascending = False)

Now let’s look at how much each attribute correlates with the median house value:


In [None]:
# use pandas scatter_matrix for more correlation

from pandas.plotting import scatter_matrix

attributes = ['charges','age','bmi','children']
scatter_matrix(insurance[attributes], figsize =(10,10))

In [None]:
sns.scatterplot(x ='age', y = 'charges', data = insurance)

### Experimenting with Attribute Combinations
One last thing you may want to do before actually preparing the data for Machine
Learning algorithms is to try out various attribute combinations. 


In [None]:
insurance['age_per_bmi'] = insurance['age']/ insurance['bmi']

In [None]:
insurance.head()

In [None]:
strat_train_set.head()

In [None]:
corr_matrix = insurance.corr()
corr_matrix["charges"].sort_values(ascending=False)

### Prepare the data for machine learning
But first let’s revert to a clean training set (by copying strat_train_set once again),
and let’s separate the predictors and the labels since we don’t necessarily want to apply
the same transformations to the predictors and the target values (note that drop()
creates a copy of the data and does not affect strat_train_set):


In [None]:
insurance = strat_train_set.drop('charges', axis = 1) # axis = 'columns'
insurance_labels = strat_train_set['charges'].copy()

### Handling Text and Categorical attributes
Earlier we left out the categorical attribute ocean_proximity because it is a text
attribute so we cannot compute its median:


In [None]:
insurance_cat = insurance[['region']]
insurance_cat.head(10)

Most Machine Learning algorithms prefer to work with numbers anyway, so let’s con‐
vert these categories from text to numbers. For this, we can use  <b>Scikit-Learn’s Ordina
lEncoder class:</b>

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder =OrdinalEncoder()

insurance_cat_encoded = ordinal_encoder.fit_transform(insurance_cat)
insurance_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

One issue with this representation is that ML algorithms will assume that two nearby
values are more similar than two distant values. This may be fine in some cases (e.g.,
for ordered categories such as “bad”, “average”, “good”, “excellent”), but it is obviously
not the case

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
insurance_cat_1hot = cat_encoder.fit_transform(insurance_cat)
insurance_cat_1hot

In [None]:
insurance_cat_1hot.toarray()

In [None]:
cat_encoder.categories_

### Sex data transformation

In [None]:
insurance_sex_cat = insurance[['sex']]

insurance_sex_cat .head()

In [None]:
sex_encoder =OrdinalEncoder()

insurance_sex_encoded = sex_encoder.fit_transform(insurance_sex_cat)
insurance_sex_encoded[:10]


In [None]:
sex_encoder.categories_

In [None]:
# smoker categorical data transformation
smoker_cat = insurance[['smoker']]
smoker_encoder = OrdinalEncoder()
smoker_encoded = smoker_encoder.fit_transform(smoker_cat)
smoker_encoded


In [None]:
smoker_encoder.categories_

In [None]:
# insurance_num contains all the numerical attributes o
insurance_num = insurance.drop(["sex",'region','smoker'], axis=1)

insurance_num.head()


#### Future Scaling
One of the most important transformations you need to apply to your data is feature
scaling. With few exceptions, Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales.
There are two common ways to get all attributes to have the same scale: min-max
scaling and standardization.


### Transformation Pipelines
As you can see, there are many data transformation steps that need to be executed in
the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with
such sequences of transformations. Here is a small pipeline for the numerical
attributes:


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('std_scaler', StandardScaler()),
])

insurance_num_tr = num_pipeline.fit_transform(insurance_num)
    

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of
steps. All but the last estimator must be transformers (i.e., they must have a
fit_transform() method). The names can be anything you like (as long as they are
unique and don’t contain double underscores “__”): they will come in handy later for
hyperparameter tuning.
When you call the pipeline’s fit() method, it calls fit_transform() sequentially on
all transformers, passing the output of each call as the parameter to the next call, until
it reaches the final estimator, for which it just calls the fit() method.

So far, we have handled the categorical columns and the numerical columns sepa‐
rately. It would be more convenient to have a single transformer able to handle all col‐
umns, applying the appropriate transformations to each column. In version 0.20,
Scikit-Learn introduced the ColumnTransformer for this purpose, and the good news
is that it works great with Pandas DataFrames. Let’s use it to apply all the transforma‐
tions to the housing data:

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(insurance_num)
cat_attribs = ['sex', 'smoker','region']


full_pipeline = ColumnTransformer([
     ("num", num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs),
                                  ])
insurance_prepared = full_pipeline.fit_transform(insurance)

### Select and train a model
At last! You framed the problem, you got the data and explored it, you sampled a
training set and a test set, and you wrote transformation pipelines to clean up and
prepare your data for Machine Learning algorithms automatically. You are now ready
to select and train a Machine Learning model.

#### Training and evaluating on the training set
Let’s first train a Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(insurance_prepared, insurance_labels)

Done! You now have a working Linear Regression model. Let’s try it out on a few
instances from the training set:

In [None]:
some_data = insurance.iloc[:5]
some_labels = insurance_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels", list(some_labels))

It works, although the predictions are not exactly accurate. Let’s measure this regression model’s RMSE on the whole train‐
ing set using Scikit-Learn’s mean_squared_error function:

In [None]:
from sklearn.metrics import mean_squared_error

insurance_predictions = lin_reg.predict(insurance_prepared)
lin_mse = mean_squared_error(insurance_labels, insurance_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

This is an example of a model underfitting
the training data. When this happens it can mean that the features do not provide
enough information to make good predictions, or that the model is not powerful
enough. 

Let’s train a <b>DecisionTreeRegressor</b>. This is a powerful model, capable of finding
complex nonlinear relationships in the data 

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(insurance_prepared,insurance_labels)

Now that the model is trained, let’s evaluate it on the training set:

In [None]:
insurance_predictions = tree_reg.predict(insurance_prepared)
tree_mse = mean_squared_error(insurance_labels, insurance_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Wait, what!? The error is too small. Could this model really be absolutely perfect? Of course,
it is much more likely that the model has badly overfit the data.

### Better Evaluation Using Cross-Validation
One way to evaluate the Decision Tree model would be to use the train_test_split
function to split the training set into a smaller training set and a validation set, the train your models against the smaller training set and evaluate them against the vali‐
dation set. It’s a bit of work, but nothing too difficult and it would work fairly well.
A great alternative is to use Scikit-Learn’s K-fold cross-validation feature. The follow‐
ing code randomly splits the training set into 10 distinct subsets called folds, then it
trains and evaluates the Decision Tree model 10 times, picking a different fold for
evaluation every time and training on the other 9 folds. The result is an array con‐
taining the 10 evaluation scores:


In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, insurance_prepared, insurance_labels,
                        scoring = 'neg_mean_squared_error', cv =10)

tree_rmse_scores = np.sqrt(-scores)


Let's write a function to display the results 



In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard Deviation: ", scores.std())

In [None]:
display_scores(tree_rmse_scores)

Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to per‐
form worse than the Linear Regression model! Notice that cross-validation allows
you to get not only an estimate of the performance of your model, but also a measure
of how precise this estimate is (i.e., its standard deviation). The Decision Tree has a
score of approximately 6723.76, generally ±596.628.

Let’s compute the same scores for the Linear Regression model just to be sure:

In [None]:
lin_scores = cross_val_score(lin_reg, insurance_prepared,
                             insurance_labels,
                             scoring = 'neg_mean_squared_error', cv =10)

lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse-scores)

The model above is the worse of the model, it performs so bad on the train_set

Let’s try one last model now: the <b>RandomForestRegressor</b>
 Random Forests work by training many Decision Trees on random subsets of
the features, then averaging out their predictions. Building a model on top of many
other models is called Ensemble Learning, and it is often a great way to push ML algo‐
rithms even further. We will skip most of the code since it is essentially the same as
for the other models

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(insurance_prepared, insurance_labels)



insurance_predictions = forest_reg.predict(insurance_prepared)
forest_mse = mean_squared_error(insurance_labels, insurance_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
forest_scores = cross_val_score(forest_reg, insurance_prepared,
                               insurance_labels,
                               scoring = "neg_mean_squared_error",
                               cv = 10)

forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

### Fine-Tune Your Model
Let’s assume that you now have a shortlist of promising models. You now need to
fine-tune them. Let’s look at a few ways you can do that.


#### Grid Search
One way to do that would be to fiddle with the hyperparameters manually, until you
find a great combination of hyperparameter values. This would be very tedious work,
and you may not have time to explore many combinations.
Instead you should get Scikit-Learn’s<b> GridSearchCV</b> to search for you. All you need to
do is tell it which hyperparameters you want it to experiment with, and what values to
try out, and it will evaluate all the possible combinations of hyperparameter values,
using cross-validation. For example, the following code searches for the best combi‐
nation of hyperparameter values for the RandomForestRegressor:


In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
 {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
 {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
 ]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
 scoring='neg_mean_squared_error',
return_train_score=True)

grid_search.fit(insurance_prepared, insurance_labels)



In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_


And of course the evaluation scores are also available:

In [None]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)


### Randomized Search
The grid search approach is fine when you are exploring relatively few combinations,
like in the previous example, but when the hyperparameter search space is large, it is
often preferable to use RandomizedSearchCV instead. This class can be used in much
the same way as the GridSearchCV class, but instead of trying out all possible combi‐
nations, it evaluates a given number of random combinations by selecting a random
value for each hyperparameter at every iteration. This approach has two main bene‐
fits:
1. If you let the randomized search run for, say, 1,000 iterations, this approach will
explore 1,000 different values for each hyperparameter (instead of just a few val‐
ues per hyperparameter with the grid search approach).
2.  You have more control over the computing budget you want to allocate to hyper‐
parameter search, simply by setting the number of iterations.


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(insurance_prepared, insurance_labels)

In [None]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

### Analyze the Best Models and Their Errors
You will often gain good insights on the problem by inspecting the best models. For
example, the RandomForestRegressor can indicate the relative importance of each
attribute for making accurate predictions:


In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances    


In [None]:
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs  + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)





### Evaluate Your System on the Test Set
After tweaking your models for a while, you eventually have a system that performs
sufficiently well. Now is the time to evaluate the final model on the test set. There is
nothing special about this process; just get the predictors and the labels from your
test set, run your full_pipeline to transform the data (call transform(), not
fit_transform(), you do not want to fit the test set!), and evaluate the final model
on the test set:


In [None]:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("charges", axis=1)
y_test = strat_test_set["charges"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse) 
final_rmse


In some cases, such a point estimate of the generalization error will not be quite
enough to convince you to launch: what if it is just 0.1% better than the model cur‐
rently in production? You might want to have an idea of how precise this estimate is.
For this, you can compute a 95% confidence interval for the generalization error using
scipy.stats.t.interval():


In [None]:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
 loc=squared_errors.mean(),
  scale=stats.sem(squared_errors)))


