<a href="https://colab.research.google.com/github/shstreuber/AI/blob/main/Week4_TotalInsurance_HealthDivision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**The Job**

<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/totalinsurance.jpg" width="300">
</div>

You work for TotalInsurance, an insurance carrier for home, health, and vehicles. Their Health Division has decided to automate their predictive processes in order to gain faster (ideally, real-time) insight into their customer data, so that health insurance claims can be approved or denied automatically and only go to an analyst for review if the customer requests manual review. In addition, TotalInsurance is automating its claim forecasting and its regional office staffing. You have received a small excerpt of the data in order to build a Deep Learning proof-of-concept.

If you succeed, TotalInsurance will give you a 1,000 bonus.

#**The Process**
We will be following the basic classification steps:

0. Preparation, loading libraries and data
1. Exploratory Data Analysis (EDA) to see how the data is distributed and to determine what the label should be. This will be the label you'll predict later on
2. Preprocess the data (remove n/a, transform data types as needed, deal with missing data) --> here is where we will need to take a few additional steps to configure our data for the Neural Network
3. Split the data into a training set and a test set
4. Build the model based on the training set
5. Test the model on the test set
6. Determine the quality of the model with the help of a Confusion Matrix and a Classification Report.

To understand what each step does, please look at the code comments and explanations



#**0. Preparation**
We will build our Deep Learning architecture on Tensorflow. Why Tensorflow? Because it is easier to build on Colab than Pytorch (for more about the battle of the giants, i.e. Tensorflow vs Pytorch, [read here](https://)).

In [None]:
import tensorflow as tf # This tells Colab that we are using TensorFlow

from tensorflow import keras # This is the main TensorFlow library
from tensorflow.keras.models import Sequential # We are building a model that runs its layers in sequential order
from tensorflow.keras import layers # We are building a Neural Network with several hidden layers
from tensorflow.keras.layers import Dense # This will help us build a fully connected architecture
from tensorflow.keras.layers.experimental import preprocessing #

print("Current TensorFlow version is", tf.__version__)

import numpy as np # Your basic mathematical library for big datasets
import pandas as pd # The library you need in order to clean and manipulate big datasets
import matplotlib.pyplot as plt # Makes pretty pictures
import seaborn as sns # for visualization aka more pretty pictures
from sklearn.model_selection import train_test_split # Scikit-Learn is the default data science library
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42) # Setting a seed value for the randomizer so we get repeatable results

Current TensorFlow version is 2.15.0


In [None]:
#Reading in the data as fraud dataframe
insurance = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/insurance_full2.csv")
insurance.head(10) # Let's look at the first 10 rows

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,insuranceclaim
0,19,female,27.9,0,yes,southwest,16884.924,approved
1,18,male,33.77,1,no,southeast,1725.5523,approved
2,28,male,33.0,3,no,southeast,4449.462,denied
3,33,male,22.705,0,no,northwest,21984.47061,denied
4,32,male,28.88,0,no,northwest,3866.8552,approved
5,31,female,25.74,0,no,southeast,3756.6216,denied
6,46,female,33.44,1,no,southeast,8240.5896,approved
7,37,female,27.74,3,no,northwest,7281.5056,denied
8,37,male,29.83,2,no,northeast,6406.4107,denied
9,60,female,25.84,0,no,northwest,28923.13692,denied


#**1. Exploratory Data Analysis (EDA)**
The goal of exploratory data analysis (EDA) is to visually and statistically summarize the main characteristics of a dataset, understand its structure, identify patterns, and detect anomalies. Through EDA, we can gain insights into the data, generate hypotheses, and determine the next steps of analysis or modeling.

Exploratory data analysis involves techniques such as summary statistics, data visualization, and graphical representations to reveal hidden patterns, relationships, or trends within the data. EDA is an essential preliminary step in the data analysis pipeline, helping to uncover meaningful information and guide further exploration or hypothesis testing.







In [None]:
# Any missing values?
insurance.isna().sum()

age               0
sex               0
bmi               0
children          0
smoker            0
region            0
charges           0
insuranceclaim    0
dtype: int64

In [None]:
# What data types and input features do we have? What could be our output label? Do we already have a label that contains the information, or do we need to create one?
insurance.dtypes

age                 int64
sex                object
bmi               float64
children            int64
smoker             object
region             object
charges           float64
insuranceclaim     object
dtype: object

In [None]:
#Let's look more closely at all data.
insurance.describe(include = 'all'), print("***DATA OVERVIEW***") # Build a data summary for ALL data in the set (not just numeric!)

***DATA OVERVIEW***


(                age   sex          bmi     children smoker     region  \
 count   1338.000000  1338  1338.000000  1338.000000   1338       1338   
 unique          NaN     2          NaN          NaN      2          4   
 top             NaN  male          NaN          NaN     no  southeast   
 freq            NaN   676          NaN          NaN   1064        364   
 mean      39.207025   NaN    30.663397     1.094918    NaN        NaN   
 std       14.049960   NaN     6.098187     1.205493    NaN        NaN   
 min       18.000000   NaN    15.960000     0.000000    NaN        NaN   
 25%       27.000000   NaN    26.296250     0.000000    NaN        NaN   
 50%       39.000000   NaN    30.400000     1.000000    NaN        NaN   
 75%       51.000000   NaN    34.693750     2.000000    NaN        NaN   
 max       64.000000   NaN    53.130000     5.000000    NaN        NaN   
 
              charges insuranceclaim  
 count    1338.000000           1338  
 unique           NaN           

##**Findings**
0. **MISSING DATA?** No. All fields have 15420 data points.
1. **POSSIBLE TARGET VARIABLES**:
  
    **- BINARY** (Classification): insuranceclaim--whether approved or denied
    
    **- CATEGORICAL** (Classification): region, which has 4 levels. We could predict in which region a client lives, which could inform our decisions on how high to set insurance rates

    **- NUMERIC** (Regression): Charges. We can predict the amount in which a client will submit a claim.

2. **QUESTIONS TO ASK**: Which input attributes are relevant to our target variable? Which input attributes are not relevant?
3. **ANY PROBLEMATIC DATA?**: No
4. **ANY INCONSISTENT DATA?**: No
5. **ANY OPPORTUNITIES FOR SIMPLIFYING?**: No
6. **ANY OPPORTUNITIES FOR REDUCING THE DATAFRAME FOR EASE OF PROCESSING IN A NEURAL NETWORK?**: No

#**2A. Data Cleanup**
This is a clean dataset with no missing or incongruent values. No substitution for missing values or unusual values is needed.

#**2B. Preprocessing**
Preprocessing means formatting the data such that it can work with the selected algorithm(s). This can involve
* Substituting values to create a uniform data type
* Transforming data from string to categorical or numeric or vice versa
* Binning and bucketing data as needed (i.e. creating roll-up attributes in which we trade accuracy of values for ease of processing)

And others

In [None]:
insurance.dtypes

age                 int64
sex                object
bmi               float64
children            int64
smoker             object
region             object
charges           float64
insuranceclaim     object
dtype: object

#**2C: Preparing the Data for Use with Tensorflow**
There are 4 steps we need to take to prepare the data to run with TensorFlow (before we even consider the architecture of the network):

1. Setting up training and test set
2. Splitting features from labels (to build the input and output layers)
3. Encoding categorical variables
4. Normalize all numeric features

####**1. Setting up training and test set**
There are different ways to split the data into a training and test set. You can specify a split by line indexes, by percentages, or by number of rows. In our example, we will use percentages to split.

In [None]:
train_dataset = insurance.sample(frac=0.8, random_state=0) # training dataset is 80%, test dataset is 20%. Rows are picked by random sampling
test_dataset = insurance.drop(train_dataset.index) # Dropping the index numbers because we want the test set to be autonomous

In [None]:
train_dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,insuranceclaim
578,52,male,30.2,1,no,southwest,9724.53,approved
610,47,female,29.37,1,no,southeast,8547.6913,denied
569,48,male,40.565,2,yes,northwest,45702.02235,approved
1034,61,male,38.38,0,no,northwest,12950.0712,approved
198,51,female,18.05,0,no,northwest,9644.2525,denied


In [None]:
test_dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,insuranceclaim
11,62,female,26.29,0,yes,southeast,27808.7251,approved
23,34,female,31.92,1,yes,northeast,37701.8768,approved
24,37,male,28.025,2,no,northwest,6203.90175,denied
25,59,female,27.72,3,no,southeast,14001.1338,approved
28,23,male,17.385,1,no,northwest,2775.19215,approved


####**2. Splitting Features from Labels**
Separate the target value, the "label", from the features. This label is the value that you will train the model to predict--in our case, we want to predict insuranceclaim.

In [None]:
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('insuranceclaim')
test_labels = test_features.pop('insuranceclaim')

####**1. One Hot Encoding**
Remember that the input layer for a Neural Network requires numeric information only.

So, to help our computer understand, for example, the smoker variable, we use one-hot encoding. We create two slots: one for yes and one for no. When we see a smoker, we put a 1 in the yes slot and 0 in the no slot. If the client is a nonsmoker, we put a 1 in the no slot and 0 in the yes slot.

So, in a nutshell, we use Encoding when:

* The categorical features present in the data are not ordinal
* When the number of categorical features present in the dataset is small so that the one-hot encoding technique can be effectively applied while building the model.

We should not use One Hot Encoding when:

* The categorical features present in the dataset are ordinal i.e for the data being like Junior, Senior, Executive, Owner.
* When the number of categories in the dataset is quite large. Working with large categories can lead to high memory consumption and to processor issues (how does Professor Streuber know? Ask her for her story!)

**NOTE:** This is an older dataset which recognizes sex as only male and female; we now recognize a broader spectrum of gender identities.

In [None]:
# Using One-Hot Encoding with pd.getdummies
train_features = pd.get_dummies(train_features, columns=['sex','smoker','region'], prefix='', prefix_sep='')
train_features.head()

Unnamed: 0,age,bmi,children,charges,female,male,no,yes,northeast,northwest,southeast,southwest
578,52,30.2,1,9724.53,0,1,1,0,0,0,0,1
610,47,29.37,1,8547.6913,1,0,1,0,0,0,1,0
569,48,40.565,2,45702.02235,0,1,0,1,0,1,0,0
1034,61,38.38,0,12950.0712,0,1,1,0,0,1,0,0
198,51,18.05,0,9644.2525,1,0,1,0,0,1,0,0


In [None]:
test_features = pd.get_dummies(test_features, columns=['sex','smoker','region'], prefix='', prefix_sep='')
test_features.head()

Unnamed: 0,age,bmi,children,charges,female,male,no,yes,northeast,northwest,southeast,southwest
11,62,26.29,0,27808.7251,1,0,0,1,0,0,1,0
23,34,31.92,1,37701.8768,1,0,0,1,1,0,0,0
24,37,28.025,2,6203.90175,0,1,1,0,0,1,0,0
25,59,27.72,3,14001.1338,1,0,1,0,0,0,1,0
28,23,17.385,1,2775.19215,0,1,1,0,0,1,0,0


In [None]:
train_labels = pd.get_dummies(train_labels, columns=['insuranceclaim'], prefix='', prefix_sep='')
train_labels.head()

Unnamed: 0,approved,denied
578,1,0
610,0,1
569,1,0
1034,1,0
198,0,1


In [None]:
test_labels = pd.get_dummies(test_labels, columns=['insuranceclaim'], prefix='', prefix_sep='')
test_labels.head()

Unnamed: 0,approved,denied
11,1,0
23,1,0
24,0,1
25,1,0
28,1,0


####**4. Normalize all NUMERIC features**
**Why should we normalize?**
Well, not only do Neural Networks not like string-type labels in the output layer; they also don't like non-standardized input attributes (aka features). That's because the Summation and Activation functions treat the values from each input attribute the same. Hence, if these values fall into the same scale, the outcome of our classification will be better. That is why will want to normalize our feature values.

There are different ways of normalizing data. One way is to do the math manually as you have seen in the previous week's file.



Another way is to use the [**preprocessing.Normalization layer**](https://keras.io/api/layers/preprocessing_layers/numerical/normalization/). This layer is a clean way to build that preprocessing into your model. It does, however, not like anything that doesn't fall into a numpy array, so be sure to one hot encode all your categorical variables or transform them into numerics.

In [None]:
normalizer = preprocessing.Normalization(axis=-1) # Here we set the normalizer up
normalizer.adapt(np.array(train_features)) # Now we apply the normalizer to the data. This calculates the mean and variance and stores them in the layer.
print(normalizer.mean.numpy())

# When the layer is called it returns the input data, with each feature independently normalized:

first = np.array(train_features[:1])

with np.printoptions(precision=2, suppress=True):
  print('Original data:', first)
  print()
  print('Normalized data:', normalizer(first).numpy())

[[3.9036446e+01 3.0735168e+01 1.0934581e+00 1.3056554e+04 5.0186914e-01
  4.9813083e-01 8.0093467e-01 1.9906542e-01 2.4859811e-01 2.3551399e-01
  2.8130844e-01 2.3457943e-01]]
Original data: [[  52.     30.2     1.   9724.53    0.      1.      1.      0.      0.
     0.      0.      1.  ]]

Normalized data: [[ 0.92 -0.09 -0.08 -0.28 -1.    1.    0.5  -0.5  -0.58 -0.56 -0.63  1.81]]


In [None]:
# We need to also apply the preprocessing.Normalization layer to our test set
# Just incase you want to try this out, here is how to use the preprocessing.Normalization layer

normalizer = preprocessing.Normalization(axis=-1) # Here we set the normalizer up
normalizer.adapt(np.array(test_features)) # Now we apply the normalizer to the data. This calculates the mean and variance and stores them in the layer.
print(normalizer.mean.numpy())

# When the layer is called it returns the input data, with each feature independently normalized:

first = np.array(test_features[:1])

with np.printoptions(precision=2, suppress=True):
  print('Original data:', first)
  print()
  print('Normalized data:', normalizer(first).numpy())

[[3.9888058e+01 3.0376863e+01 1.1007463e+00 1.4124310e+04 4.6641791e-01
  5.3358209e-01 7.7238804e-01 2.2761194e-01 2.1641791e-01 2.7238804e-01
  2.3507462e-01 2.7611938e-01]]
Original data: [[   62.      26.29     0.   27808.73     1.       0.       0.       1.
      0.       0.       1.       0.  ]]

Normalized data: [[ 1.62 -0.66 -0.93  1.09  1.07 -1.07 -1.84  1.84 -0.53 -0.61  1.8  -0.62]]


# **3. Building the Model**
There is always a specific process with which to build a TensorFlow model:
<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/TF_Process2.png" width="600">
</div>

1. First, we set up the **keras SEQUENTIAL MODEL**. This is the framework inside of which we are going to define the layers. Sequential = layers are sequentially next to each other (either “stacked” or left-to-right, depending on how you draw them).
---
2. Inside the Sequential model, we define the **LAYERS**. To do this, we need to know the following:
* **Normalization Layer**: If used, this is usually the first layer when input data needs to be normalized.
* **Shape**: This is the number of attributes we use as input for the model.
We need to ensure that the input layer has the correct number of input features. This can be specified when creating the first layer with the input_shape argument.
---
3. In the next step, we define HOW we want the model to run, that is to **COMPILE**, with model.compile(). To do this, we need to know the following:
* **Optimizer** = gradient descent function (i.e. which function we use to optimize the step-down of the weights); adam = adaptive learning rate optimization algorithm
* **Loss Function**= evaluation of the ŷ vs the ground truth
* **Metrics** = evaluation criterion, typically accuracy.
---
4. Then, we **FIT** the model to the training set with model.fit(). To do this, we need to know the following:
* **Epoch**: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE. If one epoch is too big to feed to the computer at once we can divide it in several smaller batches
* **Batch size**: Depending on the number of needed features in your dataset (you should reduce these to NO MORE THAN 6), the computing effort can be too intense. Just like you would not each a whole sandwich in one bite, the machine does better when processing the data in smaller bites called batches. The standard batch size is 32.
---
5. Lastly, we use our model to **PREDICT** the values for the test set with model.predict()
---
**How we choose the LOSS FUNCTION** for step 3 depends on the type of calculation we need our Neural Network to perform:
* If the output variable is **continuous**, we are performing a regression, so the loss function is **mean squared error or MSE**
* If the output variable is **binary**, we are performing a classification, so the loss function is **binary_crossentropy**
* If the output variable is **categorical** with more than two labels, we are still performing a classification, but now the loss function is **categorical_crossentropy**

**How we choose the ACTIVATION FUNCTION** when defining the layers: It used to be the case that Sigmoid and Tanh activation functions were preferred for all layers. These days, better performance is achieved using the  [relu function](https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning). Using a sigmoid on the output layer ensures your network output is between 0 and 1 and is easy to map to either a probability of class 1 or snap to a hard classification of either class with a default threshold of 0.5.

**How we choose the NUMBER AND SIZE OF LAYERS**:

The short answer is: We experiment until we get the best output the fastest. The longer answer is: We can use various optimization strategies that can help us out somewhat. So, let's assume that trial and error has shown us that three layers is optimal. Furthermore, let's assume that we are going to build a Dense Network, aka a fully connected network structure, in which every node is connected with every node in the next layer.

To define this architecture, we will specify the number of neurons or nodes in the layer as the first argument, and set up the activation function with the activation argument.

As activation function, we will use the rectified linear unit or ReLU activation function on the first two layers and the [the sigmoid function ](https://towardsdatascience.com/sigmoid-and-softmax-functions-in-5-minutes-f516c80ea1f9) in the output layer since our output is binary.



##**3.1 Defining the keras model**
We will build our model as follows:
1. Use keras.Sequential
2. If we have any numeric data, add our normalizer layer
3. Add two hidden layers with 24 nodes each; we will use the [relu function](https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning) so that all positive values will remain positive but all negative values will become 0.
4. For the output layer, we will use [the sigmoid function ](https://towardsdatascience.com/sigmoid-and-softmax-functions-in-5-minutes-f516c80ea1f9) since our output is binary.



In [None]:
# define the keras model
model = Sequential()
model.add(normalizer)
model.add(Dense(24, input_shape=(11,), activation='relu')) # We have 11 columns in the encoded training dataset
model.add(Dense(24, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

##**3.2 Compiling the model**
Now we can configure the training procedure using the Model.compile() method. The most important arguments to compile are the loss and the optimizer since these define what will be optimized (binary_crossentropy) and how (using the [optimizers.Adam](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fwww.tensorflow.org%2Fapi_docs%2Fpython%2Ftf%2Fkeras%2Foptimizers%2FAdam)). Note that we can adjust the learning_rate, which helps us tune the gradient.

In [None]:
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.1), loss='binary_crossentropy', metrics=['accuracy'])

##**3.3 Training the model**
Once the model is configured, we use Model.fit() to train it:

In [None]:
%%time
history = model.fit(
    train_features, train_labels,
    epochs=10,
    # suppress logging
    verbose=1,
    # Calculate validation results on 20% of the training data. Validation means that we test as we go, on a 20% subset of the training data
    validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 2.47 s, sys: 50.2 ms, total: 2.52 s
Wall time: 3.56 s


#**4. Evaluating the model**
We have trained our neural network and we can now evaluate the performance of the network on the test dataset. To evaluate your model on your training dataset, we can either use the predict() function to see the individual predictions for our test set, or we can use the evaluate() function and pass it the test data.

The evaluate() function will generate a prediction for each input and output pair and collect scores, including the average loss and any metrics you have configured, such as accuracy. The function will return a list with two values. The first will be the loss of the model on the dataset and the second will be the accuracy of the model on the dataset.


In [None]:
model.evaluate(test_features, test_labels)



[0.4709729552268982, 0.9067164063453674]

#**EXERCISES**





##**1. Try the model out as a regression**
Use "charges" as your target variable. The activation function for the output layer will be "linear"

##**2. Try the model out as a classification**
Use "region" as your target variable. The activation function for the output layer will be "softmax"

##**3. Optimize the gradient descent**
As you can see, the binary model overfits when run with a learning rate of 0.1. Edit the model and set the learning rate as follows:
1. 0.01
2. 0.001
3. 0.5

What do you observe?