<a href="https://colab.research.google.com/github/renardelyon/Latihan_ML/blob/main/Regression%20Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

We have learned about regression and how to build regression models using both scikit-learn and TensorFlow. Now we'll build a regression model from start to finish. We will acquire data and perform exploratory data analysis and data preprocessing. We'll build and tune our model and measure how well our model generalizes.

## Framing the Problem

### Overview

*Friendly Insurance, Inc.* has requested we do a study for them to help predict the cost of their policyholders. They have provided us with sample [anonymous data](https://www.kaggle.com/mirichoi0218/insurance) about some of their policyholders for the previous year. The dataset includes the following information:

Column   | Description
---------|-------------
age      | age of primary beneficiary
sex      | gender of the primary beneficiary (male or female)
bmi      | body mass index of the primary beneficiary
children | number of children covered by the plan
smoker   | is the primary beneficiary a smoker (yes or no)
region   | geographic region of the beneficiaries (northeast, southeast, southwest, or northwest)
charges  | costs to the insurance company

We have been asked to create a model that, given the first six columns, can predict the charges the insurance company might incur.

The company wants to see how accurate we can get with our predictions. If we can make a case for our model, they will provide us with the full dataset of all of their customers for the last ten years to see if we can improve on our model and possibly even predict cost per client year over year.

### Exercise 1: Thinking About the Data

Before we dive in to looking closely at the data, let's think about the problem space and the dataset. Consider the questions below.

#### Question 1

Is this problem actually a good fit for machine learning? Why or why not?

##### **Student Solution**

> *Please Put Your Answer Here*

---

#### Question 2

If we do build the machine learning model, what biases might exist in the data? Is there anything that might cause the model to have trouble generalizing to other data? If so, how might we make the model more resilient?

##### **Student Solution**

> *Please Put Your Answer Here*

---

#### Question 3

We have been asked to take input features about people who are insured and predict costs, but we haven't been given much information about how these predictions will be used. What effect might our predictions have on decisions made by the insurance company? How might this affect the insured?

##### **Student Solution**

> *Please Put Your Answer Here*

---

## Exploratory Data Analysis

Now that we have considered the societal implications of our model, we can start looking at the data to get a better understanding of what we are working with.

The data we'll be using for this project can be [found on Kaggle](https://www.kaggle.com/mirichoi0218/insurance). Upload your `kaggle.json` file and run the code block below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

### Exercise 2: EDA and Data Preprocessing

Using as many code and text blocks as you need, download the dataset, explore it, and do any model-independent preprocessing that you think is necessary. Feel free to use any of the tools for data analysis and visualization that we have covered in this course so far. Be sure to do individual column analysis and cross-column analysis. Explain your findings.

#### **Student Solution**

In [None]:
#download data
!kaggle datasets download mirichoi0218/insurance


In [None]:
import pandas as pd 

insurance_df = pd.read_csv('insurance.zip')
insurance_df.describe(include='all')

In [None]:
#check if there are missing value
insurance_df.isna().any()

In [None]:
import matplotlib.pyplot as plt

#creat bar chart for 'age' column
count = insurance_df['age'].groupby(insurance_df['age']).count()
plt.bar(count.index,count)
plt.show()

In [None]:
#create bar chart for 'sex' column
sex_count = insurance_df['sex'].groupby(insurance_df['sex']).count()
plt.bar(sex_count.index,sex_count)
plt.show()

In [None]:
#change male to 0 and female to 1
insurance_df['sex'].replace('male',int(0),inplace=True)
insurance_df['sex'].replace('female',int(1),inplace=True)

insurance_df.head()


In [None]:
#create bar chart for 'bmi' column
bmi_count = insurance_df['bmi'].groupby(insurance_df['bmi']).count()
plt.bar(bmi_count.index,bmi_count)
plt.show()

In [None]:
#create bar chart for 'children' column
child_count = insurance_df['children'].groupby(insurance_df['children']).count()
plt.bar(child_count.index,child_count)
plt.show()

In [None]:
#create bar chart for 'smoker' column
child_count = insurance_df['smoker'].groupby(insurance_df['smoker']).count()
plt.bar(child_count.index,child_count)
plt.show()

In [None]:
#change row in 'smoker' column from yes to 1 and no to 0

insurance_df['smoker'].replace('no',int(0),inplace=True)
insurance_df['smoker'].replace('yes',int(1),inplace=True)

insurance_df.head()

In [None]:
#create bar chart for 'region' column
region_count = insurance_df['region'].groupby(insurance_df['region']).count()
plt.bar(region_count.index,region_count)
plt.show()

####Data Visualization



In [None]:
#scale charges to 1/1000
scale_factor = 1000
insurance_df['charges'] = insurance_df['charges']/scale_factor

In [None]:
import seaborn as sns

df_corr = insurance_df.corr()
sns.heatmap(df_corr,annot=True,cmap='coolwarm')
plt.show()

In [None]:
insurance_df

In [None]:
#plot age and charges

plt.plot(insurance_df['age'],insurance_df['charges'],'b.')
plt.show()

In [None]:
#boxplot age and charges
edges = np.histogram_bin_edges(insurance_df['charges'], bins=10)
bins = np.digitize(insurance_df['charges'], edges[:-1])
plt.figure(figsize=[10, 10])
ax = sns.boxplot(
    y=insurance_df['age'],
    x=bins
)
labels = [f'{edges[i]:.{1}f} - {edges[i+1]:.{1}f}'
            for i in range(len(edges) - 1)]
_ = ax.set_title('Age by Charges')
_ = plt.xticks(list(range(10)), labels)

In [None]:
#plot bmi with charges
plt.bar(insurance_df['bmi'],insurance_df['charges'])
plt.show()

---

## Modeling

Now that we understand our data a little better, we can build a model. We are trying to predict 'charges', which is a continuous variable. We'll use a regression model to predict 'charges'.

### Exercise 3: Modeling

Using as many code and text blocks as you need, build a model that can predict 'charges' given the features that we have available. To do this, feel free to use any of the toolkits and models that we have explored so far.

You'll be expected to:
1. Prepare the data for the model (or models) that you choose. Remember that some of the data is categorical. In order for your model to use it, you'll need to convert the data to some numeric representation.
1. Build a model or models and adjust parameters.
1. Validate your model with holdout data. Hold out some percentage of your data (10-20%), and use it as a final validation of your model. Print the root mean squared error. We were able to get an RMSE between `3500` and `4000`, but your final RMSE will likely be different.

#### **Student Solution**

In [None]:
# split target, numeric,and non-numeric column

target_column = 'charges'
feature_column = [name for name in insurance_df.columns if name not in [target_column]]
numeric_feature_columns=[name for name in feature_column if name != 'region']


In [None]:
#standardization
#standarize numeric feature column except sex and smoker feature
standarize_feature = [c for c in numeric_feature_column if c not in ['sex','smoker']]
insurance_df.loc[:, standarize_feature] = (
    insurance_df[standarize_feature] - 
      insurance_df[standarize_feature].mean()) /\
       insurance_df[standarize_feature].std(ddof=0)
insurance_df[numeric_feature_columns].describe()

In [None]:
#shuffle the data
insurance_df = insurance_df.reindex(np.random.permutation(insurance_df.index))

#split the data
split_size = 0.8
training_size = insurance_df.shape[0]*split_size
test_size = insurance_df.shape[0]*(1-split_size)
training_df =insurance_df.head(int(training_size))
test_df=insurance_df.tail(int(test_size))

In [None]:
#one hot encoding the region column
for op in sorted(insurance_df['region'].unique()):
  op_col = op.lower()
  insurance_df[op_col] = (insurance_df['region'] == op).astype(int)
  feature_column.append(op_col)

feature_column.remove('region')

insurance_df

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.regularizers import L2


feature_count = len(feature_column)

model = keras.Sequential([
  layers.Dense(128, activation='relu', input_shape=[feature_count]),
  layers.Dropout(0.2),
  layers.Dense(128,activation='relu',kernel_regularizer=L2(l2=0.3)),
  layers.Dense(1)
])

model.summary()

In [None]:
model.compile(
  loss='mse',
  optimizer='Adam',
  metrics=['mae', 'mse'],
)

model.summary()

In [None]:

EPOCHS = 50

history =model.fit(
  training_df[feature_column],
  training_df[target_column],
  epochs=EPOCHS,
  validation_split=0.2,
)

In [None]:
import matplotlib.pyplot as plt
 
mae = history.history['mae']
val_mae = history.history['val_mae']

epoch = range(len(mae))

plt.plot(epoch,mae,'r-')
plt.plot(epoch,val_mae,'b-')
plt.legend(labels=['mae','val_mae'])
plt.xlabel('epoch')
plt.ylabel('value')
plt.show()

In [None]:
import math

loss,mae,mse=model.evaluate(test_df[feature_column],test_df[target_column])
rmse=math.sqrt(mse)
print(rmse)


---