## Linear Regression 
This is my ATM Project (Amati (observe), Tiru (imitate), Modifikasi (modification))\
Learn into ML DL AI as the beginner to be a hero!

In this case, we want to predict how much insurances that people should pay based on their age and number of children.

### Import Libraries

In [16]:
import pandas as pd # data manipulation
import numpy as np # array manipulation
%matplotlib inline 
import matplotlib.pyplot as plt # visualization
import sklearn
from sklearn.compose import ColumnTransformer # transform the column 
from sklearn.preprocessing import OneHotEncoder # encode datatype
from sklearn.linear_model import LinearRegression # modelling
from sklearn.model_selection import train_test_split # splitting train and test
from sklearn.metrics import mean_squared_error # evaluation

### Import Dataset

In [17]:
dt = pd.read_csv("data_input/c1/insurance.csv")

In [18]:
dt.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [19]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In this step, we importing the dataset from data_input folder. This data has 1071 rows with 7 columns. The data type is still not relevan, so we turn it into the right types. Below is the decription of our data.

- age: Age of primary beneficiary
sex: Insurance contractor gender, female / male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m^2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoker / Non - smoker
- region: The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges: Individual medical costs billed by health insurance.

### EDA

In this step, we would check if there is any missing values, unappropriate data types, duplicates, etc.

In [20]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


##### Change data type

In [21]:
dt[['sex','smoker','region']] = dt[['sex','smoker','region']].astype('category')

##### Missing value

In [22]:
dt.isnull().any()

age         False
sex         False
bmi         False
children    False
smoker      False
region      False
charges     False
dtype: bool

##### Check duplicated

In [23]:
dt.duplicated().any()

True

In [24]:
dt.drop_duplicates(inplace = True)

In [25]:
dt.duplicated().any()

False

Our data doesn't have missing values and duplicated data, so we can continue to the next step

### Splitting Data

In this case, we want to predict insurance amount with our data. Set target data to charges column, and predictors the rest of data. Split data into target and its predictors.

In [26]:
# get the target
y = dt['charges']

# drop target to get predictors
X = dt[['age','children']]

In [27]:
# split 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, \
                                                    y,\
                                                    test_size=0.2,\
                                                    random_state=57)

### Training

Fitting data using train that has been splitted in previous step. We can use LinearRegression function from sklearn.

In [28]:
mdl = LinearRegression()
mdl.fit(X_train, y_train)

LinearRegression()

### Testing

Predict model that has builded with test data

In [29]:
y_pred = mdl.predict(X_test)
y_pred[:5]

array([ 7967.24247001, 10929.13593094,  6831.26845995,  8197.22428361,
       15659.80443365])

### Model Evaluation

In [30]:
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error : ", mse)

Mean Squared Error :  129117385.18176281


Our model has 16% error which can tell that our model is quite good for predicting new data. Next, we predict new (dummy) data

In [31]:
new_dt = {'age' : [50,30,20],
         'children' : [7, 0, 4]}

new_dt = pd.DataFrame(new_dt)

In [32]:
y_hat = mdl.predict(new_dt)
y_hat

array([19700.52749555, 10109.56243674,  9736.01751179])

Our model success to predict new data. If you have any advice for this work or suggestion what i should learn next, you can [send an email](mailto:rahfairuzran@gmail.com) to
 me