<a href="https://colab.research.google.com/github/ylfoo/ERA2036/blob/main/Learn_Regression_fr_Insurance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Example for Regression
In this example, a linear regression model is built to predict the insurance charges.
The dataset consists of the following columns:
- age - age of primary beneficiary
- sex - insurance contractor gender, female, male
- bmi - Body mass index, providing an understanding of body, weights that are relatively high or low relative to height
- children - Number of children / Number of dependents
- smoker - Smoking
- region - the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges - Individual medical costs billed by health insurance

In [None]:
# Import the necessary modules and packages
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split as split
from sklearn.linear_model import LinearRegression

In [None]:
# Load the dataset from CSV file
df = pd.read_csv('https://raw.githubusercontent.com/wooihaw/ERA2036_T2230/main/Chapter_4/insurance.csv')

In [None]:
# Check the number of columns and rows
df.info()

In [None]:
# Randomly view 5 data samples from the dataset
df.sample(5)

In [None]:
# Check for missing data
# if there is any missing data, they must be handled first
df.isna().sum()

In [None]:
# Calculate descriptive statistics
df.describe()

In [None]:
# Apply one-hot encoding to convert nominal categorical data to numerical data
df2 = pd.get_dummies(df)
df2.sample(5)

In [None]:
# Extract the "charges" column (targets) into y
y = df2['charges'].values

# Delete the "charges" column
del df2['charges']

# Extract the remaining columns (features) into X
X = df2.values

# Print the dimensions of X and y
print(f"Dimension of X: {X.shape}")
print(f"Dimension of y: {y.shape}")

In [None]:
# Split 80% of the dataset for training and the remaining 20% for testing
X_train, X_test, y_train, y_test = split(X, y, test_size=0.2, random_state=42)

# Print the number of data samples for training and testing
print(f"Number of data samples for training: {X_train.shape[0]}")
print(f"Number of data samples for testing: {X_test.shape[0]}")

In [None]:
# Train a linear regression model with the training data to predict the insurance charges
lnr = LinearRegression().fit(X_train, y_train)

# Evaluate the linear regression model with the testing data and print the R2 score
print(f"lnr R2 score: {lnr.score(X_test, y_test)}")