## Overview
Binning "Age" column is Common age ranges for health-related analysis include the following categories. By Fixed Interval Binning, a simple method to create age groups. Analyze different age groups and plan types together affect the dependent variable (Coinsurance)

### Objective
1. Loading and inspecting dataset
2. Binning "Age" column
3. Observations
4. Handle unnecessary columns

### 1. Loading and inspecting dataset
#### import library
1. pandas that it allows you to create, manipulate, and analyze datasets efficiently.
2. numpy that it provides support for arrays, matrices, and various mathematical functions.
3. seaborn that it provides high-level functions to create attractive and informative plots.
4. matplotlib is a foundational library for creating static, interactive, and animated visualizations.

In [178]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [179]:
# Load the dataset csv file merge all value that use analyze from 4.Final_Merge_Benefit_Left-Use.ipynb
df = pd.read_csv("cleaned.csv",low_memory=False)

# display all columns, Non-Null Count and Dtype
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0           6000 non-null   int64 
 1   StandardComponentId  6000 non-null   object
 2   Age                  6000 non-null   int64 
 3   PlanType             6000 non-null   object
 4   Coinsurance          6000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 234.5+ KB


In [180]:
# Sort Age
sorted_ages = np.sort(df['Age'].unique())
print(sorted_ages)

[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65]


In [181]:
# display first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,StandardComponentId,Age,PlanType,Coinsurance
0,0,40540TX0080003,20,PPO,40
1,1,18973IA0210004,20,POS,0
2,2,88380VA0720012,20,HMO,0
3,3,58255OH0200001,20,PPO,28
4,4,52664OH1510013,20,PPO,13


### 2. Binning "Age" column
1. Common age ranges for health-related analysis include the following categories
2. By Fixed Interval Binning, a simple method to create age groups
3. Analyze different age groups and plan types together affect the dependent variable (Coinsurance) 
    - 0-20: Typically, those who are minors and covered under parents' or guardians' health insurance plans.
    - 21-25: Young adults, often covered under parental insurance until 26.
    - 26-40: Typically, adults in early to mid-career stages, where private insurance plans or employer-sponsored plans become more common.
    - 41-60: Pre-retirement age group, often with an increased focus on preventative care and management of chronic conditions.
    - 61-64: Near-retirement age, typically covered under employer-sponsored plans or private health insurance, just before eligibility for Medicare.
    - 65+: Seniors eligible for Medicare
4. Create a new column for age bins

In [182]:
# Common age bins for health-related analysis include the following categories 
# by Fixed Interval Binning, a simple approach to create age bins 
# and to analyze how different age groups and plan types together affect the dependent variable
# 1. 0-20: Typically, those who are minors and covered under parents' or guardians' health insurance plans.
# 2. 21-25: Young adults, often covered under parental insurance until 26.
# 3. 26-40: Typically, adults in early to mid-career stages, where private insurance plans or employer-sponsored plans become more common.
# 4. 41-60: Pre-retirement age group, often with an increased focus on preventative care and management of chronic conditions.
# 5. 61-64: Near-retirement age, typically covered under employer-sponsored plans or private health insurance, just before eligibility for Medicare.
# 6. 65+: Seniors eligible for Medicare

bins = [0,21, 26, 41, 61, 65, float('inf')]  # Define age ranges
labels = ['0-20', '21-25', '26-40', '41-60', '61-64', '65+']  # Corresponding labels

# Create a new column for age bins
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
df_final = df[['AgeGroup', 'Age', 'PlanType','Coinsurance']].reset_index().sort_values(by='AgeGroup')
#df_sorted = df_ra_clean[['AgeGroup', 'PlanId']].value_counts().reset_index(name='Count').sort_values(by='AgeGroup')

In [183]:
# Reset index
df_final.reset_index(drop=True)

Unnamed: 0,index,AgeGroup,Age,PlanType,Coinsurance
0,0,0-20,20,PPO,40
1,658,0-20,20,PPO,39
2,659,0-20,20,Indemnity,45
3,660,0-20,20,PPO,23
4,661,0-20,20,EPO,14
...,...,...,...,...,...
5995,5338,65+,65,PPO,28
5996,5339,65+,65,PPO,0
5997,5340,65+,65,HMO,0
5998,5327,65+,65,PPO,33


In [184]:
# display all columns, Non-Null Count and Dtype
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6000 entries, 0 to 5999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   index        6000 non-null   int64   
 1   AgeGroup     6000 non-null   category
 2   Age          6000 non-null   int64   
 3   PlanType     6000 non-null   object  
 4   Coinsurance  6000 non-null   int64   
dtypes: category(1), int64(3), object(1)
memory usage: 240.4+ KB


### 3. Observations
1. **Total raws and colums:** The dataset contains 6,000 rows(entries) and 5 columns
2. **Data Types and possible issues**
    - 2 Columns are int64: **Index, Age**
    - 1 Columns are category: **AgeGroup**
    - 1 Columns are object: **PlanType**
    - 1 Columns are float64: **Copayment**

In [185]:
# display first 5 rows
df_final.head()

Unnamed: 0,index,AgeGroup,Age,PlanType,Coinsurance
0,0,0-20,20,PPO,40
658,658,0-20,20,PPO,39
659,659,0-20,20,Indemnity,45
660,660,0-20,20,PPO,23
661,661,0-20,20,EPO,14


In [186]:
# Check for unique values in the AgeGroup column
df_final['AgeGroup'].unique()

['0-20', '21-25', '26-40', '41-60', '61-64', '65+']
Categories (6, object): ['0-20' < '21-25' < '26-40' < '41-60' < '61-64' < '65+']

In [187]:
# Check for unique values in the PlanType column
df_final['PlanType'].unique()

array(['PPO', 'Indemnity', 'EPO', 'HMO', 'POS'], dtype=object)

In [188]:
# Check for unique values in the Age column
df_final['Age'].unique()

array([20, 24, 23, 21, 25, 22, 39, 38, 32, 40, 29, 28, 31, 34, 36, 33, 37,
       26, 30, 35, 27, 44, 57, 50, 51, 47, 54, 59, 48, 45, 52, 55, 56, 42,
       41, 49, 53, 60, 46, 58, 43, 63, 62, 61, 64, 65])

### 4. Handle unnecessary columns
1. Drop 'index' columns
2. Create CSV file

In [189]:
# Step1: Drop 'index' columns
df_final = df_final.drop(columns=['index'])
# display first 5 rows
df_final.head()

Unnamed: 0,AgeGroup,Age,PlanType,Coinsurance
0,0-20,20,PPO,40
658,0-20,20,PPO,39
659,0-20,20,Indemnity,45
660,0-20,20,PPO,23
661,0-20,20,EPO,14


In [190]:
# Set the original column name (Coinsurance) and new name (Coinsurance Percentage)
df_final.rename(columns={'Coinsurance': 'Coinsurance Percentage'}, inplace=True)

# Convert percentage to decimal number
df_final['Coinsurance'] = df['Coinsurance'] / 100

In [193]:
# Drop unuse column
df_final = df_final.dropna(subset=['Coinsurance Percentage'])

In [194]:
# Sort 'Age' column
df_final = df_final.sort_values(by='Age')

# Reset index and drop index columns
df_final = df_final.reset_index(drop=True)

# Display
df_final

Unnamed: 0,AgeGroup,Age,PlanType,Coinsurance Percentage,Coinsurance
0,0-20,20,PPO,40,0.40
1,0-20,20,EPO,0,0.00
2,0-20,20,POS,0,0.00
3,0-20,20,HMO,0,0.00
4,0-20,20,PPO,28,0.28
...,...,...,...,...,...
5995,65+,65,PPO,29,0.29
5996,65+,65,HMO,0,0.00
5997,65+,65,HMO,15,0.15
5998,65+,65,PPO,40,0.40


In [195]:
# display all columns, Non-Null Count and Dtype
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   AgeGroup                6000 non-null   category
 1   Age                     6000 non-null   int64   
 2   PlanType                6000 non-null   object  
 3   Coinsurance Percentage  6000 non-null   int64   
 4   Coinsurance             6000 non-null   float64 
dtypes: category(1), float64(1), int64(2), object(1)
memory usage: 193.7+ KB


In [196]:
# Used to retrieve all unique values from the Coin column in df_final
df_final['Coinsurance'].unique()

array([0.4 , 0.  , 0.28, 0.13, 0.45, 0.42, 0.33, 0.11, 0.18, 0.09, 0.3 ,
       0.26, 0.08, 0.04, 0.36, 0.16, 0.25, 0.31, 0.07, 0.24, 0.27, 0.15,
       0.14, 0.03, 0.38, 0.66, 0.29, 0.01, 0.06, 0.43, 0.68, 0.12, 0.48,
       0.22, 0.05, 0.2 , 0.21, 0.35, 0.23, 0.02, 0.19, 0.17, 0.62, 0.63,
       0.55, 0.64, 0.1 , 0.7 , 0.32, 0.46, 0.34, 0.77, 0.37, 0.67, 0.39,
       0.44, 0.65, 0.47, 0.72, 0.52, 0.5 , 0.75, 0.6 , 0.61, 0.49, 0.76])

In [197]:
# Create CSV file
df_final.to_csv('final_binning.csv')