## Overview
Merge data between sample population data of 6000 records, Plantype and MatalLevel in PlanAttributes, Benefitname and Copayment in BenefitCostSharing to combine into one data for easy analysis is [AgeGroup, PlanType, Copayment]

### Objectives
1. Loading and inspecting dataset
2. Merge sample population 6000 records, PlanType with Copayment
3. Select Only Column that use to analyze

### 1. Loading and inspecting dataset
#### import library
1. pandas that it allows you to create, manipulate, and analyze datasets efficiently.
2. numpy that it provides support for arrays, matrices, and various mathematical functions.
3. seaborn that it provides high-level functions to create attractive and informative plots.
4. matplotlib is a foundational library for creating static, interactive, and animated visualizations.

In [34]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [35]:
# Load the dataset Benefitcostsharing after cleaning from 3.BenefitCostSharing_clean_mean-Use.ipynb
df_bm = pd.read_csv("clean_benefitcostsharing_mean_no_duplicates.csv",low_memory=False)

# display all columns, Non-Null Count and Dtype
df_bm.info()

# Check NaN value sum
df_bm.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6549 entries, 0 to 6548
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           6549 non-null   int64  
 1   StandardComponentId  6549 non-null   object 
 2   BenefitName          6549 non-null   object 
 3   CopayInnTier1        6549 non-null   object 
 4   Copayment            6549 non-null   int64  
 5   Description          6549 non-null   object 
 6   Mean_Copayment       6549 non-null   float64
dtypes: float64(1), int64(2), object(4)
memory usage: 358.3+ KB


Unnamed: 0             0
StandardComponentId    0
BenefitName            0
CopayInnTier1          0
Copayment              0
Description            0
Mean_Copayment         0
dtype: int64

In [36]:
# Load the dataset sample after merged plantype from 2.rateBinning_merged_PlanType_Metal-Use
df_sm = pd.read_csv("sample_merged_plantype.csv",low_memory=False)

# display all columns, Non-Null Count and Dtype
df_sm.info()

# Check NaN value sum
df_sm.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0.1         6000 non-null   int64 
 1   Unnamed: 0           6000 non-null   int64 
 2   index_x              6000 non-null   int64 
 3   Age                  6000 non-null   int64 
 4   PlanId               6000 non-null   object
 5   index_y              6000 non-null   int64 
 6   StandardComponentId  6000 non-null   object
 7   PlanType             6000 non-null   object
 8   MetalLevel           6000 non-null   object
dtypes: int64(5), object(4)
memory usage: 422.0+ KB


Unnamed: 0.1           0
Unnamed: 0             0
index_x                0
Age                    0
PlanId                 0
index_y                0
StandardComponentId    0
PlanType               0
MetalLevel             0
dtype: int64

### 2. Merge sample population 6000 records, PlanType with Copayment
1. Merge dr_sm = sample_merged_plantype.csv and "StandardComponentId" clean_benefitcostsharing_mean_no_duplicates.csv with "StandardComponentId"
2. Check NaN value
3. Fill 0 value replaced with NaN value
4. Drop unnecessary columns

In [37]:
# Step1: Merge
# Perform the merge
merged_df = pd.merge(df_sm, df_bm, on='StandardComponentId', how='left')

# Check the number of rows
print(merged_df.shape)

(6000, 15)


In [38]:
# Step2: Check NaN value sum
merged_df.isnull().sum()

Unnamed: 0.1            0
Unnamed: 0_x            0
index_x                 0
Age                     0
PlanId                  0
index_y                 0
StandardComponentId     0
PlanType                0
MetalLevel              0
Unnamed: 0_y           45
BenefitName            45
CopayInnTier1          45
Copayment              45
Description            45
Mean_Copayment         45
dtype: int64

In [39]:
# Step3: Fill 0 value replaced with NaN value
df_filled = merged_df.fillna(0)

# Check NaN value sum
df_filled.isnull().sum()

Unnamed: 0.1           0
Unnamed: 0_x           0
index_x                0
Age                    0
PlanId                 0
index_y                0
StandardComponentId    0
PlanType               0
MetalLevel             0
Unnamed: 0_y           0
BenefitName            0
CopayInnTier1          0
Copayment              0
Description            0
Mean_Copayment         0
dtype: int64

In [40]:
# display all columns, Non-Null Count and Dtype
df_filled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0.1         6000 non-null   int64  
 1   Unnamed: 0_x         6000 non-null   int64  
 2   index_x              6000 non-null   int64  
 3   Age                  6000 non-null   int64  
 4   PlanId               6000 non-null   object 
 5   index_y              6000 non-null   int64  
 6   StandardComponentId  6000 non-null   object 
 7   PlanType             6000 non-null   object 
 8   MetalLevel           6000 non-null   object 
 9   Unnamed: 0_y         6000 non-null   float64
 10  BenefitName          6000 non-null   object 
 11  CopayInnTier1        6000 non-null   object 
 12  Copayment            6000 non-null   float64
 13  Description          6000 non-null   object 
 14  Mean_Copayment       6000 non-null   float64
dtypes: float64(3), int64(5), object(7)
mem

In [41]:
# display first 5 rows
df_filled.head()

Unnamed: 0,Unnamed: 0.1,Unnamed: 0_x,index_x,Age,PlanId,index_y,StandardComponentId,PlanType,MetalLevel,Unnamed: 0_y,BenefitName,CopayInnTier1,Copayment,Description,Mean_Copayment
0,0,0,2277385,20,40540TX0080003,8849,40540TX0080003,PPO,High,265794.0,Routine Dental Services (Adult),No Charge,0.0,No Charge,0.0
1,1,1,2686872,20,18973IA0210004,12083,18973IA0210004,POS,Catastrophic,370605.0,Specialist Visit,No Charge,0.0,No Charge,0.0
2,2,2,3723250,20,88380VA0720012,18393,88380VA0720012,HMO,Catastrophic,580829.0,Urgent Care Centers or Facilities,No Charge,0.0,No Charge,0.0
3,3,3,3555416,20,58255OH0200001,16701,58255OH0200001,PPO,High,523636.0,Dental Check-Up for Children,No Charge,0.0,No Charge,0.0
4,4,4,3527172,20,52664OH1510013,16425,52664OH1510013,PPO,Silver,522863.0,Home Health Care Services,No Charge,0.0,No Charge,7.094017


In [42]:
# Step4: Drop unnecessary columns
df_merge = df_filled.drop(columns=['Unnamed: 0_x', 'Unnamed: 0.1', 'Unnamed: 0_y', 'index_x', 'index_y'])

In [43]:
# display first 5 rows
df_merge.head()

Unnamed: 0,Age,PlanId,StandardComponentId,PlanType,MetalLevel,BenefitName,CopayInnTier1,Copayment,Description,Mean_Copayment
0,20,40540TX0080003,40540TX0080003,PPO,High,Routine Dental Services (Adult),No Charge,0.0,No Charge,0.0
1,20,18973IA0210004,18973IA0210004,POS,Catastrophic,Specialist Visit,No Charge,0.0,No Charge,0.0
2,20,88380VA0720012,88380VA0720012,HMO,Catastrophic,Urgent Care Centers or Facilities,No Charge,0.0,No Charge,0.0
3,20,58255OH0200001,58255OH0200001,PPO,High,Dental Check-Up for Children,No Charge,0.0,No Charge,0.0
4,20,52664OH1510013,52664OH1510013,PPO,Silver,Home Health Care Services,No Charge,0.0,No Charge,7.094017


In [44]:
# display all columns, Non-Null Count and Dtype
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  6000 non-null   int64  
 1   PlanId               6000 non-null   object 
 2   StandardComponentId  6000 non-null   object 
 3   PlanType             6000 non-null   object 
 4   MetalLevel           6000 non-null   object 
 5   BenefitName          6000 non-null   object 
 6   CopayInnTier1        6000 non-null   object 
 7   Copayment            6000 non-null   float64
 8   Description          6000 non-null   object 
 9   Mean_Copayment       6000 non-null   float64
dtypes: float64(2), int64(1), object(7)
memory usage: 468.9+ KB


### 3. Select Only Column that use to analyze
1. Select only 'AgeGroup', 'Age', 'PlanType' ,and 'Copayment' columns
2. Create CSV file

In [45]:
# Select only AgeGroup, MetalLevel ,and Copayment columns
df_final_clened = df_merge[['StandardComponentId', 'Age', 'PlanType','Copayment']].reset_index(drop=True)
df_final_clened.head()

Unnamed: 0,StandardComponentId,Age,PlanType,Copayment
0,40540TX0080003,20,PPO,0.0
1,18973IA0210004,20,POS,0.0
2,88380VA0720012,20,HMO,0.0
3,58255OH0200001,20,PPO,0.0
4,52664OH1510013,20,PPO,0.0


In [46]:
# display all columns, Non-Null Count and Dtype
df_final_clened.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   StandardComponentId  6000 non-null   object 
 1   Age                  6000 non-null   int64  
 2   PlanType             6000 non-null   object 
 3   Copayment            6000 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 187.6+ KB


In [47]:
#create csv file merge all value that use analyze is complete
df_final_clened.to_csv('cleaned.csv')