## Overview
Merge data between sample population data of 6000 records, Plantype and MatalLevel in PlanAttributes, Benefitname and Copayment in BenefitCostSharing to combine into one data for easy analysis is [AgeGroup, PlanType, Copayment]

### Objectives
1. Loading and inspecting dataset
2. Merge sample population 6000 records, PlanType with Copayment
3. Select Only Column that use to analyze

### 1. Loading and inspecting dataset
#### import library
1. pandas that it allows you to create, manipulate, and analyze datasets efficiently.
2. numpy that it provides support for arrays, matrices, and various mathematical functions.
3. seaborn that it provides high-level functions to create attractive and informative plots.
4. matplotlib is a foundational library for creating static, interactive, and animated visualizations.

In [28]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [29]:
# Load the dataset Benefitcostsharing after cleaning from 3.BenefitCostSharing_clean_mean-Use.ipynb
df_bm = pd.read_csv("clean_benefitcostsharing_mean_no_duplicates.csv",low_memory=False)

# display all columns, Non-Null Count and Dtype
df_bm.info()

# Check NaN value sum
df_bm.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6632 entries, 0 to 6631
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0           6632 non-null   int64 
 1   StandardComponentId  6632 non-null   object
 2   BenefitName          6632 non-null   object
 3   CopayInnTier1        6632 non-null   object
 4   Copay                6632 non-null   int64 
 5   Description          6632 non-null   object
 6   Copayment            6632 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 362.8+ KB


Unnamed: 0             0
StandardComponentId    0
BenefitName            0
CopayInnTier1          0
Copay                  0
Description            0
Copayment              0
dtype: int64

In [30]:
# Load the dataset sample after merged plantype from 2.rateBinning_merged_PlanType_Metal-Use
df_sm = pd.read_csv("sample_merged_plantype.csv",low_memory=False)

# display all columns, Non-Null Count and Dtype
df_sm.info()

# Check NaN value sum
df_sm.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0.1         6000 non-null   int64 
 1   Unnamed: 0           6000 non-null   int64 
 2   Age                  6000 non-null   int64 
 3   PlanId               6000 non-null   object
 4   index                6000 non-null   int64 
 5   StandardComponentId  6000 non-null   object
 6   PlanType             6000 non-null   object
 7   MetalLevel           6000 non-null   object
dtypes: int64(4), object(4)
memory usage: 375.1+ KB


Unnamed: 0.1           0
Unnamed: 0             0
Age                    0
PlanId                 0
index                  0
StandardComponentId    0
PlanType               0
MetalLevel             0
dtype: int64

### 2. Merge sample population 6000 records, PlanType with Copayment
1. Merge dr_sm = sample_merged_plantype.csv and "StandardComponentId" clean_benefitcostsharing_mean_no_duplicates.csv with "StandardComponentId"
2. Check NaN value
3. Fill 0 value replaced with NaN value
4. Drop unnecessary columns

In [31]:
# Step1: Merge
# Perform the merge
merged_df = pd.merge(df_sm, df_bm, on='StandardComponentId', how='left')

# Check the number of rows
print(merged_df.shape)

(6000, 14)


In [32]:
# Step2: Check NaN value sum
merged_df.isnull().sum()

Unnamed: 0.1           0
Unnamed: 0_x           0
Age                    0
PlanId                 0
index                  0
StandardComponentId    0
PlanType               0
MetalLevel             0
Unnamed: 0_y           0
BenefitName            0
CopayInnTier1          0
Copay                  0
Description            0
Copayment              0
dtype: int64

In [33]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0.1         6000 non-null   int64 
 1   Unnamed: 0_x         6000 non-null   int64 
 2   Age                  6000 non-null   int64 
 3   PlanId               6000 non-null   object
 4   index                6000 non-null   int64 
 5   StandardComponentId  6000 non-null   object
 6   PlanType             6000 non-null   object
 7   MetalLevel           6000 non-null   object
 8   Unnamed: 0_y         6000 non-null   int64 
 9   BenefitName          6000 non-null   object
 10  CopayInnTier1        6000 non-null   object
 11  Copay                6000 non-null   int64 
 12  Description          6000 non-null   object
 13  Copayment            6000 non-null   int64 
dtypes: int64(7), object(7)
memory usage: 656.4+ KB


In [34]:
# display first 5 rows
merged_df.head()

Unnamed: 0,Unnamed: 0.1,Unnamed: 0_x,Age,PlanId,index,StandardComponentId,PlanType,MetalLevel,Unnamed: 0_y,BenefitName,CopayInnTier1,Copay,Description,Copayment
0,0,0,20,40540TX0080003,8849,40540TX0080003,PPO,High,417137,Routine Dental Services (Adult),No Charge,0,No Charge,0
1,1,1,20,18973IA0210004,12083,18973IA0210004,POS,Catastrophic,585322,Primary Care Visit to Treat an Injury or Illness,$20,20,$,2
2,2,2,20,88380VA0720012,18393,88380VA0720012,HMO,Catastrophic,885182,Urgent Care Centers or Facilities,No Charge,0,No Charge,2
3,3,3,20,58255OH0200001,16701,58255OH0200001,PPO,High,804014,Dental Check-Up for Children,No Charge,0,No Charge,0
4,4,4,20,52664OH1510013,16425,52664OH1510013,PPO,Silver,802884,Urgent Care Centers or Facilities,$50,50,$,11


In [35]:
# Step4: Drop unnecessary columns
df_merge = merged_df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0_x', 'index', 'Unnamed: 0_y'])

In [36]:
# display first 5 rows
df_merge.head()

Unnamed: 0,Age,PlanId,StandardComponentId,PlanType,MetalLevel,BenefitName,CopayInnTier1,Copay,Description,Copayment
0,20,40540TX0080003,40540TX0080003,PPO,High,Routine Dental Services (Adult),No Charge,0,No Charge,0
1,20,18973IA0210004,18973IA0210004,POS,Catastrophic,Primary Care Visit to Treat an Injury or Illness,$20,20,$,2
2,20,88380VA0720012,88380VA0720012,HMO,Catastrophic,Urgent Care Centers or Facilities,No Charge,0,No Charge,2
3,20,58255OH0200001,58255OH0200001,PPO,High,Dental Check-Up for Children,No Charge,0,No Charge,0
4,20,52664OH1510013,52664OH1510013,PPO,Silver,Urgent Care Centers or Facilities,$50,50,$,11


In [37]:
# display all columns, Non-Null Count and Dtype
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Age                  6000 non-null   int64 
 1   PlanId               6000 non-null   object
 2   StandardComponentId  6000 non-null   object
 3   PlanType             6000 non-null   object
 4   MetalLevel           6000 non-null   object
 5   BenefitName          6000 non-null   object
 6   CopayInnTier1        6000 non-null   object
 7   Copay                6000 non-null   int64 
 8   Description          6000 non-null   object
 9   Copayment            6000 non-null   int64 
dtypes: int64(3), object(7)
memory usage: 468.9+ KB


### 3. Select Only Column that use to analyze
1. Select only 'AgeGroup', 'Age', 'PlanType' ,and 'Copayment' columns
2. Create CSV file

In [38]:
# Select only AgeGroup, MetalLevel ,and Copayment columns
df_final_clened = df_merge[['StandardComponentId', 'Age', 'PlanType','Copayment']].reset_index(drop=True)
df_final_clened.head()

Unnamed: 0,StandardComponentId,Age,PlanType,Copayment
0,40540TX0080003,20,PPO,0
1,18973IA0210004,20,POS,2
2,88380VA0720012,20,HMO,2
3,58255OH0200001,20,PPO,0
4,52664OH1510013,20,PPO,11


In [39]:
# display all columns, Non-Null Count and Dtype
df_final_clened.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   StandardComponentId  6000 non-null   object
 1   Age                  6000 non-null   int64 
 2   PlanType             6000 non-null   object
 3   Copayment            6000 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 187.6+ KB


In [40]:
#create csv file merge all value that use analyze is complete
df_final_clened.to_csv('cleaned.csv')