This notebook cleans the Peoplesoft data that OCTO gave to us, and merges to the baseline DCHR data. Note that there is a discrepancy between the number of employees present in OCTO data compared the number of employees in DCHR data. This is all explained in the report.

In [1]:
import sys
sys.path.append('..')
import os.path

import copy
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
%matplotlib inline


DATA_DIR = os.path.join('..', 'data')

In [2]:
## Read in files: These are Peoplesoft data coming from OCTO
d107 = pd.read_csv(os.path.join(DATA_DIR,'457_enrollment_0107.csv'), dtype = {'EMPLID': str})
d122 = pd.read_csv(os.path.join(DATA_DIR,'457_enrollment_0122.csv'), dtype = {'EMPLID': str})
d218 = pd.read_csv(os.path.join(DATA_DIR,'457_enrollment_0218.csv'), dtype = {'EMPLID': str})
d304 = pd.read_csv(os.path.join(DATA_DIR,'457_enrollment_0304.csv'), dtype = {'EMPLID': str})

#Pull in email outcomes as generated from the notebook: 1-data-clean-and-merge.ipynb
#This is our baseline data from DCHR
outcomes = pd.read_csv(os.path.join(DATA_DIR,'outcomes_id_emails.csv'),
                       dtype = {'EmplID': str}) ## Email outcomes

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Generate variables for whether an employee has ever clicked on a link 

# We consider "success" to be whether an employee has clicked on either of these ess links
ess1 = 'https://ess.dc.gov/psp/essprod/ESS/HRMS/c/W3EB_MENU.W3EB_SELECT_EVNT.GBL?' +\
       'Page=W3EB_SELECT_EVNT&Action=U&XFER_SOURCE_TILE=Benefits'
ess2 = 'https://ess.dc.gov/psc/essprod/ESS/HRMS/c/EF_BENEFITS_FL.W3EB_GRID_FLU.GBL'

# initialize the 'click_ess' column and setting everyone to False
outcomes['Click_ess'] = False

# Going through all of the links as of March (supposed to be cumulative)
# There are 5 distinct links in the March set of emails (that's why arange(1,6))
for i in np.arange(1,6):
    #Click_ess = True if employee has ever clicked on either the ess1 or ess2 links
    #Because if df['Click_ess'] was ever true, true trumps false
    outcomes['Click_ess'] = outcomes['Click_ess'] | \
    (outcomes['Click0301_'+str(i)] == ess1) | \
    (outcomes['Click0301_'+str(i)] == ess2)


print('By treatment arm:')
print(pd.crosstab(outcomes.treatment_real, outcomes.Click_ess))

outcomes['Open_ess'] = False
outcomes['Open_ess'] = (outcomes['Opens0123'] > 0) | (outcomes['Opens0212'] > 0) | (outcomes['Opens0301'] >0)
pd.crosstab(outcomes.treatment_real, outcomes.Open_ess, margins=True)

By treatment arm:
Click_ess       False  True 
treatment_real              
0.0             11256      0
1.0             10900    315
2.0             10949    266


Open_ess,False,True,All
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,11256,0,11256
1.0,8496,2719,11215
2.0,8481,2734,11215
All,28233,5453,33686


# Data Cleaning

In [4]:
## Clean the data by changing the casing of emails and dropping straight up duplicates

#list of the dataframes we're using 
dfs = [d107, d122, d218, d304]

#give our dataframes names for the purposes of providing useful info while looping
d107.name = 'Jan 07th' 
d122.name = 'Jan 22nd'
d218.name = 'Feb 18th'
d304.name = 'Mar 04th'

#For every dataframe, check how many rows and columns we have. 
#Change emails to a consistent casing, then drop duplicates.
#See how many rows and columns we have now.
for df in dfs:
    print(df.name, ': rows, columns before dropping:', df.shape)
    df['EMAILID'] = df['EMAILID'].str.upper()
    print('Number of duplicates:', len(df[df.duplicated(keep = 'first')]))
    df.drop_duplicates(keep = 'first', inplace = True)
    print(df.name, ': rows, columns after dropping:', df.shape)
    print('\n')
    
print('Outcomes Dataset:', outcomes.shape)

Jan 07th : rows, columns before dropping: (34969, 7)
Number of duplicates: 142
Jan 07th : rows, columns after dropping: (34827, 7)


Jan 22nd : rows, columns before dropping: (35159, 7)
Number of duplicates: 145
Jan 22nd : rows, columns after dropping: (35014, 7)


Feb 18th : rows, columns before dropping: (35257, 7)
Number of duplicates: 144
Feb 18th : rows, columns after dropping: (35113, 7)


Mar 04th : rows, columns before dropping: (35306, 7)
Number of duplicates: 145
Mar 04th : rows, columns after dropping: (35161, 7)


Outcomes Dataset: (33686, 58)


### Reminder:DCHR Data - Enrollment by Treatment Arm

In [5]:
pd.crosstab(outcomes.treatment_real, outcomes.Enroll, margins=True)

Enroll,False,True,All
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,7031,4225,11256
1.0,7021,4194,11215
2.0,7021,4194,11215
All,21073,12613,33686


## Merge Datasets

Peoplesoft data is not actually consistent in terms of the emails associated with the employees, i.e. there are missing or different email addresses for employees. However, we decided that just because an email is not present in Peoplesoft doesn't mean that the employee did not receive the email. As a result, we are merging on just employee ID (as opposed to both emplid and email id) to ensure that more of our population is included.

Due to other inconsistencies in the data, the process of merging the baseline data to Peoplesoft data is a bit complicated: 
- We first merge on both EmplID and email id to avoid duplicates resulting from those employees who appear multiple times but have either missing emails, or a different email address associated with the EmplID. 
- For those who do not match on both employee id and email id, use just the EmplID for the dataset merge. 
- EmplID anon_id_1 is consistently both enrolled and not enrolled in 457b at the same time, so they will be kicked out of our dataset
- EmplID anon_id_2 is also consistently both enrolled and not enrolled at the same time. However, their monetary elections are all 0 or 0%, so we consider them to be not enrolled, and drop their "enrolled" status. 

In [6]:
def mergeOutcomes(df1, df2, left_id, right_id, left_email_id, right_email_id, #for merge
                  left_suffix, right_suffix, #to indicate any merge suffixes
                  special_col, special_str): #these are the special case ones
    '''
    Inputs:
    df1 = left dataframe to merge
    df2 = right dataframe
    left_id: string that will serve as your primary key in the left df
    right_id: string that will serve as your primary key in the right df
    left_email_id: string that will serve as the second PK in left left
    right_email_id: string that will serve as the second PK in right left
    left_suffix: upon merge, left suffix for the left df
    right_suffix: upon merge, right suffix for the right df    
    ids_to_drop: list of ids that needs to be dropped  
    
    These are for the special case ones 
    special_col:
    special_str: 
        
    Perform left join (df1, df2), first using both employee id and emails.
    
    
    '''
    
    #--------------------------------- Merge on EMPLID and EMAIL ID ----------------------------#
    
    df = pd.merge(df1, df2, 
                  left_on = [left_id, left_email_id],
                  right_on = [right_id, right_email_id],
                  suffixes = (left_suffix, right_suffix),
                  how = 'left', 
                  indicator = True)
    
    print("Breakout of merged employees:")
    print("'both' refers to employees showing up in both DCHR and OCTO data")
    print("'left_only' refers to employees who were only present in DCHR data")
    print('')
    print(df._merge.value_counts())
    
    
    #IDs to drop: 
    #Get rid of anon_id_1 
    ##### NOTE: 'anon_id_1' and 'anon_id_2" are real emplids, however, it's considered PII so we don't include here
    
    df = df[df[left_id] != 'anon_id_1']
    
    # Specific to this dataset: anon_id_2 is  consistently both enrolled and not enrolled at the same time. 
    # However, monetary elections are all 0 or 0%, so we consider this to benot enrolled, and 
    # drop the "enrolled" status. 
    df = df[~((df[left_id] == 'anon_id_2') & (df[special_col]==special_str))]

    #Check for duplicates again
    print('')
    print('Length of duplicates', len(df[df.duplicated(subset=left_id, keep = False)]))
    
    
    #--------------------------------- Merge on EMPLID only ----------------------------#
    #For those who were not mergeable: 
    # Create a dataframe who only showed up in the left dataframe, 
    # indicator = "left_only", to join back to df on Empl ID only.
    left_only = df[df._merge == 'left_only'].copy()
    
    #drop from our original dataset
    df = df[df._merge != 'left_only']

    df.drop(labels = '_merge', axis = 1, inplace = True)
    
     #Get the ids from the left-only folks and use those to only merge on EMPLID
    df_remainder = pd.merge(df1[df1[left_id].isin(list(left_only[left_id]))],
                                df2, 
                                left_on = left_id,
                                right_on = right_id,
                                indicator=True)
    
    #Take the max values to get rid of the duplicates that resulted from having blank values
    df_remainder = df_remainder.groupby(left_id).max().reset_index()
    
    #reorder based on the columns in the original set
    df_remainder = df_remainder.reindex(df.columns, axis = 1)
    
    #add data back to original
    df = pd.concat([df, df_remainder])

    #Check for duplicates
    len(df[df.duplicated(subset=left_id, keep = False)])
    
    return df

### Merge January 07 Peoplesoft Data

In [7]:
df_Jan107_merge = mergeOutcomes(df1 = outcomes, df2 = d107, 
                                left_id ='EmplID', right_id = 'EMPLID', 
                                left_email_id = 'email_upper', right_email_id = 'EMAILID', 
                                left_suffix ='',  right_suffix ='', 
                                special_col = 'JAN07THENROLLMENT',
                                special_str = 'ElectedJan07th')

print()
print('January 07 enrollments broken out by treatment group:')

pd.crosstab(df_Jan107_merge.treatment_real, 
            df_Jan107_merge.JAN07THENROLLMENT,         
            dropna=False, 
            margins=True, 
            margins_name='Total')

Breakout of merged employees:
'both' refers to employees showing up in both DCHR and OCTO data
'left_only' refers to employees who were only present in DCHR data

both          26254
left_only      7433
right_only        0
Name: _merge, dtype: int64

Length of duplicates 0

January 07 enrollments broken out by treatment group:


JAN07THENROLLMENT,ElectedJan07th,Not Elected,Total
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,4670,5965,10635
1.0,4632,5938,10570
2.0,4667,5930,10597
Total,13969,17833,31802


### Merge Jan 22nd Data TRUE BASELINE. This will be used as the baseline for analysis. 

In [8]:
df_Jan122_merge = mergeOutcomes(df1=df_Jan107_merge, df2=d122, 
                                left_id='EMPLID', right_id= 'EMPLID',
                                left_email_id='EMAILID', right_email_id='EMAILID',
                                left_suffix = '_0107', right_suffix='_0122',
                                special_col = 'JAN22NDENROLLMENT', special_str= 'ElectedJan22nd')

pd.crosstab(df_Jan122_merge.treatment_real, 
            df_Jan122_merge.JAN22NDENROLLMENT,         
            dropna=False, 
            margins=True, 
            margins_name='Total')

Breakout of merged employees:
'both' refers to employees showing up in both DCHR and OCTO data
'left_only' refers to employees who were only present in DCHR data

both          31679
left_only       124
right_only        0
Name: _merge, dtype: int64

Length of duplicates 0


JAN22NDENROLLMENT,ElectedJan22nd,Not Elected,Total
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,4656,5943,10599
1.0,4618,5903,10521
2.0,4657,5901,10558
Total,13931,17747,31678


### Merge in February Data    

In [9]:
# For future merges, suffixes added to the merge will not take place, because the previous merge already changed 
# the column names. So now we just add the suffix before the merge
cols_for_suffix = ['FLATAMOUNTBEFORETAX','FLATAMOUNTAFTERTAX', 'PERCENTAGEBEFORETAX', 'PERCENTAGEAFTERTAX']

# bind the orignal columns and the columns with suffixes added together
d218 = pd.concat([d218[['EMAILID', 'EMPLID', 'FEB18THENROLLMENT']], 
                  d218[cols_for_suffix].add_suffix('_0218')], 
                  axis = 1)

df_Feb_merge = mergeOutcomes(df1=df_Jan122_merge, 
                             df2=d218, 
                             left_id='EMPLID', right_id= 'EMPLID',
                             left_email_id='EMAILID', right_email_id='EMAILID',
                             left_suffix = '', right_suffix='',
                             special_col = 'FEB18THENROLLMENT', special_str= 'ElectedFeb18th')

#Breakout of elections by treatment groups
pd.crosstab(df_Feb_merge.treatment_real, 
            df_Feb_merge.FEB18THENROLLMENT,         
            dropna=False, 
            margins=True, 
            margins_name='Total')

Breakout of merged employees:
'both' refers to employees showing up in both DCHR and OCTO data
'left_only' refers to employees who were only present in DCHR data

both          31463
left_only       216
right_only        0
Name: _merge, dtype: int64

Length of duplicates 0


FEB18THENROLLMENT,ElectedFeb18th,Not Elected,Total
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,4627,5893,10520
1.0,4600,5861,10461
2.0,4627,5854,10481
Total,13854,17608,31462


### Merge in March Data

In [10]:
# bind the orignal columns and the columns with suffixes added together
d304 = pd.concat([d304[['EMAILID', 'EMPLID', 'MAR04THENROLLMENT']], 
                  d304[cols_for_suffix].add_suffix('_0304')], 
                  axis = 1)

# March will be our exploratory data, so we are analyzing outcomes using Jan 22 as the baseline. 
df_March_merge = mergeOutcomes(df1=df_Jan122_merge, 
                               df2=d304, 
                               left_id='EMPLID', right_id= 'EMPLID',
                               left_email_id='EMAILID', right_email_id='EMAILID',
                               left_suffix = '', right_suffix='',
                               special_col = 'MAR04THENROLLMENT', special_str= 'ElectedMar04th')

#Breakout of elections by treatment groups
pd.crosstab(df_March_merge.treatment_real, 
            df_March_merge.MAR04THENROLLMENT,         
            dropna=False, 
            margins=True, 
            margins_name='Total')

Breakout of merged employees:
'both' refers to employees showing up in both DCHR and OCTO data
'left_only' refers to employees who were only present in DCHR data

both          31361
left_only       318
right_only        0
Name: _merge, dtype: int64

Length of duplicates 0


MAR04THENROLLMENT,ElectedMar04th,Not Elected,Total
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,4610,5876,10486
1.0,4584,5850,10434
2.0,4611,5829,10440
Total,13805,17555,31360


## DCHR vs OCTO Data Inconsistency

#### There is some weird inconsistency with the sample numbers for those enrolled @ the baseline, compared to actual outcomes from Jan 9th Peoplesoft Data, i.e. ~1.5k not enrolled at the baseline but enrolled in Jan 9th.

In [11]:
data_check = df_Jan107_merge.copy()

#Which DCHR-data employees did not make it in?
#Get data for the ones that did not make it in the merge 
other_ids = outcomes[~outcomes.EmplID.isin(list(data_check.EMPLID))]
print('Number of customers who are not in OCTO-Peoplesoft data:', len(other_ids))

#line up the columns to the Jan 0107 merge columns, and append them to the merged data
other_ids = other_ids.reindex(labels = list(data_check.columns), axis = 1)
data_check = pd.concat([data_check, other_ids])

##
#Clean it up
data_check['JAN07THENROLLMENT'] = data_check['JAN07THENROLLMENT'].fillna('Missing')
data_check['Enrollment'] = data_check['Enrollment'].replace({'457BEN':'ElectedJan07th'})
data_check['Enrollment'] = data_check.Enrollment.fillna('Not Elected')
data_check['Enrollment'].value_counts()

Number of customers who are not in OCTO-Peoplesoft data: 1884


Not Elected       21073
ElectedJan07th    12613
Name: Enrollment, dtype: int64

In [12]:
pd.crosstab(data_check.Enrollment, 
            data_check.JAN07THENROLLMENT)

JAN07THENROLLMENT,ElectedJan07th,Missing,Not Elected
Enrollment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ElectedJan07th,12525,87,1
Not Elected,1444,1797,17832


In [13]:
for i in range(0,3):
    print('Treatment Group:', i)
    print(pd.crosstab(data_check[data_check.treatment_real==i].Enrollment, 
                      data_check[data_check.treatment_real==i].JAN07THENROLLMENT))
    print('-------------------------------------------------------')

Treatment Group: 0
JAN07THENROLLMENT  ElectedJan07th  Missing  Not Elected
Enrollment                                             
ElectedJan07th               4194       31            0
Not Elected                   476      590         5965
-------------------------------------------------------
Treatment Group: 1
JAN07THENROLLMENT  ElectedJan07th  Missing  Not Elected
Enrollment                                             
ElectedJan07th               4165       28            1
Not Elected                   467      617         5937
-------------------------------------------------------
Treatment Group: 2
JAN07THENROLLMENT  ElectedJan07th  Missing  Not Elected
Enrollment                                             
ElectedJan07th               4166       28            0
Not Elected                   501      590         5930
-------------------------------------------------------


## Push files for analysis. February = confirmatory, March = exploratory

In [14]:
df_Feb_merge.to_csv(os.path.join(DATA_DIR, 'confirmatory_outcomes.csv'), index = False)
df_March_merge.to_csv(os.path.join(DATA_DIR, 'exploratory_outcomes.csv'), index = False)

# Jan 7 Baseline Info for Report

In [15]:
df_Jan107_merge.JAN07THENROLLMENT.value_counts(normalize = True)

Not Elected       0.560751
ElectedJan07th    0.439249
Name: JAN07THENROLLMENT, dtype: float64

In [16]:
d122.JAN22NDENROLLMENT.value_counts(normalize=True)

Not Elected       0.586651
ElectedJan22nd    0.413349
Name: JAN22NDENROLLMENT, dtype: float64

In [17]:
df_Jan107_merge['Annual Rt'].describe()

count     31802.000000
mean      76200.745368
std       29811.800857
min       15171.000000
25%       55462.000000
50%       73295.000000
75%       93279.000000
max      312965.501000
Name: Annual Rt, dtype: float64

In [18]:
#Calculate the maximum allowable contributions
def calculateMaxContribution(df: pd.DataFrame):
    df['max_contribution'] = np.where(df['Age'] < 50, 18500, 24500)

## flags for people who's contributions are above a certain amount
def flagContributions(df: pd.DataFrame,
                     amount_threshold: int,
                     percentage_threshold: int,
                     baseline_date: str):

    ## baselines
    df["assume_amounterror_" + baseline_date] = \
        np.where(df['FLATAMOUNTBEFORETAX'] >= amount_threshold, 1, 0)
    df["assume_percerror_" + baseline_date] = \
        np.where(df['PERCENTAGEBEFORETAX'] >= percentage_threshold, 1, 0)

In [19]:
# There were a few people who contributed over $18000 per maycheck
# Making the assumption here that he/she meant to contribute $18,500 in a year, not in a paycheck

# People only do this for the flat amounts BEFORE tax, not AFTER
# For flat tax amounts, if the amount contributed is greater than or equal to $18000,
# take that amount and divide by 26 paychecks
# Keeping as 'b', because we want to drop them later as sensitivity analysis

def fixContributions(df: pd.DataFrame,
                     baseline_date: str):

    ## Fix flat amounts based on either type of error 
    df['FLATAMOUNTBEFORETAX_' + baseline_date + "b"] = \
        np.where(df["assume_amounterror_" + baseline_date] == 1, 
                 round(df['FLATAMOUNTBEFORETAX']/26,2), 
                 np.where(df["assume_percerror_" + baseline_date] == 1, 
                 round(df.max_contribution/26,2),
                 df['FLATAMOUNTBEFORETAX']))
    
    ## Fix percentages for those who are coded as percentage error
    ## code them to 0 percent contribution (after fixing their flat amount)
    df['PERCENTAGEBEFORETAX_' + baseline_date + "b"] = \
        np.where(df["assume_percerror_" + baseline_date] == 1,  0, df['PERCENTAGEBEFORETAX'])

In [20]:
def calculateContributions(df: pd.DataFrame,
                           baseline_date: str):
    '''
    Set up for whether employees increased their salaries.
    We are ignoring the difference between before tax contribution and after tax contributions for percentage contributions
    because we would have to make assumptions about the type of filers and exemptions that the employee is claiming, which 
    we do not have any information about.
    '''
    
    #Percentage of Salary
    df['PERCENTAGE_' + baseline_date] = \
        df['PERCENTAGEBEFORETAX_' + baseline_date + 'b'] + df['PERCENTAGEAFTERTAX']
  
    
    #Do the same for Flat amounts
    df['FLATAMOUNT_' + baseline_date] = \
        df['FLATAMOUNTBEFORETAX_' + baseline_date +'b'] + df['FLATAMOUNTAFTERTAX']
    df['FLATAMOUNT_' + baseline_date] = df['FLATAMOUNT_' + baseline_date].fillna(0)


    #Calculate the number of paychecks
    #If the number of paychecks is less than 26, then that means that the employee would hit their contribution early 
    #Which means that even if they increased their contribution, they would still get the same amount
    df['num_paychecks_' + baseline_date] = np.where(df['FLATAMOUNT_' + baseline_date] > 0, 
                                           np.round((df['max_contribution']/df['FLATAMOUNT_' + baseline_date]),2),
                                           0)

In [21]:
#Not the most necessary; but I wanted to make sure I ended up with the same same numbers. 

def calculateAnnualContribution(df: pd.DataFrame, 
                                date: str) -> pd.Series:
    """
    Calculate the total annualized contribution from the data frame.
    Some people contribute a flat amount per paycheck. Some people
    contribute a flat amount per paycheck, so we have to do some
    switching:
        * If percent contributed per paycheck is positive,
          return percent * salary
        * If the number of paychecks with a contribution is <= 26,
          we assume the person hit the maximum contribution before
          the end of the year, and so return the maximum contribution.
        * If the number of paychecks is more than 26, we're bleeding over
          the data and so return 26 * the flat rate people pay.
    """
    answer_pct = df['PERCENTAGE_' + date] / 100 * df['Annual Rt']
    answer_max = df['max_contribution']
    answer_flat = df['FLATAMOUNT_' + date] * 26
    
    answer = answer_pct
    done_filter = df['PERCENTAGE_' + date] > 0
    answer *= done_filter
    
    this_filter = (df['num_paychecks_' + date] == 0) & ~done_filter
    answer[this_filter] = 0
    done_filter |= this_filter
    
    this_filter = (df['num_paychecks_' + date] <= 26) & ~done_filter
    answer += answer_max * this_filter
    done_filter |= this_filter
    
    this_filter = ~done_filter
    answer += answer_flat * this_filter
    return answer

In [22]:
# Apply to Feb and March
calculateMaxContribution(df_Jan107_merge)

flagContributions(df_Jan107_merge,
                  amount_threshold = 18000,
                  percentage_threshold = 100,
                  baseline_date = "0107")

fixContributions(df_Jan107_merge,
                 baseline_date="0107")

calculateContributions(df_Jan107_merge,
                 baseline_date="0107")

df_Jan107_merge['Annual Contribution'] = calculateAnnualContribution(df_Jan107_merge, "0107")

In [23]:
enrolled = df_Jan107_merge[df_Jan107_merge.JAN07THENROLLMENT == 'ElectedJan07th'].copy()
enrolled['Annual Contribution'].describe()

count    13969.000000
mean      4670.444292
std       5666.377680
min          0.000000
25%       1300.000000
50%       2600.000000
75%       5200.000000
max      24500.000000
Name: Annual Contribution, dtype: float64

In [24]:
df_Jan107_merge.groupby('treatment_real')['Annual Contribution'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
treatment_real,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,10635.0,2050.337489,4372.49289,0.0,0.0,0.0,2080.0,24500.0
1.0,10570.0,2007.040755,4322.272701,0.0,0.0,0.0,1950.0,24500.0
2.0,10597.0,2096.978043,4541.503389,0.0,0.0,0.0,2080.0,24500.0


In [25]:
enrolled['pct_of_salary'] = enrolled['Annual Contribution']/enrolled['Annual Rt']*100
enrolled['pct_of_salary'].describe()

count    13969.000000
mean         5.130531
std          5.308053
min          0.000000
25%          1.608791
50%          3.314466
75%          6.677625
max         69.552005
Name: pct_of_salary, dtype: float64