## Data Preprocessing - Michigan System Salary
### This file assigns unique person-ID based on her department, title series, salary and year for University of Michigan salary data.

**Input: Michigan Salary Data (michigan906.dta)**

**Output: michigan_salary.csv, with extra column "uid" for each unique individual; "suspicious" indicates possible mismatch**

### Mark the following UIDs as Supspicious:
1. **FTE** cases when same uid have larger than 1 full time equivalent salary
2. **Rank** Within a year same name has two ranks, then mark as suspicious

Total: 127 IDs out of 12,607 IDs and 3,022 out of 115,120 rows rows marked as suspicious

In [54]:
import os
import warnings
warnings.simplefilter('ignore')

import pandas as pd
#pd.options.display.float_format = "{:,.2f}".format

import numpy as np
from matplotlib import pyplot as plt
#import seaborn as sns
%matplotlib inline

os.chdir('/Users/apple/Dropbox/web_scrapping_UC/temp/')

#uc_salary = pd.read_csv('/Users/apple/Desktop/research_fellow_documents/uc_salary_new.csv')
df = pd.read_stata('michigan906.dta')
df.rename(columns = {'salary':'regular_pay','salary_total':'gross_pay'}, inplace = True)
df = df.sort_values(['university','first_name','last_name','yr', 'title'])

df = df.reset_index().rename(columns = {'index':'rid'})
df['uid'] = df.groupby(['university','first_name','last_name']).ngroup()

In [55]:
m2 = df[df.duplicated(['uid','yr','department'],keep=False)]
m2[m2['department']!='']
m21 = m2.groupby(['uid','yr'])['gross_pay'].nunique().reset_index()
s0 = set(m21[m21['gross_pay']>1]['uid'])
#print('obs that have multiple records in the same department and whose salaries are different: {} IDs'.format(len(s0)))


In [56]:
#obs that have more than 1 department in the same year and whose salaries are different
m3 = df.groupby(['uid','yr'])['department'].nunique().reset_index()
m3 = m3[m3['department'] > 1]
df2 = df[df['uid'].isin(m3.uid)]
da = df2.groupby(['uid','yr'])['gross_pay'].nunique().reset_index()
s1 = set(da[da['gross_pay']>1]['uid'])
#print('obs that have more than 1 department in the same year but salaries are different: {} IDs'.format(len(s1)))


In [57]:
#whether the gross_pay column is zero or null
df['gross_notnull'] = (df['gross_pay'] != 0) & (df['gross_pay'].notnull())

#whether the title contains "prof"
df['title_prof'] = df['title'].str.contains('prof', case=False)

ad = df.groupby(['uid','yr']).agg({'fte':'sum', 'regular_pay':'sum','gross_pay':'nunique', 'gross_notnull':'sum', 'title_prof':'sum', 'rank':'nunique'}).reset_index()
md = ad[(ad['fte'] > 1.01) | (ad['fte'] < 0.99)]
#the following individuals are likely to be duplicates:
# Within the same year for the same name: 
#1.have more than 1 professor type title; 2. two different gross_pay 3. both gross_pay are not zero nor null.
s2 = set(md[(md['title_prof'] > 1) & (md['gross_pay'] > 1) & (md['gross_notnull'] > 1)]['uid'])
len(s2)


98

In [58]:
#cases where same uid have multiple rank in a year
m4 = df.groupby(['yr','uid']).agg({'rank':'nunique', 'fte':'sum'}).reset_index()

md = m4[(m4['fte'] > 1.01) | (m4['fte'] < 0.99)]

s3 = set(md[md['rank'] > 1]['uid'])
print('cases where same uid have multiple rank in a year: {} IDs.'.format(len(s3)))

cases where same uid have multiple rank in a year: 36 IDs.


In [59]:
#if within a year, gross_pay are different, then sum
#if within a year, gross_pay are same, then keep one.

df = df.merge(df.groupby(['uid','yr'])['gross_pay'].nunique().reset_index(), on=['uid','yr'])
u1 = df[df['gross_pay_y']>1]
u2 = df[df['gross_pay_y']==1]
m1 = u1.groupby(['uid','yr'])['gross_pay_x','regular_pay','gen_fund'].sum().reset_index()
m2 = u2.groupby(['uid','yr'])['gross_pay_x','regular_pay','gen_fund'].mean().reset_index()
m3 = pd.concat([m1, m2], ignore_index = True)
m3.columns = ['uid','yr','gross_pay_sum','regular_pay_sum', 'gen_fund_sum']
df = df.merge(m3, on = ['uid','yr'])
df = df.drop(columns = ['gross_pay_y']).rename(columns = {'gross_pay_x':'gross_pay'})

#mark suspicious IDs
df['suspicious'] = df['uid'].isin(s2 | s3)

In [67]:
df.to_csv('/Users/apple/Desktop/research_fellow_documents/data_clean/michigan_salary.csv', index = False)