## Data Preprocessing - Illinois System Salary
### This file assigns unique person-ID based on her department, title series, salary and year for University of Illinois salary data.

**Input: Illinois Salary Data ('illinois.dta')**

**Output: illinois_salary.csv, with extra column "uid" for each unique individual; "suspicious" indicates possible mismatch**

### Mark the following UIDs as Supspicious:
1. **Within Department** Same name that has multiple records in the same department same year but salaries are different
2. **FTE** cases when same uid have larger than 1 full time equivalent salary
3. **Rank** Within a year same name has two ranks, then mark as suspicious
4. **Across Department** Same name that have more than 1 department in the same year and whose salaries are different

Total: 352 IDs out of 23,359 IDs and 5,557 out of 144,792 rows rows marked as suspicious

In [50]:
import os
import warnings
warnings.simplefilter('ignore')

import pandas as pd
#pd.options.display.float_format = "{:,.2f}".format

import numpy as np
from matplotlib import pyplot as plt
#import seaborn as sns
%matplotlib inline

os.chdir('/Users/apple/Dropbox/web_scrapping_UC/temp/')
df = pd.read_stata('illinois.dta')
df.rename(columns = {'salary':'regular_pay','salary_total':'gross_pay'}, inplace = True)
df = df.reset_index().rename(columns = {'index':'rid'})
df['uid'] = df.groupby(['university','name']).ngroup()

In [40]:
umi_salary = pd.read_csv('/Users/apple/Desktop/research_fellow_documents/data_clean/michigan_salary.csv')

In [51]:
#Same name that has multiple records in the same department same year but salaries are different
m2 = df[df.duplicated(['uid','yr','department'],keep=False)]
m2[m2['department']!='']
m21 = m2.groupby(['uid','yr'])['gross_pay'].nunique().reset_index()
s0 = set(m21[m21['gross_pay']>1]['uid'])
print('Same name that has multiple records in the same department same year but salaries are different: {} IDs'.format(len(s0)))

#cases when same uid have larger than 1 full time equivalent salary
mm2 = df.groupby(['uid','yr'])['fte'].sum().reset_index()

s1 = set(mm2[mm2['fte']>1.01]['uid'])
print('cases when same uid have larger than 1 full time equivalent salary: {} IDs'.format(len(s1)))

#cases where same uid have multiple rank in a year
m4 = df.groupby(['yr','uid'])['rank'].nunique().reset_index()
s2 = set(m4[m4['rank'] > 1]['uid'])
print('cases where same uid have multiple rank in a year: {} IDs'.format(len(s2)))

#Same name that have more than 1 department in the same year and whose salaries are different
m3 = df.groupby(['uid','yr'])['department'].nunique().reset_index()
m3 = m3[m3['department'] > 1]
df2 = df[df['uid'].isin(m3.uid)]
da = df2.groupby(['uid','yr'])['gross_pay'].nunique().reset_index()
s3 = set(da[da['gross_pay']>1]['uid'])
print('Same name that have more than 1 department in the same year and whose salaries are different: {} IDs'.format(len(s3)))


suspicious = s0 | s1 | s2 | s3
print('total number of suspicious uids:', len(suspicious))

#if within a year, gross_pay are different, then sum
#if within a year, gross_pay are same, then keep one.


Same name that has multiple records in the same department same year but salaries are different: 3 IDs
cases when same uid have larger than 1 full time equivalent salary: 33 IDs
cases where same uid have multiple rank in a year: 326 IDs
Same name that have more than 1 department in the same year and whose salaries are different: 19 IDs
total number of suspicious uids: 352


In [52]:
df = df.merge(df.groupby(['uid','yr'])['gross_pay'].nunique().reset_index(), on=['uid','yr'])

u1 = df[df['gross_pay_y']>1]
u2 = df[df['gross_pay_y']==1]
m1 = u1.groupby(['uid','yr'])['gross_pay_x','regular_pay'].sum().reset_index()
m2 = u2.groupby(['uid','yr'])['gross_pay_x','regular_pay'].mean().reset_index()
m3 = pd.concat([m1, m2], ignore_index = True)
m3.columns = ['uid','yr','gross_pay_sum','regular_pay_sum']
df = df.merge(m3, on = ['uid','yr'])
df = df.drop(columns = ['gross_pay_y']).rename(columns = {'gross_pay_x':'gross_pay'})
df['suspicious'] = df['uid'].isin(suspicious)

In [53]:
df[['name','yr','university','gross_pay','gross_pay_sum','uid', 'department']].sort_values(['uid','yr','name']).head(10)

Unnamed: 0,name,yr,university,gross_pay,gross_pay_sum,uid,department
2239,"Aakalu, Vinay Kumar",2013,Chicago,190950.0,190950.0,0,Ophthalmology & Visual Sci
2240,"Aakalu, Vinay Kumar",2013,Chicago,190950.0,190950.0,0,Ophthalmology & Visual Sci
19675,"Aakalu, Vinay Kumar",2014,Chicago,195723.75,195723.75,0,Ophthalmology & Visual Sci
19676,"Aakalu, Vinay Kumar",2014,Chicago,195723.75,195723.75,0,Ophthalmology & Visual Sci
37169,"Aakalu, Vinay Kumar",2015,Chicago,195723.75,195723.75,0,Ophthalmology & Visual Sci
37170,"Aakalu, Vinay Kumar",2015,Chicago,195723.75,195723.75,0,Ophthalmology & Visual Sci
54888,"Aakalu, Vinay Kumar",2016,Chicago,195723.75,195723.75,0,Ophthalmology & Visual Sci
54889,"Aakalu, Vinay Kumar",2016,Chicago,195723.75,195723.75,0,Ophthalmology & Visual Sci
72676,"Aakalu, Vinay Kumar",2017,Chicago,201634.6,201634.6,0,Ophthalmology & Visual Sci
72677,"Aakalu, Vinay Kumar",2017,Chicago,201634.6,201634.6,0,Ophthalmology & Visual Sci


In [57]:
df.to_csv('/Users/apple/Desktop/research_fellow_documents/data_clean/illinois_salary.csv', index = False)