## Data Preprocessing - Texas System Salary
### This file assigns unique person-ID based on her department, title series, salary and year for University of Illinois salary data.

**Input: Texas Salary Data ('texas.dta')**

**Output: texas_salary.csv, with extra column "uid" for each unique individual; "suspicious" indicates possible mismatch** 

### Mark the following UIDs as suspicious
1. **Rank** Within a year same name has two ranks, then mark as suspicious
2. **Within Department** Same name that has multiple records in the same department same year but salaries are different
3. **Across Department** Same name that have more than 1 department in the same year and whose salaries are different

Total: 255 IDs out of 101,505 IDs and 1,215 out of 230,092 rows rows marked as suspicious

In [82]:
import os
import warnings
warnings.simplefilter('ignore')

import pandas as pd

import numpy as np
from matplotlib import pyplot as plt
#import seaborn as sns
%matplotlib inline

os.chdir('/Users/apple/Dropbox/web_scrapping_UC/temp/')
df = pd.read_stata('texas.dta')
df.rename(columns = {'salary':'gross_pay'}, inplace = True)
df = df.reset_index().rename(columns = {'index':'rid'})

#assign unique IDs if in the same university and share the same name
df['uid'] = df.groupby(['university','name']).ngroup()

In [83]:
#if within a year same name has two ranks, then mark as suspicious
m1 = df.groupby(['uid'])['rank'].nunique().reset_index()
s0 = set(m1[m1['rank'] > 1]['uid'])
print('cases where same uid have multiple rank in a year: {} IDs'.format(len(s0)))

#if within a year, gross_pay are different, then sum
#if within a year, gross_pay are same, then keep one.
df = df.merge(df.groupby(['uid','yr'])['gross_pay'].nunique().reset_index(), on=['uid','yr'])
u1 = df[df['gross_pay_y']>1]
u2 = df[df['gross_pay_y']==1]
m1 = u1.groupby(['uid','yr'])['gross_pay_x'].sum().reset_index()
m2 = u2.groupby(['uid','yr'])['gross_pay_x'].mean().reset_index()
m3 = pd.concat([m1, m2], ignore_index = True)
m3.columns = ['uid','yr','gross_pay_sum']
df = df.merge(m3, on = ['uid','yr'])
df = df.drop(columns = ['gross_pay_y']).rename(columns = {'gross_pay_x':'gross_pay'})

#obs that have multiple records in the same department and whose salaries are different
m2 = df[df.duplicated(['uid','yr','department'],keep=False)]
m2[m2['department']!='']
m21 = m2.groupby(['uid','yr'])['gross_pay'].nunique().reset_index()
s1 = set(m21[m21['gross_pay']>1]['uid'])
print('obs that have multiple records in the same department and whose salaries are different: {} IDs'.format(len(s1)))

#obs that have more than 1 department in the same year and whose salaries are different
m3 = df.groupby(['uid','yr'])['department'].nunique().reset_index()
m3 = m3[m3['department'] > 1]
df2 = df[df['uid'].isin(m3.uid)]
da = df2.groupby(['uid','yr'])['gross_pay'].nunique().reset_index()
s2 = set(da[da['gross_pay']>1]['uid'])
print('obs that have more than 1 department in the same year but salaries are different: {} IDs'.format(len(s2)))

susp_id = s0 | s1 | s2
print('total number of suspicious uids:', len(susp_id))

df['suspicious'] = df['uid'].isin(susp_id)

cases where same uid have multiple rank in a year: 17 IDs
obs that have multiple records in the same department and whose salaries are different: 238 IDs
obs that have more than 1 department in the same year but salaries are different: 1 IDs
total number of suspicious uids: 255


In [84]:
df.to_csv('/Users/apple/Desktop/research_fellow_documents/data_clean/texas_salary.csv', index = False)