# Processing California's public payrolls

This notebook processess annual government payroll [data](https://publicpay.ca.gov/Reports/RawExport.aspx) compiled and released annually by the California state controller's office. The data include anonymized salary information for all employees at cities, counties, special districts and state government. 

---

### Load python tools

In [1]:
import pandas as pd
import zipfile
from urllib.request import urlopen 
import pyarrow
import os
import glob
import io
import requests
import matplotlib
import json
import numpy as np
from altair import datum
import altair as alt
alt.renderers.enable('notebook')
import altair_latimes as lat
alt.themes.register('latimes', lat.theme)
alt.themes.enable('latimes')

ThemeRegistry.enable('latimes')

### Download zipped salary tables by year and agency type

In [2]:
os.chdir('/Users/mhustiles/data/data/controller/input/')

In [3]:
formaturl = lambda x: 'https://publicpay.ca.gov/RawExport/' + f'{x[1]}_' + f'{x[0]}' + '.zip'

In [83]:
metadata = []
for y in range(2012,2019):
#     for e in ['City', 'County', 'SpecialDistrict', 'StateDepartment']:
    for e in ['City', 'County']:
        metadata.append(dict(entity = e, year = y, url = formaturl((e, y))))

### Extract CSVs from .zip files, and then discard the .zip files

In [None]:
for m in metadata:
    !wget '{m['url']}'
    !unzip \*.zip
    !rm -f *.zip

--2019-11-07 15:06:08--  https://publicpay.ca.gov/RawExport/2012_City.zip
Resolving publicpay.ca.gov (publicpay.ca.gov)... 67.157.158.198
Connecting to publicpay.ca.gov (publicpay.ca.gov)|67.157.158.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6312863 (6.0M) [application/x-zip-compressed]
Saving to: ‘2012_City.zip’


2019-11-07 15:06:10 (3.27 MB/s) - ‘2012_City.zip’ saved [6312863/6312863]

Archive:  2012_City.zip
  inflating: 2012_City.csv           
--2019-11-07 15:06:11--  https://publicpay.ca.gov/RawExport/2012_County.zip
Resolving publicpay.ca.gov (publicpay.ca.gov)... 67.157.158.198
Connecting to publicpay.ca.gov (publicpay.ca.gov)|67.157.158.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7975474 (7.6M) [application/x-zip-compressed]
Saving to: ‘2012_County.zip’


2019-11-07 15:06:13 (3.97 MB/s) - ‘2012_County.zip’ saved [7975474/7975474]

Archive:  2012_County.zip
  inflating: 2012_County.csv         
--2019-11-07 1

---

### Read all the text files, loop and store them in a dataframe

In [None]:
path = '/Users/mhustiles/data/data/controller/input/'
all_files = glob.glob(os.path.join(path, "*.csv"))

df_from_each_file = (pd.read_csv(f, encoding = "ISO-8859-1", low_memory=False, dtype = {'DepartmentOrSubdivision': 'object', 'Year': 'object'}) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)

In [None]:
concatenated_df.head()

### Trim the dataframe to the columns we need

In [None]:
payroll = pd.DataFrame(concatenated_df[['Year','EmployerType','EmployerPopulation','EmployerName','DepartmentOrSubdivision',\
                 'Position','OvertimePay','TotalWages', 'TotalRetirementAndHealthContribution']])

In [None]:
payroll.head()

### Not everyone reports (or pays) overtime

In [None]:
payroll['OvertimePay'].fillna(0, inplace = True)
payroll['DepartmentOrSubdivision'].fillna('NOT LISTED', inplace = True)
payroll['EmployerPopulation'].fillna(0, inplace = True)

In [None]:
payroll.head()

### Clean up column headers

In [None]:
payroll.columns = payroll.columns.str.strip().str.lower().str.replace(' ', '_')\
                    .str.replace('(', '').str.replace(')', '').str.replace('-','_')

In [None]:
payroll.head()

In [None]:
payroll.rename(columns = {
'employertype':'type',
'employerpopulation':'population',
'employername':'employer',
'departmentorsubdivision':'department',
'overtimepay':'overtime',
'totalretirementandhealthcontribution':'benefits',
'totalwages':'wages',
 }, inplace = True)

### Uppercase everything because their title casing across hundreds of agencies is janky

In [None]:
payroll = payroll.apply(lambda x: x.astype(str).str.upper())

### Crudely filter for fire jobs

In [None]:
fire_payroll = payroll.loc[(payroll['department'].str.contains('FIRE'))]

### How do the dataframes look? 

In [None]:
payroll.head()

In [None]:
fire_payroll.head()

### How many records do we have here?

In [None]:
# How many records?
len(payroll)

In [None]:
# How many records?
len(fire_payroll)

---

### Export to a lightweight format

In [None]:
payroll.reset_index().to_feather('/Users/mhustiles/data/data/controller/output/payroll.feather')

---

Data source: https://publicpay.ca.gov/Reports/RawExport.aspx