# Bay Area Public Salaries

### Introduction
The motivation of the notebook is to compare graduate student assistant salaries across various counties and compare them to the state's salaries. For the purpose of this study, only data from public goverment records were used which can be found on the web under many names e.g. "Civic Service Pay Scale", "Job Salary Schedule", "Pay Schedule", etc.

**Note**: These salaries are *not* to be confused with "New Graduate Entry Level" salaries, rather these salaries are "Graduate Student Internship" levels.  
<br />

<center><img src="https://upload.wikimedia.org/wikipedia/commons/d/d8/California_Bay_Area_county_map_%28zoom%26color%29.svg" alt="bay area counties" width="250" height="250"></center>


**Hypothesis**: Counties of interest near Silicon Valley are at or exceed state salaries.
* San Francisco, San Mateo, Santa Clara, Alameda
* Santa Cruz which is south of San Mateo (grayed out on the map) was added for interest

### Import Libraries

In [28]:
import pandas as pd
import numpy as np
import matplotlib as mpl
from IPython.display import display  # for displaying pandas dataframe

### Read in Data
**Alameda County**  
This data requires a good chunk of preprocessing before it can be used. The original file was a PDF that was converted to a "csv" via [online tool](https://www.zamzar.com/). There are some issues with the original data and the conversion that need to be addressed:
* <u>Original PDF Format</u>
  * The tabular data was split every on every page and follows a pattern.
    * 8 rows of irrelevant data, 28 rows of relevant job data, and repeat ...
  * Since the data was too long to fit onto a single row the data for a sinle job overflows to next row.
    * Ex: Row 7 and 8 contain job info for ACCESS Program Clinical Mgr.
    * Row 7 has two columns worth of data "JobCode" and "JobDescription".
    * Row 8 has nineteen columns worth of data including pay steps, min and max monthly salary, etc.
    * *Hint: After removing the irrelevant data, separate the dataframe into even and odd rows, then concat the rows.* 
* <u>CSV Conversion</u>
  * Last 3 columns got messed up in the conversion.
  * Column 17 marks if the job is "FLSA": X = yes, N = no. Notice how the jobs with "N" got mixed up with Column 16 "AnnualMax" salary e.g. row 13 with "72,306.00 N".
    * *FLSA stands for Fair Labor Standards Act, which is a federal law that sets minimum wage.*
  * Column 18 marks the standard hours for the job e.g. 80, 75, etc.
    * If the job is not FLSA then the standard hours ended up in column 17 instead of 18.


In [29]:
alameda_raw = pd.read_csv('data/Alameda_County_Pay_Schedule_2024-11-15.csv', header=None)
display(alameda_raw.head(14))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,,,,,,,,,,PeopleSoft,,,,,,,,,
1,Report ID:,HXHRI003,,,,,,,,JOBCODE SALARY REPORT,,,,,,Page No.,1,,
2,,,,,,,,,,,,,,,,Run Date 11/16/2024,,,
3,As Of Date: 11/15/2024,,,,,,,,,,,,,,,Run Time 00:29:05,,,
4,Sorted By:,Job Description,,,,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,Approx,Approx,Approx,,,
6,Jobcode/ Mgmt,,Effective,Union Job,,,,,,,,,,Comp,Monthly,Monthly,Annual,Std,
7,Job Grp,Class,Date,Code,Family,Grd,Step 01,Step 02,Step 03,Step 04,Step 05,Step 06,Step 07,Freq,Min,Max,Max FLSA,Hrs,
8,6517,ACCESS Program Clinical Mgr,,,,,,,,,,,,,,,,,
9,21,SM,05/12/2024,U15,120,R02,4930.4,,,,6052,,,B,10682.53,13112.67,157352,X,80.0


**Extract Relevant Rows**

In [30]:
# Get row indices of 1st Col that has string 'Job Grp' which denotes start of table
indices = alameda_raw[alameda_raw[0].str.contains('Job Grp', na=False)].index
indices += 1  # shift 1 to go from table headers to table data

extract = [_ for i in indices for _ in range(i,i+28)]  # create list of row indices to extract

alameda = alameda_raw[alameda_raw.index.isin(extract)]  # slice rows that contain relevant info

alameda = alameda.drop(alameda.tail(1).index)  # drop last row bc it says 'END OF REPORT'
alameda.reset_index(drop=True, inplace=True)
display(alameda.head(8))
print(len(alameda))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,6517,ACCESS Program Clinical Mgr,,,,,,,,,,,,,,,,,
1,21,SM,05/12/2024,U15,120.0,R02,4930.4,,,,6052.0,,,B,10682.53,13112.67,157352.0,X,80.0
2,5142,ALL IN Physician,,,,,,,,,,,,,,,,,
3,29,SM,12/25/2022,R45,485.0,T64,10184.8,,,,12374.4,,,B,22067.07,26811.2,321734.4,X,80.0
4,1281,Absentee Voting Technician,,,,,,,,,,,,,,,,,


2987


**Fix Column 18 "Standard Hours"**

In [35]:
# Find all indices that have strings that are digits (number of hours) in column 17 and assign them to column 18
mask = alameda[17].astype(str).str.isdigit()  # convert col to type str since its a mixed type, otherwise isdigit() fails
temp = alameda[17][mask].rename(18)  # apply mask and rename column to match column for the update
display(temp)
alameda.update(temp)
display(alameda.head(8))

5       75
7       75
9       75
11      75
13      75
        ..
2978    75
2980    80
2982    75
2984    75
2986    75
Name: 18, Length: 1028, dtype: object

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,6517,ACCESS Program Clinical Mgr,,,,,,,,,,,,,,,,,
1,21,SM,05/12/2024,U15,120.0,R02,4930.4,,,,6052.0,,,B,10682.53,13112.67,157352,X,80.0
2,5142,ALL IN Physician,,,,,,,,,,,,,,,,,
3,29,SM,12/25/2022,R45,485.0,T64,10184.8,,,,12374.4,,,B,22067.07,26811.2,321734.4,X,80.0
4,1281,Absentee Voting Technician,,,,,,,,,,,,,,,,,
5,62,NM,07/07/2024,10,556.0,C66,2331.75,2439.0,2538.0,2661.75,2781.0,,,B,5052.13,6025.5,"72,306.00 N",75,75.0
6,0205N,Account Clerk Assist SAN TAP,,,,,,,,,,,,,,,,,
7,64,NM,07/07/2024,39,,O84,13.76,,,,18.92,,,H,,,N,75,75.0


**Fix Column 16 "AnnualMax" Salary and Column 17 "FLSA"**  
Split Column 16 and put values into Column 16 and 17.

In [36]:
mask2 = alameda[16].astype(str).str.contains('\s')  # look for rows that have whitespace i.e. needs splitting
temp = alameda[16][mask2].str.split(' ', n=1, expand=True)  # two col df with new column names 0 and 1
temp.rename(columns={0:16, 1:17}, inplace=True)
display(temp)
alameda.update(temp)  # update alameda df with temp df where rows and cols match
display(alameda.head(8))

Unnamed: 0,16,17
5,72306.00,N
9,65656.50,N
13,70102.50,N
17,106496.00,N
19,94341.00,X
...,...,...
2978,83908.50,N
2980,80163.20,N
2982,87067.50,N
2984,119866.50,X


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,6517,ACCESS Program Clinical Mgr,,,,,,,,,,,,,,,,,
1,21,SM,05/12/2024,U15,120.0,R02,4930.4,,,,6052.0,,,B,10682.53,13112.67,157352,X,80.0
2,5142,ALL IN Physician,,,,,,,,,,,,,,,,,
3,29,SM,12/25/2022,R45,485.0,T64,10184.8,,,,12374.4,,,B,22067.07,26811.2,321734.4,X,80.0
4,1281,Absentee Voting Technician,,,,,,,,,,,,,,,,,
5,62,NM,07/07/2024,10,556.0,C66,2331.75,2439.0,2538.0,2661.75,2781.0,,,B,5052.13,6025.5,72306.00,N,75.0
6,0205N,Account Clerk Assist SAN TAP,,,,,,,,,,,,,,,,,
7,64,NM,07/07/2024,39,,O84,13.76,,,,18.92,,,H,,,N,75,75.0


In [40]:
mask3 = alameda[16].isin(['N', 'X'])  # look for rows that contain N or X
temp = alameda[16][mask3].rename(17)
display(temp)
alameda.update(temp)  # update alameda df with temp df where rows and cols match
alameda[16][mask3] = np.nan  # replace rows with NaN
display(alameda.head(8))

7       N
11      N
15      N
39      N
41      N
       ..
2670    N
2848    N
2850    N
2858    N
2884    N
Name: 17, Length: 129, dtype: object

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,6517,ACCESS Program Clinical Mgr,,,,,,,,,,,,,,,,,
1,21,SM,05/12/2024,U15,120.0,R02,4930.4,,,,6052.0,,,B,10682.53,13112.67,157352.0,X,80.0
2,5142,ALL IN Physician,,,,,,,,,,,,,,,,,
3,29,SM,12/25/2022,R45,485.0,T64,10184.8,,,,12374.4,,,B,22067.07,26811.2,321734.4,X,80.0
4,1281,Absentee Voting Technician,,,,,,,,,,,,,,,,,
5,62,NM,07/07/2024,10,556.0,C66,2331.75,2439.0,2538.0,2661.75,2781.0,,,B,5052.13,6025.5,72306.0,N,75.0
6,0205N,Account Clerk Assist SAN TAP,,,,,,,,,,,,,,,,,
7,64,NM,07/07/2024,39,,O84,13.76,,,,18.92,,,H,,,,N,75.0


### References
**Websites.** The following includes all websites for the various salary infomation.
* Alameda County https://salaryordinance.alamedacountyca.gov/article-1/
* San Francisco County https://www.sf.gov/resource/2023/classification-and-compensation-data
* San Mateo County https://www.smcgov.org/hr/job-classification-table
* Santa Cruz County https://www2.santacruzcountyca.gov/personnel/salsched/salsched.asp
* Santa Clara County https://esa.santaclaracounty.gov/outside-organizations/human-resources/master-salary-ordinance-executive-leadership-salary-ordinance
* California State https://eservices.calhr.ca.gov/EnterpriseHRPublic/payscales/payscalesearch