# Ontario's Sunshine List (1996-2020)

Project Group 2:

- William Conley - 100782574
- Reese Dominguez - 100775764
- Alex Sawatzky - 100790274
- Joshua Trower - 100791683

## Introduction

### The Ontario Public Sector Safety Disclosure/Sunshine List

The [Ontario Public Sector Safety Disclosure](https://www.ontario.ca/page/public-sector-salary-disclosure#section-1), henceforth referred to as the **Sunshine List** for this notebook, started its collection in 1996 as a result of the **Public Sector Salary Disclosure Act** passed by the Government of Ontario. This act requires public service organizations to submit staff and salary information for all staff that earn more than \$100,000 during a calendar year by March of the year after, so as to make this data available to the public and to make public service organizations accountable for the use of the province's funding.

Although this is a positive step when it comes to transparency in the government level, the data does not show basic demographics such as gender and race, especially after the passing of the [Anti-Racism Act in 2017](https://news.ontario.ca/en/release/44976/ontario-passes-anti-racism-legislation). We want to explore any discrepancies in equity in gender and race regarding the public sector's highest paid employees.

We start by answering these basic questions, and then answering any other questions that appear as a result:

1. Which sector has the most people/largest budget?
2. What proportion of Sunshine Listers are women?
3. What proportion of Sunshine Listers are racialized?
4. How much of a change has there been since 1996?
5. Are any sectors improving in their racial/gender representation?

## Description of Data

As the Sunshine List has already been described in the introduction, we take this opportunity to introduce our auxiliary/helper datasets. Note that this has been taken almost verbatim from the descriptions in our proposal:

#### World Gender Name Dictionary
Link: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MSEGSJ 

Citation: Raffo, Julio, 2021, "WGND 2.0", https://doi.org/10.7910/DVN/MSEGSJ, Harvard Dataverse, V1, UNF:6:5rI3h1mXzd6zkVhHurelLw== [fileUNF]

This data set is available on Harvard’s dataverse which is an archive of data that has been used in research and is made available for use by other researchers.  The data was compiled by Martinez, Lax, Raffo and Saito in 2016 as part of their research paper “Identifying the Gender of PCT Inventors.”  The author has compiled more than 26 million records and linked the given names from over 195 different countries and territories. Similar to the work of Raffo, this dataset will be used to explore the representation of women on the Sunshine List.  It is important to note that using a database of names to assign genders is an imprecise approach and in many cases may misattribute gender given how many names there are that are used by both genders (like Chris or Sam). However, since the sunshine list does not include gender in the datasets, it is hoped that this imprecise approach may still offer some value in the consideration of gender representation on the Sunshine list. This data has been made available as CCO – “Public Domain Dedication”.

#### Most Common Names Database
Link: https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv 

This data set is available on FiveThirtyEight's [Github (direct link)](https://github.com/fivethirtyeight/data/blob/master/most-common-name/). [FiveThirtyEight](https://fivethirtyeight.com/)  is a website devoted to using statistics to understand political, social and sporting trends.  The data is maintained by Andrew Flowers as part of the article “Dear Mona, What’s The Most Common Name in America?” article.  This data is a compilation of surnames from the US Census Bureau which includes the percentage of the population that identifies as White, Black, Asian, Hispanic and Multiple Races. It is important to note that determining racial identity solely by surname is an imprecise approach, given that each surname often has representation across all racial identities.  However, since the sunshine list does not include employee racial identity, it is hoped that this imprecise approach may still offer some value in consideration of the representative diversity on the Sunshine List.  This data has been made available under a Creative Commons Attribution 4.0 International License.

#### Ontario Demographics
Link: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/Page.cfm?Lang=E&Geo1=PR&Code1=35&Geo2=&Code2=&Data=Count&SearchText=Ontario&Sear

This dataset is maintained by Statistics Canada and provides provincial and national statistics on population demographics.  This will serve as a baseline to compare proportions and see if a group is over or under represented. The latest census publically available is that for 2016, as 2021's will be available in February 2022.

## Analysis of Data

We first import all the datasets we intend to use. For the Sunshine List, since we're working with 24 years' worth of work, we concatenate them all into one dataframe.

In [2]:
#importing used libraries
import csv
import re
import pandas as pd
from functools import reduce
import numpy as np
import calendar
import string
import matplotlib.pyplot as plt

# seaborn
import seaborn as sns

In [15]:
# dataset import preparation
data_path = 'datasets/'
# reads CSV file, returns data in Python dict
# thanks Mariana ^^;

# to anyone else working on this: modify as needed.
def get_data_csv(path):
    collection = []
    with open(path, 'r') as f:
        for line in csv.DictReader(f):
            collection.append(line)
        return collection

# because the "Salary Paid" column is formatted 
# differently for each list...
def format_sunshine(path):
    return (pd.DataFrame(get_data_csv(path))).rename(columns = {"Salary paid" : "Salary Paid",
                                                                "Salary Paid " : "Salary Paid",
                                                                "Surname": "Last Name",
                                                                "Last name": "Last Name",
                                                                "Position": "Job Title",
                                                                "Job title": "Job Title",
                                                                "Calendar year": "Calendar Year",
                                                                "First name": "First Name",
                                                                "Taxable benefits": "Taxable Benefits"})

In [33]:
# dataset importing - sunshine lists

# concatenating all into one df
df_sunshine = format_sunshine(data_path + 'sunshine1996.csv')

for yr in range(1997, 2021):
    # calling it a temp dataset because it'll be rewritten
    # w/ each loop
    curr_file = data_path + 'sunshine' + str(yr) + '.csv'
    df_temp = format_sunshine(curr_file)

    df_sunshine = pd.concat([df_sunshine, df_temp], ignore_index=True)

In [34]:
df_sunshine.drop(columns=[''], inplace=True)
# More fixing up formatting: 

# putting salary paid/taxable benefits into usable format
# taxable benefits commented as i don't know if we'll use it
# df_sunshine["Taxable Benefits"] = df_sunshine["Taxable Benefits"].replace('[\$,]', '', regex=True).astype(float)
df_sunshine["Salary Paid"] = df_sunshine["Salary Paid"].replace('[\$,]', '', regex=True).astype(float)

# changing all names/sectors to title case
df_sunshine["Sector"] = df_sunshine["Sector"].str.title()
df_sunshine["Last Name"] = df_sunshine["Last Name"].str.title()
df_sunshine["First Name"] = df_sunshine["First Name"].str.title()

# there may be more fixing to be done, but this is okay for now
df_sunshine

Unnamed: 0,Sector,Last Name,First Name,Salary Paid,Taxable Benefits,Employer,Job Title,Calendar Year
0,Other Public Sector Employers,Kendall,Perry,194890.40,$711.24,Addiction Research Foundation,President & CEO,1996
1,Other Public Sector Employers,Rehm,Juergen,115603.62,$403.41,Addiction Research Foundation,"Dir., Soc. Eval. Research & Act. Dir., Clin. R...",1996
2,Other Public Sector Employers,Room,Robin,149434.48,$512.58,Addiction Research Foundation,"V.P., Research & Coordinator, Intern. Programs",1996
3,Ontario Public Service,Knox,Ken W,109382.92,"$4,921.68","Agriculture,Food and Rural Affairs",Deputy Minister,1996
4,Hospitals,Cliff,Bruce,110309.00,"$3,157.00",Ajax and Pickering General Hospital,President & CEO,1996
...,...,...,...,...,...,...,...,...
1676428,Universities,Zylberberg,Joel,141478.88,$727.20,York University,Assistant Professor / Canada Research Chair,2020
1676429,Universities,Zylla,Phil,127898.47,$231.93,McMaster Divinity College,Vice President Academic,2020
1676430,Universities,Zytaruk,Nicole,113582.77,$231.93,McMaster University,Research Associate,2020
1676431,Universities,Zytner,Richard,193168.37,"$1,906.08",University Of Guelph,Professor,2020


In [36]:
# importing the WGND
df_wgnd = pd.DataFrame(get_data_csv(data_path + 'wgnd.csv'))

In [37]:
df_wgnd

Unnamed: 0,name,gender,probability
0,Aaban,M,1
1,Aabha,F,1
2,Aabid,M,1
3,Aabriella,F,1
4,Aada,F,1
...,...,...,...
95021,Zyvon,M,1
95022,Zyyanna,F,1
95023,Zyyon,M,1
95024,Zzyzx,M,1


In [38]:
# importing the MCND
df_mcnd = pd.DataFrame(get_data_csv(data_path + 'surnames.csv'))

In [40]:
df_mcnd.name = df_mcnd.name.str.title()

df_mcnd

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
0,Smith,1,2376206,880.85,880.85,73.35,22.22,0.4,0.85,1.63,1.56
1,Johnson,2,1857160,688.44,1569.3,61.55,33.8,0.42,0.91,1.82,1.5
2,Williams,3,1534042,568.66,2137.96,48.52,46.72,0.37,0.78,2.01,1.6
3,Brown,4,1380145,511.62,2649.58,60.71,34.54,0.41,0.83,1.86,1.64
4,Jones,5,1362755,505.17,3154.75,57.69,37.73,0.35,0.94,1.85,1.44
...,...,...,...,...,...,...,...,...,...,...,...
151666,Yousko,150436,100,0.04,89752.93,99,(S),0,0,0,(S)
151667,Zaitsev,150436,100,0.04,89753.04,92,(S),0,0,7,(S)
151668,Zalla,150436,100,0.04,89753.11,99,(S),0,0,0,(S)
151669,Zerbey,150436,100,0.04,89753.3,99,(S),0,0,0,(S)


In [46]:
# importing the Ontario census
# TODO: that

In [47]:
# delete this 


Oh my gosh!
Don’t you know I’m a savage?

I’m a killa 너를 깰 ae
아직도 가리고 환각을 펼친 너
팰라 we holler
두렵지 않아 너 너 hit you harder

날 밀어 넣어 deep fake on me
준비가 안된 무대로
몰아넣어 fake on me
Got everybody mock up to me
수치를 느끼게 멘탈을 흔들어놔
싸늘한 관중 무너져 ae
더는 널 못 참아 say no!

두고 봐 난 좀 savage
너의 dirty 한 play
더는 두고 볼 수 없어
나를 무너뜨리고 싶은
네 환각들이 점점
너를 구축할 이유가 돼
I’m a savage
널 부셔 깨 줄게 oh
I’m a savage
널 짓밟아 줄게 oh

Get me get me now
Get me get me now
(Zu zu zu zu)
지금 나를 잡아
아님 난 더 savage
(Zu zu zu zu)
Get me get me now
Get me get me now
(Zu zu zu zu)
이젠 내가 너를 잡아
Now I’m a savage

(KI/NN) Gimme gimme now
(KI/NN) Gimme gimme now
(Zu zu zu zu)
(KI/NN) 너의 말이 보여
(KI/NN) 네 약점 algorithm
(Zu zu zu zu)
(GS/WT) 김이 김이 나
(GS/WT) 김이 김이 나
(Zu zu zu zu)
(GS/WT) MA ae SYNK 방해 말고
(GS/WT) 꺼져 savage
(Zu zu zu zu)

Mmmh everybody looks at me
익숙하잖니
양보해 참아야만 돼
어른스럽게
I’m locked up in the glass
난 놀고 싶은데
너무 끔찍한 기대
그런 환각 틀에 나를 가둬 놔

I’m going 광야로 game in
물리쳐 교묘한 이간질
And my ae로부터
멀어지게 만들
회심찬 네 trick
We gone 광야로 game in
베어버려 내 빛의 검
데미지를 입은 네게
인정사정 볼 것

## Exploratory Data Analysis

## Potential Data Science

## Conclusion