# Using the Gender Pay Gap Data

This is an example of potential uses of the Gender Pay Gap datasets. Obviously there's lots of other ways this data could be used - frankly, using this data doesn't mean you have to write code - but I thought it'd be useful to try to manipulate the data a bit, and make some comments on the pain points I find along the way.

## Why's there loads of scary code here?
The code in this notebook is the actual code doing all the things you can see in the output. It's not really that important - it's just how I've made the outputs show what they show! For the most part, I've tried to explain what I'm doing in each bit of code, but it's not at all necessary to read through it - only the outputs are important in the long run!

# Step 1: Boring setup
To do things with the data, I need to import a bunch of functionality and tell the system that I want to use the GPG data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime as dt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Step 2: Getting the data
Before I can do anything, I need to import the data into this environment so I can start working with it! 

The .csv file format of the data downloads from the GPG service website makes it really easy to do this. I've created a dataset from all the available data (across every reporting year), and pulled it into this environment so I can start working with it. The output below displays the first 5 lines of data once it's been imported, so we can take a look at its structure and think about how best to use it.

In [None]:
#import the dataset
pay_gap_data_2017_18 = pd.read_csv('../input/uk-gender-pay-gap-data-2019-to-2020/UK Gender Pay Gap Data - 2017 to 2018.csv', parse_dates=True)
pay_gap_data_2018_19 = pd.read_csv('../input/uk-gender-pay-gap-data-2019-to-2020/UK Gender Pay Gap Data - 2018 to 2019.csv', parse_dates=True)
pay_gap_data_2019_20 = pd.read_csv('../input/uk-gender-pay-gap-data-2019-to-2020/UK Gender Pay Gap Data - 2019 to 2020.csv', parse_dates=True)
pay_gap_data_2020_21 = pd.read_csv('../input/uk-gender-pay-gap-data-2019-to-2020/UK Gender Pay Gap Data - 2020 to 2021.csv', parse_dates=True)

pay_gap_data = pd.concat([pay_gap_data_2017_18, pay_gap_data_2018_19, pay_gap_data_2019_20, pay_gap_data_2020_21])

pay_gap_data.head()

# Step 3: Thinking about what to do with the data
Before I actually _do_ anything with the data, I need to think about what I'm trying to achieve. I want to come up with a couple of clear goals, so that I can try using the data in vaguely real-world contexts and see what pain points come out of that. 

I've decided on a relatively simple goal:
* Visualise how the Gender Pay Gap for a specific business sector (e.g. technology, education, etc.) has changed over time

This is very possible, and will allow me to test out the ease of use of the data.

# Step 4: Do the work

### 4A: Visualising how the Gender Pay Gap for a specific business sector has changed over time
The GPG data is for all sectors across all reporting years, so I'll need to do some work to get the chunks of the data that I actually need.

I need to:
* Split up the data so I only have the data for a certain business sector
* Find the chunks of data for each year, and average them somehow
* Present the result on a graph

### Problem 1: Business sectors aren't available in the data
Damn, already a problem! I can't find an obvious way to divide the data up by business sector (e.g. technology, education, etc.) - the only hint at a sector in the data is the SIC code, and multiple SIC codes form a sector. Probably the best approximation that won't take ages is to just pick a SIC code and see how that changed over time - the only alternative is to manually search around for all the SIC codes that make up a sector (if that's even how they relate), and that's definitely too time-consuming to do just to test out the data.

For now, I'll use data from the SIC code 62020: Information Technology Consultancy Activities

**Solution: Add BusinessSector column to the data downloads**

### Problem 2: Some organisations don't have any SIC codes
Adding SIC codes to an organisation on the GPG service is kind of optional, which means that there's a bunch of rows in the data without SIC codes. This is a pretty big problem, because it means that we could miss important information if an organisation haven't added a SIC code. Equally inconveniently, it also causes the code to fail unless I go through and manually replace empty cells with a number or remove the rows entirely - I'm going to remove them, since they won't be useful here at all.

**Solution: Make at least one SIC code mandatory? Or at least give a warning on the site that some of these might be missing**

The code below removes the rows without SIC codes from the data, and then creates a new, smaller dataset containing only rows with the SIC code 62020. The table output shows the first 5 rows of this data, and you can see that they don't necessarily _only_ have the SIC code 62020 - the first row has SicCodes 62020 and 70100. It's a bit weird that there's these \n separators within the data, and that might be something we can investigate and fix if we do some work on the data downloads in future.

In [None]:
# Remove rows with no SIC codes
cleaned_sic_codes_data = pay_gap_data.dropna(subset=['SicCodes'])

# Create dataset of all companies with SIC code 62020 (Information technology consultancy activities)
it_consultancy_data = cleaned_sic_codes_data[cleaned_sic_codes_data['SicCodes'].str.contains('62020')]

it_consultancy_data.head()

### Problem 3: There's no way to link an organisation over time

In this data, we've got no unique identifier for each organisation. Organisations don't necessarily have Companies House numbers (none in the public sector will have them), and they could change their names or addresses between years, so it's pretty much impossible to find data for the same organisation year-on-year. An organisation ID would be _incredibly_ useful here, because it'd let me filter the data down to just a single organisation if I wanted and not have to guess that they were the same from their name and address!

Edit: I've looked at this data multiple times, and yet it's only just now that I've noticed the ID column. I have _no idea_ whether this is equivalent to an OrganisationId, or if it's just a completely separate ID. I also have no idea if an organisation will keep the same ID over multiple submissions, or if, like organisation name, this can change.

**Solution: Add an OrganisationId to the data downloads, or make what the ID (and other columns) actually are much clearer**

### Problem 4: DueDate is the only thing that tells me if an organisation is public or private
There's no Sector column in the data, so at first glance it isn't obvious that the data is for _both_ public and private sectors. When you examine the data a bit, it's clear that there's something weird going on, because the DueDate column matches year on year...but for two different dates each year. From my contextual knowledge I know that this is because of the two different reporting deadlines for public and private sectors, but this is _really_ confusing and a huge pain to untangle. I'm going to have to manually filter through these dates, and separate the data into public and private sectors by checking that the DueDate matches the day and month that I'd expect - if these change at any point in the future, I'd have to completely update the code.

Also, helpfully, actually _being_ public or private sector doesn't necessarily dictate when you submit your GPG data. A whole bunch of public sector organisations submit on the private sector deadline, which makes these DueDates practically useless in determining whether an organisation is _actually_ public or private sector...

**Solution: Add a ReportingSector column to the data**

### Problem 5: It's a huge effort to split the data by reporting date
I've had to give up on trying to do this, because I can't find a way to split the data by reporting date so that I get two datasets - one with all data with a submission deadline (DueDate) of the 30th March (across all years), and the other with all the data with a submission deadline (DueDate) or the 4th April (across all years). I'm sure it's _technically_ possible, but it'll take me a huge amount of time to figure out how to do it, because it's actually really complicated trying to select data from any year, but a specific day and month.

**Solution: Having two columns for DueDate - ReportingDate (day and month, but no year) and ReportingYear (just 2017/2018/etc) would make this a lot easier**

Since I can't easily split the two reporting deadlines, I'm going to try and plot the average Gender Pay Gap across both sectors year-by-year.

### Problem 6: What on earth do all these columns mean?
Obviously I need to use some of the data to plot the average Gender Pay Gap, but I have _no idea_ which column is best for this. Is DiffMeanHourlyPercent more useful than DiffMedianHourlyPercent? How are these calculated? What do the quartiles actually tell me about the Gender Pay Gap? 

I'm also _baffled_ by the negative values in these columns - I have absolutely no idea what a negative percentage would mean! I tried looking on the GPG service website for some clarification, but I couldn't find anything in the reports or elsewhere that helped me figure this out.

Whilst a lot of people will have studied the GPG enough to know the answers to these questions, it's hard to imagine that anyone without this context would be able to do a lot with this data. There's a lot of it - and the column names are confusing - which makes it really hard to tell what's useful and what's not, and also to infer any extra meaning from it. 

**Solution: There should definitely be somewhere on the site - or a .txt file downloaded with the data - that explains the column names and what they actually are, and also what each piece of data might be useful for**

Since I'm not sure which data is best for this, I'm going to take an average of the DiffMeanHourlyPercent for all organisations each year, and plot that.

In [None]:
it_consultancy_data.loc[:,'DueDate'] = pd.to_datetime(it_consultancy_data['DueDate'])
it_consultancy_data.loc[:,'ReportingYear'] = it_consultancy_data['DueDate'].dt.year

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(16,6))

sns.lineplot(data=it_consultancy_data, x="ReportingYear", y="DiffMeanHourlyPercent", hue="EmployerSize", ci=None)

### Ta da! A result!

The graph above shows the average DiffMeanHourlyPercent for IT consultancy organisations over time, split by employer size.


# Step 5: Conclusions

Despite a few irritating pitfalls, I managed to plot a graph from some of the data. There'd always be a few things that would take some time, but I think there's a good set of steps that can be taken to improve the data downloads and make relatively simple tasks like this - and other more complicated ones - a lot easier and quicker to achieve!

To summarise the previously proposed solutions:
* Add BusinessSector column
* Make at least one SIC code on an organisation mandatory - or at least have a disclaimer with the data downloads that explains that this column can be empty
* Add an OrganisationId column, or improve the naming of the ID column (and others) so that its purpose is clear
* Add a ReportingSector column to the data
* Have two columns for DueDate - ReportingDate (day and month, but no year) and ReportingYear (just 2017/2018/etc)
* There should definitely be somewhere on the site - or a .txt file downloaded with the data - that explains the column names and what they actually are, and also what each piece of data might be useful for

There's also a fantastic Data Science platform called Kaggle, which makes varied and interesting datasets available. Anyone can create and share datasets, and tasks can be assigned to a dataset which provide clear goals for Data Scientists wanting to get involved with the data and help a specific cause. Tasks can be clear and concrete, or questions that a Data Scientist or team could solve in a variety of ways. In investigating these data downloads, I first created a private dataset on Kaggle from the .csv files I downloaded from the GPG site. Right now, only I can see and use the dataset, but making a dataset like this available to the Data Science community wth clear tasks could spark some interest and involvement with the data.

Kaggle rate datasets available on the platform on a 1-10 scale of Usability. They outline a number of steps to improve the usability of a dataset which I thought it would be useful to mention here as well:
* Easy to understand and includes essential metadata
    * Add a subtitle
    * Add a description
* Rich, machine readable file formats and metadata
    * Add file information: help others navigate your dataset with a description of each file
    * Include column descriptors: empowers others to understand your data by describing its features
    * Specify a licence: help other users understand how they can work with and share the data
    * Use preferred file formats (.csv already satisfies this)
* Assurances the dataset is maintained
    * Specify provenance: let others know how the data was collected and organised
    * Specify update frequency: let other users know if the dataset will be regularly updated
    * Provide an example of the data in use so other users can get started quickly
    * Suggest an analysis users can do with this dataset
