# Analyzation of Gender Pay Gap by Flying Geckos
by Maryann Foley, Tiffany Moi, and Helen Ye

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
# import sklearn

### Our Goal
We will be analyzing a dataset that includes the number of male and female workers, as well as their weekly pay in a certain occupation. We will be looking for evidence of a pay gap, in relation to specific occupations and their representation in an occupation

In [6]:
data = pd.read_csv('data.csv')
for col in data.columns:
    if col == 'Occupation':
        continue
    data[col] = pd.to_numeric(data[col], errors='coerce')
data.head()

Unnamed: 0,Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly
0,ALL OCCUPATIONS,109080,809.0,60746,895.0,48334,726.0
1,MANAGEMENT,12480,1351.0,7332,1486.0,5147,1139.0
2,Chief executives,1046,2041.0,763,2251.0,283,1836.0
3,General and operations managers,823,1260.0,621,1347.0,202,1002.0
4,Legislators,8,,5,,4,


### Cleaning the Data
After retrieving the data, we filter out all the rows where data is missing

In [7]:
all_wages = data[data.All_weekly.notnull() & data.M_weekly.notnull() & data.F_weekly.notnull()]
print(all_wages.shape)
all_wages.head()

(142, 7)


Unnamed: 0,Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly
0,ALL OCCUPATIONS,109080,809.0,60746,895.0,48334,726.0
1,MANAGEMENT,12480,1351.0,7332,1486.0,5147,1139.0
2,Chief executives,1046,2041.0,763,2251.0,283,1836.0
3,General and operations managers,823,1260.0,621,1347.0,202,1002.0
6,Marketing and sales managers,948,1462.0,570,1603.0,378,1258.0


### Initial Analysis
Now, we will investigate representation in certain occupations by calculating, for each profession, what percentage of the workers are female.

We will also add columns showing the relationship between the average wages of all workers in a week, the wages of males in a week, and the wages of females.

In [8]:
def percent_female_workers(row):
    return row['F_workers'] * 100.0 / row['All_workers']

In [9]:
all_wages['percent_f_workers'] = all_wages.apply(lambda row: percent_female_workers(row), axis=1)
sorted_worker = all_wages.sort_values(by=['percent_f_workers'])
sorted_worker.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly,percent_f_workers
360,CONSTRUCTION,5722,749.0,5586,751.0,137,704.0,2.394268
401,MAINTENANCE,4301,839.0,4159,842.0,143,761.0,3.324808
528,Driver/sales workers and truck drivers,2687,747.0,2582,751.0,105,632.0,3.907704
228,Police and sheriff's patrol officers,655,1002.0,569,1001.0,86,1009.0,13.129771
521,TRANSPORTATION,6953,646.0,5998,679.0,955,494.0,13.735078


In [10]:
def percent_female_wage(row):
    return row['F_weekly'] * 100.0 / row['All_weekly']

def percent_ftom(row):
    return row['F_weekly'] * 100.0 / row['M_weekly']

In [11]:
all_wages['percent_f_wages'] = all_wages.apply(lambda row: percent_female_wage(row), axis=1)
sorted_wage = all_wages.sort_values(by=['percent_f_wages'])
sorted_wage.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly,percent_f_workers,percent_f_wages
287,"Securities, commodities, and financial service...",211,1155.0,146,1461.0,65,767.0,30.805687,66.406926
440,First-line supervisors of production and opera...,783,875.0,650,924.0,133,623.0,16.985951,71.2
289,"Sales representatives, services, all other",406,966.0,268,1147.0,139,699.0,34.236453,72.360248
54,Personal financial advisors,407,1419.0,248,1738.0,159,1033.0,39.066339,72.797745
110,"Physical scientists, all other",189,1553.0,121,1770.0,68,1170.0,35.978836,75.338055


In [13]:
sorted_wage['percent_ftom'] = sorted_wage.apply(lambda row: percent_ftom(row), axis=1)
sorted_wage_ftom = sorted_wage.sort_values(by=['percent_f_wages', 'percent_ftom'])
sorted_wage_ftom.head()

Unnamed: 0,Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly,percent_f_workers,percent_f_wages,percent_ftom
287,"Securities, commodities, and financial service...",211,1155.0,146,1461.0,65,767.0,30.805687,66.406926,52.498289
440,First-line supervisors of production and opera...,783,875.0,650,924.0,133,623.0,16.985951,71.2,67.424242
289,"Sales representatives, services, all other",406,966.0,268,1147.0,139,699.0,34.236453,72.360248,60.941587
54,Personal financial advisors,407,1419.0,248,1738.0,159,1033.0,39.066339,72.797745,59.436133
110,"Physical scientists, all other",189,1553.0,121,1770.0,68,1170.0,35.978836,75.338055,66.101695
