In my [previous script][1]. I explored the wage gap in the IBM HR data set. However, I was not being meticulous enough, and only after I have published the script I realized that data is fictional.

The Bureau of Labor Statistics [data set][2], while not as detailed (and not based on individual employee data points), is based on real world data. 

Let's explore the **real** wage gap in America. 

  [1]: https://www.kaggle.com/drgilermo/d/pavansubhasht/ibm-hr-analytics-attrition-dataset/where-is-the-wage-gap
  [2]: https://www.kaggle.com/jonavery/incomes-by-career-and-gender

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
from subprocess import check_output
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
print(check_output(["ls", "../input"]).decode("utf8"))

First let's read the data and add some relevant features such as the wage ratio, gap and share of each genders (or sexes) in the field:

In [None]:
df = pd.read_csv('../input/inc_occ_gender.csv')

df = df[~(df.M_weekly == 'Na')]
df = df[~(df.F_weekly == 'Na')]
       
df['M_weekly'] = df.M_weekly.apply(lambda x: int(x))
df['F_weekly'] = df.F_weekly.apply(lambda x: int(x))
df['M_workers'] = df.M_workers.apply(lambda x: int(x))
df['F_workers'] = df.F_workers.apply(lambda x: int(x))
df['All_weekly'] = df.All_weekly.apply(lambda x: int(x))
df['All_workers'] = df.All_workers.apply(lambda x: int(x))
df['M_share'] = df.M_workers/df.All_workers 
df['F_share'] = df.F_workers/df.All_workers 
df['non_weighted_all_weekly'] = (df.M_weekly + df.F_weekly)/2
df['Gap'] = df.M_weekly - df.F_weekly
df['Ratio'] = df.F_weekly/df.M_weekly
df['Ratio_of_workers'] = df.F_workers/df.M_workers

df = df.reset_index(drop = True)

Let's now look at the most unequal and most gender-equal fields (the "most equal" fields are where women make the most, since fields with an opposite **and significant** gap do not really exist):

In [None]:
sorted_df = df.sort(['Ratio'], ascending = [True])

plt.figure(figsize = (10,10))

plt.subplot(2,1,1)
plt.barh(range(10),sorted_df.tail(10).Ratio)
plt.yticks(range(10),sorted_df.tail(10).Occupation, fontsize = 10)
plt.plot([1,1],[0,10], '--',color = 'r')
plt.title('Most Equal Fields')
plt.xlim([0,1.2])

plt.subplot(2,1,2)
plt.style.use('fivethirtyeight')
plt.barh(range(10),sorted_df.head(10).Ratio)
plt.yticks(range(10),sorted_df.head(10).Occupation, fontsize = 10)
plt.plot([1,1],[0,10], '--',color = 'r')
plt.title('Most unequal Fields')
plt.xlim([0,1.2])
plt.xlabel('Female/Male wage ratio')



Let's look now at the fields with the largest and smallest share of women:

In [None]:
sorted_df = df.sort(['F_share'], ascending = [True])

plt.figure(figsize = (10,10))

plt.subplot(2,1,1)
plt.barh(range(10),sorted_df.tail(10).F_share)
plt.yticks(range(10),sorted_df.tail(10).Occupation, fontsize = 10)
plt.plot([0.5,0.5],[0,10], '--',color = 'r')
plt.xlim([0,1])
plt.title('Fields with largerst share of women')

plt.subplot(2,1,2)
plt.style.use('fivethirtyeight')
plt.barh(range(10),sorted_df.head(10).F_share)
plt.yticks(range(10),sorted_df.head(10).Occupation, fontsize = 10)
plt.plot([0.5,0.5],[0,10], '--',color = 'r')
plt.title('Fields with smallest share of women')
plt.xlim([0,1])



The results are not very surprising. It is worthwhile to mention that the fields with the smallest share of women aren't by definition those with the highest wages (construction, maintenance, etc.). Which implies that **only** looking at the gender when analyzing the wage differences would by too simplistic. 

So, with that in mind. let's check out the most and least paying fields:

In [None]:
sorted_df = df.sort(['non_weighted_all_weekly'], ascending = [True])

plt.figure(figsize = (10,10))

plt.subplot(2,1,1)
plt.barh(range(10),sorted_df.tail(10).non_weighted_all_weekly)
plt.yticks(range(10),sorted_df.tail(10).Occupation, fontsize = 10)
plt.xlim([0,2000])
plt.xlabel('Weekly Income[$]')
plt.title('Most paying fields')

plt.subplot(2,1,2)
plt.style.use('fivethirtyeight')
plt.barh(range(10),sorted_df.head(10).non_weighted_all_weekly)
plt.yticks(range(10),sorted_df.head(10).Occupation, fontsize = 10)
plt.xlim([0,2000])
plt.title('Least paying fields')

As mention above, more paying fields tend to be more male-dominated, but this trend has a lot of out-liars. Let's take the most and least paying fields, and check the share of women in each one:

In [None]:
plt.figure(figsize = (10,10))

plt.subplot(2,1,1)
plt.barh(range(10),sorted_df.tail(10).F_share)
plt.yticks(range(10),sorted_df.tail(10).Occupation, fontsize = 10)
plt.xlim([0,1])
plt.title('Share of women in the nost paying fields')

plt.subplot(2,1,2)
plt.barh(range(10),sorted_df.head(10).F_share)
plt.yticks(range(10),sorted_df.head(10).Occupation, fontsize = 10)
plt.xlim([0,1])
plt.title('Share of women in the least paying fields')
plt.xlabel('Shere of Women')

We do see that the share of women is, generally speaking, larger in low-paying fields. however, agricultural workers and cooks are primarily male, for instance, even though these fields are among the lowest-paying ones. 



Let's see the distribution of the wage gap:

In [None]:
sns.distplot(df.Ratio, bins = np.linspace(0.4,1.2,28))
plt.title('Median Wage Ratio Distribution')

np.mean(df.Ratio)

The average median wage ratio along the different fields is 0.82. We need to be careful and remember that:

 - People usually use the average wage gap. which is not necessarliy a wiser choice (it is, actually, a worse one) but we need to compare apples with apples.
 - The number 0.82 is not only an average of medians, it also does not take into account the fact that in different fields there is an extremely different number of employees. 

It is interesting, however, that this number if higher (that is - the wage ratio is lower) than what people usually talk about. this might also be, in addition to the caveat above, due to the fact the the data we have automatically controls for the fields, which are the best predictor of the salary, and as we have seen earlier, the gender distribution is not homogeneous - males tend to work at higher-paying fields.

However, it is very clear that there is a clear, and large, wage gap across the board (and market). But in order to get the full picture, we need more features, especially working hours. Typically, when controlling for these too, the wage gap reduces to about 10% (with some noticeable variance around this number).

Let's explore the gap:

In [None]:
plt.plot(df.non_weighted_all_weekly, df.Ratio,'o',markersize = 10, alpha = 0.8)
plt.xlabel('Non Weighted Weekly Salary [$]')
plt.ylabel('Female/Male Wage Ratio')
plt.title('The gap is larger at higher salaries')

x = df.non_weighted_all_weekly
y = df.Ratio
fn = np.polyfit(x,y,1)
fit_fn = np.poly1d(fn) 
plt.plot(x,fit_fn(x))


In [None]:
plt.plot(df['F_share'], df.Ratio,'o', markersize = 10, alpha = 0.8)

x = df['F_share']
y = df.Ratio
fn = np.polyfit(x,y,1)
fit_fn = np.poly1d(fn) 
plt.plot(x,fit_fn(x))
plt.title('The Gap slightly decreases with the share of females')


## Regression Model
Let's build a very simple regression model in order to understand the importance of the different features.

First, Let's look at the salaries of males and females as a function of the non-normalized field salary (that is, the average of the males and females median salary, without weighting the fact that the number of females are males is usually larger):

In [None]:
plt.plot(df.non_weighted_all_weekly, df.M_weekly,'o')
plt.plot(df.non_weighted_all_weekly, df.F_weekly,'o')
plt.legend(['Males','Females'])
plt.xlabel('Field Median Salary')
plt.ylabel('Salary')
plt.show()


In [None]:
females_df = pd.DataFrame()
males_df = pd.DataFrame()

females_df['Gender'] = np.ones(len(df))
males_df['Gender'] = np.zeros(len(df))

females_df['Salary'] = df.F_weekly
males_df['Salary'] = df.M_weekly

females_df['F_share'] = df.F_share
males_df['F_share'] = df.F_share

females_df['non_weighted_all_weekly'] = df['non_weighted_all_weekly']
males_df['non_weighted_all_weekly'] = df['non_weighted_all_weekly']

regression_df = males_df.append(females_df)

model = LinearRegression()
columns = ['F_share','Gender','non_weighted_all_weekly']
X = regression_df[columns]

X_std = StandardScaler().fit_transform(X)
y = regression_df['Salary']

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, random_state=42)

model.fit(X_train,y_train)

plt.barh([0,1,2],model.coef_)
plt.yticks(range(3),['Share of women','Gender','Salary in Field'], fontsize = 10)
plt.title('Regression Coefficients')

print('R^2 on training...',model.score(X_train,y_train))
print('R^2 on test...',model.score(X_test,y_test))

Similarly to what we have already found:

 1. The field is the most important feature. this is obvious, in fact, if it would not have been the case the situation would be so dire that such an analysis would be entirely redundant.
 2. The gender feature is very important, with being a women having a strong negative effect on the income
 3. The share of women in the field has a very small but positive effect on equal pay - the more women there are in a field, the more equal the income is. However, this effect is still very very small compared to the gender effect. that is, even in a field where the vast majority of workers are female, they are still very likely to earn less than the few men colleagues they have. 



So how strong is the gender as a predictor?

In [None]:
model = LinearRegression()
columns = ['Gender']
X = regression_df[columns]

X_std = StandardScaler().fit_transform(X)
y = regression_df['Salary']

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, random_state=42)

model.fit(X_train,y_train)

print('R^2 on training...',model.score(X_train,y_train))
print('R^2 on test...',model.score(X_test,y_test))


The gender explains about 5% of the variance in the results. 