# Are there wage differences?
____
____

In [1]:
#   Processing
import pandas as pd
import numpy as np
import re
#   Visuals
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize']=(20,10)

First, we need to read the dataset:

In [2]:
df = pd.read_csv("../input/inc_occ_gender.csv", na_values="Na")

In [3]:
df.head()

# Data cleaning

### Dealing with missings, formatting...

In [4]:
df.count()

In [5]:
df.isnull().sum()

In [6]:
df = df.dropna().reset_index(drop=True)
df.isnull().sum()

In [7]:
df.head()

# Plotting

I am going to obtain a list with the different types of jobs:

In [8]:
titles = []
indexes = []
for i in range(df.shape[0]):
    if re.findall('([A-Z][A-Z]+)', df['Occupation'][i]): #  Groups are in capital letters
        titles.append(' '.join(re.findall('([A-Z][A-Z]+)', df['Occupation'][i])))
        indexes.append(i)

In [9]:
titles

In [10]:
df.head()

In [11]:
dfTitles = df.loc[df['Occupation'].isin(titles)].reset_index(drop=True)
dfTitles.head()

# Bar Plots

In [12]:
ax = sns.barplot(x="M_weekly", y="Occupation", data=dfTitles.sort_values('M_weekly',ascending=False))
_ = ax.set_xlim(0, 2000)

In [13]:
ax = sns.barplot(x="F_weekly", y="Occupation", data=dfTitles.sort_values('F_weekly',ascending=False))
_ = ax.set_xlim(0, 2000)

It seems that women have lower salaries in all the sectors.

In [14]:
dfTitles['diff'] = dfTitles['M_weekly']-dfTitles['F_weekly']
dfTitles.head()

In each of the sectors, the wage gap is positive ... being huge in the legal sector.

In [15]:
ax = sns.barplot(x="diff", y="Occupation", data=dfTitles.sort_values('diff',ascending=False))

# Let's plot individual jobs 

It seems that there is a big difference between the salaries of women and men if we talk about sectors, but what about specific jobs?

In [16]:
dfsubTitles = df.loc[~df['Occupation'].isin(titles)].reset_index(drop=True)
dfsubTitles.head()

In [17]:
df.loc[1]

In [18]:
dfsubTitles['diff'] = dfsubTitles['M_weekly'] - dfsubTitles['F_weekly']
dfsubTitles.head()

### Which are the jobs where men have much more salary?

In [19]:
ax = sns.barplot(x="diff", y="Occupation", data=dfsubTitles.sort_values('diff',ascending=False)[:15])

It seems that the financial sector is the one with the most salary difference.

### Are there jobs where women have a bettter salary?

In [20]:
ax = sns.barplot(x="diff", y="Occupation", data=dfsubTitles.sort_values('diff',ascending=False)[-10:])

Yes, there are jobs where women have higher wages, but the differences are not very big.

# Can we talk in general terms about salary differences?

In [21]:
ax = sns.distplot(dfsubTitles['M_weekly'],bins=20, color='blue',label='Male')
ax = sns.distplot(dfsubTitles['F_weekly'], bins=20, color='pink',label='Female')
ax.set(ylim=(0, 0.002))
plt.plot([dfsubTitles['M_weekly'].median(),dfsubTitles['M_weekly'].median()],[0, 0.0004], linewidth=2, color='blue',label='Male median')
plt.plot([dfsubTitles['F_weekly'].median(),dfsubTitles['F_weekly'].median()],[0, 0.0004], linewidth=2, color='red',label='Female median')
_=ax.legend()

In this case, I plot the median because when we work with skewness distributions, median is often more representative.

In [22]:
sns.boxplot(data=df[['M_weekly','F_weekly']])

In [23]:
print("**************MALE**************")
print(df['M_weekly'].describe())
print("*************FEMALE*************")
print(df['F_weekly'].describe())

In general terms, women have lower wages that men, in particular in the financial sector, legal and business. 

Bye and keep kaggling!