# Gender bias in data science?

Since we've got the nice Kaggle 2017 survey data, I thought it would be interesting to explore (even just a little bit!) if there was any evidence of gender bias in data science hidden in the responses.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pylab as plt
import seaborn as sb

cp = sb.color_palette()

In [None]:
df_qus = pd.read_csv('../input/schema.csv')
df_multi = pd.read_csv('../input/multipleChoiceResponses.csv',encoding="ISO-8859-1", low_memory=False,thousands=',')
df_free = pd.read_csv('../input/freeformResponses.csv',low_memory=False)
conversion = pd.read_csv('../input/conversionRates.csv')

Now we've got the data imported, let's start by seeing how the gender of the respondents is distributed...
Whilst there are ~4x more male participants, there are still ~2500 female answerees, so it isn't too much of a stretch to split the data based on gender. I have not included an discussion of those responding with 'other' or 'prefer not to say' as the numbers are much smaller

In [None]:
gender = df_multi['GenderSelect'].value_counts()
ax = gender.plot(kind="bar", figsize=(5,5))

Good to see that the age distributions for the male and female groups are really pretty similar, so are hopefully a good set of comparison data

In [None]:
man = df_multi[df_multi['GenderSelect']=='Male'].copy()
woman = df_multi[df_multi['GenderSelect']=='Female'].copy()
fig,axs = plt.subplots(1,2,figsize=(10,4))
sb.distplot(man['Age'].dropna(),ax=axs[1],color=cp[0])
plt.setp(axs[1],title='Male')
sb.distplot(woman['Age'].dropna(),ax=axs[0],color=cp[1])
plt.setp(axs[0],title='Female')

One of the most well known gender differences in any field is equal pay, so it seemed pretty logical to dive in there.
Let's get everything into USD and see if (like for age) the distribution of salary between men and women is pretty similar...?

In [None]:
# convert salary information
df_salaries = pd.merge(df_multi,conversion,left_on='CompensationCurrency',right_on='originCountry',how='left')
df_salaries['CompensationAmount'] = df_salaries['CompensationAmount'].replace({'\$': '', ',': ''}, regex=True)
df_salaries['CompensationAmount'] = df_salaries['CompensationAmount'].apply(pd.to_numeric, errors='coerce')
df_salaries['salaryUSD'] = df_salaries['CompensationAmount']*df_salaries['exchangeRate']
df_salaries2 = df_salaries[df_salaries['salaryUSD']<400000]
df_salaries2 = df_salaries2[df_salaries2['salaryUSD']>0]
df_salaries2 = df_salaries2[df_salaries2['GenderSelect'].isin(['Male','Female'])]
df_salaries2.hist(by='GenderSelect',column='salaryUSD',sharex=True,bins=50,figsize=(10,4))

In [None]:
male_satisfaction = man['JobSatisfaction'].value_counts().to_frame()
male_satisfaction['index1'] = male_satisfaction.index

female_satisfaction = woman['JobSatisfaction'].value_counts().to_frame()
female_satisfaction['index1'] = female_satisfaction.index
sorter = ['I prefer not to share','1 - Highly Dissatisfied', '2', '3', '4', '5', '6', '7', '8', '9', '10 - Highly Satisfied']

male_satisfaction['index1'] = male_satisfaction['index1'].astype('category')
male_satisfaction['index1'].cat.set_categories(sorter, inplace=True)
male_satisfaction = male_satisfaction.sort_values(['index1'])

female_satisfaction['index1'] = female_satisfaction['index1'].astype('category')
female_satisfaction['index1'].cat.set_categories(sorter, inplace=True)
female_satisfaction = female_satisfaction.sort_values(['index1'])


fig,axs = plt.subplots(1,2,figsize=(10,5))
my_cp1 = [cp[0] for x in range(0,11)]
my_cp2 = [cp[1] for x in range(0,11)]
axs[1] = male_satisfaction.plot(kind='bar',ax=axs[1],color=my_cp1)
plt.setp(axs[1],title='Male')
axs[0] = female_satisfaction.plot(kind='bar',ax=axs[0],color=my_cp2)
plt.setp(axs[0],title='Female')

to be continued...