# Attempting to Quantify Gender Differences in Kaggle Dev Survey

This notebook is meant to accompany this video: https://youtu.be/GO420aMtHfk

To see part 1 & 2 of this series refer to these resources:
- Part 1 video: https://www.youtube.com/watch?v=r-DR9HBaipU&ab_channel=KenJee
- Part 2 video: https://www.youtube.com/watch?v=KQ80oD_boBM&ab_channel=KenJee
- Kaggle Kernel (Part 2): https://www.kaggle.com/kenjee/kaggle-project-from-scratch

One of the initial questions that came up in part 1 of my analysis was how much gender inequality is there currently in data science. Assuming there is some (80 % of the samples being male makes me think there probably is...), how does this impact earning potential. 

In this notebook I:
1. First visualize and normalize gender differences in the sample 
2. Run a multiple linear regression to understand which factors contribute most to earning potential
3. Run a lasso regression to narrow variable set and try to quantify the extent gender impacts earning potential
4. Run a random forest on same data to evaluate feature importance (A nonlinear model like this is a good check)
5. Compare models for just subsets of women and men to hopefully normalize for more variables 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt #likley won't be used much as i'm experimenting with plotly 
import plotly.graph_objects as go #you will be learning how go and px work with me! 
import plotly.express as px 

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#load data 
df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
df.shape
#remove the top row
df_fin = df.iloc[1:,:]

In [None]:
#inspect the data and questions 
df.head()

In [None]:
#create a dictionary for questions 
Questions = {}

#create list of questions 
#not very efficient, but keeps things ordered
qnums = list(dict.fromkeys([i.split('_')[0] for i in df_fin.columns]))

#add data for each question to key value pairs in dictionary
for i in qnums:
    if i in ['Q1','Q2','Q3']: #since we are using .startswith() below this prevents all questions that start with 
        Questions[i] = df_fin[i] #[1,2,3] from going in the key value pair (Example in vid)
    else:
        Questions[i] = df_fin[[q for q in df_fin.columns if q.startswith(i)]]

In [None]:
# create disctionary for different gender selections 
Genders = {}
for i in df_fin.Q2.unique():
    Genders[i] = df_fin[df_fin.Q2 == i]

In [None]:
# Notes I created as I was going

#Brief EDA / Pivot Tables 
## Heatmap 
## Percentage distribution
## Income by Gender By Country 
## Income by Gender 
## Income by Skills
## Income by Role
## Income by Education 

# Convert Salary to Continuous?
# Which variables to consider? 
# Regression for Income --> Paying particular attention to Gender
# Regression for Income --> Men and Women (2 different Regressions)

#Normalize for female population of sample 

In [None]:
#look at gender distribution
df_fin.Q2.value_counts()/ df_fin.Q2.value_counts().sum()

In [None]:
#filter dataframe for male & female for simplicity (not that prefer not & nonbinary aren't important!)
df_mf = df_fin[df_fin.Q2.isin(['Man','Woman'])] 

In [None]:
#DS is clearly already a male dominated field (or at least this sample of kaggle users is)
df_mf.Q2.value_counts()/ df_mf.Q2.value_counts().sum() 

In [None]:
#Female Distribution by Role 
fig= px.histogram(df_mf,x='Q4',color ='Q2')
fig.show()

In [None]:
#Female Distribution by Role Normalized by sample of respective population 
fig= px.histogram(df_mf,x='Q4',color ='Q2', histnorm='probability density')
fig.show()

In [None]:
#Percent more or less than distribution of the average population of women (Absolute)
male_degrees = df_mf[df_mf.Q2 == 'Man'].Q4.value_counts()
female_degrees = df_mf[df_mf.Q2 == 'Woman'].Q4.value_counts()
total_degrees = df_mf.Q4.value_counts()
more_women = (female_degrees/total_degrees)-.197 #greater proportion of women than sample
more_women['Color'] = np.where(more_women.values <0, 'blue','red')
fig = go.Figure(go.Bar(x=(female_degrees/total_degrees).index, y= (female_degrees/total_degrees).values-.197, marker_color=more_women.Color))
fig.update_layout(title= "Level of Female Education Relative to AVG of Sample (19.7%)")
fig.show()


In [None]:
#Female Distribution by Country
fig= px.histogram(df_mf,x='Q3',color ='Q2')
fig.update_xaxes(categoryorder= "total descending")
fig.show()


In [None]:
#Percent more or less than distribution of the average population of women 
male_country = df_mf[df_mf.Q2 == 'Man'].Q3.value_counts()
female_country = df_mf[df_mf.Q2 == 'Woman'].Q3.value_counts()
total_country = df_mf.Q3.value_counts()
more_women = (female_country/total_country)-.197 #greater proportion of women than sample
more_women['Color'] = np.where(more_women.values <0, 'blue','red')
fig = go.Figure(go.Bar(x=(female_country/total_country).index, y= (female_country/total_country).values-.197, marker_color=more_women.Color))
fig.update_layout(title= "Amount of Women By Country Relative to AVG of Sample (19.7%)")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
#flip colors 
total_country
female_country

In [None]:
#function for creating new graphs 
def create_norm_graph(qnum, data, title, baseline):
    male = data[data.Q2 == 'Man'][qnum].value_counts()
    female = data[data.Q2 == 'Woman'][qnum].value_counts()
    total = data[qnum].value_counts()
    more_women = (female/total)-baseline #greater proportion of women than sample
    more_women['Color'] = np.where(more_women.values <0, 'blue','red')
    fig = go.Figure(go.Bar(x=(female/total).index, y= (female/total).values-baseline, marker_color=more_women.Color))
    fig.update_layout(title= title)
    fig.update_layout(xaxis={'categoryorder':'total descending'})
    fig.show()
    return 

In [None]:
# which countries have the most relative female representitives in the survey?
create_norm_graph('Q3',df_mf,"Amount of Women By Country Relative to AVG of Sample (19.7%)",.197)

In [None]:
#Which roles have the most women relative to the baseline?
create_norm_graph('Q5',df_mf,"Amount of Women By Role Relative to AVG of Sample (19.7%)",.197)

In [None]:
#create new baseline for only employed people
df_workers_mf = df_mf[~df_mf['Q5'].isin(['Student','Currently not employed'])]
df_workers_mf.Q2.value_counts()/df_workers_mf.Q2.value_counts().sum()

In [None]:
# Women's experience 
create_norm_graph('Q6',df_workers_mf,"Amount of Women By Experience Relative to AVG of Sample (17.4%)",.174)

#absolute number is a lot lower 
df_workers_mf.Q6.value_counts()

In [None]:
#by income level 
create_norm_graph('Q24',df_workers_mf,"Amount of Women By Income Level Relative to AVG of Sample (17.4%)",.174)
df_workers_mf.Q24.value_counts()

In [None]:
#graph for just data scientists 
df_mf_ds= df_mf[df_mf['Q5'] =='Data Scientist']
create_norm_graph('Q24',df_mf_ds,"Amount of Women By Country Relative to AVG of Sample (17.4%)", .174)

In [None]:
#count for perspective, some sample size issues here
df_mf_ds.Q24.value_counts()

In [None]:
#graph for US 
df_mf_US= df_mf[df_mf['Q3'] =='United States of America']
create_norm_graph('Q24',df_mf_US,"Amount of Women By Country Relative to AVG of Sample (17.4%)",.174)

In [None]:
df_mf_US.Q24.value_counts()

In [None]:
#Income by role (awful graph I know)
fig= px.histogram(df_fin.dropna(subset=['Q24','Q5']),x='Q24',color ='Q5')
fig.update_xaxes(categoryorder= "total descending")
fig.show()

In [None]:
#Income by experience 
fig= px.histogram(df_fin.dropna(subset=['Q24','Q6']),x='Q24',color ='Q6')
fig.update_xaxes(categoryorder= "total descending")
fig.show()

In [None]:
#Income by education
fig= px.histogram(df_fin.dropna(subset=['Q24','Q4']),x='Q24',color ='Q4')
fig.update_xaxes(categoryorder= "total descending")
fig.show()

# Building a Model 
I thought it made more sense to use a regression here to try to predict salary. Although it will be very rough around the edges, I think converting the salaries from categorical to numeric will allow us to more easily interperet the data. 

In [None]:
#convert dollar ranges to numeric 
#explore converting other continuious variables 
#build model with just gender 

In [None]:
#replace '$',',','>' in data 
df_model = df_fin.dropna(subset=['Q24'])
df_model['salary_cleaned'] = df_model.Q24.apply(lambda x: str(x).replace('$','').replace(',','').replace('>','').strip())
df_model.salary_cleaned.value_counts()

In [None]:
#create min range and max range for salary 
df_model['salary_min'] = df_model.salary_cleaned.apply(lambda x: 500000 if '-' not in x else int(x.split('-')[0]))
df_model['salary_max'] = df_model.salary_cleaned.apply(lambda x: 500000 if '-' not in x else int(x.split('-')[1]))

df_model.salary_max.value_counts()

In [None]:
#Convert to rough continuous variable 
df_model['aprox_salary'] = (df_model.salary_min+df_model.salary_max)/2
df_model.aprox_salary.value_counts()

In [None]:
#simple linear regression just gender 
import statsmodels.api as sm 

In [None]:
#filter for men & women 
df_model_fin = df_model[df_model.Q2.isin(['Man','Woman'])] 
#filter for workers 
df_model_fin = df_model_fin[~df_model_fin['Q5'].isin(['Student','Currently not employed'])]
df_model_fin.drop('Time from Start to Finish (seconds)', axis =1, inplace = True)

In [None]:
df_model_fin.isnull().any()

In [None]:
# create dummy variables, this is needed because essentially all our data is categorical
model_dummies = pd.get_dummies(df_model_fin)
model_dummies

In [None]:
# We only need one gender in this case because we trimmed it to only have Men & Women
Y = model_dummies.aprox_salary
X = model_dummies.Q2_Man

In [None]:
#for statsmodels, we need to add a constant to create intercept 
X = sm.add_constant(X)

In [None]:
#fit model with data 
model = sm.OLS(Y,X)
results= model.fit()

In [None]:
#create summary report (watch video to see interpretation)
results.summary()

In [None]:
# create function to add additional questions to dataframe for easier processing
def qnums(question_list, dataframe):
    q_out = [] 
    for i in question_list:
        for j in dataframe.columns:
            if i == j.split('_')[0]:
                q_out.append(j)
    return dataframe.loc[:,q_out]
        
#create data for questions 2,4,5
q245 =  qnums(['Q2','Q4','Q5'], model_dummies)
q245

In [None]:
#drop one of the gender columns, it is redundant 
X = q245.drop('Q2_Man', axis=1)
X = sm.add_constant(X)

In [None]:
#build model with additional features education, gender, and role 
model = sm.OLS(Y,X)
results= model.fit()
results.summary()

In [None]:
#questions 2,4,5,7 add in programming languages 
        
q2457 =  qnums(['Q2','Q4','Q5','Q7'], model_dummies).drop('Q2_Man', axis=1)
q2457


In [None]:
X = q2457
X = sm.add_constant(X)

In [None]:
model = sm.OLS(Y,X)
results= model.fit()
results.summary()

In [None]:
#questions 2,3,4,5,7 add in country (huge boost in model performance)
        
q24573 =  qnums(['Q2','Q4','Q5','Q7','Q3'], model_dummies).drop('Q2_Man', axis=1)
q24573


In [None]:
X = q24573
X = sm.add_constant(X)

In [None]:
model = sm.OLS(Y,X)
results= model.fit()
results.summary()

In [None]:
# lasso regression 
# random forest 
#remove some features 

In [None]:
#questions 2,3,4,5,6,7
        
q245736 =  qnums(['Q2','Q4','Q5','Q7','Q3','Q6','Q20'], model_dummies).drop('Q2_Man', axis=1)
X = q245736
X = sm.add_constant(X)

In [None]:
model2 = sm.OLS(Y,X)
results= model2.fit()
results.summary()

In [None]:
#fit model with lasso parameters Set alpha high enough to eliminate some variables 
results_reg = model2.fit_regularized(L1_wt=1, alpha= 5)
final = sm.regression.linear_model.OLSResults(model2,results_reg.params,model2.normalized_cov_params)
print(final.summary())

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
#compare random forest feature importance (allows us to rank)
clf_rf = RandomForestRegressor()
clf_rf.fit(X,Y)

In [None]:
feat_importances = pd.Series(clf_rf.feature_importances_, index=X.columns)
ax  = feat_importances.nlargest(25).sort_values().plot(kind='barh', figsize=(6,12))
ax.barh([2],feat_importances.loc['Q2_Woman'],color='red')

In [None]:
#build models for men and women independently. See how they estimate salary on the same data 
#I think this is a decent way to isolate individual effects of education, country, etc.
Women_Model = model_dummies[model_dummies.Q2_Man == 0]
Men_Model = model_dummies[model_dummies.Q2_Man == 1]

In [None]:
# create and train women's model 
women_fin =  qnums(['Q4','Q5','Q7','Q3','Q6','Q20'], Women_Model)
Y_W = Women_Model.aprox_salary
X_W = women_fin
X_W = sm.add_constant(X_W)

Women_Model

In [None]:
model_W = sm.OLS(Y_W,X_W)
results_W= model_W.fit()
results_W.summary()

In [None]:
results_reg_W = model_W.fit_regularized(L1_wt=1, alpha= 5)
final_W = sm.regression.linear_model.OLSResults(model_W,results_reg_W.params,model_W.normalized_cov_params)
print(final_W.summary())

In [None]:
#create and train men's model 
men_fin =  qnums(['Q4','Q5','Q7','Q3','Q6','Q20'], Men_Model)
Y_M = Men_Model.aprox_salary
X_M = men_fin
X_M = sm.add_constant(X_M)

model_M = sm.OLS(Y_M,X_M)
results_M= model_M.fit()
results_M.summary()

In [None]:
results_reg_M = model_M.fit_regularized(L1_wt=1, alpha= 5)
final_M = sm.regression.linear_model.OLSResults(model_M,results_reg_M.params,model_M.normalized_cov_params)
print(final_M.summary())

In [None]:
#run model on all data & compare 
combined_data = qnums(['Q4','Q5','Q7','Q3','Q6','Q20'], model_dummies)
male_preds = final_M.predict(np.array(sm.add_constant(combined_data)))
female_preds = final_W.predict(np.array(sm.add_constant(combined_data)))

In [None]:
combined_data['male_preds'] = male_preds
combined_data['female_preds'] = female_preds

In [None]:
combined_data['aprox_salary'] = model_dummies.aprox_salary
combined_data

In [None]:
px.scatter(combined_data.sort_values('aprox_salary'), x = 'aprox_salary', y = ['male_preds','female_preds'])

In [None]:
combined_data['projected_diff'] = combined_data.male_preds - combined_data.female_preds

In [None]:
combined_data.projected_diff.mean()

In [None]:
combined_data.projected_diff.std()

In [None]:
combined_data['women_prj_higher'] = combined_data.projected_diff.apply(lambda x: 1 if x < 0 else 0)

In [None]:
combined_data.women_prj_higher.value_counts()

In [None]:
## Next Steps
#roles
#countries
#sample size
#t test men & women
#would love people to expand on this