In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

multiple_choice_responses = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv", low_memory=False)
other_text_responses = pd.read_csv("../input/kaggle-survey-2019/other_text_responses.csv")
questions_only = pd.read_csv("../input/kaggle-survey-2019/questions_only.csv")
survey_schema = pd.read_csv("../input/kaggle-survey-2019/survey_schema.csv")

<h1>The Gender Gap in Tech</h1>

<p>As a woman in tech, I've always been acutely aware of the gender gap in tech, be it in computer science lectures at university as the only female participant or as the member of software engineering teams where my gender and its effect on statistics was sometimes the main selling point when switching to a different team.</p>

<p>Based on these experience, I would love to gain some insights into women in tech through the Kaggle ML & DS Survey data.  Beyond the gender distribution within the survey, both in total and across different ages or countries, the focus will be on career, compensation and coding experience and compentency.</p> 

In [None]:
multiple_choice_responses['Q3'].replace('United Kingdom of Great Britain and Northern Ireland', 'United Kingdom',
                                       inplace = True)

<h2>Gender Distribution</h2>
<p>Unfortunately, the amount of women in STEM, especially in technology, is still very low, so we would expect to see a similar distribution in the Kaggle survey. A quick look at the question regarding the respondents' gender confirms this suspicion. There were roughly 3,000 survey completed by women, in contrast to about 16,000 men participating. This sizeable gap equals to about 16.5%, if respondents who prefer not to reveal their gender or self-identify are disregarded.</p>

<p>Considering this gap, I was curious to see whether the low ratio of women to men would be similar across different continents as well as age group, with the expectation being that certain locations or younger age groups equate to a higher ratio of women in tech.</p>

In [None]:
genders = multiple_choice_responses[['Q2']].drop(multiple_choice_responses.index[0])
gender_distr = genders[(genders['Q2'] == 'Female') | (genders['Q2'] == 'Male')].groupby('Q2').agg({'Q2':'count'})

fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Gender Distribution')
ax.set(xlabel='Gender', ylabel='Respondents')
sns.set(style='whitegrid')
ax = sns.barplot(['Female', 'Male'], list(gender_distr['Q2']), ax=ax)
plt.show()

In [None]:
demographics = multiple_choice_responses[['Q1','Q2','Q3']].drop(multiple_choice_responses.index[0])
demographics = demographics[(demographics['Q2'] == 'Female') | (demographics['Q2'] == 'Male')]
demographics = demographics.rename(columns={'Q1':'Age_Group', 'Q2':'Gender', 'Q3':'Country'})

<h3>Gender Distribution by Country</h3>

<p>An initial look at the amount of female survey respondents using just the counts per country reveals high numbers of up to 800 for India and the United States, followed by Canada, Germany and the UK. However, it stands to reason that highly populated and tech-heavy countries such as these would have a high number of overall respondents, an assumption confirmed by looking at the percentage of female respondents per country instead of the count.</p>
<p>The second graph comparing the percentage of female respondents per country reveals that India places at an average, whereas the US, UK and Germany are only slightly above average. Surprisingly, smaller countries claim the top spots for female participation, though these numbers are, in some cases, skewed by an overall very low number of respondents. For example, there were only about 70 overall respondents from Tunisia in a total of approximately 4.5 million. As such, it stands to reason that the only correlation between location and the ratio of female to male respondents that can be made is for countries found to be above average in both graphs, having had a high number of overall responses and a high percentage.</p>

In [None]:
loc = pd.crosstab(demographics['Country'], demographics['Gender'])
loc['Female'] = loc['Female'].astype(float)
loc['Male'] = loc['Male'].astype(float)

loc = loc.sort_values(by=['Female'], ascending=False)
fig, ax = plt.subplots(1,1,figsize=(15,30))
ax.set_title('Female Respondents by Location')
ax.set(xlabel='Counts')
sns.set(style='whitegrid')
ax = sns.barplot(loc['Female'], loc.index, ax=ax, palette="vlag")
plt.show()

for i in loc.index:
    female = loc.at[i, 'Female']
    male = loc.at[i, 'Male']
    total = loc.at[i, 'Female'] + loc.at[i, 'Male']
    loc.at[i, 'Female'] = round(female / total,4)

loc = loc.drop(['Male'], axis = 1)
loc = loc.sort_values(by=['Female'], ascending=False)

fig, ax = plt.subplots(1,1,figsize=(15,30))
ax.set_title('Female Respondents by Location (Percentage)')
ax.set(xlabel='Percentage')
sns.set(style='whitegrid')
ax = sns.barplot(list(loc['Female']), loc.index, ax=ax, palette="vlag")
plt.show()

<h3>Gender Distribution by Age Group</h3>

<p>Looking at the distribution of age groups across both male and female respondents results in a fairly standard graph of the subset of the population still working or in education. There is a moderate bias towards respondents below the age of 40, but this may be explained by the generally quite low median age of employees in tech, especially in the US, one of the countries with the most responses.</p>

In [None]:
age = pd.crosstab(demographics['Age_Group'], demographics['Gender'])
age['Female'] = age['Female'].astype(float)
age['Male'] = age['Male'].astype(float)

totals = demographics.groupby('Age_Group').agg({'Gender':'count'})
fig, ax = plt.subplots(1,1,figsize=(15,5))
ax.set_title('Age Group Distribution')
ax.set(xlabel='Age Group', ylabel='Respondents')
sns.set(style='whitegrid')
ax = sns.barplot(totals.index, list(totals['Gender']), ax=ax, palette="rocket")
plt.show()


for i in age.index:
    female = age.at[i, 'Female']
    male = age.at[i, 'Male']
    total = age.at[i, 'Female'] + age.at[i, 'Male']
    age.at[i, 'Female'] = round(female / total,4)

age = age.drop(['Male'], axis=1)

<p>The percentage of female respondents within each of the eleven age groups follows a similar pattern with some interesting difference. In general, it seems that there are fewer women in higher age groups, just as there are fewer older women in tech. However, the highest percentages are not found among 18-25 year olds, but among respondents in the twenties. It would be interesting to find out whether this a general shift in the amount of women in tech in what are most likely current university students or whether there is a trend for women to become interesting in or switch to tech after having finished formal education.</p>

<p>Surprisingly, there is also a quite low fluctuation in the percentage of women responding per age group up until the age of 45. I would have expected there to be a larger change between people up until their early thirties and older generations due to the availibity of technology when growing up as well as attitudes towards women's "roles".</p>

<p>Having filtered out these datapoints regarding women on kaggle and, more broadly speaking, women in tech, the question remains whether it is possible to find "trends" within the Kaggle survey which suggest a higher or lower likelihood of women going into tech. Additionally, it would be interesting to find out the job titles and education levels of female respondents. As such, the following analysis filters the dataset for female respondents only.</p>

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
ax.set_title('Female Respondents by Age Group (Percentage)')
ax.set(ylabel='Percentage')
sns.set(style='whitegrid')
ax = sns.barplot(age.index, list(age['Female']), ax=ax, palette="rocket")
plt.show()

In [None]:
women = multiple_choice_responses[multiple_choice_responses['Q2']=='Female'].copy()
men = multiple_choice_responses[multiple_choice_responses['Q2']=='Male'].copy()

<h2>Career and Education</h2>

<p>What do the roughly 16.5% of women who participated in the survey do for a living? How many of them are in tech or still students, how large are the companies they work for and are there difference in compensation compared to men or between different job titles?</p>

<h3>Job Titles and Categories</h3>
<p>A closer look at the distribution of job titles reveals that the largest single category are students, which is most likely due to the higher number of respondents in younger age groups. Students are followed by three tech positions, Data Scientists, Data Analysts and Software Engineers. These three groups, while not the majority, do make up the largest subset of job titles and confirm that many female respondents to the Kaggle survey are in fact in STEM careers. However, within the top five of most frequent job titles, there is also a large number of female respondents in other fields.</p>

<p>Based on this initial analysis, a grouping into STEM careers, respondents in full-time education, other field or between jobs as well as respondents whose technological expertise is difficult to determine (e.g. Business Analysts or Project Managers) may yield interesting results.</p>

In [None]:
jobs = women.groupby('Q5', as_index=False).agg({'Q1':'count'})
jobs = jobs.rename(columns={'Q1':'Count', 'Q5':'Job_Title'})
jobs = jobs.sort_values(by=['Count'], ascending = False)

fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.set_title('Job Titles (Women only)')
sns.set(style='whitegrid')
ax = sns.barplot('Count', 'Job_Title', ax=ax, palette="deep", data=jobs)
plt.show()

<p>As the graph shows, when considering all STEM-related job titles, respondents in this field do make up the majority of responses at about 55%. If one assumes that students are most likely working towards a career in STEM as well, this number rises to slightly less than 80%.</p>

<p>There are also close to 10% of female respondents in other careers, likely non-STEM-related fields. It would be interesting to see what motivats this subgroup to explore Kaggle. It may be worthwhile to investigate in combination with coding expertise for this as well as the other non STEM subgroups, including unemployed respondents. </p>

In [None]:
stem = ['Data Scientist', 'Data Analyst', 'Software Engineer', 'Research Scientist', 
        'Data Engineer', 'DBA/Database Engineer', 'Statistician']
unclear = ['Business Analyst', 'Product/Project Manager']

jobs['Job_Field'] = [ 'STEM' if x in stem else 'Unclear' if x in unclear else x for x in jobs['Job_Title'] ]
field = jobs.groupby('Job_Field', as_index=False).agg({'Count':'sum'})
field = field.sort_values(by=['Count'], ascending=False)
field['Count'] = field['Count'].astype(float)


total = field['Count'].sum()

for i in field.index:
    count = field.at[i, 'Count']
    field.at[i, 'Count'] = round(count / total,4)

field = field.rename(columns={'Count':'Percentage'})

fig, ax = plt.subplots(1,1,figsize=(10,8))
ax.set_title('Job Fields (Women only)')
sns.set(style='whitegrid')
ax = sns.barplot('Percentage', 'Job_Field', ax=ax, palette="deep", data=field)
plt.show()

<h3>Education</h3>

<p>Moving on to education distribution among female respondents, it quickly becomes apparent that the majority of survey participants have at least completed or are working on an undergraduate degree, with a high number of respondents having gone on to postgraduate studies. Respondents most frequently selected a Master's degree as their level of eduation.</p>

In [None]:
education = women.groupby('Q4', as_index = False).agg({'Q1':'count'})
education = education.rename(columns={'Q1':'Count','Q4':'Education'}).sort_values(by=['Count'], ascending = False)
education.replace("Some college/university study without earning a bachelor’s degree", "Some college", inplace = True)
education.replace("No formal education past high school", "High School", inplace = True)
education.replace("I prefer not to answer", "No answer", inplace = True)



postgrad = ["Master’s degree", "Doctoral degree", "Professional degree"]
no_edu = ["Some college", "High School"]
education['Category'] = [ 'Postgrad' if x in postgrad else 'No Degree' if x in no_edu 
                         else 'Undergrad' if x == "Bachelor’s degree" else x for x in education['Education'] ]

edu_cat = education.groupby('Category', as_index = False).agg({'Count':'sum'})
edu_cat['Count'] = edu_cat['Count'].astype(float)
total = edu_cat['Count'].sum()

for i in edu_cat.index:
    count = edu_cat.at[i, 'Count']
    edu_cat.at[i, 'Count'] = round(count / total,4)
    
edu_cat = edu_cat.rename(columns={'Count':'Percentage'}).sort_values(by=['Percentage'], ascending = False)


fig, ax = plt.subplots(1,1,figsize=(10,8))
ax.set_title('Degrees')
sns.set(style='whitegrid')
ax = sns.barplot('Count', 'Education', ax=ax, palette="deep", data=education)
plt.show()

<p>Categorising these degrees into undergraduate, postgraduate, as well as no university degree or no answer and calulating percentages reveals even more clearly that, globally, the vast majority of women responding to the survey are either students or have completed a degree. In regards to the small number of respondents who did not complete a degree, it might be interesting to look at mean compensation and coding experience to determine whether this category contains so-called "self-starters" who did not finish university and instead build on an already successfully career, or whether there are other reasons for the lack of university degree.</p>

In [None]:
fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Degree Categories')
sns.set(style='whitegrid')
ax = sns.barplot('Percentage', 'Category', ax=ax, palette="deep", data=edu_cat)
plt.show()

<h3>Salaries</h3>

<p>Compensation is the final datapoint clearly defining a respondent's career. Unfortunately, when analysing the data, there were an abnormal amount of responses in the 0 - 999$ range. As shown in the graph below, these cannot be explained by the respondent's job titles either, as both "unemployed" and "student" are not listed. As such, the reasons for the bias towards abnormally low salaries may be:
    <ul>
        <li>As a substitute for "Prefer not to answer"</li>
        <li>Currency conversion</li>
        <li>Previous job title (e.g. retired)</li>
    </ul>

Due to the likelihood of the low salaries being due to either corrupt data, a misunderstanding or repurposing, the 0 - 999 range has been excluded from the overall salary distribution. However, the next higher salary range, 1000 - 1999 also seem overrepresented. </p>

In [None]:
low_salary = women[women['Q10']=='$0-999']
lows = low_salary.groupby('Q5', as_index = False).agg({'Q1':'count'})
lows = lows.rename(columns={'Q1':'Count', 'Q5':'Occupation'})
lows = lows.sort_values(by=['Count'], ascending = False)

fig, ax = plt.subplots(1,1,figsize=(10,8))
ax.set_title('Low Salaries: Occupation')
sns.set(style='whitegrid')
ax = sns.barplot('Count', 'Occupation', ax=ax, palette="deep", data=lows)
plt.show()

<p>Unfortunately, due to the survey using salary ranges in US$ for a global set of respondents, the salary range graph most likely contains a few inaccuracies due to differing salary ranges when converting local currencies into dollars. Overall, however, it seems like there are few very high earners among women. The median salary is lower in comparison to men.</p>

In [None]:
def find_number(x):    
    if x[0] == '>':
        return 500001
    elif x[0] == '$':
        return 0
    else:
        x = x.strip().replace(',','')
        return int(x[0 : x.find('-')])

salaries = women.groupby('Q10', as_index = False).agg({'Q1':'count'})
salaries = salaries.rename(columns={'Q10':'Range', 'Q1':'Count'})

salaries['Range_Start'] = salaries.Range.apply(lambda x: find_number(x))

salaries = salaries.sort_values(by=['Range_Start'])
salaries = salaries[salaries['Range_Start'] != 0]

fig, ax = plt.subplots(1,1,figsize=(20,8))
ax.set_title('Salaries')
sns.set(style='whitegrid')
ax = sns.barplot('Range', 'Count', ax=ax, palette="deep", data=salaries)
ax.set_xticklabels(labels=salaries['Range'], rotation=30)
plt.show()

<p>Revisiting the group of respondents who did not finish a degree in regards to salary ranges shows that assuming some of these would be so-called "self-starters" is wrong, as no salary exceeds 100k. Instead, the amount of salaries below 5,000 $ suggests that this group is mostly comprised of low-income positions such as students or unemployed respondents.</p>
<p>Investigating these numbers further shows that while there is a number of respondents working in tech in this subset, a large number identified as students. Considering the constraints of the degree options (obtain within 2 years from today), these respondents probably just started university. </p>

In [None]:
women['Q4'].replace("No formal education past high school", "High School", inplace = True)
women['Q4'].replace("Some college/university study without earning a bachelor’s degree", "Some college", inplace = True)
no_degree = women[(women['Q4']=='High School') | (women['Q4']=='Some college')]

no_degree_sal = no_degree.groupby('Q10', as_index=False).agg({'Q1':'count'})
no_degree_sal = no_degree_sal.rename(columns={'Q10':'Range', 'Q1':'Count'})

no_degree_sal['Range_Start'] = no_degree_sal.Range.apply(lambda x: find_number(x))
no_degree_sal = no_degree_sal.sort_values(by=['Range_Start'])

fig, ax = plt.subplots(1,1,figsize=(20,8))
ax.set_title('Salaries - No Degree')
sns.set(style='whitegrid')
ax = sns.barplot('Range', 'Count', ax=ax, palette="deep", data=no_degree_sal)
ax.set_xticklabels(labels=salaries['Range'], rotation=30)
plt.show()

no_degree_job = no_degree.groupby('Q5', as_index=False).agg({'Q1':'count'})
no_degree_job = no_degree_job.rename(columns={'Q5':'Education', 'Q1':'Count'})
no_degree_job = no_degree_job.sort_values(by=['Count'], ascending = False)

fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.set_title('Job Title - No Degree')
sns.set(style='whitegrid')
ax = sns.barplot('Count', 'Education', ax=ax, palette="deep", data=no_degree_job)
plt.show()

<h2>Coding Experience</h2>

<p>While women's careers, salaries and education give an idea of what life is like for women in STEM, coding experience appears to be the most reliable indicator of ability and coding interest. </p>

<p>Besides a large number of missing anwers to this question, the majority of female respondents appear to have been coding for up to five years, with only few respondents having been coding before 2014. This, especially in combination with students being the largest single job title group, suggests that most female respondents are either currently learning how to code at university or have recently acquired tech jobs.</p>

In [None]:
women['Q15'].replace('I have never written code', 'None', inplace=True)
women['Q15'].replace(np.nan, 'No answer', inplace = True)

code = women.groupby('Q15', as_index = False).agg({'Q1':'count'})
code = code.rename(columns={'Q1':'Count', 'Q15':'Coding Experience'})
code = code.sort_values(by=['Count'], ascending = False)

fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Coding Experience')
sns.set(style='whitegrid')
ax = sns.barplot('Count', 'Coding Experience', ax=ax, palette="deep", data=code)
plt.show()

<h3>Coding Levels</h3>

<p>Categorising coding experience into beginner, intermediate and expert further confirms that for most women on Kaggle, coding is a recently acquired skill. Beginners, who are respondents with up to 2 years experience, account for over 40% of respondents, more than intermediate (3-10 years) and expert (10+ years) coders combined. But is this actually specific to women or are men at a similar level?</p>

In [None]:
beginner = ['< 1 years', '1-2 years']
mid = ['3-5 years', '5-10 years']
expert = ['10-20 years', '20+ years']

code['Level'] = [ 'Beginner' if x in beginner else 'Intermediate' if x in mid 
                 else 'Expert' if x in expert else x for x in code['Coding Experience']]

code_lvl = code.groupby('Level').agg({'Count':'sum'})
order = ['None','Beginner','Intermediate','Expert','No answer']
code_lvl = code_lvl.reindex(order).reset_index()

total = code_lvl['Count'].sum()


fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Coding Level - Women')
sns.set(style='whitegrid')
ax = sns.barplot('Level', 'Count', ax=ax, palette="deep", data=code_lvl, estimator=lambda x: sum(x)/total*100.0)
ax.set(ylabel='Percentage')
plt.show()

<p>The amount of beginner coders is similar for female and male respondents. However, for men, there is no bias towards less than one year of coding experience, the two beginner levels have approximately the same number of respondents. In regards to the other categories, there are differences. Male respondents are slightly more likely to fall into the intermediate and expert categories (28 to 24% and 9 to 6% respectively). Additionally, male respondents were overall more likely to answer this question and respondents with no coding experience are rarer.</p>
<p>This supports the assumption that more women are currently taking up coding and that there were fewer women in tech a few years ago.</p>

In [None]:
men['Q15'].replace('I have never written code', 'None', inplace=True)
men['Q15'].replace(np.nan, 'No answer', inplace = True)

code_m = men.groupby('Q15', as_index = False).agg({'Q1':'count'})
code_m = code_m.rename(columns={'Q1':'Count', 'Q15':'Coding Experience'})
code_m = code_m.sort_values(by=['Count'], ascending = False)

code_m['Level'] = [ 'Beginner' if x in beginner else 'Intermediate' if x in mid 
                 else 'Expert' if x in expert else x for x in code_m['Coding Experience']]

code_lvl_m = code_m.groupby('Level').agg({'Count':'sum'})
code_lvl_m = code_lvl_m.reindex(order).reset_index()

total_m = code_lvl_m['Count'].sum()

fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Coding Level - Men')
sns.set(style='whitegrid')
ax = sns.barplot('Level', 'Count', ax=ax, palette="deep", data=code_lvl_m, estimator=lambda x: sum(x)/total_m*100.0)
ax.set(ylabel='Percentage')
plt.show()

<h2>Machine Learning and Programming Languages</h2>

<p>Finally, an analysis of machine learning and languages used by survey respondents might shed some further light on how the experience for women and men in tech compare. Due to the nature of the survey, a high usage of machine learning as well as preference for an ML-friendly language such as Python is expected.</p>

<h3>Machine Learning Usage</h3>

<p>As this question asked whether respondent's employers use machine learning, it does not necessarily confirm whether the respondents themselves use machine learning in their job position. Categorising the responses into yes / no / don't know and expoloring shows a fairly similar distribution with male and female respondents. The only differences are that women are slightly more likely to be unaware of their employer's usage of machine learning, while men are more likely to classify their employer as exploring their options.</p>

In [None]:
exploring = "We are exploring ML methods (and may one day put a model into production)"
no = "No (we do not use ML methods)"
women['Q8'] = ["Don't know" if x == "I do not know" else "No" if x == no 
              else "Exploring" if x == exploring else "Yes" for x in women['Q8']]
men['Q8'] = ["Don't know" if x == "I do not know" else "No" if x == no 
              else "Exploring" if x == exploring else "Yes" for x in men['Q8']]

In [None]:
ml = women.groupby('Q8', as_index = False).agg({'Q1':'count'})
ml_m = men.groupby('Q8', as_index = False).agg({'Q1':'count'})
ml = ml.rename(columns={'Q8':'Machine_Learning', 'Q1':'Count'})
ml_m = ml_m.rename(columns={'Q8':'Machine_Learning', 'Q1':'Count'})

total = ml['Count'].sum()
total_m = ml_m['Count'].sum()

fig, ax = plt.subplots(1,2,figsize=(18,5))
sns.set(style='whitegrid')
sns.barplot('Machine_Learning', 'Count', ax=ax[0], palette="deep", data=ml, estimator=lambda x: sum(x)/total*100.0)
sns.barplot('Machine_Learning', 'Count', ax=ax[1], palette="deep", data=ml_m, estimator=lambda x: sum(x)/total_m*100.0)
for a in ax:
    a.set(xlabel='Machine Learning', ylabel='Percentage')
ax[0].set_title('ML Usage - Women')
ax[1].set_title('ML Usage - Men')
plt.show()

<h3>Programming Languages</h3>

<p>When asked about programming languages they use, both female and male respondents were likely to name Python as one of their options. SQL and R being the second and third most frequently named languages in both demographics suggests that both male and female Kaggle survey respondents have at least some experience with data analysis and machine learning. Interestingly, women name Python slightly less and R slightly more frequently than men.</p>

<p>There are a few other notable difference with less frequently named languages:
<ul>
    <li>Women select C++ less often</li>
    <li>Top 5 includes MATLAB for women, but its's only on 9th place for men</li>
    <li>More women than men select no language</li>
    <li>Bash percentage is half of what it is for male respondents</li>
</ul>
Unfortunately, none of the language selections directly include the respondent's level of expertise. As such, the percentages might include users who have only started learning a language recently and are less likely to continue being regular users of it.</p>

    

In [None]:
responses = women['Q1'].count()
responses_m = men['Q1'].count()
base = 'Q18_Part_'
counts = dict()
counts_m = dict()

palette = dict()
colours = sns.color_palette('deep',11)

for i in range(1,12):
    col = base + str(i)
    count = women[col].count()
    count_m = men[col].count()
    lang = women[col].dropna().values[0]
    counts[lang] = count
    counts_m[lang] = count_m
    palette[lang] = colours[i-1]


result = pd.DataFrame.from_dict(counts, orient='index').reset_index().rename(columns={'index':'Language', 0:'Count'})
result = result.sort_values(by=['Count'], ascending = False)
result_m = pd.DataFrame.from_dict(counts_m, orient='index').reset_index().rename(columns={'index':'Language', 0:'Count'})
result_m = result_m.sort_values(by=['Count'], ascending = False)

fig, ax = plt.subplots(1,2,figsize=(18,10))
sns.set(style='whitegrid')
sns.barplot(result['Count'], result['Language'],  ax=ax[0], palette=palette, data=result, 
            estimator=lambda x: sum(x)/responses*100.0)
sns.barplot(result_m['Count'], result_m['Language'], ax=ax[1], palette=palette, data=result_m, 
            estimator=lambda x: sum(x)/responses_m*100.0)
for a in ax:
    a.set(ylabel='', xlabel='Percentage')
ax[0].set_title('Languages - Women')
ax[1].set_title('Languages - Men')
plt.show()

<p>As an example, I decided to analyse the level of expertise of female Python users, the most frequently selected language. Categorising them into new users with less than a year of experience, beginners with up to two years of experience and proficient users shows that new users are the minority, but are still well-represented at 30%. As such, no conclusive statements can be made of whether the company's machine learning usage and a person's language usage corresponds to machine learning experience.</p>
<p>Additionally, the level of experience in this graph might not correspond to the respondent's Python expertise if this language was picked up later. For example, I personally would fall into the proficient group, but would fit into the new users category regarding Python.</p>

In [None]:
python = women[women['Q18_Part_1']=='Python'].copy()

python['Q15'] = [ 'New' if x == '< 1 years' else 'Beginner' if x == '1-2 years' else 'Proficient' for x in python['Q15']]
py_exp = python.groupby('Q15').agg({'Q1':'count'})

py_order = ['New', 'Beginner', 'Proficient']
py_exp = py_exp.reindex(py_order).reset_index()
py_exp = py_exp.rename(columns={'Q15':'Experience', 'Q1':'Count'})

total_py = py_exp['Count'].sum()

fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Coding Level - Female Python Users')
sns.set(style='whitegrid')
ax = sns.barplot('Experience', 'Count', ax=ax, palette="deep", data=py_exp, estimator=lambda x: sum(x)/total_py*100.0)
ax.set(ylabel='Percentage')
plt.show()

<p>As a next step, I analysed the relationship between coding experience of female Python users and the number of additional coding languages they have learned. As the graph below shows, for female respondents who selected only Python as a coding language they use regularly, the majority has less than two years of experience, with almost 50% having just started learning Python.</p>

In [None]:
for i in python.index:
    count = 0
    for j in range(2,11):
        lang = python.at[i, (base + str(j))]
        if pd.isnull(lang) == False:
            count += 1
    python.at[i, 'Q18_Count'] = count
    
python['Multi_Language'] = [ '> 5' if x >= 5.0 else '2-4' if x > 2.0 
                            else 0 if x == 0.0 else 1 if x == 1.0 else 2 for x in python['Q18_Count']]
others = python.groupby(['Multi_Language', 'Q15']).agg({'Q1':'count'}).rename(columns={'Q1':'Count'})
others = pd.DataFrame(data={'Count': others['Count']}, index=others.index).reset_index()

py_only = others[others['Multi_Language']==0].drop(['Multi_Language'],axis=1).rename(columns={'Q15':'Experience'})
total_py_only = py_only['Count'].sum()

fig, ax = plt.subplots(1,1,figsize=(10,5))
ax.set_title('Coding Level - Python only (Female)')
sns.set(style='whitegrid')
ax = sns.barplot('Experience', 'Count', ax=ax, palette="deep", data=py_only, estimator=lambda x: sum(x)/total_py_only*100.0,
                order=py_order)
ax.set(ylabel='Percentage')
plt.show()

<p>Continuing this analysis to include different categories of multiple language users shows that the majority of proficient Python users codes in more than one language regularly. However, both beginners and new Python users also selected multiple languages in the majority of cases. Despite the majority not being as clear as among proficient users, regular use of more than one programming language does not seem to correspond directly to coding expertise.</p>

<p>From this data, it's clear that due to the varying levels of proficiency which cannot be filtered down further, selecting Python as a regularly used language does not correspond to machine learning expertise. However, it is interesting to see that the majority of women who code use more than one language regularly. Additionally, there seems to be a large rise in Python usage within the past two years.</p>

In [None]:
cat_order = ['New', 'Beginner', 'Proficient']

sns.set(style='whitegrid')
g = sns.catplot(x='Multi_Language', y='Count', col='Q15', data=others, kind='bar', col_order=cat_order)
titles = ["New", "Beginner", "Proficient"]
for ax, title in zip(g.axes.flat, titles):
    ax.set_title(title)
    ax.set(xlabel = 'Languages in Addition to Python')
plt.show()

<h2>Conclusion</h2>

<p>Overall, women are definitely still underrepresented at Kaggle, though the size of this gender gap differs between countries. However, it seems as though this gap might be narrowing as a large number of female respondents have only recently begun to develop their coding skills, suggesting a wider trend towards tech jobs and skills.</p>

<p>Looking at women who are already in tech, I found only slight differences in the usage of machine learning, programming languages or job titles when compared to male respondents. Unfortunately, however, on a global scale, salaries still differ.</p>

<p>Based on the survey data, it stands to hope that women will close the gender gap in tech.</p>