# Women in Stack Overflow

Stack Overflow (SO) is a well-known website for all things related to code. Whether you're a newbie or a seasoned developer, chances are you've spent some quality time scrambling around people's Q&As for insight on how to unstuck you on your code.

In [their words](https://stackoverflow.com/company):
>Founded in 2008, Stack Overflow is the largest, most trusted online community for anyone that codes to learn, share their knowledge, and build their careers. **More than 50 million unique visitors** come to Stack Overflow **each month** to help solve coding problems, develop new skills, and find job opportunities.

Another great thing about SO is the [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey). Every year since 2011, they open the survey to anyone willing to commit their time to it.
In addition to SO's insights, this data collected makes it possible for anyone else to get ideas about the community, its members and people that contribute to the website.

After a little while of constantly visiting SO as I was learning new languages, features, ways to visualize data, I started wondering about women involved with tech. The data collected by the Annual Developer Survey from 2019 is a good start to get some insights.  
The next notebook comprises some analysis about
 - [Gender](#Gender)  
 - [Employment](#Employment)  
 - [Work as Developers](#Work-as-Developers)  
 - [Formal Education](#Formal-Education)  
 - [Country](#Country)  
 - [Ethnicity](#Ethnicity)  
 - [Age vs. Social Media](#Age-vs.-Social-Media)
 - [Women vs. Stack Overflow](#Women-vs.-Stack-Overflow)
 
The survey and questions asked are available in this [pdf](data/so_survey_2019.pdf).

### Import libraries and helper functions

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns
from helpers import search_question, get_probs, plot_stats
%matplotlib inline

In [None]:
print("matplotlib version:", matplotlib.__version__)
print("numpy version:", np.__version__)
print("pandas version:", pd.__version__)
print("re version:", re.__version__)
print("seaborn version:", sns.__version__)

### Import and show the data

Read the data and schema of the survey, store them in variables and show the first five rows.

In [None]:
df = pd.read_csv('./data/survey_results_public.csv')
schema = pd.read_csv('./data/survey_results_schema.csv')
df.head()

Every row represents a participant's answers.
Every column represents a question asked.

In [None]:
participants, questions = df.shape
notes = []
notes.append('The survey had {} participants.'.format(participants))

As I was analyzing the data, I found it useful to store my notes in a variable to retrieve it at the end.
With this I'd avoid to make a mess of the notebook, and also having to come back to every note alongside it.

My notes had the next structure:
```python
notes = []
notes.append('string to format'.format(
    value_1,
    value_2,
    value_3,
))
```   

*String to format* was the context where to place the data.  
*Values_\** came from the analysis of the data.  
The result was the note: instead of hardcorded results, I'd added context and let the data speak for itself.

## Gender

I wrote the function **search_question** that will search the schema for a given term and retrieve all the questions that match it.
This facilitates the task of retrieving columns to analyze them.

In [None]:
print(search_question.__doc__)

In [None]:
search_question(schema, "identify")

What are the unique values for the question about *Gender* ?

In [None]:
df['Gender'].unique()

Taking a look to the values, we can easily identify:  
 - Man
 - Woman
 - Non-binary, genderqueer, or gender non-conforming
 
Those values repeated alone and in combinations, and the survey format makes it obvious that the first value in combinations was the first marked by the respondant.  
I imputed and transformed the values as follows:
 - transform *Woman;Man* values to *Non-binary*
 - keep the first value in combinations
 - transform *Non-binary, genderqueer, or gender non-conforming* to *Non-binary*

In [None]:
gender_dict = {
    'replace': ['Woman;Man', '.+;', 'Non-binary, genderqueer, or gender non-conforming'],
    'value': ['Non-binary', '', 'Non-binary']
}

df['Gender'].replace(to_replace=gender_dict['replace'], value=gender_dict['value'], regex=True, inplace=True)

I wrote the helper function **get_probs** that will compute the percentages of values in a pandas Series.  
Then again, avoided the clutter in the notebook.

In [None]:
print(get_probs.__doc__)

In [None]:
stats_gender = get_probs(df['Gender'])
stats_gender

The **plot_stats** helper function takes in a pandas Series and plots it.

In [None]:
print(plot_stats.__doc__)

In [None]:
plot_stats(stats_gender.index, stats_gender.values, title='Gender distribution of developers\nwho participated in the survey', 
           xlabel_='Gender', ylabel_='% of developers', filename='img/Gender.png')

In [None]:
notes.append('Approximately {}% of the participants were men, {}% were women, and {}% as non-binary, genderqueer, or gender non-conforming.'.format(
    round(stats_gender.Man, 1),
    round(stats_gender.Woman, 1),
    round(stats_gender['Non-binary'], 1),
))

The desirable subject of this analysis are participants of the survey whose gender identity (or one of them) was woman.  
I reflected this as a subset of only woman participants from the imputed data and stored it in a variable ***df_woman***.

In [None]:
df['Gender'].isnull().sum()

In [None]:
gender_woman = ['Woman']
df_woman = df[df['Gender'].isin(gender_woman)]

In [None]:
notes.append('The survey had {} of {} participants \
that identified themselves as women ({}%), and {} ({}%) that didn\'t provide information.'.format(df_woman.shape[0], 
                                                   df.shape[0], 
                                                   round(df_woman.shape[0] / df.shape[0] * 100, 1),
                                                   df['Gender'].isnull().sum(), 
                                                   round(df['Gender'].isnull().sum() / df.shape[0] * 100, 1)))

In [None]:
df_woman.to_csv('data/woman_survey.csv', sep=',')

## Employment

What questions were asked about employment?  
What kind of employment was predominant among the women participants?

In [None]:
search_question(schema, "employment")

In [None]:
df_woman['Employment'].unique()

In [None]:
stats_work = get_probs(df_woman['Employment'])
stats_work

In [None]:
plot_stats(stats_work.index, stats_work.values, title='Employment Status', 
           xlabel_='Employment', ylabel_='Percentage', filename='img/Employment.png', 
           xticks_labels=['Full-time', 'Unemployed\nlooking for job', 'Part-time', 'Contractor', 'Unemployed\nnot looking for job', 'Retired'])

In [None]:
df_woman['Employment'].isnull().sum()

In [None]:
work_vals = [df_woman.shape[0], df_woman['Employment'].isnull().sum()]

for i in range(3):
    work_vals.append(round(stats_work[i], 1))
    work_vals.append(stats_work.index[i].lower())

notes.append('From the {} women developers, {} didn\'t \
provide information on this question, and \
approximately {}% were {}, {}% were {}, and {}% were {}.'.format(*work_vals))

I filtered the ***df_woman*** data to take some notes about women employed at the time of the survey. For this purpose, I stored it in ***df_woman_working*** variable.  

Then I stored it back in ***df_woman*** to use it in the rest of the notebook.

In [None]:
working_targets = ['Employed full-time',
                   'Independent contractor, freelancer, or self-employed',
                   'Employed part-time']

df_woman_working = df_woman[df_woman['Employment'].isin(working_targets)]

In [None]:
women_work_ = '{} out of {} women developers ({}% of them) were actively working at the moment of participating in the survey.'
notes.append(women_work_.format(df_woman_working.shape[0], 
                   df_woman.shape[0], 
                   round(df_woman_working.shape[0] / df_woman.shape[0] * 100, 1)))

In [None]:
df_woman = df_woman_working

## Work as Developers

I also wanted to know if the women participants were developers.  
The first step to find out was to get all questions related to code:

In [None]:
search_question(schema, 'code')

*MainBranch* looks promising. According to the survey, you could only provide one answer for this, so no need to impute or transform the answers.  
Let's see the unique values for this question:

In [None]:
df_woman['MainBranch'].unique()

In [None]:
stats_dev = get_probs(df_woman['MainBranch'])
stats_dev

In [None]:
plot_stats(stats_dev.index, stats_dev.values, title='Distribution of developer statuses for women developers', 
           xlabel_='Developer status', ylabel_='Percentage of women developers', filename='img/Devs.png', 
           xticks_labels=['Profession', 'Student', 'As part of work', 'Hobby', 'Not anymore'])

In [None]:
df_woman['MainBranch'].isnull().sum()

In [None]:
branch_vals = [df_woman.shape[0],
               df['MainBranch'].isnull().sum(),
               (df_woman.shape[0] - df['MainBranch'].isnull().sum())]

for i in range(3):
    branch_vals.append(round(stats_dev[i], 1))
    branch_vals.append(stats_dev.index[i])

main_branch_ = 'From the {} women developers participating in the survey, {} didn\'t provide information on this question. From the {} who did answered, approximately {}% answered "{}", {}% said "{}", and {}% answered "{}".'
notes.append(main_branch_.format(*branch_vals))

In the previous section, I filtered all the data from women to keep only those employed at the time of the survey.   
Now, I filter those who work as developers.

In [None]:
branch_targets = ['I am a developer by profession', 
                  'I am not primarily a developer, but I write code sometimes as part of my work']

df_woman = df_woman[df_woman['MainBranch'].isin(branch_targets)]

## Formal Education

I aim to get only information about formal education, and unique values for this question.  
Same as previous sections, I use the helper functions to:
 - get questions related to the education level and unique values,
 - get percentages for this data,
 - plot the percentages,
 - take a note about this information

In [None]:
search_question(schema, "education")

In [None]:
df_woman['EdLevel'].unique()

In [None]:
stats_ed = get_probs(df_woman['EdLevel'])
stats_ed

In [None]:
plot_stats(stats_ed.index, stats_ed.values, title='Distribution of education levels for women developers', 
           xlabel_='Higher education achieved', ylabel_='Percentage of women developers', filename='img/EdLevel.png', 
           xticks_labels=['Bachelor', 'Master', 'Some college', 'Other doctoral', 'Associate', 
                          'Secondary school', 'Professional degree', 'No formal ed.', 'Primary school'])

In [None]:
df_woman['EdLevel'].isnull().sum()

In [None]:
ed_vals = [df_woman.shape[0], df_woman['EdLevel'].isnull().sum()]

for i in range(3):
    ed_vals.append(round(stats_ed[i], 1))
    ed_vals.append(stats_ed.index[i])

notes.append('From the {} women developers, {} didn\'t provide information on this question, \
approximately {}% had a {}, {}% a {}, and {}% had {}.'.format(*ed_vals))

## Country

In [None]:
search_question(schema, "country")

In [None]:
df_woman['Country'].unique()[:10]

In [None]:
stats_country = get_probs(df_woman['Country'])
stats_country

In [None]:
plot_stats(stats_country.index[:10], stats_country.values[:10], title='Distribution of Top 10 countries\n where women developers currently reside', 
           xlabel_='Countries', ylabel_='Percentage of women developers', filename='img/Country.png')

In [None]:
df_woman['Country'].isnull().sum()

In [None]:
country_vals = [df_woman.shape[0], 
                df_woman['Country'].isnull().sum()]

for i in range(5):
    country_vals.append(stats_country.index[i])
    country_vals.append(round(stats_country[i], 1))
    
about_countries = 'The {} actively working women developers that participated in the survey\
and gave some kind of information ({} didn\'t provide information about \
their residence) about their education were predominantly from {} ({}%),\
{} ({}%), {} ({}%), {} ({}%), and {} ({}%).'
notes.append(about_countries.format(*country_vals))

## Ethnicity

In [None]:
search_question(schema, "identify")

For this question, was possible to mark more than one answer, even though there were options like *Biracial* and *Multiracial*.  
We can see in the next cell some of the unique answers:

In [None]:
df_woman['Ethnicity'].unique()[:10]

In [None]:
df_woman['Ethnicity'].isnull().sum()

In [None]:
len(df_woman['Ethnicity'].unique())

There were 67 unique combinations for the question about *Ethnicity*.  
My approach was to simplify the information for the solely purpose of getting a better idea about ethnicities.  
I imputed the answers and keep only the first answer of each respondant.  
Then I got the percentages of distinct ethnicities for women developers employed at the time of the survey.

In [None]:
ethnics = df_woman[['Ethnicity', 'Country']].copy().dropna()
ethnics['Ethnicity'].replace(to_replace=';.+', value='', regex=True, inplace=True)

In [None]:
stats_ethnic = get_probs(ethnics['Ethnicity'])
stats_ethnic

In [None]:
plot_stats(stats_ethnic.index, stats_ethnic.values, 
           title='Distribution of ethnicities\nfor women developers\nemployed at the time of the survey',
           ylabel_='% of women developers\nemployed at the time of the survey', 
           xlabel_='Ethnicities', 
           xticks_labels=['White', 'South Asian', 'East Asian', 'Latinx', 'Black', 'Multiracial', 'Biracial',
                          'Middle Eastern', 
                          'Native or Indigenous'],
           filename='img/Distribution_Ethnics.png')

The plot give us an idea of the ethnicities of the participants.  
But ethnicity makes sense in a context of where the participant is located.  
To compel with this idea, I:
 - took the countries where the women developer where currently located, 
 - created a dummy variable out of the *Ethnicity* column,
 - filtered the first 10 countries where the participants where located,
 - grouped by *Country*,
 - computed the percentages for Country-Ethnicity values  
 
After all this, I plotted the matrix as a heatmap and provided the percentages to assist the comparison between ethnicities in these countries.

In [None]:
ethnics_top_countries = pd.concat([ethnics['Country'], pd.get_dummies(ethnics['Ethnicity'], drop_first=True)], axis=1)
ethnics_top_countries = ethnics_top_countries[ethnics.Country.isin(stats_country.index[:10])].groupby('Country').mean() * 100
ethnics_top_countries

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
ax = sns.heatmap(ethnics_top_countries, annot=True, fmt=".2f")
for t in ax.texts: 
    t.set_text(t.get_text() + "%")
plt.autoscale()
plt.savefig('img/Ethnicity.png', bbox_inches = "tight");

As we can see in the heatmap, each country was paired with the % of women developers currently located there and their ethnicity.  
I kept each *Ethnicity* label as it was.

In [None]:
ethnic_vals = [df_woman.shape[0] - df_woman['Ethnicity'].isnull().sum()]

for i in range(5):
    ethnic_vals.append(stats_ethnic.index[i])
    ethnic_vals.append(round(stats_ethnic[i], 1))

ethnic_vals.append(df_woman['Ethnicity'].isnull().sum())
    
about_ethnics = 'The {} actively working women developers that participated in the survey, \
gave information about their education and ethnicity were predominantly {}s ({}%), \
{} ({}%), {} ({}%), {} ({}%), and {}s ({}%). {} didn\'t provide information.'
notes.append(about_ethnics.format(*ethnic_vals))

## Age vs. Social Media

Out of my curiosity I wondered where could I find women developers to connect with them. I've found many on Twitter (say [hi](https://twitter.com/miss_sizigia)!).  
But, what are other choices and would this have something to do with their age?

In [None]:
search_question(schema, "social")

In [None]:
df_woman[['Age', 'SocialMedia']].isnull().sum().sum()

I filtered women by *Age* and *SocialMedia*, dropped NaN values, and got the unique values for *SocialMedia*.

In [None]:
woman_age_socialmedia = df_woman[['Age', 'SocialMedia']].dropna()

woman_age_socialmedia['SocialMedia'].unique()

As we can see, *VK ВКонта́кте*, *WeChat 微信* and *Weibo 新浪微博* have characters from other alphabets, so I cleaned the names for readability.

In [None]:
socialmedia_dict = {
    'replace': ["I don't use social media", '(?![a-zA-Z]).+'], 
    'value': ['None', '']
}

woman_age_socialmedia['SocialMedia'].replace(to_replace=socialmedia_dict['replace'], 
                                             value=socialmedia_dict['value'], 
                                             regex=True, inplace=True)

Next, I:
 - created a dummy variable out of *SocialMedia*,
 - cut the DataFrame to have ranges of age instead of individual ages of each participants and stored it as *AgeRange*,
 - dropped the *Age* column,
 - grouped by *AgeRange*, 
 - computed the mean percentages, and
 - stored it all in a **age_social_media**:

In [None]:
woman_age_socialmedia = pd.concat([woman_age_socialmedia['Age'], pd.get_dummies(woman_age_socialmedia['SocialMedia'], drop_first=True)], axis=1)

bins = [0, 18, 25, 32, 39, 46, 53, np.inf]
names = ['<18', '18-25', '25-32', '32-39', '39-46', '46-53', '53+']

woman_age_socialmedia['AgeRange'] = pd.cut(woman_age_socialmedia.Age, bins, labels=names)

woman_age_socialmedia.drop(columns='Age', inplace=True)

age_social_media = woman_age_socialmedia.groupby('AgeRange').mean() * 100

Next, I create a dictionary to map the categorical ranges with strings to give them meaning later on when I take a note about this analysis.

In [None]:
age_social_dict = {}
strs = ['younger than 18', 
'between 18 and 25',
'between 25 and 32',
'between 32 and 39',
'between 39 and 46',
'between 46 and 53',
'older than 53']

for idx, name in enumerate(names):
    age_social_dict[name] = strs[idx]

age_social_dict

The following DataFrame shows the percentages of women developers employed at the time of the survey grouped by age ranges of:
 - younger than 18 (<18),
 - between 18 and 25 (18-25],
 - between 25 and 32 (25-32],
 - between 32 and 39 (32-39],
 - between 39 and 46 (39-46],
 - between 46 and 53 (46-53],
 - older than 53 (53+)

In [None]:
age_social_media

Even though it is informative on its own, it doesn't say much since I have to search back and forth to get the maximum values.  
I can quickly solve this by creating a new DataFrame with the maximum values per index and the associated percentage.  
I transposed both columns to have all values of *AgeRange* as columns and both *Most Used Site* and *% of users* as indexes.

In [None]:
most_used_media = pd.DataFrame([age_social_media.T.idxmax(), round(age_social_media.T.max(), 2)], 
                               index=['Most Used Site', '% of users'])
most_used_media

I extract another note from this data wrangling using the dictionary I previously created and another list to store significant data.

In [None]:
age_social_vals = [search_question(schema, "social")[0]['question'].lower(), 
                   df_woman.shape[0], 
                   df_woman[['Age', 'SocialMedia']].isnull().sum().sum(),
                   woman_age_socialmedia.shape[0]]

for col in most_used_media.columns:
    age_social_vals.append(int(woman_age_socialmedia.shape[0] * most_used_media[col]['% of users'] / 100))
    age_social_vals.append(age_social_dict[col])
    age_social_vals.append(most_used_media[col]['Most Used Site'])
    age_social_vals.append(most_used_media[col]['% of users'])

notes.append('When asked "{}" to {} women developers, {} didn\'t provide one of/both their age and \
social media site of preference. From the {} who provided the information, \
\n - {} women, {} years old, said {} ({}%),\
\n - {} women, {} years old, said {} ({}%), \
\n - {} women, {} years old, said {} ({}%), \
\n - {} women, {} years old, said {} ({}%), \
\n - {} women, {} years old, said {} ({}%), \
\n - {} women, {} years old, said {} ({}%), \
\n - {} women, {} years old, site of choice was {} ({}%).'.format(*age_social_vals))

Next, I found useful to visualize the matrix of values with a heatmap as done with *Ethnicity*.

In [None]:
fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.heatmap(age_social_media, annot=True, fmt=".2f")
for t in ax.texts: 
    t.set_text(t.get_text() + "%")
plt.autoscale()
plt.savefig('img/SocialMedia.png', bbox_inches = "tight");

## Women vs. Stack Overflow

Last but not least, how do these actively working women developers interact with Stack Overflow?

First I need to know which columns are related with Stack Overflow in terms o questions, so:

In [None]:
search_question(schema, "Stack Overflow")

Next, I'm going to copy a subset of questions from ***df_woman***.  
This prevents me from messing with the data I've been wrangling, and be sure I'll keep this new DataFrame separate from ***df_woman***.

In [None]:
stackoverflow = df_woman[['Age', 'MainBranch', 'Employment', 'YearsCodePro',
          'SOVisit1st', 'SOVisitFreq', 'SOVisitTo',
          'SOFindAnswer', 'SOTimeSaved', 'SOHowMuchTime',
          'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams',
          'SOComm', 'WelcomeChange', 'SONewContent']].copy()

stackoverflow.head()

How many women answered any of these questions?

In [None]:
stackoverflow.shape[0]

How many women didn\'t answer some of these questions?

In [None]:
stackoverflow.isnull().sum().sum()

Let's keep information from only those who answered all of the questions about Stack Overflow.

In [None]:
stackoverflow.dropna(how='any', inplace=True)
stackoverflow.shape[0]

Now, I'd like to see if there's some relation between the first time they visited Stack Overflow (*SOVisit1st*), the years of coding professionally (*YearsCodePro*), and the frequency of participation in the community (*SOPartFreq*).

First, let's take a look at the unique values for *YearsCodePro* and *SOVisit1st*.

In [None]:
stackoverflow['YearsCodePro'].unique()

As we see in *YearsCodePro*, not all values can be converted to numbers.  
To represent ```"Less than 1 year"```, I'm going to impute it with a ```0```, since less than 12 months don't make a full year.

In [None]:
stackoverflow['SOVisit1st'].unique()

Again in *SOVisit1st*, not all values can be converted to numbers.  
To represent ```"I don't remember"```, I'm going to impute it with ```2019```, when the survey was launched, based on the assumption that at least they visited SO to complete the survey.

I'm copying and imputing *YearsCodePro* in ***years_SO***. I also casted the values to ```integer``` to be able to perfom calculations with them in the next lines.  
  
Then I create the column *YearsSince1stVisitSO*, that stores how many years have passed between their first visit to SO and the year of the survey (2019).  
  
The next line is yet another column named *YearsCodePro+SO* that stores the differences between the years as professional developers and the years since the first visit to SO.  
  
Values for *YearsCodePro+SO* would be interpreted as follows:
 - negative: the developer has been *x* years working professionally without visiting SO  

- positive: the developer has been *x* years working professionally visiting SO  

In [None]:
# Copy, impute and store YearsCodePro in years_SO
years_SO = stackoverflow[['YearsCodePro']].replace("Less than 1 year", 0.5, regex=True).copy().astype(int)

# Impute SOVisit1st, compute the difference between the year of the survey, cast it as integer,
#and store it in years_SO as YearsSince1stVisitSO
years_SO['YearsSince1stVisitSO'] = 2019 - stackoverflow['SOVisit1st']\
.replace("I don't remember", 2019, regex=True).astype(int)

# Compute the difference between YearsSince1stVisitSO and YearsCodePro
years_SO['YearsCodePro+SO'] = years_SO['YearsSince1stVisitSO'] - years_SO['YearsCodePro']

years_SO.head()

We obtained the values as predicted before, but a better way to make sense of continuous data is by binning the data, same as did before for [Age vs. Social Media](#Age-vs.-Social-Media).

Let's take a better look at the bins we're going to generate next:

 - Devs with 12 or more years of professional experience before their first visit: **(-inf, -12.0]**
 - Devs with 1 to almost 12 years of professional experience before their first visit: **(-12.0, -1.0]**
 - Devs who visited SO for the first time around the same time they started to work as developers, **(-1.0, 0.0]**
 - Devs who visited SO for the first time around the time they started to work as developers and as far as 3 years before that moment, **(0.0, 3.0]**
 - Devs who visited SO for the first time around 3 years before they started to work as developers and as far as 5 years before that moment, **(3.0, 5.0]**
 - Devs who visited SO for the first time more than 5 years before they started to work a developers, **(5.0, inf]**
 


In [None]:
bins = [-np.inf, -12, -1, 0, 3, 5, np.inf]
names = ['12+ before SO', 
         '1-12 before SO', 
         'Work+SO same time', 
         '0 to 3 after SO', 
         '3-5 after SO',
         '5+ after SO']

years_SO['YearsCodePro+SO'] = pd.cut(years_SO['YearsCodePro+SO'], bins, labels=names)

years_SO.head()

I'm going to create another dictionary to keep track of the meaning of each bin. The purpose is using them later to get notes on this analysis.

In [None]:
years_SO_dict = {}

strs = ['devs with 12 or more years of professional experience before their first visit',
        'devs with 1 to almost 12 years of professional experience before their first visit', 
        'devs who visited SO for the first time around the same time they started to work as developers', 
        'devs who visited SO for the first time around the time they started to work as developers and as far as 3 years before that moment', 
        'devs who visited SO for the first time around 3 years before they started to work as developers and as far as 5 years before that moment', 
        'devs who visited SO for the first time more than 5 years before they started to work a developers']

for idx, name in enumerate(names):
    years_SO_dict[name] = strs[idx]

years_SO_dict

In the next cell, I take the following steps:
 - create a dummy DataFrame for the *SOPartFreq*, this will assign ```1``` if the category is present in the row or not, and ```0``` if not,
 - store it in ***dummy_years***,
 - group by *YearsCodePro+SO*, to have the values for each bin I previously created, and last
 - take the mean percentages of the values.

In [None]:
dummies_years = pd.concat([years_SO['YearsCodePro+SO'], 
                           pd.get_dummies(stackoverflow['SOPartFreq'])], 
                          axis=1)\
                .groupby('YearsCodePro+SO').mean() * 100

There's another thing that can help the visualization, the columns of ***dummy_years*** are actually ordinal variables.  
Let's represent this by re-indexing the DataFrame with the provided list where I reordered the columns from more to less frequent.

In [None]:
dummies_years = dummies_years.reindex(['Multiple times per day',
                       'Daily or almost daily',
                       'A few times per week',
                       'A few times per month or weekly',
                       'Less than once per month or monthly',
                       'I have never participated in Q&A on Stack Overflow'], axis=1)
dummies_years

Another way to visualize these values:

In [None]:
fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.heatmap(dummies_years, annot=True, fmt=".2f")
for t in ax.texts: 
    t.set_text(t.get_text() + "%")
plt.autoscale()
plt.savefig('img/YearsSOandPro.png', bbox_inches = "tight");

Great! Let's get the max frequencies for each group of developers:

In [None]:
years_SO_codePro = pd.DataFrame([dummies_years.T.idxmax(), round(dummies_years.T.max(), 2)], 
                               index=['Frequency of Participation', '% of users'])
years_SO_codePro

The probabilities highly suggest that actively working women developers, no matter how long they've been working as developers and have SO as a resource, visit the site less than once per month or monthly.

In [None]:
years_SO_vals = [stackoverflow.shape[0]]

for col in years_SO_codePro.columns:
    years_SO_vals.append(years_SO_codePro[col]['% of users'])
    years_SO_vals.append(years_SO_dict[col])
    years_SO_vals.append(years_SO_codePro[col]['Frequency of Participation'].lower())
    
notes.append('When asked about their participation in Stack Overflow, {} actively working \
women developers provided the following insights:\n\
 - {}% {}, said to visit SO {},\n\
 - {}% {}, said to visit SO {},\n\
 - {}% {}, said to visit SO {},\n\
 - {}% {}, said to visit SO {},\n\
 - {}% {}, said to visit SO {},\n\
 - {}% {}, said to visit SO {}.'.format(*years_SO_vals))

## Notes

Finally, let's print all my notes, save it in a file and proceed to use it to write an article or communicate the insights with my team, friends, family, or you!

In [None]:
print(*notes, sep='\n')

In [None]:
with open('women_SO_notes.txt', 'w') as file:
    for line in notes:
        file.write("%s \n" % line)