## Introduction

2020 is over, which calls for a retrospect. The 2020 Kaggle survey includes users from different backgrounds and parts of society. The global pandemic of 2020 affected female workers in every sector across the world, with evidence based reports from the [United States](https://www.americanprogress.org/issues/women/reports/2020/10/30/492582/covid-19-sent-womens-workforce-progress-backward/) and [India](https://www.business-standard.com/article/current-affairs/covid-impact-women-workforce-disappearing-most-affected-in-urban-india-120121500259_1.html). 

There has always been a massive disparity between females and males in the IT sector, with males dominating the industry from its inception. The global pandemic only made it worse. Due to the current nature of the unstable work economy, this report specifically studies the responses of female and non-binary users in an attempt to learn about the services and tools that can be provided to this population to help bridge the gender gap in the industry. I have decided to include the non-binary respondents as well since it is encouraging to analyse the diversity in the user base.   

In [None]:
import numpy as np # linear algebra
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from matplotlib import cm
from math import log10

warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory = False)
data.drop([0], inplace=True)

## 1. Demographic

There is a vast difference of genders who use Kaggle. The number of female users is alarmingly low as is the number of non-binary people. No assumptions can be made regarding those who chose not to say and self-describe. Due to this disparity, the analysis will focus only on female and non-binary users. 

In [None]:
surveydf = data[(data.Q2=='Woman')| (data.Q2=='Nonbinary')]

### Age

In [None]:
# 1. Age
age = surveydf['Q1'].value_counts(normalize=True)*100

style = dict(size=10, color='black')
sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=age.index, y=age, palette="husl")
ax.set(xlabel="Age Groups", ylabel = "% of Respondents", title = 'Repondents of Varying Age Groups')
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Most female and non-binary users are between the ages of 18 and 29, with the highest being in their early 20s. Its interesting to note the users who are about the age of 50, especially the 2 reposndents over the age of 70!

In [None]:
# Horizontal Barplot Y axis labels
def show_values_on_bars(axs, h_v="h", space=0.4):
# Code from https://stackoverflow.com/questions/43214978/seaborn-barplot-displaying-values
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = round(p.get_height(), 2)
                ax.text(_x, _y, value, ha="center", fontsize = 8) 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = round(p.get_width(), 2)
                ax.text(_x, _y, value, ha="left", fontsize = 8)

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)

### Country of Residence

In [None]:
#3. Country
surveydf.loc[surveydf['Q3']=='United Kingdom of Great Britain and Northern Ireland','Q3']='UK and Northern Ireland'
country = surveydf['Q3'].value_counts(normalize=True)*100

sns.set(rc={'figure.figsize':(15.7,15.27)})
ax = sns.barplot(y=country.index, x=country, palette="husl")
ax.set(ylabel="Countries", xlabel = "% of Respondents", title = "Respondents' County of Residence")
show_values_on_bars(ax)
None

Respondents from atleast 55 countries participated in the survey, with majority of these respondents present in India, followed by the United States. The rest of the 54.4% of selected respondents are from different parts of the world.

### Qualifications

In [None]:
# 4. Qualifications
qualification = surveydf['Q4'].value_counts(normalize=True)*100
n = len(qualification)
colors = [cm.Set2(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = qualification.index
plt.pie(x=qualification, autopct="%.1f%%", explode=[0.01]*qualification.shape[0], labels=labels, colors=colors, pctdistance=0.5, textprops={'fontsize': 9})
plt.title("Qualification of the Repondents", fontsize=14);

A total of 89.63% of the selected respondents have an official university degree, with the largest group having a masters degree.

### Current Role

In [None]:
# 5. Current Role
role = surveydf['Q5'].value_counts(normalize=True)*100

sns.set(rc={'figure.figsize':(18.7,6.27)})
ax = sns.barplot(x=role.index, y=role, palette="husl")
ax.set(xlabel="Current Role", ylabel = "% of Respondents", title = "Current/Recent Job Titles")
ax.set_xticklabels(ax.get_xticklabels(), rotation=55)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Highest number of female and non-binary respondents were students. While 10% of these respondents were unemployed, the top job titles were Data Scientist and Data Analyst.

## 2. Technical Information

### Experience with Writing Code

In [None]:
# 6. Coding experience
exp = surveydf['Q6'].value_counts(normalize=True)*100
n = len(exp)
colors = [cm.Set2(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = exp.index
plt.pie(x=exp, autopct="%.1f%%", explode=[0.02]*exp.shape[0], labels=labels, colors=colors, pctdistance=0.5, textprops={'fontsize': 10})
plt.title("Years of Coding Experience", fontsize=14);

51.53% of the selected respondents have between 1 to 5 years of experience writing code. Those with more programming experinece are less in number, albiet a small portion (8.8%) of respondents have never written code. 

In [None]:
# Function for questions with multiple answers
import operator
def multiple_ans(quest):
    ans_dict = {}
    i = 0
    for col in surveydf.columns:
        if col.startswith(quest):
            try:
                surveydf[col] = surveydf[col].str.strip()
                key = surveydf[col].value_counts().index[0]
                ans_dict[key] = surveydf[col].count()
            except IndexError:
                continue

    # Get percentages of the values
    for keys in ans_dict:
        ans_dict[keys] = round(ans_dict[keys] * 100/surveydf.shape[0], 2)

    # Sort Dictionary values
    sorted_items = {k: v for k, v in sorted(ans_dict.items(), key=lambda item: item[1], reverse=True)}
    return sorted_items

In [None]:
# Function for mutiple answers of alternate questions
def multiple_ans_alt(quest, df):
    ans_dict = {}
    i = 0
    for col in df.columns:
        if col.startswith(quest):
            try:
                df[col] = df[col].str.strip()
                key = df[col].value_counts().index[0]
                ans_dict[key] = df[col].count()
            except IndexError:
                continue

    # Get percentages of the values
    for keys in ans_dict:
        ans_dict[keys] = round(ans_dict[keys] * 100/df.shape[0], 2)

    # Sort Dictionary values
    sorted_items = {k: v for k, v in sorted(ans_dict.items(), key=lambda item: item[1], reverse=True)}
    return sorted_items

### Programming language used regularly

In [None]:
# 7. Regular programming language
values = multiple_ans('Q7')

sns.set(rc={'figure.figsize':(14.7,6.27)})
ax = sns.barplot(x=list(values.keys()), y=list(values.values()), palette="husl")
ax.set(xlabel="Programming Language", ylabel = "% of Respondents", title = "Regular Programming Language")
ax.set_xticklabels(ax.get_xticklabels(), rotation=55)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

About 70% of the respondents use Python, making it the most used language among others. 

### Languages used along with Python

In [None]:
# 7.1 Respondents who used Python
python_users = surveydf[surveydf['Q7_Part_1'].notna()]
language_py = {}
i = 0
for col in python_users.columns:
    if col.startswith('Q7'):
        try:
            key = python_users[col].value_counts().index[0]
            language_py[key] = python_users[col].count()
        except IndexError:
            continue

In [None]:
# Get percentages of the values
for keys in language_py:
    language_py[keys] = round(language_py[keys] * 100/python_users.shape[0], 2)

# Sort Dictionary values
values = {k: v for k, v in sorted(language_py.items(), key=lambda item: item[1], reverse = True)}
del values['Python']

sns.set(rc={'figure.figsize':(14.7,6.27)})
ax = sns.barplot(x=list(values.keys()), y=list(values.values()), palette="husl")
ax.set(xlabel="Programming Language", ylabel = "% of Respondents", title = "Programming Language Used With Python")
ax.set_xticklabels(ax.get_xticklabels(), rotation=55)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Those that used Python also used more of SQL and less of languages like R and C.

### Languages used other than Python

In [None]:
# 7.2 Respondents who didn't use Python
non_python_users = surveydf[surveydf['Q7_Part_1'].isna()]
language_non_py = {}
i = 0
for col in non_python_users.columns:
    if col.startswith('Q7'):
        try:
            key = non_python_users[col].value_counts().index[0]
            language_non_py[key] = non_python_users[col].count()
        except IndexError:
            continue

In [None]:
# Get percentages of the values
for keys in language_non_py:
    language_non_py[keys] = round(language_non_py[keys] * 100/non_python_users.shape[0], 2)

# Sort Dictionary values
values = {k: v for k, v in sorted(language_non_py.items(), key=lambda item: item[1], reverse = True)}


sns.set(rc={'figure.figsize':(14.7,6.27)})
ax = sns.barplot(x=list(values.keys()), y=list(values.values()), palette="husl")
ax.set(xlabel="Programming Language", ylabel = "% of Respondents", title = "Years of Writing Code/Programming")
ax.set_xticklabels(ax.get_xticklabels(), rotation=55)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

In contrast to those that didn't use Python, R while SQL still remained the most used programming language.

### Programming language recommendation

In [None]:
# 8. Language recommendation
lang = surveydf['Q8'].value_counts(normalize=True)*100

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=lang.index, y=lang, palette="husl")
ax.set(ylabel="% of Respondents", xlabel = "% of Respondents", title = "Most Recommended Language")
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Python is the most recommended language given its ease and compatibility features. Although the syntax in R is similar to Python in the manner of handling data, it is still far from being recommended when compared to Python.

### Integrated Development Environments (IDE's) used regularly

In [None]:
# 9. Regular IDE 
ide = multiple_ans('Q9')
ide['Jupyter'] = ide.pop('Jupyter (JupyterLab, Jupyter Notebooks, etc)')
ide['Visual Studio Code'] = ide.pop('Visual Studio Code (VSCode)')
ide = {k: v for k, v in sorted(ide.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(14.7,6.27)})
ax = sns.barplot(x=list(ide.keys()), y=list(ide.values()), palette="husl")
ax.set(xlabel="Programming Language", ylabel = "% of Respondents", title = "Integrated Development Environments (IDE) Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=55)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Jupyter notebooks for the win! The provision for scripting, visualizations and using narrative text while supporting languages like R and Python make them easy to use. With different products supporting Jupyter notebooks, it is not suprising to know that this IDE is the most used tool.

### Regular hosted notebook products used

In [None]:
# 10. Hosted Notebook Products
nb = multiple_ans('Q10')

sns.set(rc={'figure.figsize':(23.7,6.27)})
ax = sns.barplot(x=list(nb.keys()), y=list(nb.values()), palette="husl")
ax.set(xlabel="IDEs", ylabel = "% of Respondents", title = "Integrated Development Environments (IDE) Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)

None

A little over a quarter of the population do not prefer hosted notebooks, indicating that locally hosted notebooks are usually preferred.

### Most preferred computing platform

In [None]:
# 11. Platform preference
platform = surveydf['Q11'].value_counts(normalize=True)*100
n = len(platform)
colors = [cm.Set2(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = surveydf['Q11'].value_counts().index
plt.pie(x=surveydf['Q11'].value_counts(normalize=True)*100, autopct="%.1f%%", explode=[0.02]*platform.shape[0], labels=labels, colors=colors, pctdistance=0.7, textprops={'fontsize': 11})
plt.title("Platform Preference", fontsize=14);
None

Most of the selected respondents prefer using personal systems over online hosted systems, thus supporting the previous statement that respondents prefer local systems.

### Specialized hardware used regularly

In [None]:
# 12. Specialised Hardware
hw = multiple_ans('Q12')

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(hw.keys()), y=list(hw.values()), palette="husl")
ax.set(xlabel="Hardware", ylabel = "% of Respondents", title = "Specialized Hardware Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)

None

While most selected respondents do not prefer additional hardware, the most used additional hardware are GPUs.

### Use of Tensor Processing Unit (TPU)

In [None]:
# 13. TPU frequency
tpu = surveydf['Q13'].value_counts(normalize=True)*100
n = len(tpu)
colors = [cm.Set2(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = tpu.index
plt.pie(x=tpu, autopct="%.1f%%", explode=[0.02]*tpu.shape[0], labels=labels, colors = colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Number of Times TPU (Tensor Processing Unit) Was Used", fontsize=14);

Very few respondents have used TPU, with a marginally small number using it often.

## 3. Machine Learning

### Visualization libraries used regularly

In [None]:
# 14. Visualization Libraries
lib = multiple_ans('Q14')

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(lib.keys()), y=list(lib.values()), palette="husl")
ax.set(xlabel="IDEs", ylabel = "% of Respondents", title = "Visualization Libraries Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)

None

Known for its simple and easy depiction of data, Matplotlib was the most used data visualization library followed by its dependent library Seaborn. Ggplot took the lead for R users. 

### Years of experience with Machine Learning

In [None]:
# 15. Experience in ML
ml = surveydf['Q15'].value_counts(normalize=True)*100
n = len(ml)
colors = [cm.tab20(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = ml.index
plt.pie(x=ml, autopct="%.1f%%", explode=[0.01]*ml.shape[0], labels=labels, colors = colors, pctdistance=0.7, textprops={'fontsize': 9})
plt.title("Years Spent Using Machine Learning", fontsize=14);

Inspiringly, a little less than half the selected respondents have been using machine learning methods for less than a year. Those with more experience seem to be in smaller numbers comparitively. 

### Machine Learning frameworks used regularly

In [None]:
# 16. ML frameworks
fw = multiple_ans('Q16')

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(fw.keys()), y=list(fw.values()), palette="husl")
ax.set(xlabel="ML Framework", ylabel = "% of Respondents", title = "Machine Learning Framework Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)

None

Most widely used frameworks are those that support more complicated algorithms and data processing techniques.  

### Machine Learning algorithms used regularly

In [None]:
# 17. ML algorithm
algo = multiple_ans('Q17')
algo['Gradient Boosting Machines'] = algo.pop('Gradient Boosting Machines (xgboost, lightgbm, etc)')
algo['Transformer Networks'] = algo.pop('Transformer Networks (BERT, gpt-3, etc)')
algo['Dense Neural Networks'] = algo.pop('Dense Neural Networks (MLPs, etc)')
algo = {k: v for k, v in sorted(algo.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(algo.keys()), y=list(algo.values()), palette="husl")
ax.set(xlabel="ML Algorithms", ylabel = "% of Respondents", title = "Machine Learning Algorithms Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Logistic and linear regression have always been the most commonly used algorithm, but what these algorithms can't do can be caputured and analysed in decision tree algorithms and for better accuracy and increased complexity, the convolusional neural network algorithm. 

### Computer Vision methods used regularly

In [None]:
# Alternate Question based on answers in Q17
computer_vison_df = surveydf[['Q18_Part_1', 'Q18_Part_2', 'Q18_Part_3', 'Q18_Part_4', 'Q18_Part_5', 'Q18_Part_6', 'Q18_OTHER']]
computer_vison_df = computer_vison_df.dropna(how='all')

In [None]:
# 18. Computer vision
cn = multiple_ans_alt('Q18',computer_vison_df)
cn['Image classification, general purpose'] = cn.pop('Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)')
cn['Image segmentation methods'] = cn.pop('Image segmentation methods (U-Net, Mask R-CNN, etc)')
cn['General purpose image/video tools'] = cn.pop('General purpose image/video tools (PIL, cv2, skimage, etc)')
cn['Object detection methods'] = cn.pop('Object detection methods (YOLOv3, RetinaNet, etc)')
cn['Generative Networks'] = cn.pop('Generative Networks (GAN, VAE, etc)')

cn = {k: v for k, v in sorted(cn.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(cn.keys()), y=list(cn.values()), palette="husl")
ax.set(xlabel="Computer Vison Methods", ylabel = "% of Respondents", title = "Computer Vision Methods Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Among those that who answered this question, Image classification and general purpose computer vison models stood out with generative networks being used the least by these respondents. 

In [None]:
# Alternate Question based on answers in Q17
nlp_df = surveydf[['Q19_Part_1', 'Q19_Part_2', 'Q19_Part_3', 'Q19_Part_4', 'Q19_Part_5', 'Q19_OTHER']]
nlp_df = nlp_df.dropna(how='all')

### Natural Language Processing (NLP) methods used regularly

In [None]:
# 19. NLP
nlp =  multiple_ans_alt('Q19',nlp_df)
nlp['Word embeddings/vectors'] = nlp.pop('Word embeddings/vectors (GLoVe, fastText, word2vec)')
nlp['Encoder-decorder models'] = nlp.pop('Encoder-decorder models (seq2seq, vanilla transformers)')
nlp['Transformer language models'] = nlp.pop('Transformer language models (GPT-3, BERT, XLnet, etc)')
nlp['Contextualized embeddings'] = nlp.pop('Contextualized embeddings (ELMo, CoVe)')

nlp = {k: v for k, v in sorted(nlp.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(nlp.keys()), y=list(nlp.values()), palette="husl")
ax.set(xlabel="NLP Methods", ylabel = "% of Respondents", title = "Natural Language Processing Methods Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

For those who were asked this question, word embeddings took higher precedence over other methods.

## 4. Organization

### Size of the company respondents employed at

In [None]:
# 20. Company size
size = surveydf['Q20'].value_counts(normalize=True)*100
n = len(size)
colors = [cm.Set2(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = size.index
plt.pie(x=size, autopct="%.1f%%", explode=[0.02]*size.shape[0], labels=labels, colors = colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Size of Company Respondents Employed at", fontsize=14);

While most respondents identified they belonged to small sized companies, it is worth noting that this includes respondents who are unemployed. Medium sized and larger company respondents are distributed fairly equally.

### Individuals responsible for Data Science at workplace

In [None]:
# 21. Data Science employees
emp = surveydf['Q21'].value_counts(normalize=True)*100
n = len(emp)
colors = [cm.Set2(i / n) for i in range(n)]

#Using matplotlib
pie, ax = plt.subplots(figsize=[13,10])
labels = emp.index
plt.pie(x=emp, autopct="%.1f%%", explode=[0.02]*emp.shape[0], labels=labels, colors = colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Number of Individuals Responsible for Data Science at Workplace", fontsize=14);

There are either too few or quite many individuals at these companies that are responsible for data science workloads. 37% of the companies have at the most 4 employees that work on data using data science tools and techniques.

### Machine Learning methods used at work

In [None]:
# 22. ML methods at work
methods = surveydf['Q22'].value_counts(normalize=True)*100
n = len(methods)
colors = [cm.Set2(i / n) for i in range(n)]

#Using matplotlib
pie, ax = plt.subplots(figsize=[13,10])
labels = methods.index
plt.pie(x=methods, autopct="%.1f%%", explode=[0.02]*methods.shape[0], colors = colors, labels=labels, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Extent of Machine Learning Methods Used at Work", fontsize=14);

While 35.8% of the companies use Machine Learning streamlined processes, 25% of the companies are venturing into using Machine Learning methods. This shows companies expanding data science avenues in their businesses. 

### Job Role Duties

In [None]:
# 23. Job role duties
role = multiple_ans('Q23')

sns.set(rc={'figure.figsize':(20.7,9.27)})
ax = sns.barplot(y=list(role.keys()), x=list(role.values()), palette="husl")
ax.set(ylabel="Job Role Duties", xlabel = "% of Respondents")
ax.set_title("Natural Language Processing Methods Used Regularly",fontsize=15)
ax.set_yticklabels(ax.get_yticklabels())
show_values_on_bars(ax, "h", 0.3)
None

Analysing data that affect business decisions and products is the most common job role of the selected respondents. Data storage roles come in second where are experimentation of using machine learning methods follow immedate suit. 

### Yearly Compensation in USD

In [None]:
# 24. Salary
# surveydf['Q24'].value_counts(normalize=True)*100
salary_dict = {
    '$0-999': 28.596594,
    '1,000-1,999': 7.105109,
    '2,000-2,999': 3.875514,
    '3,000-3,999': 2.701116,
    '4,000-4,999': 2.701116,
    '5,000-7,499': 4.932472,
    '7,500-9,999': 3.053435,
    '10,000-14,999': 5.226072,
    '15,000-19,999': 4.051674,
    '20,000-24,999': 3.523194,
    '25,000-29,999': 2.055197,
    '30,000-39,999': 4.286553,
    '40,000-49,999': 4.286553,
    '50,000-59,999': 3.405755,
    '60,000-69,999': 2.877275,
    '70,000-79,999': 2.348796,
    '80,000-89,999': 2.172637,
    '90,000-99,999': 2.172637,
    '100,000-124,999': 4.521433,
    '125,000-149,999': 2.290076,
    '150,000-199,999': 2.290076,
    '200,000-249,999': 0.645919,
    '250,000-299,999': 0.293600,
    '300,000-500,000': 0.293600,
    '> $500,000': 0.293600
}

In [None]:
sns.set(rc={'figure.figsize':(20.7,8.27)})
my_range=range(1,len(salary_dict)+1)

plt.hlines(y=my_range,  xmin=0, xmax=list(salary_dict.values()), color='red')
plt.plot(list(salary_dict.values()), my_range, "o")

plt.yticks(my_range, list(salary_dict.keys()))
plt.title("Level of Compensation ($USD)", loc='center', fontsize=14)
plt.xlabel('% of Respondents')
plt.ylabel('Compensation')
None

The graph shows that the smallest salary bracket has the highest respondents. It should be noted that this indicates that most respondents are either paid very less or are unemployed. A very small portion of respondents are paid above $100,000 USD a year. This could be assumed to be those are higher positions with mutiple years of experience.

### Countries with respondents who are unpaid/paid less

The choropleth map below shows all the counties where the selected respondents are either paid the lowest yearly compensation or are unemployed.

In [None]:
# Low salary Countries in dataframe
low_sal_country = surveydf[surveydf['Q24']=='$0-999']['Q3'].unique()
low_sal_country_df = pd.DataFrame(low_sal_country, columns=['Country'])

In [None]:
# Get country codes
import pycountry 
def alpha3code(column):
    CODE=[]
    for country in column:
        try:
            code=pycountry.countries.get(name=country).alpha_3
           # .alpha_3 means 3-letter country code 
           # .alpha_2 means 2-letter country code
            CODE.append(code)
        except:
            CODE.append('None')
    return CODE
# create a column for code 
low_sal_country_df['CODE']=alpha3code(low_sal_country_df.Country)

In [None]:
import plotly.express as px
fig = px.choropleth(data_frame = low_sal_country_df,
                    locations= "CODE",
                    color= "Country",  
                    hover_name= "Country",
                    title = 'Countries of Unemployed/Low Paid Respondents')  

fig.show()

## 5. Tools

### Money spent on Machine Learning/Cloud Computing services

In [None]:
# 25. Money spent on tools
money = surveydf['Q25'].value_counts(normalize=True)*100
n = len(money)
colors = [cm.Set2(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = money.index
plt.pie(x=money, autopct="%.1f%%", explode=[0.02]*money.shape[0], labels=labels, colors=colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Money (USD) Spent on Machine Learning and/or Cloud Computing Services in the Past 5 Years", fontsize=14);

More than 50% of the selected respondents have spent money on Machine Learning or cloud services at home or at work. Based on this, the survey questions were designed to ask specific questions to those who spent money on the tools and services against those that didn't. Those who spent money were defined as professionals (group A), whereas those that didn't were defined as non-professionals (group B).

### 5A. Questions asked to the professionals

In [None]:
# Function for questions with multiple answers with Part A and Part B
def multiple_ans_ab(quest, df):
    ans_dict = {}
    i = 0
    for col in df.columns:
        if col.startswith(quest):
            try:
                df[col] = df[col].str.strip()
                key = df[col].value_counts().index[0]
                ans_dict[key] = df[col].count()
            except IndexError:
                continue

    # Get percentages of the values
    for keys in ans_dict:
        ans_dict[keys] = round(ans_dict[keys] * 100/df.shape[0], 2)

    # Sort Dictionary values
    sorted_items = {k: v for k, v in sorted(ans_dict.items(), key=lambda item: item[1], reverse = True)}
    return sorted_items

In [None]:
# Filter data for non-spenders and spenders
spender1 = surveydf[surveydf['Q25']!='$0 ($USD)']
nonspender1 = surveydf[surveydf['Q25']=='$0 ($USD)']

In [None]:
spender = spender1[['Q25', 'Q26_A_Part_1',
 'Q26_A_Part_2',
 'Q26_A_Part_3',
 'Q26_A_Part_4',
 'Q26_A_Part_5',
 'Q26_A_Part_6',
 'Q26_A_Part_7',
 'Q26_A_Part_8',
 'Q26_A_Part_9',
 'Q26_A_Part_10',
 'Q26_A_Part_11',
 'Q26_A_OTHER',
 'Q27_A_Part_1',
 'Q27_A_Part_2',
 'Q27_A_Part_3',
 'Q27_A_Part_4',
 'Q27_A_Part_5',
 'Q27_A_Part_6',
 'Q27_A_Part_7',
 'Q27_A_Part_8',
 'Q27_A_Part_9',
 'Q27_A_Part_10',
 'Q27_A_Part_11',
 'Q27_A_OTHER',
 'Q28_A_Part_1',
 'Q28_A_Part_2',
 'Q28_A_Part_3',
 'Q28_A_Part_4',
 'Q28_A_Part_5',
 'Q28_A_Part_6',
 'Q28_A_Part_7',
 'Q28_A_Part_8',
 'Q28_A_Part_9',
 'Q28_A_Part_10',
 'Q28_A_OTHER',
 'Q29_A_Part_1',
 'Q29_A_Part_2',
 'Q29_A_Part_3',
 'Q29_A_Part_4',
 'Q29_A_Part_5',
 'Q29_A_Part_6',
 'Q29_A_Part_7',
 'Q29_A_Part_8',
 'Q29_A_Part_9',
 'Q29_A_Part_10',
 'Q29_A_Part_11',
 'Q29_A_Part_12',
 'Q29_A_Part_13',
 'Q29_A_Part_14',
 'Q29_A_Part_15',
 'Q29_A_Part_16',
 'Q29_A_Part_17',
 'Q29_A_OTHER',
 'Q30',
 'Q31_A_Part_1',
 'Q31_A_Part_2',
 'Q31_A_Part_3',
 'Q31_A_Part_4',
 'Q31_A_Part_5',
 'Q31_A_Part_6',
 'Q31_A_Part_7',
 'Q31_A_Part_8',
 'Q31_A_Part_9',
 'Q31_A_Part_10',
 'Q31_A_Part_11',
 'Q31_A_Part_12',
 'Q31_A_Part_13',
 'Q31_A_Part_14',
 'Q31_A_OTHER',
 'Q32',
 'Q33_A_Part_1',
 'Q33_A_Part_2',
 'Q33_A_Part_3',
 'Q33_A_Part_4',
 'Q33_A_Part_5',
 'Q33_A_Part_6',
 'Q33_A_Part_7',
 'Q33_A_OTHER',
 'Q34_A_Part_1',
 'Q34_A_Part_2',
 'Q34_A_Part_3',
 'Q34_A_Part_4',
 'Q34_A_Part_5',
 'Q34_A_Part_6',
 'Q34_A_Part_7',
 'Q34_A_Part_8',
 'Q34_A_Part_9',
 'Q34_A_Part_10',
 'Q34_A_Part_11',
 'Q34_A_OTHER',
 'Q35_A_Part_1',
 'Q35_A_Part_2',
 'Q35_A_Part_3',
 'Q35_A_Part_4',
 'Q35_A_Part_5',
 'Q35_A_Part_6',
 'Q35_A_Part_7',
 'Q35_A_Part_8',
 'Q35_A_Part_9',
 'Q35_A_Part_10',
 'Q35_A_OTHER']]

In [None]:
nonspender = nonspender1[['Q25', 'Q26_B_Part_1',
 'Q26_B_Part_2',
 'Q26_B_Part_3',
 'Q26_B_Part_4',
 'Q26_B_Part_5',
 'Q26_B_Part_6',
 'Q26_B_Part_7',
 'Q26_B_Part_8',
 'Q26_B_Part_9',
 'Q26_B_Part_10',
 'Q26_B_Part_11',
 'Q26_B_OTHER',
 'Q27_B_Part_1',
 'Q27_B_Part_2',
 'Q27_B_Part_3',
 'Q27_B_Part_4',
 'Q27_B_Part_5',
 'Q27_B_Part_6',
 'Q27_B_Part_7',
 'Q27_B_Part_8',
 'Q27_B_Part_9',
 'Q27_B_Part_10',
 'Q27_B_Part_11',
 'Q27_B_OTHER',
 'Q28_B_Part_1',
 'Q28_B_Part_2',
 'Q28_B_Part_3',
 'Q28_B_Part_4',
 'Q28_B_Part_5',
 'Q28_B_Part_6',
 'Q28_B_Part_7',
 'Q28_B_Part_8',
 'Q28_B_Part_9',
 'Q28_B_Part_10',
 'Q28_B_OTHER',
 'Q29_B_Part_1',
 'Q29_B_Part_2',
 'Q29_B_Part_3',
 'Q29_B_Part_4',
 'Q29_B_Part_5',
 'Q29_B_Part_6',
 'Q29_B_Part_7',
 'Q29_B_Part_8',
 'Q29_B_Part_9',
 'Q29_B_Part_10',
 'Q29_B_Part_11',
 'Q29_B_Part_12',
 'Q29_B_Part_13',
 'Q29_B_Part_14',
 'Q29_B_Part_15',
 'Q29_B_Part_16',
 'Q29_B_Part_17',
 'Q29_B_OTHER',
 'Q31_B_Part_1',
 'Q31_B_Part_2',
 'Q31_B_Part_3',
 'Q31_B_Part_4',
 'Q31_B_Part_5',
 'Q31_B_Part_6',
 'Q31_B_Part_7',
 'Q31_B_Part_8',
 'Q31_B_Part_9',
 'Q31_B_Part_10',
 'Q31_B_Part_11',
 'Q31_B_Part_12',
 'Q31_B_Part_13',
 'Q31_B_Part_14',
 'Q31_B_OTHER',
 'Q33_B_Part_1',
 'Q33_B_Part_2',
 'Q33_B_Part_3',
 'Q33_B_Part_4',
 'Q33_B_Part_5',
 'Q33_B_Part_6',
 'Q33_B_Part_7',
 'Q33_B_OTHER',
 'Q34_B_Part_1',
 'Q34_B_Part_2',
 'Q34_B_Part_3',
 'Q34_B_Part_4',
 'Q34_B_Part_5',
 'Q34_B_Part_6',
 'Q34_B_Part_7',
 'Q34_B_Part_8',
 'Q34_B_Part_9',
 'Q34_B_Part_10',
 'Q34_B_Part_11',
 'Q34_B_OTHER',
 'Q35_B_Part_1',
 'Q35_B_Part_2',
 'Q35_B_Part_3',
 'Q35_B_Part_4',
 'Q35_B_Part_5',
 'Q35_B_Part_6',
 'Q35_B_Part_7',
 'Q35_B_Part_8',
 'Q35_B_Part_9',
 'Q35_B_Part_10',
 'Q35_B_OTHER']]

In [None]:
# Remove unused rows
nonspender = nonspender.dropna(how='all')
spender = spender.dropna(how='all')

### Cloud Computing platforms used regularly

In [None]:
# 26_A. Cloud Computing Platforms
cc1 = multiple_ans_ab('Q26_A',spender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(cc1.keys()), y=list(cc1.values()), palette="husl")
ax.set(xlabel="Cloud Computing Platforms", ylabel = "% of Respondents", title = "Cloud Computing Platforms Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Amazon Web Services and Google Cloud Platform were the most regularly used platforms.

### Cloud Computing products used regularly

In [None]:
# Alternate Question based on answers in Q26A
cc_prd_df = spender[['Q27_A_Part_1','Q27_A_Part_2','Q27_A_Part_3','Q27_A_Part_4','Q27_A_Part_5','Q27_A_Part_6','Q27_A_Part_7','Q27_A_Part_8','Q27_A_Part_9','Q27_A_Part_10','Q27_A_Part_11','Q27_A_OTHER']]
cc_prd_df = cc_prd_df.dropna(how='all')

In [None]:
# 27_A. Could Computing Products
cc2 = multiple_ans_ab('Q27_A',cc_prd_df)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(cc2.keys()), y=list(cc2.values()), palette="husl")
ax.set(xlabel="Cloud Computing Products", ylabel = "% of Respondents", title = "Cloud Computing Products Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Of those who were asked about Cloud Computing products used regularly, Amazon EC2 ranked the highest, followed by Azure Cloud Services, GCC Engine and AWS Lambda.

### Machine Learning products used regularly

In [None]:
# Alternate Question based on answers in Q26A
ml_prd_df = spender[['Q28_A_Part_1','Q28_A_Part_2','Q28_A_Part_3','Q28_A_Part_4','Q28_A_Part_5','Q28_A_Part_6','Q28_A_Part_7','Q28_A_Part_8','Q28_A_Part_9','Q28_A_Part_10','Q27_A_OTHER']]
ml_prd_df = ml_prd_df.dropna(how='all')

In [None]:
# 28_A. Machine Learning Products
mlp= multiple_ans_ab('Q28_A',ml_prd_df)
mlp['GC AI Platform/ML Engine'] = mlp.pop('Google Cloud AI Platform / Google Cloud ML Engine')
mlp = {k: v for k, v in sorted(mlp.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(mlp.keys()), y=list(mlp.values()), palette="husl")
ax.set(xlabel="Machine Learning Products", ylabel = "% of Respondents", title = "Machine Learning Products Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

More than 55% of those who asked this question did not use specific Machine learning products.

### Big Data products used regularly

In [None]:
# 29_A. Big Data Products
bdp = multiple_ans_ab('Q29_A',spender)
bdp['Microsoft Azure Data Lake'] = bdp.pop('Microsoft Azure Data Lake Storage')
bdp = {k: v for k, v in sorted(bdp.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(20.7,6.27)})
ax = sns.barplot(x=list(bdp.keys()), y=list(bdp.values()), palette="husl")
ax.set(xlabel="Big Data Products", ylabel = "% of Respondents", title = "Big Data Products Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Flavors of SQL were the most preferred Big Data products used by professionals.

### Big Data products used most often

In [None]:
# 30. Big Data Products used most often
bdp2 = spender['Q30'].value_counts(normalize=True)*100
n = len(bdp2)
colors = [cm.tab20(i / n) for i in range(n)]

pie, ax = plt.subplots(figsize=[13,10])
labels = bdp2.index
plt.pie(x=bdp2, autopct="%.1f%%", explode=[0.02]*bdp2.shape[0], labels=labels, colors=colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Big Data Products Used Most Often", fontsize=14);

From those who used more than one Big Data product, one quarter of them prefered MySQL while almost another quarter of respondents preferred other flavors of SQL rather than cloud based products.

### Business Intelligence tools used regularly

In [None]:
# 31_A. Business Intelligence Tools
bit = multiple_ans_ab('Q31_A',spender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(bit.keys()), y=list(bit.values()), palette="husl")
ax.set(xlabel="Business Intelligence Tools", ylabel = "% of Respondents", title = "Business Intelligence Tools Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

While 37% of the selected population did not use Business Intelligent tools, the most frequently used were Tableau and PowerBI.

### Business Intelligence tools used most often

In [None]:
# 32. Big Intelligence Products used most often
bdp2 = spender['Q32'].value_counts(normalize=True)*100
n = len(bdp2)
colors = [cm.tab20(i / n) for i in range(n)]

#Using matplotlib
pie, ax = plt.subplots(figsize=[15,13])
labels = bdp2.index
plt.pie(x=bdp2, autopct="%.1f%%", explode=[0.02]*bdp2.shape[0], labels=labels, colors=colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Business Intelligence Tools Used Most Often", fontsize=14);

Among those who used multiple BI tools, Tableau was the most preferred which is closely followed by PowerBI. With very similar functionalities, the use of these two tools boils down to business requriments.

### Automated Machine Learning tools used regularly

In [None]:
# 33_A. Automated Machine Learning Tools
auto_ml = multiple_ans_ab('Q33_A',spender)
sns.set(rc={'figure.figsize':(20.7,8.27)})
ax = sns.barplot(y=list(auto_ml.keys()), x=list(auto_ml.values()), palette="husl")
ax.set(ylabel="Automated Machine Learning Tools", xlabel = "% of Respondents")
ax.set_title("Automated Machine Learning Tools Used Regularly",fontsize=14)
ax.set_yticklabels(ax.get_yticklabels())
show_values_on_bars(ax, "h", 0.3)
None

62% of the selected population did not prefer automated Machine Learning tools, indicating that these respondents preferred designing models manually.

### Automated Machine Learning tools used regularly

In [None]:
auto_ml_df = spender[spender['Q33_A_Part_7']!='No / None']
auto_ml_df.shape

In [None]:
# 34_A. Automated Machine Learning Tools on regular basis
auto_ml2 = multiple_ans_ab('Q34_A',auto_ml_df)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(auto_ml2.keys()), y=list(auto_ml2.values()), palette="husl")
ax.set(xlabel="Automated Machine Learning Tools", ylabel = "% of Respondents", title = "Automated Machine Learning Tools Used Regularly")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Those professional respondents who did use automated Machine Learning tools used more of Auto-Sklearn.

### Tools used to manage Machine Learning Experiments

In [None]:
# 35_A. Tools to help manage machine learning experiments
ml_tool = multiple_ans_ab('Q35_A',spender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(ml_tool.keys()), y=list(ml_tool.values()), palette="husl")
ax.set(xlabel="ML Managing Tools", ylabel = "% of Respondents", title = "Tools to Manage Machine Learning Experiments")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Not many preferred using tools to manage the machine learning experiments and applications. This could be used an an opportunity to promote the use of such tools to encourage exploring machine learning. 

### 5B. Questions asked to the non-professionals

Non-professionals received questions that what tools they hope to become familiar with in the next 2 years instead of asking what tools they use on
a regular basis.

### Cloud Computing platforms

In [None]:
# 26_B. Cloud computing platforms you hope to become more familiar with in the next 2 years
cc_ptfm = multiple_ans_ab('Q26_B',nonspender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(cc_ptfm.keys()), y=list(cc_ptfm.values()), palette="husl")
ax.set(xlabel="Cloud Computing Platforms", ylabel = "% of Respondents", title = "Desired Cloud Computing Platforms to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Just as the professionals, non-professionals have shown great interest in familiarising AWS and GCP platforms.

### Cloud Computing products

In [None]:
# 27_B. Cloud computing products you hope to become more familiar with in the next 2 years
cc_prd = multiple_ans_ab('Q27_B',nonspender)

sns.set(rc={'figure.figsize':(18.7,6.27)})
ax = sns.barplot(x=list(cc_prd.keys()), y=list(cc_prd.values()), palette="husl")
ax.set(xlabel="Cloud Computing Products", ylabel = "% of Respondents", title = "Desired Cloud Computing Products to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

Google Cloud products took precedence over Azure and Amazon products for non-professionals to learn about.

### Machine Learning products

In [None]:
# 28_B. Machine Learning products you hope to become more familiar with in the next 2 years
ml_prd = multiple_ans_ab('Q28_B',nonspender)
ml_prd['GC AI Platform/ ML Engine'] = ml_prd.pop('Google Cloud AI Platform / Google Cloud ML Engine')
ml_prd = {k: v for k, v in sorted(ml_prd.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(18.7,6.27)})
ax = sns.barplot(x=list(ml_prd.keys()), y=list(ml_prd.values()), palette="husl")
ax.set(xlabel="Machine Learning Products", ylabel = "% of Respondents", title = "Machine Learning Products to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

As Cloud Computing products, Google Cloud machine learning products were shown more interest in than the Azure or Amazon services.

### Big Data products

In [None]:
# 29_B. Big Data products you hope to become more familiar with in the next 2 years
bd_prd = multiple_ans_ab('Q29_B',nonspender)
bd_prd['Microsoft Azure Data Lake'] = bd_prd.pop('Microsoft Azure Data Lake Storage')
bd_prd = {k: v for k, v in sorted(bd_prd.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(20.7,6.27)})
ax = sns.barplot(x=list(bd_prd.keys()), y=list(bd_prd.values()), palette="husl")
ax.set(xlabel="Big Data Products", ylabel = "% of Respondents", title = "Big Data Products to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

MySQL ranked the highest in Big Data products desired to be learnt, with the NoSQL database software MongoDB being the second most sought after.

### Business Intelligence products

In [None]:
# 31_B. Business Intelligence products you hope to become more familiar with in the next 2 years
bi_prd = multiple_ans_ab('Q31_B',nonspender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(bi_prd.keys()), y=list(bi_prd.values()), palette="husl")
ax.set(xlabel="Big Intelligence Products", ylabel = "% of Respondents", title = "Big Intelligence Products to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

As preferred by the professionals, non-professionals have shown great interest in learning Tableau and PowerBI along with Google Data Studio.

### Categories of Automated Machine Learning tools

In [None]:
# 33_B. Categories of automated machine learning tools you hope to become more familiar with in the next 2 years
auto_ml_prd = multiple_ans_ab('Q33_B',nonspender)
sns.set(rc={'figure.figsize':(20.7,8.27)})
ax = sns.barplot(y=list(auto_ml_prd.keys()), x=list(auto_ml_prd.values()), palette="husl")
ax.set(ylabel="Automated Machine Learning Tools", xlabel = "% of Respondents")
ax.set_title("Categories of Automated Machine Learning Tools To Become Familiar With",fontsize=14)
ax.set_yticklabels(ax.get_yticklabels())
show_values_on_bars(ax, "h", 0.3)
None

Although most professionals did not use automated Machine Learning tools, the most desired tools to be learnt are automated model selection (e.g. auto-sklearn) and ML pipeline automation (e.g. Google Cloud AutoML). Automated tools like feature engineering/selection, data augmentation and hyperparameter tuning were shown great similar level of interest in.

### Specific automated Machine Learning tools

In [None]:
# 34_B. Specific automated machine learning tools you hope to become more familiar with in the next 2 years
auto_ml_prd = multiple_ans_ab('Q34_B',nonspender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(auto_ml_prd.keys()), y=list(auto_ml_prd.values()), palette="husl")
ax.set(xlabel="Automated Machine Learning Tools", ylabel = "% of Respondents", title = "Specific Automated Machine Learning Tools to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

From the automated ML tools categories above, Auto-Sklearn and Google Cloud AutoML are the most sought after tools to be learnt.

### Tools for managing Machine Learning experiments 

In [None]:
# 35B. Tools for managing ML experiments  you hope to become more familiar with in the next 2 years
auto_ml_prd = multiple_ans_ab('Q35_B',nonspender)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(auto_ml_prd.keys()), y=list(auto_ml_prd.values()), palette="husl")
ax.set(xlabel="Tools for Managing Machine Learning", ylabel = "% of Respondents", title = "Tools for Managing Machine Learning Experiments to Become Familiar With")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

While not many have shown interest in learning the tools that manage Machine Learning tasks and experiments, TensorBoard was selected as the most desired tool to learn.

## 6. Online Platform and Resources

### Public sites used to share analysis or applications

In [None]:
# select only answered rows
online = surveydf[['Q36_Part_1','Q36_Part_2','Q36_Part_3','Q36_Part_4','Q36_Part_5','Q36_Part_6','Q36_Part_7','Q36_Part_8','Q36_Part_9','Q36_OTHER']]
online = online.dropna(how="all")

In [None]:
# 36. Public sharing sites used
share = multiple_ans_alt('Q36', online)

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(share.keys()), y=list(share.values()), palette="husl")
ax.set(xlabel="Public Sharing Sites", ylabel = "% of Respondents", title = "Public Sharing Sites Used")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

With various online platforms providing the resources to share data analysis or data science projects, GitHub was the most preferred site followed by Kaggle.

### Data Science course platforms

In [None]:
# 37. Begun or completed data science courses
course = multiple_ans('Q37')
course['University Courses'] = course.pop('University Courses (resulting in a university degree)')
course['Cloud-certification programs'] = course.pop('Cloud-certification programs (direct from AWS, Azure, GCP, or similar)')
course = {k: v for k, v in sorted(course.items(), key=lambda item: item[1], reverse=True)}

sns.set(rc={'figure.figsize':(15.7,6.27)})
ax = sns.barplot(x=list(course.keys()), y=list(course.values()), palette="husl")
ax.set(xlabel="Data Science Course Platforms", ylabel = "% of Respondents", title = "Platforms Data Science Courses Taken/Completed On")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),"%.1f%%"% rect.get_height(), ha='center', **style)
None

With courses from distinguished universities around the world, Coursera was the used the most for data science courses. Udemy, Kaggle Learn and University courses had a similar level of audience for data science courses. While Fast.ai was the least used, it should be gain more recognition for its free courses and applications that are most known for solving computer vision and NLP problems during different flavors of convolutional neural networks.

### Primary tool to analyze data

In [None]:
# 38. Primary tool to analyse data
analyse = surveydf['Q38'].value_counts(normalize=True)*100
n = len(bdp2)
colors = [cm.tab20(i / n) for i in range(n)]

#Using matplotlib
pie, ax = plt.subplots(figsize=[15,10])
labels = analyse.index
plt.pie(x=analyse, autopct="%.1f%%", explode=[0.02]*analyse.shape[0], labels=labels, colors=colors, pctdistance=0.7, textprops={'fontsize': 10})
plt.title("Primary Tool to Analyse Data", fontsize=14);

It is encouraging to note that a higher percentage of repondents preferred using R and Python on RStudio and Jupyterlab to analyse data than those who used the conventional software like Microsoft Excel and Google Sheets. While it can be argued that some datasets are small and not too complex to use local development environments, business intelligence software like Salesforce, PowerBI and Tableau can be used for these datasets, although the component of monthly/yearly subscription payment for these tools could hinder the desire to use them. 

### Favorite media sources 

In [None]:
# 39. Favorite media sources
media = multiple_ans('Q39')
sns.set(rc={'figure.figsize':(20.7,8.27)})
ax = sns.barplot(y=list(media.keys()), x=list(media.values()), palette="husl")
ax.set(ylabel="Data Science Course Sources", xlabel = "% of Respondents")
ax.set_title("Favorite Data Science Media Sources",fontsize=14)
ax.set_yticklabels(ax.get_yticklabels())
show_values_on_bars(ax, "h", 0.3)
None

Kaggle was proven worthy by the selected respondents as the favorite media source to report on data science topics. Useful notebooks by different users is a great source of learning the use of data and learning new skills and techniques. YouTube videos are a great way for audio visual learning that mimics the learning structure of a teacher-student relationship. Blogs are equally helpful to understand concepts. The reason these 3 channels are most favored is they are relatable to students and professionals over the globe. These channels are written by and for data enthusiasts who provide solutions for the simplest problem to the most complicated ones with easy guided and relatable solutions. This does not belittle the information from other sources like journals, newsletters, podcasts etc.  

## Conclusion

Of the selected female and non-binary respondents, most have been exploring various tools, methods and products available to explore and use data. Various tools to manage data and its applications have been invested in by some respondents, while the non-professionals have shown great interest in familiarising with advanced tools and technology to expands their machine learning and cloud computing skills. Although, automated tools that benefit coders in helping them manage the data and applications should be promoted to encourage exploring machine learning techniques and cloud computing capabilities. 

Of all the survey respondents, only 20.6% of them identified as female or non-binary. This goes to show that a great deal needs to be done to encourage females and gender non-conforming individuals to explore the avenues of data science and subsequently use the resources available. It is promising to note that most of these selected respondents were in their early 20s, proving that recent generations are eager to venture into the information technology and more specifically into data science. This community is growing far and wide and more data enthusiasts - young and old are discovering the joys of exploring data with the tools that are known and unknown. Promoting these tools are essential to help data entusiats build their skills and grow in large numbers with greater diversity.