# 1. Introduction
Self-directed learning has always been a cornerstone of the data science and machine learning industry (as anyone who has spent hours crawling through documentation on APIs, packages and SDKs can testify!). 

This has become even more critical since the advent of COVID-19 as employees, job hunters and students scramble to develop skills to increase adaptability and flexibility in the workplace and job market. Between Q1 2020 and Q1 2021, online self-directed learning has seen a 3x increase globally, with certain regions like Asia seeing a 5x increase (Cornerstone on Demand 2021).

So we can agree that self-directed learning is key, but for any would-be learner this now begs the question of "What do I learn?".

A 2014 white paper by CEB revealed that in the average organisation, 45% of all learning delivered ends up not being applied, with a 2004 study putting that number closer to 80-85% (Ken Phillips 2016). That's astronomical!

There is no doubt that the question of **"What** do I learn?" in an age where information is so abundant, is the most important question of all.

# 2. Our Goal
This notebook will serve as a practical guide for self-directed learners, to help them identify the areas of study relevant to them by utilising a data driven approach.

Section 4 "Information on Job Titles" has been designed for individuals looking to select or switch career paths. By examining the various attributes (such as salary and job activities) of different careers within the data science and machine learning industry, individuals can identify which may best suit them to begin their self-learning journey.

Section 5 "Breakdown of Skills/Tools/Methods by Job Title" has been designed to help individuals select areas of study that will be relevant to them based on the career path they are pursuing (or if they don't know what career yet then they can see what areas of study will be useful across the different careers).

# 3. The Dataset

## 3.1 Description
The dataset used comes from an industry wide survey for data science and machine learning held from 09/01/2021 to 10/04/2021 with 25,973 valid responses from 171 countries and territories.

Questions were asked as a multichoice with many allowing for multiple selection.
Go to https://www.kaggle.com/c/kaggle-survey-2021/data for a full breakdown of the survey and the raw data.

## 3.2 Managing Bias
As the data is collected via survey rather than random sample, we must be aware of the inherent bias with this method. 

One major concern was the bias in the countries in which respondents were based. As such we have avoided making sum-based comparisons between countries such as "number of respondents from India with a master’s degree compared to number of respondents from Australia with a master’s degree" and instead have looked at these as a proportion relative to the country, for example "Of the respondents from India, what proportion have a master’s degree". 

Another concern was a difference in respondents from each job title, once again we have used proportions rather than sum comparisons. For example, rather than asking "How many data scientists use Python" we ask "What proportion of data scientists use Python".

The exception to this is section 4.3 "Industry Breakdown". This is as job titles with lower respondents would skew the visualisation when comparing across industry, additionally it is useful to see the respondents broken down by industry and job title.

## 3.3 Processing The Data Set
As the data was already pre-processed (see "kaggle_survey_2021_methodology.pdf" in the "supplementary data" folder of the dataset), we only needed to pre-process the data relative to our objective. Given that the objective of the notebook was to look at areas of study based on what is used by industry professionals, respondents who answered "Student" or "Currently Not Employed" to Question 5 "Select the title most similar to your current role (or most recent title if retired)", were removed from the dataset, leaving us with 17,182 valid respondents.

An additional note on data quality is the presence of null values for some multiple select answers, whilst one option was to potentially impute the option of "None" where a question was not answered, this runs the risk of skewing the data as we cannot guarantee that "None" was the intended answer, as such the null values have not been changed. On the bubble charts in section 5, the size of the bubble and the scale on the right of the graph gives and rough indication of what proportion of survey respondents provided used any specific item.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
rawdf = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', low_memory=False)

# Clear Question description row
df = rawdf.iloc[1:]

# drop students and unemployed respondents
df = df[~df['Q5'].isin(['Student','Currently not employed'])]

 
# 4. Information on Job Titles: Helping you Select a Career Path
The purpose of this section is to help familiarise you with some of the available career paths in the data science and machine learning space and help find which will best suit you. To achieve this goal, for each job title provided we will look at a brief description of the job, the expected salary in each country, a breakdown of jobs in different industries and the day-to-day activities of each job title.

## 4.1 Job Descriptions
Below descriptions are modified and extended from [Seek Australia](https://www.seek.com.au/career-advice/)
### Business Analyst
Business Analysts are responsible for reviewing and analysing business decisions and processes, as well as communicating technical information back to the business and shareholders. Analysts work across a large range of departments and industries.

### DBA / Database Engineer
Database engineers are responsible for an organisation’s database management. They assess business and user requirements and plan, configure and maintain custom databases and their accompanying documentation and protocols for the business.

### Data Analyst
Data Analysts help organisations identify ways to reduce costs and find opportunities for improving business revenue. Data Analysts collect and analyse data from within the business. Analysts are responsible for taking data and their findings and transferring this to actionable plans for the business.

### Data Engineer
Data Engineers develop ways to store and access large amounts of data. Data engineers design and maintain data architectures that enable easier access and interpretation of data for the business and shareholders.

### Data Scientist
Data Scientists look for patterns in the raw business data and use this to find trends and patterns to generate insights into real-world problems and to find possible solutions for these problems.

### Developer Relations/Advocacy
Developer Relations involves working with business shareholders and developers to assist in planning and setting expectations for projects. Advocates are typically experts in their field/company and promote their product.

### Machine Learning Engineer
Machine Learning Engineers are experts in AI and machine learning models. ML engineers design, build and test models. Also, verifying data quality is part of their job as it is important that the ML models are being trained with quality and accurate data.

### Product Manager
Product Manager's identify the needs of the business or shareholder and ensure products designed will meet these needs and requirements. The manager also ensures their team is working in line with the business’s goals.

### Program/Project Manager
Like product managers, this position involves ensuring that the program/project output of the team they lead, is matching the needs and expectations of the business or customer. 

### Research Scientist
Research Scientist's design and undertake experiments and trials. This is a broad position and the options are endless as to what you can do. 

### Software Engineer
Software Engineer's design, develop, maintain and test software. Engineer's work with businesses and shareholders to determine the requirements for the program.

### Statistician
Statisticians utilise statistical theories and knowledge to analyse and interpret findings related to data. Statisticians’ validation and ensure quality data and report their findings to their businesses/customers.

### Other
Other is referred to within this notebook under job titles and refers to any survey respondent who did not feel that their current job title was reflected in any of the options as above.

## 4.2 Education by Job Title
Education level varies according to job title. The purpose of this section is to give you an idea about what level of education you should aim for, given a career decision. 


In [None]:
# create info df with desired info
education_job = df[["Q4", "Q5"]]
education_job = education_job.rename(columns={"Q4":"Education", "Q5":"Job Title"})

grouped_education = pd.DataFrame(education_job.groupby(["Job Title", "Education"], as_index=True).size())

# build data frame with desired output (binary vector representation of each response with 1 indicating education response)
education_options = education_job["Education"].to_list()
unique_education_options = set(education_options)
education_data = pd.DataFrame(columns=unique_education_options, index=["Business Analyst", "Data Analyst", "Data Engineer", "Data Scientist", "DBA/Database Engineer", "Machine Learning Engineer", "Product Manager", "Program/Project Manager", "Research Scientist", "Software Engineer", "Statistician", "Other"])

# build education_data df and fill in using values from grouped_education
for i in range(0, (grouped_education.shape[0])):
  # get Series for each row in grouped_education data
  series = grouped_education.iloc[i]
  # series.name has data @ index 0 = job title and 1 = education
  # series[0] is the count of that job/education pairing
  education_data.at[series.name[0], series.name[1]] = series[0]

# fill in total count for each job title
education_data["Total"] = education_data.sum(axis=1)
education_data.fillna(value=0, inplace=True)

#get list of columns and remove total
cols = education_data.columns.tolist()
cols.remove('Total')

# divide all values by row total to get proportion
proportion_education_data = education_data[cols].div(education_data.Total, axis=0)

# stacked bar chart - x-axis is job title, y-axis is the count and the stacked barchart makes up the education levels (modified from https://towardsdatascience.com/stacked-bar-charts-with-pythons-matplotlib-f4020e4eb4a7 tutorial)
# build chart data, filter out Total column for graphing
fields = proportion_education_data.columns
# 7 colours required
colors = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748', '#FF5733', '#FFA533', '#33DDFF']

# figure and axis
fig, ax = plt.subplots(1, figsize=(12, 10))
# plot bars
left = len(proportion_education_data) * [0]
for idx, name in enumerate(fields):
    plt.barh(proportion_education_data.index, proportion_education_data[name], left = left, color=colors[idx])
    left = left + proportion_education_data[name]
# title, legend, labels
plt.title('Education for each job title\n', loc='left', fontsize=20, pad=20)
plt.xlabel('Proportion of education for job title', fontsize=16, labelpad=20)
plt.ylabel('Job Title', fontsize=16, labelpad=20)
# remove spines
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# adjust limits and draw grid lines
plt.ylim(-0.5, ax.get_yticks()[-1] + 0.5)
plt.legend(fields, ncol=2, frameon=False, bbox_to_anchor=(1.0, 1.0))
ax.set_axisbelow(True)
ax.xaxis.grid(color='gray', linestyle='dashed')
plt.gca().invert_yaxis()
plt.show()

From the above graph, it can clearly be seen that most positions require a bachelor’s degree or a master's degree.
Interesting to note, is that research scientists seem to hold or be undertaking doctoral degrees, suggesting that this position requires a higher level of study.

## 4.3 Salaries
Salaries were included in the Kaggle survey and respondents were able to select ranges for their responses. As numerical data is easier to visualise, the middle value for each range was taken as the response and then averages were able to be taken based off this data. This is not an accurate representation of compensation you may receive, but instead a tool for you to use to compare positions and countries that may be associated with higher compensations. As there were many different countries represented, as well as job titles, first let's look at the average salaries based of all available data from the survey.

In [None]:
salary_data = df[["Q3", "Q5", "Q25"]]
salary_data = salary_data.rename(columns={"Q3":"Country", "Q5":"Job Title", "Q25":"Yearly Comp USD"})
salary_data = salary_data.replace(to_replace={"$0-999":499.5, "1,000-1,999":1499.5, "10,000-14,999":12499.5, "30,000-39,999":34999.5, "100,000-124,999":112499.5, "5,000-7,499":6249.5, "50,000-59,999":54999.5, "40,000-49,999":44999.5, "20,000-24,999":22499.5, "2,000-2,999":2499.5, "15,000-19,999":17499.5, "7,500-9,999":8749.5, "60,000-69,999":64999.5, "25,000-29,999":27499.5, "70,000-79,999":74999.5, "4,000-4,999":4499.5, "150,000-199,999":174999.5, "80,000-89,999":84999.5, "3,000-3,999":3499.5, "125,000-149,999":137499.5, "90,000-99,999":94999.5, "200,000-249,999":224999.5, "300,000-499,999":399999.5, "250,000-299,999":274999.5, ">$1,000,000":1000000,"$500,000-999,999":749999.5})

total_results = salary_data.groupby(["Country", "Job Title"]).mean()
# heatmap of the total results with all data included
pivot_total = pd.pivot_table(total_results, values=["Yearly Comp USD"], index=["Country"], columns=["Job Title"], fill_value=0)
plt.figure(figsize=(12,12))
ax = sns.heatmap(data=pivot_total, robust=True, linewidth=.015, xticklabels=True, yticklabels=True, cmap="Greens")
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, rotation_mode="anchor", ha="right")
plt.title("Salary by Country vs Job Title", fontsize=20, pad=20)
plt.xlabel("Job Title", fontsize=16, labelpad=20)
plt.ylabel("Country", fontsize=16, labelpad=20)
plt.show()

In the heatmap, darker colours correspond to greater average compensation for the job title. Each cell in the heatmap is made up of the corresponding country/job title pair. 

Not all country/job titles had respondents and the white cells represent these missing data points.

A point to keep in mind is that these compensations are all in USD and may not be a useful comparison tool between countries, as the lower band of compensation for a country may in fact be the same quality of compensation in a high compensated country. Instead, using the interpretation that darker colours are greater compensated positions can be used to determine ideal positions within the rows. 

We can see that not many cells contain data. After examining the number respondents, we also found quite a few entries with data from a low number of individuals.

As stated [in this cyberlecture](http://web.pdx.edu/~newsomj/pa551/lecture4.htm ), larger samples better represent the population. For another analysis, we recreated the above heatmap with only data with greater than 30 data points, as this follows the [central limit theorem](https://www.investopedia.com/terms/c/central_limit_theorem.asp) that 30 is a large enough sample to ensure a fairly representative sample of the population.

In [None]:
## remove country/job title combinations with less than 50 responses
large_sample_salary = salary_data.groupby(["Country", "Job Title"]).mean()
large_sample_salary['Total'] = salary_data.groupby(["Country", "Job Title"]).size()
large_sample_salary = large_sample_salary[large_sample_salary['Total'] > 29]

# heatmap of the large sample results with >= 30 data point pairs included
pivot_total = pd.pivot_table(large_sample_salary, values=["Yearly Comp USD"], index=["Country"], columns=["Job Title"], fill_value=0)
plt.figure(figsize=(12,12))
ax = sns.heatmap(data=pivot_total, robust=True, linewidth=.015, xticklabels=True, yticklabels=True, cmap="Greens")
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, rotation_mode="anchor", ha="right")
plt.title("Large Sample Country/Job Pairings", fontsize=20, pad=20)
plt.xlabel("Job Title", fontsize=16, labelpad=20)
plt.ylabel("Country", fontsize=16, labelpad=20)
plt.show()

As before, the white cells represent an average of 0/missing data as every country/job title combination is included in the heatmap. This is due to not having sufficient data for those specific combinations. These values still appear as the nature of the heatmap includes a cell for every country/job title pairing.

Due to the large samples principle, this heatmap contains the data that is best representative of its population.

After examining these heatmaps Israel, United States of America and Australia seem to have the highest compensation in USD. It should be noted that this does not necessarily mean that other countries are not similar in compensation as smaller countries may have a lower cost of living/median wage and "low" compensation may actually be "high" compensation. 

Data Scientist and Other seem to be the job titles associated with the highest compensation levels.

## 4.4 Industry Breakdown
Within the survey, respondents were able to enter their industry. We have put together a heatmap to demonstrate the spread of Kaggler's job titles across various industries.

In [None]:
# df for job titles and industry
industry_data = df[["Q5", "Q20"]]
industry_data = industry_data.rename(columns={"Q5":"Job Title", "Q20":"Industry"})
industry_grouped = industry_data.groupby(["Job Title", "Industry"]).size().reset_index()

# build the heatmap of industry to job titles
pivot_total = pd.pivot_table(industry_grouped, index=["Job Title"], columns=["Industry"], fill_value=0)
plt.figure(figsize=(12,12))
ax = sns.heatmap(data=pivot_total, robust=True, linewidth=.015, xticklabels=True, yticklabels=True, cmap="Greens")
ax.set(xticklabels=['Academics/Education','Accounting/Finance','Broadcasting/Communications','Computers/Technology','Energy/Mining',
               'Government/Public Service', 'Hospitality/Entertainment/Sports','Insuarance/Risk Assessment','Manufacturing/Fabrication',
              'Marketing/CRM','Medical/Pharmaceutical','Military/Security/Defense','Non-profit/Service','Online Business/Internet-based Sales',
              'Online Service/Internet-based Services','Other','Retail/Sales','Shipping/Transportation'])
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, rotation_mode="anchor", ha="right")
plt.title("Industry Breakdown for Job Titles", fontsize=20, pad=20)
plt.xlabel("Industry", fontsize=16, labelpad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.show()

From the above heatmap, the darker green represents a higher concentration of survey respondents reporting that job title/industry pairing. 

As expected, the industries that are most popular among respondents include computers/technology and academics/education. 
Accounting/finance seems to also be popular. 

Data Scientist and Research Scientist positions are popular in the academics/education industry. Data Scientist, Machine Learning Engineer and Software Engineering are popular responses for respondents from the computers/technology industry.
Accounting/finance seem to have Data Analyst, Data Scientist as well as Software Engineers. 

## 4.5 Job Activities
When selecting a career, it is important to consider the day-to-day activities you will be performing along with your personal preference. The following plot shows the proportions of different job activities by job title.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,119:127]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,119:127]].groupby(['Q5']).size()
tdf

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title',
'Analyze and understand data to \n influence product or business decisions',
'Build and/or run the data infrastructure \n that my business uses for storing, \n analyzing, and operationalizing data',
'Build prototypes to explore\n applying machine learning to new areas',
'Build and/or run a machine \n learning service that operationally \n improves my product or workflows',
'Experimentation and iteration \n to improve existing ML models',
'Do research that advances the \n state of the art of machine learning',
'None of these activities are an \n important part of my role at work',
'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Job Activities by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Activity", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see above, the most common activity across all job titles is "Analyse and understand data to influence product or business decisions". However, in some roles such as Software Engineer, it is not as prevalent.

## 4.6 Concluding Notes for Section 4
Hopefully the above section has provided you with sufficient information to help decide on a career path or at least serve as a starting point. Keep in mind your choice for section 5 as you can tailor your self-directed learning towards the career you wish to pursue.

# 5. Breakdown of Skills/Tools/Methods by Job Title
The purpose of this section is to help identify where to focus your self-directed learning based on your career path.

Most of the information for this section is presented as bubble charts, with the colour of the bubble showing the use of a specific item proportionate to other items for each Job Title, and the size of the bubble showing the use of an item relative to the number of respondents for each job title. If that's a little confusing this example of how to use it might help.

Say you want to be a Machine Learning Engineer; we can see in terms of programming languages (section 5.2.1) that Python is the most used as it is coloured yellow and comparatively the circle is quite large. Note that the size of the circle and the scale on the right of the chart give us an indication of how many respondents answered the question. Now let's look at computer vision methods (section 5.3.3), we can see that for the Machine Learning Engineer, most respondents who answered the question said they use "Image classification and other general-purpose networks", but the bubble is smaller and the scale on the right doesn't go as high. So, what does this mean? It means that less respondents that said they were Machine Learning Engineers selected the image classification option (based on the scale we can assume less than 50%) when compared to the number that selected the Python option.

So how do we use this information? Remembering the goal is to identify what to learn and in what order. If you already have experience with tools, you'll need to factor that in (as well as current demand in your workplace). To make it simple we recommend:
1. Creating a list of 1 tool/method/skill from each section/chart, based on your career path.
2. Order the list by the size of the bubble/scale, with Programming Language first (5.2.1) and Integrated Development Environment (IDE) second (5.2.2). 

For example, if your career path is data scientist you might start with Python --> Jupyter Notebook --> Matplotlib --> Scikit-Learn --> Linear/Logistic Regression --> etc.

Essentially, you're making a list of what's relevant to you and learning the most used first (larger bubbles).

This section categorises items into Learning Tools, Basic Tools, Machine Learning, Cloud Computing, Data Storage and Big Data, and Business Intelligence



## 5.1 Learning Tools
There are many tools available to you online that provide you with the opportunity to learn/refine your data science skills. Kaggler's listed platforms that they have used, and we aim to investigate which platforms are the most popular.


In [None]:
# columns for answers to question 40, all parts
tdf = df.iloc[:,np.r_[243:255]]

# transpose data so that rows become learning tools
tdf = tdf.transpose()

# rename index to learning tool
tdf.rename(index={"Q40_Part_1":"Coursera", "Q40_Part_2":"edX", "Q40_Part_3":"Kaggle Learn Courses", "Q40_Part_4":"DataCamp", "Q40_Part_5":"Fast.ai", "Q40_Part_6":"Udacity", "Q40_Part_7":"Udemy", 
                  "Q40_Part_8":"LinkedIn Learning", "Q40_Part_9":"Cloud-certification programs \n (direct from AWS, \n Azure, GCP, or similar)", "Q40_Part_10":"University Courses \n resulting in a university degree)", 
                  "Q40_Part_11":"None", "Q40_OTHER":"Other"}, inplace=True)

# convert responses to a binary vector like system, 1 means a response and 0 for no response
def response_to_binary(x):
  if x is not np.nan:
    return 1
  else:
    return 0
binary_tdf = tdf.applymap(response_to_binary)

# sum the total of all rows and get a total 
binary_tdf["Total"] = binary_tdf.apply(np.sum, axis = 1)

# convert to percentages
# get total count for responses of tools used - multiple education tools maybe used by one respondent
total_responses = binary_tdf["Total"].sum()
# divide all values by 
binary_tdf["Percentage"] = binary_tdf["Total"].div(total_responses, axis=0)
binary_tdf["Percentage"] = pd.Series([val*100 for val in binary_tdf["Percentage"]], index = binary_tdf.index)

# plot in descending from most used to least used 
binary_tdf.sort_values("Percentage", ascending=False).plot.bar(y="Percentage", color=['C2', 'C2', 'C2', 'C7', 'C7', 'C7', 'C7', 'C7', 'C7', 'C7', 'C7', 'C7'], figsize=(15,10))

plt.title("Learning Tool Usage", fontsize=20, pad=20)
plt.ylabel("Percentage of Responses that used each Learning Tool", fontsize=16, labelpad=20)
plt.xlabel("Learning Tool", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.show()

As we can see from the above bar graph, there is a wide variety of tools to choose from. We have coloured the graph for you to clearly see the 3 most popular tools, which are Coursera, Kaggle Learn Courses and Udemy. 
[Coursera](https://www.coursera.org/) was the most popular, with just under 20% of respondents having used this tool.

## 5.2 Basic Tools
Basic Tools encompasses the fundamental tools for a practitioner, this consists of a Programming Language, an IDE, a Hosted Notebook and a Data Visualisation Tool.

### 5.2.1 Programming Languages
A programming language is essentially a written language that tells computers what to do.


In [None]:
# columns for Q7_Part_1 to Q7_Other
tdf = df.iloc[:,np.r_[5,7:20]].groupby(['Q5']).count()
tdf["Total"] = df.iloc[:,np.r_[5,7:20]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title','Python', 'R', 'SQL', 'C','C++','Java','Javascript','Julia','Swift','Bash','MATLAB','None','Other']
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Programming Languages by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Programming Language", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

# plt.legend(loc='best', bbox_to_anchor=(1.7,1), labelspacing=2, fontsize=14, frameon=False, markerscale=1)

plt.tight_layout()
plt.show()

As we can see Python is the most used programming language across the programming languages for the relevant careers with SQL being second most common.
### 5.2.2 Integrated Development Environments (IDEs)
An IDE is a software tool that combines different developer tools. Think of it as the platform on which you will be writing your code.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,21:34]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,21:34]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title',"JupyterLab", "RStudio", "Visual Studio","Visual Studio Code\n (VSCode)", "PyCharm","Spyder","Notepad++", "Sublime Text", "Vim / Emacs", "MATLAB", "Jupyter Notebook","None","Other" ]
tdf.set_index('Job_Title',inplace=True)


# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("IDEs by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("IDE", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see, Jupyter Notebook is the most used IDE across the relevant professions followed by Visual Studio Code. It is worth noting that the statistician career path stands apart with RStudio being the most used IDE.

### 5.2.3 Hosted Notebooks
A Notebook is a type of IDE that also supports text elements. A Hosted Notebook is simply a notebook hosted online, often used for working collaboratively with others.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,34:51]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,34:51]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', "Kaggle Notebooks","Colab Notebooks", "Azure Notebooks", "Paperspace / Gradient", "Binder / JupyterHub", "Code Ocean",
               "IBM Watson Studio", "Amazon Sagemaker \n Studio Notebooks", "Amazon EMR Notebooks", "Google Cloud Notebooks \n (AI Platform / Vertex AI)",
               "Google Cloud Datalab", "Databricks Collaborative \n Notebooks", "Zeppelin / Zepl Notebooks", "Deepnote Notebooks", "Observable Notebooks",
               "None", "Other"]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Hosted Notebooks by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Hosted Notebook", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

Here we can see that Collab Notebooks is the most used hosted notebook with Kaggle notebooks being a close second.

### 5.2.4 Data Visualisation Libraries / Tools
Data Visualisation Libraries refer to the libraries used to create charts, graphs and other methods of visualisation. Think of these like a toolbox for visualising data, rather than program everything from scratch you can build a visualisation with a few simple lines of code. For example, to create the bubble charts you see here, we used Matplotlib.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,59:71]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,59:71]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Matplotlib', 'Seaborn', "Plotly / Plotly Express", "Ggplot / ggplot2", 'Shiny',' D3 js', 'Altair', 'Bokeh', 'Geoplotlib', 'Leaflet / Folium', 'None', 'Other']
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Data Visualisation Libraries / Tools by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Library / Tool", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

Here we can see that across most professions Matplotlib is the most widely used visualisation tool, the exception to this being statisticians where it is a close second to ggplot/ggplot2.

### 5.2.5 Primary Tools
This relates to what category of software is used and can be factored into prioritising your list of items to learn.

In [None]:
# create info df with desired info
primary_tool = df[["Q5", "Q41"]]
primary_tool = primary_tool.rename(columns={"Q5":"Job Title", "Q41":"Primary Tool"})

grouped_tool = pd.DataFrame(primary_tool.groupby(["Job Title", "Primary Tool"], as_index=True).size())

# build data frame with desired output (binary vector representation of each response with 1 indicating education response)
job_options = primary_tool["Job Title"].to_list()
unique_job_options = set(job_options)
tool_options = primary_tool["Primary Tool"].to_list()
unique_tool_options = set(tool_options)
tool_data = pd.DataFrame(index=unique_job_options, columns=unique_tool_options)

# build country_data df and fill in using values from grouped_country
for i in range(0, (grouped_tool.shape[0])):
  # get Series for each row in grouped_education data
  series = grouped_tool.iloc[i]
  # series.name has data @ index 0 = job title and 1 = primary tool
  # series[0] is the count of that job/tool pairing
  tool_data.at[series.name[0], series.name[1]] = series[0]

tool_data.drop(np.NaN, axis=1, inplace=True)

# fill in total count for each job title
tool_data["Total"] = tool_data.sum(axis=1)
tool_data.fillna(value=0, inplace=True)
tool_data
# get list of columns and remove total
cols = tool_data.columns.tolist()
cols.remove('Total')

# # divide all values by row total to get proportion
proportion_tool_data = tool_data[cols].div(tool_data.Total, axis=0)

# stacked bar chart - x-axis is job title, y-axis is the count and the stacked barchart makes up the education levels (modified from https://towardsdatascience.com/stacked-bar-charts-with-pythons-matplotlib-f4020e4eb4a7 tutorial)
# build chart data, filter out Total column for graphing
fields = proportion_tool_data.columns
# 7 colours required
colors = ['#1D2F6F', '#8390FA', '#6EAF46', '#FAC748', '#FF5733', '#FFA533']

# figure and axis
fig, ax = plt.subplots(1, figsize=(12, 10))
# plot bars
left = len(proportion_tool_data) * [0]
for idx, name in enumerate(fields):
    plt.barh(proportion_tool_data.index, proportion_tool_data[name], left = left, color=colors[idx])
    left = left + proportion_tool_data[name]
# title, legend, labels
plt.xlabel('Proportion of primary tool by job title', fontsize=16, labelpad=20)
plt.title('Primary tool usage by job title\n', loc='left', fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
# remove spines
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
# adjust limits and draw grid lines
plt.ylim(-0.5, ax.get_yticks()[-1] + 0.5)
plt.legend(fields, ncol=2, frameon=False, bbox_to_anchor=(1.0, 1.0))
ax.set_axisbelow(True)
ax.xaxis.grid(color='gray', linestyle='dashed')
plt.gca().invert_yaxis()
plt.show()

As visibile above the majority of primary tools used across professions are local development environments and basic statistical software. However as to which of those 2 are prioritised in your self-directed learning will be dependent on your selected career path.

## 5.3 Machine Learning (ML)
Machine Learning refers to algorithms that improve an analytical model automatically as they look over more data.

### 5.3.1 Machine Learning Frameworks
Refers to the frameworks and libraries used in machine learning. Think of these like the toolbox for machine learning, like our data visualisation libraries, rather than having to program everything from scratch, you can use these frameworks to run ML algorithms with only a few lines of code.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,72:90]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,72:90]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Scikit-learn', 'TensorFlow', 'Keras', 'PyTorch', 'Fast.ai', 'MXNet', 'Xgboost', 'LightGBM', 'CatBoost', 'Prophet', 'H2O 3', 'Caret', 'Tidymodels', 'JAX', 'PyTorch Lightning', 'Huggingface', 'None', 'Other']
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Machine Learning Frameworks by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Framework", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see the most used Machine Learning Frameworks across professions is Scikit-learn. It is worth noting that for Developer Relations/Advocacy Scikit-learn appears to be tied closely with TensorFlow as the most used package framework for the profession.

### 5.3.2 Machine Learning Algorithms
Within the ML Frameworks there are different algorithm types. If ML Frameworks are the toolboxes, then ML Algorithms are the tools in the box.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,90:102]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,90:102]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Linear or Logistic \n Regression', 'Decision Trees or \n Random Forests',' Gradient Boosting Machines \n (xgboost, lightgbm, etc)', 'Bayesian Approaches', 'Evolutionary Approaches', 
               'Dense Neural Networks \n (MLPs, etc)', 'Convolutional \n Neural Networks', 'Generative Adversarial \n Networks', 'Recurrent Neural \n Networks',' Transformer Networks \n (BERT, gpt-3, etc)', 'None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Machine Learning Algorithms by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Algorithm", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see the most used Machine Learning Framework across professions is Linear or Logistic Regression. It is worth noting that for Machine Learning Engineers, Convolutional Neural Networks come a very close second.

### 5.3.3 Computer Vision Methods
Computer vision methods refer to methods for computers to gain meaningful information from digital images or videos. Similar to ML algorithms, computer vision methods are another type of tool in the box, so if ML algorithms are the different types of screwdrivers, then computer vision methods would be the hammers.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,102:109]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,102:109]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title','General purpose image / video \n tools (PIL, cv2, skimage, etc)', 'Image segmentation methods \n (U-Net, Mask R-CNN, etc)', 'Object detection methods \n (YOLOv3, RetinaNet, etc)', 
               'Image classification and \n other general purpose networks \n (VGG, Inception, ResNet, ResNeXt,\n NASNet, EfficientNet, etc)', 'Generative Networks \n (GAN, VAE, etc)', 'None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Computer Vision Methods by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Method", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see, Image classification and other general-purpose networks are the most used computer vision methods. There are 2 points of interest here, first is that Machine Learning Engineers appear to use comparatively more computer vision methods, second is that only approximately 40% of ML Engineers answered that they use Image classification, this makes it less of a priority for self-directed learning than say Python as a programming language in which over 90% of Machine Learning Engineers answered that they use.

### 5.3.4 Natural Language Processing (NLP) Methods
NLP methods refers to the methods in which computers gain meaningful information from natural language such as speech and text. Once again, as with ML algorithms and computer vision methods these are a completely different type of tool in the toolbox.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,109:115]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,109:115]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Word embeddings/vectors \n (GLoVe, fastText, word2vec)', 'Encoder-decoder models \n (seq2seq, vanilla transformers)', 
               'Contextualized embeddings \n (ELMo, CoVe)', 'Transformer language models \n (GPT-3, BERT, XLnet, etc)', 'None', 'Other'

]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("NLP Methods by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Method", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see the most used NLP Method is Word Embedding/Vectors which is most used by Machine Learning Engineers, however only about 20-30% answered that they use these methods.

### 5.3.5 Automated Machine Learning
Automated Machine Learning (AutoML) refers to automating the steps involved in applying machine learning solutions. If ML Algorithms are the screw drivers than AutoML are the power drills.

#### 5.3.5.1 AutoML Techniques
Within the machine learning process there are several different steps, AutoML techniques refers to the different steps for machine learning that can be automated. If delivering a presentation to your manager is the machine learning process, then having a computer make the presentation slides for you would be one of the AutoML techniques.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,205:213]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,205:213]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Automated data augmentation \n (e.g. imgaug, albumentations)', 'Automated feature \n engineering / selection \n (e.g. tpot, boruta_py)', 
               'Automated model selection \n (e.g. auto-sklearn, xcessiv)', 'Automated model architecture \n searches (e.g. darts, enas)',' Automated hyperparameter tuning \n (e.g. hyperopt, ray.tune, Vizier)', 
               'Automation of full ML pipelines \n (e.g. Google AutoML, H20 Driverless AI)', 'No / None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Automated Machine Learning Techniques by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Technique", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see most respondents do not use automated machine learning models at this time. However, there are still a small number of respondents in each profession that do. Whilst this means that it may not be an initial point of self-directed learning, it could also be a skill that once developed would provide an individual with a competitive edge in the job market (provided there is demand for the skill). The most viable selection in this case would be based on chosen career path.

#### 5.3.5.2 AutoML Tools
AutoML Tools refers to the tools are used to carry out the techniques in the previous section. If automatically building presentation slides is an AutoML technique then Microsoft PowerPoint would be an AutoML tool.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,213:221]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,213:221]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Google Cloud AutoML', 'H20 Driverless AI', 'Databricks AutoML', 'DataRobot AutoML', 'Amazon Sagemaker Autopilot', 'Azure Automated \n Machine Learning', 'No / None', 'Other'

]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Automated Machine Learning Tools by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Tool", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

On a similar thread to automated machine learning models, we see that most respondents use none of the listed options. However following this there appear to be a few respondents that use Google Cloud AutoML

### 5.3.6 Managed Machine Learning Products
Managed Machine Learning Products refers to the available products to support building and deploying your own machine learning model.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,155:165]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,155:165]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Amazon SageMaker', 'Azure Machine \n Learning Studio', 'Google Cloud \n Vertex AI', 'DataRobot', 'Databricks', 'Dataiku', 'Alteryx', 'Rapidminer', 'No / None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Managed Machine Learning Products by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Product", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see most respondents do not use any of the managed machine learning products listed. Out of the remaining options the most viable to study will be dependent on chosen career path.

### 5.3.7 Machine Learning Management Tools
Machine Learning Management Tools refers to the tools/products for managing your machine learning experiments.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,221:233]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,221:233]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Neptune.ai', 'Weights & Biases', 'Comet.ml', 'Sacred + Omniboard', 'TensorBoard', 'Guild.ai', 'Polyaxon', 'Trains', 'Domino Model Monitor', 'MLflow', 'No / None', 'Other'

]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Machine Learning Management Tools by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Tool", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see most respondents do not use a machine learning management tool, the exception here is Machine Learning Engineers which use TensorBoard more than any other ML management tool.

## 5.4 Cloud Computing
Cloud computing is essentially performing computing tasks but, on the cloud, rather than on your local device.

### 5.4.1 Cloud Computing Platforms
Cloud computing platforms simply refer to the platforms/providers that can facilitate cloud computing. If we consider cloud computing to be a car, cloud computing platforms are simply the different makes.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,129:141]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,129:141]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Amazon Web Services (AWS)', 'Microsoft Azure', 'Google Cloud Platform (GCP)', 'IBM Cloud / Red Hat', 
               'Oracle Cloud', 'SAP Cloud', 'Salesforce Cloud', 'VMware Cloud', 'Alibaba Cloud', 'Tencent Cloud', 'None', 'Other'

]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Cloud Computing Platforms by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Platform", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

As we can see the most used Cloud computing platforms were Amazon Web Services, followed by Google Cloud Platform, although only a little over 30% of Data Engineers and Data Scientists from the pool of respondents answered that they use AWS.

### 5.4.2 Cloud Computing Products
Cloud Computing Products refers to the products available on each platform. To stick with our car example if the platform is the make, the product is the model.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,142:147]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,142:147]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Amazon Elastic \n Compute Cloud (EC2)', 'Microsoft Azure \n Virtual Machines', 'Google Cloud \n Compute Engine', 'No / None', 'Other'

]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Cloud Computing Products by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Product", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

Amazon Elastic Compute Cloud (EC2) turned out to be the most used Cloud Computing Product.

## 5.5 Data Storage and Big Data

### 5.5.1 Data Storage Products
Refers to the different tools available for storing data in the cloud.


In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,147:155]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,147:155]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Microsoft Azure \n Data Lake Storage', 'Microsoft Azure \n Disk Storage', 'Amazon Simple \n Storage Service (S3)',
               'Amazon Elastic \n File System (EFS)', 'Google Cloud \n Storage (GCS)', 'Google Cloud Filestore', 'No / None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Data Storage Products by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Product", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

Out of the Data Storage Products, Amazon Simple Storage Service appeared to be the most used.

### 5.5.2 Big Data Products
Big data products refer to the products available for handling large datasets that would struggle on conventional applications.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,165:186]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,165:186]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'MySQL', 'PostgreSQL', 'SQLite', 'Oracle Database', 'MongoDB', 'Snowflake', 'IBM Db2', 'Microsoft SQL Server', 'Microsoft Azure SQL Database', 'Microsoft Azure Cosmos DB', 
               'Amazon Redshift', 'Amazon Aurora', 'Amazon RDS', 'Amazon DynamoDB', 'Google Cloud BigQuery', 'Google Cloud SQL', 'Google Cloud Firestore', 'Google Cloud BigTable', 'Google Cloud Spanner', 'None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Big Data Products by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Product", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

Big Data Products had quite a wide dispersion, indicating the professionals aren’t gravitating, entirely to 1 or 2 platforms. MySQL however appeared to be the most used followed by both Microsoft SQL Server and PostgreSQL.

## 5.6 Business Intelligence Tools
Business Intelligence Tools refer to the tools/products used to generate useful insights/information from raw data.

In [None]:
# columns for answers to question
tdf = df.iloc[:,np.r_[5,187:204]].groupby(['Q5']).count()

# total number of entries in relevant job titles
tdf["Total"] = df.iloc[:,np.r_[5,187:204]].groupby(['Q5']).size()

#get list of columns and remove total
cols = tdf.columns.tolist()
cols.remove('Total')

# divide all values by row total
tdf = tdf[cols].div(tdf.Total, axis=0)

# resets index and assigns column names which will serve as x axis labels
tdf.reset_index(inplace=True)
tdf.columns = ['Job_Title', 'Amazon QuickSight', 'Microsoft Power BI', 'Google Data Studio', 'Looker', 'Tableau', 'Salesforce', 'Einstein Analytics', 
               'Qlik', 'Domo', 'TIBCO Spotfire', 'Alteryx', 'Sisense', 'SAP Analytics Cloud', 'Microsoft Azure Synapse', 'Thoughtspot', 'None', 'Other'
]
tdf.set_index('Job_Title',inplace=True)

# unstacks temporary dataframe and builds a cartesian product for our bubble chart
dfu = tdf.unstack().reset_index()
dfu.columns = ['x_axis','y_axis','proportion']

# Create bubble plot with grid
fig = plt.figure()
plt.scatter(dfu.x_axis, dfu.y_axis, s = dfu.proportion*1000, edgecolors="black", c = dfu['proportion'], zorder = 2, cmap="viridis")
plt.grid(ls = "--", zorder = 1)
fig.set_size_inches(18.5, 10.5)

# Set Titles and Labels
plt.title("Business Intelligence Tools by Job Title", fontsize=20, pad=20)
plt.ylabel("Job Title", fontsize=16, labelpad=20)
plt.xlabel("Tool", fontsize=16, labelpad=20)
plt.xticks(rotation=45, rotation_mode='anchor', ha='right')
plt.gca().invert_yaxis()

#Legend with colour bar and circle sizes
plt.colorbar(ticks=np.linspace(0,1,11))

plt.tight_layout()
plt.show()

Out of the Business Intelligence Tools most respondents said they do not use any. Of the ones that do use a tool, Microsoft Power BI and Tableau were the most popular.

# 6. Closing Remarks

Returning to our original question of "What do I study?", hopefully this notebook has provided you with the necessary information to begin your self-directed learning journey, to ensure your time spent learning is best utilised.

Whilst what you choose to study will vary depending on career path, if the goal is to maintain employability by ensuring flexibility, then it may be wise to focus on areas that are used across multiple careers (such as Python programming language).

Otherwise, if learning from the beginning and wishing to break into an industry it is wise to focus on what the data tells us, by looking at what are the most used tools/skills for your respective career. As you continue to learn you can return to this notebook to determine what area of study in which to challenge yourself next.

# References
Cornerstone 2021, *What today's self-directed learning trends tell us about the value of modern learning content*, published 12 May 2021, <https://www.cornerstoneondemand.com/resources/article/what-todays-self-directed-learning-trends-tell-us-about-the-value-of-modern-learning-content/>

Ken Phillips 2016, 'How Much is Scrap Learning Costing Your Organisation', *Association for Talent Development*, blog, published 10 August 2016,<https://www.td.org/insights/how-much-is-scrap-learning-costing-your-organization>