# Machine Learning Experience & ... (other responses) | Plotly

### Overview

* The analysis is based on respondents that answered affirmatively to Question 15 of the 2021 Kaggle Machine Learning & Data Science Survey (“How many years have you been using machine learning methods?”). 
* For each question from the survey, a 100% bar chart was built and small comments were given, mainly on items that are not immediately visible on it.
* For single-choice questions whose answer choices can be arranged in a certain order, for example, the question about age, stacked charts were built so that the values of these answers increase when moving from the lower-left corner to the upper-right one.
* At the end of the document, a general profile of the respondent is given and it changes with increasing experience in machine learning (based on the most common responses). *Note: Multiple selection questions may have more than one answer if the difference is less than 2 percentage points, in which case order is important.*

In [None]:
# Here's python libraries to load
import numpy as np 
import pandas as pd 

import plotly.io as pio
import plotly.express as px

pio.templates.default = 'simple_white'


#1. Load and prepare the data

kaggle_survey_full = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv',  
                            dtype={195: 'object', 201: 'object'}, header=[0, 1])

# Drop the second rows of columns containing questions
kaggle_survey_remidx = kaggle_survey_full.droplevel(1, axis=1)

# Select respondents with machine learning experience
kaggle_survey = kaggle_survey_remidx[kaggle_survey_remidx.Q15.notna() 
    & (kaggle_survey_remidx.Q15 != 'I do not use machine learning methods')]


# 2. Define helper constants

# Match columns with multi-column questions to answer choices
ks_qa = {col: answer.split('-', 2)[-1].strip(' ') 
                       for col, answer in kaggle_survey_full.columns 
                       if len(col) > 3}
ks_qa.update(
    Q36_A_Part_6='Automation of full ML pipelines (e.g. Google Cloud AutoML, H2O Driverless AI)')

KS_MULTICOL_ANSWERS = ks_qa.copy()

Q15_NAME = 'ML Experience'

ML_EXPERIENCE_ORDER = ['Under 1 year', '1-2 years', '2-3 years', '3-4 years',
                       '4-5 years', '5-10 years', '10-20 years', '20 or more years']

In [None]:
# 3. Define helper function for calculating percentages

def calculate_ml_experience_distribution(df_sample):
    """Calculate the distribution of ML experience in a specified sample."""
    sample_num = df_sample.shape[0]   
    ks_mle = (df_sample.Q15.value_counts(normalize=True).mul(100)
                       .loc[ML_EXPERIENCE_ORDER].reset_index())
    return (ks_mle.assign(count=(ks_mle.Q15 / 100) * sample_num)
                  .rename(columns={'Q15': '%'}))

def calculate_category_percentage(df, column, unstack_level=-1):
    """Return a wide-format DataFrame containing percentages calculated 
    for each column category for each level of ML experience.
    """
    df_wide = df.value_counts(subset=['Q15', column]).unstack(level=unstack_level)
    df_percent = df_wide.div(df_wide.sum(axis=1), axis=0) * 100 
    if unstack_level:
        return df_percent.loc[ML_EXPERIENCE_ORDER]
    else:
        return df_percent.loc[:, ML_EXPERIENCE_ORDER]
    
def select_question_columns(df, question):
    """Select all columns for multi-column question."""
    return [q for q in df.columns if q.startswith(question)]

def calculate_group_percentage(df, question, replace_column_names_with_answer_choices=True):
    """Return a DataFrame containing percentages calculated for each group, 
    and with a column containing column names or answer choices.
    """
    columns = select_question_columns(df, question)
    df_selected = df.loc[df[columns].notna().any(axis=1), ['Q15', *columns]]
    if df_selected.empty:
        df_selected['Q15'] = ML_EXPERIENCE_ORDER
    df_group = df_selected.groupby('Q15')
    df_percent = (df_group.count().div(df_group.size(), axis=0).mul(100)
                          .loc[ML_EXPERIENCE_ORDER].stack().reset_index()
                          .rename(columns={'level_1': 'answer', 0: '%'}))
    if replace_column_names_with_answer_choices:
        return df_percent.replace(KS_MULTICOL_ANSWERS)
    else:
        return df_percent     

def calculate_ab_group_percentage(df_a, df_b, a_question, b_question):
    """Return a DataFrame containing percentages calculated for each group 
    on main (A) and supplementary (B) questions.
    """
    df_percent_a = calculate_group_percentage(df_a, a_question, False)
    df_percent_b = calculate_group_percentage(df_b, b_question, False)
         
    df_percent_ab = df_percent_a.append(df_percent_b) 
    df_percent_ab['ab'] = ['Use on a regular basis' if ab[4] == 'A' 
                           else 'Hope to become more familiar with in the next 2 years'
                           for ab in df_percent_ab.answer]           
                     
    return df_percent_ab.replace(KS_MULTICOL_ANSWERS).replace({'No / None': 'None'})


# 4. Define plotting functions

def plotly_barplot(data, y, *, range_y=[0, 101], title='', color=None, 
                   animation_frame=None, facet_col=None, barmode='relative', 
                   hover_data=None, labels=None):
    """Make and return a plotly bar chart figure."""
    fig_bar = px.bar(data, x=data.columns[0], y=y, title=title, range_y=range_y,
                     animation_frame=animation_frame, facet_col=facet_col,
                     color=color, hover_data=hover_data, labels=labels, 
                     barmode=barmode)            
    fig_bar.update_layout(xaxis_title=None, yaxis_title=None, 
                          hoverlabel_font_size=14, height=500,
                          title_font_size=14)
    fig_bar.update_yaxes(showgrid=True, gridcolor='darkgray')
    return fig_bar

def plotly_ml_experience_distribution_barplot(df, title, facet_col=None, range_y=[0, 51], 
                                              calculate_distribution=True):
    """Make and return a bar chart of the distribution of ML experience."""
    df_distribution = df
    
    if calculate_distribution:
        df_distribution = calculate_ml_experience_distribution(df)

    fig_bar = plotly_barplot(df_distribution, '%', title=title, 
                             facet_col=facet_col, range_y=range_y,
                             hover_data={'%': ':.1f', 'index': False, 'count': ':.0f'})
    fig_bar.update_layout(hoverlabel_bgcolor='white')
    return fig_bar

def plotly_stacked_barplot(df, column, *, title, legend_name, category_order=None):
    """Make and show a 100% stacked bar chart."""
    df_w = calculate_category_percentage(df, column).reset_index()
    cols = df_w.columns[1:]
    
    if category_order:
        cols = category_order 
 
    fig_bar = plotly_barplot(df_w, cols, title=title, hover_data={'value': ':.1f'}, 
                             labels={'variable': legend_name, 'value': '%', 'Q15': Q15_NAME})
    fig_bar.show()
        
def plotly_group_barplot(df, question, *, title, legend_name, legend_bottom=False):
    """Make and show a 100% bar chart with barmode='group'."""
    df_g = calculate_group_percentage(df, question)
     
    fig_bar = plotly_barplot(df_g, '%', title=title, color='answer', 
                             barmode='group', hover_data={'%': ':.1f'}, 
                             labels={'answer': legend_name, 'Q15': Q15_NAME})
    
    if legend_bottom:
        fig_bar.update_layout(legend=dict(x=0, y=-2))
    fig_bar.show() 
    
def plotly_group_slider_barplot(df_a, df_b, a_question, b_question, *, title, legend_name):
    """Make and show a 100% bar chart with barmode='group' and slider."""
    df_g = calculate_ab_group_percentage(df_a, df_b, a_question, b_question)
        
    fig_bar = plotly_barplot(df_g, '%', title=title, animation_frame='ab',
                             color='answer', barmode='group', 
                             hover_data={'ab': False, '%': ':.1f'}, 
                             labels={'answer': legend_name, 'Q15': Q15_NAME})
    fig_bar['layout'].pop('updatemenus')
    fig_bar.update_layout(sliders=[dict(currentvalue={'prefix': None, 
                                                      'font': {'size': 16, 
                                                               'color': 'rgb(236, 112, 0)'}}, 
                                        borderwidth=0.2, bgcolor='rgb(18, 72, 110)', len=1, x=0)])
    fig_bar.show()

The respondents in our case are divided into 8 groups according to their machine learning experience (the number of years of using ML methods): under 1 year, 1-2 years, 2-3 years, 3-4 years, 4-5 years, 5-10 years, 10-20 years, 20 or more years. *Note: This order will be preserved in all bar charts regardless of other values.*

In [None]:
total_num = kaggle_survey_remidx.shape[0]
ml_experience_num = kaggle_survey.shape[0]
print('Number of respondents with machine learning experience: \
{0} ({1:.1%} of the total number of respondents).'.format(ml_experience_num, (ml_experience_num / total_num)))

plotly_ml_experience_distribution_barplot(kaggle_survey, 
                                          title='Machine Learning Experience Distribution, %')

Machine learning (ML) has been gaining popularity in recent years: almost 70% of the respondents have less than 2 years of relevant experience, and only about 2.5% have more than 10 years. Interestingly, the first four groups differ from each other by almost 2 times, although they cover the range only a year.

# 1. Demographic Profile

### 1.1. Age (Q1)

In [None]:
ks_q1 = 'Q1'

plotly_stacked_barplot(kaggle_survey, ks_q1, title='Machine Learning Experience & Age, %', 
                       legend_name='Age (years)')

Obviously, there is a direct relationship between the age of the respondents and the number of years of ML experience. It should be noted that there is a small percentage of respondents (about 2.8%) who indicated their age as 18-24 years old, and experience using ML methods for 20 years or more.

### 1.2. Gender (Q2)

In [None]:
ks_q2 = 'Q2'
ks_q2_order = ['Man', 'Woman','Nonbinary', 'Prefer not to say', 'Prefer to self-describe']

plotly_stacked_barplot(kaggle_survey, ks_q2, title='Machine Learning Experience & Gender, %', 
                       legend_name='Gender', category_order=ks_q2_order)

Although the respondents with ML experience are mostly men (about 77-86%), in the group under 1 year the percentage of women is higher than in all the others, which means that in recent years there has been an increase in interest in machine learning among women as well.

### 1.3. Country (Q3)

In [None]:
ks_q3 = 'Q3'
ks_q3_name = 'Country'

plotly_stacked_barplot(kaggle_survey, ks_q3, title='Machine Learning Experience & Country, %',
                       legend_name=ks_q3_name)

A larger percentage of the respondents with little ML experience (under 4 years) currently resides in India, but the more experience, the less their share, and the greater the percentage of respondents currently reside in the USA (especially with more than 4 years of ML experience). A similar situation is observed in some other countries, for example, more users with more experience currently reside in Spain, Australia, and the UK and with less ML experience in Nigeria, Russia, and China. But from some countries, for example from Japan, the percentage of the respondents in each group is practically the same.

In [None]:
ml_experience_by_country = (calculate_category_percentage(kaggle_survey, ks_q3, unstack_level=0)
                                .idxmax(axis=1).reset_index()
                                .rename(columns={ks_q3: ks_q3_name, 0: Q15_NAME}))

fig_map = px.choropleth(ml_experience_by_country, locations=ks_q3_name,
                        color=Q15_NAME, 
                        locationmode = 'country names',
                        hover_name=ks_q3_name, 
                        hover_data={ks_q3_name: None, Q15_NAME: None},
                        title='Machine Learning Experience of most respondents by Country')
fig_map.show()

ml_experience_by_country[ml_experience_by_country[Q15_NAME] == '1-2 years']

Most of the respondents from almost all the countries represented have under 1 year of ML experience, with the exception of Belgium and Sweden, where the most have 1-2 years of ML experience.

### 1.4. Education (Q4)

In [None]:
ks_q4 = 'Q4'
ks_q4_order = ['No formal education past high school', 
               'Some college/university study without earning a bachelor’s degree', 
               'Bachelor’s degree', 'Master’s degree', 'Doctoral degree',
               'Professional doctorate', 'I prefer not to answer']

plotly_stacked_barplot(kaggle_survey, ks_q4, title='Machine Learning Experience & Education, %', 
                       legend_name='Education', category_order=ks_q4_order)

A large percentage of users with under 1 year of ML experience have (or plan to attain within the next 2 years) a Bachelor's degree, those with 1-10 years of ML experience - a Master's degree, and those with 10 or more years of ML experience - a Doctoral degree.

### 1.5. Current Role Title (or most recent title if retired) (Q5)

In [None]:
ks_q5= 'Q5'

plotly_stacked_barplot(kaggle_survey, ks_q5, 
                       title='Machine Learning Experience & Current Role Title, %', 
                       legend_name='Current Role Title')

The main roles of respondents with less than 2 years of ML experience are students, and more than 2 years are Data Scientists. Also, among the respondents with less ML experience, there are more Software Engineers, Data Analysts, or Сurrently not employed, but the more experience the respondents have, the more Research Scientists and Statisticians among them.

# 2. Technologies Used

### 2.1. Programming Experience (Q6)

In [None]:
ks_q6 = 'Q6'
ks_q6_order = ['< 1 years', '1-3 years', '3-5 years', '5-10 years',  '10-20 years', '20+ years']

plotly_stacked_barplot(
    kaggle_survey, ks_q6, title='Machine Learning Experience & Programming Experience, %', 
    legend_name='Programming Experience', category_order=ks_q6_order)

Among the respondents with ML experience, there is none who has never written code. At the same time, there is a direct relationship between the number of years of programming (writing code) and the number of years of ML experience. 90% or more of respondents have been programming (writing code) for as many years or longer than using machine learning methods.

### 2.2. Programming Languages (Q7)

In [None]:
ks_q7 = 'Q7'

plotly_group_barplot(kaggle_survey, ks_q7, 
                     title='Machine Learning Experience & Programming Languages, %', 
                     legend_name='Programming Languages')

Python is by far the most popular programming language, regardless of the respondents' ML experience. The second most popular is SQL among the groups that have less than 20 years of experience, and those who have more than 20 - R. Interestingly, the popularity of R directly depends on the ML experience of the users (among respondents with less than 2 years of ML experience, only about 20% use it). More experienced respondents also use Bash more.

### 2.3. Programming Language to learn first (Q8)

In [None]:
ks_q8 = 'Q8'

plotly_stacked_barplot(kaggle_survey, ks_q8, 
                       title='Machine Learning Experience & Programming Language to learn first, %', 
                       legend_name='Programming Language')

Most of the respondents recommend that an aspiring data scientist learn Python first. And this is not surprising, since most of them use it (see the chart for Q7). But, interestingly, the second most popular recommended language is R, not SQL, regardless of ML experience.

### 2.4. IDE's (Q9)

In [None]:
ks_q9 = 'Q9'

plotly_group_barplot(kaggle_survey, ks_q9, title="Machine Learning Experience & IDE's, %", 
                     legend_name="IDE's")

*Note: There are discrepancies in the dataset and in the "2021 Kaggle DS & ML Survey: Questions and answer choices" (in the second "Jupyter (JupyterLab, Jupyter Notebooks, etc)" is written as "JupyterLab")*. 

By far the most popular integrated development environment (IDE) is Jupyter Notebook, although Jupyter is not that popular. The use of some IDE's (e.g., Jupyter Notebook, VSCode, Jupyter, RStudio, Vim/Emacs) depends on the respondent's experience in ML or the popularity of the programming language they use. For example, the popularity of RStudio is similar to the popularity of the R language itself (see the chart for Q7). 

### 2.5. Hosted Notebook Products (Q10)

In [None]:
ks_q10 = 'Q10'

plotly_group_barplot(kaggle_survey, ks_q10, 
                     title='Machine Learning Experience & Hosted Notebook Products, %', 
                     legend_name='Hosted Notebook Products')

The majority of respondents with ML experience use Colab Notebooks, Kaggle Notebooks, or no hosted notebook products at all. Moreover, the less experience, the more popular these two products are, and the less the percentage of those who do not use any hosted notebook products at all. At the same time, the rest of the mentioned products are used by less than 11% of the respondents, regardless ML experience.

### 2.6. Type of Computing Platform (Q11)

In [None]:
ks_q11 = 'Q11'
ks_q11_order =  ['A laptop', 'A personal computer / desktop', 
                 'A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)',
                 'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)', 
                 'Other', 'None']

plotly_stacked_barplot(kaggle_survey, ks_q11, 
                       title='Machine Learning Experience & Type of Computing Platform, %', 
                       legend_name='Type of Computing Platform', category_order=ks_q11_order)

A laptop is in the first place in popularity as the type of computing platform most often used for data science projects. Interestingly, users with 5-10 years of ML experience prefer a cloud computing platform to a personal computer/desktop, in contrast to other users. There are also more of them who use a deep learning workstation.

### 2.7. Types of Specialized Hardware (Q12)

In [None]:
ks_q12 = 'Q12'

plotly_group_barplot(kaggle_survey, ks_q12, 
                     title='Machine Learning Experience & Types of Specialized Hardware, %', 
                     legend_name='Types of Specialized Hardware')

The specialized hardware is used mainly by respondents with more than 2 years of ML experience, with preference given to the NVIDIA GPUs. Google Cloud TPUs are used by less than 20% of each group, and they are most popular with users with 1-3 years of ML experience.

### 2.8. TPU Experience (Q13)

In [None]:
ks_q13 = 'Q13'
ks_q13_order = ['Never', 'Once', '2-5 times', '6-25 times', 'More than 25 times']

plotly_stacked_barplot(kaggle_survey, ks_q13, title='Machine Learning Experience & TPU Experience, %', 
                       legend_name='TPU Experience', category_order=ks_q13_order)

More than half of respondents with ML experience have never used a tensor processing unit (TPU). And those who nevertheless used it, did it mostly 2-5 times, except for users with under 1 year of ML experience who used a TPU only once. In addition, the more experienced the respondents, the more of them use a TPU 25 and more times.

### 2.9. Data Visualization Libraries or Tools (Q14)

In [None]:
ks_q14 = 'Q14'

plotly_group_barplot(kaggle_survey, ks_q14, 
                     title='Machine Learning Experience & Data Visualization Libraries or Tools, %', 
                     legend_name='Data Visualization Libraries or Tools')

The following data visualization libraries/tools are more popular with respondents with ML experience: Matplotlib, Seaborn, Plotly/Plotly Express, and Ggplot/ggplot2. The rest, except Shiny, are used on a regular basis by only less than 13%. Shiny popularity differs from ML experience level and is highest among respondents with 5-10 years of experience.

### 2.10. Machine Learning Frameworks (Q16)

In [None]:
ks_q16 = 'Q16'

plotly_group_barplot(kaggle_survey, ks_q16, 
                     title='Machine Learning Experience & Machine Learning Frameworks, %', 
                     legend_name='ML Frameworks')

The popularity of Xgboost and LightGBM increases with the number of years of use of ML methods and the maximum value is reached in the group with 5-10 years of ML experience, after which it decreases again. The Scikit-learn, TensorFlow, Keras, and PyTorch are popular with all groups, and in that order. At the same time, about 16% of users with less than a year of ML experience do not use any ML frameworks on a regular basis. CatBoost, Caret, and Huggingface stand out among the less popular frameworks.

### 2.11. Machine Learning Algorithms (Q17)

In [None]:
ks_q17 = 'Q17'

plotly_group_barplot(kaggle_survey, ks_q17, 
                     title='Machine Learning Experience & Machine Learning Algorithms, %', 
                     legend_name='ML Algorithms')

Users regardless of experience in ML prefer simple, easy-to-interpret algorithms. Among complex algorithms, Gradient Boosting Machines and Convolutional Neural Networks (CNNs) are popular (respondents with 1-3 years of ML experience use CNNs more often). Bayesian Approaches and Evolutionary Approaches are used more than others by respondents with experience of 10-20 years and 20 or more years, respectively, and the increase in their popularity between groups is possibly associated with an increase in the percentage of Research Scientists and Statisticians in them (see the chart for Q5).

Computer vision methods are generally more popular (due to the popularity of CNNs), regardless of ML experience than natural language processing (NLP) methods, in which Recurrent Neural Networks are in the lead.

### 2.12. Computer Vision Methods (Q18)

In [None]:
kaggle_survey_cv = kaggle_survey[(kaggle_survey.Q17_Part_7 == 'Convolutional Neural Networks') 
                                 | (kaggle_survey.Q17_Part_8 == 'Generative Adversarial Networks')]
ks_cv_num = kaggle_survey_cv.shape[0]

print('Number of respondents using computer vision methods: {0} ({1:.1%} of respondents with \
ML experience).'.format(ks_cv_num, (ks_cv_num / ml_experience_num)))

In [None]:
ks_q18 = 'Q18'

plotly_group_barplot(kaggle_survey_cv, ks_q18, 
                     title='Machine Learning Experience & Computer Vision Methods, %', 
                     legend_name='CV Methods', legend_bottom=True)

With the exception of Image classification and other general-purpose networks, which are in first place in popularity among all groups, General purpose image/video tools, and Image segmentation methods are mostly used by respondents with 5-10 years of ML experience, and Object detection methods - with 20 or more years.

### 2.13. Natural Language Processing Methods (Q19)

In [None]:
kaggle_survey_nlp = kaggle_survey[
    (kaggle_survey.Q17_Part_9 == 'Recurrent Neural Networks') 
    | (kaggle_survey.Q17_Part_10 == 'Transformer Networks (BERT, gpt-3, etc)')]
ks_nlp_num = kaggle_survey_nlp.shape[0]

print('Number of respondents using natural language processing methods: {0} ({1:.1%} \
of respondents with ML experience).'.format(ks_nlp_num, (ks_nlp_num / ml_experience_num)))

In [None]:
ks_q19 = 'Q19'

plotly_group_barplot(kaggle_survey_nlp, ks_q19, 
                     title='Machine Learning Experience & Natural Language Processing Methods, %', 
                     legend_name='NLP Methods')

In all groups, the following order of popularity of the indicated NLP methods is preserved: word embeddings/vectors, transformer language models, encoder-decoder models, contextualized embeddings. But at the same time, among users with less than a year of ML experience, most of those who do not use them at all on a regular basis, although their percentage decreases with increasing experience in ML.

# 3. Employment

In [None]:
# Select only employed respondents for Questions 20-26 
employed_condition = ~kaggle_survey.Q5.isin(['Student', 'Currently not employed'])
kaggle_survey_employed = kaggle_survey[employed_condition]
ks_employed_num = kaggle_survey_employed.shape[0]

print('Number of employed respondents with ML experience: {0} ({1:.1%} of the total number \
of respondents with ML experience).'.format(ks_employed_num, (ks_employed_num / ml_experience_num)))

plotly_ml_experience_distribution_barplot(
    kaggle_survey_employed, 
    title='Machine Learning Experience Distribution (among employed respondents), %')

### 3.1. Industry (Q20)

In [None]:
ks_q20 = 'Q20'

plotly_stacked_barplot(kaggle_survey_employed, ks_q20, 
                       title='Machine Learning Experience & Industry, %', 
                       legend_name='Industry')

The current employer/contract of the majority of respondents using ML methods less than 10 years is in the Computers/Technology, and more than 10 years in the Academics/Education industry. The percentage of those employed in the other specified industries is practically the same in all groups, although the Accounting/Finance industry stands out among them (with ML experience less than 20 years).

### 3.2. The Employer's Company Size (Q21)

In [None]:
ks_q21 = 'Q21'
ks_q21_order = ['0-49 employees', '50-249 employees', '250-999 employees', '1000-9,999 employees', 
                '10,000 or more employees']
        
plotly_stacked_barplot(kaggle_survey_employed, ks_q21, 
                       title="Machine Learning Experience & The Employer's Company Size, %", 
                       legend_name="The Employer's Company Size", category_order=ks_q21_order)

Respondents with under 4 years of ML experience mostly work in small companies, startups, or as self-employed (the number of employees in them is 0-49), and with 4-20 years of experience in large companies (with 10,000 or more employees). Recall that the same division in ML experience is observed among respondents residing in India and America (see the chart for Q3). Respondents with more than 20 years of ML experience work in smaller companies (with 1,000-9,999 employees), possibly because their dominant industry is Academics/Education.

### 3.3. Data Science Team Size (Q22)

In [None]:
ks_q22 = 'Q22'
ks_q22_order = ['0', '1-2', '3-4', '5-9', '10-14', '15-19', '20+']

plotly_stacked_barplot(kaggle_survey_employed, ks_q22, 
                       title='Machine Learning Experience & Data Science Team Size, %', 
                       legend_name='DS Team Size', category_order=ks_q22_order)

In all groups with more than 2 years of ML experience, preference is given to teams with 20+ individuals responsible for data science workloads. Slightly more than half of the users with less than 1 year of ML experience work alone or in a data science team of 1-2 individuals. It is interesting that few respondents work in data science teams of 15-19 individuals.

### 3.4. ML Methods in employer's business (Q23)

In [None]:
ks_q23 = 'Q23'
ks_q23_order = [
   'I do not know', 
   'No (we do not use ML methods)',
   'We are exploring ML methods (and may one day put a model into production)',
   'We recently started using ML methods (i.e., models in production for less than 2 years)',
   'We have well established ML methods (i.e., models in production for more than 2 years)',
   'We use ML methods for generating insights (but do not put working models into production)'
]

plotly_stacked_barplot(
   kaggle_survey_employed, ks_q23, 
   title="Machine Learning Experience & ML Methods in employer's business, %", 
   legend_name="ML Methods in employer's business", category_order=ks_q23_order)

It is clearly seen that the more ML experience (especially 2-3 and more years) the respondents have, the more they work in companies with well established ML methods, and those with less relevant experience work in companies that are exploring ML methods or recently started using them. It is interesting that in all groups there are users who do not know if their employer uses ML methods, which means those users most likely use these methods for some of their projects, and not at their main job.

### 3.5. Important Activities at Work (Q24)

In [None]:
ks_q24 = 'Q24'

plotly_group_barplot(kaggle_survey_employed, ks_q24, 
                     title='Machine Learning Experience & Important Activities at Work, %', 
                     legend_name='Important Activities at Work', legend_bottom=True)

Obviously, respondents who have just started using machine learning methods or have been doing it for only three years or less are tasked with analyzing and understanding data to influence product or business decisions at work, but with increasing ML experience, they begin to take on other additional responsibilities, mainly building prototypes to explore applying ML to new areas and/or experimentation and iteration to improve existing ML models, although data analysis still occupies one of the most important parts of the work. Very few users perform any other type of activity.

### 3.6. Yearly Compensation (Q25)

In [None]:
ks_q25 = 'Q25'
ks_q25_order = ['$0-999',  '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', 
                '5,000-7,499', '7,500-9,999', '10,000-14,999', '15,000-19,999', '20,000-24,999', 
                '25,000-29,999', '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999', 
                '70,000-79,999', '80,000-89,999', '90,000-99,999', '100,000-124,999', 
                '125,000-149,999', '150,000-199,999', '200,000-249,999', '250,000-299,999', 
                '300,000-499,999', '$500,000-999,999', '>$1,000,000']

plotly_stacked_barplot(kaggle_survey_employed, ks_q25, 
                       title='Machine Learning Experience & Yearly Compensation, %', 
                       legend_name='Yearly Compensation, $USD', category_order=ks_q25_order)

There is a noticeable shift in the most common level of current yearly compensation with increasing ML experience, from a value of 0-999 USD to 100,000-124,999 and even more. The threshold value, as in the case of the countries of residence of the respondents (see the chart for Q3), is 4 years of experience in ML. It should be noted that among the respondents of all groups there are practically all levels of yearly compensation.

# 4. ML & Cloud Computing Services

### 4.1. Money spent on ML & Cloud Computing Services (Q26)

In [None]:
ks_q26 = 'Q26'
ks_q26_order = ['0', '1-99', '100-999',  '1000-9,999', '10,000-99,999', '100,000 or more']
kaggle_survey_re = kaggle_survey_employed.replace(
    {'$0 ($USD)': '0', '$1-$99': '1-99', '$100-$999': '100-999',  
    '$1000-$9,999': '1000-9,999', '$10,000-$99,999': '10,000-99,999', 
    '$100,000 or more ($USD)': '100,000 or more'})

plotly_stacked_barplot(
    kaggle_survey_re, ks_q26, 
    title='Machine Learning Experience & Money spent on ML & Cloud Computing Services, %', 
    legend_name='$USD', category_order=ks_q26_order)

With the increase in years of experience in machine learning, the percentage of those who spent more money at home (or at work) on machine learning and/or cloud computing services in the past 5 years also increases, at the same time, less of those who did not spend anything or spent 1-999 USD. Recall that among groups with more experience more employees in large companies (see the chart for Q21). In any case, regardless of ML experience, about 20% of respondents did not spend money on those services.

In [None]:
# Select two groups of respondents for groups A and B questions
ab_condition = employed_condition & (kaggle_survey.Q26 != '$0 ($USD)')
kaggle_survey_a = kaggle_survey[ab_condition]
kaggle_survey_b = kaggle_survey[~ab_condition]
kaggle_survey_employed

ks_ab_dist = []
print('Groups of respondents depending on whether they are employed and respondents \
who have spent money in the cloud or not:\n')

for ks, group, match_condition in zip((kaggle_survey_a, kaggle_survey_b), 
                                      ('A', 'B'), ('Yes', 'No')):
    ks_dist = calculate_ml_experience_distribution(ks)
    ks_dist['Group'] = group
    ks_ab_dist.append(ks_dist)
    ks_num = ks.shape[0]
    ks_ratio = ks_num / ml_experience_num
    print('{0} -> Group {1}: {2} ({3:.1%} of respondents with ML experience)'.format(
        match_condition, group, ks_num, ks_ratio))

fig_bar_group = plotly_ml_experience_distribution_barplot(
    ks_ab_dist[0].append(ks_ab_dist[1]), facet_col='Group', 
    range_y=[0, 61], calculate_distribution=False,
    title='Machine Learning Experience Distribution in Group A and Group B, %')
fig_bar_group.update_layout(xaxis2_title=None)

*Note: Group A will be listed in the charts below as "Use on a regular basis" and Group B as "Hope to become more familiar with in the next 2 years".*

### 4.2. Cloud Computing Platforms (Q27-A & Q27-B)

In [None]:
ks_q27a, ks_q27b = 'Q27_A', 'Q27_B'

plotly_group_slider_barplot(kaggle_survey_a, kaggle_survey_b, ks_q27a, ks_q27b,
                            title='Machine Learning Experience & Cloud Computing Platforms, %',
                            legend_name='Cloud Computing Platforms')

Less than 30% of respondents from group A, do not use cloud computing platforms, and those who use them prefer mainly three (Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure), while among the users with 20 or more years of experience in ML, GCP is more popular than AWS. 

In group B, the more ML experience of the respondents, the more of those who do not want to become familiar with cloud computing platforms in the near future, but still there are many users who also choose AWS, GCP, and Microsoft Azure. Respondents with less experience in ML are also more interested in IBM Cloud/Red Hat and Oracle Cloud.

### 4.3. Cloud Platforms (most enjoyable to use) (Q28)

In [None]:
ks_q28 = 'Q28'
kaggle_survey_more_than_one_choice_q27a = (
    kaggle_survey_a[kaggle_survey_a.loc[:, select_question_columns(kaggle_survey_a, ks_q27a)]
                                   .notna().sum(axis=1) > 1])

plotly_stacked_barplot(
    kaggle_survey_more_than_one_choice_q27a, ks_q28, 
    title='Machine Learning Experience & Cloud Platforms (most enjoyable to use), %', 
    legend_name='Cloud Platforms')

Since the most popular cloud platforms are AWS, GCP, and Microsoft Azure (see the chart for Q27-A), it is expected that users can use them both at the same time and with other cloud platforms. And in this case, in terms of popularity, AWS is no longer an unambiguous leader, but competes with GCP, or all the platforms for the users are equally enjoyable to use. For example, respondents with 1-2, 3-5, and 20 or years of experience in ML prefer GCP a little more, and all other groups prefer AWS.

### 4.4. Cloud Computing Products (Q29-A & Q29-B)

In [None]:
ks_q29a, ks_q29b = 'Q29_A', 'Q29_B'
kaggle_survey_relevant_answer_q27a = kaggle_survey_a[kaggle_survey_a.Q27_A_Part_1.notna() 
                                                     | kaggle_survey_a.Q27_A_Part_2.notna() 
                                                     | kaggle_survey_a.Q27_A_Part_3.notna()]

plotly_group_slider_barplot(kaggle_survey_relevant_answer_q27a, kaggle_survey_b, ks_q29a, ks_q29b,  
                            title='Machine Learning Experience & Cloud Computing Products, %',
                            legend_name='Cloud Computing Products')

Despite the fact that the respondents of group A mainly use Amazon EC2, group B of respondents is more interested in Google Cloud Compute Engine and even Microsoft Azure Virtual Machines (in second place for users with under a year of ML experience).

### 4.5. Data Storage Products (Q30-A & Q30-B)

In [None]:
ks_q30a, ks_q30b = 'Q30_A', 'Q30_B'

plotly_group_slider_barplot(kaggle_survey_relevant_answer_q27a, kaggle_survey_b, ks_q30a, ks_q30b, 
                            title='Machine Learning Experience & Data Storage Products, %',
                            legend_name='Data Storage Products')

For users with under a year of experience in ML (group A chart), Google Cloud Storage (GCS) is in the first place in terms of popularity, and Amazon S3, only in second, although for all the other Amazon S3 in first place. Amazon EFS and Microsoft Azure Disk Storage are used more by respondents with 10-20 and 20 or more years of ML experience, respectively.

Interestingly, there are no responses from users in group B.

### 4.6. Managed ML Products (Q31-A & Q31-B)

In [None]:
ks_q31a, ks_q31b = 'Q31_A', 'Q31_B'

plotly_group_slider_barplot(kaggle_survey_a, kaggle_survey_b, ks_q31a, ks_q31b, 
                            title='Machine Learning Experience & Managed ML Products, %',
                            legend_name='Managed ML Products')

Managed ML products are not popular with respondents, regardless of the experience of machine learning (group A chart), but a small percentage still use them, among which can be noted Amazon SageMaker, Google Cloud Vertex AI, Azure Machine Learning Studio, and Databricks.

Also, a larger percentage of respondents with more than 3 years of ML experience, do not want to become familiar with managed ML products in the near future (group B chart). But the less experience in ML, the more interested in them. Although the products mentioned earlier are also popular, but Google Cloud Vertex AI and even Azure Machine Learning Studio are more interesting than Amazon SageMaker.

### 4.7. Big Data Products (Q32-A & Q32-B)

In [None]:
ks_q32a, ks_q32b = 'Q32_A', 'Q32_B'

plotly_group_slider_barplot(kaggle_survey_a, kaggle_survey_b, ks_q32a, ks_q32b, 
                            title='Machine Learning Experience & Big Data Products, %',
                            legend_name='Big Data Products')

Basically, users (group A chart) use MySQL, or a slightly smaller percentage (except for those respondents with 5-10 years of ML experience) - PostgreSQL. Among the remaining ones stand out MongoDB, Microsoft SQL Server, and SQLite, the popularity of which slightly differs depending on experience in ML. Thus, respondents with under 3 years of experience prefer Microsoft SQL Server, with 3-5 and 20 or more years - MongoDB, and with 5-20 years - SQLite (with Microsoft SQL Server for users with 10-20 years of ML experience). Also, about 14-20% do not use big data products (relational databases, data warehouses, data lakes, or similar) at all.

In group B, the situation is different: respondents with less than 5 years of experience in ML mostly want to become familiar with the big data products, but those with more experience do not. At the same time, more interest is shown not only to MySQL, but also to MongoDB.

### 4.8. Big Data Products (use most often) (Q33)

In [None]:
ks_q33 = 'Q33'
kaggle_survey_more_than_one_choice_q32a = (
    kaggle_survey_a[kaggle_survey_a.loc[:, select_question_columns(kaggle_survey_a, ks_q32a)]
                                   .notna().sum(axis=1) > 1])

plotly_stacked_barplot(kaggle_survey_more_than_one_choice_q32a, ks_q33, 
                       title='Machine Learning Experience & Big Data Products (use most often), %', 
                       legend_name='Big Data Products')

Since the most popular big data product is MySQL (or PostgreSQL for users with 5-10 years of ML experience) (see the chart for Q32-A), it is used more often (or on par with PostgreSQL, in the second case). Among respondents with more than 20 years of experience in ML, there is also the same percentage of those who use MySQL and PostgreSQL. Although, in general, any preferences dependent on ML experience have not been noticed.

### 4.9. Business Intelligence Tools (Q34-A & Q34-B)

In [None]:
ks_q34a, ks_q34b = 'Q34_A', 'Q34_B'

plotly_group_slider_barplot(kaggle_survey_a, kaggle_survey_b, ks_q34a, ks_q34b, 
                            title='Machine Learning Experience & Business Intelligence Tools, %',
                            legend_name='BI Tools')

Regardless of machine learning experience (group A chart), Tableau and Microsoft Power BI are the leaders, while overall business intelligence tools are not that popular.

At the same time, there is an inverse relationship between respondents' experience in machine learning and their desire to become familiar with BI tools (the threshold value is 3 years) (group B chart). And also Google Data Studio is added to the two previous leaders.

### 4.10. Business Intelligence Tools (use most often) (Q35)

In [None]:
ks_q35 = 'Q35'
kaggle_survey_more_than_one_choice_q34a = (
    kaggle_survey_a[kaggle_survey_a.loc[:, select_question_columns(kaggle_survey_a, ks_q34a)]
                                   .notna().sum(axis=1) > 1])

plotly_stacked_barplot(
    kaggle_survey_more_than_one_choice_q34a, ks_q35, 
    title='Machine Learning Experience & Business Intelligence Tools (use most often), %', 
    legend_name='BI Tools')

Despite the fact that Tableau is slightly ahead of Microsoft Power BI in popularity among respondents with 1-2 and 3-20 years of ML experience (see the chart for Q32-A), the latter, in turn, is chosen more often by them if the respondents use multiple BI tools.

### 4.11. Сategories of (partial) Automated ML Tools (Q36-A & Q36-B)

In [None]:
ks_q36a, ks_q36b = 'Q36_A', 'Q36_B'

plotly_group_slider_barplot(
    kaggle_survey_a, kaggle_survey_b, ks_q36a, ks_q36b, 
    title='Machine Learning Experience & Categories of (partial) Automated ML Tools, %',
    legend_name='Сategories of (partial) AutoML Tools')

Among all the respondents (group A chart), more than 48% do not use any automated ML tools, while the less ML experience, the higher it is, and vice versa, with an increase in the experience, the greater the percentage of using. The preference for certain categories also depends on ML experience. For example, users with less ML experience choose automated data augmentation and model selection, and the larger it is, the more they use automated hyperparameter tuning.

At the same time, in group B, not only automated model selection and hyperparameter tuning are most popular, but also automation of full ML pipelines, although the percentage of interest in each of them differs from the ML experience of the respondents. Overall interest is also higher.

### 4.12. (Partial) Automated ML Tools (Q37-A & Q37-B)

In [None]:
ks_q37a, ks_q37b = 'Q37_A', 'Q37_B'
kaggle_survey_answered_affirmatively_q36a = kaggle_survey_a[kaggle_survey_a.Q36_A_Part_7.isna()]

plotly_group_slider_barplot(
    kaggle_survey_answered_affirmatively_q36a, kaggle_survey_b, ks_q37a, ks_q37b, 
    title='Machine Learning Experience & (Partial) Automated ML Tools, %',
    legend_name='(Partial) AutoML Tools')

Google Cloud AutoML is more popular, especially among users under 3 years of ML experience, although respondents with 5-10 years of experience prefer H20 Driverless AI over it (group A chart). But over 32% of respondents with any level of experience don't use automated ML (or partial AutoML) products at all.

In group B Google Cloud AutoML, it is still in the first place, but there is also a greater interest in Amazon Sagemaker Autopilot and Azure Automated Machine Learning, especially among respondents with under 3 years of ML experience, who have more experience either do not want to become familiar with anything in the near future or with Amazon Sagemaker Autopilot.

### 4.13. Tools to help Manage ML Experiments (Q38-A & Q38-B)

In [None]:
ks_q38a, ks_q38b = 'Q38_A', 'Q38_B'

plotly_group_slider_barplot(
    kaggle_survey_a, kaggle_survey_b, ks_q38a, ks_q38b, 
    title='Machine Learning Experience & Tools to help Manage ML Experiments, %',
    legend_name='Tools to help Manage ML Experiments')

More than half of all users do not use tools to help manage ML experiments, especially those with less than 2 years of ML experience (group A chart). About 28% of those with more experience use TensorBoard. MLflow also stands out from other tools, especially for users with 2-20 years of experience in ML.

Respondents with less than 3 years of experience in ML are more interested in tools to help manage ML experiments (group B chart). Although the leaders have not changed. Among the rest, Weights & Biases and Neptune.ai stand out.

# 5. Publications, Courses, Data Analysis Tools, Media Sources

### 5.1. Places to publicly share/deploy Data Analysis or ML apps (Q39)

In [None]:
ks_q39 = 'Q39'

plotly_group_barplot(
    kaggle_survey, ks_q39, 
    title='Machine Learning Experience & Places to publicly share/deploy Data Analysis or ML apps, %', 
    legend_name='Places')

Respondents with 1-2 years of ML experience share their work more and they choose for this mostly GitHub. Kaggle is popular with less experienced users, and Colab - with 1-3 years. It should also be noted that respondents with 20 or more ML experience prefer to post work in their personal blog than others.

### 5.2. Data Science Сourse Platforms (Q40)

In [None]:
ks_q40 = 'Q40'

plotly_group_barplot(
    kaggle_survey, ks_q40, 
    title='Machine Learning Experience & Data Science Сourse Platforms, %', 
    legend_name='Data Science Сourse Platforms')

There is a noticeable dependence of the popularity of some platforms on the ML experience of the respondents, especially Kaggle Learn Courses (the less user experience in ML, the more popular they are), which are even ahead of Coursera in a group with under a year experience. The maximum percentage of Udemy, DataCamp, and edX users is observed in groups with experience of 1-2, 3-4, and 5-10 years, respectively. The popularity of University Courses also shows a change with a change in experience in ML, with a group with 3-4 and 10-20 years of experience highlighted.

### 5.3. The Primary Data Analysis Tool (Q41)

In [None]:
ks_q41 = 'Q41'

plotly_stacked_barplot(kaggle_survey, ks_q41, 
                       title='Machine Learning Experience & The Primary Data Analysis Tool, %', 
                       legend_name='Data Analysis Tools')

For data analysis, respondents with more than 2 years of ML experience in most cases use Local development environments as the primary tool, and the number of users grows with increasing experience (up to 20 years), while they continue to use Basic statistical software less, although for the group with less than a year's experience in ML these is the most popular tool. Respondents with 5 or more years of experience also use Advanced statistical software more than others.

### 5.4. Media Sources on Data Science (Q42)

In [None]:
ks_q42 = 'Q42'

plotly_group_barplot(
    kaggle_survey, ks_q42, 
    title='Machine Learning Experience & Media Sources on Data Science, %', 
    legend_name='Media Sources')

The popularity of some media sources on data science depends on the experience of machine learning. For example, Kaggle and YouTube are more popular with groups with less experience, and Blogs and Journal Publications, on the other hand, are more popular with groups with more experience. Especially the popularity of Journal Publications is changing. Twitter and Reddit are used more by respondents with 5-10 years of experience in ML, Email newsletters - with more than 20 years.

# Summary - General Profile

**of the users of ML methods and its change depending on the number of years of relevant experience (based on the most common responses):**

A man who is a Python programmer (and recommends an aspiring data scientist to learn it first).
He creates machine learning projects on a laptop in Jupyter Notebook and, while his ML experience is less than 20 years, in a hosted notebook product, never using a TPU.
Visualizes data with Matplotlib, and builds ML models with Scikit-learn.
At the same time, while his ML experience is less than 20 years, he builds models using Linear or Logistic Regression algorithms.
Then all his work is published on GitHub.  

**--------------- Under 1 year ---------------**

He is an 18-21 year old student with a bachelor's degree, residing in India.
Has been programming for 1-3 years.
He uses Kaggle Notebooks as hosted notebook products for his ML projects but does not use any types of specialized hardware.
Data analysis is carried out in basic statistical software (Microsoft Excel, Google Sheets, etc.).
He gets new knowledge and information on data science at Kaggle (Learn Courses, notebooks, forums, etc.).
Also now and while his experience in machine learning is less than 3 years, he hopes to become more familiar with ML & cloud computing services in the next 2 years, such as MySQL and Tableau for working with data, Google Cloud Compute Engine for cloud computing, Google Cloud AutoML to automate machine learning.
But right now, he is interested in Amazon Web Services (AWS) as a cloud computing platform, Google Cloud Vertex AI (more interesting) and Azure ML Studio for machine learning management, and an automated model selection along with automation of full ML pipelines (slightly less).
As for Tools to help manage ML experiments, he is not interested in them, although TensorBoard caught his attention.

**--------------- 1-2 years ---------------**

He is 22-24 years old. He is still a student, but with a master's degree (or plans to attain it within the next 2 years).
From that point on, he uses Colab Notebooks instead of Kaggle Notebooks, local development environments instead of basic statistical software for data analysis, and Coursera instead of Kaggle Learn Courses for new knowledge.
He also hopes to become more familiar not only with AWS in the next 2 years, but a little with Google Cloud Platform (GCP), and selected only Google Cloud Vertex AI for machine learning management, and yet TensorBoard in order to help manage ML experiments.
Everything else is the same.

**---------------- 2-3 years ---------------**

He is 25-29 years old and since then he already works as a Data Scientist with a yearly compensation of 0-999 USD in a small company (0-49 employees) in the computers/technology industry.
The company has well established ML methods (i.e., models in production for more than 2 years), and also has 20+ individuals who are responsible for data science workloads.
He now has 3-5 years of programming experience and has already started using specialized hardware in his work, namely NVIDIA GPUs.
At the same time, an important part of his role at work is analysis and understanding data to influence product or business decisions.
As for ML and/or cloud computing services, he spent neither at home nor at work on them in the past 5 years.
But now he is hoping to become more familiar with GCP more than AWS in the next 2 years, and he is also interested in learning about automation tools for automated model selection.
Otherwise, nothing has changed.

**---------------- 3-4 years ---------------**

He started spending 1,000-9,999 USD on cloud computing services.
And from now on, he uses AWS and its products, in particular Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3), on a regular basis.
Also from now on uses MySQL to work with data.
At the same time, business intelligence tools, managed ML products, tools to help manage ML experiments, and automated ML tools are not used at all.
Everything else is the same.

**---------------- 4-5 years ---------------**

He is 30-34 years old. His programming experience has also increased to 5-10 years.
And from that moment he resides in the USA and works in a large company (10,000 or more employees) and receives his yearly compensation of 100,000-124,999 USD, but otherwise, both the company and his work are no different from the past, like everything else.

**---------------- 5-10 years ---------------**

Now his programming experience is even bigger and is 10-20 years.
And the important part of his role at work begins to be building prototypes to explore applying ML to new areas, while analysis and understanding data to influence product or business decisions is slightly less.
Also, he uses more PostgreSQL in his work, not just MySQL.
He gets less new information on data science on Kaggle and reads more blogs (Towards Data Science, Analytics Vidhya, etc).
Otherwise, everything is the same.

**---------------- 10-20 years ---------------**

He is 40-44 years old. He has attained (or plans to attain within the next 2 years) a doctoral degree.
At the same time, from that moment on, he no longer works in the computers/technology industry, but in academics/education, and his yearly compensation has increased to 150,000-199,999 USD.
An important part of his role at work is currently building prototypes to explore applying ML to new areas.
And he spends even more money on cloud computing services (10,000-99,999 USD).
But he no longer uses PostgreSQL to work with data, but only MySQL, as before.
He still gets new information on data science more from the blogs or Kaggle, but he also reads journal publications (peer-reviewed journals, conference proceedings, etc.)
Everything else remained unchanged.

**---------------- 20 or more years ---------------**

He is already 55-59 years old, and his programming experience is 20+ years.
He decided to change the company to a slightly smaller one (1000-9,999 employees) with a decrease in his yearly compensation again to 100,000-124,999 USD, but the important part of his role at work again is analysis and understanding data to influence product or business decisions.
He no longer uses any hosted notebook products for his machine learning work.
But it uses ML algorithms not only Linear or Logistic Regression, but also Decision Trees or Random Forests, and the latter more often.
He continues to spend money on сloud сomputing services, but those amounts are both 1,000-9,999 USD and 100,000 USD or more.
At the same time, he uses GCP as a cloud computing platform, not AWS, although Amazon Elastic Compute Cloud (EC2) uses everything the same, but additionally, Google Cloud Compute Engine.
His favorite media sources that report on data science topics eventually became the journal publications, although he continues to read the blogs.

# References:

The following dataset and libraries were used for the analysis:
1. Data, Questions and answer choices, and Methodology and survey flow logic from [2021 Kaggle Machine Learning & Data Science Survey](https://www.kaggle.com/c/kaggle-survey-2021/overview/description).
2. Python libraries: [pandas](https://pandas.pydata.org/pandas-docs/stable/) and [plotly](https://plotly.com/python/).