In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# This line makes our plot visible.

%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

                                              Introduction

   When I opened 2019 survey at the first time, I was wondered by a question, which had been absent in the previous Kaggle surveys - “What is the size of your company?”. This is an interesting question and a new challenge. What do we know about the companies, whose representatives submitted the survey? What challenges do each group experience? Is there any difference in processing data within each group? What is the difference and what is in common between the groups of companies? How machine learning techniques are changing today’s small, medium, large business? Can we get answers to these questions from the survey? Or maybe we will have even more questions after analyzing the survey data? Unfortunately, there was no question about the company size in the previous surveys and we can not compare nor see trends, nor make predictions, but it is a good point to start.

   >Another interesting question is why does the business need machine learning and data analyzing techniques at all? At this era of data, our world is facing day-to-day growth of information available. Speed of changes and accumulated information have astonishing scale. A lot of routine tasks, performing of which was not an issue 20 years ago, seem overwhelming without external help nowadays. Our brain has its own limited operational capacity and we have to rely on external help at the era of information. We need stock market analysis, weather forecast, earthquake probability predictions, etc. For example, 100 years ago we had few sources of information for the weather prediction. We had a small amount of historical measurements, thermometer, hygrometer. It was easy to process the information manually, but the cost was accuracy. Today, to get a precious prediction we have many different sources of information, hundred years of historical records, a lot of sensors, satellite data, etc. We can still process data from few of them manually and get the same accuracy as 100 years ago, but we can get much better accuracy using all data sources. The more information we have, the more complex task to process this information it is. This is the evolutionary way. Once we have a new useful tool, its spreading is inevitable. Those who will not adopt it, will have less chances to survive.

   >Ancient Greeks were able to fit all science concepts and theirs picture of the world in only one profession - philosopher. In today’s world we have a huge amount of science disciplines, divisions and sub-divisions. The more information we gather in any field of knowledge, the more complex it is becoming and we have to divide this discipline, because any human will spend a significant part of their life trying to fit all the information about the discipline into their brain. But what if we have another option, except of dividing? What if we can use our brain for more complex tasks and leave everything else to machines?

   >In the global world full of information, I think it is a good choice! Machine learning techniques are becoming more popular. Hardware performance is becoming higher and higher. Machine learning is turning from the blurred idea of the science fiction movie makers to our daily reality.

   So who are they, our today’s engines of the data world? How companies of different sizes adopt machine learning in their business model? How each of them use it? Is there any specific obstacle for any of them? Let’s dive into our survey’s data to take a glimpse.


In [None]:
# These lines load all 3 Kaggle surveys, which we have in availability.

survey_2017 = pd.read_csv("/kaggle/input/2017-kaggle-survey/2017_responses.csv", encoding='latin-1', low_memory=False)
survey_2018 = pd.read_csv("/kaggle/input/2018-kaggle-survey/2018_responses.csv", header = 1, low_memory=False)
survey_2019 = pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv", header = 1)

   So let's take look at our groups closer. We have 5 of them representing groups from small to large businesses. Groups '**0-49 employees**' and '**50-249 employees**' can be considered as a small businesses, while '**1000-9,999 employees**' and '**>10,000 employees**' as a big companies, and the '**250-999 employees**' is in the middle. From the below two charts, we can see that our respondents are not spread between these groups evenly. The largest group of respondents are representatives of the smalles business. The next group is the biggest one. So more than a half of our respondents represent the biggest and smallest business groups. The question is why? Also the smallest group represents middle size business with the significant difference between the medium group and the biggest, and the smallest. We can conclude that the smallest and the biggest companies are more interested in machine learning techniques. Or maybe they have less obstacles? Or more motivation?

In [None]:
# This cell eximines raw amount of responses per company size. We will use it for the future data rescaling.
company_size_raw = survey_2019['What is the size of the company where you are employed?'].value_counts().to_frame('Total amount')

# We will determine and plot amount of respondents per company size in % to make it human-readable.
company_size_perc = survey_2019["What is the size of the company where you are employed?"].value_counts(normalize=True)*100

# Set plotting variables and plot the output.
plt.rcParams.update({'figure.figsize': (7, 5)})
ax = company_size_perc.plot(kind = 'bar')
ax.set_title('Company size')

#company_size_perc.set_xlabel('Amount ($)')
ax.set_ylabel('% of respondents')

In [None]:
# Function for easy column grouping and counting their content.
# This functions takes dataframe; name of the column we interested in; 
# new column name (we will use it for shorter references and readibility); 
# column for indexing and optional parameter col_param, which is a value in an orig_col_name.

def group_by_func(df, orig_col_name, new_col_name, index_col, col_param = False):
    if col_param is not False:
        # If the col_param is set, we will select data from the original dataframe using this parameter.
        # We need it to group a column, which has multiple values in it.
        temp_df = df[df[orig_col_name] == col_param]
        # Group 2 columns and count amount of each pair
        
    else:
        temp_df = df
    grouped_df = temp_df.groupby([orig_col_name, index_col]).size().to_frame(new_col_name).reset_index()
    # Remove unnecessary column for easier further data manipulation and setting the index
    grouped_df = grouped_df.drop(orig_col_name, 1).set_index(index_col)
    return grouped_df

In [None]:
# Function for re-arrenging data from two corresponding columns to a table. 
# This function is used to create a new dataframe with multiple columns from two given columns.
# It takes a column name we're interested in; a column, which we use for grouping (size of companies) 
# and an optional value of scale for re-scaling.

def column_to_table(orig_col_name, index_col_name, scale = False):
    new_dataframe = pd.DataFrame()
    # Make a list of all possible values of the interesting column
    new_col_name_raw = survey_2019[orig_col_name].unique()
    # Leave only significant valuables
    new_col_name = [x for x in new_col_name_raw if str(x) != 'nan']
    # Iterate over each value to make a column in our new dataframe
    
    for i in new_col_name:
        temp_df = group_by_func(survey_2019, orig_col_name, i, index_col_name, i)
        # Parameter scale is set to True if we need relative data instead of absolute one
        
        if scale is True:
            # Divide our amount of each pair per amount of responders in appropriate group
            temp_df = ((temp_df[i]/company_size_raw['Total amount'])*100).to_frame(i)
            
        # Add newly created column to our new dataframe
        new_dataframe = pd.concat([new_dataframe, temp_df], axis = 1, sort = True)
    return new_dataframe

In [None]:
# Create a new dataframe using multiple columns from the original dataframe.
# This function is used to create a new dataframe from multiple columns in the original dataframe
# with the desired columns.
# It takes:
#   - a column name we are interested in;
#   - new column names for easier plotting;
#   - index column, which represents company size
#   - an optional scale value, which it uses for re-scaling results according to amount of respondents
#     in the specific group.

def multiple_columns_to_table(orig_column_name, new_column_name, index_col, scale = False):
    output_df = pd.DataFrame()
    # j parameter is used to iterate over new_column_name list.
    j = 0
    
    for i in orig_column_name:
        # Creating a column of a new dataframe
        temp_df = group_by_func(survey_2019, i, new_column_name[j], index_col)
        # Re-scale our output column to amount of respondents in each group, if scale parameter is set to True.
        
        if scale is True:
            temp_df = ((temp_df[new_column_name[j]]/company_size_raw['Total amount'])*100).to_frame(new_column_name[j])
        j += 1
        # Add a column to our new dataframe
        output_df = pd.concat([output_df, temp_df], axis = 1, sort = True)
        # Return and plot output dataframe
    return output_df.style.background_gradient(cmap='Blues')

   Let's look more precisely at our groups of companies! What is the size of machine learning team in each of our group? The smallest companies '**0-49 employees**' have the smallest size of machine learning team, a lot of them don't have anyone responsible for machine learning at all. The largest companies with '**>10,000 employees**' expectedly have more than 20 persons with the machine learning job responsibility. Bigger companies tend to have more people responsible for machine learning and smaller companies have less.

In [None]:
# How many employees are responsible for ML at your business in 2019.

column_to_table('Approximately how many individuals are responsible for data science workloads at your place of business?',
                'What is the size of the company where you are employed?'
               ).style.background_gradient(cmap='Blues')

   Look at the table below, it represents the same distribution of amount of responsible for machine learning individuals relatively to the total amount of respondents of this company size. Let's look at our middle group '**250-999 employees**'. Interesting fact is that amount of persons responsible for the machine learning in this group is distributed almost evenly (except for the 15-19 group) and not concentrated in the middle as we would assume. It has big groups of ML specialists as well as the small. If we compare it with two alligned groups of '**50-249 employees**' and '**1000-9,999 employees**', we will see that it tends to be closer to the '**1000-9,999 employees**', despite the fact it has much less workers in the company itself. It can suggest us that medium size companies have business structure more similar to big companies, than to small. They are interested in data science, but why don't they have so many representatives in this field?

In [None]:
# How many employees are responsible for ML at your business in 2019, re-scaled data for more deep comparison.

column_to_table(
    'Approximately how many individuals are responsible for data science workloads at your place of business?',
    'What is the size of the company where you are employed?',
    True
).style.background_gradient(cmap='Blues')

   Let's get back to another interesting group from the table above - '**0-49 employees**', who replied that they don't have anyone responsible for ML. Why such a significant group of responders have none responsible for machine learning at their workplace, but they are still interested in machine learning? Are they simply studying this field as a good opportunity to switch the job or are they exploring a new way of solving their tasks? From the below chart we can figure out that these individuals work mostly with the raw data. They are analyzing, understanding the data and building infrustucture for it. So this significant group are eximine possibility to process huge amounts of data using machine learning techniques at their workplace. 

In [None]:
# Distribution of activities for company 0-49 with the respondents replied that they don't have anyone responsible
# for machine learning in their company.
# Let's get the list of activities from the survey and make more compact output list for easier plotting

activities_2019_survey = [
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other'
]
activities_0_49_employees_output = [
    'Analyze and understand data',
    'Build/run the data infrastructure for storing/analyzing/operationalizing data',
    'Build prototypes to explore applying ML to new areas',
    'Build/run a ML service that operationally improves my product/workflows',
    'Experimentation and iteration to improve existing ML models',
    'Do research that advances the state of the art of ML',
    'None of these',
    'Other'
]
# Select necessary part of the dataframe related to our group of 0-49 employees, who replied
# that they don't have anyone responsible for the machinee learning
temp_df = survey_2019[(survey_2019['What is the size of the company where you are employed?'] == '0-49 employees') & (survey_2019['Approximately how many individuals are responsible for data science workloads at your place of business?'] == '0')]
activities_0_49_employees_list = []

for i in activities_2019_survey:
    # Count and add each activity to the list.
    activities_0_49_employees_list.append(temp_df[i].count())
    # Create a dataframe from the list.
    
activities_0_49_employees = pd.DataFrame(activities_0_49_employees_list, index = activities_0_49_employees_output, columns =['Total amount of respondents'])

# Set plotting parameters and plot the frame.
plt.rcParams.update({'figure.figsize': (10, 7)})
plt.rcParams.update({'font.size': 18})
ax = activities_0_49_employees.plot(kind = 'barh')
ax.set_title("Activities of '0-49 employees' group")

   What if we take a closer look at the distribution of the work activities across the groups of companies of different sizes? Is there any difference between large companies and small? How do each group process the data? What are the most important parts of work activities related to data according to the company size? Our first table is a raw data, total amount of respondents in each group. It is hard to analyze these data due to significant difference between amount of answerers in them.

In [None]:
# Size of company vs. activities of the employees 2019.
# List of activities, represented in the survey.

activities_list = [
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other'
]

# Desired output column names for more readable format.
activities_list_output = [
    'Analyze and understand data to influence product or business decisions',
    'Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
    'Build prototypes to explore applying machine learning to new areas',
    'Build and/or run a machine learning service that operationally improves my product or workflows',
    'Experimentation and iteration to improve existing ML models',
    'Do research that advances the state of the art of machine learning',
    'None of these activities are an important part of my role at work',
    'Other'
]

multiple_columns_to_table(
    activities_list,
    activities_list_output,
    'What is the size of the company where you are employed?'
)

   It is much easier to plot a table with the relative distribution to the amount of respondents in each  company. The most evident fact is, that the companies with '**>10,000 employees**' are leading in all categories of data processing. They are analyzing, researching, building and experimenting. Theirs massive teams of data scientists are leading the way towards data science in every aspect of our lives. It isn't a surprise that we see huge companies at the top. Look at the first column of the table. You can see that nearly half of all responders consider analyzing and understanding the data as an important part of their work. The most significant task and a starting point for machine learning is the data itself. Big companies have human resources for data gathering and analisys, moreover they've got the data itself. As a business giants they are dealing with a huge amount of data daily and they need tools to make data processing more accurate and quick. These companies are building machine learning platforms and algorithms. They are the middle layer between scientific part of the data science and production. They combine both!

   But what's about another interesting group '**250-999 employees**'? Is there anything specific about it here?
Even when we look at the table with the relative amount of responses, we can see that this group has too weak implementation of machine learning techniques in theirs business model. What are the obstacles to this? Is it well-established technological process without data technologies? Or is it something else?

   And of course, we should take a look at the smallest group of '**Analyze and understand data to influence product or business decisions**' - it is the smallest amount in the column and it belongs to '**0-49 employees**'. The fact that this group uses the first step of data process less than the others, can give use a tip about the way they use machine learning methods. They use more pre-trained models, modify them for their needs and use publically available data sets.

   It's obvious that small companies generate less data, but they are still willing to process information with the machine learning tools, since it can save them theirs revenue.

In [None]:
# Size of company vs. activities of the employees normilized.

activities_list = [
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work',
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other'
]

activities_list_output = [
    'Analyze and understand data to influence product or business decisions',
    'Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
    'Build prototypes to explore applying machine learning to new areas',
    'Build and/or run a machine learning service that operationally improves my product or workflows',
    'Experimentation and iteration to improve existing ML models',
    'Do research that advances the state of the art of machine learning',
    'None of these activities are an important part of my role at work',
    'Other'
]

multiple_columns_to_table(
    activities_list,
    activities_list_output,
    'What is the size of the company where you are employed?',
    True
)

   Let's dive deeper and compare a work activities distribution from the previous year with the current one. We rescaled the both groups according to the amount of answerers in each survey for more correct comparison. We can see that the '**Other**' and the '**None**' categories became significantly smaller. Other categories became smaller too, but at the same time we've got a new category - '**Experimentation and iteration to improve existing ML methods**' and it represents sizable part of our respondendents. We can conclude that our audience became more specialized, with less people, who are not involved in data science process at their work.

In [None]:
# Activities in total 2019 vs. 2018 years scaled according to the amount of respondents in each year.
# List of activities.

survey_2018_activities = ['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other']
survey_2019_activities = ['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work', 'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other']

# Desired activities name for more readable format.
survey_2018_activities_output = [
    'Analyze and understand data',
    'Build/run a ML service that operationally improves my product/workflows',
    'Build/run the data infrastructure for storing/analyzing/operationalizing data',
    'Build prototypes to explore applying ML to new areas',
    'Do research that advances the state of the art of ML',
    'None of these',
    'Other'
]

survey_2019_activities_output = [
    'Analyze and understand data',
    'Build/run the data infrastructure for storing/analyzing/operationalizing data',
    'Build prototypes to explore applying ML to new areas',
    'Build/run a ML service that operationally improves my product/workflows',
    'Experimentation and iteration to improve existing ML models',
    'Do research that advances the state of the art of ML',
    'None of these',
    'Other'
]

survey_2018_activities_list = []
for i in survey_2018_activities:
    # Count, re-scale and add each activity to the list.
    survey_2018_activities_list.append((survey_2018[i].count()/23859)*100)
    # Create a dataframe from the list.
survey_2018_activities_df = pd.DataFrame(
    survey_2018_activities_list,
    index = survey_2018_activities_output,
    columns =['2018']
) 

survey_2019_activities_list = []

for i in survey_2019_activities:
    # Count, re-scale and add each activity to the list.
    survey_2019_activities_list.append((survey_2019[i].count()/19717)*100)
    # Create a dataframe from the list.
survey_2019_activities_df = pd.DataFrame(survey_2019_activities_list, index = survey_2019_activities_output, columns =['2019'])
# Combine dataframes for both years.

survey_activities_2019_vs_2018 = pd.concat([survey_2018_activities_df, survey_2019_activities_df], axis = 1, sort = True)

# Set plotting parameters and plot the frame.
plt.rcParams.update({'figure.figsize': (10, 7)})
plt.rcParams.update({'font.size': 18})
ax = survey_activities_2019_vs_2018.plot(kind = 'barh')
ax.set_title('The most important activities 2018 vs. 2019')
ax.set_xlabel('% of respondents')

   I want to take a closer look at the category '**Experimentation and iteration to improve existing ML models**' and its distribution across the company sizes. Once again we have a lot of representatives from the smallest and the biggest groups and a huge gap in the middle. Let's check this chart again with the relative data (to the amount of respondents in each group). 

In [None]:
# Distribution of the 'Experimentation and iteration to improve existing ML models' activity across the company types, raw data.

data_science_degree = group_by_func(
    survey_2019,
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models',
    'Experimentation and iteration to improve existing ML models',
    'What is the size of the company where you are employed?'
)

# Set plotting parameters and plot the dataframe.
plt.rcParams.update({'figure.figsize': (10, 5)})
plt.rcParams.update({'font.size': 13})
data_science_degree.plot(kind = 'bar')

   We can see that the first two groups are two largest groups relatively to the amount of respondents in each group. They have much more data, much more space and resources to experiment and iterate with the data. Strange thing is that at this point group of '**250-999 employees**' is the smallest. It is becoming more clear that middle business is still at the stage of data analisys and exploring possibility of implementing machine learning into their business model mostly. Maybe we can find out more if we look at the machine learning techniques used by the companies?

In [None]:
# Distribution of the 'Experimentation and iteration to improve existing ML models' activity across the company types, re-scaled to the amount of respondents in each group.

data_science_degree = group_by_func(
    survey_2019,
    'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models',
    'Experimentation and iteration to improve existing ML models',
    'What is the size of the company where you are employed?'
)

data_science_degree_percent = (data_science_degree['Experimentation and iteration to improve existing ML models']/company_size_raw['Total amount'])*100

# Set plotting parameters and plot the dataframe.
plt.rcParams.update({'figure.figsize': (7, 5)})
plt.rcParams.update({'font.size': 13})
ax = data_science_degree_percent.plot(kind = 'bar')
ax.set_title('Experimentation and iteration to improve existing ML models')

   Look at the chart below. Our two opposite groups '**0-49 employees**' and '**>10,000 employees**' are at the top again. While huge companies have well-established machine learning methods and some of them recently started to explore these methods, small companies either don't use ML methods, or exploring them, or using them for a not significant period of time without putting them into production. Part os small companies use ML methods, but not for a long time and I assume these are startups, which use ML as a basic idea for the startup itself. So at this point, we can say, that big companies actively use ML methods for their business purposes and small companies are trying to explore those methods and figure out how to fit them into their business model. And again we can see a considerable gap for the medium sized companies.

In [None]:
# Distribution of machine learning methods across the company types, raw amount.

column_to_table(
    'Does your current employer incorporate machine learning methods into their business?',
    'What is the size of the company where you are employed?'
).style.background_gradient(cmap='Blues')

   Let's take look at the same table with the relative to the total amount of respondents data. Our opinion about the smallest companies is getting stronger. Groups '**50-249 employees**' and '**250-999 employees**' look almost identical, and '**1,000-9,999 employees**' has slightly bigger amount of well-established ML methods. 

In [None]:
# Distribution of machine learning methods across the company types, re-scaled amount for comparizon.

column_to_table(
    'Does your current employer incorporate machine learning methods into their business?',
    'What is the size of the company where you are employed?',
    True
).style.background_gradient(cmap='Blues')

   We need to explore also job titles of the answerers. We need to know not only how the data is processed in the companies of each group, but also who processes the data. Look at the distribution of the job title over company sizes below. The most significant groups of titles are '**Data Scientists**' and '**Software Engineers**'. The most represented companies once again are the smallest and the largest. There is nothing specific in this field, so let's go further. 

In [None]:
# Distribution of job titles across the company sizes.

column_to_table(
    'What is the size of the company where you are employed?',
    'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'
).T.style.background_gradient(cmap='Blues')

   Let me show you a comparison chart of the job titles of our respondents in the last 3 surveys. What has changed during this period of time?

In [None]:
# Comparizon of job title of respondents in the past 3 surveys, amount re-scaled to the amount of respondents
# in each appropriate group.
# Select job title from each survey, count amount of each group and re-scale according to the amount of respondents
# in each group.

job_title_2017 = ((survey_2017['CurrentJobTitleSelect'].value_counts()/16000)*100).to_frame('2017')
job_title_2019 = ((survey_2019['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts()/19717)*100).to_frame('2019')
job_title_2018 = ((survey_2018['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts()/23859)*100).to_frame('2018')

# Create a new dataframe by combining 3 dataframes from each year.
new_df = pd.concat([job_title_2017, job_title_2018, job_title_2019], axis = 1, sort = True)

# Set plotting parameters and plot the dataframe.
plt.rcParams.update({'figure.figsize': (10, 20)})
plt.rcParams.update({'font.size': 13})
ax = new_df.plot(kind = 'barh', width = 1.0)
ax.grid()
ax.set_title('Comparison of respondents job titles in 2017, 2018, 2019 years')
ax.set_xlabel('% of respondents in each year')

   This chart is re-scaled according to the amount of respondents in each year to make comparison more correct. We have a much lesser amount of job titles in 2019 survey, so I assume that the most of the 2017 small groups have merged to other groups or group '**Other**'. We've got a '**Student**' title (which has a significant amount of answerers) and the '**Data Engineer**' title. The '**Data Scientist**' title grew a lot. If we take a look at the chart as a whole, we can conclude that our audience became more data science specific. We don't have the '**Consultant**', the '**Developer Advocate**' and the '**CTO**' any longer, they either went to the '**Other**' group (which didn't grew significantly) or they attributed themselves to other groups now. The Data science field undergoing changes from closed-scientific-researching area to fancy something-everyone-heard-of ('I want to take a look at this for sure') to we-definitely-need-to-explore-this-and-put-into-production. It is becoming something from the computer science magazine or science fiction movie to our daily reality and we cannot imagine our world without it anymore. Now, I want to come back from the outer space and Mars Mission landing to our data, we have more to explore here! 

   I have noticed one more interesting thing, that the '**Product/Project manager**' job title is almost doubled since the last survey. Is this a statistical error or something deeper? Maybe it's a new trend, when companies are exploring machine learning field for their new projects? And if so what companies are exploring the field most? On the chart below you can find distribution of the '**Product/Project Manager**' job title across the company sizes. This table is re-scaled to the amount of answerers in each group.

In [None]:
# Distribution of the 'Product/Project Manager' job title across types of companies, re-scaled according to
# the amount of respondents in each group. 
# Group job title and company size columns using the 'Product/Project Manager' as a key.

product_manager_vs_company_size = group_by_func(
    survey_2019,
    'Select the title most similar to your current role (or most recent title if retired): - Selected Choice',
    'Product/Project Manager',
    'What is the size of the company where you are employed?',
    'Product/Project Manager'
)

# Re-scale the amount of the specific job title holder according to the amount of respondents in each group.
product_manager_vs_company_size_percent = (product_manager_vs_company_size['Product/Project Manager']/company_size_raw['Total amount'])*100

# Set plotting parameters and plot the dataframe.
plt.rcParams.update({'figure.figsize': (7, 5)})
plt.rcParams.update({'font.size': 13})
ax = product_manager_vs_company_size_percent.plot(kind = 'bar', x = 'What is the size of the company where you are employed?', y = '% of responders')
ax.set_title('Product/Project Manager job title distribution')
ax.set_ylabel('% of respondents')

   There is nothing specific. Product/Project managers are represented almost evenly in each group with the tiny gap for the medium sized companies. I want to explore this information deeper and show you a distribution of the activities of the '**Product/Project Managers**'.

In [None]:
# Distribution of the activities for respondents, who replied 'yes' to job title 'Product/Project Manager'
# Let's get the list of activities from the survey and make more compact output list for easier plotting

activities_2019_survey = ['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions', 
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas', 
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows',
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models', 
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning', 
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work', 
                          'Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other']

activities_2019_survey_output = ['Analyze and understand data', 
                                 'Build/run the data infrastructure for storing/analyzing/operationalizing data', 
                                 'Build prototypes to explore applying ML to new areas', 
                                 'Build/run a ML service that operationally improves my product/workflows', 
                                 'Experimentation and iteration to improve existing ML models', 
                                 'Do research that advances the state of the art of ML', 
                                 'None of these', 
                                 'Other']

# Select necessary part of the dataframe related to 'Product/Project Manager' job title
temp_df = survey_2019[(survey_2019['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'] == 'Product/Project Manager')]
project_manager_list = []

for i in activities_2019_survey:
    # Count and add each activity to the list.
    project_manager_list.append(temp_df[i].count())
    # Create a dataframe from the list.

project_manager_activities = pd.DataFrame(
    project_manager_list,
    index = activities_2019_survey_output,
    columns =['Count']
)

# Set plotting parameters and plot the frame.
plt.rcParams.update({'figure.figsize': (10, 7)})
plt.rcParams.update({'font.size': 18})
ax = project_manager_activities.plot(kind = 'barh')
ax.set_title('Product/Project Manager activities distribution')
ax.set_xlabel('Amount of respondents')

   As it was expected they analyze data, build prototypes and explore applying machine learning to their new projects. It is a good sign for the data science field.

   We have another factor, except of the methods, titles and roles, which can describe our companies and challanges they may face during integration of data science into their business. Let's look at the information about the data scientists themselves. What resources are widely available? Who use them and how?

   We can see a table with the amount relative to the total amount of respondents in each company size group. This table is much more useful for comparison, since we avoid difference in amount of respondents in each group.


In [None]:
# Learning platform usage across company sizes, re-scaled according to amount of respondents in
# the appropriate group.
# List of the platforms.

learning_platform = ['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udacity',
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera',
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX',
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataQuest',
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Courses (i.e. Kaggle Learn)', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udemy', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - LinkedIn Learning', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - University Courses (resulting in a university degree)', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - None', 
                     'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Other']

# List of the output platforms for more readable format.
learning_platform_output = ['Udacity', 
                            'Coursera', 
                            'edX', 
                            'DataCamp', 
                            'DataQuest', 
                            'Kaggle Courses', 
                            'Fast.ai', 
                            'Udemy', 
                            'LinkedIn Learning', 
                            'University Courses', 
                            'None', 
                            'Other']

multiple_columns_to_table(
    learning_platform,
    learning_platform_output,
    'What is the size of the company where you are employed?',
    True
)

   We can figure out that Coursera, Kaggle and Udemy are at the top of learning platforms, University courses at the forth place. Why university degree is not so valuable? Is this because it is a new field for the universities themselves and most part of our data scientists came from computer science, science or totally other field? Is the university education very expensive and doesn't provide anything specific, what you can find on Kaggle, Udemy or Coursera? Or Coursera, Udemy and Kaggle are only the first steps towards the data scientist job? Or the data science field is becoming more easier to understand and implement without any specific university knowledge? As it frequently happens in our life - one answer can bring much more questions!

   All 5 groups are very interested in self-education and use multiple sources. The top group is '**>10,000 employees**', but much more interesting group is another one - '**250-999 employees**'. This group is the second from the top and despite the fact that it has the smallest number of respondents, it has highly-educated professionals. I assume increase of this group size next year, because it looks like they are here not just for fun, but for the data science!

   Now let's go to the information resources in the data science field. The data is relative to the amount of respondents in each company size group. And again we see highly-educated almost evenly-spread audience with Kaggle, Blogs and YouTube in top three. Good job! Our responders are more than anyone aware of information value and necessity to study during the whole life.


In [None]:
# Data source usage across company_size, re-scaled according to the amount of respondents in the appropriate group.
# List of the datasources.

data_source = ['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Twitter (data science influencers)',
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Hacker News (https://news.ycombinator.com/)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, r/datascience, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (forums, blog, social media, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Cloud AI Adventures, Siraj Raval, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, Linear Digressions, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (traditional publications, preprint journals, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)', 
               'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None']

# List of the output datasources for more readable format.
data_source_output = ['Twitter', 
                      'Hacker News', 
                      'Reddit', 
                      'Kaggle', 
                      'Course Forums', 
                      'YouTube', 
                      'Podcasts', 
                      'Blogs', 
                      'Journal Publications', 
                      'Slack Communities', 
                      'None']

multiple_columns_to_table(
    data_source,
    data_source_output,
    'What is the size of the company where you are employed?',
    True
)

   And the last, but not least important thing I want to explore, is a cross section of coding experience of our data scientists corresponding to the company size. How much coding expertise do the companies accumulate? The data is relative to the amount of respondents in each group.

In [None]:
# Distribution of the coding experience across the company types, re-scaled data for comparison.

column_to_table(
    'How long have you been writing code to analyze data (at work or at school)?',
    'What is the size of the company where you are employed?',
    True
).style.background_gradient(cmap='Blues')

   What I can say - it is nearly impossible not to code in this field and it is not required to have 10+ years of coding experience to be a data scientist. Once again, groups '**>10,000 employees**' and '**1,000-9,999 employees**' are at the top of expertise and have nearly half of their respondents with more than 3 years of coding experience. From here we can conclude that only the smallest companies have significantly less coding expertise than anyone else.




                                             Conclusion

   We did a deep and massive exploration of the companies and their workers, represented in Kaggle 2019 survey. It was very interesting to meet these companies and try to understand their needs and chellanges. Different company size groups have different motivation and an attitude to the data science, they also have different pace of improvement.

   We can say that smaller companies are very interested in the data science as an improvement of their workflow, to save the revenue and maybe to explore new fields. ML is a cutting edge of the current technology and no wonder that part of small companies (probably startups) build on this idea. So at the time of constant changes and process automations, ML is a survival tool for small companies. They have their own challenges in the face of inability to create their own datasets, small ML departments, but due to their small size they are very flexible. They can clearly feel in which direction the wind blows and the source of this wind is at the huge companies. Big companies have resources to make deep research, gather data and create datasets, improve algorythms and explore new math functions. They made Data science field less science specific and more approachable to others. They have big Data Science departments and possibility exploring the Data field in all directions. They are able to release development platforms and tools for the community free of charge. And eventually they put the Data science into production. Big companies have their own challenges such as lack of human resources. What about the middle companies? This is the most interesting part of our study. While big companies implemented ML techniques long time ago and small companies are exploring and implementing, middle companies are mostly hesitating. They have more resouces, than the small companies, but they already have a settled business process and they don't realize necessity of changes. So there is no up-to-bottom or bottom-to-up way of implantation of machine learning. We can see up-bottom-middle way. And I think medium business eventually will realize that ML is not a faraway future, but our current reality. They should not resist or hesitate, they should try to find their way of implementing this technology into their business model.

   P.S. As a side note I would like to ask survey formers to include the following question to the next year survey. It will make our dataset more abundant. Thank you!

   What is the main focus of your company's business?
   Who are your company's customers?
   Do you implement ML for your internal organizational processes or you put them into your products manufacture, or both?
   Do you use ML to replace outdated data processing technologies or you build new technologies with ML from the scratch?