In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', None)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Data Scientist: Should you work for a small or a big company?

Data Science is here to stay.

According to [KDNuggests article](https://www.kdnuggets.com/2017/10/how-choose-data-science-job.html):
> Companies are struggling to fill roles. Data Science jobs stay open days longer than the more traditional job postings.

Companies are competing to acquire the best candidates and usually figuring out what the key tasks of a specific role would be is not straightforward.

Monzo, in their [article about its Data team](https://monzo.com/blog/2019/11/04/how-we-scaled-our-data-team-from-1-to-30-people-part-1) says:
> Adding an ML component to our data roles also meant we changed our titles to "Data Scientist". And based on some anecdata, this helps us attract more people to our roles. A few years back we A/B tested different titles for open data roles. And by changing the titles from Data Analyst to Data Scientist (without changing the job description), we could attract close to twice as many applications, of significantly higher quality too!

It's obvious that Data Scientist roles can vary significantly from company to company - data professionals usually have difficulty in finding what's the best type of company for them. 
A common dilemma when trying to decide which working environment suits you best is whether to focus on small or larger companies to apply. This is sensible; a Data Scientist's profile, role, and responsibilites might end up being quite different in start-ups or big corporations.

So, what should you expect when aiming for a role in a small company? Is working for a bigger company a promise for a better pay or a more diverse team?

We look at what differentiates a Data Scientist role at a big or small company based on characteristics in a personal level, but also on the working environment, as well as the business' choices & activities.

Let's dive in!

<span style="background-color:#f7a385;padding:10px;">SPOILER ALERT!</span>

<div class="alert alert-info">
<br>
KEY TAKEAWAYS
<hr>
<ul>
<li> Roughly 1 in 5 people is a Female Data Scientist independent of the company's size</li> 
<li> Small-size companies hire younger professionals and people with more diversity in education type</li> 
<li> 1 in 5 small companies are in India - 1 in 4 medium & large companies are in the US </li>
<li>Salary patterns are same for all company types in the US, while in India there is a salary gap between small and large ones.</li>
    <li>The larger the company the more experienced (in years) its Data Scientists are in coding</li>
    <li>Half of the largest companies have well-established Machine Learning methods</li>
    <li>Programming Languages: Python in top place for every company - R and SQL not so much preferred by small ones</li>
    <li>ML frameworks: Scikit-learn in top place for every company type - LightGBM and Xgboost in large companies, Tensorflow in small ones</li>
</ul>   </div>

## Overview

In [None]:
schema = pd.read_csv('/kaggle/input/kaggle-survey-2019/survey_schema.csv')
schema.head(2)

In [None]:
responses = pd.read_csv('/kaggle/input/kaggle-survey-2019/other_text_responses.csv')
responses

In [None]:
m_choice_df = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
m_choice_df.head()

### Focusing on the Data Scientist role
The distribution of the role titles shows that 20% of the respondents are Data Scientists.

We'll focus on the subset of Data Scientists in this analysis, excluding the rest of the data roles.

In [None]:
def custom_format(ax):
    ax.spines["top"].set_visible(False)
    ax.spines["bottom"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.tick_params(axis='both', labelsize=12, colors='#565656', bottom=False, left=False)
    ax.set_ylabel('');
    ax.set_xlabel('');
    return ax

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(y='Q5', 
              data=m_choice_df.iloc[1:], 
              order=m_choice_df['Q5'].value_counts().index[:-1],
              color='#66C2A5');
ax = custom_format(ax)
sns.set_style("whitegrid")

In [None]:
print('{0:1f}% of participants are Data Scientists.'.format(
    m_choice_df.loc[m_choice_df['Q5']=='Data Scientist'].shape[0]/m_choice_df.shape[0] * 100
))

In [None]:
data_scientists = m_choice_df.loc[m_choice_df['Q5']=='Data Scientist']
data_scientists.head()

### Company size
What's the size of the companies Data Scientists are working for? This is the main element on which we are based. How many employees are working at the respondent's current company?

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
sns.countplot(y=data_scientists['Q6'], 
              color='#66C2A5',
              order=['0-49 employees', '50-249 employees', '250-999 employees', '1000-9,999 employees', '> 10,000 employees']);
ax = custom_format(ax)
ax.set_title('Number of Data Scientists respondents per company size');

> Most of them are in either in small size businesses, with 0-49 employees or in large ones, with over 10,000.

Really curious to see who really are our Data Scientists in all those companies!

# Person-related characteristics
### **Gender**: Male is the dominating gender independent of the company's size
Grouping the Data Scientists per company size allows us to create a gender profile for each company category and explore their differences. 

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
sns.countplot(y='Q2', 
              data=m_choice_df.iloc[1:], 
              order=m_choice_df['Q2'].value_counts().index[:-1],
              color='#66C2A5');
ax = custom_format(ax)
ax.set_title('Number of Data Scientists respondents per gender');

> Overall, there are much more Male Data Scientists respondents. Roughly 1 in 5 is a woman.

In [None]:
gender_counts = (data_scientists.groupby(['Q6'])['Q2']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
gender_counts.head()

In [None]:
gender_pivoted_q6 = gender_counts.pivot('Q6', 'Q2', 'percentage')
gender_pivoted_q6 = gender_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"])

gender_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(gender_pivoted_q6.T, 
            cmap="Greens",
            annot=True);
ax.set_title('Percentage of Data Scientists per gender on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

> We encounter exactly the same picture in all types of companies: male-dominated with little representation from the rest of the gender groups.

How do you feel about this picture? It's really discouraging to see the gender gap repeating almost identically in all the cases. It's real, there is no evidence to prove the opposite, unfortunately.

### **Age**: Small-size companies hire younger professionals


In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
sns.countplot(y='Q1', 
              data=m_choice_df.iloc[1:], 
              order=['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70+'],
              color='#66C2A5')
ax = custom_format(ax)

> * Data Scientists ages are kind of skewed towards the younger ages.
* It's quite encouraging that we have some representation from older ages too.

Here we imagine a Data Scientist's age profile for each company category.

In [None]:
age_counts = (data_scientists.groupby(['Q6'])['Q1']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
age_counts.head()

In [None]:
age_pivoted_q6 = age_counts.pivot('Q6', 'Q1', 'percentage')
age_pivoted_q6 = age_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"])

age_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(age_pivoted_q6.T, annot=True, cmap = "Greens");
ax.set_title('Percentage of Data Scientists per age on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

> * Companies with fewer employees (especially the ones with 0-49 employees) tend to hire younger professionals.
* Medium and large-size companies appear to have a better diversity of people in terms of age.
* In all types of companies, the majority of Data Scientists are people between 22 and 39 years old.

Does this mean that younger people are attracted early in their career by small companies, potentially start-ups? Or that small companies tend to hire younger people aiming probably for folk with less experience but higher motivation?

### **Education**: Small companies hire people with more diversity in education

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
sns.countplot(y='Q4', 
              data=m_choice_df.iloc[1:], 
              order=['No formal education past high school',
                                                      'Some college/university study without earning a bachelor’s degree',
                                                      'Professional degree',
                                                      'Bachelor’s degree',
                                                      'Master’s degree',
                                                      'Doctoral degree',
                                                      'I prefer not to answer'
                                                     ],
              color='#66C2A5')
ax = custom_format(ax)

>* Master's and Bachelor's are the most common qualifications in the Data Science world.
* Doctoral degree follows next.
* There are people with no formal education that made it to Data Science, proving that there are several paths one can follow to be a Data Scientist.

A Data Scientist's education profile for each company size reveals more about the practices of each one when hiring.

In [None]:
degree_counts = (data_scientists.groupby(['Q6'])['Q4']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
degree_counts.head(5)

In [None]:
degree_pivoted_q6 = degree_counts.pivot('Q6', 'Q4', 'percentage')
degree_pivoted_q6 = degree_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"],
                                             columns=['No formal education past high school',
                                                      'Some college/university study without earning a bachelor’s degree',
                                                      'Professional degree',
                                                      'Bachelor’s degree',
                                                      'Master’s degree',
                                                      'Doctoral degree',
                                                      'I prefer not to answer'
                                                     ])

degree_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(degree_pivoted_q6.T, annot=True, cmap = "Greens");
ax.set_title('Percentage of Data Scientists per education type on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

> Small companies (0-49 employees) appear to hire professionals with a bigger diversity in their education background.

As a data person, would you pursue an MSc's or a Doctoral degree to enchance your chances to get a position in a larger company?

### **Country**: 1 in 5 small companies are in India - 1 in 4 med & large companies are in the US

In [None]:
fig, ax = plt.subplots(figsize=(10, 12))
sns.countplot(y='Q3', 
              data=m_choice_df.iloc[1:], 
              order=m_choice_df['Q3'].value_counts().index[:-1],
              color='#66C2A5')
ax = custom_format(ax)

>* No surprises here, India and the USA are in the first places.
* The first 10 places are dominated by countries in Europe, North and South America, and Asia.
* Africa only appears on the 12the place with Nigeria.

How does the country people reside relate with the company size?

In [None]:
country_counts = (data_scientists.groupby(['Q6'])['Q3']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
country_counts.head()

In [None]:
country_pivoted_q6 = country_counts.pivot('Q6', 'Q3', 'percentage')
country_pivoted_q6 = country_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"])

country_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(20, 16))
sns.heatmap(country_pivoted_q6.T.sort_values('0-49 employees', ascending=False), 
            cmap="Greens",
            annot=True);
ax.set_title('Percentage of Data Scientists per country on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

> * 1 in 5 small companies are in India 
* 1 in 4 med & large companies are in the US

Long story short, you should move to India if you want to work for a smaller company or go to the USA for a big corporation!


### **Salary**: The picture changes based on the country

A very hot topic in the Data Science world: Does the salary vary with the company size?

In [None]:
salary_counts = (data_scientists.groupby(['Q6'])['Q10']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
salary_counts.head()

In [None]:
salary_pivoted_q6 = salary_counts.pivot('Q6', 'Q10', 'percentage')
salary_pivoted_q6 = salary_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"], 
                                             columns=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999',
                                                      '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999',
                                                      '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999',
                                                      '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                                                      '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999',
                                                      '150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-500,000',
                                                      '> $500,000'])

salary_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(salary_pivoted_q6.T,
            cmap="Greens",
            annot=True);
ax.set_title('Worldwide: Percentage of Data Scientists per salary band on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

We suppose salary is highly skewed by the absolute difference in compensation between different countries, so no useful results can be derived from this one.

Motivated by this, let's explore how the salary looks in relation to the company size, individually for India and the US, which are the countries with the higher concentration of companies. We also check for the UK out of curiosity and personal interest.

#### India: Small companies don't pay well

In [None]:
salary_counts_india = (data_scientists.loc[data_scientists['Q3']=='India'].groupby(['Q6'])['Q10']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
salary_counts_india.head()

In [None]:
salary_pivoted_q6_india = salary_counts_india.pivot('Q6', 'Q10', 'percentage')
salary_pivoted_q6_india = salary_pivoted_q6_india.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"], 
                                             columns=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999',
                                                      '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999',
                                                      '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999',
                                                      '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                                                      '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999',
                                                      '150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-500,000',
                                                      '> $500,000'])

salary_pivoted_q6_india.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(salary_pivoted_q6_india.T,
            cmap="Greens",
            annot=True);
ax.set_title('India: Percentage of Data Scientists per salary band on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

>* We notice a significantly high concentration of low-paid workers in small companies.
* Med-size companies also have the highest amount of low-paid people.
* The pay seems to be a bit fairer as the company of size gets bigger.

Oops, change of plans: India has lots of small companies but they don't pay as much!

#### United States of America: Same pay patterns in all types of companies

In [None]:
salary_counts_us = (data_scientists.loc[data_scientists['Q3']=='United States of America'].groupby(['Q6'])['Q10']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
salary_counts_us.head()

In [None]:
salary_pivoted_q6_us = salary_counts_us.pivot('Q6', 'Q10', 'percentage')
salary_pivoted_q6_us = salary_pivoted_q6_us.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"], 
                                             columns=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999',
                                                      '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999',
                                                      '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999',
                                                      '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                                                      '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999',
                                                      '150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-500,000',
                                                      '> $500,000'])

salary_pivoted_q6_us.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(salary_pivoted_q6_us.T,
            cmap="Greens",
            annot=True);
ax.set_title('USA: Percentage of Data Scientists per salary band on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

>* The situation appears to be pretty much the same here in terms of pay in different in size companies.

#### United Kingdom: No evidence of pay difference

In [None]:
salary_counts_uk = (data_scientists.loc[data_scientists['Q3']=='United Kingdom of Great Britain and Northern Ireland'].groupby(['Q6'])['Q10']
                 .value_counts(normalize=True)
                 .rename('percentage')
                 .mul(100)
                 .reset_index()
                 .sort_values('Q6'))
salary_counts_uk.head()

In [None]:
salary_pivoted_q6_uk = salary_counts_uk.pivot('Q6', 'Q10', 'percentage')
salary_pivoted_q6_uk = salary_pivoted_q6_uk.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"], 
                                             columns=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999',
                                                      '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999',
                                                      '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999',
                                                      '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                                                      '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999',
                                                      '150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-500,000',
                                                      '> $500,000'])

salary_pivoted_q6_uk.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(salary_pivoted_q6_uk.T,
            cmap="Greens",
            annot=True);
ax.set_title('UK: Percentage of Data Scientists per salary band on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

>* The absence of data in some cases here doesn't help to draw safe conclusions.
* There is no evidence to claim differences in pay.

## **Code experience**: The larger the company the more experienced Data Scientists are in coding

The respondents have been asked the question of how long have they been writing code for.


In [None]:
fig, ax = plt.subplots(figsize=(8,4))
sns.countplot(y='Q15', color='#66C2A5', data=data_scientists, order=['I have never written code', '< 1 years', '1-2 years',
                                                         '3-5 years', '5-10 years', '10-20 years', '20+ years',
                                                        ])
custom_format(ax);

In [None]:
code_counts_uk = (data_scientists.groupby(['Q6'])['Q15']
                 .value_counts(normalize=True)
                 .rename('percentage')
                 .mul(100)
                 .reset_index()
                 .sort_values('Q6'))
code_counts_uk.head()

In [None]:
code_pivoted_q6_uk = code_counts_uk.pivot('Q6', 'Q15', 'percentage')
code_pivoted_q6_uk = code_pivoted_q6_uk.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"], 
                                                columns=['I have never written code', '< 1 years', '1-2 years',
                                                         '3-5 years', '5-10 years', '10-20 years', '20+ years',
                                                        ]
                                             )

code_pivoted_q6_uk.head()

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(code_pivoted_q6_uk.T,
            cmap="Greens",
            annot=True);
ax.set_title('Percentage of Data Scientists per code experience on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

> * In very small companies 15% of Data Scientists have less than a year of code experience.
* In all cases most of the workers have 3-5 years of coding experience.
* Most of the most experienced people are in large companies.

As an experienced Data Scientist would you be reluctant of joining a new company?

# Company-related characteristics
### **Machine Learning**: Half of the largest companies have well-established ML methods

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
sns.countplot(y=data_scientists['Q8'], 
              color='#66C2A5')
ax = custom_format(ax)

In [None]:
ml_counts = (data_scientists.groupby(['Q6'])['Q8']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
ml_counts.head()

In [None]:
ml_pivoted_q6 = ml_counts.pivot('Q6', 'Q8', 'percentage')
ml_pivoted_q6 = ml_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"],
                                     columns=['We have well established ML methods (i.e., models in production for more than 2 years)',
                                              'We recently started using ML methods (i.e., models in production for less than 2 years)',
                                              'We are exploring ML methods (and may one day put a model into production)',
                                              'We use ML methods for generating insights (but do not put working models into production)',
                                              'No (we do not use ML methods)',
                                              'I do not know',
                                             ])

ml_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(ml_pivoted_q6.T, annot=True, cmap = "Greens");
ax.set_title('Percentage of Data Scientists per ML presence on each company size');
ax.set_xlabel('');
ax.set_ylabel('');

> * The smaller the size of the company the earlier the stage of their work with the ML methods.
* Almost half of the companies with >10k employees have well-established ML methods.


Machine Learning for the win: at least you know that whatever the size of the company you will be doing **some** ML. Would you join a big company and sacrifice your colleagues' diversity in education (as found above) to do **more** Machine Learning? 

### **Activities**: _Εxploring applying machine learning to new areas_ is done mostly in large companies

What are the main activities that make up an important part of a Data Scientist's role at work?

Here, we're looking at the amount of people that undertake specific activities at work per company size.

In [None]:
fields = ['Q9_Part_1', 'Q9_Part_2', 'Q9_Part_3', 'Q9_Part_4', 
          'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7', 'Q9_Part_8'
         ]

In [None]:
for field in fields:
    data_scientists[field+'_new'] = data_scientists[field].apply(
        lambda x: 1 if type(x) == str else 0
    )
    data_scientists[field+'_new'] = data_scientists[field+'_new'].fillna(0);

In [None]:
q9_part1_counts = (data_scientists.groupby(['Q6'])['Q9_Part_1_new']
                     .value_counts(normalize=True)
                     .rename('Analyze and understand data to influence product or business decisions')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part1_counts = q9_part1_counts.loc[q9_part1_counts['Q9_Part_1_new']==1].drop(columns=['Q9_Part_1_new'])

In [None]:
q9_part2_counts = (data_scientists.groupby(['Q6'])['Q9_Part_2_new']
                     .value_counts(normalize=True)
                     .rename('Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part2_counts = q9_part2_counts.loc[q9_part2_counts['Q9_Part_2_new']==1].drop(columns=['Q9_Part_2_new'])


In [None]:
q9_part3_counts = (data_scientists.groupby(['Q6'])['Q9_Part_3_new']
                     .value_counts(normalize=True)
                     .rename('Build prototypes to explore applying machine learning to new areas')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part3_counts = q9_part3_counts.loc[q9_part3_counts['Q9_Part_3_new']==1].drop(columns=['Q9_Part_3_new'])


In [None]:
q9_part4_counts = (data_scientists.groupby(['Q6'])['Q9_Part_4_new']
                     .value_counts(normalize=True)
                     .rename('Build and/or run a machine learning service that operationally improves my product or workflows')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part4_counts = q9_part4_counts.loc[q9_part4_counts['Q9_Part_4_new']==1].drop(columns=['Q9_Part_4_new'])


In [None]:
q9_part5_counts = (data_scientists.groupby(['Q6'])['Q9_Part_5_new']
                     .value_counts(normalize=True)
                     .rename('Experimentation and iteration to improve existing ML models')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part5_counts = q9_part5_counts.loc[q9_part5_counts['Q9_Part_5_new']==1].drop(columns=['Q9_Part_5_new'])


In [None]:
q9_part6_counts = (data_scientists.groupby(['Q6'])['Q9_Part_6_new']
                     .value_counts(normalize=True)
                     .rename('Do research that advances the state of the art of machine learning')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part6_counts = q9_part6_counts.loc[q9_part6_counts['Q9_Part_6_new']==1].drop(columns=['Q9_Part_6_new'])


In [None]:
q9_part7_counts = (data_scientists.groupby(['Q6'])['Q9_Part_7_new']
                     .value_counts(normalize=True)
                     .rename('None of these activities are an important part of my role at work')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part7_counts = q9_part7_counts.loc[q9_part7_counts['Q9_Part_7_new']==1].drop(columns=['Q9_Part_7_new'])


In [None]:
q9_part8_counts = (data_scientists.groupby(['Q6'])['Q9_Part_8_new']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q9_part8_counts = q9_part8_counts.loc[q9_part8_counts['Q9_Part_8_new']==1]


In [None]:
subtables = [q9_part3_counts,
            q9_part4_counts, q9_part5_counts, q9_part6_counts,
            q9_part7_counts]

In [None]:
merged = q9_part1_counts.merge(q9_part2_counts, on='Q6')
for subtable in subtables:
    merged = merged.merge(subtable, on='Q6')
merged.head()

In [None]:
merged = merged.set_index('Q6').unstack().to_frame()
merged.head()

In [None]:
merged = merged.reset_index()
merged = merged.rename(columns={'level_0': 'Task', 'Q6': 'Company size', 0: 'percentage'})

In [None]:
import warnings
warnings.filterwarnings("ignore")

grid = sns.FacetGrid(merged, col="Company size", hue="Company size",
                     palette="Set2", col_wrap=1, height=3, sharey=True, sharex=True,
                    col_order=['0-49 employees', '50-249 employees', '250-999 employees',
                               '1000-9,999 employees', '> 10,000 employees']);
grid.map(sns.barplot, 'percentage', 'Task');
sns.despine();
plt.subplots_adjust(hspace=0.2, wspace=1.2);

>* "Build prototypes to explore applying machine learning to new areas" appear to be much more common as an activity for people in medium and large size companies.

No worries Data Scientist, the number of employees don't say much about the work you'll be doing in the company!

### **DS workloads**: The largest the company the more people work in data workloads

How many individuals are responsible for data science workloads at the business?

In [None]:
# Approximately how many individuals are responsible for data science workloads at your place of business?
fig, ax = plt.subplots(figsize=(9, 5))
sns.countplot(y=data_scientists['Q7'],
             order=['0', '1-2', '3-4', '5-9', '10-14', '15-19', '20+'], 
             color='#66C2A5');
custom_format(ax);

> Most of the companies that employ Data Scientists have more than 20 people that are responsible for data science workloads.

In [None]:
workload_counts = (data_scientists.groupby(['Q6'])['Q7']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
workload_counts.head()

In [None]:
workload_pivoted_q6 = workload_counts.pivot('Q6', 'Q7', 'percentage')
workload_pivoted_q6 = workload_pivoted_q6.reindex(index=["0-49 employees", 
                                                        "50-249 employees", 
                                                        "250-999 employees", 
                                                        "1000-9,999 employees", 
                                                        "> 10,000 employees"],
                                                 columns=['0', '1-2', '3-4', '5-9', '10-14', '15-19', '20+'])

workload_pivoted_q6.head()

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(workload_pivoted_q6.T, 
            cmap="Greens",
            annot=True);
ax.set_title('Percentage of cases per people responsible for DS workload on each company size')
ax.set_xlabel('');
ax.set_ylabel('');

> It makes sense that the larger the company the more people are responsible for the data science workload.

It would be good if we had the data to compute what is the proportion of people in the workload in comparison to the total number of employees.

#### Data Science workload and ML presence
Surprisingly, there are cases of Data Scientists who claim no people are working in the data science workloads.

Let's have a look why this happens: Is this related to the presence of ML in the business (Q8)? 

In [None]:
# create an auxiliary column that shows zero or more people in the DS workload
data_scientists['people_workload'] = data_scientists['Q7'].apply(lambda x: '0' if x=='0' else '1+')
data_scientists['people_workload'].value_counts()

In [None]:
people_workload_counts = (data_scientists.groupby(['people_workload'])['Q8']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('people_workload'))

In [None]:
fig, ax = plt.subplots(figsize=(8, 6));
sns.barplot(y="Q8", x="percentage", hue="people_workload", data=people_workload_counts, palette="Set2");
custom_format(ax);

> The absence of people in the workload might be driven by the absence of ML in the production.

### **Programming Languages**: R and SQL not so much preferred in small companies
What programming languages do Data Scientists use on a regular basis?

We investigate whether the size of the working environment affects the choice of the main programming languages.

In [None]:
fields = ['Q18_Part_1', 'Q18_Part_2', 'Q18_Part_3', 'Q18_Part_4', 
          'Q18_Part_5', 'Q18_Part_6', 'Q18_Part_7', 'Q18_Part_8', 'Q18_Part_9', 'Q18_Part_10'
         ]

In [None]:
for field in fields:
    data_scientists[field+'_new'] = data_scientists[field].apply(
        lambda x: 1 if type(x) == str else 0
    )
    data_scientists[field+'_new'] = data_scientists[field+'_new'].fillna(0);

In [None]:
q18_part1_counts = (data_scientists.groupby(['Q6'])['Q18_Part_1_new']
                     .value_counts(normalize=True)
                     .rename('Python')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part1_counts = q18_part1_counts.loc[q18_part1_counts['Q18_Part_1_new']==1].drop(columns=['Q18_Part_1_new'])

In [None]:
q18_part2_counts = (data_scientists.groupby(['Q6'])['Q18_Part_2_new']
                     .value_counts(normalize=True)
                     .rename('R')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part2_counts = q18_part2_counts.loc[q18_part2_counts['Q18_Part_2_new']==1].drop(columns=['Q18_Part_2_new'])

In [None]:
q18_part3_counts = (data_scientists.groupby(['Q6'])['Q18_Part_3_new']
                     .value_counts(normalize=True)
                     .rename('SQL')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part3_counts = q18_part3_counts.loc[q18_part3_counts['Q18_Part_3_new']==1].drop(columns=['Q18_Part_3_new'])

In [None]:
q18_part4_counts = (data_scientists.groupby(['Q6'])['Q18_Part_4_new']
                     .value_counts(normalize=True)
                     .rename('C')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part4_counts = q18_part4_counts.loc[q18_part4_counts['Q18_Part_4_new']==1].drop(columns=['Q18_Part_4_new'])

In [None]:
q18_part5_counts = (data_scientists.groupby(['Q6'])['Q18_Part_5_new']
                     .value_counts(normalize=True)
                     .rename('C++')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part5_counts = q18_part5_counts.loc[q18_part5_counts['Q18_Part_5_new']==1].drop(columns=['Q18_Part_5_new'])

In [None]:
q18_part6_counts = (data_scientists.groupby(['Q6'])['Q18_Part_6_new']
                     .value_counts(normalize=True)
                     .rename('Java')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part6_counts = q18_part6_counts.loc[q18_part6_counts['Q18_Part_6_new']==1].drop(columns=['Q18_Part_6_new'])

In [None]:
q18_part7_counts = (data_scientists.groupby(['Q6'])['Q18_Part_7_new']
                     .value_counts(normalize=True)
                     .rename('Javascript')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part7_counts = q18_part7_counts.loc[q18_part7_counts['Q18_Part_7_new']==1].drop(columns=['Q18_Part_7_new'])

In [None]:
q18_part8_counts = (data_scientists.groupby(['Q6'])['Q18_Part_8_new']
                     .value_counts(normalize=True)
                     .rename('Typescript')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part8_counts = q18_part8_counts.loc[q18_part8_counts['Q18_Part_8_new']==1].drop(columns=['Q18_Part_8_new'])

In [None]:
q18_part9_counts = (data_scientists.groupby(['Q6'])['Q18_Part_9_new']
                     .value_counts(normalize=True)
                     .rename('Bash')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part9_counts = q18_part9_counts.loc[q18_part9_counts['Q18_Part_9_new']==1].drop(columns=['Q18_Part_9_new'])

In [None]:
q18_part10_counts = (data_scientists.groupby(['Q6'])['Q18_Part_10_new']
                     .value_counts(normalize=True)
                     .rename('Matlab')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q18_part10_counts = q18_part10_counts.loc[q18_part10_counts['Q18_Part_10_new']==1].drop(columns=['Q18_Part_10_new'])

In [None]:
subtables = [q18_part3_counts,
            q18_part4_counts, q18_part5_counts, q18_part6_counts,
             q18_part7_counts, q18_part8_counts, q18_part9_counts, q18_part10_counts
            ]

In [None]:
merged = q18_part1_counts.merge(q18_part2_counts, on='Q6')
for subtable in subtables:
    merged = merged.merge(subtable, on='Q6')
merged.head()

In [None]:
merged = merged.set_index('Q6').unstack().to_frame()
merged.head()

In [None]:
merged = merged.reset_index()
merged = merged.rename(columns={'level_0': 'Language', 'Q6': 'Company size', 0: 'percentage'})

In [None]:
grid = sns.FacetGrid(merged, col="Company size", hue="Company size",
                     palette="Set2", col_wrap=3, height=3, sharey=True, sharex=True,
                    col_order=['0-49 employees', '50-249 employees', '250-999 employees',
                               '1000-9,999 employees', '> 10,000 employees']);
grid.map(sns.barplot, 'percentage', 'Language');
sns.despine();
plt.subplots_adjust(hspace=0.2, wspace=1.2);

>* Python dominates the Data Science community: no surprises here.
* SQL seems not to be used by DS professionals in small companies as much as in medium and large ones. Same seems to be happenning with R.
* Overall smaller companies have slighly more diversity in their tools.

Learn Python and SQL to crack the big company's coding interview! Stay a bit broader to make it into the smaller one!

### **Machine learning frameworks**: Small companies prefer Tensorfow, large prefer LightGBM and Xgboost 

In [None]:
fields = ['Q28_Part_1', 'Q28_Part_2', 'Q28_Part_3', 'Q28_Part_4', 
          'Q28_Part_5', 'Q28_Part_6', 'Q28_Part_7', 'Q28_Part_8', 'Q28_Part_9', 'Q28_Part_10', 'Q28_Part_11'
         ]

In [None]:
for field in fields:
    data_scientists[field+'_new'] = data_scientists[field].apply(
        lambda x: 1 if type(x) == str else 0
    )
    data_scientists[field+'_new'] = data_scientists[field+'_new'].fillna(0);

In [None]:
q28_part1_counts = (data_scientists.groupby(['Q6'])['Q28_Part_1_new']
                     .value_counts(normalize=True)
                     .rename('Scikit-learn')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part1_counts = q28_part1_counts.loc[q28_part1_counts['Q28_Part_1_new']==1].drop(columns=['Q28_Part_1_new'])

In [None]:
q28_part2_counts = (data_scientists.groupby(['Q6'])['Q28_Part_2_new']
                     .value_counts(normalize=True)
                     .rename('TensorFlow')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part2_counts = q28_part2_counts.loc[q28_part2_counts['Q28_Part_2_new']==1].drop(columns=['Q28_Part_2_new'])

In [None]:
q28_part3_counts = (data_scientists.groupby(['Q6'])['Q28_Part_3_new']
                     .value_counts(normalize=True)
                     .rename('Keras')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part3_counts = q28_part3_counts.loc[q28_part3_counts['Q28_Part_3_new']==1].drop(columns=['Q28_Part_3_new'])

In [None]:
q28_part4_counts = (data_scientists.groupby(['Q6'])['Q28_Part_4_new']
                     .value_counts(normalize=True)
                     .rename('RandomForest')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part4_counts = q28_part4_counts.loc[q28_part4_counts['Q28_Part_4_new']==1].drop(columns=['Q28_Part_4_new'])

In [None]:
q28_part5_counts = (data_scientists.groupby(['Q6'])['Q28_Part_5_new']
                     .value_counts(normalize=True)
                     .rename('Xgboost')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part5_counts = q28_part5_counts.loc[q28_part5_counts['Q28_Part_5_new']==1].drop(columns=['Q28_Part_5_new'])

In [None]:
q28_part6_counts = (data_scientists.groupby(['Q6'])['Q28_Part_6_new']
                     .value_counts(normalize=True)
                     .rename('Pytorch')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part6_counts = q28_part6_counts.loc[q28_part6_counts['Q28_Part_6_new']==1].drop(columns=['Q28_Part_6_new'])

In [None]:
q28_part7_counts = (data_scientists.groupby(['Q6'])['Q28_Part_7_new']
                     .value_counts(normalize=True)
                     .rename('Caret')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part7_counts = q28_part7_counts.loc[q28_part7_counts['Q28_Part_7_new']==1].drop(columns=['Q28_Part_7_new'])

In [None]:
q28_part8_counts = (data_scientists.groupby(['Q6'])['Q28_Part_8_new']
                     .value_counts(normalize=True)
                     .rename('LightGBM')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part8_counts = q28_part8_counts.loc[q28_part8_counts['Q28_Part_8_new']==1].drop(columns=['Q28_Part_8_new'])

In [None]:
q28_part9_counts = (data_scientists.groupby(['Q6'])['Q28_Part_9_new']
                     .value_counts(normalize=True)
                     .rename('Spark MLib')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part9_counts = q28_part9_counts.loc[q28_part9_counts['Q28_Part_9_new']==1].drop(columns=['Q28_Part_9_new'])

In [None]:
q28_part10_counts = (data_scientists.groupby(['Q6'])['Q28_Part_10_new']
                     .value_counts(normalize=True)
                     .rename('Fast.ai')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part10_counts = q28_part10_counts.loc[q28_part10_counts['Q28_Part_10_new']==1].drop(columns=['Q28_Part_10_new'])

In [None]:
q28_part11_counts = (data_scientists.groupby(['Q6'])['Q28_Part_11_new']
                     .value_counts(normalize=True)
                     .rename('None')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q6'))
q28_part11_counts = q28_part11_counts.loc[q28_part11_counts['Q28_Part_11_new']==1].drop(columns=['Q28_Part_11_new'])

In [None]:
subtables = [q28_part3_counts, q28_part4_counts, q28_part5_counts, q28_part6_counts, q28_part7_counts, q28_part8_counts,
            q28_part9_counts, q28_part10_counts, q28_part11_counts]
merged = q28_part1_counts.merge(q28_part2_counts, on='Q6')
for subtable in subtables:
    merged = merged.merge(subtable, on='Q6')
merged.head()

In [None]:
merged = merged.set_index('Q6').unstack().to_frame()
merged.head()

In [None]:
merged = merged.reset_index()
merged = merged.rename(columns={'level_0': 'ML Framework', 'Q6': 'Company size', 0: 'percentage'})
merged.head()

In [None]:
grid = sns.FacetGrid(merged, col="Company size", hue="Company size",
                     palette="Set2", col_wrap=3, height=3, sharey=True, sharex=True,
                    col_order=['0-49 employees', '50-249 employees', '250-999 employees',
                               '1000-9,999 employees', '> 10,000 employees']);
grid.map(sns.barplot, 'percentage', 'ML Framework');
sns.despine();
plt.subplots_adjust(hspace=0.2, wspace=1.2);

> * Scikit-learn is the most used ML framework in every company.
* There is an increasing usage of the LightGBM and Xgboost with the increase of the company size.
* Tensorflow seems a bit more widely used in the smaller companies.

However, do the tools matter that much to make a choice about a company? Tools are there just to give solutions to problems, and might not be the driven force to make a working environment decision.

How do you go about deciding the company you want to work at? 

Comments and ideas are more than welcome!