# Amazon Web Service share in data scientists

I show that the AWS usage in the data scientists is competing to GCP. For the year 2021, the usage of AWS is 25% and GCP is 21%. The AWS usage is 8% lower than its 33% market segment share in cloud service 33%, on the contrary GCP usage 21% is twice more than its market segment share 10%. Especially in Japan, the GCP usage is 28% that is almost three times large and nearly equal to the share of AWS.

The crucial difference of the usage locates hosted notebook AWS 2.2% and GCP is 38%. The typical hosted notebook service is [Colaboratory](https://colab.research.google.com/). The usage of the hosted notebook service does not change between Japan and Global (about 35%). It does not explain the cloud platform usage difference between the Japan and Global. According to the Google Trend, Colaboratory(GCP) vs SageMaker(AWS) is [60 vs 23](https://trends.google.co.jp/trends/explore?date=2021-01-01%202022-04-25&geo=JP&q=SageMaker,Colaboratory,Azure%20Machine%20Learning) but [34 vs 70](https://trends.google.co.jp/trends/explore?date=2021-01-01%202022-04-25&q=SageMaker,Colaboratory,Azure%20Machine%20Learning) in Global. This difference means that the usage of Colaboratory is almost the same in Japan and Global but the amount of content (ex. textbook, blog etc) is larger in Japan than Global. And the amount of content will explain the usage difference between Japan and Global.

From the analysis, increasing the amount of content is the most crucial activity to expand the cloud platform usage.

The table of contents of this notebook is the followings.

1. About the survey data
2. The cloud platform usage in data scientist
3. The developer experience of cloud platform service in data scientist's workflow
4. References

This notebook is based on [2021 EDA](https://www.kaggle.com/paultimothymooney/2021-kaggle-data-science-machine-learning-survey) that is dealing 2017~2020 survey data. I exclude 2017 because of the lack of questions compare to other years.

In [None]:
COUNTRY = 'Japan'
COUNTRY_TEXT = {
    'global': 'data scientists (global)',
    'local': f'data scientists ({COUNTRY})'
}

In [None]:
import os
import numpy as np 
import pandas as pd 
import altair as alt
import warnings
warnings.filterwarnings("ignore")

# At first, load the survey data

CHART_DIR = 'chat_data_dir'
if os.path.exists(CHART_DIR):
    os.mkdir(CHART_DIR)

# Load the data
def load_csv(base_dir,file_name):
    """Loads a CSV file into a Pandas DataFrame"""
    file_path = os.path.join(base_dir,file_name)
    df = pd.read_csv(file_path,low_memory=False,encoding='ISO-8859-1')
    return df    

def save_csv(dataframe, name):
    dataframe.to_csv(f'{CHART_DIR}/{name}', index=True)

base_dir_2018 = '/kaggle/input/kaggle-survey-2018/'
file_name_2018 = 'multipleChoiceResponses.csv'
survey_df_2018 = load_csv(base_dir_2018,file_name_2018)
responses_df_2018 = survey_df_2018[1:]
responses_df_2018['year'] = ['2018'] * len(responses_df_2018)
survey_df_2018.to_csv('2018_kaggle_ds_and_ml_survey_responses_only.csv',index=False)   

base_dir_2019 = '/kaggle/input/kaggle-survey-2019/'
file_name_2019 = 'multiple_choice_responses.csv'
survey_df_2019 = load_csv(base_dir_2019,file_name_2019)
responses_df_2019 = survey_df_2019[1:]
responses_df_2019['year'] = ['2019'] * len(responses_df_2019)
survey_df_2019.to_csv('2019_kaggle_ds_and_ml_survey_responses_only.csv',index=False)

base_dir_2020 = '/kaggle/input/kaggle-survey-2020'
file_name_2020 = 'kaggle_survey_2020_responses.csv'
survey_df_2020 = load_csv(base_dir_2020,file_name_2020)
responses_df_2020 = survey_df_2020[1:]
responses_df_2020['year'] = ['2020'] * len(responses_df_2020)
survey_df_2020.to_csv('2020_kaggle_ds_and_ml_survey_responses_only.csv',index=False)

base_dir_2021 = '/kaggle/input/kaggle-survey-2021/'
file_name_2021 = 'kaggle_survey_2021_responses.csv'
survey_df_2021 = load_csv(base_dir_2021,file_name_2021)
responses_df_2021 = survey_df_2021[1:]
responses_df_2021['year'] = ['2021'] * len(responses_df_2021)
survey_df_2021.to_csv('2021_kaggle_ds_and_ml_survey_responses_only.csv',index=False)

print('Total Number of Responses 2018: ',responses_df_2018.shape[0])
print('Total Number of Responses 2019: ',responses_df_2019.shape[0])
print('Total Number of Responses 2020: ',responses_df_2020.shape[0])
print('Total Number of Responses 2021: ',responses_df_2021.shape[0])

In [None]:
def filter_country(dataframe):
    """
    Filter the dataframe by COUNTRY
    """
    return dataframe[dataframe['Q3'] == COUNTRY]

def multiple_choice_to_df(dataframe, question_prefix, kind='', filter_func=None):
    """
    Make the dataframe of each choice count from the multiple choice question.
    """
    
    if filter_func:
        dataframe = filter_func(dataframe)
    
    multiple_choices = []
    year = dataframe['year'].max()
    count_sum = 0
    for column in dataframe.columns:
        if column.startswith(question_prefix) and not column.endswith('OTHER_TEXT'):
            values = dataframe[column].dropna().reset_index(drop=True)
            if len(values) > 0:
                multiple_choices.append({
                    'year': year,
                    'choice': values.iat[0].strip(),
                    'count': len(values)
                })
    
    if len(multiple_choices) == 0:
        return None
    else:
        df = pd.DataFrame(multiple_choices)
        df['percentage'] = df['count'] * 100 / df['count'].sum()
        if kind:
            df['kind'] = [kind] * len(df)

        return df


def answer_to_df(dataframe, question, kind='', filter_func=None):
    """
    Make the dataframe of each answer count from the single choice question.
    """

    if filter_func:
        dataframe = filter_func(dataframe)
    
    year = dataframe['year'].max()
    counts = dataframe[question].value_counts().reset_index(name='count')
    counts['year'] = [year] * len(counts)
    counts['index'] = counts['index'].apply(str.strip)
    counts['percentage'] = counts['count'] * 100 / counts['count'].sum()
    counts['kind'] = [kind] * len(counts)
    counts.rename(columns={'index': 'choice'}, inplace=True)
    return counts


def question_to_df(dataframe, question, multiple_choice=True, kind='', filter_func=None):
    """
    Make the dataframe of each choice count from single and multiple choice question.
    """

    if multiple_choice:
        return multiple_choice_to_df(dataframe, question, kind, filter_func)
    else:
        return answer_to_df(dataframe, question, kind, filter_func)


def questions_to_df(dataframe_and_questions, multiple_choice=True, kind='', filter_func=None):
    """
    Make the dataframe of each choice count from the dataframe and question pairs.
    """

    dataframes = []
    for df, q in dataframe_and_questions:
        _df = question_to_df(df, q, multiple_choice, kind, filter_func)
        if _df is not None and len(_df) > 0:
            dataframes.append(_df)
    
    return pd.concat(dataframes)


def question_to_country_and_global_df(dataframe, question_prefix, multiple_choice=True):
    """
    Make the dataframe of each choice count segmented by COUNTRY
    from single and multiple choice question.
    """
    global_df = question_to_df(
        dataframe, question_prefix, multiple_choice, kind=COUNTRY_TEXT['global']
    )
    country_df = question_to_df(
        dataframe, question_prefix, multiple_choice, kind=COUNTRY_TEXT['local'], filter_func=filter_country
    )
    country_and_global_df = pd.concat([global_df, country_df])
    country_and_global_df.rename(columns={'kind': 'country'}, inplace=True)
    return country_and_global_df


def questions_to_country_and_global_df(dataframe_and_questions, multiple_choice=True):
    """
    Make the dataframe of each choice count segmented by COUNTRY
    from the dataframe and question pairs.
    """

    global_df = questions_to_df(
        dataframe_and_questions, multiple_choice, kind=COUNTRY_TEXT['global']
    )
    country_df = questions_to_df(
        dataframe_and_questions, multiple_choice, kind=COUNTRY_TEXT['local'], filter_func=filter_country
    )
    country_and_global_df = pd.concat([global_df, country_df])
    country_and_global_df.rename(columns={'kind': 'country'}, inplace=True)
    return country_and_global_df


def add_choice_series(dataframe, question_prefix, name, multiple_choice=True):
    dfs = []
    for column in dataframe.columns:
        if column.startswith(question_prefix) and not column.endswith('OTHER_TEXT'):
            filtered_df = dataframe[dataframe[column].notna()]
            if len(filtered_df) > 0:
                value = filtered_df[column].iat[0].strip()
                filtered_df[name] = value
                dfs.append(filtered_df)
    
    return pd.concat(dfs)

# 1. About the survey data

I show the number of respondent and its role. The 921 respondent from Japan and its role is mainly Student(26%), Data Scientist(13%), Software Engineer(9%), Data Analyst(8%) in 2021.

In [None]:
country_distribution_df = questions_to_df(
    [(responses_df_2018, 'Q3'),
     (responses_df_2019, 'Q3'),
     (responses_df_2020, 'Q3'),
     (responses_df_2021, 'Q3'),
    ], multiple_choice=False)

country_distribution_df['country'] = country_distribution_df['choice'].apply(
    lambda c: COUNTRY if c == COUNTRY else 'Others'
)

alt.Chart(country_distribution_df, title='The number of responses').mark_bar().encode(
    x='year',
    y=alt.Y('count', sort='y'),
    color='country',
    tooltip=['choice', 'count']
).properties(
    width=200,
    height=200
)

In [None]:
role_distribution_df = questions_to_country_and_global_df(
    [(responses_df_2018, 'Q5'),
     (responses_df_2019, 'Q5'),
     (responses_df_2020, 'Q5'),
     (responses_df_2021, 'Q5'),
    ], multiple_choice=False)


alt.Chart(
    role_distribution_df,
    title=f'The role of responders').mark_bar().encode(
    x=alt.X('year', title=None),
    y='percentage',
    color=alt.Color('choice', title=None, sort='-y'),
    column=alt.Column('country', title=None),
    tooltip=['choice', 'percentage']
).properties(
    width=200,
    height=200
)

In [None]:
stage_distribution_df = questions_to_country_and_global_df(
    [(responses_df_2019, 'Q8'),
     (responses_df_2020, 'Q22'),
     (responses_df_2021, 'Q23'),
    ], multiple_choice=False)

alt.Chart(
    stage_distribution_df,
    title=f'The machine learning stage of responders').mark_bar().encode(
    x=alt.X('year', title=None),
    y='percentage',
    color=alt.Color('choice', title=None, sort='-y'),
    column=alt.Column('country', title=None),
    tooltip=['choice', 'percentage']
).properties(
    width=200,
    height=200
)

In [None]:
learning_distribution_df = questions_to_country_and_global_df(
    [(responses_df_2019, 'Q13'),
     (responses_df_2020, 'Q37'),
     (responses_df_2021, 'Q40'),
    ])

alt.Chart(
    learning_distribution_df,
    title=f'How responders learn the data science').mark_bar().encode(
    x=alt.X('year', title=None),
    y='sum(percentage)',
    color=alt.Color('choice', title=None, sort='-y'),
    column=alt.Column('country', title=None),
    tooltip=['choice', 'percentage']
).properties(
    width=200,
    height=200
)

# 2. The cloud platform usage in data scientist

Let's analyze the share of the cloud platforms in data scientist.

* Questions list: Q15(2018), Q29(2019), Q26-A(2020), Q27-A(2021).

In [None]:
# Market share is reffered from Synergy Research Group
# https://www.srgresearch.com/articles/amazon-microsoft-google-grab-the-big-numbers-but-rest-of-cloud-market-still-grows-by-27

MAJOR_CLOUDS = ['Amazon Web Services (AWS)', 'Microsoft Azure', 'Google Cloud Platform (GCP)']
market_share_2021 = pd.DataFrame([
    {'year': 2021, 'country': 'market', 'choice': 'Amazon Web Services (AWS)', 'count': 33, 'percentage': 33.0},
    {'year': 2021, 'country': 'market', 'choice': 'Microsoft Azure', 'count': 20, 'percentage': 20.0},
    {'year': 2021, 'country': 'market', 'choice': 'Google Cloud Platform (GCP)', 'count': 10, 'percentage': 10.0},
    {'year': 2021, 'country': 'market', 'choice': 'Other', 'count': 37.0, 'percentage': 37.0}
])

In [None]:
# Display the charts
share_2021 = pd.concat([market_share_2021, question_to_country_and_global_df(responses_df_2021, 'Q27_A')])
share_2021 = share_2021[share_2021['choice'].isin(MAJOR_CLOUDS)]

alt.Chart(share_2021, title='Most Commonly Used Cloud Computing Platforms')\
.mark_bar().encode(
    x=alt.X('country', title=None, sort=['market', COUNTRY_TEXT['global'], COUNTRY_TEXT['local']]),
    y=alt.Y('percentage', title='% of respondents'),
    column=alt.Column('choice', sort=['-y'], title=None),
    tooltip=['choice', 'country', 'percentage']
).properties(
    width=100,
    height=180
)

**The usage of AWS 25.2% is 8% lower than its 33% market segment share in cloud service**. The GCP usage 21% is 11% higher than its 10% market segment share. **Especially in Japan 28.6% and 18% higher than 10%**. 

**The data scientist team attribute affect the difference of the usage?**

The data scientist teams have 1) role and 2) members. And the 3) stage of ML is different from each team. By analyzing the usage from 3 team attributes, we can understand what kind of data science team likes respective cloud platform.

* Questions list for 1) role: Select any activities that make up an important part of your role at work
  * Q9(2019), Q23(2020), Q24(2021)
* Questions list for 2) team size: Approximately how many individuals are responsible for data science workloads at your place of
business?
  * Q7(2019), Q21(2020), Q22(2021)
* Questions list for 3) ML state: Does your current employer incorporate machine learning methods into their business?
  * Q8(2019), Q22(2020), Q23(2021)

For each attributes, there is no obvious difference. But this is important fact. The machine learning service in GCP is developing and a lot of service is still beta. The large and production phase data scientists team also use GCP especially in Japan. The teams that have production machine learning model more than 2 years is AWS 28% and GCP 23% in global (AWS +5%) but AWS 19% and GCP 23% in Japan (AWS -4%).

In [None]:
cloud_platform_questions = [
    (responses_df_2018, 'Q15'),
    (responses_df_2019, 'Q29'),
    (responses_df_2020, 'Q26_A'),
    (responses_df_2021, 'Q27_A')
]

def make_cloud_df(cloud_dataframe_questions, cloud_name, year_questions, multiple_choice=True):
    cloud_dfs = []
    for cloud_df, cloud_question in cloud_platform_questions:
        cloud_df = add_choice_series(cloud_df, cloud_question, cloud_name)
        year = cloud_df['year'].max()
        if year in year_questions:
            question = year_questions[year]
            for value in cloud_df[cloud_name].unique().tolist():
                cloud = value.strip()
                _cloud_df = cloud_df[cloud_df[cloud_name] == cloud]
                if len(_cloud_df) > 0:
                    _cloud_df_counted = question_to_country_and_global_df(_cloud_df, question, multiple_choice)
                    _cloud_df_counted[cloud_name] = cloud
                    cloud_dfs.append(_cloud_df_counted)
            
    return pd.concat(cloud_dfs)

def plot_comparison_between_clouds(dataframe, title, color, width=200, height=180):
    _df = dataframe[dataframe.cloud.isin(MAJOR_CLOUDS)].groupby(['year', color, 'cloud', 'country']).sum().reset_index()
    return alt.Chart(_df[_df.cloud.isin(MAJOR_CLOUDS)],title=title).mark_line().encode(
            x=alt.X('year', title=None),
            y=alt.Y('percentage', title='% of respondents'),
            color=color,
            column=alt.Column('cloud', title=None, sort=MAJOR_CLOUDS),
            row=alt.Row('country', title=None),
            tooltip=['cloud', 'year', color, 'percentage']
    ).properties(
        width=width,
        height=height
    )

In [None]:
missions = {
    '2019': 'Q9',
    '2020': 'Q23',
    '2021': 'Q24'
}

mission_dict = {
    'Analyze and understand data to influence product or business decisions': 'Analyze data',
    'Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data': 'Build DWH',
    'Build prototypes to explore applying machine learning to new areas': 'Build ML model',
    'Build and/or run a machine learning service that operationally improves my product or workflows': 'Build MLOps',
    'Experimentation and iteration to improve existing ML models': 'Improve ML model',
    'Do research that advances the state of the art of machine learning': 'Research ML',
    'None of these activities are an important part of my role at work': 'None',
    'Other': 'Other'
}

mission_df = make_cloud_df(cloud_platform_questions, 'cloud', missions)
mission_df['mission'] = mission_df['choice'].apply(lambda c: mission_dict[c])

plot_comparison_between_clouds(
    mission_df, 'Comparison from the perspective of data science team role', 'mission', width=200)

In [None]:
members = {
    '2019': 'Q7',
    '2020': 'Q21',
    '2021': 'Q22'
}

members_dict = {
    '0': '1.small(<5)',
    '1-2': '1.small(<5)',
    '3-4': '1.small(<5)',
    '5-9': '2.medium(5<15)',
    '10-14': '2.medium(5<15)',
    '15-19': '3.large(15<)',
    '20+': '3.large(15<)',
    'Other': 'Other'
}

member_df = make_cloud_df(cloud_platform_questions, 'cloud', members, multiple_choice=False)
member_df['team size'] = member_df['choice'].apply(lambda c: members_dict[c])
plot_comparison_between_clouds(
    member_df, 'Comparison from the perspective of data science team size', 'team size', width=200)

In [None]:
stages = {
    '2019': 'Q8',
    '2020': 'Q22',
    '2021': 'Q23'
}

stages_dict = {
    'We are exploring ML methods (and may one day put a model into production)': '1.developing',
    'We use ML methods for generating insights (but do not put working models into production)': '1.developing',
    'We recently started using ML methods (i.e., models in production for less than 2 years)': '2.production (less 2years)',
    'We have well established ML methods (i.e., models in production for more than 2 years)': '3.production (over 2years)',
    'No (we do not use ML methods)': 'None',
    'I do not know': 'Other',
}

stage_df = make_cloud_df(cloud_platform_questions, 'cloud', stages, multiple_choice=False)
stage_df['stage'] = stage_df['choice'].apply(lambda c: stages_dict[c])
plot_comparison_between_clouds(
    stage_df, 'Comparison from the perspective of data science stage', 'stage')

In [None]:
def stage_base_usage(stage_df, year):
    counts = []
    for country in COUNTRY_TEXT:
        country_count = stage_df[(stage_df['year'] == year) &\
                                 (stage_df['country'] == COUNTRY_TEXT[country])]['count'].sum()
        for cloud in MAJOR_CLOUDS:
            for stage in stages_dict:
                count = stage_df[(stage_df['year'] == year) &\
                                 (stage_df['country'] == COUNTRY_TEXT[country]) &\
                                 (stage_df['cloud'] == cloud) &\
                                 (stage_df['choice'] == stage)
                                ]['count'].sum()
                percentage = count / country_count * 100
                counts.append({
                    'year': 2021,
                    'country': COUNTRY_TEXT[country],
                    'choice': cloud,
                    'stage': stages_dict[stage],
                    'count': count,
                    'percentage': percentage

                })
    
    _market_share_2021 = market_share_2021[market_share_2021['choice'].isin(MAJOR_CLOUDS)]
    _market_share_2021['stage'] = '3.production (over 2years)'
    df = pd.concat([_market_share_2021, pd.DataFrame(counts)])
    return df


alt.Chart(stage_base_usage(stage_df, '2021'), title='Most Commonly Used Cloud Computing Platforms for each stage')\
.mark_bar().encode(
    x=alt.X('country', title=None, sort=['market', COUNTRY_TEXT['global'], COUNTRY_TEXT['local']]),
    y=alt.Y('percentage', title='% of respondents'),
    column=alt.Column('choice', sort=['-y'], title=None),
    color='stage',
    tooltip=[ 'country', 'choice', 'stage', 'percentage']
).properties(
    width=100,
    height=180
)

# 3. The developer experience of cloud platform service in data scientist's workflow

At first, the developer’s experience of AWS and GCP is as same as in global (27% vs 23%) but GCP is more preferred in Japan (37%). 

In [None]:
experience_2021 = question_to_country_and_global_df(responses_df_2021, 'Q28', multiple_choice=False)
experience_2021 = experience_2021[experience_2021['choice'].isin(MAJOR_CLOUDS + ['Databricks'])]

alt.Chart(experience_2021, title='Most Familiar Cloud Computing Platforms')\
.mark_bar().encode(
    x=alt.X('country', title=None, sort=['data scientists (global)', f'data scientists ({COUNTRY})']),
    y=alt.Y('percentage', title='% of respondents'),
    column=alt.Column('choice', sort=['-y'], title=None),
    tooltip=['choice', 'country', 'percentage']
).properties(
    width=150,
    height=180
)

To understand this difference, I analyze the share of each service that is used in workflow of data science. That is 1) hosted notebook service, 2) data infrastructure service, 3) ml service, 4) computing instance. The following is the list of serice and usage in the workflow.

1. **Hosted notebook services**: is used to experiment new model and idea (ex: SageMaker Studio Notebook, Google Colaboratory, Azure Notebooks, etc) 
   * Question: Which of the following hosted notebook products do you use on a regular basis?
   * Question list: Q17(2019), Q10(2020), Q10(2021)
2. **Data infrastructure service**: is used to arrange and query the data (ex: Amazon Redshift, Google BigQuery, MySQL, PostgreSQL, etc)
   * Question: Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis
   * Question list: Q31(2019), Q29-A(2020), Q32-A(2021)
3. **Machine learning service**: is used to prototype the machine learning model quickly (ex: Amazon SageMaker, Google Vertex AI, Azure Machine Learning Studio, Databricks etc,)
   * Question: Do you use any of the following managed machine learning products on a regular basis? (Select all that apply)
   * Question list: Q32(2019), Q28-A(2020), Q31-A(2021)
4. **Computing service**: is used to training or deploy machine learning model (ex: Amazon Elastic Compute Cloud, Google Compute Engine, Microsoft Azure Viertual Machine, etc)
   * Question: Do you use any of the following cloud computing products on a regular basis? (Select all that apply)
   * Question list: Q30(2019), Q27-A(2020), Q29-A(2021)

When the service usage in 1~4 steps increase, then the share of platform will increase.

5. **AI/ML platform**: is used to overall machine learning process (ex: Amazon Web Service, Google Cloud Platform, Microsoft Azure)
   * Question: Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply)
   * Question list: Q29(2019), Q26-A(2020), Q27-A(2021).

The crucial difference is hosted notebook service. It suggests that the hosted notebook service experience of GCP is better than other service and it led to developer experience. But the usage in the hosted notebook service does not lead to the share of the cloud platform service straight forward. There is two hypothesis. One is the usage of notebook service does not have effect to cloud platform service usage. The other is the hosted notebook service usage led to other factor of developer and it causes the difference of cloud platform service usage. The Colaboratory easy to use feature will affect to "developer experience" and it is observed by Google Trend quantitively in Japan. By increasing the amount of content relates to hosted notebook service and improve Google Trend is one of the experiment to confirm the factor and improve the cloud platform usage.

In [None]:
experimenting_service_dict = { 
    'Code Ocean': 'Other',
    'None': 'None',
    'Paperspace / Gradient': 'Other',
    'Binder / JupyterHub': 'Other',
    'IBM Watson Studio': 'Other',
    'Other': 'Other',
    'Azure Notebooks': 'Microsoft Azure',
    'Databricks Collaborative Notebooks': 'Databricks',
    'Kaggle Notebooks': 'Kaggle',
    'Colab Notebooks': 'Google Cloud Platform (GCP)',
    'Amazon EMR Notebooks': 'Amazon Web Services (AWS)',
    'Amazon Sagemaker Studio Notebooks': 'Amazon Web Services (AWS)',
    'Google Cloud AI Platform Notebooks': 'Google Cloud Platform (GCP)',
    'Google Cloud Notebooks (AI Platform / Vertex AI)': 'Google Cloud Platform (GCP)',
    'Google Cloud Datalab': 'Google Cloud Platform (GCP)',
    'Deepnote Notebooks': 'Other',
    'Google Cloud Datalab Notebooks': 'Google Cloud Platform (GCP)',
    'Kaggle Notebooks (Kernels)': 'Kaggle',
    'Amazon Sagemaker Studio': 'Amazon Web Services (AWS)',
    'Google Colab': 'Google Cloud Platform (GCP)',
    'AWS Notebook Products (EMR Notebooks, Sagemaker Notebooks, etc)': 'Amazon Web Services (AWS)',
    'FloydHub': 'Other',
    'Google Cloud Notebook Products (AI Platform, Datalab, etc)': 'Google Cloud Platform (GCP)',
    'Microsoft Azure Notebooks': 'Microsoft Azure',
    'Zeppelin / Zepl Notebooks': 'Other',
    'Observable Notebooks': 'Other',
}

In [None]:
prototyping_service_data_dict = {
    'Google BigQuery': 'Google Cloud Platform (GCP)',
    'AWS Redshift': 'Amazon Web Services (AWS)',
    'Databricks': 'Databricks',
    'AWS Elastic MapReduce': 'Amazon Web Services (AWS)',
    'Teradata': 'Other',
    'Microsoft Analysis Services': 'Microsoft Azure',
    'Google Cloud Dataflow': 'Google Cloud Platform (GCP)/Other',
    'AWS Athena': 'Amazon Web Services (AWS)',
    'AWS Kinesis': 'Amazon Web Services (AWS)/Other',
    'Google Cloud Pub/Sub': 'Google Cloud Platform (GCP)/Other',
    'None': 'None',
    'Other': 'Other',
    'MySQL': 'Database Software',
    'PostgresSQL': 'Database Software',
    'SQLite': 'Database Software',
    'Oracle Database': 'Database Software',
    'MongoDB': 'Database Software',
    'Snowflake': 'Database Software',
    'IBM Db2': 'Database Software',
    'Microsoft SQL Server': 'Database Software',
    'Microsoft Access': 'Database Software',
    'Microsoft Azure Data Lake Storage': 'Microsoft Azure',
    'Amazon Redshift': 'Amazon Web Services (AWS)',
    'Amazon Athena': 'Amazon Web Services (AWS)',
    'Amazon DynamoDB': 'Amazon Web Services (AWS)',
    'Google Cloud BigQuery': 'Google Cloud Platform (GCP)',
    'Google Cloud SQL': 'Google Cloud Platform (GCP)',
    'Google Cloud Firestore': 'Google Cloud Platform (GCP)',
    'PostgreSQL': 'Database Software',
    'Microsoft Azure SQL Database': 'Microsoft Azure',
    'Microsoft Azure Cosmos DB': 'Microsoft Azure',
    'Amazon Aurora': 'Amazon Web Services (AWS)',
    'Amazon RDS': 'Amazon Web Services (AWS)',
    'Google Cloud BigTable': 'Google Cloud Platform (GCP)',
    'Google Cloud Spanner': 'Google Cloud Platform (GCP)',
}

In [None]:
prototyping_service_model_dict = {
    'Azure Machine Learning Studio': 'Microsoft Azure',
    'Amazon SageMaker': 'Amazon Web Services (AWS)',
    'Other': 'Other',
    'Google Cloud Natural Language': 'Google Cloud Platform (GCP)',
    'No / None': 'None',
    'SAS': 'Other',
    'Google Cloud AI Platform / Google Cloud ML Engine': 'Google Cloud Platform (GCP)',
    'Alteryx': 'Other',
    'Dataiku': 'Other',
    'Databricks': 'Databricks',
    'DataRobot': 'Other',
    'Google Cloud Vertex AI': 'Google Cloud Platform (GCP)',
    'Google Cloud Vision AI': 'Google Cloud Platform (GCP)/Other',
    'Google Cloud Video AI': 'Google Cloud Platform (GCP)/Other',
    'Amazon Forecast': 'Amazon Web Services (AWS)',
    'Azure Cognitive Services': 'Microsoft Azure',
    'Amazon Rekognition': 'Amazon Web Services (AWS)',
    'Cloudera': 'Other',
    'None': 'None',
    'Google Cloud Translation': 'Google Cloud Platform (GCP)/Other',
    'RapidMiner': 'Other',
    'Google Cloud Speech-to-Text': 'Google Cloud Platform (GCP)/Other',
    'Google Cloud Vision': 'Google Cloud Platform (GCP)/Other',
    'Google Cloud Machine Learning Engine': 'Google Cloud Platform (GCP)',
    'Rapidminer': 'Other',
}

In [None]:
building_service_dict = {
    'Other': 'Other',
    'AWS Lambda': 'Amazon Web Services (AWS)',
    'No / None': 'None',
    'Google Cloud Functions': 'Google Cloud Platform (GCP)',
    'Google Cloud Compute Engine': 'Google Cloud Platform (GCP)',
    'Amazon Elastic Container Service': 'Amazon Web Services (AWS)',
    'Amazon Elastic Compute Cloud (EC2)': 'Amazon Web Services (AWS)',
    'Google Cloud App Engine': 'Google Cloud Platform (GCP)',
    'Google Cloud Run': 'Google Cloud Platform (GCP)',
    'Azure Functions': 'Microsoft Azure',
    'Microsoft Azure Container Instances': 'Microsoft Azure',
    'Azure Cloud Services': 'Microsoft Azure',
    'AWS Elastic Compute Cloud (EC2)': 'Amazon Web Services (AWS)',
    'Amazon EC2': 'Amazon Web Services (AWS)',
    'Google Compute Engine (GCE)': 'Google Cloud Platform (GCP)',
    'None': 'None',
    'Azure Container Service': 'Microsoft Azure',
    'AWS Batch': 'Amazon Web Services (AWS)',
    'Google Kubernetes Engine': 'Google Cloud Platform (GCP)',
    'AWS Elastic Beanstalk': 'Amazon Web Services (AWS)/Other',
    'Google App Engine': 'Google Cloud Platform (GCP)',
    'Azure Virtual Machines': 'Microsoft Azure',
    'Microsoft Azure Virtual Machines': 'Microsoft Azure',
}

In [None]:
workflow = {
    'hosted notebook': [
        (responses_df_2019, 'Q17'),
        (responses_df_2020, 'Q10'),
        (responses_df_2021, 'Q10')
    ],
    'data infrastructure': [
        (responses_df_2019, 'Q31'),
        (responses_df_2020, 'Q29_A'),
        (responses_df_2021, 'Q32_A')
    ],
    'machine learning': [
        (responses_df_2019, 'Q32'),
        (responses_df_2020, 'Q28_A'),
        (responses_df_2021, 'Q31_A')
    ],
    'ai/ml computing': [
        (responses_df_2019, 'Q30'),
        (responses_df_2020, 'Q27_A'),
        (responses_df_2021, 'Q29_A')
    ],
    'ai/ml platform': [
        (responses_df_2019, 'Q29'),
        (responses_df_2020, 'Q26_A'),
        (responses_df_2021, 'Q27_A')
    ]
}

workflow_dfs = []
for step in workflow:
    dataframe_and_questions = workflow[step]
    _step_df = questions_to_country_and_global_df(dataframe_and_questions)
    _step_df['step'] = [step] * len(_step_df)
    if step == 'hosted notebook':
        _step_df['cloud'] = _step_df['choice'].apply(lambda c: experimenting_service_dict[c])
    elif step == 'data infrastructure':
        _step_df['cloud'] = _step_df['choice'].apply(lambda c: prototyping_service_data_dict[c])
    elif step == 'machine learning':
        _step_df['cloud'] = _step_df['choice'].apply(lambda c: prototyping_service_model_dict[c])
    elif step == 'ai/ml computing':
        _step_df['cloud'] = _step_df['choice'].apply(lambda c: building_service_dict[c])
    elif step == 'ai/ml platform':
        _step_df['cloud'] = _step_df['choice'].tolist()

    workflow_dfs.append(_step_df)

workflow_df = pd.concat(workflow_dfs)

In [None]:
workflow_df = workflow_df[workflow_df['cloud'].isin(MAJOR_CLOUDS + ['Databricks'])]
workflow_df.head(5)

In [None]:
def plot_the_workflow(dataframe):
    _df = dataframe.groupby(['year', 'step', 'country', 'cloud']).sum().reset_index()
    return alt.Chart(_df,
              title='The share of services in the workflow of data scientist').mark_line().encode(
        x=alt.X('step', title=None, sort=['hosted notebook', 'data infrastructure', 'machine learning', 'ai/ml computing', 'ai/ml platform']),
        y=alt.Y('percentage', title='% of respondents'),
        color='year',
        column=alt.Column('cloud', title=None),
        row=alt.Row('country'),
        tooltip=['year', 'step', 'country', 'cloud', 'percentage']
    ).properties(
        width=150,
        height=140
    )

In [None]:
plot_the_workflow(workflow_df)

In [None]:
alt.Chart(workflow_df[(workflow_df['step'] == 'hosted notebook') & (workflow_df['year'] == '2021')].groupby(
    ['cloud', 'country', 'choice']
).sum().reset_index(), title="Hosted notebook usage in 2021").mark_bar().encode(
    x='percentage',
    y=alt.Y('choice', sort='-x', title=None),
    column=alt.Column('cloud', title=None),
    row=alt.Row('country', title=None),
    tooltip=['choice', 'percentage'],
).properties(
    width=100,
    height=200
)

In [None]:
alt.Chart(workflow_df[(workflow_df['step'] == 'ai/ml platform')].groupby(
    ['year', 'cloud', 'country', 'choice']
).sum().reset_index(), title="AI/ML platform usage trend").mark_bar().encode(
    x='year',
    y=alt.Y('percentage', sort='-x', title=None),
    column=alt.Column('cloud', title=None),
    row=alt.Row('country', title=None),
    tooltip=['choice', 'percentage'],
).properties(
    width=100,
    height=200
)

In [None]:
alt.Chart(workflow_df[(workflow_df['step'] == 'data infrastructure') & (workflow_df['year'] == '2021')].groupby(
    ['cloud', 'country', 'choice']
).sum().reset_index(), title="Data infrastructure usage in 2021").mark_bar().encode(
    x='percentage',
    y=alt.Y('choice', sort='-x', title=None),
    column=alt.Column('cloud', title=None),
    row=alt.Row('country', title=None),
    tooltip=['choice', 'percentage'],
).properties(
    width=150,
    height=200
)

# 4. References

### Questions list

Hare is the related questions lists in this notebook.

* [2019 kaggle machine learning data science survey](https://www.kaggle.com/c/kaggle-survey-2019)
   * Q3 : In which country do you currently reside?
   * Q17 : Which of the following hosted notebook product do you use on a regular basis ?
   * Q29 : Which of the following cloud computing platforms do you use on a regular basis ?
   * Q30 : Which specific cloud computing products do you use on a regular basis ?
   * Q31 : Which specific big data / analytics products do you use on a regular basis ?
   * Q32 : Which of the following machine learning products do you use on a regular basis ?
   * Q34 : Which of the following relational database products do you use on a regular basis ?
* [2020 kaggle machine learning data science survey](https://www.kaggle.com/c/kaggle-survey-2020)
   * Q3 : In which country do you currently reside?
   * Q10 : Which of the following hosted notebook products do you use on a regular basis?
   * Q26-A : Which of the following cloud computing platforms do you use on a regular basis?
   * Q27-A : Do you use any of the following cloud computing products on a regular basis? 
   * Q28-A : Do you use any of the following machine learning products on a regular basis?
   * Q29-A : Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?
   * Q30-A : Which of the following business intelligence tools do you use on a regular basis?
* [2021 kaggle machine learning data science survey](https://www.kaggle.com/c/kaggle-survey-2021)
   * Q3: In which country do you currently reside?
   * Q10: Which of the following hosted notebook products do you use on a regular basis?
   * Q27-A : Which of the following cloud computing platforms do you use on a regular basis?
   * Q29-A : Do you use any of the following cloud computing products on a regular basis?
   * Q30-A : Do you use any of the following data storage products on a regular basis?
   * Q31-A : Do you use any of the following managed machine learning products on a regular basis?
   * Q32-A : Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?
   * Q34-A : Which of the following business intelligence tools do you use on a regular basis?

### Report list

* Market share
   * Synergy Research Group. [Amazon, Microsoft & Google Grab the Big Numbers – But Rest of Cloud Market Still Grows by 27%](https://www.srgresearch.com/articles/amazon-microsoft-google-grab-the-big-numbers-but-rest-of-cloud-market-still-grows-by-27)
   * canalysis. [Global cloud services market Q2 2021](https://www.canalys.com/newsroom/global-cloud-services-q2-2021)

In [None]:
!mkdir /kaggle/working/docker/
!pip freeze > '/kaggle/working/docker/requirements.txt'
print('This notebook makes use of \nthe following Python libraries:\n')
print('numpy:',np.__version__)
print('pandas:',pd.__version__)
print('altair:',alt.__version__)
