# Machine Learning on the cloud!(In progress)

My day-to-day job involves working with numerous companies moving their infrastructure around a bunch of public cloud platforms(and on-prem infra). Something I've observed is that companies haven't embraced using the cloud for ML as much as they embraced moving infrastructure from on-premises to the cloud. Although this dataset is not a very accurate representation of the problem, it could be a good starting point to understand how are companies adopting ML and using the cloud to run experiments. 

![Cloud?](https://www.explainxkcd.com/wiki/images/5/5f/cloud.png)

# Table of contents and visualization built:
- Percentage of users using cloud 2019 vs 2020
    - Grouped by day job
    - Grouped by the number of employees
- Familiarity with number of cloud platforms 2019 vs 2020
- Growth rate of public cloud platforms 2019 vs 2020
- Popularity of popular cloud ML tools 2019 vs 2020
- Proportion of users using the following tools 2019 vs 2020:
    - Big Data
    - Business Intelligence
    - AutoML


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import PathPatch
from matplotlib.patches import Patch
import matplotlib.patches as patches
import collections
import seaborn as sns
import plotly

color = sns.color_palette()

In [None]:
df20 = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', keep_default_na=False)
df19 = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv', keep_default_na=False)

In [None]:
df20.head()

In [None]:
cols = list(df20.columns)
unique = list(set(['_'.join(x.split('_')[0:2]) for x in cols if 'A' in x or 'B' in x]))+list(set([x.split('_')[0] for x in cols if 'A' not in x and 'B' not in x and 'Part' in x]))
col_breakdown = {}
for col in unique:
    tmp = [c for c in cols if c.startswith(col)]
    col_breakdown[col] = tmp
col_frozen = frozenset(col_breakdown.keys())
def clean(row):
    cleaned = []
    for col in col_frozen:
        tmp = [x for x in list(row[col_breakdown[col]]) if x!='']
        cleaned.append(tmp)
#     print(len(cleaned))
    return pd.Series(cleaned)
df20[list(col_frozen)] = df20.apply(clean, axis=1)

In [None]:
cols = list(df19.columns)
unique = list(set(['_'.join(x.split('_')[0:2]) for x in cols if 'A' in x or 'B' in x]))+list(set([x.split('_')[0] for x in cols if 'A' not in x and 'B' not in x and 'Part' in x]))
col_breakdown = {}
for col in unique:
    tmp = [c for c in cols if c.startswith(col)]
    col_breakdown[col] = tmp
col_frozen = frozenset(col_breakdown.keys())
def clean(row):
    cleaned = []
    for col in col_frozen:
        tmp = [x for x in list(row[col_breakdown[col]]) if x!='']
        cleaned.append(tmp)
#     print(len(cleaned))
    return pd.Series(cleaned)
df19[list(col_frozen)] = df19.apply(clean, axis=1)

In [None]:
qs_20 = ['Q1','Q2','Q3', 'Q4','Q5','Q6','Q10','Q11','Q12','Q13','Q20','Q21','Q23','Q25','Q26_A','Q26_B', 'Q27_A','Q27_B','Q28_A','Q28_B','Q29_A','Q29_B','Q30','Q31_A','Q31_B','Q32','Q34_A','Q34_B','Q38']
qs_19 = ['Q1','Q2','Q3','Q4','Q5','Q6','Q7','Q11','Q14','Q17','Q22','Q29','Q30','Q31','Q32','Q33','Q34']

In [None]:
d20 = df20[qs_20][1:]
d19 = df19[qs_19][1:]

Percentage comparison between 2019 and 2020 in terms of cloud adoption broken down by different platforms
- Percentage of entries with no cloud platform(2019 vs 2020)
- Same as above, for every role
- Same as above, broken by gender
- Same as above, broken by #years exp
- Bar plot of number of cloud platforms per user

In [None]:
tmp19 = d19[d19['Q5']!='Student']
print('No entry: {}'.format(len([x for x in list((tmp19['Q29'])) if x==[-1] or x==['-1'] or 'None' in x])))
print('Total: {}'.format(len(tmp19)))
print('Cloud answered: {}'.format(len([x for x in list((tmp19['Q29'])) if x!=[-1] and x!=['-1'] and 'None' not in x])))
total_cloud_19 = len([x for x in list((tmp19['Q29'])) if x!=[-1] and x!=['-1'] and 'None' not in x])
num_cloud_user_19 = collections.Counter([len(x)-1 for x in list((tmp19['Q29'])) if x!=[-1] and x!=['-1'] and 'None' not in x])
p_cloud_19 = round(float(total_cloud_19*100/len(tmp19)),2)
print('%respondents using Cloud: {}'.format(p_cloud_19))

In [None]:
tmp20 = d20[d20['Q5']!='Student']
print('No entry: {}'.format(len([x for x in list((tmp20['Q26_A'])) if x==[] or x==['None']])))
print('Total: {}'.format(len(tmp20)))
print('Cloud answered: {}'.format(len([x for x in list((tmp20['Q26_A'])) if x!=[] and x!=['None']])))
total_cloud_20 = len([x for x in list((tmp20['Q26_A'])) if x!=[] and x!=['None']])
num_cloud_user_20 = collections.Counter([len(x) for x in list((tmp20['Q26_A'])) if x!=[] and x!=['None']])
p_cloud_20 = round(float(total_cloud_20*100/len(tmp20)),2)
print('%respondents using Cloud: {}'.format(p_cloud_20))

# Percentage of users using cloud 2019 vs 2020

In [None]:
x = ['2019', '2020']
y = [p_cloud_19, p_cloud_20]
fig = plt.figure(figsize = (6, 6))
ax = fig.add_subplot()

# ----------------------------------------------------------------------------------------------------
# plot the data
for x_, y_ in zip(x, y):
    # make a scatter plot
    ax.bar(x_, y_, color = "red" if y_ < 32 else "green", alpha = 0.3)
    
    ax.text(x_, y_ + 0.5, round(y_, 1), horizontalalignment='center')
    
# ----------------------------------------------------------------------------------------------------
# prettify the plot
# change the ylim
# ax.set_ylim(0, 70)

# rotate the x ticks 90 degrees
ax.set_xticklabels(x, rotation = 90)

# add an y label
ax.set_ylabel("% respondents who use the Cloud")

# set a title
ax.set_title("Distribution of cloud users out over the years");

I honestly expected this to be higher but it does seem like there is only a marginal increase in the number of people who have responded that they work with/know atleast one cloud platform. Something that needs to be noted is that this is for all the roles and not specific to Data Scientist or MLE(Rolewise breakdown coming up)

# Role-wise 2019 vs 2020

This is a breakdown of numbers comparison for 2019 vs. 2020 but grouped by role. This is particularly interesting and obvious to see how a specific role has effect on the adoption of Cloud in that company. 

The biggest jump is seen in **Database Engineers/DBAs** and **Statisticians**. This goes on to show how a diverse set of people are embracing cloud and are able to bring more and more workloads onto cloud. It's not surprising that DBAs have started using public cloud platforms more given how the leading cloud platforms are adding support for almost any database out there. It would also be interesting to see these growth numbers broken down by cloud platforms, which might help us answer the question 'Are DBAs embracing Google Cloud more or Azure?'

In [None]:
roles19 = set([x.strip() for x in list(d19['Q5'][1:])])
roles20 = set([x.strip() for x in list(d20['Q5'][1:])])
roles = roles19.intersection(roles20)

fig = plt.figure(figsize=(20,20))
i = 1
for role in roles:
#     print(role)
    total_cloud_19 = len([x for x in list((d19[d19['Q5']==role][1:]['Q29'])) if x!=[-1] and x!=['-1'] and 'None' not in x])
    p_cloud_19 = round(float(total_cloud_19*100/len(d19[d19['Q5']==role])),2)
#     print('%respondents using Cloud: {}'.format(p_cloud_19))
    total_cloud_20 = len([x for x in list((d20[d20['Q5']==role][1:]['Q26_A'])) if x!=[] and x!=['None']])
    p_cloud_20 = round(float(total_cloud_20*100/len(d20[d20['Q5']==role])),2)
#     print('%respondents using Cloud: {}'.format(p_cloud_20))
    x = ['2019', '2020']
    y = [p_cloud_19, p_cloud_20]
    totals = [total_cloud_19, total_cloud_20]
    denoms = [len(d19[d19['Q5']==role]), len(d20[d20['Q5']==role])]
    if y[0]!=0:
        ax = fig.add_subplot(4,4,i)
        for x_, y_, t_, d_ in zip(x, y, totals, denoms):
        # make a scatter plot
            ax.bar(x_, y_, color = "red" if x_=='2019' else "green", alpha = 0.3)

            ax.text(x_, y_/2, '{}'.format(round(y_, 1)), horizontalalignment='center')
            ax.text(x_, y_/3, '{}/{}'.format(t_, d_), horizontalalignment='center')

        # ----------------------------------------------------------------------------------------------------
        # prettify the plot
        # change the ylim
    #     ax.set_ylim(0, 30)

        # rotate the x ticks 90 degrees
    #     ax.set_xticklabels(x, rotation = 90)

        # add an y label
        ax.set_ylabel("% respondents who use the Cloud")

        # set a title
        ax.set_title("{}".format(role))
        i+=1
fig.suptitle('Distribution of % cloud users for the role: ')

# Grouped by number of employees

I did not imagine this visualization to be so interesting when I was building it. Since I had a template established to plot any column, I simply ran it. But when I see the graph it seems like large organization have adopted cloud more than relatively smaller companies and that tells us an interesting story. Intuitively, we would have thought that smaller companies might have increased adoption over the years compared to a big company but here we are!

In [None]:
q_19 = 'Q6'
q_20 = 'Q20'
roles19 = set([x.strip() for x in list(d19[q_19][1:])])
roles20 = set([x.strip() for x in list(d20[q_20][1:])])
roles = [x for x in list(roles19.intersection(roles20)) if x!='']

fig = plt.figure(figsize=(20,20))
i = 1
for role in roles:
#     print(role)
    total_cloud_19 = len([x for x in list((d19[d19[q_19]==role][1:]['Q29'])) if x!=[-1] and x!=['-1'] and 'None' not in x])
    p_cloud_19 = round(float(total_cloud_19*100/len(d19[d19[q_19]==role])),2)
#     print('%respondents using Cloud: {}'.format(p_cloud_19))
    total_cloud_20 = len([x for x in list((d20[d20[q_20]==role][1:]['Q26_A'])) if x!=[] and x!=['None']])
    p_cloud_20 = round(float(total_cloud_20*100/len(d20[d20[q_20]==role])),2)
#     print('%respondents using Cloud: {}'.format(p_cloud_20))
    x = ['2019', '2020']
    y = [p_cloud_19, p_cloud_20]
    totals = [total_cloud_19, total_cloud_20]
    denoms = [len(d19[d19[q_19]==role]), len(d20[d20[q_20]==role])]
    if y[0]!=0:
        ax = fig.add_subplot(4,4,i)
        for x_, y_, t_, d_ in zip(x, y, totals, denoms):
        # make a scatter plot
            ax.bar(x_, y_, color = "red" if x_=='2019' else "green", alpha = 0.3)

            ax.text(x_, y_/2, '{}'.format(round(y_, 1)), horizontalalignment='center')
            ax.text(x_, y_/3, '{}/{}'.format(t_, d_), horizontalalignment='center')

        # ----------------------------------------------------------------------------------------------------
        # prettify the plot
        # change the ylim
    #     ax.set_ylim(0, 30)

        # rotate the x ticks 90 degrees
    #     ax.set_xticklabels(x, rotation = 90)

        # add an y label
        ax.set_ylabel("% respondents who use the Cloud")

        # set a title
        ax.set_title("{}".format(role))
        i+=1
fig.suptitle('Distribution of % cloud users for the number of employees: ')

# Number of cloud products used

I personally spent a lot of 2020 working with Google Cloud and started familiarizing myself with some of the other cloud platforms out there as well. That is when I got to wondering, how many out there have started to study and play around with more than one cloud platforms(I knew I'm not the only one haha). The comparison clearly shows that more and more people are exploring multiple cloud platforms. 

> I keep wondering, is this similar to how developers started learning more than one languages or is this comparison not valid?

In [None]:
x19 = list(num_cloud_user_19.keys())
y19 = list(num_cloud_user_19.values())
# instantiate the figure
fig = plt.figure(figsize = (16,6))
ax19 = fig.add_subplot(1,3,1)

# ----------------------------------------------------------------------------------------------------
# plot the data
for x_, y_ in zip(x19, y19):
    # this is very cool, since we can pass a function to matplotlib
    # and it will plot the color based on the result of the evaluation
    ax19.bar(x_, y_, color = color[9], alpha = 0.5)
#     ax19.bar(x_, y_, color = "red" if y_ < 100 else "green", alpha = 0.3)
    
     # add some text
    ax19.text(x_, y_ + 0.3, round(y_, 1), horizontalalignment = 'center')

# ----------------------------------------------------------------------------------------------------
# prettify the plot

# rotate the x ticks 90 degrees
# ax19.set_xticklabels(x, rotation=90)

# add an y label
ax19.set_ylabel("Number of respondents")

# add an x label
ax19.set_xlabel("Number of Cloud platforms")

# set a title
ax19.set_title("Bar Chart for Number of Cloud platforms by respondents(2019)");

x20 = list(num_cloud_user_20.keys())
y20 = list(num_cloud_user_20.values())

ax20 = fig.add_subplot(1,3,3)

# ----------------------------------------------------------------------------------------------------
# plot the data
for x_, y_ in zip(x20, y20):
    # this is very cool, since we can pass a function to matplotlib
    # and it will plot the color based on the result of the evaluation
    ax20.bar(x_, y_, color = color[9], alpha = 0.5)
#     ax20.bar(x_, y_, color = "red" if y_ < 100 else "green", alpha = 0.3)
    
     # add some text
    ax20.text(x_, y_ + 0.3, round(y_, 1), horizontalalignment = 'center')

# ----------------------------------------------------------------------------------------------------
# prettify the plot

# rotate the x ticks 90 degrees
# ax20.set_xticklabels(x, rotation=90)

# add an y label
ax20.set_ylabel("Number of respondents")

# add an x label
ax20.set_xlabel("Number of Cloud platforms")

# set a title
ax20.set_title("Bar Chart for Number of Cloud platforms by respondents(2020)");


In [None]:
def flag_cloud_19(row):
    x = row['Q29']
    if x!=[-1] and x!=['-1'] and 'None' not in x and row['Q6']!='Student':
        return pd.Series([True])
    else:
        return pd.Series([False])

def flag_cloud_20(row):
    x = row['Q26_A']
    if x!=[] and x!=['None'] and row['Q6']!='Student':
        return pd.Series([True])
    else:
        return pd.Series([False])
    
d19[['cloud_true']] = d19.apply(flag_cloud_19, axis=1)
d20[['cloud_true']] = d20.apply(flag_cloud_20, axis=1)

In [None]:
cloudwise_20 = collections.Counter(sum(list(d20[d20.cloud_true==True]['Q26_A']), []))
total_cloud_20 = len(d20[d20.cloud_true==True])
for key in cloudwise_20:
    cloudwise_20[key]/=total_cloud_20/100

In [None]:
cloudwise_19 = collections.Counter(sum(list(d19[d19.cloud_true==True]['Q29']), []))
total_cloud_19 = len(d19[d19.cloud_true==True])
for key in cloudwise_19:
    cloudwise_19[key]/=total_cloud_19/100

In [None]:
clouds = set(cloudwise_19.keys()).intersection(cloudwise_20.keys())
clouds_diff = {cloud:cloudwise_20[cloud]-cloudwise_19[cloud] for cloud in clouds}
clouds_diff = dict(sorted(clouds_diff.items(), key=lambda x: x[1]) )

# Growth of public cloud platforms 2019 vs 2020

This seems to be one of the most important visualizations of this notebook. Here's why:
> We all know Amazon is like the most used Cloud platform out there but being a Cloud Consultant, I'm seeing a lot of companies shopping around for platforms that offer more than just 'cheap cloud' or 'high discounts'. Companies now know that cloud is everywhere, now they have actually started looking for the right things in cloud platforms(For example, Google supports a lot of open source tools and libraries compared to AWS, just my observation, not trying to sell GCP to you)

It obviously does not come as a surprise that Azure is leading the growth numbers here and I personally know a lot of developers and companies moving to Azure. What's surprising however is 'Oracle Cloud'. Although Oracle does not stand anywhere near the other giants in terms of absolute market share, it is interesting to see how it's growing over the years. It could worth seeing roles/company sizes for which Oracle is growing a lot(there could be a pattern here :D )

In [None]:
vals = list(clouds_diff.values())
keys = list(clouds_diff.keys())

fig = plt.figure(figsize = (8, 8))
ax = fig.add_subplot()
colors = ['Red' if v<0 else 'Green' for v in vals]

# ----------------------------------------------------------------------------------------------------
# plot the data
# plot horizontal lines from the origin to each data point
ax.hlines(y = keys, 
          xmin = 0,
          xmax = vals,
          color = colors,
          alpha = 0.6)

# # plot the dots
ax.scatter(x = vals,
          y = keys,
          s = 100,
          color = colors,
          alpha = 0.6)
y = [round(x,2) for x in vals]
x = keys
for i in range(len(clouds_diff)):
    if y[i]>0:
        ax.annotate('{}%'.format(y[i]), (y[i]+0.3, x[i]), alpha=0.8)
    else:
        ax.annotate('{}%'.format(y[i]), (y[i]-0.8, x[i]), alpha=0.8)
ax.set_title("Diverging Lollipop of growth of Cloud platforms from 2019 to 2020")

# autoscale
ax.autoscale_view()

# change x lim
ax.set_xlim(-2, 7)

# set labels
ax.set_xlabel("% change")
ax.set_ylabel("Cloud platform")

# change the spines to make it nicer
ax.spines["right"].set_color("None")
ax.spines["top"].set_color("None")

# add a grid
ax.grid(linestyle='--', alpha=0.5);

In [None]:
cloud_ml_products_19 = collections.Counter(sum(d19['Q32'], []))
total_cloud_19 = len(d19[d19.cloud_true==True])
for key in cloud_ml_products_19:
    cloud_ml_products_19[key]/=total_cloud_19/100

cloud_ml_products_mapping = {
    'Google Cloud Machine Learning Engine': 'Google Cloud AI Platform / Google Cloud ML Engine',
    'Google Cloud Vision': 'Google Cloud Vision AI',
    
}

for old in cloud_ml_products_mapping:
    cloud_ml_products_19[cloud_ml_products_mapping[old]] = cloud_ml_products_19.pop(old)

In [None]:
cloud_ml_products_20 = collections.Counter(sum(d20[d20.cloud_true==True]['Q28_A'], []))
total_cloud_20 = len(d20[d20.cloud_true==True])
for key in cloud_ml_products_20:
    cloud_ml_products_20[key]/=total_cloud_20/100
for key in cloud_ml_products_20:
    cloud_ml_products_20[key.strip()] = cloud_ml_products_20.pop(key)

In [None]:
cloud_ml_products = set(cloud_ml_products_19.keys()).intersection(cloud_ml_products_20.keys())
cloud_ml_products

cloud_ml_products_diff = {cloud:cloud_ml_products_20[cloud]-cloud_ml_products_19[cloud] for cloud in cloud_ml_products}
cloud_ml_products_diff = dict(sorted(cloud_ml_products_diff.items(), key=lambda x: x[1]) )
_ = cloud_ml_products_diff.pop('Other')

# Popularity of popular cloud ML tools 2019 vs 2020

I think all I wanted to compare was the growth of Sagemaker vs Google AI platform. But something interesting I came across here is although the adoption rate for Azure seems to be going up, the proportion of users using Azure Machine Learning Studio seems to have gone down. That might mean that Azure might not be the platforms devs Data folks are turning to at the moment but Azure might be more sought after for other portofolios of developers(Statisticians maybe :P ?)

In [None]:
vals = list(cloud_ml_products_diff.values())
keys = list(cloud_ml_products_diff.keys())

fig = plt.figure(figsize = (6, 6))
ax = fig.add_subplot()
colors = ['Red' if v<0 else 'Green' for v in vals]

# ----------------------------------------------------------------------------------------------------
# plot the data
# plot horizontal lines from the origin to each data point
ax.hlines(y = keys, 
          xmin = 0,
          xmax = vals,
          color = colors,
          alpha = 0.6)

# print(colors)
# # plot the dots
ax.scatter(x = vals,
          y = keys,
          s = 100,
          color = colors,
          alpha = 0.6)
y = [round(x,2) for x in vals]
x = keys
for i in range(len(cloud_ml_products_diff)):
    if y[i]>0:
        ax.annotate('{}%'.format(y[i]), (y[i]+0.3, x[i]), alpha=0.8)
    else:
        ax.annotate('{}%'.format(y[i]), (y[i]-0.8, x[i]), alpha=0.8)
ax.set_title("Diverging Lollipop of growth of Cloud platforms from 2019 to 2020")

# autoscale
ax.autoscale_view()

# change x lim
ax.set_xlim(-3, 3)

# set labels
ax.set_xlabel("% change")
ax.set_ylabel("Cloud platform")

# change the spines to make it nicer
ax.spines["right"].set_color("None")
ax.spines["top"].set_color("None")

# add a grid
ax.grid(linestyle='--', alpha=0.5);

# Under construction - Proceed with caution
![Typing furiously](https://pics.me.me/the-best-of-me-is-yet-to-come-under-construction-25159623.png)
Yes, the code is really bad down here, might hurt your eyes, come visit us later?

In [None]:
cloud_bi_products_20 = collections.Counter(sum(d20[d20.cloud_true==True]['Q31_A'], []))
total_cloud_20 = len(d20[d20.cloud_true==True])
for key in cloud_bi_products_20:
    cloud_bi_products_20[key]/=total_cloud_20/100
for key in cloud_bi_products_20:
    cloud_bi_products_20[key.strip()] = cloud_bi_products_20.pop(key)
cloud_bi_products_20

In [None]:
cloud_bi_products = set(cloud_bi_products_19.keys()).intersection(cloud_bi_products_20.keys())
cloud_bi_products

cloud_bi_products_diff = {cloud:cloud_bi_products_20[cloud]-cloud_bi_products_19[cloud] for cloud in cloud_bi_products}
cloud_bi_products_diff = dict(sorted(cloud_bi_products_diff.items(), key=lambda x: x[1]) )
cloud_bi_products_diff

In [None]:
cloud_automl_products_19 = collections.Counter(sum(d19[d19.cloud_true==True]['Q33'], []))
total_cloud_19 = len(d19[d19.cloud_true==True])
for key in cloud_automl_products_19:
    cloud_automl_products_19[key]/=total_cloud_19/100
for key in cloud_automl_products_19:
    if type(key)==str:
        cloud_automl_products_19[key.strip()] = cloud_automl_products_19.pop(key)

cloud_automl_products_mapping = {
#     'AWS Redshift': 'Amazon Redshift',
#     'Google automlgQuery': 'Google Cloud automlgQuery',
#     'AWS Athena': 'Amazon Athena',
}

for old in cloud_automl_products_mapping:
    cloud_automl_products_19[cloud_automl_products_mapping[old]] = cloud_automl_products_19.pop(old)
cloud_automl_products_19

In [None]:
cloud_automl_products_20 = collections.Counter(sum(d20[d20.cloud_true==True]['Q34_A'], []))
total_cloud_20 = len(d20[d20.cloud_true==True])
for key in cloud_automl_products_20:
    cloud_automl_products_20[key]/=total_cloud_20/100
for key in cloud_automl_products_20:
    cloud_automl_products_20[key.strip()] = cloud_automl_products_20.pop(key)
cloud_automl_products_20

In [None]:
cloud_automl_products = set(cloud_automl_products_19.keys()).intersection(cloud_automl_products_20.keys())
cloud_automl_products

cloud_automl_products_diff = {cloud:cloud_automl_products_20[cloud]-cloud_automl_products_19[cloud] for cloud in cloud_automl_products}
cloud_automl_products_diff = dict(sorted(cloud_automl_products_diff.items(), key=lambda x: x[1]) )
cloud_automl_products_diff