## What do kagglers wish?

The 2021 kaggle survey data has two sections: the main survey questions and the supplementary questions. In the later section, kaggle did a sort of *marketing research* by asking kagglers what do they wish/hope to be familiar with in the coming two years regarding to some of the DS/ML/AI/Big data platfprms and tools. 

These questions are about the **FUTURE** and are placed at the **BACK** of the questionnaire, hence the title of the notebook :)

Using this *short* notebook I will try to visualize and see what the kagglers of today wish/hope to explore and familiarize themselves with in 2 years. 

First, I will analyse the global trend as to what the choices/preferences of all respondents, then I will go one layer deeper and see whether individual job titles follow the global trend or not in terms of platform/tools preferences for the next two years. 

## Table of Contents
* [1. Cloud computing platform/products](#1)
* [2. Data Storage Products ](#2)
* [3. Managed ML products](#3)
* [4. Big Data Products](#4)
* [5. Business Intelligence Tools (BI)](#5)
* [6. AutoML Tools](#6)
* [7. ML Experiments](#7)
* [8. Summary](#8)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt

import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pio.templates.default = "none"

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

In [None]:
df21= pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", )
df = df21.iloc[1:]

In [None]:
### List of the "_B_Part" question

q27b_ = ['Q27_B_Part_1',
    'Q27_B_Part_2',
    'Q27_B_Part_3',
    'Q27_B_Part_4',
    'Q27_B_Part_5',
    'Q27_B_Part_6',
    'Q27_B_Part_7',
    'Q27_B_Part_8',
    'Q27_B_Part_9',
    'Q27_B_Part_10',
    'Q27_B_Part_11',
    'Q27_B_OTHER']


q29b_ = ['Q29_B_Part_1',
     'Q29_B_Part_2',
     'Q29_B_Part_3',
     'Q29_B_Part_4',
     'Q29_B_OTHER']


q30b_ = ['Q30_B_Part_1',
    'Q30_B_Part_2',
    'Q30_B_Part_3',
    'Q30_B_Part_4',
    'Q30_B_Part_5',
    'Q30_B_Part_6',
    'Q30_B_Part_7',
    'Q30_B_OTHER']


q31b_ = ['Q31_B_Part_1',
    'Q31_B_Part_2',
    'Q31_B_Part_3',
    'Q31_B_Part_4',
    'Q31_B_Part_5',
    'Q31_B_Part_6',
    'Q31_B_Part_7',
    'Q31_B_Part_8',
    'Q31_B_Part_9',
    'Q31_B_OTHER']


q32b_ = ['Q32_B_Part_1',
    'Q32_B_Part_2',
    'Q32_B_Part_3',
    'Q32_B_Part_4',
    'Q32_B_Part_5',
    'Q32_B_Part_6',
    'Q32_B_Part_7',
    'Q32_B_Part_8',
    'Q32_B_Part_9',
    'Q32_B_Part_10',
    'Q32_B_Part_11',
    'Q32_B_Part_12',
    'Q32_B_Part_13',
    'Q32_B_Part_14',
    'Q32_B_Part_15',
    'Q32_B_Part_16',
    'Q32_B_Part_17',
    'Q32_B_Part_18',
    'Q32_B_Part_19',
    'Q32_B_Part_20',
    'Q32_B_OTHER']


q34b_ = ['Q34_B_Part_1',
    'Q34_B_Part_2',
    'Q34_B_Part_3',
    'Q34_B_Part_4',
    'Q34_B_Part_5',
    'Q34_B_Part_6',
    'Q34_B_Part_7',
    'Q34_B_Part_8',
    'Q34_B_Part_9',
    'Q34_B_Part_10',
    'Q34_B_Part_11',
    'Q34_B_Part_12',
    'Q34_B_Part_13',
    'Q34_B_Part_14',
    'Q34_B_Part_15',
    'Q34_B_Part_16',
    'Q34_B_OTHER']


q36b_ = ['Q36_B_Part_1',
    'Q36_B_Part_2',
    'Q36_B_Part_3',
    'Q36_B_Part_4',
    'Q36_B_Part_5',
    'Q36_B_Part_6',
    'Q36_B_Part_7',
    'Q36_B_OTHER']


q37b_ = ['Q37_B_Part_1',
    'Q37_B_Part_2',
    'Q37_B_Part_3',
    'Q37_B_Part_4',
    'Q37_B_Part_5',
    'Q37_B_Part_6',
    'Q37_B_Part_7',
    'Q37_B_OTHER']


q38b_ = ['Q38_B_Part_1',
   'Q38_B_Part_2',
   'Q38_B_Part_3',
   'Q38_B_Part_4',
   'Q38_B_Part_5',
   'Q38_B_Part_6',
   'Q38_B_Part_7',
   'Q38_B_Part_8',
   'Q38_B_Part_9',
   'Q38_B_Part_10',
   'Q38_B_Part_11',
   'Q38_A_OTHER']


##############################
# list of the response corresponding to the parts of the question

q27b = ['Amazon Web Services (AWS)',
    'Microsoft Azure',
    'Google Cloud Platform (GCP)' ,
    'IBM Cloud / Red Hat',
    'Oracle Cloud',
    'SAP Cloud',
    'Salesforce Cloud',
    'VMware Cloud',
    'Alibaba Cloud' ,
    'Tencent Cloud',
    'None',
    'Other']


q29b = ['Amazon Elastic Compute Cloud (EC2)',
    'Microsoft Azure Virtual Machines',
    'Google Cloud Compute Engine',
    'No / None',
    'Other']

q30b = ['Microsoft Azure Data Lake Storage' ,
    'Microsoft Azure Disk Storage',
    'Amazon Simple Storage Service (S3) ',
    'Amazon Elastic File System (EFS) ' ,
    'Google Cloud Storage (GCS) ',
    'Google Cloud Filestore' ,
    'No / None',
    'Other']


q31b = ['Amazon SageMaker' ,
    'Azure Machine Learning Studio',
    'Google Cloud Vertex AI',
    'DataRobot' ,
    'Databricks' ,
    'Dataiku',
    'Alteryx',
    'Rapidminer' ,
    'No / None',
    'Other']


q32b = ['MySQL',
    'PostgreSQL',
    'SQLite',
    'Oracle Database',
    'MongoDB',
    'Snowflake',
    'IBM Db2',
    'Microsoft SQL Server',
    'Microsoft Azure SQL Database', 
    'Microsoft Azure Cosmos DB',
    'Amazon Redshift', 
    'Amazon Aurora',
    'Amazon RDS',
    'Amazon DynamoDB',
    'Google Cloud BigQuery',
    'Google Cloud SQL',
    'Google Cloud Firestore',
    'Google Cloud BigTable',
    'Google Cloud Spanner',
    'None' ,
    'Other' ]

q34b = ['Amazon QuickSight',
    'Microsoft Power BI',
    'Google Data Studio',
    'Looker',
    'Tableau',
    'Salesforce',
    'Tableau CRM',
    'Qlik' ,
    'Domo',
    'TIBCO Spotfire',
    'Alteryx',
    'Sisense',
    'SAP Analytics Cloud',
    'Microsoft Azure Synapse',
    'Thoughtspot',
    'None',
    'Other']


q36b  = ['Automated data augmentation (e.g. imgaug, albumentations)' ,
    'Automated feature engineering/selection (e.g. tpot, boruta_py)',
    'Automated model selection (e.g. auto-sklearn, xcessiv)',
    'Automated model architecture searches (e.g. darts, enas)' ,
    'Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)' ,
    'Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)',
    'No / None' ,
    'Other']

q37b = ['Google Cloud AutoML',
    'H20 Driverless AI',
    'Databricks AutoML',
    'DataRobot AutoML',
    'Amazon Sagemaker Autopilot',
    'Azure Automated Machine Learning',
    'No / None',
    'Other']

q38b = ['Neptune.ai',
    'Weights & Biases',
    'Comet.ml',
    'Sacred + Omniboard',
    'TensorBoard',
    'Guild.ai' ,
    'Polyaxon',
    'ClearML' ,
    'Domino Model Monitor' ,
    'MLflow',
    'No / None',
    'Other']


In [None]:
# helper functions
# the head count fuction was copied and slightly modified from https://www.kaggle.com/ruchi798/kaggle-ml-ds-survey-analysis
def head_count_suf(df, question_num, part, n):
    questions = []
    questions = ['Q'+ str(question_num)+str(part) +'_Part_'+ str(j) for j in range(1, n)]
    questions.append('Q'+ str(question_num) +str(part)+ '_OTHER')
    
    categories = []
    values = []
    for i in questions:
        category = df[i].value_counts().index[0]
        val = df[i].value_counts()[0]
        
        categories.append(category)
        values.append(val)
       
    combined_df = pd.DataFrame()
    combined_df['Category'] = categories
    combined_df['Value'] = values
   
    return combined_df

def df_with_percentages_suf(df, q, part, n):    
    dff = head_count_suf(df, q, part, n)   
    
    perc = []

    for item  in dff.Value:          
        
        perc.append(np.round(item/len(df) *100, 2))

    data = pd.DataFrame()
    data['Percentage'] = perc
    df_perc = pd.concat([dff, data], axis=1)
    #df_perc['Region'] = region
    
    return df_perc


# def df_with_percentages_relative(df, q, part, n):
    
#     dff = head_count_suf(df, q, part, n)
#     total = dff['Value'].sum()
#     perc = []

#     for item  in dff.Value:      
        
#         perc.append(np.round(item/total*100, 2))

#     data = pd.DataFrame()
#     data['Percentage'] = perc
#     df_perc = pd.concat([dff, data], axis=1)
    
    
#     return df_perc

## 1. Cloud computing platform/products  <a class="anchor" id="1"></a>

#### Q27-A
#### Present: Which of the following cloud computing platforms do you use on a regular basis? 
#### Q27-B
#### Future: Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years


> * Most respondents use either **AWS, GCP** or **Microsoft Azure**.
> * At present AWS is leading the pack with GCP and Mirosoft Azure following in 2nd and 3rd. 
> * However, GCP is getting more fans and its almost tied with AWS as future preference. 

In [None]:
title = 'Cloud computing platforms: present use vs future wish'

df_27a = df_with_percentages_suf(df21, 27, '_A', 12).sort_values(['Value'],ascending=False)
df_27b = df_with_percentages_suf(df21, 27, '_B', 12).sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.3,
                    shared_yaxes=False,
                   )

fig.add_trace(go.Bar(y=df_27a['Category'],
                     x=df_27a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_27b['Category'],
                     x=df_27b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=400, width=1000,
              barmode='group',
              titlefont={'size': 28},
              template='simple_white',
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=225, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()


#### Q29-A
#### Present: Do you use any of the following cloud computing products on a regular basis? 

#### Q29-B
#### Future: In the next 2 years, do you hope to become more familiar with any of these specific cloud computing products? 


> * Of the respondents who do use cloud computing platforms, **Amazon Elastic Compute Cloud (EC2)** users edges that of **Google Cloud Compute Engine** users. 
> * However, the future looks brighter for **Google Cloud Compute Engine**!
> * **Microsoft Azure Virtual Machine** improved from second choice at the present to be second in the future.

In [None]:
title = 'Cloud computing products: present use vs future wish'

df_29a = df_with_percentages_suf(df21, 29, '_A', 5).sort_values(['Value'],ascending=False)
df_29b = df_with_percentages_suf(df21, 29, '_B', 5).sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.3,
                    shared_yaxes=False,
                   )

fig.add_trace(go.Bar(y=df_29a['Category'],
                     x=df_29a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_29b['Category'],
                     x=df_29b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=400, width=1000,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',
                  yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=225, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:
title="<b><span style='font-size:28px; color: #04a999'>Cloud Computing Platforms:Future </span> <br>"

cols = ['Q5'] + q27b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q27b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=550,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #template='ggplot2',
                  #paper_bgcolor='#f6f6f6',
                  annotations=None,
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
)
fig.show()

title="<b><span style='font-size:28px; color: #04a999'>Cloud Computing Products:Future </span> <br>"


cols = ['Q5'] + q29b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q29b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=550,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #paper_bgcolor='#f6f6f6',
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
                 )
              
fig.show()

## 2. Data Storage Products  <a class="anchor" id="2"></a>

#### Q30-A
#### Present: Do you use any of the following data storage products on a regular basis?

#### Q30-B
#### Future: In the next 2 years, do you hope to become more familiar with any of these specific data storage products?

> * Amazon Simple Storage Service (S3), Google Cloud Storage (GCS) and Microsoft  Azure Data Lake Storage are the top storage.
> * Due to the absence of Q38-B from the questionnaire, we do not know what the respondents wish for the future. 

In [None]:
title = 'Data Storage Products: present use'

df_30a = df_with_percentages_suf(df21, 30, '_A', 8).sort_values(['Value'],ascending=False)
#df_30b = df_with_percentages_suf(df21, 30, '_B', 8).sort_values(['Value'],ascending=False)


fig = go.Figure()

fig.add_trace(go.Bar(y=df_30a['Category'],
                     x=df_30a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    )
             )
fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=500, width=800,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=225, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

## 3. Managed ML products  <a class="anchor" id="3"></a>

#### Q31-A
#### Present: Do you use any of the following managed machine learning products on a regular basis?

#### Q31-B
#### Future: In the next 2 years, do you hope to become more familiar with any of these managed machine learning products?

> * The large majority do not use a managed machine learning products. Of those who use MML products, **Amazon SageMaker**, **Azure Machine Learning Studio** and **Databrick** are the top 3 with Google Cloud Vertex AI in fourth place.
> * The future seems promising for these products as well. The above **four** are also the top 4 choices. **Google Could Vertex AI** is in high demand this time. 

In [None]:
title = 'Managed Machine Learning Products: present use vs future wish'

df_30a = df_with_percentages_suf(df21, 31, '_A', 10).sort_values(['Value'],ascending=False)
df_30b = df_with_percentages_suf(df21, 31, '_B', 10).sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.25,
                    shared_yaxes=False,
                   )

fig.add_trace(go.Bar(y=df_30a['Category'],
                     x=df_30a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_30b['Category'],
                     x=df_30b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=400, width=1000,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=200, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:
title="<b><span style='font-size:28px; color: #04a999'>Managed Machine Learning Products: Future </span> <br>"


cols = ['Q5'] + q31b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q31b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=750,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #template='ggplot2',
                  #paper_bgcolor='#f6f6f6',
                  annotations=None,
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
)
fig.show()

## 4. Big Data Products  <a class="anchor" id="4"></a>

#### Q32-A
#### Present: Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?

#### Q32-B
#### Future: Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you hope to become more familiar with in the next 2 years? 

> * MySQL is the widely favoured at present followed by PostgreSQL, Microsoft SQL Server and MongoDB. MySQL is is goning hold on to its lead in the future as well. 
> * MongoDB jumped to second place leaving PostgreSQL and Google Cloud SQL in third and fourth place. 
> * It seem that Google is getting traction!

   

In [None]:
title = 'Big Data Products: present use vs future wish'

df_30a = df_with_percentages_suf(df21, 32, '_A', 21).sort_values(['Value'],ascending=False)
df_30b = df_with_percentages_suf(df21, 32, '_B', 21).sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.25,
                    shared_yaxes=False,
                   )

fig.add_trace(go.Bar(y=df_30a['Category'],
                     x=df_30a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_30b['Category'],
                     x=df_30b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=600, width=1000,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=200, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:
title="<b><span style='font-size:28px; color: #04a999'>Big Data Products: Future </span> <br>"

cols = ['Q5'] + q32b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q32b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=750,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #template='ggplot2',
                  #paper_bgcolor='#f6f6f6',
                  annotations=None,
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
)
fig.show()

## 5. Business Intelligence Tools (BI)  <a class="anchor" id="5"></a>

#### Q34-A
#### Present: Which of the following business intelligence tools do you use on a regular basis?

#### Q34-B
#### Furture: Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years? 

> * Tableau, Microsoft Power BI and Google Data Studio are the top three BI tools used at the moment and will continue to be the leading choices in 2 years time as well.
> * The non-users are high in percentage here as well.

In [None]:
title = 'Business Intelligence Tools: present use vs future wish'

df_30a = df_with_percentages_suf(df21, 34, '_A', 17).sort_values(['Value'],ascending=False)
df_30b = df_with_percentages_suf(df21, 34, '_B', 17).sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.25,
                    shared_yaxes=False,
                   )

fig.add_trace(go.Bar(y=df_30a['Category'],
                     x=df_30a['Value'], 
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_30b['Category'],
                     x=df_30b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5, )
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=500, width=1000,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=200, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:

title="<b><span style='font-size:28px; color: #04a999'>Business Intelligence Tools: Future </span> <br>"

cols = ['Q5'] + q34b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q34b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=750,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #template='simple_white',
                  #paper_bgcolor='#f6f6f6',
                  annotations=None,
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
)
fig.show()

## 6. AutoML Tools  <a class="anchor" id="6"></a>

#### Q36-A

#### Presents: Do you use any automated machine learning tools (or partial  AutoML tools) on a regular basis? 


#### Q36-B
#### Future: Which categories of automated machine learning tools (or partial  AutoML tools) do you hope to become more familiar with in the next 2 years?  

> * Very large portion of the respondents do not use AutoML at presents. Among those who use AutoML tools, **Automated model selection (auto-sklearn, xcessive,..)** is mostly used.
> * In 2 years time, in addition to **Automated model selection** users hope to experience **Automation of Full ML pipeline** more than others.
> * **Google Cloud  AutoML, Azure  Automated Machine Learning** and **Amazon Sagemaker Autopilot** are the top three autoML products used and are the choice for most respondent in that same order. **H20 Driverless AI** drop one place from 5th to being 6th.


In [None]:
title = 'Automated Machine Learning Tools: present use'

df_36a = df_with_percentages_suf(df21, 36, '_A', 8).sort_values(['Value'],ascending=False)
df_36b = df_with_percentages_suf(df21, 36, '_B', 8).sort_values(['Value'],ascending=False)


fig = go.Figure()

fig.add_trace(go.Bar(y=df_36a['Category'],
                     x=df_36a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    )
             )
fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=500, width=800,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=425, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:
title = 'Automated Machine Learning Tools: future wish'

df_36a = df_with_percentages_suf(df21, 36, '_A', 8).sort_values(['Value'],ascending=False)
df_36b = df_with_percentages_suf(df21, 36, '_B', 8).sort_values(['Value'],ascending=False)


fig = go.Figure()

fig.add_trace(go.Bar(y=df_36b['Category'],
                     x=df_36b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    )
              
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=500, width=800,
              barmode='group',
              titlefont={'size': 28},
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=425, r=150, b=100, t=70,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

#### Q37-A
#### Present: Which of the following automated machine learning tools (or partial  AutoML tools) do you use on a regular basis? 
#### Q37-B
#### Future: Which specific automated machine learning tools (or partial  AutoML tools) do you hope to become more familiar with in the next 2 years?

In [None]:
title = 'Automated Machine Learning Tools: present use vs future wish'

df_37a = df_with_percentages_suf(df21, 37, '_A', 8)#.sort_values(['Value'],ascending=False)
df_37b = df_with_percentages_suf(df21, 37, '_B', 8)#.sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.3,
                    #shared_yaxes=True,
                   )

fig.add_trace(go.Bar(y=df_37a['Category'],
                     x=df_37a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_37b['Category'],
                     x=df_37b['Value'],
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=500, width=1000,
              barmode='group',
              titlefont={'size': 28},
              template="simple_white",
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=225, r=150, b=100, t=150,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:
title="<b><span style='font-size:28px; color: #04a999'>Automated Machine Learning Tools: Future </span> <br>"


# cols = ['Q5'] + q36b_
# df = df21[cols][1:]
# df = df.groupby(['Q5']).count()

# x = list(df.columns)
# y = list(df.index) 
# z = np.transpose(df.values)


# fig = ff.create_annotated_heatmap(z=z, y=q36b, x=y, colorscale='greens', xgap=3, ygap=3)

# fig.update_layout(title_text=title,
#                   title_x=0.5,
#                   titlefont={'size': 24, 'family':'San Serif'},
#                   width=1200, height=750,
#                   xaxis_showgrid=False,
#                   xaxis={'side': 'bottom'},
#                   yaxis_showgrid=False,
#                   yaxis_autorange='reversed',
#                   #template='ggplot2',
#                   #paper_bgcolor='#f6f6f6',
#                   annotations=None,
#                   )

# fig.update_layout(autosize=False,
# margin=dict(
#     l=425, r=50, b=150, t=100,)
# )
# fig.show()


#title ='<b> AutoML Tools <b>'


cols = ['Q5'] + q37b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q37b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=750,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #template='ggplot2',
                  #paper_bgcolor='#f6f6f6',
                  annotations=None,
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
)
fig.show()

## 7. ML Experiments  <a class="anchor" id="7"></a>

#### Q38-A
#### Present: Do you use any tools to help manage machine learning experiments? 

#### Q38-B
#### Future: In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments?

> Today's kagglers use **Tensorboard**, **MLflow** and **Weights&Biases** as the top three for their ML experimenatation. They will continue their favourtism for the same tools in 2 years. 


In [None]:
title = 'Machine Learning Experimenting Tools: present use vs future wish'

df_38a = df_with_percentages_suf(df21, 38, '_A', 12)#.sort_values(['Value'],ascending=False)
df_38b = df_with_percentages_suf(df21, 38, '_B', 12)#.sort_values(['Value'],ascending=False)


fig = make_subplots(rows=1, cols=2, 
                    horizontal_spacing = 0.15,
                    #shared_yaxes=True,
                   )

fig.add_trace(go.Bar(y=df_38a['Category'],
                     x=df_38a['Value'],
                     orientation= 'h',
                     name="present",
                     marker_color='lightseagreen',
                     opacity =0.6
                    ),row=1, col=1,
             )

fig.add_trace(go.Bar(y=df_38b['Category'],
                     x=df_38b['Value'] ,
                     orientation= 'h',
                     name="future",
                     marker_color='gold',
                     opacity =0.6
                    ),row=1, col=2,
             )

fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title, 
              font_family="San Serif",
              bargap=0.2,
                  height=500, width=1000,
              barmode='group',
              titlefont={'size': 28},
              template="simple_white",
              paper_bgcolor='#F5F5F5',
              plot_bgcolor='#F5F5F5',
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              ).update_yaxes(categoryorder='total ascending')

fig.update_layout(xaxis_title='Count',yaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=225, r=150, b=100, t=150,
),
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

In [None]:
title="<b><span style='font-size:28px; color: #04a999'>Machine Learning Experimentation Tools: Future </span> <br>"


cols = ['Q5'] + q38b_
df = df21[cols][1:]
df = df.groupby(['Q5']).count()

x = list(df.columns)
y = list(df.index) 
z = np.transpose(df.values)


fig = ff.create_annotated_heatmap(z=z, y=q38b, x=y, colorscale='greens', xgap=3, ygap=3)

fig.update_layout(title_text=title,
                  title_x=0.5,
                  titlefont={'size': 24, 'family':'San Serif'},
                  width=900, height=750,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  #template='ggplot2',
                  #paper_bgcolor='#f6f6f6',
                  )

fig.update_layout(autosize=False,
margin=dict(
    l=225, r=150, b=150, t=100,)
)
fig.show()

## 8. Summary  <a class="anchor" id="8"></a>

* **Amazon Web Service**, **Google Cloud Platform** and **Microsoft Azure** are top used cloud computing platforms and will continue to be on top in the future. Regarding cloud computing product, **Amazon Elastic Compute Cloud** is mostly used product. **Google's Cloud Compute Engine** will rule this domain according to the survey.

* **Amazon Simple Storage Service (S3)**, **Google Cloud Storage (GCS)** and **Microsoft  Azure Data Lake Storage** are the top data storage product kagglers are using. Due to the unfortunate error of the data collection phase, we do not have the data for the future in this particular category i.e Q30_B is empty in the responses file. 

* **Amazon SageMaker, Azure Machine Learning Studio** and **Databrick** are the top 3 tools for managing ML poducts. The future is going to be all Google however, as **Google CloudVeterx AI** tops the list.

* Among the big data products, **MySQL** is a widley used product at present and will be the top choice for the next 2 years. However, **MongoDB** will enjoy being the best of the rest improving upon its 4th place at the moment.

* Most respondents seem to be happy with the **business intelligence (BI) tools** the have at their disposal. **Tableau, Microsoft Power BI** and **Google Data Studio** are the top of the list and kagglers hope to have them in the future as well.

* **Google Cloud AutoML, Azure Automated Machine Learning** and **Amazon Sagemaker Autopilot** are top choice autoML products peresent and future.

* **Tensorboard**, **MLflow** and **Weights&Biases** as the top three ML experimenatation tools and they will continue to be at the top in the future.

* Finally, we also observe that there are a large number of respondents (top or near top of the lists) who do not use the specified platforms/tools at the moment. A good number of people seem to be happy with what they have, they responded with "no/none" when they were asked about their future wishes. Moreover, **Google** seem to have a lot of users. It is in the top three for most of the categories and even higher as in future wishes.

...TBC


### Thank you for visiting this notebook!
#### If you are interested, my other notebook on the same dataset is here ([How popular is Kaggle in Africa?](https://www.kaggle.com/desalegngeb/how-popular-is-kaggle-in-africa))