<a id="competition-overview"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:1; color:white' role="tab" aria-controls="home"><center>Kaggle ML and DS Survey</center></h2>
           
 ![Drag Racing](https://images.unsplash.com/photo-1504868584819-f8e8b4b6d7e3?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1476&q=80)

#### **<span style="color:orange;">Description</span>**
    
The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

#### **<span style="color:orange;"> Data</span>**
Main Data: kaggle_survey_2021_responses.csv: 42+ questions and 25,973 responses

Responses to multiple choice questions (only a single choice can be selected) were recorded in individual columns. Responses to multiple selection questions (multiple choices can be selected) were split into multiple columns (with one column per answer choice).
Supplementary Data:
kaggle_survey_2021_answer_choices.pdf: list of answer choices for every question

With footnotes describing which questions were asked to which respondents.
kaggle_survey_2021_methodology.pdf: a description of how the survey was conducted

You can ask additional questions by posting in the pinned Q&A thread.



## Lets get started and see what the survey insights are 

We want to look it under three lenses. 
* One is going to be what the current survey results are looking at.
* Trend analysis historically
* Expected future trends


### General Functions

In [None]:
import os
import numpy as np 
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import math
import squarify

import folium
from geopy.geocoders import Nominatim
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster

from wordcloud import WordCloud,STOPWORDS
from matplotlib.ticker import FuncFormatter
from palettable.scientific.sequential import Acton_14,Bamako_12,Hawaii_5
from palettable.tableau import Tableau_20

import gc

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df.columns=df.columns +'___'+ df.iloc[0]


#### A quick glance of Data

In [None]:

df = df.iloc[1: , :]
df.head(10)

It will be easier if we can get the first row as part of the columns header so we can easily group by and classify the related questions and derive insights

In [None]:
df.rename(columns={'Q3___In which country do you currently reside?':'Q3_Country'}, inplace=True)

In [None]:
## helper functions
def plot_geograh(df, column, legend_label, title):
    df.plot(column=column, 
           figsize=(15, 10),
           legend=True,cmap='OrRd'
          ,legend_kwds={'label': legend_label,
                        'orientation': "horizontal"})

    plt.title(title,fontsize=25)

    

## General Trends for Kaggle Survey 2021

### Identify Survey Participation by Country

In [None]:
import geopandas
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
country_count=df.groupby('Q3_Country')['Q1___What is your age (# years)?'].count().reset_index(name='total')

merge=pd.merge(world,country_count,how='left',left_on='name', right_on=['Q3_Country'])
merge=merge[~(merge.Q3_Country.isna())]


In [None]:
plot_geograh(merge, 'total', "Survey Participation by Country",
             'Kaggle ML Survey Participation Status')

* The graph shows that India has the highest participation for the Kaggle Survey for the year 2021 followed by US

### Common Stats Globally  for 2021 survey

#### General Stats around Usage

In [None]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
import plotly.offline as py
import plotly.graph_objs as go

fichier_html_graphs=open("DASHBOARD.html",'w')
fichier_html_graphs.write("<html><head></head><body>"+"\n")
from IPython.display import HTML

def display_graph(filename):
    display(HTML('<iframe class="center" width="1000" height="600" src="'+filename+'"  allowfullscreen></iframe>'))

def create_custom_metrics(val):
    total_programming=df.iloc[:,val['column_start']:val['column_end']].melt(var_name='columns', value_name=val['name']).groupby(val['name']).size().reset_index(
    name='total')
    fig = px.pie(total_programming, values=total_programming['total']
                 , names=val['name'], title=val['title'])
    plotly.offline.plot(fig, filename='Chart_'+val['title']+'.html',auto_open=False)
    fichier_html_graphs.write("  <object data=\""+'Chart_'+val['title']+'.html'+"\" width=\"1800\" height=\"1650\"></object>"+"\n")


    
    

#### Add entry to json to get the insight from Data and generate the required graphs

In [None]:
stats_to_know=[
{'title':'Most used programming languages',
'column_start': 7,
'column_end': 20,
'name':'total_programming'},
{'title':'Gender participation in the survey',
'column_start': 2,
'column_end': 3,
'name':'gender'},
{'title':'Age Group participation in the survey',
'column_start': 1,
'column_end': 2,
'name':'age_group'},
{'title':'Education qualification',
'column_start': 4,
'column_end': 5,
'name':'education'},
{'title':'Experience',
'column_start': 6,
'column_end': 7,
'name':'experience'},
{'title':'salary',
'column_start': 127,
'column_end': 128,
'name':'salary'},

]

In [None]:
for val in stats_to_know:
    print(val)
    create_custom_metrics(val)

gc.collect()


## Global Stats -  General Trends Summary

#### 1. Most used programming languages used by survey participants

 The most popular programming languages used globally are Python followed by SQL, C++ and R.







In [None]:
display_graph('./Chart_Most used programming languages.html')



#### 2. Gender of survey participants

79% Survey participants are male and the only 18% contribute to female. We need to check better at country level data to get a deeper understanding of how the gender distribution varies across countries



In [None]:
display_graph('./Chart_Gender participation in the survey.html')

#### 3. Age group of survey participants

More than 50% of participants are less than 40 years of age. We can still see lot of participation probably 
coming from students as well. Also its worth noting that we have about 9% of people above age of 50 which shows that age is just a number and  people are interested in learning new technologies and never stop learning.


In [None]:
display_graph('./Chart_Age Group participation in the survey.html')


#### 4. Education of survey participants 

As this is advanced field of study we can see that 89% people have atleast a Bachelor's Degree


In [None]:
display_graph('./Chart_Education qualification.html')


#### 5. Experience of survey participants 

This is interesting as well. As we can see this is a platform that helps people to learn more there is a large number of freshers/beginners in the platform who are looking forward to learn from the experts or through lot of peer help.


In [None]:
display_graph('./Chart_Experience.html')


#### 5. Salary of survey participants 

This is interesting as well.This might require further drill down to understand if there is any salary disparity between female and male and also based on country as the currency conversion matters while looking at the standardizing salary across the countries


In [None]:
display_graph('./Chart_salary.html')

##### 5a. Salary of survey participants by Country
The below chart explains why we have a 22% in salary range of 0-999 dollars. Most of the cohort is coming from India. Also see the plot of salaries per the current role the survey participants have filled in.


In [None]:

cols=["Q25___What is your current yearly compensation (approximate $USD)?","Q3_Country",
      "Q5___Select the title most similar to your current role (or most recent title if retired): - Selected Choice",
      "Q2___What is your gender? - Selected Choice"]
cols

salary_details = df[cols]
salary_details.columns = ["Salary", "Country","Current_role","gender"]

salary_details1 = salary_details.groupby(["Country","Salary"]).size().reset_index(name='total_count')
salary_details_pivot=pd.melt(salary_details1,id_vars=['Country',"Salary"],var_name='metrics', value_name='values')
# fig = px.bar(salary_details_pivot, x="Country", y="values", color="Salary", 
#              title="Salary Split by Country")
# fig.show()
salary_details1['country_total']=salary_details1.groupby('Country')['total_count'].transform('sum')
salary_details1['salary_percent']=salary_details1['total_count']/salary_details1['country_total']
salary_details1['salary_percent']=salary_details1['salary_percent']*100

In [None]:
# Imports
import plotly.graph_objs as go
import pandas as pd
import numpy as np



# split df by Salary
names=salary_details1['Country'].unique().tolist()
gender = salary_details1['Salary'].unique().tolist()

dfs = {}

# dataframe collection grouped by Salary
for name in names:
    #print(name)
    dfs[name]=pd.pivot_table(salary_details1[salary_details1['Country']==name],
                             values='salary_percent',
                             index=['Salary'],
                             #columns=['gender'],
                             aggfunc=np.sum)


# plotly start 
fig = go.Figure()

# get column names from first dataframe in the dict
colNames = list(dfs[list(dfs.keys())[0]].columns)
#xValues=

# one trace for each column per dataframe: salary range per gender
for col in colNames:
    fig.add_trace(go.Bar(x=gender,
                         visible=True,
                         y=dfs['India']['salary_percent'].values
                  )
             )

# menu setup    
updatemenu= []

buttons=[]

# create traces for each salary range
for df12 in dfs.keys():
    buttons.append(dict(method='update',
                        label=df12,
                        visible=True,
                        args=[
                              {'y':[dfs[df12]['salary_percent'].values]}])
                  )

# buttons for menu 2
b2_labels = colNames

# some adjustments to the updatemenus
updatemenu=[]
your_menu=dict()
updatemenu.append(your_menu)
your_menu2=dict()
updatemenu.append(your_menu2)
#updatemenu[1]
updatemenu[0]['buttons']=buttons
updatemenu[0]['direction']='down'
updatemenu[0]['name']='test'
updatemenu[0]['showactive']=True

fig.update_layout(showlegend=False, updatemenus=updatemenu,title="Salary % per Participant Country")
fig.show()

As seen in the above graph we can see that the countries where the conversion value is higher towards the US dollars, the salaries are concentrated in the lower buckets where as in countries where the currencies are comparable to US dollars they are spread across the salary buckets. 

In [None]:
current_role_salary=salary_details.groupby(["Current_role","Salary"]).size().reset_index(name='total_count')
current_role_salary_pivot=pd.melt(current_role_salary,id_vars=["Salary","Current_role"],var_name='metrics', value_name='values')
# fig = px.bar(current_role_salary_pivot, x="Current_role", y="values", color="Salary", 
#              title="Salary Split by Current Role")
# fig.show()
current_role_salary['title_total']=current_role_salary.groupby('Current_role')['total_count'].transform('sum')
current_role_salary['salary_percent']=current_role_salary['total_count']/current_role_salary['title_total']
current_role_salary['salary_percent']=current_role_salary['salary_percent']*100

In [None]:
# Imports
import plotly.graph_objs as go
import pandas as pd
import numpy as np



# split df by Salary
names=current_role_salary['Current_role'].unique().tolist()
gender = current_role_salary['Salary'].unique().tolist()

dfs = {}

# dataframe collection grouped by Salary
for name in names:
    #print(name)
    dfs[name]=pd.pivot_table(current_role_salary[current_role_salary['Current_role']==name],
                             values='salary_percent',
                             index=['Salary'],
                             #columns=['gender'],
                             aggfunc=np.sum)


# plotly start 
fig = go.Figure()

# get column names from first dataframe in the dict
colNames = list(dfs[list(dfs.keys())[0]].columns)
#xValues=

# one trace for each column per dataframe: salary range per gender
for col in colNames:
    fig.add_trace(go.Bar(x=gender,
                         visible=True,
                         y=dfs['Business Analyst']['salary_percent'].values
                         
                  )
             )

# menu setup    
updatemenu= []

# buttons for menu 1, names
buttons=[]

# create traces for each salary range
for df12 in dfs.keys():
    buttons.append(dict(method='update',
                        label=df12,
                        visible=True,
                        args=[
                              {'y':[dfs[df12]['salary_percent'].values]}])
                  )

# buttons for menu 2
b2_labels = colNames



# some adjustments to the updatemenus
updatemenu=[]
your_menu=dict()
updatemenu.append(your_menu)
your_menu2=dict()
updatemenu.append(your_menu2)
#updatemenu[1]
updatemenu[0]['buttons']=buttons
updatemenu[0]['direction']='down'
updatemenu[0]['name']='test'
updatemenu[0]['showactive']=True



fig.update_layout(showlegend=False, updatemenus=updatemenu,title="Salary % per Participant Job Title")
fig.show()

This is a good insight as well as this platform not only provide support and help to core data scientist but people across wide range of the job profiles. We would like to see more details on the Other group as well to understand the diverse impact data is making in the industry.
As the saying goes 
<b>"Without data you're just another person with an opinion.”</b> which makes it critical in all fields to analyse the relevant data before making any decisions.

##### 5b. Salary of survey participants by Gender


In [None]:
gender_salary=salary_details.groupby(["gender","Salary"]).size().reset_index(name='total_count')
# gender_salary_pivot=pd.melt(gender_salary,id_vars=["Salary","gender"],var_name='metrics', value_name='values')
# fig = px.bar(gender_salary_pivot, x="gender", y="values", color="Salary", 
#              title="Salary Split by Gender")
# fig.show()
gender_salary['gender_total']=gender_salary.groupby('gender')['total_count'].transform('sum')
gender_salary['salary_percent']=gender_salary['total_count']/gender_salary['gender_total']
gender_salary['salary_percent']=gender_salary['salary_percent']*100
gender_salary=gender_salary.sort_values('Salary', ascending=True)

#### How to interpret the below graph?
Once you select the salary range, the value against the gender indicate the % of survey participants within   that gender that has the selected salary range.
 
 As we can see in the chart towards the highest salary ranges ,among the survey participants % of male candidates have higher salary compared to female counterparts with in the same buckets. 

In [None]:
# Imports
import plotly.graph_objs as go
import pandas as pd
import numpy as np



# split df by Salary
f = lambda x: x.str.extract('(\d+)', expand=False).astype(int)
gender_salary = gender_salary.sort_values('Salary', key= f, ignore_index=True)
names=gender_salary['Salary'].unique().tolist()
names = [value for value in names if value != '>$1,000,000']
names.append('>$1,000,000')
gender = gender_salary['gender'].unique().tolist()

dfs = {}

# dataframe collection grouped by Salary
for name in names:
    #print(name)
    dfs[name]=pd.pivot_table(gender_salary[gender_salary['Salary']==name],
                             values='salary_percent',
                             index=['gender'],
                             #columns=['gender'],
                             aggfunc=np.sum)


# plotly start 
fig = go.Figure()

# get column names from first dataframe in the dict
colNames = list(dfs[list(dfs.keys())[0]].columns)
#xValues=

# one trace for each column per dataframe: salary range per gender
for col in colNames:
    fig.add_trace(go.Bar(x=gender,
                         visible=True,
                         y=dfs['$0-999']['salary_percent'].values
                         
                  )
             )

# menu setup    
updatemenu= []

# buttons for menu 1, names
buttons=[]

# create traces for each salary range
for df12 in dfs.keys():
    buttons.append(dict(method='update',
                        label=df12,
                        visible=True,
                        args=[
                              {'y':[dfs[df12]['salary_percent'].values]}])
                  )

# buttons for menu 2
b2_labels = colNames


updatemenu=[]
your_menu=dict()
updatemenu.append(your_menu)
your_menu2=dict()
updatemenu.append(your_menu2)
#updatemenu[1]
updatemenu[0]['buttons']=buttons
updatemenu[0]['direction']='down'
updatemenu[0]['name']='test'
updatemenu[0]['showactive']=True
# updatemenu[1]['buttons']=buttons2
# updatemenu[1]['name']='test'
# updatemenu[1]['y']=0.6


fig.update_layout(showlegend=False, updatemenus=updatemenu,title="Salary % per Participant Gender")
fig.show()

## Global Stats- ML Infra Specific Usage 

Add required columns to json to visualize insights around those features

In [None]:
stats_to_know=[
{'title':'Most popular IDES',
'column_start': 21,
'column_end': 34,
'name':'popular_ide'},
{'title':'Most popular Hosted Products',
'column_start': 34,
'column_end': 51,
'name':'popular_hosted_notebook'},
{'title':'Most popular hardware options',
'column_start': 52,
'column_end': 58,
'name':'popular_hardware'},
{'title':'Most popular Cloud Platforms',
'column_start': 129,
'column_end': 141,
'name':'cloud_platforms'},
{'title':'Most popular Cloud products',
'column_start': 142,
'column_end': 147,
'name':'cloud_products'}]

In [None]:
for val in stats_to_know:
    print(val)
    create_custom_metrics(val)

gc.collect()

#### 1.Most popular IDE used by survey participants

The most popular IDE used by participants include Jupyter notebook and visual studio code.










In [None]:
display_graph('./Chart_Most popular IDES.html')

#### 2.Most popular hosted products by survey participants

This exactly says what products are avaiable free for cost and great in collaboration and a large share is accounted by Colab and Kaggle Notebooks.


In [None]:
display_graph('./Chart_Most popular Hosted Products.html')


#### 3.Most popular hardware otpions by survey participants

A 30% of participants have voted for NVIDIA GPUs but given that people use above hosted products mostly for
basic ML and insights identification and thats why we can see about 50% have given None indicating that they
never had to worry about the hardware.

In [None]:
display_graph('./Chart_Most popular hardware options.html')

#### 4.Most popular cloud platform by survey participants

As expected AWS and GCP takes about 46% share of the popular ones. The None is interesting as well which might indicate that people use their local to do the development (either their tasks are simple or they have set up proper workstations to support the ML workload)


In [None]:
display_graph('./Chart_Most popular Cloud products.html')


#### 5.Most popular cloud products by survey participants
EC2 and Compute Engine gets the most share which indicate that they are basic virtual servers which you can use to deploy and build other applications and services.

In [None]:
display_graph('./Chart_Most popular Cloud Platforms.html')

## Historical Trends Analysis
#### A quick look into the yearly stats . Thnx to @Andrada Olteanu for the dataset for 2017-2021

In [None]:
yearly_stats=pd.read_csv('../input/kaggle-data-science-survey-20172021/kaggle_survey_2017_2021.csv')
yearly_stats.columns=yearly_stats.iloc[0]
yearly_stats= yearly_stats[1:]

### 1. Survey Participation across the years with Gender Analysis

In [None]:

cols=["Year","What is your gender? - Selected Choice"]

participation = yearly_stats[cols]
participation.columns = ["Year", "Gender"]


participation=participation[participation['Year'] != 2017]
participation['Year']=participation['Year'].astype(int)
participation = participation.groupby(["Year",'Gender']).size().reset_index(name='total')
participation['Year']=participation['Year'].astype(str)
participation['Gender'] = participation['Gender'].replace(['Man'],'Male')
participation['Gender'] = participation['Gender'].replace(['Woman'],'Female')


In [None]:
fig = px.bar(participation, x="Year", y="total", color="Gender", title="Participation by Gender")
fig.show() 

We can see that survey participation has increased quite significantly in 2021 owning to lockdown and travel restrictions giving more time for people to trial out more and more wider reach of the platform across the globe.

### 2. Acceptance of Programming languages across time period

In [None]:
cols = list(yearly_stats.columns[8:21])
cols.extend(["Year"])

programming = yearly_stats[cols]
programming.columns = ["Python", "R", "SQL", "C", "C++", "Java",
                 "Javascript", "Julia", "Swift", "Bash", "MATLAB","None","Other",
                 "Year"]


programming=programming[programming['Year'] != 2017]
programming['Year']=programming['Year'].astype(int)
programming = programming.groupby(["Year"]).count().reset_index()


pro_pivot=pd.melt(programming,id_vars=['Year'],var_name='Languages', value_name='Usage Count')



In [None]:
import plotly.express as px
pro_pivot['Year']=pro_pivot['Year'].astype(str)

fig = px.area(pro_pivot, x="Year", y="Usage Count", color="Languages",
      line_group="Languages")
fig.show()

* As seen from the above graph Python has been the most popular language among the survey participants for years where as the newer languages like Julia and Swift are getting popular owning to how ML can be used with IOS programming and wider scope of applications. It is interesting to note that there is an increase in the Other bucket as well which means we might have to add other languages in the forthcoming years survey to understand if there is any new languages that has missed out on the list.


### 2. Participation of Countries for the Survey across time period

We are going to look at top 10 countries who participated in the survey from 2018 and follow the trend across time to see how is there any emerging trends in participation

In [None]:
cols=['In which country do you currently reside?','Year',"What is your gender? - Selected Choice"]

country_participation = yearly_stats[cols]

country_participation.columns = ["Country", "Year", "Gender"]


country_participation=country_participation[country_participation['Year'] != 2017]
country_participation['Gender'] = country_participation['Gender'].replace(['Man'],'Male')
country_participation['Gender'] = country_participation['Gender'].replace(['Woman'],'Female')
country_participation['Year']=country_participation['Year'].astype(int)
country_participation = country_participation.groupby(["Year",'Country']).size().reset_index(name='total')


top_countries=country_participation[country_participation['Year']==2018].sort_values('total', ascending=False).head(10)['Country']
country_participation_filter=country_participation[country_participation['Country'].isin(top_countries)]

In [None]:
fig = px.area(country_participation_filter, x="Year", y="total", color="Country",
      line_group="Country")
fig.show()

The above plot shows that there has been an increased participation from Countries like India across the time period but at the same time participation has decreased from United States in 2019 and 2020 but it is in an increasing trend in 2021. It can also be because as the data field and possibilities becomes more and more it is getting spread across different countries and the platform provides an excellent oppurtunity to learn new technologies.

### Lets also take a look at new countries added every year and countries with higher yoy participation

In [None]:
country_participation_new= country_participation.set_index(['Year', 'Country'])
d2 = country_participation_new.groupby(level='Country').pct_change()
d2=d2.reset_index()
d2['total']=d2['total']*100

#### New Countries added per year
* 2019 - `Algeria', 'Saudi Arabia', 'Taiwan'
* 2020 - `Ghana', 'Nepal', 'Sri Lanka', 'United Arab Emirates'
* 2021 - `Ecuador', 'Ethiopia', 'Iraq', 'Kazakhstan', 'Uganda'

In [None]:
d2.fillna(0, inplace=True)
d2['Year']=d2['Year'].astype(str)
d2['total']=d2['total'].astype(int)
d2=d2[~(d2['Country'].isin(['Ecuador', 'Ethiopia', 'Iraq', 'Kazakhstan', 'Uganda']))]
fig = px.area(d2, x="Year", y="total", color="Country",
      line_group="Country")
fig.show()

Most of the countries had a drop in 2019 and then it is going back up . We can filter out at a country level to see the changes specific to the country. The countries with negative drop in 2021 compared to previous years are Algeria, Denmark, Norway

In [None]:
# Imports
import plotly.graph_objs as go
import pandas as pd
import numpy as np



# split df by Salary
names=d2['Country'].unique().tolist()
d2['Year']=d2['Year'].astype(str)
Year = d2['Year'].unique().tolist()

dfs = {}

# dataframe collection grouped by Salary
for name in names:
    #print(name)
    dfs[name]=pd.pivot_table(d2[d2['Country']==name],
                             values='total',
                             index=['Year'],
                             #columns=['gender'],
                             aggfunc=np.sum)


# plotly start 
fig = go.Figure()

# get column names from first dataframe in the dict
colNames = list(dfs[list(dfs.keys())[0]].columns)
#xValues=

# one trace for each column per dataframe: salary range per gender
for col in colNames:
    fig.add_trace(go.Bar(x=Year,
                         visible=True,
                         y=dfs['Argentina']['total'].values
                         
                  )
             )

# menu setup    
updatemenu= []

# buttons for menu 1, names
buttons=[]

# create traces for each salary range
for df12 in dfs.keys():
    buttons.append(dict(method='update',
                        label=df12,
                        visible=True,
                        args=[
                              {'y':[dfs[df12]['total'].values]}])
                  )

# buttons for menu 2
b2_labels = colNames


updatemenu=[]
your_menu=dict()
updatemenu.append(your_menu)
your_menu2=dict()
updatemenu.append(your_menu2)
#updatemenu[1]
updatemenu[0]['buttons']=buttons
updatemenu[0]['direction']='down'
updatemenu[0]['name']='test'
updatemenu[0]['showactive']=True
# updatemenu[1]['buttons']=buttons2
# updatemenu[1]['name']='test'
# updatemenu[1]['y']=0.6


fig.update_layout(showlegend=False, updatemenus=updatemenu,title="Survey Participation By Country % change")
fig.show()

## Looking into the possible future trends

In [None]:
stats_to_know=[
{'title':'Most Desirable cloud platform',
'column_start': 268,
'column_end': 280,
'name':'popular_platform_future'},
{'title':'Most Desirable hosted Products',
'column_start': 280,
'column_end': 285,
'name':'popular_hosted_products_future'},
{'title':'Experiment Tools for Future',
'column_start': 357,
'column_end': 369,
'name':'popular_experiment_products_future'},
{'title':'Auto ML Tools for Future',
'column_start': 349,
'column_end': 357,
'name':'popular_automl_products_future'},
{'title':'BI Tools for Future',
'column_start': 324,
'column_end': 341,
'name':'popular_bi_products_future'},
]

In [None]:
for val in stats_to_know:
    print(val)
    create_custom_metrics(val)

gc.collect()

### 1. Most Desired Cloud Platform to learn about in the future 2 years

In [None]:
display_graph('./Chart_Most Desirable cloud platform.html')

As expected people want to learn more about AWS and GCP in general in the coming years followed by Azure who are all the  market leaders in the cloud space atm.

### 2. Most Desired Cloud Platform Product to learn about in the future 2 years

In [None]:
display_graph('./Chart_Most popular Hosted Products.html')

This is bit unusal in terms of <b>None</b> getting a higer percentage of 19% among the survey participants. Google colab and kaggle are the most effective and collaborative way to get started to learn about data wrangling and both are easier to use and have free quota usage for general purpose needs.

### 3. How to perform Experiments in the future?

In [None]:
display_graph('./Chart_Experiment Tools for Future.html')

This is quite intriguing. I think people are still learning the benefits of using expermenting tools to track the progress of the model lifecycle and iteration development. But as expected tensorboard, MLFlow and W & B are within the most popular choices. The increased use of W & B especially across the recent Kaggle notebooks should change this trend in the future and help people understand the importance more.

### 4. Auto ML Tools for Future

In [None]:
display_graph('./Chart_Auto ML Tools for Future.html')

I think one of the main reason why GCP is higher is because of the amazing capabilties of Bigquery and ML capabilities that comes with it. If you havent tried it out yet,  you just need SQL to create a basic set of ML models using BQ and training and predictions can be done with BQ ML. It is also worth noting that even though the preffered platform to learn was AWS in the second position when it comes to Auto ML, Azure takes a lead to second place.

### 5. BI Tools for Future

In [None]:
display_graph('./Chart_BI Tools for Future.html')

Tableau and Power BI leads the visualization tools landscape for future followed by Google Data Studio.


#### Thank you

Thanks to kaggle team for hosting this competition and all those Data enthusiast who participated in the survey. It is very important that we take learnings from the survey and help make the product even better to serve the global community to get better at data practices and implementations.