## What does 2021 surevy data tell us about the top five participating countries?
In the year 2021,total *25974* kagglers participated in the survey. More than half of the participants are from the top five countries: *India,United States of America,Japan, China and Brazil* respectively as per thier number of participants. In this notebook, our goal is to explore the data of these top five countries from different perspective, so that we can get an idea of what's going on in these five countries in the field of Data Scienc.

This review will help people who are interested in working in these countries as a data science professional or with machine learning. It will help them to know which industries deal more with data science or ML in these countries, in which country the scope to work as a Data Scince professional is more,what tool/technology the use to accomplish thier task etc.

## Table of Content
1. [Considerations](#Considerations)
2. [Top five participating countries](#Top-Five-Participating-Countries)
3. [Comparision of top five countries' data](#Comparision-of-Top-Five-countries'-Data)
 1. [Gender](#Gender)
 2. [Age](#Age)
 3. [Education Level](#Education-Level)
 4. [Industry](#Industry)
 5. [Computing Platform](#Computing-Platform)
 6. [Company Size](#Company-Size)
 7. [Programming Language Used](#Programming-Language-Used)
 8. [IDE](#IDE)
 9. [Hosted Notebook Product](#Hosted-Notebook-Product)
 10. [Data Visualization Libraries or Tools](#Data-Visualization-Libraries-or-Tools)
 11. [Machine Learning Framework](#Machine-Learning-Framework)
 12. [ML Algorithm](#ML-Algorithm)
 13. [Cloud Computing Platform](#Cloud-Computing-Platform)
 14. [Cloud Computing Products](#Cloud-Computing-Products)
 15. [Data Storage Products](#Data-Storage-Products)
 16. [Managed Machine Learning Products](#Managed-Machine-Learning-Products)
 17. [Big Data Products](#Big-Data-Products)
 18. [Business Intelligence Tools](#Business-Intelligence-Tools)
 19. [Automated Machine Learning Tools](#Automated-Machine-Learning-Tools)
 20. [Tools to Manage Machine Learning Experiments](#Tools-to-Manage-Machine-Learning-Experiments)
4. [Conclusions can we draw from this examination](#Conclusions-can-we-draw-from-this-examination)


## Considerations

1. According to country wise participation count ranking, 'Other' stands at 3rd position as country. But since 'Other' is not a country that exists, for the sake of simplicity ,we are not considering it.
2. Like wise we are also not considering 'Other' as an Industry and skipping the responses for this option.
3. Each figure shows the data from the countries as a percentage except in 'Participating countries in Kaggle 2021 survey'. 
4. The overlap among the options are overlooked where the questions have multiple answers.

In [None]:
import re
import operator
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import plotly.graph_objects as go
from plotly.subplots import make_subplots
 
from matplotlib_venn import venn3
from bokeh.plotting import figure, show

responses=pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
survey_data=responses.iloc[1:,:]

mycolor=['#aeccdb', '#efa6a5', '#ebbd81', '#c8b6d2', '#b3d495','#fff3aa','#88cccc','#cdd422',"#ff8b94",'#6d929b','#fe6695',
        '#ff7000','#3aafa9','#f49d00','#007fb1','#34cd63','#95a5a5']
mycolor2=['#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', 
          '#fb9a99','#e31a1c','#fdbf6f','#ff7f00',"#cab2d6",'#ffff99','#b15928',
        '#e78ac3','#b3b3b3']

## Top Five Participating Countries

In [None]:
df=survey_data.rename(columns={'Q3':'Country'})
fig = px.histogram(df, x="Country",title='Participating countries in Kaggle 2021 survey ').update_xaxes(categoryorder='total descending')
fig.show()

# From the graph we see that the top five participating countries are India,United States of America,Japan,China,Brazil respectively
top_5_country=['India','United States of America','Japan','China','Brazil']

# We are considering responses only from these five countries 
survey_data_top_5_country = survey_data[survey_data['Q3'].isin(top_5_country)]
  

The above graph shows that, in the survey 2021,the participation of Indian Kagglers is the higest followed by USA,Japan,China and Brazil. The gap between India and USA is significantly big, more than 50%.The responses from other three countries Japan,China and Brazil anre almost equal.

## Comparision of Top Five countries' Data

## Gender

In [None]:
#Gender distribution for top five countries
df=survey_data_top_5_country.groupby(['Q3'])['Q2'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index().sort_values('Q3').rename(columns={'Q3':'Country','Q2':'Gender'})
df['Percentage']=round(df['Percentage'],2)
df.reset_index(inplace=True)

gender_order=['Man','Woman','Prefer not to say','Prefer to self-describe','Nonbinary']
df['Country'] = pd.Categorical(df['Country'], top_5_country)
df['Gender'] = pd.Categorical(df['Gender'],gender_order )
df=df.sort_values(['Country','Gender'])

fig = px.bar(df,x='Country', y='Percentage',color='Gender', barmode='group',text=[f'{i}%' for i in df['Percentage']],
              title='Gender wise participation for top five countries',color_discrete_sequence=mycolor)
fig.show()

Women's participation is always much much lower than men's. And this time it is no exception. But in the case of Japan, compared to the other four countries, the situation is even worse. For Japan, the participation of women is less than 10%. 


## Age

In [None]:
#Age distribution for top five countries
df=survey_data_top_5_country.groupby(['Q3'])['Q1'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index().sort_values('Q3').rename(columns={'Q3':'Country','Q1':'Age'})
df['Percentage']=round(df['Percentage'],2)
df.reset_index(inplace=True)
age_order = [ '18-21', '22-24', '25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+' ]
df['Country'] = pd.Categorical(df['Country'], top_5_country)
df['Age'] = pd.Categorical(df['Age'], age_order)
df=df.sort_values(['Country','Age'])


fig = px.bar(df,x='Country', y='Percentage',color='Age', barmode='group',text=[f'{i}%' for i in df['Percentage']],
              title='Age distribution for top five countries',color_discrete_sequence=mycolor,width=1000,height=500)
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-.3,
    xanchor="right",
    x=1
))
fig.show()

The above graph displays the following info-
1. In India and China, the majority of participants are between the ages of 18 and 24.
2. The majority of Participants in USA,Japan and Brazil are between the ages of 25 and 34.

## Education Level

In [None]:
#Comparision of education for top five countries Copy
from plotly.subplots import make_subplots


df=survey_data_top_5_country.groupby(['Q3'])['Q4'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index().sort_values('Q3').rename(columns={'Q3':'Country','Q4':'Education'})
df['Percentage']=round(df['Percentage'],2)
df.reset_index(inplace=True)
Education_sorted=['No formal education past high school','Some college/university study without earning a bachelor’s degree','Bachelor’s degree',
                                                   'Master’s degree','Doctoral degree','Professional doctorate','I prefer not to answer']
df['Country'] = pd.Categorical(df['Country'], top_5_country)
df['Education'] = pd.Categorical(df['Education'], Education_sorted)
df=df.sort_values(['Country','Education'])

labels = Education_sorted
labels[1] = 'Some college/university study <br>without earning a bachelor’s degree'

fig = make_subplots(rows=3, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}],[ {'type':'domain'}, {'type':'domain'}],[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=list(df[df['Country']=='India']['Percentage']),marker_colors=mycolor,rotation=270, name="India",direction ='clockwise'),1, 1)
fig.add_trace(go.Pie(labels=labels, values=list(df[df['Country']=='United States of America']['Percentage']),marker_colors=mycolor,rotation=270, name="USA",direction ='clockwise'),1, 2)
fig.add_trace(go.Pie(labels=labels, values=list(df[df['Country']=='Japan']['Percentage']),marker_colors=mycolor,rotation=270, name="Japan",direction ='clockwise'),2, 1)
fig.add_trace(go.Pie(labels=labels, values=list(df[df['Country']=='China']['Percentage']),marker_colors=mycolor,rotation=270, name="China",direction ='clockwise'),2, 2)
fig.add_trace(go.Pie(labels=labels, values=list(df[df['Country']=='Brazil']['Percentage']),marker_colors=mycolor, rotation=270,name="Brazil",direction ='clockwise'),3, 1)


fig.update_traces( hoverinfo="label+percent+name")

fig.update_layout(height=800, width=1000,
    title_text="Comparision of highest education level for top five countries",
    annotations=[dict(text='India', x=0.18, y=0.72, font_size=15, showarrow=False),
                 dict(text='USA', x=0.80, y=0.72, font_size=15, showarrow=False),
                 dict(text='Japan', x=0.18, y=0.3, font_size=15, showarrow=False),
                 dict(text='China', x=0.80, y=0.3, font_size=15, showarrow=False),
                 dict(text='Brazil',x=0.18, y=-0.05, font_size=15, showarrow=False)
                ]
)
fig.show()

The pie charts above present the following statistics:
1. In India and Brazil, the majority of participants have a Bachelor's degree. After that participants having Master's degree are the highest.
2. However, the trend in the United States, Japan, and Brazil is different. More than 40% participants in these countries are generally Master's degree holders. A bachelor's degree is the second highest level of education in these nations.

## Industry

In [None]:
df=survey_data_top_5_country.groupby(['Q20'])['Q20'].count().sort_values(ascending=False).reset_index(name='Participants').rename(columns={'Q20':'Industry'})
df['Percentage']=round((df['Participants']/df['Participants'].sum())*100, 2) 
fig = px.bar(df, x='Industry', y='Percentage',text=[f'{i}%' for i in df['Percentage']],title='Percentage of participating industries of these top five countries',color_discrete_sequence=mycolor)
fig.show()

# From the graph we see that the top five participating industries are Computers/Technology,Academics/Education,Accounting/Finance,Manufacturing/Fabrication,Medical/Pharmaceutical respectively
top_5_industry=['Computers/Technology','Academics/Education','Accounting/Finance','Manufacturing/Fabrication','Medical/Pharmaceutical']

From the above graph we can see that-
1. almost one third of Kagglers are from 'Computers/Technology' industry and then 17.10% are from 'Academics/Education' industry. These two industries makes 46.73% of all the industries. So we can say that, the kagglers are mostly from 'Computers/Technology'and then from 'Academics/Education' industry.
2. Then come the following industries 'Accounting/ Finance', 'Manufacturing/Fabrication' and 'Medical/Pharmaceutical' industry.

In [None]:
#Comparision of percentage of participating industries of these top five countries
df=survey_data_top_5_country.groupby(['Q3'])['Q20'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index().sort_values('Q3').rename(columns={'Q3':'Country','Q20':'Industry'})
df['Percentage']=round(df['Percentage'],2)
df=df[df['Industry'].isin(top_5_industry)]
df.reset_index(inplace=True)

df['Country'] = pd.Categorical(df['Country'], top_5_country)
df['Industry'] = pd.Categorical(df['Industry'], top_5_industry)
df=df.sort_values(['Country','Industry'])

fig = px.bar(df,x='Country', y='Percentage',color='Industry', barmode='group',text=[f'{i}%' for i in df['Percentage']],
              title='Comparision of percentage of participating industries of these top five countries',color_discrete_sequence=mycolor)
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-.3,
    xanchor="right",
    x=1
))
fig.show()

The above graph shows industry wise participation (top 5 industry) for the five countries:India, USA, Jpan, Chna and Brazil. The graph provides the following statistics-
1. For all these five countries most Kagglers are from 'Computer/Technology' industry except Brazil. For brazil, most Kagglers are from 'Academics/Education'.
2. For India, USA and China, the second participating industry is 'Academics/Education' whereas for japan it is 'Manufacturing/Fabrication' and for Brazil, it is 'Computer/Technology'.
3. All the cuontries except Japan, the third highest participating industry is 'Accounting/Finance'. In Japan, it is 'Academics/Education'.

Finally, we can say that, Japan has adopted ML/DS significantly (23.31 %) in the  'Manufacturing/Fabrication' industry than the other four countries whereas for other four countries the percentage for 'Manufacturing/Fabrication' industry  is less than 6%.

## Computing Platform

In [None]:
#Comparision of Computing Platform for top five countries
df=survey_data_top_5_country.groupby(['Q3'])['Q11'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index().sort_values('Q3').rename(columns={'Q3':'Country','Q11':'Computing Platform'})
df['Percentage']=round(df['Percentage'],2)
df.reset_index(inplace=True)
df['Country'] = pd.Categorical(df['Country'], top_5_country)
df=df.sort_values(['Country'])

fig = px.bar(df,x='Country', y='Percentage',color='Computing Platform', barmode='group',text=[f'{i}%' for i in df['Percentage']],
              title='Comparision of Computing Platform used for top five countries',
             color_discrete_sequence=mycolor,width=1200)
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-.3,
    xanchor="right",
    x=1
))
fig.show()

The following observations are provided by this graph:

1. The laptop is the most common computing platform in all of these nations and then comes the name of the Personal Computer/Desktop in second place.

2. Among the less popular computing platforms, cloud computing platforms (AWS,Azure,GCP,hosted notebooks, etc.) have slightly more users in the United States, Japan, and Brazil, whereas deep learning workstations (NVIDIA GTX, LambdaLabs, etc.) have somewhat more users in China.

## Company Size

In [None]:
df=survey_data_top_5_country.groupby(['Q3'])['Q21'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index().sort_values('Q3').rename(columns={'Q3':'Country','Q21':'Size'})
df['Percentage']=round(df['Percentage'],2)
df.reset_index(inplace=True)

size_order=['0-49 employees','50-249 employees','250-999 employees','1000-9,999 employees','10,000 or more employees']
df['Country'] = pd.Categorical(df['Country'], top_5_country)
df['Size'] = pd.Categorical(df['Size'], size_order)
df=df.sort_values(['Country','Size'])

fig = px.bar(df,x='Country', y='Percentage',color='Size', barmode='group',text=[f'{i}%' for i in df['Percentage']],
              title='Comparision of company size for top five countries',color_discrete_sequence=mycolor,width=1000)
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-.3,
    xanchor="right",
    x=1
))
fig.show()

The following findings are presented in this figure:

For the ease  of describing, let's assume letter A, B, C, D and E stands for '0-49 employees', '50-249 employees' , '250-999 employees', '1000-9,999' and '10,000 or more employees' sized company respectively.

1. In India, participants mostly come from either A' or E type companies. Participantion from other three sized companies B,C and D are alomst equal and not more than 16% individually.

2. In USA, participants mostly come from A, B and C type companies. Participation from B and C is at most 13% individually.

3. Japan and Brazil show exception here. Here participants among these five types of companies are approximately evenly distributed for these two countries.

4. In China, the case is diffrent from other four countries. It has significalty huge percentage of participation from A type company, 37.24%. Participation from other four types of companies are approximately fairly distributed here.
   
The participation percentage from A type company in China is much higher than other four countries. Type A company holds 37%  participation in China, 27% from India and 25.22% in Brazil. This much Data science and ML application ,in small sized companies can be explained for India and Brazil to some extend, as because these two countries are among top ten freelancing countries. But as a freelancing country China is not that much popular. So it can be assumed that, this 37% of Data Science and ML prefessional provides service solely to thier country. What does this widespread use of Data science and ML in small companies tell us about China?? 

Let's dig a little bit deeper. Let's find out, in China what percentage of the '0-49 employees' sized company belong to which industry.


In [None]:
df=survey_data_top_5_country
df=df[df['Q3']=='China']
df=df.groupby(['Q21'])['Q20'].value_counts(normalize=True).mul(100).round(2).rename('Percentage').reset_index().rename(columns={'Q20':'Industry','Q21':'Size'})
fig = px.sunburst(df, path=['Size','Industry'], values='Percentage',height=600,width=600,title='China\'s company size wise industry percentage.' )
fig.show()

The figure above shows that around 70% of these small companies belongs to either 'Academics/Education' or 'Computers/Technology' industry. That means,they are extensively involved in research work as well as implementation in DS and ML field.

What indicates that?? Has a big transformation in digitalization is about to begin or already begun in China? 

In [None]:
df_column_list=list(survey_data_top_5_country.columns)
df_column_list.remove('Q3')

survey_data_top_5_country_summary=survey_data_top_5_country.groupby('Q3')[df_column_list].count().reset_index().rename(columns={'Q3':'Country'})
survey_data_top_5_country_summary['Country'] = pd.Categorical(survey_data_top_5_country_summary['Country'], top_5_country)
survey_data_top_5_country_summary=survey_data_top_5_country_summary.sort_values(['Country'])

def rename_column_name(df_1,df_2,col_list):
    new_col_list=[]
    for column in col_list:
        new_name=df_1[column].value_counts().index[0].strip()
        df_2.rename(columns={column:new_name},inplace=True)
        new_col_list.append(new_name)
    return new_col_list

## Programming Language

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q7' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q7_Part_1":"Q7_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q7' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of programming language used for top five countries',color_discrete_sequence=mycolor2,height=600)
fig.show()

This graph depicts the following findings:

1. Python is extremely popular in all five of these contries, having more than 30% of total usage.
2. Then SQL,C and C++ are almost equally popular in most of these countries.
3. R is slightly more popular in USA and Brazil than other three countires.

## IDE

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q9' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q9_Part_1":"Q9_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q9' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of IDE used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

This graph presents the following information:

1. In terms of popularity, Jupyter Notebook occupies the first position in all the above countries,except Chaina. In Chaina,PyCharm is in first and Jupyter Notebook is in second position. 
2. After Jupyter Notebook, VSCode is the second popular IDE in most of above countries.

## Hosted Notebook Product

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q10' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q10_Part_1":"Q10_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q10' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Hosted Notebook Product used in top five countries',
             color_discrete_sequence=mycolor,height=600)
fig.show()

It can be seen from this graph-

1. Kaggle Notebook and Google Colab are almost equally popular in these countries. 
2. The number of participants who do not use any hosted notebook is not less. 
3. Other Hosted Notebooks aren't doing so well in terms of popularity.

## Data Visualization Libraries or Tools

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q14' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q14_Part_1":"Q14_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q14' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of data visualization libraries or tools usage of top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

From the above chart we see that-

1. In these above countries, Matplotlib and Seaborn are nearly equally popular.
2. Next on the list of popularity comes the names of Plotly/Plotly Express and Gplot/ggplot2 together.
3. None of the others are above 5% in terms of popularity.



## Machine Learning Framework

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q16' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q16_Part_1":"Q16_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q16' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Machine Learning Framework usage of top five countries',
             color_discrete_sequence=mycolor,height=600)
fig.show()

From this statistics we can see that-

1. Scikit-learn and Tensorflow are mostly used in almost all five countries. 
2. Keras ,PyTorch and Xgboost are also used in these countries. In China, PyTorch is used more than in other countries.
3. In Japan, LightGBM is more popular than in other countries.

## ML Algorithm

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q17' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q17_Part_1":"Q17_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q17' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of ML Algorithm used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

1. In terms of pupularity,here we can put 'Linear or Logistic Regression', 'Decision Tree or Random Forests' and Gradient Boosting(xgboost,lightgbm, etc) in the same group.
2. Then comes Convolution Neural Netwrok.
3. After that, Bayesian Approcahes and Dense Nural Networks come together.

## Cloud Computing Platform

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q27_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q27_A_Part_1":"Q27_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q27_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Cloud Computing Platform used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

Here we can see that-

1. In India, USA and Brazil AWS, Azure and Google Cloud are side by side in terms of popularity.
2. Japan AWS and Azure are almost equally popular but Microsoft Azure is not that much popular (8.33%).
3. In China, participants mostly use Alibaba Cloud (28%). After that Tencent Cloud and Google Cloud are almost same in popularity.



## Cloud Computing Products

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q29_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q29_A_Part_1":"Q29_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q29_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Cloud Computing Products used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

From the above picture we can see that-

1. In almost every country a significant number of participant do not use any Cloud Computing Product.
2. In Brazil, India and USA, there is no significant difference in the popularity of Amazon Elastic Computer Cloud(EC2), Microsoft Azure Virtual Machines and Google Cloud Computer Engine.
3. In Japan and China,Amazon Elastic Computer Cloud(EC2) and Google Cloud Computer Engine are quite popular but Microsoft Azure Virtual Machine is not that much popular.

## Data Storage Products

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q30_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q30_A_Part_1":"Q30_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q30_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Data Storage Products used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

From the image above it is visible that-

1. In four of above countries, Amazon Simple Storage Service (S3) has huge popularity except China. In China, Amazon Simple Storage Service (S3) is second highest popular.
2. In four of above countries, Google Cloud Storage holds second position in terms of popularity. Whereas In China  Google Cloud Storage is in first position.
3. In India, USA and Brazil, Microsoft Azure DataLake Storage is used in a small percentage (at least 10.5%), but in Japan and China its usage paercentage is not more than (2.68%).


## Managed Machine Learning Products

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q31_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q31_A_Part_1":"Q31_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q31_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Managed Machine Learning Products used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

The above graph shows that-

1. In all five country, most of the participant don't use managed machine learning product. In japan, this percentage reaches almost 75%.
2. Those who uses managed machine learning products, mostly use Amazon SageMaker, Azure Machine Learning Studio, Google Cloud Vertex AI and Databricks.But none of them are significantly used.



## Big Data Products

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q32_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q32_A_Part_1":"Q32_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q32_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Big Data Products used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

The above graph shows that-

1. In all of above countries, MySQL and PostgreSQL lie in the first row in terms of usage. In China usage of MySQL reaches the highest(30.67%).
2. Then SQLiet is also used by participants. 
3. In Japan, almost 30% participant don't use any Big Data Product.


## Business Intelligence Tools

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q34_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q34_A_Part_1":"Q34_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q34_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Business Intelligence Tools used in top five countries',color_discrete_sequence=mycolor,height=600)
fig.show()

This figure shows that-

1. In these five countries a significant number of participants do not use any BI tools yet.
2. In these five countries Tableau is slightly higher in popularity than Microsoft Power BI, except in Brazi. In Brazil Microsoft Power BI is more popular than Tableau. 
3. In Brazil, Google Data Stiduo is used by 10.55% participants, which is exceptional than other four countries. In other countries Google Data Stiduo is used not more than 6.58%. 


## Automated Machine Learning Tools

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q36_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q36_A_Part_1":"Q36_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q36_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,
             title='Comparision of Automated Machine Learning tools used in top five countries',
             color_discrete_sequence=mycolor,height=600,width=1100)
fig.show()

The above statistics shows us-

1. Most participants are not yet accustomed to using automated machine learning tools.
2. In terms of popularity, all the tools are used more or less same way except 'Automated Model Architecture Searches'. 'Automated Model Architecture Searches' has the least popularity among these five countries.


## Tools to Manage Machine Learning Experiments

In [None]:
column_list=['Country']+[col for col in df_column_list if 'Q38_A' in col]
df=survey_data_top_5_country_summary[column_list].copy()
df["sum"] = df.sum(axis=1)
df_new = df.loc[:,"Q38_A_Part_1":"Q38_A_OTHER"].div(df["sum"], axis=0).mul(100).round(2)
df_new['Country']=df['Country']
df=df_new

q_column_list=[col for col in df_column_list if 'Q38_A' in col]
y_labels=rename_column_name(survey_data_top_5_country,df,q_column_list)

fig = px.bar(df,x='Country', y=y_labels,title='Comparision of Tools to manage Machine Learning Experiments  used in top five countries',
             color_discrete_sequence=mycolor,height=600)
fig.show()

This picture tells us that-

1. In these five countries, most of the participants don't use any tools to manage machine learning experiments.
2. Those who use, use mostly TensorBoard and then MLflow.

## Conclusions can we draw from this examination

The goal of this notebook was to examine the state of data science and machine learning in the top five countries with the most participants. I wanted to see what industries the data science professionals in these countries are most involved in, what tools / techniques they use to get their work done smoothly and so on. After reviewing the data obtained from the statistics of these five countries, we have got the following information -

1. The top five participating countries  in the '2021 Kaggle Machine Learning & Data Science Survey' are India, USA, Japan, China and Brazil. The number of Indian kagglers participating is the highest.
2. Women's participation is usually far lower than men's in this survey. In Japan, participation of women is less than 10%.
3. Data Scinece and Machine Learning is more popular in youngsters than among elderly.
4. The majority of the participants hold a bachelor's or master's degree.
5. In each country, most participants are involved in the 'Computer/Technology' or 'Academics/Education' industry which is pretty predictable. But Japan is a little different here.In the 'Manufacturing/Fabrication' sector, Japan has adopted ML/DS considerably (23.31%) more than the other four nations.
6. 'Laptop' and 'Personal computer/ desktop' are still at the top of the list of user favorites.
7. In these five countries participants are mostly from '0-49 employees','1000-9,999' and '10,000 or more employees' sized companies. In China, 37.24% participants are from '0-49 employees' companies, which is significantly large. Though India and China has a good amount of participants from '0-49 employees' sized companies, this percentage is unsurprising given that these two countries are among the top five freelancing countries. But in China this huge percentage of Data Science and ML professionals only serve their home country as freelancing is not that much popular in China. This distinguishes this country from others. Which brings up the question: is China undergoing a digital transformation, or has one already begun?
8. Python is extremely popular as a programming language in all five of these contries. SQL,C and C++ are also well-liked in these countries.
9. Jupyter Notebook is at spike in terms of popularity in these five counties.
10. Kaggle notebooks and Google Colab are the first choice of Kaggler for hosted notebooks.
11. The name Matplotlib and Seaborn comes first in the list of users' preferences for data visualization.
12. As Machine Learning Framework, Scikit-learn,Tensorflow,Keras and PyTorch all are used by Kagglers.
13. PyTorch is used more frequently in China than in other countries.
14. As ML algorithm, Linear or Logistic Regression, Decision Tree or Random Forests and Gradient Boosting(xgboost,lightgbm, etc) are popular among Kagglers. Convolution Neural Netwrok, Bayesian Approcahes and Dense Nural Networks are also used by participants.
15. In India, USA and Brazil AWS, Azure and Google Cloud are side by side in terms of popularity as Cloud Computing platform. But in China Alibaba Cloud and Tencent Cloud are Kagglers favourite.
16. Kagglers are not yet accustomed to using any Cloud Computing Product. Among the participants Amazon Elastic Computer Cloud(EC2) and Google Cloud Computer Engine are used mostly. Microsoft Azure Virtual Machine is not as much used as Amazon Elastic Computer Cloud(EC2) and Google Cloud Computer Engine.
17. In four of these countries, Amazon Simple Storage Service (S3) has huge popularity except China. In China Google Cloud Storage takes the first position.
18. The majority of participants do not utilize any managed machine learning product. Amazon SageMaker, Azure Machine Learning Studio, Google Cloud Vertex AI, and Databricks are the most popular managed machine learning tools. However, none of these are widely utilized.
19. As a Big Data Product MySQL,PostgreSQL and SQLiet are mostly used amon participants. In Japan, about 30% of participants do not utilize any Big Data Product.
20. In these five countries a significant number of participants do not use any BI tools yet. In Japan and China, this percentage is very high,54.84% and 45.08 %. Among participants Tableau and Microsoft Power BI are used mostly.
21. The majority of participants have not yet become used to utilizing automated machine learning techniques.
22. To manage machine learning experiments, participants do not utilize any tools mostly. But among the users TensorBoard and MLflow are quite familiar.
23. Almost all of the parameters show that India, the United States, and Brazil follow nearly the same path. This is because we may infer that each of these countries is among the top 10 freelancing countries.

This is all I have been able to collect from this survey of 2021.

Thank you for reading!
