## Introduction
This report is an attempt to show and compare the emerging trends of machine learning and data science in different regions in the world including Indida, United States of America, Europe, South East Asia and any country of your interest. In this report I picked Iran as the country of interest, since I grew up there and I'm curious about it. You can change it to any other arbitrary country and see how it deviates from rest of the world or from a specific other region. <br>
- In this report I will initially show the differences and similarities of kaggle respondents in age, gender, education, years of experience and their current role at work. <br>
- Then we'll see how their choices for programming language, development environments, hosted notebooks and computer vision and NLP techiques differ. <br>
- We would also asses how the team and company size differs in different regions and countries. Also how different companies approach machine learning and use cloud computing platforms together with data storage products. <br>
<br>
Hope you find this analysis interesting. And if you have any comments or feedbacks, I would truely appreciate it! :)

In [None]:
import pandas as pd
import numpy as np

import IPython.display
import matplotlib.pyplot as plt

from plotly.offline import init_notebook_mode, iplot
from plotly import subplots
import plotly as py
import plotly.graph_objects as go
import plotly.express as px

from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
questions = df.iloc[0,:]
df = df.drop(df.index[0])

Below is the list of names of the countries in the kaggle survey. You can choose any country from this list to run this comparetive analysis. I should admit that I'd be happier, if I saw the name of Iran it as simple as Iran, rather than Islamic Republic of... ;)

In [None]:
 df["Q3"].unique()

In [None]:
COUNTRY = "Iran, Islamic Republic of..."

Due to high numbers of respondents in India and USA, I keep the survey analysis on country level for these two countries. However, to make the comparision relative I combine the survey results of the top 5 European countries in numbers of participants together. This includes UK and Irland, Germany, Spain, France and Italy. I know we have gone through painful process of Brexit, but let's keep it in Europe for this report :D For the South East Asia, also I combined the results from Japan, China, South Korea and Taiwan.   

In [None]:
non_profs = (df['Q15'] == "Under 1 year") | (df['Q15'] == "1-2 years") | (df['Q15'] == "I do not use machine learning methods") #professionals
profs = ~non_profs #non-professionals

SouthEastAsia = (df['Q3'] == 'Japan')  | (df['Q3'] == 'China') | (df['Q3'] == 'South Korea') | (df['Q3'] == 'Taiwan') 
Europe = (df['Q3'] == 'United Kingdom of Great Britain and Nothern Irland')  | (df['Q3'] == 'Germany') | (df['Q3'] == 'Spain') | (df['Q3'] == 'France') | (df['Q3'] == 'Italy') 
India = (df['Q3'] == 'India')
USA = (df['Q3'] == 'United States of America')
alt = (df['Q3'] == COUNTRY) #alternative
all_resp = df['Q1'] == df['Q1'] #an array of True
Rest =  ~(India | SouthEastAsia | Europe | USA | alt) #Rest of the world

countries = {'India': India,'SE_Asia':SouthEastAsia, 'USA':USA,'Europe': Europe, COUNTRY: alt, 'All': all_resp }
colors = {'India': '#58ABAE','SE_Asia':'#A5CBC6', 'USA':'#FAF4F0','Europe': '#E5E5E5', COUNTRY: '#FFD6D4', 'All': '#FEB8C5', 'Rest of the World': '#AAB8C5'}

In [None]:
num_rest = len(df) - len(df[India]) - len(df[SouthEastAsia]) - len(df[USA]) - len(df[Europe]) - len(df[alt])
num_respondants = [len(df[India]), len(df[SouthEastAsia]), len(df[USA]), len(df[Europe]), len(df[alt]), num_rest]
respondant_groups = np.append(list(countries.keys())[:-1],['Rest of the World'])

## 1. Respondents Distributions
Here you can see the distribution of respondents in different regions. <br> 28.6% of respondants are from India, 10.2% from USA, 9.35% from South East Asia, 6.3% from Europe, 0.75% from Iran and 42.6% from rest of the world. This is a quite low percentage of total participants for Iran, with 195 respondents in total. Though, I'd like to keep it as is. <br>
For some parts of the analysis I prefered to look at the responses from somehow more experienced people. The selection criteria for this group has been people with more than 2 years of experience of using machine learning techniques. Here you can see that only 31.7% of total respondants fall into this category. <br> 
The pie chart on the right shows the distribution of this group of respondants(professionals) for different regions. You can see that the relative count of responses for USA and Europe increases to 14.1% and 8.91%. For India it decreases to 21.6% and it remains almost the same for the rest of regions. 

In [None]:
fig = subplots.make_subplots(rows = 1, cols = 3,  shared_yaxes  = True, specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}]],
                             subplot_titles = ['All Participants','All Participants',
                                               'Professional Participants'])

fig.update_layout(title = go.layout.Title(text ="<b> Regions and Countries Participation</b>", x = 0.4))

trace1 = go.Pie(labels = respondant_groups, values = num_respondants, 
                name = '', domain = dict(x = [0, 0.3]))

trace1.marker = dict(colors = list(colors.values()))

trace2 = go.Pie(labels = ['profs', 'non_profs'], values = [len(df[profs]), len(df[non_profs])], name = '')
trace2.marker = dict(colors = ['#FFB347', '#78A2CC'])

num_others_profs = len(df[profs]) - len(df[India & profs]) - len(df[SouthEastAsia & profs]) - len(df[USA & profs]) - len(df[Europe & profs]) - len(df[alt & profs])
num_respondants_profs = [len(df[India & profs]), len(df[SouthEastAsia & profs]), len(df[USA & profs]), len(df[Europe & profs]), len(df[alt & profs]), num_others_profs]
trace3 = go.Pie(labels = respondant_groups, values = num_respondants_profs, name = '') 

fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,2)
fig.append_trace(trace3,1,3)

fig.show()


In [None]:
fields = df["Q5"].unique()
fields = fields[1:]
fields = np.append(fields,['Other'])

## 2.Age

In [None]:
title_format = "<span style='font-size:20px; font-family:Times New Roman'>%s</span>"
fig = subplots.make_subplots(rows = 1, cols = 1, 
                             subplot_titles = [title_format % "Age Among Different Regions and Countries "],
                             vertical_spacing = 0.07)

for country in countries.keys():
    count = df[countries[country]]['Q1'].value_counts()
    trace = go.Bar(
                    y = np.round(df[countries[country]]['Q1'].value_counts()/len(df[countries[country]])*100,1),
                    x = df[countries[country]]['Q1'].value_counts().index,
                    name = country,
                    hovertext = count,
                    marker = dict(color = colors[country]),
                    legendgroup = country
                    )

    fig.append_trace(trace,1,1)
    fig.update_traces(hovertemplate = '<b>Percent</b>: %{y}%<br>'+
                                      '<b>Count</b>: %{hovertext}')


fig['layout'].update(height = 400, width = 1000)

iplot(fig)

Most of the respondents from India are 18-21, from Iran 22-29, from SE Asia and Europe are 25-29 and from US 25-34 years old. <br>
36% of respondants from India are 18-21 years old. This group is the largest among other age groups from different regions. <br>
Most of the respondants in age groups of 40-44, 45-49, 50-54, 55-59, 60-69 and 70+ are from US. <br>

## 3. Gender

In [None]:
title_format = "<span style='font-size:20px; font-family:Times New Roman'>%s</span>"
fig = subplots.make_subplots(rows = 1, cols = 1, 
                             subplot_titles = [title_format % "Gender Among Different Regions and Coutnries"],
                             vertical_spacing = 0.07)

for country in countries.keys():
    trace2 = go.Bar(
                    y = np.round(df[countries[country]]['Q2'].value_counts()/len(df[countries[country]])*100,1),
                    x = df[countries[country]]['Q2'].value_counts().index,
                    name = country,
                    hovertext = df[countries[country]]['Q2'].value_counts(),
                    marker = dict(color = colors[country]),
                    legendgroup = country,
                    showlegend = True
                    )
    

    fig.append_trace(trace2,1,1)
    fig.update_traces(hovertemplate = '<b>Percent</b>: %{y}%<br>'+
                                      '<b>Count</b>: %{hovertext}')


fig['layout'].update(height = 400, width = 1000)

iplot(fig)

79% and 19% of total respondants are male and female correspondingly. <br> 
The rate of female respondents are the highest among Iranians with 26%. (woohoo!) <br>
Female participation is the lowest among SE Asians and Europeans with 13% and 14%. <br>

## 4. Educaiton

In [None]:
title_format = "<span style='font-size:20px; font-family:Times New Roman'>%s</span>"
fig = subplots.make_subplots(rows = 1, cols = 1, 
                             subplot_titles = [title_format % "Education Among Different Regions and Countries"],
                             vertical_spacing = 0.07)

for country in countries.keys():
    
    trace3 = go.Bar(
                    y = np.round(df[countries[country]]['Q4'].value_counts()/len(df[countries[country]])*100,1),
                    x = df[countries[country]]['Q4'].value_counts().index,
                    name = country,
                    hovertext = df[countries[country]]['Q4'].value_counts(),
                    marker = dict(color = colors[country]),
                    legendgroup = country,
                    showlegend = True
                    )

    fig.append_trace(trace3,1,1)
    fig.update_traces(hovertemplate = '<b>Percent</b>: %{y}%<br>'+
                                      '<b>Count</b>: %{hovertext}')

fig['layout'].update(height = 600, width = 1000)

iplot(fig)

Most of the resondents have either a Bachlor's degree or a Master's degree. They comprise 38% and 39% of total repondents correspondingly. <br> 
Most of the poeple from Inida have a Bachlor's dgree, while most of the poeple from SE Asia, Europe, US and Iran have a Master degree. This is probably due to age difference of participants from India compared to other regions. <br>

Relatively more respondants from Europe have a PhD compared to other groups. 20% of participants from Europe have a PhD. This number is 4% for respondents from Inida and is 10% for total respondents. <br> 
Education is not limited to university degrees. Among the respondents with some university studies and without earning a degree, SE Asia holds the relative majority. 10.5% of respondents from SE Asia belong to this group. <br>

## 5. Work and Programming Experience

In [None]:
experience_labels = ['I have never written code', '< 1 years', '1-3 years', '3-5 years','5-10 years', '10-20 years', '20+ years']

experience_country = pd.DataFrame()
field_country = pd.DataFrame()

for country in countries.keys():
    experience_country[country] = df[countries[country]]["Q6"].value_counts().reindex(experience_labels)
    experience_country[country] = np.round(experience_country[country]/(countries[country].sum()) * 100,1)

for country in countries.keys():
    field_country[country] = df[countries[country]]["Q5"].value_counts().reindex(fields)
    field_country[country] = np.round(field_country[country]/(countries[country].sum()) * 100,1)

fig = subplots.make_subplots(rows = 1, cols = 2, horizontal_spacing = 0.3, subplot_titles = ['Years of Programming Experience','Current Role at Work'] )
fig.update_layout(title = go.layout.Title(text ="<b> Programming Experience and Current Role for All participats</b>", x = 0.5))
fig.add_trace(go.Heatmap(x = list(countries.keys()), y = experience_labels, z = experience_country.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,1)
fig.add_trace(go.Heatmap(x = list(countries.keys()), y = fields, z = field_country.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,2)
fig.update_traces(hovertemplate = '<b>Region</b>: %{x}<br>'+
                                    '<b>Experience</b>: %{y}<br>' +
                                    '<b>Percent</b>: %{z}%')
fig.update_layout(autosize = False, width = 1000, height = 500)
fig.update_traces(showscale = False)
fig.show()

For all regions most of the respondents have 1-3 years of programming experience. So it seems that kaggle is popular among people who are new in the field, know the basics and foundemntals and have practiced programming for at least a year. <br>
There is a very low percentage of respondents with absolute no experience in coding for all regions. <br>
For USA and Europe the respondents are quite evenly distributed between different years of experience, however, in India, South East Asia and Iran most of the respondents have less than 5 years of experience. This is in accordance with the general trend among all respondents. <br>
Most of the respondents from all regions are students. Data scientists are the second biggest group and software engineers come as third. This is the trend more or less for all the regions. There are relatively more data scientists from US and Europe compared to more students from Inida and South East Asia. Surprisingly, there are more data analysts, research scientists or even people with no work than machine learning engineers for all regions in this survey! 

In [None]:
fig = subplots.make_subplots(rows = 5, cols = 3,  shared_yaxes  = True,
                             subplot_titles = fields)
fig.update_layout(title = go.layout.Title(text ="<b> Programming Experience and Current Role for All participats</b>", x = 0.5))

i = 0

for field in fields:
    for country in countries.keys():
        experience_country[country] = df[(countries[country]) & (df["Q5"] == field)]["Q6"].value_counts()
        experience_country[country] = np.round(experience_country[country]/(countries[country].sum()) * 100,1)
    trace = go.Heatmap(x = list(countries.keys()), y = experience_labels, z = experience_country.values, type = 'heatmap', colorscale = 'tealrose', name = '')
    fig.append_trace(trace,i//3+1, i%3+1)
    i += 1
fig.update_traces(hovertemplate = '<b>Region</b>: %{x}<br>'+
                                    '<b>Experience</b>: %{y}<br>' +
                                    '<b>Percent</b>: %{z}%')
fig.update_layout(autosize = False, width = 1000, height = 1500)
fig.update_traces(showscale = False)
fig.show()

Now let's have a closer look at how the programming experience changes for different working roles.<br>
In the plots below the data is normalized for each country. In other words, each number in each square shows the percentage of respondents among the whole respondents in a specific region. For example, 17.56% of Indian repspondents are students with 1-3 years of programming experience, Or 3.17% of respondents from SE Asia are data analysts with 1-3 years of programming experience.  <br>
The colors in the plots are adjusted per plot. Dark red in one plot corresponds to a different percentage number than dark red in another plot. It only shows the maximum number for a specific plot. <br>
There are relatively more software engineers, research scientists, data engineers and database engineers with +10 years of programming experience from Europe and US. In SE_Asia there are more data analyst, business analyst, data engineer and product manager with 1-3 years of programming experience. <br>
Most of the Indian respondents have 1-3 years of programming experience and work as software engineers, data scientists, data analysts and business analysts. The biggest sub-group among the respondants are Inidan students with 1-3 years or less than a year of programming experience. <br>
Iranian data is sparse. This is probably due to low number of participants (195 people). They are more data scientists and machine learning engineers with 3-5 years of programming experience or software engineers or between jobs (currently not employed) with 1-3 years of programming experience. <br>
Most of the respondents from all places are also students with 1-3 years of programming experience, as well as data scientists and data analysts with 1-3 or 3-5 years of programming experience.

## 6. Programming Language

In [None]:
program = pd.DataFrame()
for col in df.columns[7:20]:
    prgrm = questions[col][102:]
    for country in countries.keys():
        program.loc[prgrm,country]= (countries[country] & df.loc[:,col].notnull() & non_profs).sum()
        program.loc[prgrm,country] = np.round(program.loc[prgrm,country]/((countries[country]& non_profs).sum()) * 100,1)

program_profs = pd.DataFrame()
for col in df.columns[7:20]:
    prgrm = questions[col][102:]
    for country in countries.keys():
        program_profs.loc[prgrm,country]= (countries[country] & df.loc[:,col].notnull() & profs).sum()
        program_profs.loc[prgrm,country] = np.round(program_profs.loc[prgrm,country]/((countries[country]& profs).sum()) * 100,1)

dev_env = pd.DataFrame()
for col in df.columns[21:34]:
    environment = questions[col][142:]
    for country in countries.keys():
        dev_env.loc[environment,country]= (countries[country] & df.loc[:,col].notnull() & non_profs).sum()
        dev_env.loc[environment,country] = np.round(dev_env.loc[environment,country]/((countries[country] & non_profs).sum()) * 100,1)
        
dev_env_profs = pd.DataFrame()
for col in df.columns[21:34]:
    environment = questions[col][142:]
    for country in countries.keys():
        dev_env_profs.loc[environment,country]= (countries[country] & df.loc[:,col].notnull() & profs).sum()
        dev_env_profs.loc[environment,country] = np.round(dev_env_profs.loc[environment,country]/((countries[country]& profs).sum()) * 100,1)
        
host_not = pd.DataFrame()
for col in df.columns[34:51]:
    Notebook = questions[col][124:]
    for country in countries.keys():
        host_not.loc[Notebook,country]= (countries[country] & df.loc[:,col].notnull() & non_profs).sum()
        host_not.loc[Notebook,country] = np.round(host_not.loc[Notebook,country]/((countries[country]& non_profs).sum())  * 100,1)
        
host_not_profs = pd.DataFrame()
for col in df.columns[34:51]:
    Notebook = questions[col][124:]
    for country in countries.keys():
        host_not_profs.loc[Notebook,country]= (countries[country] & df.loc[:,col].notnull() & profs).sum()
        host_not_profs.loc[Notebook,country] = np.round(host_not_profs.loc[Notebook,country]/((countries[country] & profs).sum()) * 100,1)
        
host_not

In [None]:
fig = subplots.make_subplots(rows = 1, cols = 3, horizontal_spacing = 0.04, shared_yaxes  = True, subplot_titles = ['Non-professionals','Professionals', 'Difference'] )
fig.update_layout(title = go.layout.Title(text ="<b> Frequently Used Programming Language </b>", x = 0.5))

fig.add_trace(go.Heatmap(x = program.columns, y = program.index, z = program.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,1)
fig.add_trace(go.Heatmap(x = program_profs.columns, y = program_profs.index, z = program_profs.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,2)
fig.add_trace(go.Heatmap(x = program_profs.columns, y = program_profs.index, z = program_profs.values - program.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,3)

fig.update_traces(hovertemplate = '<b>Region</b>: %{x}<br>'+
                                    '<b>Experience</b>: %{y}<br>' +
                                    '<b>Percent</b>: %{z}%')

fig.update_layout(width = 900, height = 600)
fig.update_traces(showscale = False)
fig.show()

The first plot shows the frequently used programming languages for non professionals. Non-professionals here refers to people with less than two years of experience using machine learning techniques. The second plot shows the same for professional population and the third plot highlights the differences between the two group. <br>
As you can see, python is quite popular among both groups of professionals and non professionals in all countries and regions with 77-95% of respondants reporting frequently using it in all groups. Python is least used among Inidan professionals and American non-professionals. <br>
SQL is the second most frequently used programming language with 24-49% of usage for different groups. It is mostly used among US professionals and least among SE_Asia non-professionasls. <br>
R popularity ranges between 12-36% and this rate is between 13-30% for C++. <br>
R is most popular among US professionals and C++ is most popular among SE Asians and Iranians. <br>
The diff plot shows that R is better embraced among professionals for all regions. <br>
Python is more popular among non-professionals compared to professionals in India, Iran and all countries combined. <br>
Bash is appreciated by professionals better, specially for professionals in US and Europe. <br>
MATLAB is relatively popular among Iranian professionals.

## 7. Development Environment

In [None]:
fig = subplots.make_subplots(rows = 1, cols = 3, horizontal_spacing = 0.05, shared_yaxes  = True, subplot_titles = ['Non-professionals','Professionals', 'Difference'] )
fig.update_layout(title = go.layout.Title(text ="<b> Frequently Used Development Environment </b>", x = 0.6))

fig.add_trace(go.Heatmap(x = dev_env.columns, y = dev_env.index, z = dev_env.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,1)
fig.add_trace(go.Heatmap(x = dev_env_profs.columns, y = dev_env_profs.index, z = dev_env_profs.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,2)
fig.add_trace(go.Heatmap(x = dev_env.columns, y = dev_env.index, z = dev_env_profs.values - dev_env.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,3)

fig.update_traces(hovertemplate = '<b>Region</b>: %{x}<br>'+
                                  '<b>Experience</b>: %{y}<br>'+
                                  '<b>Percent</b>: %{z}%')

fig.update_layout(width = 900, height = 600)
fig.update_traces(showscale = False)
fig.show()

Jupyter notebook is the absolut winner among development environments with 58-71% of professionals and non-professional from different regions using it <br> 
Visual Studio Code is the second most popluar with 32-43%. It is specially popular among SE Asian and European professionals. <br> 
PyCharm is the third most popular environment with the usage rate of 22-42% among different groups. It is appricated most among Iranian professionals. <br>
RStudio is relatively used more among professionals from USA and Europe. <br>
US professionals use Vim/Emacs more than non-professionals. <br>
MATLAB is more popular among Iranian professionals than non-professionals. <br>
Professionals in Inida use Spyder more than non-professionals. however, the trend is reversed for Jupyter notebook. <br>
Just like python as programming language, Jupyter Notebook is more used by non-professionals than professionals in Iran, as in India. <br>


## 8. Hosted Notebooks

In [None]:
fig = subplots.make_subplots(rows = 1, cols = 3, horizontal_spacing = 0.05, shared_yaxes  = True, subplot_titles = ['Non-professionals','Professionals', 'Difference'] )
fig.update_layout(title = go.layout.Title(text ="<b> Frequently Used Hoste Notebook </b>", x = 0.6))

fig.add_trace(go.Heatmap(x = host_not.columns, y = host_not.index, z = host_not.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,1)
fig.add_trace(go.Heatmap(x = program.columns, y = host_not_profs.index, z = host_not_profs.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,2)
fig.add_trace(go.Heatmap(x = dev_env.columns, y = host_not.index, z = host_not_profs.values - host_not.values, type = 'heatmap', colorscale = 'tealrose', name = ''),1,3)

fig.update_traces(hovertemplate = '<b>Region</b>: %{x}<br>'+
                                  '<b>Experience</b>: %{y}<br>'+
                                  '<b>Percent</b>: %{z}%')

fig.update_layout(width = 900, height = 600)
fig.update_traces(showscale = False)
fig.show()

Colab Notebooks and kaggle notebooks are equally popular among all respondants. The usage rate is 23-50% for different groups. <br>
Both professionals and non-professionals in US use these notebooks less than other groups. <br>
In all regions except India professionals use Colab notebooks more than non-professionals. <br>
In all regions except SE Asia non-professionals use kaggle notebooks more than professionals. <br>
Colab notebooks are specially popular among Inidans and Iranians both professionals and non-professinals compared to other groups. <br>
Google Cloud notebooks are the 3rd most popular after colab notebooks and kaggle notebooks with a low rate of usage of 7% for all respondants. <br>
Other notebooks doesn't show significat results. <br>

## 9. Computer Vision and NLP

In [None]:
computer_vision = ~all_resp #an array of False
NLP = ~all_resp

for col in range(102, 107):
    computer_vision = computer_vision | (~pd.isna(df.iloc[:,col]))
    
for col in range(109, 113):
    NLP = NLP| (~pd.isna(df.iloc[:,col]))
    
modes = {'Computer Vision': '#58ABAE','NLP':'#FAF4F0', 'Both':'#FEB8C5'}

In [None]:
modes = {'Computer Vision': '#58ABAE','NLP':'#FAF4F0', 'Both':'#FEB8C5'}
title_format = "<span style='font-size:20px; font-family:Times New Roman'>%s</span>"
fig = subplots.make_subplots(rows = 1, cols = 1, 
                             subplot_titles = [title_format % "Computer Vision or NLP"],
                             vertical_spacing = 0.07)
for mode in modes.keys():
    y_value = []
    for country in countries.keys():
        if mode == 'Computer Vision':
            y_value.append((computer_vision & countries[country]).sum() /len(df[countries[country]])*100)
        elif mode == 'NLP':                                                         
            y_value.append((NLP & countries[country]).sum()/len(df[countries[country]])*100)
        else:
            y_value.append( (computer_vision & NLP & countries[country]).sum()/len(df[countries[country]])*100)
    y_value = np.round(y_value, 1)
    trace = go.Bar(
                    y = y_value,
                    x = list(countries.keys()),
                    name = mode,
                    marker = dict(color = modes[mode])
                    )
    fig.update_traces(hovertemplate = '<b>Region</b>: %{x}<br>'+
                                      '<b>Percent</b>: %{y}%<br>')
    fig.append_trace(trace,1,1)

fig['layout'].update(height = 400, width = 900)
iplot(fig)

This plot shows how many percent of participants for a sepcific country or region regurlarly work with computer vision, NLP or both techniques. <br>
About 23% of the total respondents on kaggle survey work with computer vision, 14% with NLP and 10% with both of these techniques. <br>
Both computer vision and NLP is used more relatively among SE Asian and Europeans compared to US, India and overall community. <br>
Also it seems that more than half of the people that work with NLP, work with computer vision as well. This is interesting, since some of the teqniques such as transformers that were initially developed for NLP are used in computer vision nowadays. <br>

## 10. Data Science Team Size

In [None]:
team_size = pd.DataFrame()
for country in countries.keys():
    team_size[country] = df[countries[country]]['Q22'].value_counts()#/len(df[countries[country]])*100
team_size = team_size.reindex(['0', '1-2', '3-4', '5-9','10-14', '15-19','20+'])
#team_size.loc['nan',:] = len(df) - team_size.sum(axis = 0)
team_size

In [None]:
fig = subplots.make_subplots(
    rows = 1, 
    cols = 1, 
    shared_yaxes = True, 
    shared_xaxes = False,
    subplot_titles = [title_format % "Data Science Team Size for Different Countries and Regions"],
    horizontal_spacing = 0.02, 
    vertical_spacing = 0.01
)

for country in list(countries.keys())[:-1]:
    team_size_prct_country = np.round(team_size[country]/len(df[countries[country]])*100,1)
    txt = list(zip(team_size[country], team_size_prct_country))
    txt2 = team_size_prct_country
    trace = go.Bar(
        y = team_size.index,
        x = team_size[country],
        name = country,
        marker = dict(color= colors[country]),
        orientation = "h",
        hovertext = txt2
        #hoverinfo = 'text',
    )
    fig.update_traces(hovertemplate = '<b>Count</b>: %{x}<br>'+
                                      '<b>Percent</b>: %{hovertext}%<br>')
    fig.append_trace(trace,1,1);


layout = dict(barmode = 'stack')
fig.update_layout(layout)
fig['layout'].update(height = 400, width = 1000)
iplot(fig)

As you hover the mouse on the bars, the first number shows the total count of respondents in a specific country or region with the specific team-size. The second number shows the percentage of those repsondents for that country or region. <br>
Most of respondents work in data science teams with more than 20 people. This is specially true for respondents from US with highest relative percentage. <br>
For teams with 0-20 members, as the size of teams increases, the number of people working in those teams decreases. This pattern is followed in all countries and regions in this report. <br>
It appears that the data science teams are either more than 20 people or 1-2 people. Team size of 1-2 people is more common among European respondents. <br>

## 11. Important Working Activity

In [None]:
y_values = []
for col in range(119, 127):
    temp = df[~pd.isna(df.iloc[:,col])].iloc[:,col]
    y_values.append(temp.unique()[0])
break_line = (lambda x: (x[:49]+'<br>'+x[49:100]+'<br>'+x[100:]) if (len(x) > 50) else (x[:50]+'<br>'+x[50:]))
y_values = [break_line(item) for item in y_values]

In [None]:
fig = subplots.make_subplots(
    rows = 1, 
    cols = 1, 
    subplot_titles = [title_format % "Major Working Activity Among Professionals"],
    horizontal_spacing = 0.02, 
    vertical_spacing = 0.01
)
for country in countries.keys():
    x_values = []
    counts = []
    for col in range(119, 127):
        count = df[countries[country] & profs].iloc[:,col].notnull().sum()
        x = count/len(df[countries[country] & profs])*100
        x = np.round(x,1)
        x_values.append(x)
        counts.append(count)
        
    trace = go.Bar(
                    y = y_values,
                    x = x_values,
                    name = country,
                    marker = dict(color= colors[country]),#"#a2885e"
                    orientation = "h",
                    hovertext = counts
                    )
    fig.update_traces(hovertemplate = '<b>Count</b>: %{hovertext}<br>'+
                                      '<b>Percent</b>: %{x}%<br>')
    fig.append_trace(trace,1,1);

fig['layout'].update(height = 1000, width = 900)
iplot(fig)

The numbers on each bar shows the percentage of the respondents in a specific region that has has picked that activity as important part of their work at work. <br>
Most of the respondents form all regions and coutries state that analyzing and understanding data to influence product or business decisoins is an important part of their role. <br>
Europe and US follow similar patterns for their major activity at work. SE Asia and India have more similar patterns. <br>
Higher percentage of respondents from US and Europe engage in machine learning related activities than SE Asia and India. For example, 58% and 51% of participants from US and EU as compared to 36 and 34% from SE Asia and India report that analyzing and understanding data to influence product or business decisions is an important activity at work for them. <br>
Building prototyes an applying machine learning to new areas is the second most important activity in all countries and regions except for Iran. <br>
For Iranian respondants, the activies are more research oriented rather than practical in industries. <br>

## 12. Machine Learning Usage at Work and Size of Company

In [None]:
ML_Usage_Company_size = pd.crosstab(df.Q21, df.Q23, margins = True, margins_name = "Total")
ML_Usage_Company_size = ML_Usage_Company_size.reindex(index = ['0-49 employees','50-249 employees', '250-999 employees', '1000-9,999 employees', '10,000 or more employees', 'Total'])
ML_Usage_Company_size

In [None]:
fig = go.Figure()
fig.update_layout(title = go.layout.Title(
                                    text = '<b>The Extent of Machine Learning Usage by Current Employer</b>', xref = "paper", x = 0.5))
txt = []
for country in countries.keys():
    
    ML_Usage_Company_size = pd.crosstab(df[countries[country]].Q21, df[countries[country]].Q23, margins = False, margins_name = "Total")
    ML_Usage_Company_size = ML_Usage_Company_size.reindex(index = ['0-49 employees','50-249 employees', '250-999 employees', '1000-9,999 employees', '10,000 or more employees'])
    marker_size_abs = (ML_Usage_Company_size.values).flatten() #absolut values
    marker_size_prc = np.round((ML_Usage_Company_size.values).flatten()/len(df[countries[country]])*100,1) #percentage values for each country
    for count, perc in zip(marker_size_abs,marker_size_prc):
        txt.append('<b>Count:</b>{}<br> <b>Percent:</b> {}%'.format(count,perc))
    
    x,y = np.meshgrid(ML_Usage_Company_size.columns, ML_Usage_Company_size.index)
    trace = go.Scatter(
                        x = x.flatten(), 
                        y = y.flatten(),
                        marker_size = (ML_Usage_Company_size.values).flatten()/len(df[countries[country]])*50000,
                        text = txt,
                        hoverinfo = 'text',
                        name = country,
                        )
    txt = []
    fig.add_trace(trace)
    # Tune marker appearance and layout
    fig.update_traces(mode = 'markers', marker = dict(sizemode='area', line_width = 2))

fig['layout'].update(height = 800, width = 1000)
fig.show()


**How to read the plot** <br>
This plot shows the comparision of machine learning usage in companies with different sizes. The size of the bubble corresponds to the relative percentage of the respondents in that group. For example in case of the India, the bubble on the left most bottom part of the plot shows that 230 or(3.1%) participants from India work for 0-49 employee company and do not know if ML is used by their employer<br>
To get the most out of this plot, it's better to select two counties/regions that you would like to compare and deselct the other ones. I summerize some of the patterns I find interesting in this comparison. <br> <br>

**Insights** <br>
Most of the respondents work for companies with less than 50 people. The respondents working for big corporations with more than 10,000 employees are the second group. <br>
Most of the respondents in 0-49 employee sized companies either do not use ML methods or only use it for exploring without having models in productions. Big corporation employees however report that they have well established ML methods(i.e. models in production for more than 2 years). <br>
In US, the respondents relatively work more for big corporations with 10k+ employees compared to rest of the world. The plot indicates that more companies in US regardless of the number of their employees have well established ML models compared to the rest of the world. <br> 
In SE Asia, respondents relatively work more for companies with 1k-10k employees. More relative respondents also state that they use ML methods for generating insights, but do not put working models in production. <br>
In Europe also, companies relatively have more well established ML methods with models in production compared to the rest of the world. <br>
In India, the respondents have less deviation from the rest of the wrold. They slightly tend to work more in bigger corporations and less in smaller componies with below 50 epmplyees. Lower percentage of the poeple from Inida has replied to this part of survey compared to rest of the world. <br>
Most of the respondents from Iran work for companies with 0-49 employees. <br>



## 13. Cloud Computing Platforms and Storage Products

In [None]:
cloud = dict()
storage = dict()
for col in range(147,155):
    storage[questions.iloc[col][118:]] = ~pd.isna(df.iloc[:,col])

for col in range(129,141):
    cloud[questions.iloc[col][124:]] = ~pd.isna(df.iloc[:,col])
    

Cloud_Storage = pd.DataFrame()
for cloud_name, cloud_cond in list(cloud.items())[:6]:
    for storage_name, storage_cond in storage.items():
        Cloud_Storage.loc[cloud_name, storage_name] = len(df[cloud_cond & storage_cond])
        
Cloud_Storage    

In [None]:
fig = go.Figure()
fig.update_layout(title = go.layout.Title(text = '<b>Regularly Used Cloud Computing Platforms and Storage Products</b>', xref = "paper", x = 0.5))
Cloud_Storage = pd.DataFrame()
txt = []
for country in countries.keys():
    
    for cloud_name, cloud_cond in list(cloud.items())[:4]:
        for storage_name, storage_cond in storage.items():
            Cloud_Storage.loc[cloud_name, storage_name] = len(df[cloud_cond & storage_cond & countries[country] ])
    
    x,y = np.meshgrid(Cloud_Storage.columns, Cloud_Storage.index)
    marker_size_abs = (Cloud_Storage.values).flatten()
    marker_size_prc = np.round((Cloud_Storage.values).flatten()/len(df[countries[country]])*100,1)
    for count, perc in zip(marker_size_abs,marker_size_prc):
        txt.append('<b>Count: </b>{}<br> <b>Percent: </b>{}%'.format(count,perc))
    
    trace = go.Scatter(
                        x = x.flatten(), 
                        y = y.flatten(),
                        marker_size = (Cloud_Storage.values).flatten()/len(df[countries[country]])*50000,
                        text = txt,
                        hoverinfo = 'text',
                        name = country,
                        )
    txt = []
    fig.add_trace(trace)
    # Tune marker appearance and layout
    fig.update_traces(mode = 'markers', marker = dict(sizemode='area', line_width = 2))

fig['layout'].update(height = 600, width = 1000)
fig.show()

**How to read the plot** <br>
This plot shows the frequently usded cloud computing platforms and storage products and how the usage differ for different countries and regiolns. For simplicity of plots, I only kept the most popular products and removed the other ones from the matrix. This means that Oracle, SAP, Salesforce, VMware, Alibaba and Tencent Cloud were excluded. <br>
For getting to know how to read the plot, please read the notes for part 12. <br><br>

**Insights** <br>
Most of the participants from the overall places use Amazon Web Services(AWS) together with Amazon Simple Storage Services(S3) (8.4%). <br>
The second most populer products are Google Cloud Platform(GCP) together with Google Cloud Storage (GCS) (6.5%). <br>
The combination of (AWS + GCS) and (GCP + S3) are more popular than Microsoft products (3.6% vs 3% respectively). IBM Cloud is the least popular among other clouds. (~2%) <br>
US participants use cloud services and storage products above average of participants. They specially use (AWS + S3) combo more than other cominiation products compared to other countries and regions. <br>
Inidan participants use cloud services and storage products a bit below average of participants. However, their choices of products follows the common pattern for total participants from all places. <br> 
SE Asian repondents use (GCP) more than average. They use AWS and Microsoft Azure less than average. The combination of (GCS and GCP) is more popular than other ones in SE Asia.<br>
In Europe, Microsoft Azure is used more than othe places. However the most popular combination of product is still (AWS + S3) with 11.8% of respondents using that. <br>
Respondents from Iran use Google products significantly more than Amazon or Microsoft products. <br>

## Conclusion

In this notebook we learned about how kagglers from different regions and countreis differ from each other from different aspects. These aspects include their demographics (age and sex), eduction, programming experience, job roles and their machine learning experience. <br>
After recognizing these groups of people, we tried to examine different practices among these groups. This analysis is useful for people to get a perspective about what programming languages, development environments and hosted notebooks are popular globaly, in a specific region or in a specific country of their interest. A comparision for the popularity of NLP versus computer vision is presented as well. <br>
We later tried to understand the current ecosystems at data science jobs in industries. This analysis gives insights about the data science teams size, company size and their approach to machine learning techniques as well as mostly used cloud computing platforms and data storage products. <br>


Hope you find this analysis interesting! For me it's been very exciting to get to konw the community I learn the most from and to which I try to contribute!
Continously trying to improve, I truly appreaciate your feedbacks about my analysis. :-)