In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from statsmodels.stats.proportion import proportions_ztest
from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import warnings

# Introduction

Indians form the biggest community on kaggle. Approximately **29% of survey participants this year are from India up from 18.5% in 2018**. 
Despite being such an important part and winning mulitple competition on Kaggle, unfortunately Indian Kagglers are not doing great on one aspect - Income

Being a part of the Indian data science community, it is of great interest for me to analyze and extract key factors causing the income disparity and come up with insights that can help in reducing this gap

So with that, let's talk Money!

![](https://imgur.com/5mIKFCz.png)

## Analysis approach

This analysis is divided in 4 main parts:

* **EDA around income:** How factors such as Age, Gender, Company size etc. affect Income among Kagglers
* **Driver analysis:** Using supervised machine learning model, gauge importance of various factors affecting Income across countries
* **Products of importance:** Based on Driver analysis, deep dive on specific products/components of key drivers
* **Into the Future:** What Kagglers will do in next 2 years

Results are presented as proportion of people answering a particular question for which I have used Stacked column and horizontal bar charts for Single choice and Multiple choice questions respectively. 

Data from 3 previous surveys (2018 - 2020) is used. I have compiled this data manually since there were difference in question mappings over the years

The good thing about having data for mulitple years is that we can see trends on what has changed. Therefore, to capture these trends, two tailed z-tests are performed for each proportion to check statistical significance of the changes observed

### What is Z-test?

For those who are unfamiliar with this concept, I have provided a quick summary below:


Statistical significance testing is a common concept in market research to check if any change observed in proportions of succesful events over total events between two samples is an underlying change in population or not. For proportion, a Z-test is used where the null hypothesis is that the proportions are the same:

null hypothesis: p1-p2 = 0

where p1 and p2 are proportion of successful events from two different samples (For this analysis p1 and p2 will come from survey data of two different years)

To be able to reject the null hypothesis, the p-value of the z-test should be less than a threshold value. For this analysis, the threshold is set at 5%.

Z-score for two proportions z-test is computed as : 

<img src="https://imgur.com/doSqCZf.png" align="center"/>
<br><br>
So all we need are the proportions (p1,p2) and the total sample size (n1,n2) for each sample!

### How to read the charts?

Two types of charts are used: **Stacked column chart** for questions with single select option and **Horizontal bar charts** for questions with multiple select options. <mark>All charts are interactive plots and reveal additional information on hovering</mark>

**Stacked column chart:**

<div>
<img src="https://imgur.com/7heTnkH.png" align="left"/>
</div>
<br><br>
<div>
    


**Horizontal bar charts:**


<img src="https://imgur.com/1pKsupg.png" align="left"/>

Note: For this analysis, Z-test is performed only between 2021 data and previous year's data for all the charts wherever year on year trends are shown

In [None]:
df = pd.read_csv('../input/kaggle-ml-survey-data-2018-to-2021/Kaggle ML Survey data 2018 to 2021.csv',low_memory=False)

df["Q1"].replace(dict.fromkeys(['70-79','70+'], '>70'), inplace=True)

df["Q6"].replace(dict.fromkeys(['Less than  1 years','< 1 years'], '< 1 year'), inplace=True)
df["Q6"].replace(dict.fromkeys(['30-40 years','40+ years','20-30 years'], '20+ years'), inplace=True)
df["Q6"].replace({"1-2 years": "1-3 years"}, inplace=True)
df["Q6"].replace(dict.fromkeys(['I have never written code but I want to learn','I have never written code and I do not want to learn'], 'I have never written code'), inplace=True)

df["Q2"].replace({"Male": "Man", "Female": "Woman"}, inplace=True)

df['Income'] = df["Q25"].replace(dict.fromkeys(['$0-999','2,000-2,999','1,000-1,999','4,000-4,999','3,000-3,999','5,000-7,499','7,500-9,999'], '0-10,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['10,000-14,999','15,000-19,999','10-20,000'], '10,000-20,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['20,000-24,999','25,000-29,999','20-30,000'], '20,000-30,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['30,000-39,999','30-40,000'], '30,000-40,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['40,000-49,999','40-50,000'], '40,000-50,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['50-60,000','50,000-59,999','60-70,000','60,000-69,999'], '50,000-70,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['90-100,000', '70-80,000', '80-90,000','70,000-79,999','80,000-89,999','90,000-99,999'], '70,000-100,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['100-125,000', '125-150,000', '100,000-124,999','125,000-149,999'], '100,000-150,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['150-200,000', '200-250,000', '250-300,000', '150,000-199,999','200,000-249,999','250,000-299,999'], '150,000-300,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['300-400,000', '400-500,000', '300,000-499,999'], '300,000-500,000'))
df['Income'] = df["Income"].replace(dict.fromkeys(['> $500,000', '500,000+', '$500,000-999,999','More than $1,000,000'], '500,000+'))

Q7_cols = ['Q7_Part_1','Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6','Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11','Q7_Part_12','Q7_OTHER']
Q7_cols_desc = ['Python','R','SQL','C','C++','Java','JavaScript','Julia','Swift','Bash','MATLAB','None','Other']

Q18_cols = ['Q18_Part_1','Q18_Part_2','Q18_Part_3','Q18_Part_4','Q18_Part_5','Q18_Part_6','Q18_OTHER']
Q18_cols_desc = ['General purpose image/video','Image segmentation',
'Object detection',
'Image classification',
'Generative Networks',
'None',
'Other']

Q1_cols = ['18-21','22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','>70']

colors = ['blueviolet','coral','darkgreen','mediumblue','orange','purple','yellow','springgreen','lightpink','lightgray', 'red', 'rosybrown', 'royalblue','magenta','maroon','mediumaquamarine','mediumblue','plum']

In [None]:
# Functions to return Z-test results:

def z_test(table1,columns):
    table3  = table1.copy()
    for i in columns:
        for j in range(len(table1)):
            stat,pval = proportions_ztest(np.array([table1.loc[j,i],table1.loc[table1['Year']=='2021',i]]),
                                          np.array([table1.loc[j,'Total'],table1.loc[table1['Year']=='2021','Total']]), value=None, alternative='two-sided', prop_var=False)
            table3.loc[j,i] = pval*100
    for i in columns:
        text = 'Significant change with year(s): '
        year_count=0
        for j in range(len(table3)-1):
            if (table3.loc[j,i]<5) & (table1.loc[j,i]/table1.loc[j,'Total']>0.01):
                text = text + " " + table3.loc[j,'Year']
                year_count=year_count+1
        if year_count>0:
            table3.loc[table3['Year']=='2021',i] = text
        else:
            table3.loc[table3['Year']=='2021',i] = 'No significant change over years'
    table3.loc[table3['Year']!='2021',columns] = ''
    return table3

# Function for stacked column chart:

def stacked_column(table2,i,x_col,table3,table_tot,j,legend):
    return go.Bar(x = table2[x_col],
                     y = table2.loc[:,i],
                     orientation='v',
                     name = i,
                     text= table2.loc[:,i].where(table2.loc[:,i]>3,''),
                     textangle = 0,
                     customdata = table3.loc[:,i],
                     texttemplate='<b>%{text}<b>', 
                     hovertemplate = 'Year: <b>%{x}</b><br>Cumulative percentage users: <b>%{y}</b><b>%</b><br><b>%{customdata}</b>',
                     textposition='inside',
                     textfont_size=10,
                     cliponaxis = False,
                     marker = dict(color = colors[j]),
                     offsetgroup=0
                     ,base = table_tot,
                     showlegend=legend
                     )

# Function for horizontal bar chart:

def bar_chart(table2,i,x_col,table3,table_tot,j,legend):
    return go.Bar(x =table2.loc[:,i] ,
                     y =table2[x_col],
                     orientation='h',
                     name = i,
                     text= table2.loc[:,i].where(table2.loc[:,i]>3,'').astype(str),
                     textangle = 0,
                     customdata = table3.loc[:,i],
                     texttemplate='<b>%{text}<b>', 
                     hovertemplate = 'Year: <b>%{y}</b><br>Percentage users: <b>%{x}</b><b>%</b><br><b>%{customdata}</b>',
                     textposition='auto',
                     textfont_size=10,
                     cliponaxis = False,
                     marker = dict(color = colors[j]),
                     offsetgroup=j,
                     showlegend=legend
                     )

# Section-1: EDA around Income

## Overall Income distribution

Here we see the distribution of people across income bands over the years for India, USA & Brazil

In [None]:
# Distribution of Income across countries

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Income_cols = ['0-10,000','10,000-20,000','20,000-30,000','30,000-40,000','40,000-50,000','50,000-70,000','70,000-100,000','100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

table1 = df.pivot_table(values='User', index=['Year','Q3'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Income distribution across Countries<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=400,
                 width = 900,legend_traceorder = "reversed")

fig.show()

<div class="alert alert-success">There is a significant increase in proportion of people earning under 10k USD per year in India after 2019</div>
<br>


This raises the question- which countries pay the most to their Data scientists and which pay the least

For the highest paying countries, I have taken the countries where proportion of people having income greater than 100k is highest in 2021

For lowest paying countries, I have taken the countries where proportion of people having income less than 10k is highest in 2021

Note: Only those countries are taken where at-least 50 people have filled the survey in 2021

In [None]:
# Distribution of Income across countries

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Income_cols = ['0-10,000','10,000-20,000','20,000-30,000','30,000-40,000','40,000-50,000','50,000-70,000','70,000-100,000','100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

Above_100 = ['100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

table1 = df.pivot_table(values='User', index=['Year','Q3'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]


table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)

table2['Above_100'] = table2[Above_100].sum(axis=1)
Top_5 = table2[(table2.Year=='2021')&(table1.Total>50)].sort_values(by = ['Above_100'],ascending=False)
Top_5 = Top_5.iloc[0:4,:].Q3

Bottom_5 = table2[(table2.Year=='2021')&(table1.Total>50)].sort_values(by = ['0-10,000'],ascending=False)
Bottom_5 = Bottom_5.iloc[0:4,:].Q3

categories = list(Top_5)

fig = go.Figure()

fig = make_subplots(rows=2, cols=len(categories),shared_yaxes=True,subplot_titles=categories + list(Bottom_5))



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    
categories = list(Bottom_5)


col=1
legend=False
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=2,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    
    
fig.update_layout(title ="<b>Highest and lowest paying countries in 2021 for Data Science<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=700,
                 width = 1000,legend_traceorder = "reversed")

fig.show()

**Switzerland** has come out on top! while **Bangladesh** has the highest proportion of people getting paid less than 10k per year

Let's look at some of the factors affecting Income distribution in India and abroad

## Age

In [None]:
# Distribution of Age across countries

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
    
table1 = df.pivot_table(values='User', index=['Year','Q3'], columns=['Q1'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Q1_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Q1_cols].sum(axis=1)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q1_cols)

    for i in Q1_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Age distribution across Countries<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=400,
                 width = 900,legend_traceorder = "reversed")

fig.show()

<div class="alert alert-success">Significant increase is seen in age group 18-21 in India after 2019</div>
<br>

People <mark>less than 25 years of age form 60% of Kaggle users in India</mark> while it is only 15-17% in USA and Brazil

## Gender Distribution

In [None]:
# Distribution of Gender across countries

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 

Q2_cols = ['Man', 'Woman', 'Nonbinary', 'Prefer not to say','Prefer to self-describe']
    
table1 = df.pivot_table(values='User', index=['Year','Q3'], columns=['Q2'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Q2_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Q2_cols].sum(axis=1)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q2_cols)

    for i in Q2_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Gender distribution<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=400,
                 width = 1000,legend_traceorder="reversed")

fig.show()

<div class="alert alert-success">Woman participation in Data science has increased in India from 16% in 2019 to 22% in 2021 and the change is significant! There is no significant change from last year</div>
<br>

Not much has changed in USA & Brazil in last 3 years in terms of Gender distribution

## Income distribution across Age and Gender in India

In [None]:
# Distribution of Income across Demogs

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Income_cols = ['0-10,000','10,000-20,000','20,000-30,000','30,000-40,000','40,000-50,000','50,000-70,000','70,000-100,000','100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

df['Age_buckets'] = df['Q1'].replace(dict.fromkeys(['18-21','22-24'], '18-24'))
df['Age_buckets'] = df["Age_buckets"].replace(dict.fromkeys(['25-29','30-34'], '25-34'))
df['Age_buckets'] = df["Age_buckets"].replace(dict.fromkeys(['35-39','40-44','45-49'], '35-49'))
df['Age_buckets'] = df["Age_buckets"].replace(dict.fromkeys(['50-54','55-59','60-69','>70'], '>50'))

table1 = df[df.Q3=='India'].pivot_table(values='User', index=['Year','Age_buckets'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)


categories = ['18-24','25-34','35-49','>50']

fig = go.Figure()

fig = make_subplots(rows=2, cols=len(categories),shared_yaxes=True,subplot_titles=categories + ['','Woman','Man'])


col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Age_buckets==cat].reset_index(drop=True)
    table1_new = table1[table1.Age_buckets==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
        
# Plot for Gender
        
table1 = df[df.Q3=='India'].pivot_table(values='User', index=['Year','Q2'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)


categories = ['Woman','Man']    

col=1
legend=False
for cat in categories:
    table2_new = table2[table2.Q2==cat].reset_index(drop=True)
    table1_new = table1[table1.Q2==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=2,col=col+1)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False


fig.update_layout(title ="<b>India's income distribution across Age and Gender<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=800,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="v",
    yanchor="bottom",
    y=0,
    xanchor="right",
    x=1
))

fig.show()

<div class="alert alert-success">The rise in proportion of people earning under 10k in India is more prominent among people above 35 years of age where the proportion has doubled in last 3 years</div>
<br>

People below 25 years of age are earning the same as in 2018

<div class="alert alert-success">A larger portion of Women earn less than 10k USD per year than Men. There is no significant change in this proportion over the last 2 years</div>
<br>

## Coding or Programming experience

In [None]:
# Distribution of Coding experience across countries

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Q6_cols = ['< 1 year','1-3 years','3-5 years','5-10 years','10-20 years','20+ years','I have never written code']

table1 = df.pivot_table(values='User', index=['Year','Q3'], columns=['Q6'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Q6_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Q6_cols].sum(axis=1)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q6_cols)

    for i in Q6_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Coding or Programming Experience distribution<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=400,
                 width = 1000,legend_traceorder="reversed")

fig.show()

<div class="alert alert-success">Proportion of Kagglers with 1-3 years of programming experience is increasing every year. Change in 2021 is significant with 2018,2019 and 2020</div>
<br>

People with more than 3 years experience form a bigger part of Kaggle community in Brazil in 2021 than in 2018-19

## India's income distribution by Coding experience

In [None]:
# India's income distribution by Coding and ML experience

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Income_cols = ['0-10,000','10,000-20,000','20,000-30,000','30,000-40,000','40,000-50,000','50,000-70,000','70,000-100,000','100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

Q6_cols = ['< 1 year','1-3 years','3-5 years','5-10 years','10-20 years','20+ years','I have never written code']

table1 = df[df.Q3=='India'].pivot_table(values='User', index=['Year','Q6'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)


categories = Q6_cols

fig = go.Figure()

fig = make_subplots(rows=2, cols=4,shared_yaxes=True,subplot_titles=categories)

row=1
col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q6==cat].reset_index(drop=True)
    table1_new = table1[table1.Q6==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=row,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    if col>4:
        row=2
        col=1
    


fig.update_layout(title ="<b>India's income distribution by Coding experience<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=800,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="v",
    yanchor="bottom",
    y=0,
    xanchor="right",
    x=1
))

fig.show()

<div class="alert alert-success">Income increases with coding experience as expected. Over the years there has not been a significant change in income for people with more than 10 years of coding experience</div>
<br>

## India's income distribution by Machine learning experience

Let's also see the distribution by ML experience

In [None]:
# India's income distribution by ML experience

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Income_cols = ['0-10,000','10,000-20,000','20,000-30,000','30,000-40,000','40,000-50,000','50,000-70,000','70,000-100,000','100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

df["Q15"].replace(dict.fromkeys(['Under 1 year','< 1 years'], '< 1 year'), inplace=True)
df["Q15"].replace(dict.fromkeys(['3-4 years','4-5 years'], '3-5 years'), inplace=True)
df["Q15"].replace(dict.fromkeys(['10-15 years','10-20 years','20 or more years'], '10+ years'), inplace=True)
df["Q15"].replace(dict.fromkeys(['I do not use machine learning methods','I have never studied machine learning but plan to learn in the future','I have never studied machine learning and I do not plan to'], 'I do not use ML methods'), inplace=True)

Q15_cols = ['< 1 year','2-3 years','3-5 years','5-10 years','10+ years','I do not use ML methods']

table1 = df[df.Q3=='India'].pivot_table(values='User', index=['Year','Q15'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)


categories = Q15_cols

fig = go.Figure()

fig = make_subplots(rows=2, cols=4,shared_yaxes=True,subplot_titles=categories)

row=1
col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q15==cat].reset_index(drop=True)
    table1_new = table1[table1.Q15==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=row,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    if col>4:
        row=2
        col=1
    


fig.update_layout(title ="<b>India's income distribution by Machine learning experience<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=800,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="v",
    yanchor="bottom",
    y=0,
    xanchor="right",
    x=1
))

fig.show()

## Size of the company

In [None]:
# Distribution of Age across countries

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
df["Q21"].replace(dict.fromkeys(['10,000 or more employees'], '> 10,000 employees'), inplace=True)

Q21_cols = ['0-49 employees','50-249 employees','250-999 employees','1000-9,999 employees','> 10,000 employees']
    
table1 = df.pivot_table(values='User', index=['Year','Q3'], columns=['Q21'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Q21_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Q21_cols].sum(axis=1)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q21_cols)

    for i in Q21_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Size of the company<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=400,
                 width = 900,legend_traceorder="reversed")

fig.show()

<div class="alert alert-success">Increase in proportion of people working in smaller companies (Under 250 employees) was seen during COVID-19 period (2020).
<br>
The proportion is back to pre-covid level with no significant change w.r.t 2019 for all 3 countries</div>
<br>

## Income distribution by Company size in India

In [None]:
# India's income distribution by ML experience

warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
    
Income_cols = ['0-10,000','10,000-20,000','20,000-30,000','30,000-40,000','40,000-50,000','50,000-70,000','70,000-100,000','100,000-150,000','150,000-300,000','300,000-500,000','500,000+']

Q21_cols = ['0-49 employees','50-249 employees','250-999 employees','1000-9,999 employees','> 10,000 employees']

table1 = df[df.Q3=='India'].pivot_table(values='User', index=['Year','Q21'], columns=['Income'], aggfunc=lambda x: len(x.unique()))

table1 = table1[Income_cols]

table1 = table1.fillna(0)

table2 = round(table1.div(table1.sum(axis=1), axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table1['Total'] = table1[Income_cols].sum(axis=1)


categories = Q21_cols

fig = go.Figure()

fig = make_subplots(rows=2, cols=3,shared_yaxes=True,subplot_titles=categories)

row=1
col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q21==cat].reset_index(drop=True)
    table1_new = table1[table1.Q21==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Income_cols)

    for i in Income_cols:
        fig.add_trace(stacked_column(table2_new,i,'Year',table3,table_tot,j,legend),row=row,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    if col>3:
        row=2
        col=1
    


fig.update_layout(title ="<b>India's income distribution by Company size<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=800,
                 width = 900,legend_traceorder="reversed",legend=dict(
    orientation="v",
    yanchor="bottom",
    y=0,
    xanchor="right",
    x=1
))

fig.show()

<div class="alert alert-success">While employees working in company sizes of less than 49 are clearly getting paid less, the trend cease to exist above company sizes of 1000 employees</div>
<br>

# Section-2: Driver Analysis

Now, let's start the exciting part!

### What is Driver analysis?

Driver analysis is another tool used in Market research to <mark>come up with the set of features that *drive* a particular behavior</mark> or outcome among the population. In other words, it tells the importance of Independent variables in predicting the dependent variable. The importance is similar to variance in the data explained by that variable and the sum of all the variable importance is equal to 1

Here, the <mark>dependent variable taken is Income</mark> and Independent variables include but not limited to **Age, Gender, Highest Formal education, Programming language knowledge, Experience** etc.


### Methodology used

This problem can be treated as a supervised machine learning problem with dependent variable as Income which is a categorical variable thus making it a Classification problem.

Gradient Boosting Classifier algorithm is used to fit the data **for each Year and Country** separately and features are grouped in **8 broad categories**. 142 variables are used to fit the model

After fitting the model on 9 different data sets (For 3 countries x 3 years), we get feature importance as output for each of the 142 variables used.

These variables are grouped in 8 categories:
* **Tools and Technology Knowledge:** Knowledge or regular use of various ML algorithms, Programming languages, Cloud computing platforms etc.
* **Industry/Company size:** The industry a person works in and the size of the company
* **Demographics:** Age, Gender
* **Job Profile:** Work activities
* **Online learning:** Online courses completed on various platforms
* **ML/Coding experience:** Experience in writing code or using ML methods
* **Formal Education:** Highest formal education
* **Other**



In [None]:
# Driver analysis

result = {}
result1 = {}

train_columns = ['Q1','Q2','Q4','Q5','Q6','Q7_Part_1','Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6','Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11','Q7_Part_12','Q7_OTHER','Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_Part_12','Q9_OTHER','Q10_Part_1','Q10_Part_2','Q10_Part_3','Q10_Part_4','Q10_Part_5','Q10_Part_6','Q10_Part_7','Q10_Part_8','Q10_Part_9','Q10_Part_10','Q10_Part_11','Q10_Part_12','Q10_Part_13','Q10_Part_14','Q10_Part_15','Q10_Part_16','Q10_OTHER','Q11','Q14_Part_1','Q14_Part_2','Q14_Part_3','Q14_Part_4','Q14_Part_5','Q14_Part_6','Q14_Part_7','Q14_Part_8','Q14_Part_9','Q14_Part_10','Q14_Part_11','Q14_OTHER','Q15','Q16_Part_1','Q16_Part_2','Q16_Part_3','Q16_Part_4','Q16_Part_5','Q16_Part_6','Q16_Part_7','Q16_Part_8','Q16_Part_9','Q16_Part_10','Q16_Part_11','Q16_Part_12','Q16_Part_13','Q16_Part_14','Q16_Part_15','Q16_Part_16','Q16_Part_17','Q16_OTHER','Q17_Part_1','Q17_Part_2','Q17_Part_3','Q17_Part_4','Q17_Part_5','Q17_Part_6','Q17_Part_7','Q17_Part_8','Q17_Part_9','Q17_Part_10','Q17_Part_11','Q17_OTHER','Q18_Part_1','Q18_Part_2','Q18_Part_3','Q18_Part_4','Q18_Part_5','Q18_Part_6','Q18_OTHER','Q19_Part_1','Q19_Part_2','Q19_Part_3','Q19_Part_4','Q19_Part_5','Q19_OTHER','Q20','Q21','Q22','Q23','Q24_Part_1','Q24_Part_2','Q24_Part_3','Q24_Part_4','Q24_Part_5','Q24_Part_6','Q24_Part_7','Q24_OTHER','Q40_Part_1','Q40_Part_2','Q40_Part_3','Q40_Part_4','Q40_Part_5','Q40_Part_6','Q40_Part_7','Q40_Part_8','Q40_Part_9','Q40_Part_10','Q40_Part_11','Q40_OTHER','Q41','Q27_A_Part_1','Q27_A_Part_2','Q27_A_Part_3','Q27_A_Part_4','Q27_A_Part_5','Q27_A_Part_6','Q27_A_Part_7','Q27_A_Part_8','Q27_A_Part_9','Q27_A_Part_10','Q27_A_Part_11','Q27_A_OTHER']
train_columns_cat = ['Demographics','Demographics','Formal Education','Job profile','ML/Coding experience','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Other','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','ML/Coding experience','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Industry/Company size','Industry/Company size','Industry/Company size','Industry/Company size','Job profile','Job profile','Job profile','Job profile','Job profile','Job profile','Job profile','Job profile','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Job profile','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge','Tools and technology knowledge']
train_columns_subcat = ['Age','Gender','Formal Education','Job Title','Coding Experience','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','Programming Language','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','IDEs','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Hosted Notebook','Other','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','Visualization libraries','ML Experience','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Frameworks','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','ML Algorithms','Computer vision methods','Computer vision methods','Computer vision methods','Computer vision methods','Computer vision methods','Computer vision methods','Computer vision methods','NLP methods','NLP methods','NLP methods','NLP methods','NLP methods','NLP methods','Industry','Company Size','Company Size','Industry','Job Activities','Job Activities','Job Activities','Job Activities','Job Activities','Job Activities','Job Activities','Job Activities','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Online learning','Job Activities','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms','Cloud computing Platforms']

countries = ['India','United States of America','Brazil']
Years = [2019,2020,2021]
Importance_features = ['Industry/Company size','Demographics','Job profile','ML/Coding experience','Formal Education','Tools and technology knowledge','Online learning','Other']


model = GradientBoostingClassifier(n_estimators=200,learning_rate=0.05,max_features=12,subsample=0.7)

for x in countries:
    
    
    df_importance = pd.DataFrame(columns = ['Features','Categories','Sub_categories','2019','2020','2021'])
    df_importance["Features"] = train_columns
    df_importance["Categories"] = train_columns_cat
    df_importance["Sub_categories"] = train_columns_subcat
    for y in Years:
        train = df[(df.Q3==x) & (df.Year ==y)][train_columns+['Q25']]
        train['Q25'] = train.Q25.replace('0',np.nan)
        train = train[train.Q25.notna()]

        for cols in train.columns:
            train[cols] = train[cols].replace(0.0,'0')
            train[cols] = label_encoder.fit_transform(train[cols])

        X = train[train_columns]
        Y = train.Q25
        model.fit(X, Y)

        importances = model.feature_importances_
        df_importance[str(y)] = importances
    result[x] = df_importance
    result1[x] = result[x].groupby(by='Categories').sum().reset_index()

## Key Income Drivers by Country

In [None]:
categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories,specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}]])

col=1
Plot_year = '2021'
for x in categories:   
    
    
    fig.add_trace(go.Pie(labels = result1[x]['Categories'],
                   values = result1[x][Plot_year],
                   hole = .5,
                 hoverinfo='label+percent', textinfo='percent'),row=1,col=col)
    
    col = col+1

fig.update_layout(title ="<b>Key Drivers in determining Income across countries<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.5,
    xanchor="right",
    x=1
))

fig.show()

<div class="alert alert-success">The Industry and company size play bigger role in determining income in India compared to other countries. Knowledge of various tools and technology is the most important aspect</div>
<br>

<mark>Level of formal Education has very little effect on compensation in all 3 countries</mark>

**Online learning play a significant part in what data scientists are paid**

## Key income drivers for India YoY

In [None]:
categories = ['2019','2020','2021']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories,specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}]])

col=1
Country = 'India'
for x in categories:   
    
    
    fig.add_trace(go.Pie(labels = result1[Country]['Categories'],
                   values = result1[Country][x],
                   hole = .5,
                 hoverinfo='label+percent', textinfo='percent'),row=1,col=col)
    
    col = col+1

fig.update_layout(title ="<b>Income drivers in India by year<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.5,
    xanchor="right",
    x=1
))

fig.show()

Even though the importance of Tools and Technology knowledge is becoming less dominant, it still is the major factor that determines Income

## Tools and Technology Deep dive

In [None]:
categories = ['India','United States of America','Brazil']

result2={}
for x in categories:
    result2[x] = result[x][result[x].Categories == 'Tools and technology knowledge'].groupby(by='Sub_categories').sum().reset_index()

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories,specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}]])

col=1
Plot_year = '2021'
for x in categories:   
    
    
    fig.add_trace(go.Pie(labels = result2[x]['Sub_categories'],
                   values = result2[x][Plot_year],
                   hole = .5,
                 hoverinfo='label+percent', textinfo='percent'),row=1,col=col)
    
    col = col+1

fig.update_layout(title ="<b>Feature importance among Tools and Technologies<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.5,
    xanchor="right",
    x=1
))

fig.show()

<div class="alert alert-success">Importance of different Technology areas is evenly distributed in all 3 countries</div>
<br>

## Importance of Tools and Technology in India YoY

In [None]:
countries = ['India','United States of America','Brazil']
categories = ['2019','2020','2021']

result2={}
for x in countries:
    result2[x] = result[x][result[x].Categories == 'Tools and technology knowledge'].groupby(by='Sub_categories').sum().reset_index()

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories,specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}]])

col=1
Country = 'India'
for x in categories:   
    
    
    fig.add_trace(go.Pie(labels = result2[Country]['Sub_categories'],
                   values = result2[Country][x],
                   hole = .5,
                 hoverinfo='label+percent', textinfo='percent'),row=1,col=col)
    
    col = col+1

fig.update_layout(title ="<b>Important Drivers among Tools and Technologies in India<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.5,
    xanchor="right",
    x=1
))

fig.show()

<div class="alert alert-success">Importance of Computer vision methods, NLP methods, Visualization libraries and Cloud computing platforms is increasing every year</div>
<br>

Let's also check this with USA to validate our claim

In [None]:
countries = ['India','United States of America','Brazil']
categories = ['2019','2020','2021']

result2={}
for x in countries:
    result2[x] = result[x][result[x].Categories == 'Tools and technology knowledge'].groupby(by='Sub_categories').sum().reset_index()

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories,specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}]])

col=1
Country = 'United States of America'
for x in categories:   
    
    
    fig.add_trace(go.Pie(labels = result2[Country]['Sub_categories'],
                   values = result2[Country][x],
                   hole = .5,
                 hoverinfo='label+percent', textinfo='percent'),row=1,col=col)
    
    col = col+1

fig.update_layout(title ="<b>Important Drivers among Tools and Technologies in USA<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.5,
    xanchor="right",
    x=1
))

fig.show()

**We see a similar trend here**

# Section-3:  Products of importance

In the previous section we have seen that Tools and Technology (such as Computer vision methods, NLP Methods, ML Algorithms), Job profile, Online learning etc. play the role of key drivers in determining the Income

In this section, we **deep dive at Product or component level for these categories** and their trend over the years

**Note: Similar to Section-1, feel free to use hovering feature on 2021's data to check statistical significance of change**

 ## Job description

In [None]:
#Job profile
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

Q24_cols = ['Q24_Part_1','Q24_Part_2','Q24_Part_3','Q24_Part_4','Q24_Part_5','Q24_Part_6','Q24_Part_7','Q24_OTHER']
Q24_cols_desc = ['Analyze and understand data to influence product or business decisions','Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data','Build prototypes to explore applying machine learning to new areas','Build and/or run a machine learning service that operationally improves my product or workflows','Experimentation and iteration to improve existing ML models','Do research that advances the state of the art of machine learning','None of these activities are an important part of my role at work','Other']
df_n = df[['User','Year','Q3']+Q24_cols]
df_n.rename(columns=dict(zip(Q24_cols, Q24_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q24_cols_desc]!='0').sum(axis=1)
df_n[Q24_cols_desc +['Total']] = df_n[Q24_cols_desc + ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q24_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q24_cols_desc)

    for i in Q24_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Job description<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=1000,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="right",
    x=1
))

fig.show()

To simplify a little bit, let's look at percentage of people who do just 1 task at their jobs and those who do more than 1

In [None]:
Q24_cols = ['Q24_Part_1','Q24_Part_2','Q24_Part_3','Q24_Part_4','Q24_Part_5','Q24_Part_6','Q24_Part_7','Q24_OTHER']
Q24_cols_desc = ['Analyze and understand data to influence product or business decisions','Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data','Build prototypes to explore applying machine learning to new areas','Build and/or run a machine learning service that operationally improves my product or workflows','Experimentation and iteration to improve existing ML models','Do research that advances the state of the art of machine learning','None of these activities are an important part of my role at work','Other']
Multi_cols = ['Single task','2-3','>3']

df_n = df[df.Year!=2018][['User','Year','Q3']+Q24_cols]
df_n.rename(columns=dict(zip(Q24_cols, Q24_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q24_cols_desc]!='0').sum(axis=1)
df_n['Single task'] = np.int64(df_n.Total==1)
df_n['2-3'] = np.int64((df_n.Total>1) & (df_n.Total<4))
df_n['>3'] = np.int64(df_n.Total>3)
df_n[Q24_cols_desc + Multi_cols + ['Total']] = df_n[Q24_cols_desc+Multi_cols + ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)
table2 = round(table1[Q24_cols_desc + Multi_cols].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table2['More_than_1'] = table2[['2-3','>3']].sum(axis=1)
Top = table2[(table2.Year=='2021')&(table1.Total>500)].sort_values(by = ['More_than_1'],ascending=False)
Top = Top.iloc[0,:].Q3

categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q24_cols_desc + Multi_cols)

    for i in Multi_cols:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Multiple tasks at Work<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=500,
                 width = 900,legend_traceorder="reversed")

fig.show()

<div class="alert alert-success">More than 50% of people in USA & Brazil wear multiple hats at work while majority of Indians are sticking to single task</div>
<br>


## Online Learning

In [None]:
#Online courses

Q40_cols = ['Q40_Part_1','Q40_Part_2','Q40_Part_3','Q40_Part_4','Q40_Part_5','Q40_Part_6','Q40_Part_7','Q40_Part_8','Q40_Part_9','Q40_Part_10','Q40_Part_11','Q40_OTHER']
Q40_cols_desc = ['Coursera','edX','Kaggle Learn Courses','DataCamp','Fast.ai','Udacity','Udemy','LinkedIn Learning','Cloud-certification programs','University Courses','None','Other']

df_n = df[df.Year!=2018][['User','Year','Q3']+Q40_cols]
df_n.rename(columns=dict(zip(Q40_cols, Q40_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q40_cols_desc]!='0').sum(axis=1)
df_n[Q40_cols_desc + ['Total']] = df_n[Q40_cols_desc + ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q40_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q40_cols_desc)

    for i in Q40_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Online learning platforms<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=1000,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="right",
    x=1
))

fig.show()

There is a <mark>significant increase in Kaggle Learn courses among online learning platforms</mark> in all 3 countries from 2019 to 2021

Udemy increase is significant in India and USA but same in Brazil

**Let's also see how many people are learning from more than 1 platform**

In [None]:
Q40_cols = ['Q40_Part_1','Q40_Part_2','Q40_Part_3','Q40_Part_4','Q40_Part_5','Q40_Part_6','Q40_Part_7','Q40_Part_8','Q40_Part_9','Q40_Part_10','Q40_Part_11','Q40_OTHER']
Q40_cols_desc = ['Coursera','edX','Kaggle Learn Courses','DataCamp','Fast.ai','Udacity','Udemy','LinkedIn Learning','Cloud-certification programs','University Courses','None','Other']
Multi_cols = ['Single platform','2-3','>3']

df_n = df[df.Year!=2018][['User','Year','Q3']+Q40_cols]
df_n.rename(columns=dict(zip(Q40_cols, Q40_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q40_cols_desc]!='0').sum(axis=1)
df_n['Single platform'] = np.int64(df_n.Total==1)
df_n['2-3'] = np.int64((df_n.Total>1) & (df_n.Total<4))
df_n['>3'] = np.int64(df_n.Total>3)
df_n[Q40_cols_desc + Multi_cols + ['Total']] = df_n[Q40_cols_desc+Multi_cols + ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)
table2 = round(table1[Q40_cols_desc + Multi_cols].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)

table2['More_than_1'] = table2[['2-3','>3']].sum(axis=1)
Top = table2[(table2.Year=='2021')&(table1.Total>500)].sort_values(by = ['More_than_1'],ascending=False)
Top = Top.iloc[0,:].Q3

categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q40_cols_desc + Multi_cols)

    for i in Multi_cols:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Multiple Online learning Platforms<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=500,
                 width = 900,legend_traceorder="reversed")

fig.show()

<div class="alert alert-success">Two-third people use more than 1 platform for online learning in all 3 countries. This has dropped w.r.t 2019 but is consistent with last year</div>
<br>



## Tools and Technology - ML Algorithms

In [None]:
#Machine learning algorithms

Q17_cols = ['Q17_Part_1','Q17_Part_2','Q17_Part_3','Q17_Part_4','Q17_Part_5','Q17_Part_6','Q17_Part_7','Q17_Part_8','Q17_Part_9','Q17_Part_10','Q17_Part_11','Q17_OTHER']
Q17_cols_desc = ['Linear or Logistic Regression','Decision Trees or Random Forests','Gradient Boosting Machines (xgboost, lightgbm, etc)','Bayesian Approaches','Evolutionary Approaches','Dense Neural Networks (MLPs, etc)','Convolutional Neural Networks','Generative Adversarial Networks','Recurrent Neural Networks','Transformer Networks (BERT, gpt-3, etc)','None','Other']

df_n = df[df.Year!=2018][['User','Year','Q3']+Q17_cols]
df_n.rename(columns=dict(zip(Q17_cols, Q17_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q17_cols_desc]!='0').sum(axis=1)
df_n[Q17_cols_desc + ['Total']] = df_n[Q17_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q17_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q17_cols_desc)

    for i in Q17_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>ML Algorithm usage across countries<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=1000,
                 width = 1000)

fig.show()

## Tools and Technology - Computer vision methods

In [None]:

Q18_cols = ['Q18_Part_1','Q18_Part_2','Q18_Part_3','Q18_Part_4','Q18_Part_5','Q18_Part_6','Q18_OTHER']
Q18_cols_desc = ['General purpose image/video tools (PIL, cv2, skimage, etc)','Image segmentation methods (U-Net, Mask R-CNN, etc)','Object detection methods (YOLOv3, RetinaNet, etc)','Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)','Generative Networks (GAN, VAE, etc)','None','Other']
    
df_n = df[df.Year!=2018][['User','Year','Q3']+Q18_cols]
df_n.rename(columns=dict(zip(Q18_cols, Q18_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q18_cols_desc]!='0').sum(axis=1)
df_n[Q18_cols_desc + ['Total']] = df_n[Q18_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q18_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q18_cols_desc)

    for i in Q18_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Computer vision methods usage across countries<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=1000,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="right",
    x=1
))

fig.show()

## Tools and Technology - NLP Methods

In [None]:
Q19_cols = ['Q19_Part_1','Q19_Part_2','Q19_Part_3','Q19_Part_4','Q19_Part_5','Q19_OTHER']
Q19_cols_desc = ['Word embeddings/vectors (GLoVe, fastText, word2vec)','Encoder-decorder models (seq2seq, vanilla transformers)','Contextualized embeddings (ELMo, CoVe)','Transformer language models (GPT-3, BERT, XLnet, etc)','None','Other']

df_n = df[df.Year!=2018][['User','Year','Q3']+Q19_cols]
df_n.rename(columns=dict(zip(Q19_cols, Q19_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q19_cols_desc]!='0').sum(axis=1)
df_n[Q19_cols_desc + ['Total']] = df_n[Q19_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q19_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q19_cols_desc)

    for i in Q19_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>NLP methods usage across countries<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=1000,
                 width = 1000,legend_traceorder="reversed",legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="right",
    x=1
))

fig.show()

## Tools and Technology - Cloud computing Platforms

In [None]:
Q27_cols = ['Q27_A_Part_1','Q27_A_Part_2','Q27_A_Part_3','Q27_A_Part_4','Q27_A_Part_5','Q27_A_Part_6','Q27_A_Part_7','Q27_A_Part_8','Q27_A_Part_9','Q27_A_Part_10','Q27_A_Part_11','Q27_A_OTHER']
Q27_cols_desc = [' Amazon Web Services (AWS)',' Microsoft Azure ',' Google Cloud Platform (GCP) ',' IBM Cloud / Red Hat ',' Oracle Cloud ',' SAP Cloud ',' Salesforce Cloud ',' VMware Cloud ',' Alibaba Cloud ',' Tencent Cloud ','None','Other']
    
df_n = df[(df.Year!=2018) & (df.Year!=2019) ][['User','Year','Q3']+Q27_cols]
df_n.rename(columns=dict(zip(Q27_cols, Q27_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q27_cols_desc]!='0').sum(axis=1)
df_n[Q27_cols_desc + ['Total']] = df_n[Q27_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q27_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India','United States of America','Brazil']

fig = go.Figure()

fig = make_subplots(rows=1, cols=len(categories),shared_yaxes=True,subplot_titles=categories)



col=1
legend=True
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q27_cols_desc)

    for i in Q27_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Cloud computing Platforms usage across countries (2021 vs 2020)<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=800,
                 width = 1000,legend_traceorder="reversed")

fig.show()

# Section-4: Into the Future

In this section we look at future perceptions on product usage at India and Global level as entered in the 2021 survey

## Cloud computing Platforms

In [None]:
Q27B_cols = ['Q27_B_Part_1','Q27_B_Part_2','Q27_B_Part_3','Q27_B_Part_4','Q27_B_Part_5','Q27_B_Part_6','Q27_B_Part_7','Q27_B_Part_8','Q27_B_Part_9','Q27_B_Part_10','Q27_B_Part_11','Q27_B_OTHER']
Q27B_cols_desc = [' Amazon Web Services (AWS)',' Microsoft Azure ',' Google Cloud Platform (GCP) ',' IBM Cloud / Red Hat ',' Oracle Cloud ',' SAP Cloud ',' VMware Cloud ',' Salesforce Cloud ',' Alibaba Cloud ',' Tencent Cloud ','None','Other']


df_n = df[df.Year==2021][['User','Year']+Q27B_cols]
df_n.rename(columns=dict(zip(Q27B_cols, Q27B_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q27B_cols_desc]!='0').sum(axis=1)
df_n[Q27B_cols_desc + ['Total']] = df_n[Q27B_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q27B_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India']

fig = go.Figure()

fig = make_subplots(rows=1, cols=2,shared_yaxes=True,subplot_titles=['Global','India'])



table_tot = np.zeros(table2.iloc[:,1].shape)
j=0
table3= z_test(table1,Q27B_cols_desc)
for i in Q27B_cols_desc:
    fig.add_trace(bar_chart(table2,i,'Year',table3,table_tot,j,True),row=1,col=1)
    table_tot =table_tot+ table2.loc[:,i]
    j=j+1




df_n = df[df.Year==2021][['User','Year','Q3']+Q27B_cols]
df_n.rename(columns=dict(zip(Q27B_cols, Q27B_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q27B_cols_desc]!='0').sum(axis=1)
df_n[Q27B_cols_desc + ['Total']] = df_n[Q27B_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q27B_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India']

col=2
legend=False
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q27B_cols_desc)

    for i in Q27B_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Future Cloud computing Platforms usage<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=500,
                 width = 1000,legend_traceorder="reversed",hovermode=False)

fig.show()

<div class="alert alert-success">Most Indians plan to use AWS cloud computing Platforms in next 2 years. The percentage is also higher than overall Global percentage</div>

Also, <mark>fewer Indians will stay away from Cloud computing platforms</mark> compared to Global peers

## Cloud computing products

In [None]:
Q31B_cols = ['Q31_B_Part_1','Q31_B_Part_2','Q31_B_Part_3','Q31_B_Part_4','Q31_B_Part_5','Q31_B_Part_6','Q31_B_Part_7','Q31_B_Part_8','Q31_B_Part_9','Q31_B_OTHER']
Q31B_cols_desc = ['Amazon SageMaker','Azure Machine Learning Studio','Google Cloud Vertex AI','DataRobot','Databricks','Dataiku','Alteryx','Rapidminer','None','Other']

df_n = df[df.Year==2021][['User','Year']+Q31B_cols]
df_n.rename(columns=dict(zip(Q31B_cols, Q31B_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q31B_cols_desc]!='0').sum(axis=1)
df_n[Q31B_cols_desc + ['Total']] = df_n[Q31B_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q31B_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India']

fig = go.Figure()

fig = make_subplots(rows=1, cols=2,shared_yaxes=True,subplot_titles=['Global','India'])



table_tot = np.zeros(table2.iloc[:,1].shape)
j=0
table3= z_test(table1,Q31B_cols_desc)
for i in Q31B_cols_desc:
    fig.add_trace(bar_chart(table2,i,'Year',table3,table_tot,j,True),row=1,col=1)
    table_tot =table_tot+ table2.loc[:,i]
    j=j+1




df_n = df[df.Year==2021][['User','Year','Q3']+Q31B_cols]
df_n.rename(columns=dict(zip(Q31B_cols, Q31B_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q31B_cols_desc]!='0').sum(axis=1)
df_n[Q31B_cols_desc + ['Total']] = df_n[Q31B_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q31B_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India']

col=2
legend=False
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q31B_cols_desc)

    for i in Q31B_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Future Cloud computing Products<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=500,
                 width = 1000,legend_traceorder="reversed",hovermode=False)

fig.show()

## Auto ML Products

In [None]:
Q37B_cols = ['Q37_B_Part_1','Q37_B_Part_2','Q37_B_Part_3','Q37_B_Part_4','Q37_B_Part_5','Q37_B_Part_6','Q37_B_Part_7','Q37_B_OTHER']
Q37B_cols_desc = ['Google Cloud AutoML','H2O Driverless AI','Databricks AutoML','DataRobot AutoML','Amazon Sagemaker Autopilot','  Azure Automated Machine Learning ','None','Other']

df_n = df[df.Year==2021][['User','Year']+Q37B_cols]
df_n.rename(columns=dict(zip(Q37B_cols, Q37B_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q37B_cols_desc]!='0').sum(axis=1)
df_n[Q37B_cols_desc + ['Total']] = df_n[Q37B_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q37B_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India']

fig = go.Figure()

fig = make_subplots(rows=1, cols=2,shared_yaxes=True,subplot_titles=['Global','India'])



table_tot = np.zeros(table2.iloc[:,1].shape)
j=0
table3= z_test(table1,Q37B_cols_desc)
for i in Q37B_cols_desc:
    fig.add_trace(bar_chart(table2,i,'Year',table3,table_tot,j,True),row=1,col=1)
    table_tot =table_tot+ table2.loc[:,i]
    j=j+1




df_n = df[df.Year==2021][['User','Year','Q3']+Q37B_cols]
df_n.rename(columns=dict(zip(Q37B_cols, Q37B_cols_desc)), inplace=True)
df_n['Total'] = np.int64(df_n[Q37B_cols_desc]!='0').sum(axis=1)
df_n[Q37B_cols_desc + ['Total']] = df_n[Q37B_cols_desc+ ['Total']].replace(['0',0],np.nan)
table1 = df_n.groupby(by=['Year','Q3']).count()

table1 = table1.fillna(0)

table2 = round(table1[Q37B_cols_desc].div(table1.Total, axis=0)*100,1)

table2 = table2.reset_index()
table2['Year']=table2.Year.astype(str)

table1 = table1.reset_index()
table1['Year']=table1.Year.astype(str)


categories = ['India']

col=2
legend=False
for cat in categories:
    table2_new = table2[table2.Q3==cat].reset_index(drop=True)
    table1_new = table1[table1.Q3==cat].reset_index(drop=True)
    table_tot = np.zeros(table2_new.iloc[:,1].shape)
    j=0

    table3= z_test(table1_new,Q37B_cols_desc)

    for i in Q37B_cols_desc:
        fig.add_trace(bar_chart(table2_new,i,'Year',table3,table_tot,j,legend),row=1,col=col)
        table_tot =table_tot+ table2_new.loc[:,i]
        j=j+1
    col = col+1
    if col>1:
        legend=False
    

fig.update_layout(title ="<b>Future Auto ML Products<b>",
                  title_x = 0.5,
                  title_font = dict(size = 20, color = 'MidnightBlue'),
                 height=500,
                 width = 1000,legend_traceorder="reversed",hovermode=False)

fig.show()

# Conclusion

In this analysis, I have mainly focused on Indian data scientists and the pay gap with global peers. We have looked at how Demographics, Industry, company size, Ongoing learning and Data science knowledge impact what data scientists get paid in India. We have also looked at product and component level what continues to be the preference and what has changed over the years. The analysis was divided in 4 Sections: EDA around Income, Driver analysis, Products of Importance and Into the future.

Key insights from each section are provided below:

1. EDA around Income: 
    * India is taking the lead in closing the gender gap in data science community but there's still a long way to go. **Pay gap between Men and Women continues to be a worry**
    * Income is positively correlated with Age, Coding or ML experience. Though, there seems to be an emerging trend of low paying jobs (Less than 10k USD) among employees above 35 years of age in Indian data science industry
    * Employees working in very small companies (Less than 50 employees) get paid the least. **Company size doesn't matter above 1000 employees**
    
2. Driver analysis:
    * India is on the right track in terms of adopting the latest technology that seems to matter the most in driver analysis
    * There is an emerging trend of Cloud computing technologies, Computer vision methods and NLP methods playing more important role in deciding compensation as compared to traditional Notebook, IDEs and Programming language knowledge
    * **Level of formal education has little effect on deciding compensation** while online learning continues to play an important role
    
3. Products of importance:
    * **More people do research related to ML in India compared to USA & Brazil**. However, majority of Indians are limited to single area of work while data scientists in USA & Brazil usually wear multiple hats
    * Two-third people have completed online courses from more than 1 platform in all 3 countries. **Kaggle learn courses participation continue to rise significantly YoY**
    * While use of traditional ML algorithms such as Logistic or Linear regression is still dominant, Indians continue to lead on use of CNNs at their work
    
4. Into the Future:
    * AWS is the preferred cloud computing platform for Indian Kagglers for the next 2 years
    * Future adoption of cloud technologies is higher among Indians compared to Global peers. This is in line to rising importance of cloud computing in compensation   
