# Introduction:

Welcome to my Debut attempt in Kaggle Survey Analysis.

Data Science (DS) field is fascinating.  In this analysis, we are going to look at whether the interest in this fascinating field is there across the world regardless of Human Development Index (HDI) or not?  Lets do deep dive in this and understand more on this.

The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions. (Reference: http://hdr.undp.org/en/content/human-development-index-hdi)

Human Development Index data is available @ http://hdr.undp.org/en/data#.


HDI classifications are based on HDI fixed cutoff points, which are derived from the quartiles of distributions of the component indicators. The cutoff points are HDI of less than 0.550 for low human development, 0.550–0.699 for medium human development, 0.700–0.799 for high human development and 0.800 or greater for very high human development. (Reference: http://hdr.undp.org/en/content/human-development-indicators-and-indices-2018-statistical-update-readers-guide)

Using above cut off points, we can categorize countries in this survey data as **Very High Human Developed (VHHD), High Human Developed (HHD), Medium Human Developed (MHD), Low Human Developed (LHD).**  

I presume, it is fair to assume that in this survey, people who reside in a country are residents of those countries or representing the characteristics of the people from that country in the DS field.

For this analysis, we will not consider the records which has country name as Others.  As we can't determine the HDI for those countries (as actual country name is masked) let us exclude those records. 


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #For plotting charts
import warnings 

warnings.filterwarnings("ignore", category=DeprecationWarning) #To ignore warning message
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


hdi=pd.read_csv("../input/kaggle-2019-survey-add-on-data/HDI.csv")
hdi.describe()

mcr=pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv",low_memory=False)
mcr=mcr.drop([0])

questions=pd.read_csv("../input/kaggle-2019-survey-add-on-data/QuestionDetails.csv")

mcr['Q3'].unique()

mcr['Q3'].replace({'United States of America':'United States',
                   'Russia':'Russian Federation', 
                   'South Korea':'Korea (Republic of)',
                   'United Kingdom of Great Britain and Northern Ireland':'United Kingdom',
                   'Czech Republic':'Czechia',
                   'Hong Kong (S.A.R.)':'Hong Kong, China (SAR)',
                   'Republic of Korea':'Korea (Republic of)',
                   'Iran, Islamic Republic of...':'Iran (Islamic Republic of)'},
        inplace=True)

In [None]:
mcr_hdi=pd.merge(mcr, hdi, how='left', left_on='Q3', right_on='Country')

In this analysis its presumed that readers all already familiar with the survey  questions.  If you want to know the questions kindly view the same below.

Note: In case, readers want to view the detailed data along with the graph they need to change set display_output_df as True.

In [None]:
questions

In [None]:
#Set this to true to display the output of different methods that we would invoke for analysis
display_output_df=False
#Figure size
figsize=(10,5)

In [None]:
#This method helps to plot a chart for the Simple Choice questions
#That inturn helps to analyze the data.  It also groups the data
def SimpleChoicePlot(qn_col,qn_desc=None,inp_df=mcr_hdi,figsize=figsize):
    if qn_desc == None:
        qn_desc=questions[questions['QN_Number']==qn_col]['QN_Short_Desc'].values[0]
    temp_df=inp_df.groupby(['HDI_Category',qn_col])[qn_col].count().rename('Count')
    temp_df=((temp_df/temp_df.groupby(level=0).sum())*100).to_frame().unstack()
    #Flatten the Multi Index created
    temp_df.columns=temp_df.columns.to_flat_index()
    #Make column names without the 'Count' Name. Need to display only various choice values
    temp_df.columns=[item[1] for item in temp_df.columns]
    temp_df.plot(kind='bar',stacked=True,figsize=figsize).legend(bbox_to_anchor=(1,1))
    #plt.legend(loc='center right')
    plt.title(qn_desc)
    plt.show()
    
    #Change the index name as follows:
    temp_df.index.name='Values Given'
    if display_output_df:
        print("Answer values distribution is given below across all HDI Categories:")
        print(temp_df.T)
    return temp_df

In [None]:
#This method helps to plot a chart for the Multi Choice questions
#That inturn helps to analyze the data.  It also groups the data
def MultiChoicePlot(qn_col_abr,figsize=figsize):
    #Extract question column names and descriptions based on question column abbrevation
    qn_name_list=questions[questions['QN_Number'].str.contains(qn_col_abr)]['QN_Number'].values
    qn_desc_list=questions[questions['QN_Number'].str.contains(qn_col_abr)]['QN_Short_Desc'].values
    qn_choices=questions[questions['QN_Number'].str.contains(qn_col_abr)]['ShortenChoices'].values
    cur_qn_df=pd.DataFrame(columns=['HDI_Category','Choice'])    
    for itn in range(0,len(qn_name_list)):
        qn_col=qn_name_list[itn]
        qn_cur_choice=qn_choices[itn]
        temp_df=mcr_hdi[['HDI_Category',qn_col]]
        #Populate shorten description
        temp_df.loc[~temp_df[qn_col].isnull(),qn_col]=qn_cur_choice
        temp_df.columns=['HDI_Category','Choice']
        cur_qn_df=cur_qn_df.append(temp_df)

    #Remove all Nulls (That is whereever people have not entered)
    cur_qn_df=cur_qn_df[~cur_qn_df['Choice'].isnull()]
    temp_df=cur_qn_df.groupby(['HDI_Category','Choice'])['Choice'].count().rename('Count')
    temp_df=((temp_df/temp_df.groupby(level=0).sum())*100).to_frame().unstack()
    #Flatten the Multi Index created
    temp_df.columns=temp_df.columns.to_flat_index()
    #Make column names without the 'Count' Name. Need to display only various choice values
    temp_df.columns=[item[1] for item in temp_df.columns]
    temp_df.plot(kind='bar',stacked=True,figsize=figsize).legend(bbox_to_anchor=(1,1))
    plt.title(qn_desc_list[0])
    plt.show()
    
    #Change the index name as follows:
    temp_df.index.name='Choices Given'
    if display_output_df:
        print("Answer choices distribution is given below across all HDI Categories:")
        print(temp_df.T)
    
    #List down how many choices are given by people. As its multi choice people might have chosen more than one
    qn_summary_cols=questions[(questions['QN_Number'].str.contains(qn_col_abr)) &
                           (questions['ShortenChoices']!='None')]['QN_Number'].values
    
    QSummaryColName=qn_col_abr+'_Summary'
    mcr_hdi[QSummaryColName]=len(qn_summary_cols)-mcr_hdi[qn_summary_cols].isnull().sum(axis=1)
    
    #Combined Question plot
    SimpleChoicePlot(QSummaryColName,qn_desc_list[0]+' - # Of Choices Chosen')
    return temp_df

In [None]:
#This method helps to plot a chart for the Simple Choice questions against another question
#That inturn helps to analyze the data.  It also groups the data
def GroupSimpleChoicePlot(qn_col,qn_grp,qn_desc,grp_lvl=[0,1],inp_df=mcr_hdi):
    #If multiple qn_cols are passed then pass grp_lvl as [0,1,2] etc    
    group_cols=['HDI_Category']
    for gc in qn_col:
        group_cols.append(gc)
    group_cols.append(qn_grp)
       
    temp_df=inp_df.groupby(group_cols)[qn_grp].count().rename('Count')
    temp_df=((temp_df/temp_df.groupby(level=grp_lvl).sum())*100).to_frame().unstack()
    #Flatten the Multi Index created
    temp_df.columns=temp_df.columns.to_flat_index()
    #Make column names without the 'Count' Name. Need to display only various choice values
    temp_df.columns=[item[1] for item in temp_df.columns]
    temp_df.plot(kind='bar',stacked=True,figsize=figsize).legend(bbox_to_anchor=(1,1))
    #plt.legend(loc='center right')
    plt.title(qn_desc)
    plt.show()
    
    #Change the index name as follows:
    temp_df.index.name='Values Given'
    
    if display_output_df:
        print("Answer values distribution is given below across all HDI Categories:")
        print(temp_df)
    return temp_df

In [None]:
#This method helps to plot a chart for the Multi Choice questions against another question
#That inturn helps to analyze the data.  It also groups the data
def GroupMultiChoicePlot(qn_col_abr,qn_grp,qn_addnl_desc,grp_lvl=[0,1],inp_df=mcr_hdi):
    #Extract question column names and descriptions based on question column abbrevation
    qn_name_list=questions[questions['QN_Number'].str.contains(qn_col_abr)]['QN_Number'].values
    qn_desc_list=questions[questions['QN_Number'].str.contains(qn_col_abr)]['QN_Short_Desc'].values
    qn_choices=questions[questions['QN_Number'].str.contains(qn_col_abr)]['ShortenChoices'].values

    all_grp_cols=['HDI_Category','Choice']+qn_grp
    cur_qn_df=pd.DataFrame(columns=all_grp_cols)    
    for itn in range(0,len(qn_name_list)):
        qn_col=qn_name_list[itn]
        qn_cur_choice=qn_choices[itn]
        temp_df=inp_df[['HDI_Category',qn_col]+qn_grp]
        #Populate shorten description
        temp_df.loc[~temp_df[qn_col].isnull(),qn_col]=qn_cur_choice
        temp_df.columns=all_grp_cols
        cur_qn_df=cur_qn_df.append(temp_df)

    #Remove all Nulls (That is whereever people have not entered)
    cur_qn_df=cur_qn_df[~cur_qn_df['Choice'].isnull()]
    temp_df=cur_qn_df.groupby(all_grp_cols)['Choice'].count().rename('Count')
    temp_df_grp=((temp_df/temp_df.groupby(level=grp_lvl).sum())*100).to_frame().unstack()
    #Flatten the Multi Index created
    temp_df_grp.columns=temp_df_grp.columns.to_flat_index()
    #Make column names without the 'Count' Name. Need to display only various choice values
    temp_df_grp.columns=[item[1] for item in temp_df_grp.columns]
    temp_df_grp.plot(kind='bar',stacked=True,figsize=figsize).legend(bbox_to_anchor=(1,1))
    plt.title(qn_desc_list[0] + ' cum ' + qn_addnl_desc)
    plt.show()
    
    #Change the index name as follows:
    temp_df_grp.index.name='Choices Given'
    
    if display_output_df:
        print("Answer choices distribution is given below across all HDI Categories:")
        print(temp_df_grp)
    return temp_df,temp_df_grp

In [None]:
#This method helps to summarize and look at the Grouped output
#Eg:- Get ML Tools usage pattern (None vs ML Tools Usage) by Gender 
#choiceval - MLToolsUsed
def SummarizeResult(temp_grp_df,choiceval,agg_lvl=[0,1]):
    temp_grp_df1=temp_grp_df.reset_index()
    temp_grp_df1['ModifiedChoice']='No'+choiceval
    temp_grp_df1.loc[temp_grp_df1['Choice']!='None','ModifiedChoice']=choiceval
    temp_grp_df1=temp_grp_df1.groupby(['HDI_Category',col_name,'ModifiedChoice'])['Count'].sum().rename('Aggregated')
    temp_grp_df1=((temp_grp_df1/temp_grp_df1.groupby(level=agg_lvl).sum())*100).unstack()
    print(temp_grp_df1)
    return temp_grp_df1

## Group Kaggler's participated in the survey based on Human Development Index

Let us group the Kaggler's based on their countries HDI.  Post that deep dive into it.

Note: People who participated in this survey are referred as Kagglers in this analysis.

In [None]:
CategorizedCountriesCount=len(mcr_hdi)-mcr_hdi['HDI_Category'].isnull().sum()
((mcr_hdi['HDI_Category'].value_counts())/CategorizedCountriesCount).plot(kind='pie',figsize=figsize,autopct='%1.1f%%',title='% of respondents across all HDI categories').legend(bbox_to_anchor=(1,1))

It is nice to see that Data Science (DS) career is attracting people across all Human Development Index Categories. **Though % of response is low in LHD category, its presence itself is heartening!!!  *Also participation from MHDs are higher than HHDs. Nice to see this active participation from MHD countries as well.  Cheers to both MHD and LHD.***

Let us look at number of countries falling under all these categories.

In [None]:
temp=mcr_hdi[['HDI_Category','Country']].drop_duplicates().groupby(['HDI_Category'])['Country'].count().sort_values(ascending=False)
temp.plot(kind='barh',figsize=figsize).legend(bbox_to_anchor=(1,1))
temp

People from 35 VHHD countries, 11 HHD countries, 10 MHD countries and 1 LHD country have participated in this survey.  **DS is Niche area, and its obvious to see the above pattern of more number of participation from VHHD and followed by other countries.**

## Age of Kaggler's across HDIs

In [None]:

temp_df=SimpleChoicePlot('Q1')


54%, 43% of participants from MHD, LHD countries are between 18-24 Age Category. 31% and 17% from both HHD and VHHD. 

**Its astonishing to see that we have 54% and 43% of young kagglers participated from both MHD and LHD.  It could be that DS career started booming in these countries and attracting youngsters.  But in both HHD and VHHD young kagglers participation is comparatively low.  On the contrary in the VHHD and HHD countries Seniors (above 50 years) are active in DS activities.**

As these countries are developed and people from these countries could be already working in DS or similar types of job for quite some time.

Let us check the young kagglers role across all HDIs.

In [None]:
#Lets take only 18-21 age range people and check their roles. 
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q1'].isin(['18-21'])]
temp_df=GroupSimpleChoicePlot(['Q1'],'Q5','18-21 Age cum Current Role  Vs HDI',inp_df=mcr_hdi_filtered)

As expected more than 60% across all categories in this age group are Students.

**Surprisingly between 8 to 12% of people are working as Data Scientists.  Wow Smart guys!!!  Working as Data Scientist at this young age.**

**Another spectacular finding is that, 12% of these young kagglers from LHD are working as Data Scientist.  Top % of Data Scientist in this age group across all categories.  Bravo!!!!**

**This shows the interest among the people from this country on this exciting DS career.**

Let us check what our Senior Kagglers are doing?

In [None]:
#Let us take only 50+ age range people and check their roles
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q1'].isin(['50-54','55-59','60-69','70+'])]
temp_df=GroupSimpleChoicePlot(['Q5'],'Q1','50+ Age cum Current Role  Vs HDI',inp_df=mcr_hdi_filtered)

**Wow its nice to see that from VHHD, 70+ super seniors are working across all different roles.  I believe these seniors are guiding the juniors and showing the path relentlessly even at this age.**

From VHD and MHD also there are few people working at the age of 70+.  But in LHD there are none above 60+.  This country is slowly developing and we can wish the people from this country to reach this stage in future.

**Its surprising to see that some of these super seniors have mentioned their role as Student.  Presume that they might have enrolled for some new courses hence tagged themselves as student.  But its worth to note that their interest on continuous learning is not stopping.  Salute them!!!!**

## Lets explore Gender distribution from these HDIs:

In [None]:
#Take only Male and Female records alone
mcr_hdi_m_f_alone=mcr_hdi[mcr_hdi['Q2'].isin(['Male','Female'])]
mcr_hdi_m_f_alone['Q2'].value_counts()/len(mcr_hdi_m_f_alone)
temp_df=SimpleChoicePlot('Q2',inp_df=mcr_hdi_m_f_alone)

At the overall level only 16.59% participants are Female (By considering only Male and Female choices alone). In both MHD & VHHD countries slightly higher percentage of Females have participated. But in both HHD & LHD countries it is slightly lower than 16.59%. 

## Let us check country distribution across HDI

Let us also check the Male and Female participation from these countries.

In [None]:
temp_df=SimpleChoicePlot('Q3')
temp_df=GroupSimpleChoicePlot(['Country'],'Q2','Gender vs Country',inp_df=mcr_hdi_m_f_alone)
temp_df[temp_df['Female']>=16.59].sort_values(by='Female')

In [None]:
temp_df[temp_df['Female']>=16.59].groupby(level=0)['Female'].count()

- United States is having 31.5% of participation from VHHD countries and toping the list in that category.
- Brazil is having 29% of participation from HHD countries and toping the list in that category.
- India is having 80.2% of participation from MHD countries and toping the list in that category.
- Obviously Nigeria is 100% from LHD category.

- **Another refreshing factor is that Female Kagglers from 26 countries (14 from VHHD and 6 from both HHD and MHD) are active in DS field than rest of Female Kagglers (More than average Female Participation).**  
- **Tunisia is topping this with 49% of participants are Female Kagglers.  Kudos to them!!!! Let them boost similar interest among Female Kagglers across the world!!!!**


## Let us checks education level of the participants

In [None]:
temp_df=SimpleChoicePlot('Q4')

Kagglers from VHHD and HHD are highly educated.
- 50 & 46% from VHHD and HHD are holding Master's degree.
- 20 & 24% from VHHD and HHD are holding Doctoral degree.
- 48 & 51% from MHD and LHD are holding Bacherlor's degree.

Let us look at the role and salary of the people who have done Doctoral Degree.

In [None]:
mcr_hdi.Q4.unique()
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q4'].isin(['Doctoral degree'])]
temp_df=GroupSimpleChoicePlot(['Q4'],'Q5','Doctoral degree cum Current Role  Vs HDI',inp_df=mcr_hdi_filtered)

temp_df=GroupSimpleChoicePlot(['Q4'],'Q10','Doctoral degree cum Salary Vs HDI',inp_df=mcr_hdi_filtered)

- As expected primary role among Doctoral degree holder across country is: 
    * Research Scientist (28 to 35%)
    * Data Scientist (18 to 31%)
    * Students (10 to 22%) probably they may be doing their PhD now.

Let us look at their salary.

0.9% of Doctor degress holders are earning >500K in VHHD and 0.4% and 0.8% from MHD and HHD none from LHD.   

**Stunning to see that even with such a high educational qualification few Kagglers are earning only between 0 to 999 USD across all HDIs (3%, 15%, 25%, 36%). It could makes sense for Kagglers from MHD and LHD.  But we could see low earning Kagglers with high education present in VHHD and HHD also.**

Note: Whenever these %s are given in this analysis it is always in the order of VHHD, HHD, MHD and LHD unless otherwise specified.

Lets check their roles and Org Size.  Can they be students?

In [None]:
mcr_hdi_filtered=mcr_hdi_filtered[mcr_hdi_filtered['Q10']=='$0-999']
temp_df=GroupSimpleChoicePlot(['Q4'],'Q5','Doctoral degree with 0-999 USD salary cum Role Vs HDI',inp_df=mcr_hdi_filtered)
temp_df=GroupSimpleChoicePlot(['Q4'],'Q6','Doctoral degree with 0-999 USD salary cum Org Size Vs HDI',inp_df=mcr_hdi_filtered)

**Oh No!!! None are students.  Unbelievable!!!**

Let us look at their Org Size.

**Another shock. They work across all different size of organizations, primarily (>50%) in Start Ups / Small Orgs (0-49 employees) from VHHD and HHD categories.**

In general there is an opnion that people from VHHD are high earners all the time and people from MHD and LHD earn less. But we could observe that it may not be the case all the times.

Primary education level of VHHD and HHD kagglers are Master Degree.  Now let us look at their Salary and Org Size.

In [None]:
#Lets us look at their roles and salary of the Master’s degree holders.
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q4'].isin(['Master’s degree'])]
temp_df=GroupSimpleChoicePlot(['Q4'],'Q5','Master’s degree cum Current Role  Vs HDI',inp_df=mcr_hdi_filtered)
temp_df=GroupSimpleChoicePlot(['Q4'],'Q10','Master\'s degree with 0-999 USD salary cum Role Vs HDI',inp_df=mcr_hdi_filtered)

Primary role of Master Degress holders is Data Scientist (18 to 28%).
14 to 23% across all countries are students.

**4.6 to 30.8% across all countries their compensation is 0 - 999 USD. Lets check whether they are students and also check their org size.**

In [None]:
mcr_hdi_filtered=mcr_hdi_filtered[mcr_hdi_filtered['Q10']=='$0-999']
temp_df=GroupSimpleChoicePlot(['Q4'],'Q5','Master’s degree with 0-999 USD salary cum Role Vs HDI',inp_df=mcr_hdi_filtered)
temp_df=GroupSimpleChoicePlot(['Q4'],'Q6','Master\'s  degree with 0-999 USD salary cum Org Size Vs HDI',inp_df=mcr_hdi_filtered)

#Let us compare education of Female vs Male
temp_df=GroupSimpleChoicePlot(['Q2'],'Q4','Gender cum Education Vs HDI',inp_df=mcr_hdi_m_f_alone)


**Stunning!!!! Here also none of them are students and performing different roles.**


Let us look at their Org Size

**Another surprise. They work across all different size of organizations, primarily (>50%) from Start Ups / Small Orgs (0-49 employees) across all categories.**

Let us compare Education between both genders:
- **Very interesting fact to note that, across all categories, Female Doctoral Degree Holders are more than Male Doctoral Degree holders.**
- **This difference is almost double in LHD (8% of Female vs 4% of Male PhDs).  Fabulous performance!!!!**

## Roles of Kagglers across HDIs:

In [None]:
temp_df=SimpleChoicePlot('Q5')

**Top 3 roles across all HDIs:**
- VHHD : Data Scientist, Students and S/W Engineers
- HHD & MHD: Students, Data Scientist and S/W Engineers
- LHD: Students, Data Scientist and Not Employed

**It is sorrowful to see that, unemployment is bit higher in LHD (12.8%) than all other categories.  But it's inspiring to see that they are upskilling themselves in this Niche DS area.  All the best to all those Kaggler's to florish their career in this blooming space.**

**As expected, VHHD is leading in this niche area as quarter of their people are already playing the high role in DS career.  Both HHDs and MHDs are following.**

## Let us check Kaggler's Organization Size:

In [None]:
temp_df=SimpleChoicePlot('Q6')

**Across all categories people working in Start Ups or Small Organizations have actively participated in this survey.  Its provocative to see that Start Ups are leading the way in this Niche Area!!!!  They could be giving heavy competition to established organizations also.**

**Fascinating to see that 55.7% of participants from LHD are from the small organizations (0-49 Employees) and interested in the DS field.**

In VHHD and MHD second top participation is from >10K employees. That shows that DS field is being attracted for the people working in established organizations also.


## Data science workload level & ML usage at Kaggler's place of business?

In [None]:
temp_df=SimpleChoicePlot('Q7')
temp_df=SimpleChoicePlot('Q8')

Its surprising to see that in MHD and VHHD countries almost 25% of data science workloads is at the higher end (i.e. 20+ DS workload)  **It shows that even in MHDs lots of organization started working on ML area and it outperforms than HHD. Good news for DS aspirants from MHD.**  

In HHD and LHD, data science workloads is yet to pickup at their organizations with compare to other two categories.  Because primarily 1 or 2 members working in DS space in their organization.

Maturity level on DS and real time usage of DS by VHHD's Organization is overwhelming. Because 35% of them mentioned that their organizations are having having well established ML methods & use ML for generating insights.  It is obvious to expect because 25% of VHHD participants are Data Scientist.  

**16% of orgs from MHDs are having well established ML methods, which is higher than the HHD country (14%).  This is also expected because in MHD 25% of respondents mentioned, 20+ Data Science work load is there in their org.**

**It is provocative to see that close to 23.2%, 29.2%, 31.8% and 51.9% of participants from VHHD, MHD, HHD and LHD countries are not using ML models or they don't know about the same.  Even though their organizations are not using ML still these kaggle participants have upskilled themselves and actively participated in kaggle survey. Bravo!!!.**


## Let us do the Compensation Analysis across HDI categories:

Let us do compensation analysis across HDIs.  Also, let us check that in terms of Gender as well.

In [None]:
temp_df=SimpleChoicePlot('Q10')
#Let us compare salary of Female vs Male
temp_df=GroupSimpleChoicePlot(['Q2'],'Q10','Gender cum Salary Vs HDI',inp_df=mcr_hdi_m_f_alone)
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()

Maximum number of participants i.e. 14.4%, 22.2%, 41.4% from HHD, MHD and LHD are earning only 0 to 999$ annually.  It is surprising to see this in HHD as well. VHHD participants salary range seems to be more or less uniformly distributed across all salary ranges except for few ranges.

In general on an average Females are earning lesser than Men. **But it is amazing to note that 3% of Females from LHD are earning >=100K USD but only 1% of Males earning >=100K USD from this HDI category.**

## Let us check the money spent by Kagglers on ML at work in the past 5 years as well as the primary tool used to analyze data

In [None]:
temp_df=SimpleChoicePlot('Q11')
temp_df=SimpleChoicePlot('Q14')

**Its stunning to see that close to 30%+ of participants from across all categories have spent only zero USD on ML / Cloud computing products at their work in the past 5 years.  But still they have upskilled in this space (probably on personal interest) and (actively) participating in Kaggle surveys.  This shows interest on ML is increasing regardless of the HDI factor. This is being proven again and again.**

Close to 11% from VHHD are spending >100K USD on ML / Cloud computing products at their work.  It again shows their maturity in this space.

**It is marvelous to see that even LLD is having 2% of participants spending >100K USD on ML / Cloud computing products at their work.  In this scenario also, MHD has slightly overperformed than HHD.**

43+% of participants across all categories are using Local development environments like JupyterLab, R Studio. That shows the popularity of this tool. 

**Its interesting to see that still MS Excel, Google sheets are used by around 15% to 30% of participants across all categories. That shows the power of these simple and regularly used tools. It could be mainly because of its good analytical capabilities.**

BI softwares, Advanced statistical softwares and cloud based softwares like AWS etc are yet to pickup in this space, as it's usage between 3% to 9% across all HDI categories.

Only VHHD has 9% of cloud data based S/Ws.  As it is matured market we can expect the same.

Let us take only 'Basic statistical software - MS Excel' usage data alone and check roles of the people who are using it.

In [None]:
mcr_hdi.Q14.unique()
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q14'].isin(['Basic statistical software (Microsoft Excel, Google Sheets, etc.)'])]

temp_df=GroupSimpleChoicePlot(['Q14'],'Q5','Basic statistical software  cum Current Role Vs HDI',inp_df=mcr_hdi_filtered)
#Filter only statisticians
mcr_hdi.Q5.unique()
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q5'].isin(['Statistician'])]
temp_df=GroupSimpleChoicePlot(['Q5'],'Q14','Basic statistical software  cum Statistician Vs HDI',inp_df=mcr_hdi_filtered)

Primarily MS Excel and Google sheets are used by Students.  Yes, why they need to spend money for other tools when these easily available tools provding data analysis capabilities.

**Interesting fact is that close to 6% to 9% of people who are using Excel are performing Data Scientist and 7 to 11% are Data Analyst roles respectively.  So it shows that for data analysis we don't need to always depend on advanced tools alone, choose the right tool for right purpose.**

Statisticians seem to be using Excel less for data analysis.  Obviously statistician would be using more advanced S/Ws.  Lets see what softwares are popular among statisticians.  Local Dev Environments are popular among Statisticians.  Nice, they rely on their own coding skills itself.  

Advanced Statistical S/W seems to be popular among statisticians from MHD and LHD (37% and 47%).


## Let us check how long Kaggler's have been writing code to analyze data

In [None]:
temp_df=SimpleChoicePlot('Q15')

#Lets check age of people who has written code <=2 years
mcr_hdi.Q15.unique()
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q15'].isin(['1-2 years','< 1 years'])]
mcr_hdi_filtered['Q15']='<=2 years'
temp_df=GroupSimpleChoicePlot(['Q15'],'Q1','Age cum Code Writting years for analysis Vs HDI',inp_df=mcr_hdi_filtered)

#Lets check role of people who have never written code for data analysis
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q15'].isin(['I have never written code'])]
temp_df=GroupSimpleChoicePlot(['Q15'],'Q5','Role cum Code Writting years for analysis Vs HDI',inp_df=mcr_hdi_filtered)

#Lets look at the tools used by them
mcr_hdi_filtered=mcr_hdi_filtered[mcr_hdi_filtered['Q5'].isin(['Student'])]
temp_df=GroupSimpleChoicePlot(['Q15'],'Q14','Tools cum Code Writting years for analysis by students Vs HDI',inp_df=mcr_hdi_filtered)

66+% of people from MHD and LHD are more active on data analysis using code in the recent past only.  It shows the interest on DS career is picking is very fast in the recent years from these countries. Among that 48+% are students. **Again it proves that DS is popular among young generation from MHD and LHD. They are gearing up in this fascinating DS world!!!! Cheer up guys!!!!**

**6.3% and 1.8% people from both VHHD, HHD are writting code to analyze data for 20+ years.  And 7% from VHHD are seniors (50+). Salute them!!!! This shows the maturity level in these developed countries in this field.  DS career is not started just now and its there for long time and could have been used in pockets.  Let them guide all youngsters to succeed in this field!!!! **

8+% from MHD and LHD have never written code for data analysis. Lets look at their roles.  35% in both MHD and LHD are students.  That makes sense, hence so far they did not write code for data analysis.  Lets look at the tools used by these students.

55% and 63% from MHD and LHD of are using Excel.  It makes sense.  **But 28% and 13% from MHD and LHD are using local environments.  This seems to be contradicting, when they have not written code then what they will do in local environment??? I don't know :-(.  Anyone knows please tell me.**


## What language Kaggler's are using and recommending to aspiring DS

In [None]:
temp_df=SimpleChoicePlot('Q19')

temp_df=MultiChoicePlot('Q18_Part')

#Python and SQL vs their roles
col_name='Q18_Part_1' #Python
title='Pythong Lang cum Current Role Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q5',title,inp_df=mcr_hdi_filtered)
temp_df.T

col_name='Q18_Part_3' #SQL
title='SQL cum Current Role Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q5',title,inp_df=mcr_hdi_filtered)
temp_df.T

#Python and SQL vs their roles
col_name='Q18_Part_1' #Python
title='Pythong Lang cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()


col_name='Q18_Part_3' #SQL
title='SQL cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()

Python is dominating as it is recommended by 77+% to aspiring data scientist to learn first. R takes second preference (7+%).

Python is being used primarily by Kagglers across all categories(33+%) followed by SQL (15+%) and then R(10+%).

Though maximum of 47% of people are using Python on regular basis while recommending 77+% they have recommended aspiring DS to learn python first.

4.3 to 9.4% people across HDIs are using 5 programming languages on regular basis.

As expected Top 3 roles (Data Scientist, Students and S/W Engineer) are using both Python and SQL.

Lets check their salaries.

Around 27.9%,2.9%,2.4%, 0.7% people using Python are earning more than 100K USD from VHHD, HHD, MHD and LHD.  
Around 32.1%,2.4%,2.5% people using SQL  are earning more than 100K USD from VHHD, HHD, MHD and none from LHD.  

**Wow, people using SQL seems to be earning more than people using Python, especially in VHHD and MHD.  SQL seems to be continue helping to earn more.  As SQL technology is there for long time, it might be dominating like this.**

## TPU vs GPU vs CPU:

In [None]:
temp_df=SimpleChoicePlot('Q22')

#Lets check role of the who are using TPU <=5 times
mcr_hdi_filtered=mcr_hdi[mcr_hdi['Q22'].isin(['Once','2-5 times'])]
mcr_hdi_filtered['Q22']='<=5 times'
temp_df=GroupSimpleChoicePlot(['Q22'],'Q5','TPU <=5 times cum Role Vs HDI',inp_df=mcr_hdi_filtered)

#Lets check how many years they have used ML methods vs who are using TPU <=5 times
temp_df=GroupSimpleChoicePlot(['Q22'],'Q23','TPU <=5 times cum ML Methods Usage Years Vs HDI')#,inp_df=mcr_hdi_filtered)

temp_df=MultiChoicePlot('Q21_Part')
#Lets check the Salary, ML Methods, ML Algorithms used with CPUs
#Lets check the Salary, ML Methods and ML Algorithms used with GPUs

#CPU vs their salaries
col_name='Q21_Part_1' #CPU
title='CPU cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()

#GPU vs their salaries
col_name='Q21_Part_2' #GPU
title='GPU cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()


#Check usage across HDI Categories for CPU
df_1=mcr_hdi[~mcr_hdi['Q21_Part_1'].isnull()]['HDI_Category'].value_counts()
print("CPU Usage:\n",df_1/df_1.sum())

#Check usage across HDI Categories for GPU
df_1=mcr_hdi[~mcr_hdi['Q21_Part_2'].isnull()]['HDI_Category'].value_counts()
print("GPU Usage:\n",df_1/df_1.sum())


##Check against ML Algorithms, ML Frameworks
#Check against ML Algorithms used by people for the above
col_name='Q21_Part_2' #GPU
title='GPU Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q24_Part',[col_name],title,grp_lvl=[0,2])#,inp_df=mcr_hdi_filtered)
temp_df.T

#Check against ML Algorithms used by people for the above
col_name='Q21_Part_1' #CPU
title='CPU Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q24_Part',[col_name],title,grp_lvl=[0,2])#,inp_df=mcr_hdi_filtered)
temp_df.T

#Check against ML Frameworks used by people for the above
col_name='Q21_Part_2' #GPU
title='GPU Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q28_Part',[col_name],title,grp_lvl=[0,2])#,inp_df=mcr_hdi_filtered)
temp_df.T

#Check against ML Algorithms used by people for the above
col_name='Q21_Part_1' #CPU
title='CPU Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q28_Part',[col_name],title,grp_lvl=[0,2])#,inp_df=mcr_hdi_filtered)
temp_df.T

TPU don't seem to be popular even from VHHD countries (83%). **But its interesting to note that close to 20% of people from MHD are using TPU 1 to 5 times vs 15% people from VHHD and LHD. Lets look at their roles.**

**In MHD and LHD Students are using TPUs more than Data Scientist.  Great Students!!! You people are rocking!!!! Hats off to you.**

Interestingly, those who have used TPUs >25 times, are using ML methods for <=2 years from all categories.  **This shows TPUs seems to be slowly becoming popular for the people who started their journeys recently in both MHD and LHD countries.**

Let us look at CPU and GPU.  CPUs is dominating (51+%) followed by GPUs (19+%).

Around 9 to 14% of people not using any specialized hardwares across all HDIs. **Staggering to notice that 14% (Top across all categories) are not using any specialized hardwards are from VHHD.**


Lets check their salaries:
- Around 28.5%,3.0%,2.7%, 0.8% people using CPU are earning more than 100K USD from VHHD, HHD, MHD and LHD.
- Around 29.2%,4.3%,3.7% people using GPU are earning more than 100K USD from VHHD, HHD, MHD and none from LHD. **They are earning more than people who are using CPU across all HDI categories except for LHD.**

Top 3 ML Algorithms / ML Frameworks used by Kagglers using GPUs are:
- Linear Logistic Regression (LLR) algorithms. CNN and Decision Tree (DT) or RF
- Scikit-learn, Tensor Flow and Keras (except in LHD).  In LHD it is Scikit-learn, RF and Tensor Flow.

## Lets us look at our Kaggler's important part of their role at work.

In [None]:
temp_df=MultiChoicePlot('Q9_Part')

Though we have people with less coding experience are more from MHD but around 11% of them are working in state of art ML, followed by 9% from all other categories.  **Spectacular performance by MHD people.**

ML maturity is looking good across all HDI categories. 34+% of people across all HDI categories are doing ML prototypes, Improve ML model or workflow and state of art ML work.

There around 4.8+% of people across all HDI countries are performing atleast chosen 4 activities given as part of their role (None choice chosen is not considered).

## Let us explore Kaggler's Favorite Media Sources, Courses joined & Notebooks used:

In [None]:
temp_df=MultiChoicePlot('Q12_Part')

#Take only Favourite Media Source -Kaggle and check their roles
mcr_hdi.Q12_Part_4.unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi['Q12_Part_4'].isnull()]
temp_df=GroupSimpleChoicePlot(['Q12_Part_4'],'Q5','Favourite Media Source -Kaggle cum Current Role Vs HDI',inp_df=mcr_hdi_filtered)
if display_output_df:
    temp_df.T

    
temp_df=MultiChoicePlot('Q13_Part')

#Coursera and Kaggle vs their roles
col_name='Q13_Part_2' #Coursera
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q5','Favourite Media Source -Coursera cum Current Role Vs HDI',inp_df=mcr_hdi_filtered)
temp_df.T


col_name='Q13_Part_6' #Kaggle
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q5','Favourite Media Source -Kaggle cum Current Role Vs HDI',inp_df=mcr_hdi_filtered)
temp_df.T

#Lets check their salaries.
#Coursera and Kaggle vs their salaries
col_name='Q13_Part_2' #Coursera
title='Coursera course cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()


col_name='Q13_Part_6' #Kaggle
title='Kaggle course cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()

#Notebooks used
temp_df=MultiChoicePlot('Q17_Part')

#Kaggle Notebook vs their roles
col_name='Q17_Part_1' #Kaggle Notebook
title='Kaggle Notebook cum Current Role Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q5',title,inp_df=mcr_hdi_filtered)
temp_df.T


#Kaggle Notebook vs their salaries
col_name='Q17_Part_1' #Kaggle Notebook
title='Kaggle Notebook cum Salary Vs HDI'
mcr_hdi[col_name].unique()
mcr_hdi_filtered=mcr_hdi[~mcr_hdi[col_name].isnull()]
temp_df=GroupSimpleChoicePlot([col_name],'Q10',title,inp_df=mcr_hdi_filtered)
temp_df.T
temp_df1=temp_df.T.copy().reset_index()
temp_df1['SalRange']='<100K'
temp_df1.loc[temp_df1['index'].isin(['> $500,000','100,000-124,999',
             '125,000-149,999','150,000-199,999','200,000-249,999',
             '250,000-299,999','300,000-500,000']),'SalRange']='>=100K'

temp_df1.groupby('SalRange').sum()

**Top 3 sources used for DS topics orderwise across all HDI categories:**
- Kaggle (19+%)
- Blogs (17+%)
- YouTube (12+%)
- 8+% are referring minimum 6 sources

**Interesting fact is that 11% from LHD are referring six sources, higher than any other categories.  This shows their zeal in this DS field.**

Let us check roles of people who are referring Kaggle source:
- 25+% are students from HHD, MHD and LHD
- 26% are Data Scientist from VHHD.  **This again proves the maturity of DS field in VHHD.**

**Its dazzling to see 10.3% of people referring Kaggle from LHD are not employed.  Interest on this space is paramounting in LHD as well.  Big Thanks to Kaggle for helping both Students and Not Employed people.**

**Top 3 courses:**
- Coursera (14+%)
- Kaggle (12+%)
- Udemy (10+%)
- Around 12% people from HHD and VHHD countries used university courses for DS learning.
- *Close to 9 to 12 % of participants across HDIs have used five sources for DS learning.*  **On this 12% is from LHD.  Again it proves their interest in the DS career.**

Lets check the roles and salary of the people using both Coursera and Kaggle.  

19+% of people learning through coursera are Data Scientist and 13+% are students.  Similar pattern for people learning through Kaggle as well.

Around 29.3%,3.1%,3.3% of people studying through coursera courses are earning more than 100K USD from VHHD, HHD and MHD.  None from LHD.  
Around 22.6%,3.1%,3.4% of people studying through Kaggle courses are earning more than 100K USD from VHHD, HHD and MHD.  None from LHD.  

**It could be either these courses helped them to earn more salary. Or they might be already earning high package but their zeal on continuous learning is exemplary to note.  Anyways in both cases it shows the power and popularity of these courses.**

**Thanks to Kaggle Team for conducting such beneficial courses in this DS community and helping them flourishing in their career.**

**Top 2 Notebooks:**
- KaggleNotebooks (21+%) 
- GoogleCoLab (18+%).

In VHHD close to 32% people are not using any of the notebooks. Interesting!!!.

Lets check the roles and salary of the people who are using Kaggle Notebooks.
- Students are primary users (Max 33%).
- 19+% of Data Scientist are using Kaggle Notebooks.

Around 18.9%, 2.9%, 3.1% and 1.6% of people using Kaggle Notebooks are earning more than 100K USD from VHHD, HHD, MHD and LHD.  

**Its breathtaking news that, Kaggle Notebooks users (1.6%) from LHD are earning more than 100K USD. This is the first time we have seen that users from LHD crossed this threshold.**

## Let us check the interest on Automated ML and Cloud by our Kagglers:

In [None]:
#Automated MLs
temp_df=MultiChoicePlot('Q25_Part')

#Cloud computing platforms
temp_df=MultiChoicePlot('Q29_Part')

#Cloud computing products
temp_df=MultiChoicePlot('Q30_Part')


## Automated ML Tools:

- 36+% of Kagglers across all categories are not using Automated ML tools. 
- From preference perspective, Automated model selection (e.g. auto-sklearn, xcessiv) are used by 13+% and followed by Automated data augmentation (e.g. imgaug, albumentations) by 6.8+%.
- Close to 49% from VHHD are still not relying on these ML tools. **So it means its not popular even in the matured market also.**
- Between 7 and 9% of people across HDIs are using 3 ML tools on regular basis.

## Interest on Cloud:
**Top 3 Cloud Computing Platform:**
- AWS (17+%) 
- Google cloud platform (21+%)
- MS Azure (13+%).

**Surprising to see that 16+% of people across all HDIs are not using any of these cloud computing platforms.  Almost quarter of people from VHHD are not using any cloud computing platforms.**

Another interesting fact is that 4.5 to 8.1% people across HDIs are using 3 ML cloud platforms on regular basis.

**Top 2 Cloud Computing Products:**
- AWS Elastic Compute Cloud (EC2)(8+%)
- Google Compute Engine (GCE)(8+%).

Surprising to see that 25+% of people across all HDIs are not using any of these cloud computing products.  **More than quarter of people (30%) from VHHD are not using any cloud computing products.**

Another interesting fact is that 4.5 to 6.6% people across HDIs are using 3 ML cloud products on regular basis.

**Based on the above it is evident that cloud is not so popular in the DS field.  In the current cloud market in DS field, AWS seems to be dominating.**

## Lets now explore interest of Kagglers on ML related products and tools.

In [None]:
##ML Products
temp_df=MultiChoicePlot('Q32_Part')

##ML Tools
temp_df=MultiChoicePlot('Q33_Part')

##RDB
temp_df=MultiChoicePlot('Q34_Part')

##Big data / Analytics Products
temp_df=MultiChoicePlot('Q31_Part')

**Top 3 ML Products:**
- Azure_ML_Studio (5.4+%)
- Google_ML_Engine (5.1+%)
- GoogleSpeech-to-Text (3+%).

**Google seems to be dominating in this segment.**

Bewildering to see that 31+% of people across all HDIs are not using any of these ML products.  More than half of the people from VHHD (56%) are not using any ML products.

Despite the above, 7.8 to 9.7% people across HDIs are using 2 ML products on regular basis.


**Top 3 ML Tools:**
- Auto Sklearn(6.6+%)
- Google Auto ML (4.7+%) 
- Auto Keras (3.9+%).

**Again Google seems to be dominating in this segment.**

As per above trend seen earlier 48+% of people across all HDIs are not using any of Auto ML tools.  Almost Two Third (70.1%) from VHHD are not using any Auto ML tools.

Again despite the low interest still 5.5 to 6.8% people across HDIs are using 2 Auto ML tools on regular basis.


**Top RDBs:**
- MySQL (19.6+%)
- Postgres SQL (9.3+%)
- MS SQL Server (12.8+%).
- SQLite (10.7+%)
- Oracle DB (6.2+%).

**Oracle seems to be dominating in this segment as it has 2 products (Postgress and Oracle DB).**

7.5 to 9.4% people across HDIs are using 3 RDBs on regular basis.

**Top Big data / Analytics Products**
- Google BigQuery(9.9+%)
- Google cloud data flow (4.1+%)
- MS Analysis Service (3.8+%).

**Google seems to be dominating in this segment. Across HDIs around 17.5 to 31% are using Google's Big Data or analytics products.**

Trend continued here as well, 39+% of people across all HDIs are not using any of these big data or analytics products.  Almost half (47%) of VHHD people are not using any big data or analytics products.

7.1 to 9.8% people across HDIs are using 2 big data or analytics products on regular basis.


**Based on all of the above analysis it is very much evident that still ML Products, ML Tools, RDB and Big Data / Analytics products have not got the popularity in the Data Science field.  This is very much evident in the VHHD category.**

With the current market **Google seems to be having big pie in this business segment.**

# Does HDI affects taste or interest of the Human?

One interesting analysis has been done to check whether the taste or interest of people on Technology, Platform, Tools etc would differ based on their countries HDI factor?  What do you think?  

When the deep dive analysis was done on this, taste or interest seems to be uniform among Female Participants across all categories and same hold true for Male also.  For instance, if we take ML Algorithm, Linear Logistic Regression and Decision Trees or RF seems to be popular among Female participants than Male participants from their respective categories.

**Sounds interesting!!!! Lets look at them below.**

**Let us look at the popular interest on various activities / factors among Female and Male participants across HDI Categories below:**

In [None]:
#Q24_Part_2	Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Decision Trees or Random Forests
#Check against Gender
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q24_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q25_Part_2	Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated feature engineering/selection (e.g. tpot, boruta_py)
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q25_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q26_Part_2	Which categories of computer vision methods do you use on a regular basis?  (Select all that apply) - Selected Choice - Image segmentation methods (U-Net, Mask R-CNN, etc)
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q26_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q27_Part_2	Which of the following natural language processing (NLP) methods do you use on a regular basis?  (Select all that apply) - Selected Choice - Encoder-decorder models (seq2seq, vanilla transformers)
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q27_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q28_Part_2	Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   TensorFlow 
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q28_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q29_Part_2	Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Amazon Web Services (AWS) 
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q29_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q30_Part_2	Which specific cloud computing products do you use on a regular basis? (Select all that apply) - Selected Choice - Google Compute Engine (GCE)
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q30_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q31_Part_2	Which specific big data / analytics products do you use on a regular basis? (Select all that apply) - Selected Choice - AWS Redshift
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q31_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q32_Part_2	Which of the following machine learning products do you use on a regular basis? (Select all that apply) - Selected Choice - Cloudera
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q32_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q33_Part_2	Which automated machine learning tools (or partial AutoML tools) do you use on a regular basis?  (Select all that apply) - Selected Choice -  H20 Driverless AI  
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q33_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T

#Q34_Part_2	Which of the following relational database products do you use on a regular basis? (Select all that apply) - Selected Choice - PostgresSQL
col_name='Q2' #Gender
title='Gender Vs HDI'
temp_grp_df,temp_df=GroupMultiChoicePlot('Q34_Part',[col_name],title,grp_lvl=[0,2],inp_df=mcr_hdi_m_f_alone)
temp_df.T


In [None]:
from IPython.display import HTML, display
import tabulate
table = [["Activity / Factor","Female","Male"],
         ["ML Algorithms","Linear Logistic Regression, Decision Trees or RF","CNN and GB Machines"],
         ["ML Tools","Auto Model Selection","Auto Data Augmentation"],
         ["Computer Vision Methods","Image Segment Methods","General Purpose Image"],
         ["Natural Language Processing","Encoder Decoder Models (Except in MHD) & Word Embeddings (in MHD)","Transformer Language Models (Except in LHD) & Word Embeddings (in LHD)"],
         ["Machine Learning Frameworks","Scikit-learn","Keras and PyTorch"],
         ["Cloud Computing Platforms","IBMCloud","AmazonWebServices"],
         ["Cloud Computing Products","AzureVirtualMachines (Except in MHD) & GoogleCloudFunctions (in MHD)","AWS_EC_Cloud"],
         ["Big Data / Analytics Products","MS Analysis Service","AWSAthena, AWSElasticMapReduce, AWSKinesis and AWSRedshift"],
         ["Machine Learning Products","Azure_ML_Studio and RapidMiner","GoogleCloudVision and AmazonSageMaker"],
         ["Automated Machine Learning Tools","MLbox (Except in LHD) and H20DriverlessAI (in LHD)","GoogleAutoML"],
         ["Relational Database Products","MicrosoftAccess","PostgresSQL (Except in LHD) & MicrosoftSQLServer (in LHD)"]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

**Looks like AWS products are popular among Male Participants and Microsoft products are popular among Female Participants across all HDI categories.**

# Conclusion:

**VHHDs are Marching:**
- These veterans are sky rocketing in this DS career space.
- VHHD are matured markets and it is obvious to expect that in the Data Science Niche Area they will out perform in all activities than others.
- They are highly educated and primarily holding Master's degree and Doctoral degree. 
- Seniors (50+ years) and Super Seniors (70+ years) are active in this space.
- Data Scientist is the top role.
- ML maturiy is high in their organization.
- Even with high packages they are learning DS courses. Either these DS courses could be helping them to earn more salary or their Learning Appetite is high.
- Female Vs Male participation ratio is higher than average in almost 14 countries from this category. This is interesting.

**HHDs are Running:**
- These adults are already running in this DS career space.
- Female Vs Male participation ratio is higher than average in almost 6 countries.
- Experienced programmers are part of this group who are writing code for 20+ years for data analysis. 
- They are highly educated and primarily holding Master's degree and Doctoral degree. 
- MHDs are giving heavy competition to them and if they don't run fast MHD could overtake them in the future.

**MHDs are Walking:**
- These kids started walking and some are running in this DS career.
- Almost half of Kagglers from this group are youngsters (18-24 years) & primarily students.
- As youngsters are getting into this space we can expect that in future MHD could perform well in this DS field.
- We have some matured organizations who are currently using ML
- Female Vs Male participation ratio is higher than average in almost 6 countries.
- Students from this group seems to be using advanced processors like TPU than Data Scientis from their group. This shows their interest and growth in this DS field.

 
**LHDs are Crawling:**
- These babies are crawling and some active babies started walking in this DS career.
- Close to half of Kagglers from this group are youngsters (18-24 years) & primarily students.
- 12% of these Youngest (18-21 years) Kagglers are performing Data Scientist role.  This shows the zeal of the people in this space and some already started performing the role.
- Only one country participated from this category.  When more countries join in this DS space they could perform well in DS career.
- Students from this group seems to be using advanced processors like TPU than Data Scientis from their group. This shows their interest and growth in this DS field.
- Higher % of Female PhD holders are there in this group (almost double than Male PhD %).
- This is the only Category where >=100 K compensation being earned by Females (3%) than Males (1%).  

Cloud computing and Automation seems to be at Walking stage in this DS field.

**Let me conclude my analysis with a closing note: *Let Seniors and Super Seniors from VHHD and HHD show path to youngsters coming in this fascinating field of Data Science across the world especially from MHD and LHD.  Let all of us together grow and use this DS field effectively to make this world better place to live.***