## Kaggle ML & DS Survey

Every year we have a Machine Learning and Data Science Survey from Kaggle. This is the fourth time.

In this kernel I want to do some analysis which would be interesting for all .i.e  I have decided to have a look at the country where most of the kagglers reside, you guessed it right, India!

I will show how our characteristics changed (or not changed) over time and what is interesting about Indian Kagglers.

Let's assume that the survery is representative and allows to make conclusions.
For some features it is possible to compare data in different years, but for others either there is no such data or the data is in very different formats. Let's start.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/16394/logos/header.png?t=2019-10-17-20-37-58)

In [None]:
# importing libraries
import numpy as np 
import pandas as pd 
import os
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)
from ipywidgets import interact, interactive, interact_manual
import ipywidgets as widgets
import colorlover as cl

In [None]:
# loading data from different years

DIR = '/kaggle/input/kaggle-survey-2020/'
df_choice_20 = pd.read_csv(DIR + 'kaggle_survey_2020_responses.csv', low_memory=False)
df_choice_20.columns = ['_'.join(col) for col in df_choice_20.columns]
df_choice_20.columns = df_choice_20.iloc[0]
df_choice_20=df_choice_20.drop([0])

DIR = '/kaggle/input/kagglesurvey2019/'
df_free_19 = pd.read_csv(DIR + 'other_text_responses.csv', low_memory=False)
df_choice_19 = pd.read_csv(DIR + 'multiple_choice_responses.csv', low_memory=False, encoding='latin-1', header=[0,1])
df_choice_19.columns = ['_'.join(col) for col in df_choice_19.columns]



DIR = '/kaggle/input/kaggle-survey-2018/'
df_free_18 = pd.read_csv(DIR + 'freeFormResponses.csv', low_memory=False, header=[0,1])
df_choice_18 = pd.read_csv(DIR + 'multipleChoiceResponses.csv', low_memory=False, header=[0,1])
# Format Dataframes
df_free_18.columns = ['_'.join(col) for col in df_free_18.columns]
df_choice_18.columns = ['_'.join(col) for col in df_choice_18.columns]


DIR = '/kaggle/input/kaggle-survey-2017/'
df_free_17 = pd.read_csv(DIR + 'freeformResponses.csv', low_memory=False)
df_choice_17 = pd.read_csv(DIR + 'multipleChoiceResponses.csv', low_memory=False, encoding='latin-1')

In [None]:
#Replacing country name with less versbose alternative

df_choice_20.loc[df_choice_20['In which country do you currently reside?'] == 'United States of America', 'In which country do you currently reside?'] = 'USA'
df_choice_20.loc[df_choice_20['In which country do you currently reside?'] == 'United Kingdom of Great Britain and Northern Ireland', 'In which country do you currently reside?'] = 'UK'

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# processing data for visualizations

top_count = df_choice_20['In which country do you currently reside?'].value_counts().head(8).reset_index().rename(columns={'In which country do you currently reside?': 'count', 'index': 'Country'})

# taking only Indian responders
df_choice_17 = df_choice_17.loc[df_choice_17['Country'] == 'India']
df_choice_18 = df_choice_18.loc[df_choice_18['Q3_In which country do you currently reside?'] == 'India']
df_choice_19 = df_choice_19.loc[df_choice_19['Q3_In which country do you currently reside?'] == 'India']
df_choice_20 = df_choice_20.loc[df_choice_20['In which country do you currently reside?'] == 'India']

def get_age(x: int):
    """
    Convert numerical age to categories.
    """
    if 18 <= x <= 21:
        return '18-21'
    elif 22 <= x <= 24:
        return '22-24'
    elif 25 <= x <= 29:
        return '25-29'
    elif 30 <= x <= 34:
        return '30-34'
    elif 35 <= x <= 39:
        return '35-39'
    elif 40 <= x <= 44:
        return '40-44'
    elif 45 <= x <= 49:
        return '45-49'
    elif 50 <= x <= 54:
        return '50-54'
    elif 55 <= x <= 59:
        return '55-59'
    elif 60 <= x <= 69:
        return '60-69'
    elif x >= 70:
        return '70+'
    
# create a new age column with the same name and unique values in all datasets
df_choice_17['Age_'] = df_choice_17['Age'].apply(lambda x: get_age(x))
df_choice_18['Age_'] = df_choice_18['Q2_What is your age (# years)?']
df_choice_18.loc[df_choice_18['Age_'].isin(['70-79', '80+']), 'Age_'] = '70+'
df_choice_19['Age_'] = df_choice_19['Q1_What is your age (# years)?']
df_choice_20['Age_'] = df_choice_20['What is your age (# years)?']

# renaming columns so that it would be easier to work with them
df_choice_17 = df_choice_17.rename(columns={'GenderSelect': 'Gender', 'FormalEducation': 'Degree'})
df_choice_18 = df_choice_18.rename(columns={'Q1_What is your gender? - Selected Choice': 'Gender', 'Q9_What is your current yearly compensation (approximate $USD)?': 'Salary',
                                            'Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Degree'})
df_choice_19 = df_choice_19.rename(columns={'Q2_What is your gender? - Selected Choice': 'Gender', 'Q10_What is your current yearly compensation (approximate $USD)?': 'Salary',
                                            'Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Degree'})
df_choice_20 = df_choice_20.rename(columns={'What is your gender? - Selected Choice': 'Gender', 'What is your current yearly compensation (approximate $USD)?': 'Salary',
                                            'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Degree'})

df_choice_19['Degree'] = df_choice_19['Degree'].replace({'Masterâs degree': 'Master’s degree', 'Bachelorâs degree': 'Bachelor’s degree',
                                                         'Some college/university study without earning a bachelorâs degree': 'Some college/university study without earning a bachelor’s degree'})
df_choice_17['Degree'] = df_choice_17['Degree'].replace({"Master's degree": 'Master’s degree', "Bachelor's degree": 'Bachelor’s degree',
                                                         "Some college/university study without earning a bachelor's degree": 'Some college/university study without earning a bachelor’s degree',
                                                         "I did not complete any formal education past high school": "No formal education past high school"})



In [None]:
# changing salary values to the same categories for 2018
df_choice_18.loc[df_choice_18['Salary'].isin(['400-500,000', '300-400,000']), 'Salary'] = '300,000-500,000'
df_choice_18.loc[df_choice_18['Salary'].isin(['500,000+']), 'Salary'] = '> $500,000'



# changing salary values to the same categories for 2019
df_choice_19.loc[df_choice_19['Salary'].isin(['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999']), 'Salary'] = '0-10,000'
df_choice_19.loc[df_choice_19['Salary'].isin(['10,000-14,999', '15,000-19,999']), 'Salary'] = '10-20,000'
df_choice_19.loc[df_choice_19['Salary'].isin(['20,000-24,999', '25,000-29,999']), 'Salary'] = '20-30,000'
df_choice_19.loc[df_choice_19['Salary'] == '30,000-39,999', 'Salary'] = '30-40,000'
df_choice_19.loc[df_choice_19['Salary'] == '40,000-49,999', 'Salary'] = '40-50,000'
df_choice_19.loc[df_choice_19['Salary'] == '50,000-59,999', 'Salary'] = '50-60,000'
df_choice_19.loc[df_choice_19['Salary'] == '60,000-69,999', 'Salary'] = '60-70,000'
df_choice_19.loc[df_choice_19['Salary'] == '70,000-79,999', 'Salary'] = '70-80,000'
df_choice_19.loc[df_choice_19['Salary'] == '80,000-89,999', 'Salary'] = '80-90,000'
df_choice_19.loc[df_choice_19['Salary'] == '90,000-99,999', 'Salary'] = '90-100,000'
df_choice_19.loc[df_choice_19['Salary'] == '100,000-124,999', 'Salary'] = '100-125,000'
df_choice_19.loc[df_choice_19['Salary'] == '125,000-149,999', 'Salary'] = '125-150,000'
df_choice_19.loc[df_choice_19['Salary'] == '150,000-199,999', 'Salary'] = '150-200,000'
df_choice_19.loc[df_choice_19['Salary'] == '200,000-249,999', 'Salary'] = '200-250,000'
df_choice_19.loc[df_choice_19['Salary'] == '250,000-299,999', 'Salary'] = '250-300,000'



# changing salary values to the same categories for 2020
df_choice_20.loc[df_choice_20['Salary'].isin(['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999']), 'Salary'] = '0-10,000'
df_choice_20.loc[df_choice_20['Salary'].isin(['10,000-14,999', '15,000-19,999']), 'Salary'] = '10-20,000'
df_choice_20.loc[df_choice_20['Salary'].isin(['20,000-24,999', '25,000-29,999']), 'Salary'] = '20-30,000'
df_choice_20.loc[df_choice_20['Salary'] == '30,000-39,999', 'Salary'] = '30-40,000'
df_choice_20.loc[df_choice_20['Salary'] == '40,000-49,999', 'Salary'] = '40-50,000'
df_choice_20.loc[df_choice_20['Salary'] == '50,000-59,999', 'Salary'] = '50-60,000'
df_choice_20.loc[df_choice_20['Salary'] == '60,000-69,999', 'Salary'] = '60-70,000'
df_choice_20.loc[df_choice_20['Salary'] == '70,000-79,999', 'Salary'] = '70-80,000'
df_choice_20.loc[df_choice_20['Salary'] == '80,000-89,999', 'Salary'] = '80-90,000'
df_choice_20.loc[df_choice_20['Salary'] == '90,000-99,999', 'Salary'] = '90-100,000'
df_choice_20.loc[df_choice_20['Salary'] == '100,000-124,999', 'Salary'] = '100-125,000'
df_choice_20.loc[df_choice_20['Salary'] == '125,000-149,999', 'Salary'] = '125-150,000'
df_choice_20.loc[df_choice_20['Salary'] == '150,000-199,999', 'Salary'] = '150-200,000'
df_choice_20.loc[df_choice_20['Salary'] == '200,000-249,999', 'Salary'] = '200-250,000'
df_choice_20.loc[df_choice_20['Salary'] == '250,000-299,999', 'Salary'] = '250-300,000'

In [None]:
#Rename Gender values in 2020
df_choice_20.loc[df_choice_20['Gender']=='Man','Gender'] = 'Male'
df_choice_20.loc[df_choice_20['Gender']=='Woman','Gender'] = 'Female'

In [None]:
# Functions

def plot_gender_vars(var1: str = '', title_name: str = '', avg: bool = False):
    """
    Make separate count plots for genders over years.
    """
    colors = cl.scales['4']['div']['PuOr']
    names = {0: '2017', 1: '2018', 2: '2019', 3:'2020'}
    fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Male', 'Female'), print_grid=False)
    # there are too little responders, who don't identify as Male/Female, I have decided that I can use the most common genders.
    for j, c in enumerate(['Male', 'Female']):
        data = []
        for i, df in enumerate([df_choice_17, df_choice_18, df_choice_19, df_choice_20]):
            grouped = df.loc[(df['Gender'] == c), var1].value_counts().sort_index().reset_index()
            msg = "Total Count"
            if avg:
                msg = "Rate"
                grouped['Age_'] = grouped['Age_'] / np.sum(grouped['Age_'])
            trace = go.Bar(
                x=grouped['index'],
                y=grouped.Age_,
                name=names[i],
                marker=dict(color=colors[i]),
                showlegend=True if j == 0 else False,
                legendgroup=i
            )
            fig.append_trace(trace, 1, j + 1)    

    fig['layout'].update(height=400, width=1000, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'{msg} of Indian kagglers by {title_name} and gender');
    return fig


def plot_var(var1: str = '', title_name: str = ''):
    """
    Plot one variable over years.
    """
    colors = cl.scales['4']['div']['PuOr']
    names = {0: '2017', 1: '2018', 2: '2019', 3:'2020'}
    
    data = []
    for i, df in enumerate([df_choice_17, df_choice_18, df_choice_19, df_choice_20]):
        grouped = df[var1].value_counts().sort_index().reset_index()
        grouped[var1] = grouped[var1] / np.sum(grouped[var1])
        trace = go.Bar(
            x=grouped['index'],
            y=grouped[var1],
            name=names[i],
            marker=dict(color=colors[i]),
            legendgroup=i
        )
        data.append(trace)
    layout = dict(height=400, width=1000, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Indian kagglers by {title_name}', xaxis={'categoryorder': 'total descending'});  
    fig = dict(data=data, layout=layout)
    return fig


def plot_var_salary(var1: str = '', title_name: str = '', normalize: bool = False):
    """
    Plot salary over years. This is a separate function, because
    it is necessary to add code for sorting.
    """
    colors = cl.scales['4']['div']['PuOr']
    names = {0: '2018', 1: '2019', 2:'2020'}
    
    data = []
    for i, df in enumerate([df_choice_18, df_choice_19, df_choice_20]):
        grouped = df[var1].value_counts().sort_index().reset_index()
        if normalize:
            grouped[var1] = grouped[var1] / np.sum(grouped[var1])
        map_dict = {'0-10,000': 0,
                    '10-20,000': 1,
                    '100-125,000': 10,
                    '125-150,000' : 11,
                    '150-200,000': 12,
                    '20-30,000': 2,
                    '200-250,000': 13,
                    '250-300,000': 14,
                    '30-40,000': 3,
                    '300,000-500,000': 15,
                    '40-50,000': 4,
                    '50-60,000': 5,
                    '60-70,000': 6,
                    '70-80,000': 7,
                    '80-90,000': 8,
                    '90-100,000': 9,
                    '> $500,000': 16,
                    'I do not wish to disclose my approximate yearly compensation': 17}
        grouped['sorting'] = grouped['index'].apply(lambda x: map_dict[x])
        grouped = grouped.loc[grouped['index'] != 'I do not wish to disclose my approximate yearly compensation']
        grouped = grouped.sort_values('sorting', ascending=True)
        trace = go.Bar(
            x=grouped['index'],
            y=grouped[var1],
            name=names[i],
            marker=dict(color=colors[i]),
            legendgroup=i
        )
        data.append(trace)
    layout = dict(height=500, width=1000, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Indian kagglers by {title_name}');  
    fig = dict(data=data, layout=layout)
    return fig


def plot_choice_var(var: str = '', title_name: str = ''):
    """
    Plot a variable, in which responders could select several answers.
    """
    col_names = [col for col in df_choice_20.columns if f'{var}' in col]
    data = []
    small_df = df_choice_20[col_names]
    text_values = [col.rsplit('- ')[2] for col in col_names]
    counts = []
    for m, n in zip(col_names, text_values):
        if small_df[m].nunique() == 0:
            counts.append(0)
        else:
            counts.append(sum(small_df[m] == n))
            
    trace = go.Bar(
        x=text_values,
        y=counts,
        name='c',
        marker=dict(color='rgb(56,118,176)'),
        showlegend=False
    )
    data.append(trace)    
    fig = go.Figure(data=data)
    fig['layout'].update(height=600, width=1000, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Popular {title_name}', xaxis={'categoryorder': 'total descending'});
    return fig

## Count of Respodents - India vs World

In [None]:
df_count = pd.DataFrame({'Year': [2017, 2018, 2019, 2020], 'Count': [df_choice_17.shape[0], df_choice_18.shape[0], df_choice_19.shape[0],  df_choice_20.shape[0]]})
top_count = top_count.sort_values('count')

fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Bar(y=top_count['Country'], x=top_count['count'], orientation='h', name='Number of respondents by country in 2020',marker=dict(color = 'rgb(33,187,30)', colorscale='viridis')), row=1, col=1)
fig.add_trace(go.Bar(x=df_count['Year'], y=df_count['Count'], name='Number of Indian responders by year', marker=dict(color = 'rgb(30,187,171)', colorscale='viridis')), row=1, col=2)

fig['layout'].update(height=400, width=1000,paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)', title= 'Count of Respodents');
iplot(fig);


Indian community in Kaggle is quite large and is increasing every year. 

If we look at other countries, we can see that India has the highest number of respodents this year.

Almost 6000 Indians took part in the survey this year. The number of Indians taking part in survey is increasing every year.

Fun Fact: World's first 4x Kaggle Grandmaster is an Indian [@Abhishek Thakur](https://www.kaggle.com/abhishek)

In [None]:
fig = plot_gender_vars(var1='Age_', title_name='age', avg=False)
iplot(fig);

* It seems age group `18-21` forms majority of the Indian Kagglers and is increasing every year. This is mostly because Kaggle is getting more popular among Indian undergraduates.
* The count of Male kagglers is higher for all age groups but there is increasing number of women Kagglers every year, which is a good sign. 

In [None]:
fig = plot_gender_vars(var1='Age_', title_name='age', avg=True)
iplot(fig);



* Surprisingly, excep the `18-21` age group the average age is decreasing over time. That could be because the influx of young Indians to Kaggle every year. 

* It seems in 2020 on an average there are more new female Kagglers than Female.

In [None]:
#Convert long Degree name to small
df_choice_20.loc[df_choice_20['Degree']=='Some college/university study without earning a bachelor’s degree','Degree'] = "College w/o bachelor's degree"
df_choice_19.loc[df_choice_19['Degree']=='Some college/university study without earning a bachelor’s degree','Degree'] = "College w/o bachelor's degree"
df_choice_18.loc[df_choice_18['Degree']=='Some college/university study without earning a bachelor’s degree','Degree'] = "College w/o bachelor's degree"
df_choice_17.loc[df_choice_17['Degree']=='Some college/university study without earning a bachelor’s degree','Degree'] = "College w/o bachelor's degree"

In [None]:
fig = plot_var(var1='Degree', title_name='degree')
iplot(fig);

* Almost 50% of the Indian Kagglers have a bachelor's degree.
* Following by bachelor's degree almost 30% of rest have Master's degree.
* The number of Indian PhD holders in kaggle is less but I think this will increase over years because some industries now seek PhD level expertise on data science.

In [None]:
fig = plot_var_salary(var1='Salary', title_name='salary')
iplot(fig);

If we talk about salaries, it is important to understand several things:
- some Indians work in USA and other countries and as a result have quite high salaries; 
- Compared to other countries Data Scientist or ML professionals in India are paid less;
- Few Data Scientists from India work in more developed nations are paid in USD or Euro and hence very few Indians earn more than 30k USD per annum.

* More than 50k is possible, but difficult to achieve - this is usually a level of head of DS.
    * 30k-40k is a level of a senior DS.
    * ~20k is a level of a middle DS.
* Not surprisingly, most Indians get less than 10K USD because these are freshers or less than 2 years experienced.
* By the way, did you notice some rich people with 500k$+? I wonder who those people are :)

## Professional skills

### Resources for learning Data Science

In [None]:
media_source_q = 'Who/what are your favorite media sources that report on data science topics? (Select all that apply)'
fig = plot_choice_var(var=media_source_q, title_name='resources')
iplot(fig);

### Popular programming languages among Indians Kagglers

* Majority of the Indians use Kaggle, Youtube and Blogs to learn Data Science.
* It isn't surprising that Kaggle is the main sources of getting information :)
* After Kaggle, most of Indian Kagglers consider youtube and blogs as their favourite media resource to learn data science.
* MOOCs, Twitter Influencers, Newsletter is also used by few Indians to gain information.

In [None]:
language_q = 'What programming languages do you use on a regular basis? (Select all that apply)'
fig = plot_choice_var(var=language_q, title_name='languages')
iplot(fig);

There are several interesting things about popular programming languages in India:

- Python and SQL are, of course, top used languages;
- Python is undoubtedly the most used programming laguage by Kagglers because it is the go to laguage for all data scientists in Industry;
- R isn't popular. It is probably used by a small section of Researchers or Students.
- A lot of kagglers have background in software programming, as a result a lot of people know C++, C, Java, and other languages;

### Popular ML libraries

In [None]:
pop_lib = 'Which of the following machine learning frameworks do you use on a regular basis?'
fig = plot_choice_var(var=pop_lib, title_name='libraries')
iplot(fig);

While scikit-learn is a top popular library for obvious reasons, there are other interesting libraries.
- Tensorflow and Keras is the go to library for deep learning enthusiasts.
- LGB and XGB are very popular as they provide great results and are easy to use. CatBoost is also gaining popularity. I am sure next year it will be more popular.
- Pytorch is a bit behind. Looks like Indians like Tensorflow more than PyTorch. 

### Additional Activities that are part of work

In [None]:
additional_activity_q = 'Select any activities that make up an important part of your role at work'
fig = plot_choice_var(var= additional_activity_q, title_name='additional activities')
iplot(fig);


* It is interesting that most of the people spend time on analysing and understanding data that is important for business.

* Many build protoypes as `Proof of Concepts` for stakeholders. 
* People also experiment and iteratively try to improve existing models.
* Interstingly, some people do research about the state-of-the-art in machine learning. I think is forms a major driving force for innovation by Indians.

### NLP Methods

In [None]:
nlp_tools = 'Which of the following natural language processing (NLP) methods do you use on a regular basis'
fig = plot_choice_var(var=nlp_tools, title_name='nlp tools')
iplot(fig);

* Not surprisingly word embeddings continue to be popular. They often don't require GPU at all or can be used in small neural nets (not requiring huge resources). 
* Encoder-Decoder methods is gaining popularity.
* On the other hand BERT and other transformers allow to get great results rather easily, provided of course you have enough resources.

### Popular ML tools

In [None]:
pop_ml = 'Which categories of automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years'
fig = plot_choice_var(var= pop_ml, title_name='ml tools')
iplot(fig);

* I work in a startup and I completely understand that the industry is now looking for Automation of full ML Pipelines and hence it is the most popular ML tool.

* Automated Model Selection and Automated Feature Engineering is major part of Automation of full ML pipelines, hence these are the second and third most popular tools.

* Hyperparameter optimization tools like Optuna, HyperOpt is also gaining popularity.

### To be Continued..