**About the Notebook:**

In this Kaggle python notebook, let us try to dive deep into the results of one of the most comprehensive data science and machine learning surveys conducted by Kaggle and try to understand them better along the way using visuals.


**About this survey:**

* This survey received 19,717 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity.
* Most of our respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels.
* The survey was live from October 8th to October 28th. We allowed respondents to complete the survey at any time during that window. The median response time for those who participated in the survey was approximately 10 minutes.
* Not every question was shown to every respondent. You can learn more about the different segments we used in the survey_schema.csv file. In general, respondents with more experience were asked more questions and respondents with less experience were asked less questions.

## Let us now get into the findings.!

## List of dataset files available

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # For plotting
import seaborn as sns # For plotting

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="darkgrid")

In [None]:
multiple_choice_df = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
questions_df = pd.read_csv('/kaggle/input/kaggle-survey-2019/questions_only.csv')
survey_schema_df = pd.read_csv('/kaggle/input/kaggle-survey-2019/survey_schema.csv')
text_res_df = pd.read_csv('/kaggle/input/kaggle-survey-2019/other_text_responses.csv')

### Glimpse of Kaggle Survey dataset
Let's see the overview of given dataset

In [None]:
print (f'Shape of multiple choice responses: {multiple_choice_df.shape}')
print (f'Shape of questions only: {questions_df.shape}')
print (f'Shape of survey schema: {survey_schema_df.shape}')
print (f'Shape of text responses: {text_res_df.shape}')

### Description of each file given above:

1. **multiple_choice_responses.csv**: Multiple choice single response questions fit into individual columns whereas multiple choice multiple response questions were split into multiple columns. Text responses were encoded to protect user privacy and countries with fewer than 50 respondents were grouped into the category "other".

2. **questions_only.csv**: The list of questions from the 2019 Kaggle Data Science and Machine Learning Survey 

3. **survey_schema**: Survey schema describing which questions were presented to which respondents. In general, respondents with more experience were asked more questions than respondents with less experience.

4. **text_res_df**: If "Other" is selected in any of the question, there is also an option to provide a free form text response. These text responses were separated and shuffled to protect user privacy.

In [None]:
multiple_choice_df.head()

In [None]:
questions_df.head()

In [None]:
survey_schema_df.head()

In [None]:
text_res_df.head()

### List of asked questions in the survey

In [None]:
from IPython.display import Markdown, display
for index, value in questions_df.iloc[0].items():
    display(Markdown(f'**{index}:** {value}'))

## Exploratory Data Analysis
Let's explore the Kaggle survey data based on different features using some visulizations.

### # of respondents for all the questions

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
sns.set(style="darkgrid")
ax = sns.barplot(survey_schema_df.columns.values[1:][:-1], survey_schema_df.iloc[1][1:].astype('int64')[:-1])
ax.set(xlabel='Questions', ylabel='# of responses', title='Questions vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.show()

### Age group wise response distribution

#### Question 1: What is your age (# of years)?

In [None]:
total = int(survey_schema_df.iloc[1]['Q1'])
ax = sns.barplot(multiple_choice_df.groupby(['Q1']).size().reset_index(name='counts')['Q1'][:-1], multiple_choice_df.groupby(['Q1']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Age groups', ylabel='# of responses', title='Age groups vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 

plt.show()

It looks like most of the respondents are in age group 25-29.

### Gender wise response distribution 

#### Question 2: What is your gender?

In [None]:
sns.set(rc={'figure.figsize':(9,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q2'])
ax = sns.barplot(multiple_choice_df.groupby(['Q2']).size().reset_index(name='counts')['Q2'][:-1], multiple_choice_df.groupby(['Q2']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Gender', ylabel='# of responses', title='Gender group vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Most of the respondents are Male

### Country wise response distribution 

#### Question 3: In which country do you currently reside?

In [None]:
sns.set(rc={'figure.figsize':(25,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q3'])
ax = sns.barplot(multiple_choice_df.groupby(['Q3']).size().reset_index(name='counts')['Q3'], multiple_choice_df.groupby(['Q3']).size().reset_index(name='counts')['counts'], color='b')
ax.set(xlabel='Country', ylabel='# of responses', title='Country vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    if height/total*100 > 2: # Only plot the % if it's more than 2% in this case
        ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Majority responses are from India (24.32%) and USA (15.68%).

### Distribution of Educational background

#### Question 4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q4'])
ax = sns.barplot(multiple_choice_df.groupby(['Q4']).size().reset_index(name='counts')['Q4'][:-1], multiple_choice_df.groupby(['Q4']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Educational Qualification', ylabel='# of responses', title='Educational Qualification vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

There are ~44% respondents are attaining or will attain master's degree,

### Responsibility/Current Role wise distribution

#### Question 5: Select the title most similar to your current role (or most recent title if retired).

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q5'])
ax = sns.barplot(multiple_choice_df.groupby(['Q5']).size().reset_index(name='counts')['Q5'].drop(9, axis=0), multiple_choice_df.groupby(['Q5']).size().reset_index(name='counts')['counts'].drop(9, axis=0))
ax.set(xlabel='Job Title', ylabel='# of responses', title='Job Title vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Most of the respondents are either students or data scientists.

### Distribution of company size of respondents

#### Question 6: What is the size of the company where you are employed?

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q6'])
ax = sns.barplot(multiple_choice_df.groupby(['Q6']).size().reset_index(name='counts')['Q6'][:-1], multiple_choice_df.groupby(['Q6']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Is ML used?', ylabel='# of responses', title='ML usage vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Majority of companies are of size 0-49 employees. This indicates possibility startups.

### Distribution of number of individuals working on ML/DS

#### Question 7: Approximately how many individuals are responsible for data science workloads at your place of business?

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q7'])
ax = sns.barplot(multiple_choice_df.groupby(['Q7']).size().reset_index(name='counts')['Q7'][:-1], multiple_choice_df.groupby(['Q7']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Is ML used?', ylabel='# of responses', title='ML usage vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

### Question 8: Does your current employer incorporate machine learning methods into their business?

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q8'])
ax = sns.barplot(multiple_choice_df.groupby(['Q8']).size().reset_index(name='counts')['Q8'][1:], multiple_choice_df.groupby(['Q8']).size().reset_index(name='counts')['counts'][1:])
ax.set(xlabel='Is ML used?', ylabel='# of responses', title='ML usage vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

### Distribution of compensation of respondents

#### Question 10: What is your current yearly compensation (approximate $USD)?

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q10'])
ax = sns.barplot(multiple_choice_df.groupby(['Q10']).size().reset_index(name='counts')['Q10'][:-1], multiple_choice_df.groupby(['Q10']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Compensation group', ylabel='# of responses', title='Compensation group vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=11) 
plt.show()

### Distribution of amout spent ($ USD) on ML/DS

#### Question 11: Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q11'])
ax = sns.barplot(multiple_choice_df.groupby(['Q11']).size().reset_index(name='counts')['Q11'][:-1], multiple_choice_df.groupby(['Q11']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Amount spent', ylabel='# of responses', title='Amount spent vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

It looks like majority of respondents (~33.05%) do not spent any $$$ on ML/CC products at work.

### Question 15: How long have you been writing code to analyze data (at work or at school)?

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q15'])
ax = sns.barplot(multiple_choice_df.groupby(['Q15']).size().reset_index(name='counts')['Q15'].drop(6, axis=0), multiple_choice_df.groupby(['Q15']).size().reset_index(name='counts')['counts'].drop(6, axis=0))
ax.set(xlabel='# of years of writing code', ylabel='# of responses', title='# of years of writing code vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Majority of respondents are writing code since 1-2 years.

### Question 19: What programming language would you recommend an aspiring data scientist to learn first?

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q19'])
ax = sns.barplot(multiple_choice_df.groupby(['Q19']).size().reset_index(name='counts')['Q19'][:-1], multiple_choice_df.groupby(['Q19']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='Recommanded Programming Language', ylabel='# of responses', title='Recommanded Programming Language vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Python is the most recommanded programming language to learn for aspiring data scientists.

### Question 22: Have you ever used a TPU (tensor processing unit)?

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q22'])
ax = sns.barplot(multiple_choice_df.groupby(['Q22']).size().reset_index(name='counts')['Q22'].drop(3, axis=0), multiple_choice_df.groupby(['Q22']).size().reset_index(name='counts')['counts'].drop(3, axis=0))
ax.set(xlabel='TPU usage', ylabel='# of responses', title='TPU usage vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

More than 81% respondents have never used TPU.

### Question 23: For how many years have you used machine learning methods?

In [None]:
sns.set(rc={'figure.figsize':(15,8)})
sns.set(style="darkgrid")

total = int(survey_schema_df.iloc[1]['Q23'])
ax = sns.barplot(multiple_choice_df.groupby(['Q23']).size().reset_index(name='counts')['Q23'][:-1], multiple_choice_df.groupby(['Q23']).size().reset_index(name='counts')['counts'][:-1])
ax.set(xlabel='ML methods usage', ylabel='# of responses', title='ML methods vs. # of responses')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches: # loop to all objects and plot group wise % distribution
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
plt.show()

Most of the respondents are working in ML/DS methods since less than 1 year.

## Exploring questions with multiple choices and their response behavior

In this survey, there were many questions with multiple choices. Where respondent can select one or more options for a certain question. Responses of such questions are recorded as separate columns in the dataset. Let's first analyse the columns of `multiple_choice_responses.csv`.

In [None]:
multiple_choice_df.columns.values

In [None]:
def plot_multi_choice_dist(question, number_of_parts, xlabel, ylabel):
    sns.set(rc={'figure.figsize':(15,8)})
    sns.set(style="darkgrid")

    cats, counts = [], []
    total = int(survey_schema_df.iloc[1][question])

    for i in range(number_of_parts):
        cats.append(multiple_choice_df[multiple_choice_df[f'{question}_Part_{i+1}'].notnull()][f'{question}_Part_{i+1}'][1:].unique()[0])
        counts.append(int(multiple_choice_df[multiple_choice_df[f'{question}_Part_{i+1}'].notnull()][f'{question}_Part_{i+1}'].shape[0]))
    ax = sns.barplot(cats, counts)
    ax.set(xlabel=xlabel, ylabel=ylabel, title=f'{xlabel} vs. {ylabel}')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
    for p in ax.patches: # loop to all objects and plot group wise % distribution
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 5,
                '{:1.2f}%'.format(height/total*100),
                ha="center", fontsize=15) 
    return plt

### Responsibility distribution at workplace

### Question 9: Select any activities that make up an important part of your role at work: (Select all that apply)

In [None]:
plot_multi_choice_dist('Q9', 8, 'Responsibilities', '# of responses').show()

### Distribution of learning sources

#### Questions 12: Who/what are your favorite media sources that report on data science topics? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q12', 12, 'ML/DS Learning Source', '# of responses').show()

### Distribution of learning platforms

### Question 13: On which platforms have you begun or completed data science courses? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q13', 12, 'ML/DS Learning Platform', '# of responses').show()

Coursera looks the most used platform among all other sources for data science aspirants.

### Distribution of IDE's tool used on a regular basis

### Question 16: Which of the following integrated development environments (IDE's) do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q16', 12, 'IDE\'s used', '# of responses').show()

Most of the respondents uses Jupyter as an IDE.

### Distribution of usage of hosted notebook products

### Question 17: Which of the following hosted notebook products do you use on a regular basis? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q17', 12, 'Hosted notebook platform', '# of responses').show()

### Distribution of used programming languages

### Question 18: What programming languages do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q18', 12, 'Programming languages', '# of responses').show()

Python is most used language among data science aspirants

### Distribution of usage of visualization tools/libraries

### Question 20: What data visualization libraries or tools do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q20', 12, 'Visualization tools/libraries', '# of responses').show()

Matplotlib used the most among the data science aspirants

### Distribution of usage of specialized hardware for ML/DS

### Question 21: Which types of specialized hardware do you use on a regular basis? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q21', 5, 'Specialized Hardwares', '# of responses').show()

### Distribution of used ML algorithms

### Question 24: Which of the following ML algorithms do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q24', 12, 'ML algorithms', '# of responses').show()

### Distribution of used ML tools

### Question 25: Which categories of ML tools do you use on a regular basis? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q25', 8, 'ML tools', '# of responses').show()

### Distribution of computer vision methods used

### Question 26: Which categories of computer vision methods do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q26', 7, 'Computer Vision methods', '# of responses').show()

### Distribution of NLP (Natural Language Processing) methods used

### Question 27: Which of the following natural language processing (NLP) methods do you use on a regular basis? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q27', 6, 'NLP methods', '# of responses').show()

### Distribution of Machine Learning frameworks used

### Question 28: Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q28', 12, 'ML Frameworks', '# of responses').show()

### Distribution of used Cloud Computing platforms

### Question 29: Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q29', 12, 'Cloud computing platforms', '# of responses').show()

### Distribution of clouding computing products

### Question 30: Which specific cloud computing products do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q30', 12, 'Cloud computing products', '# of responses').show()

### Distribution of big data / analytics products used

### Question 31: Which specific big data / analytics products do you use on a regular basis? (Select all that apply)

In [None]:
plot_multi_choice_dist('Q31', 12, 'Big data / analytics products', '# of responses').show()

### Destribution of used Machine Learning products

### Question 32: Which of the following machine learning products do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q32', 12, 'Machine Learning products', '# of responses').show()

### Destribution of automated ML Tools used

### Question 33: Which automated machine learning tools (or partial AutoML tools) do you use on a regular basis? (Select all that apply) 

In [None]:
plot_multi_choice_dist('Q33', 12, 'Automated ML Tools', '# of responses').show()

### Destribution of relational database products used
### Question 34: Which of the following relational database products do you use on a regular basis? (Select all that apply).

In [None]:
plot_multi_choice_dist('Q34', 12, 'Relational Database products', '# of responses').show()

## Your suggestions are appreciated. Let me know if I've missed anything. Thank you!