---
# Kaggle Data Science Survey 2021

---

**Problem Statement:**

* Kaggle set out to conduct an industry-wide survey in 2021 that presents a truly comprehensive view of the state of data science and machine learning. The survey was live from 09/01/2021 to 10/04/2021, and after cleaning the data, 25,973 responses were captured.
 * The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners.
 * That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). 
 
---
 
**Main Dataset:**

* **kaggle_survey_2021_responses.csv** file includes 42+ questions and 25,973 responses.
 * Responses to multiple choice questions (only a single choice can be selected) were recorded in individual columns. 
 * Responses to multiple selection questions (multiple choices can be selected) were split into multiple columns (with one column per answer choice).

**Supplementary Data:**
* **kaggle_survey_2021_answer_choices.pdf:** list of answer choices for every question
 * With footnotes describing which questions were asked to which respondents.
* **kaggle_survey_2021_methodology.pdf:** a description of how the survey was conducted
 * You can ask additional questions by posting in the pinned Q&A thread.
---

---
**Importing Libraries:**

* To get started we will use Python for data pre-processing and data analysis.

* Import python libraries as necessary to get started for data load and later import other libraries as needed

---

In [None]:
# Importing package numpys (For Numerical Python)
import numpy as np 
# Importing for data analysis
import pandas as pd 
# module finds all the pathnames matching a specified pattern
import glob 
# module provides a portable way of using operating system dependent functionality
import os 
 # Importing pyplot interface using matplotlib
import matplotlib.pyplot as plt 
# Importing seaborn library for interactive visualization
import seaborn as sns 
# Importing WordCloud for text data visualization
from wordcloud import WordCloud
# Importing matplotlib for plots
import matplotlib
#Importing datetime for using datetime
from datetime import datetime
#Importing plotly Express for visualization
import plotly.express as px

---
# Data Definition/Description of Kaggle Survey 2021 responses

---

In [None]:
# Loading dataset kaggle_survey_2021_responses.csv
kaggle_survey_2021_data = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv',low_memory=False)

---
**Data Definition:**

Kaggle Survey 2021 Responses:
* The 2021 Kaggle DS & ML Survey received 25,973 usable responses from participants in 171 different countries and territories.
* Responses to multiple choice questions (only a single choice can be selected) were recorded in individual columns. 
* Responses to multiple selection questions (multiple choices can be selected) were split into multiple columns (with one column per answer choice).
* The survey data was released under a CC 2.0 license:
https://creativecommons.org/licenses/by/2.0/ 

---

---
**Following is the list of Questions for kaggle survey 2021 as per main dataset and supplementary data**

---
| No. | Question | Description of the Question |
| :-- | :--| :--| 
|01| **Q1**   | What is your age (# years)?|
|02| **Q2** | What is your gender?|
|03| **Q3**   | In which country do you currently reside?|
|04| **Q4** | What is the highest level of formal education that you have attained or plan to attain within the next 2 years?|
|05| **Q5**   | Select the title most similar to your current role (or most recent title if retired)|
|06| **Q6**   | For how many years have you been writing code and/or programming?|
|07| **Q7**   | What programming languages do you use on a regular basis? (Multiple Choice)|
|08| **Q8**   | What programming language would you recommend an aspiring data scientist to learn first?|
|09| **Q9**   | Which of the following integrated development environments (IDE's) do you use on a regular basis? (Multiple Choice)|
|10| **Q10**   | Which of the following hosted notebook products do you use on a regular basis? (Multiple Choice)|
|11| **Q11**   | What type of computing platform do you use most often for your data science projects?|
|12| **Q12**   | Which types of specialized hardware do you use on a regular basis? (Multiple Choice)|
|13| **Q13**   | Approximately how many times have you used a TPU (tensor processing unit)?|
|14| **Q14**   | What data visualization libraries or tools do you use on a regular basis? (Multiple Choice)|
|15| **Q15**   | For how many years have you used machine learning methods?|
|16| **Q16**   | Which of the following machine learning frameworks do you use on a regular basis? (Multiple Choice)|
|17| **Q17**   | Which of the following ML algorithms do you use on a regular basis? (Multiple Choice)|
|18| **Q18**   | Which categories of computer vision methods do you use on a regular basis? (Multiple Choice)|
|19| **Q19**   | Which of the following natural language processing (NLP) methods do you use on a regular basis? (Multiple Choice)|
|20| **Q20**   | In what industry is your current employer/contract (or your most recent employer if retired)?|
|21| **Q21**   | What is the size of the company where you are employed?|
|22| **Q22**   | Approximately how many individuals are responsible for data science workloads at your place of business?|
|23| **Q23**   | Does your current employer incorporate machine learning methods into their business?|
|24| **Q24**   | Select any activities that make up an important part of your role at work: (Multiple Choice)|
|25| **Q25**   | What is your current yearly compensation (approximate USD) ?|
|26| **Q26**   | Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate USD)?|
|27| **Q27-A**   | Which of the following cloud computing platforms do you use on a regular basis? (Multiple Choice)|
|28| **Q28**   | Of the cloud platforms that you are familiar with, which has the best developer experience (most enjoyable to use)?|
|29| **Q29-A**   | Do you use any of the following cloud computing products on a regular basis? (Multiple Choice)|
|30| **Q30-A**   | Do you use any of the following data storage products on a regular basis? (Multiple Choice)|
|31| **Q31-A**   | Do you use any of the following managed machine learning products on a regular basis? (Multiple Choice)|
|32| **Q32-A**   | Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis? (Multiple Choice)|
|33| **Q33-A**   | Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often?|
|34| **Q34-A**   | Which of the following business intelligence tools do you use on a regular basis? (Multiple Choice)|
|35| **Q35**   | Which of the following business intelligence tools do you use most often?|
|36| **Q36-A**   | Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis? (Multiple Choice)|
|37| **Q37-A**   | Which of the following automated machine learning tools (or partial AutoML tools) do you use on a regular basis? (Multiple Choice)|
|38| **Q38-A**   | Do you use any tools to help manage machine learning experiments? (Multiple Choice)|
|39| **Q39**   | Where do you publicly share or deploy your data analysis or machine learning applications? (Multiple Choice)|
|40| **Q40**   | On which platforms have you begun or completed data science courses? (Multiple Choice)|
|41| **Q41**   | What is the primary tool that you use at work or school to analyze data? |
|42| **Q42**   | Who/what are your favorite media sources that report on data science topics? (Multiple Choice) |

---
**Supplementary Questions:**

---

| No. | Question | Description of the Question |
| :-- | :--| :--| 
|01| **Q27-B**   | Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years? (Multiple Choice)|
|02| **Q29-B**   | In the next 2 years, do you hope to become more familiar with any of these specific cloud computing products? (Multiple Choice)|
|03| **Q30-B**   | In the next 2 years, do you hope to become more familiar with any of these specific data storage products? (Multiple Choice) |
|04| **Q31-B**   | In the next 2 years, do you hope to become more familiar with any of these managed machine learning products? (Multiple Choice)|
|05| **Q32-B**   | Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you hope to become more familiar with in the next 2 years? (Multiple Choice)|
|06| **Q34-B**   | Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years? (Multiple Choice)|
|07| **Q36-B**   | Which categories of automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years? (Multiple Choice)|
|08| **Q37-B**   | Which specific automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years? (Multiple Choice)|
|09| **Q38-B**   | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Multiple Choice)|

In [None]:
# get shape of dataframe
print('Shape of kaggle survey 2021 responses dataset is:', kaggle_survey_2021_data.shape)

# print summary of dataframe
kaggle_survey_2021_data.info()

---
**Q: What does data looks like for kaggle survey 2021 responses dataset?**

---

In [None]:
# print first 10 rows of dataframe
kaggle_survey_2021_data.head(10)

---
**Q: What is the statistics description for kaggle survey 2021 responses dataset?**

---

In [None]:
# print descriptive statistics for all object type
kaggle_survey_2021_data.describe(include='all')

In [None]:
# impute missing value with empty character
kaggle_survey_2021_data.fillna("", inplace=True)

Combine value from multiple choice question to one single column for EDA.

In [None]:
# get all column names which has multiple choice
survey_question = [7,9,10,12,14,16,17,18,19,24,33,39,40,42]
for i in survey_question:
    question = 'Q'+ str(i)
    col= kaggle_survey_2021_data.filter(like=question, axis=1).columns
    # combine vaue of all column with multiple choice
    survey_Q = kaggle_survey_2021_data[col].apply(''.join,axis=1)
    # create a new column to represent that question
    kaggle_survey_2021_data[question] = survey_Q
    # remove whitespace character for all the values in new column
    kaggle_survey_2021_data[question]= kaggle_survey_2021_data[question].str.strip()
    # drop old columns which are created for multiple choice
    kaggle_survey_2021_data.drop(labels=col, axis=1, inplace=True)

In [None]:
# get all column names which has multiple choice
survey_question = [27,29,30,31,32,34,36,37,38]
for i in survey_question:
    question = 'Q'+ str(i)+'_A'
    col= kaggle_survey_2021_data.filter(like=question, axis=1).columns
    # combine vaue of all column with multiple choice
    survey_Q = kaggle_survey_2021_data[col].apply(''.join,axis=1)
    # create a new column to represent that question
    kaggle_survey_2021_data[question] = survey_Q
    # remove whitespace character for all the values in new column
    kaggle_survey_2021_data[question]= kaggle_survey_2021_data[question].str.strip()
    # drop old columns which are created for multiple choice
    kaggle_survey_2021_data.drop(labels=col, axis=1, inplace=True)

In [None]:
# get all column names which has multiple choice
survey_question = [27,29,30,31,32,34,36,37,38]
for i in survey_question:
    question = 'Q'+ str(i)+'_B'
    col= kaggle_survey_2021_data.filter(like=question, axis=1).columns
    # combine vaue of all column with multiple choice
    survey_Q = kaggle_survey_2021_data[col].apply(''.join,axis=1)
    # create a new column to represent that question
    kaggle_survey_2021_data[question] = survey_Q
    # remove whitespace character for all the values in new column
    kaggle_survey_2021_data[question]= kaggle_survey_2021_data[question].str.strip()
    # drop old columns which are created for multiple choice
    kaggle_survey_2021_data.drop(labels=col, axis=1, inplace=True)

New set of columns after combining multiple choice columns.

In [None]:
#summary of new columns
kaggle_survey_2021_data.columns

In [None]:
# get shape of dataframe
print('Shape of kaggle survey 2021 responses dataset is:', kaggle_survey_2021_data.shape)

# print summary of dataframe
kaggle_survey_2021_data.info()

Quick look at statistics description of new columns

In [None]:
kaggle_survey_2021_data[1:].describe()

---
# Data Analysis/EDA of Kaggle Survey 2021 responses

---

---
**Data Scientist Profile: Gender**

---

As per Kaggle Executive Summary on State of Machine Learning and Data Science 2021:

* Data Science is still suffering from a large gender gap in the workplace, as 82% of users identify as men.
---

---
**Q: What is the distribution of Gender in survey response?**

---

In [None]:
# plot percentage distribution count of Gender
kaggle_survey_2021_data[1:]['Q2'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.3,0.5], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(10,5), shadow=True, startangle=135, legend=False, cmap='winter')
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = kaggle_survey_2021_data[1:]['Q2'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Let's explore from **Gender** (Man and Woman) perspective as to what does Data Scientist Profile looks like via Plotly Express using interactive charts.

---
**Q: What is the distribution of Gender with respect to Age Group?**

---

In [None]:
gender =["Man", "Woman"]
#distribution of Gender with respect to Age group
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=['Q2', 'Q1'], color='Q2', hover_data=['Q1'])
fig.show()

Click/Touch on Label "Man" will show Age Group distribution for Man only and Click/Touch on Label "Woman" will show Age Group distribution for Woman only which reveals that distribution of Age Group is very similar (Age 18 to 39) for both Man and Woman where majority of Data Scientist Profile appears to belong.

Note: For Man, highest Age Group appears to be 25-29 and For Woman, highest Age Group appears to be in 18-21

---
**Q: What is the distribution of Gender with respect to Education?**

---

In [None]:
#distribution of Gender with respect to Education
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=['Q2','Q4'], color='Q2', hover_data=['Q4'])
fig.show()

Click/Touch on Label "Man" will show Education distribution for Man only and Click/Touch on Label "Woman" will show Education distribution for Woman only which reveals that distribution of Education is very similar (Bachelor's, Master's and Doctoral Degree) for both Man and Woman where majority of Data Scientist Profile appears to belong.

Note: For Man, Bachelor and Master Degree is almost same and For Woman, similar pattern appears that Bachelor and Master's Degree is majority.

---
**Q: What is the distribution of Gender with respect to Job Title?**

---

In [None]:
#distribution of Gender with respect to job title
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=['Q2','Q5'], color='Q2', hover_data=['Q5'])
fig.show()

Click/Touch on Label "Man" will show Job Title distribution for Man only and Click/Touch on Label "Woman" will show Job Title distribution for Woman only which reveals that distribution of Job Title is very similar (Student) for both Man and Woman where majority of Data Scientist Profile appears to belong.

Note: For Man, Student and then Data Scientist, Software Engineer are top Job Title and For Woman, similar pattern appears that Student and Data Scientist are top Job Title but Data Analyst is next best as majority.

---
**Q: What is the distribution of Gender with respect to Programming Experience?**

---

In [None]:
#distribution of Gender with respect to Programming Experience
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=['Q2','Q6'], color='Q2', hover_data=['Q6'])
fig.show()

Click/Touch on Label "Man" will show Programming Experience distribution for Man only and Click/Touch on Label "Woman" will show Programming Experience distribution for Woman only which reveals that distribution of Programming Experience is very similar (1-3 years,<1 years,3-5 years) for both Man and Woman where majority of Data Scientist Profile appears to belong.

Note: For Man, Programming Experience of 1-3 years and <1 years and For Woman, similar pattern appears for Programming Experience of 1-3 years and <1 years.

---
**Q: What is the distribution of Gender with respect to Industry?**

---

In [None]:
#distribution of Gender with respect to Industry
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=[px.Constant('Gender->Job Title->Industry'), 'Q2','Q5','Q20'], color='Q20', hover_data=['Q2'])
fig.show()

* Click/Touch on Label "Man" will show Job Title distribution for Man.
 * Click/Touch on respective Label for "Job Title" will show Industry distribution for that.
* Click/Touch on Label "Woman" will show Job Title distribution for Woman.
 * Click/Touch on respective Label for "Job Title" will show Industry distribution for that.

Above plot reveals that distribution of Industry is very similar (Computers/Technology and Academics/Education) for both Man and Woman where majority of Data Scientist Profile appears to belong.

---
**Q: What is the distribution of Gender with respect to Compensation?**

---

In [None]:
#distribution of Gender with respect to Compensation
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=[px.Constant('Gender->Job Title->Compensation'), 'Q2','Q5','Q25'], color='Q25', hover_data=['Q2'])
fig.show()

* Click/Touch on Label "Man" will show Job Title distribution for Man.
 * Click/Touch on respective Label for "Job Title" will show Compensation distribution for that.
* Click/Touch on Label "Woman" will show Job Title distribution for Woman.
 * Click/Touch on respective Label for "Job Title" will show Compensation distribution for that.

Above plot reveals that distribution of Compensation is very similar (0-999 USD) for both Man and Woman where majority of Data Scientist Profile appears to belong.

---
**Q: What is the distribution of Gender with respect to Country?**

---

In [None]:
#distribution of Gender with respect to Country
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q2.isin(gender)], path=[px.Constant('Gender->Job Title->Country'), 'Q2', 'Q5','Q3'], color='Q3', hover_data=['Q2'])
fig.show()

* Click/Touch on Label "Man" will show Job Title distribution for Man.
 * Click/Touch on respective Label for "Job Title" will show Country distribution for that.
* Click/Touch on Label "Woman" will show Job Title distribution for Woman.
 * Click/Touch on respective Label for "Job Title" will show Country distribution for that.

Above plot reveals that distribution of Country is very similar (India and United States of America) for both Man and Woman where majority of Data Scientist Profile appears to belong.

---
**Data Scientist Profile: Education**

---

As per Kaggle Executive Summary on State of Machine Learning and Data Science 2021:

* Graduate degrees continue to be the norm for data scientists, with over 62% having obtained either a Master’s or doctoral degree. Fewer than 5% of data scientists have no degree beyond a high school diploma.
---

---
**Q: What is the distribution of Education in survey response?**

---

In [None]:
# plot percentage distribution count of Education
kaggle_survey_2021_data[1:]['Q4'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(20,10), shadow=True, startangle=135, legend=False, cmap='summer')
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = kaggle_survey_2021_data[1:]['Q4'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Let's explore from **Education** (Bachelor's, Master's and Doctoral Degree) perspective what does Data Scientist Profile looks like via Plotly Express using interactive charts.

---
**Q: What is the distribution of Education with respect to Job Title?**

---

In [None]:
education=["Bachelor’s degree","Master’s degree","Doctoral degree"]
#distribution of Education with respect to Job Title
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q4.isin(education)], path=['Q4','Q5'], color='Q4', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "degree" will show Job Title distribution for that Degree.

Above plot reveals that distribution of Job Title for Master's Degree is around Student, Data Scientist, Other, Data Analyst and Software Engineer. For Bachelor's Degree Job Title distribution is around Student, Data Scientist, Software Engineer and Data Analyst. For Doctoral degree Job Title distribution is around Research Scientist, Data Scientist, Other and Student.

---
**Q: What is the distribution of Education with respect to Programming Experience?**

---

In [None]:
#distribution of Education with respect to Programming Experience
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q4.isin(education)], path=['Q4','Q6'], color='Q4', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "degree" will show Programming Experience distribution for that Degree.

Above plot reveals that distribution of Programming Experience for Bachelor's Degree is around 0 to 5 years. For Master's Degree Programming Experience distribution is around 0 to 10 years. For Doctoral degree Programming Experience distribution is around 5 to 20 years where majority of Data Scientist Profile belongs for this survey.

---
**Q: What is the distribution of Education with respect to Machine Learning Experience?**

---

In [None]:
#distribution of Education with respect to Machine Learning Experience
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q4.isin(education)], path=['Q4','Q15'], color='Q4', hover_data=['Q15'])
fig.show()

* Click/Touch on Label with "degree" will show Machine Learning Experience distribution for that Degree.

Above plot reveals that distribution of Machine Learning Exprience for Bachelor's and Master's degree is in the range "Under 1 year", "1-2 years" and "I do not use machine learning methods" which is expected but Doctoral Degree Education distribution appears to be having good mix for all range of experience in Machine Learning.

---
**Q: What is the distribution of Education with respect to Industry?**

---

In [None]:
#distribution of Education with respect to Industry
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q4.isin(education)], path=[px.Constant('Education->Job Title->Industry'), 'Q4','Q5','Q20'], color='Q20', hover_data=['Q4'])
fig.show()

* Click/Touch on Label with "degree" will show Job Title distribution for that Degree.
 * Click/Touch on respective Label for "Job Title" will show Industry distribution for that.

Above plot reveals that distribution of Industry is around Computers/Technology and Academics/Education for Master's, Bachelor and Doctoral degree where majority of Data Scientist Profile appears to belong.

---
**Q: What is the distribution of Education with respect to Compensation?**

---

In [None]:
#distribution of Education with respect to Compensation
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q4.isin(education)], path=[px.Constant('Education->Job Title->Compensation'), 'Q4','Q5','Q25'], color='Q25', hover_data=['Q4'])
fig.show()

* Click/Touch on Label with "degree" will show Job Title distribution for that Degree.
 * Click/Touch on respective Label for "Job Title" will show Compensation distribution for that.

Above plot reveals that distribution of Compensation is majority around "Student" Job Title for Master's and Bachelor's Degree with "Currently Not Employed" prominant for both Degree as expected. Doctoral degree has Compensation distribution around "Research Scientist" and "Data Scientist" but there is also majority for "Student" and "Currently Not Employed" category.

---
**Q: What is the distribution of Education with respect to Country?**

---

In [None]:
#distribution of Education with respect to Country
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q4.isin(education)], path=[px.Constant('Education->Job Title->Country'), 'Q4','Q5','Q3'], color='Q3', hover_data=['Q4'])
fig.show()

* Click/Touch on Label with "degree" will show Job Title distribution for that Degree.
 * Click/Touch on respective Label for "Job Title" will show Country distribution for that.
 
Above plot reveals that distribution of Country has "Student" category for Bachelor's and Master's Degree which belongs to India with United States of America appears to be having majority for Doctoral degree under "Research Scientist", "Data Scientist" and "Student" category.

---
**Data Scientist Profile: Experience**

---

As per Kaggle Executive Summary on State of Machine Learning and Data Science 2021:

* While most Kaggle data scientists have at least a few years of experience under their belt, a growing share have taken up programming within the last year (14.6% vs 9% in 2020).

* Most Kaggle data scientists are newer to machine learning than programming. Slightly more than 55% of data scientists have less than three years experience. Less than 6% of professional data scientists have been using machine learning for a decade or more. 

---

---
**Q: What is the distribution of Experience (Programming) in survey response?**

---

In [None]:
# plot percentage distribution count of Experience (Programming)
kaggle_survey_2021_data[1:]['Q6'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(15,5), shadow=True, startangle=135, legend=False, cmap='autumn')
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = kaggle_survey_2021_data[1:]['Q6'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Let's explore from **Experience (Programming)** (<1 years, 1-3 years, 3-5 years and 5-10 years) perspective as to what does Data Scientist Profile looks like via Plotly Express using interactive charts.

---
**Q: What is the distribution of Experience (Machine Learning) in survey response?**

---

In [None]:
# plot percentage distribution count of Experience (Machine Learning)
kaggle_survey_2021_data[1:]['Q15'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(15,10), shadow=True, startangle=135, legend=False, cmap='rainbow')
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = kaggle_survey_2021_data[1:]['Q15'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

---
**Q: What is the distribution of Experience (Programming) with respect to Job Title?**

---

In [None]:
experience=["< 1 years","1-3 years","3-5 years","5-10 years"]
#distribution of Experience (Programming) with respect to Job Title
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=['Q6','Q5'], color='Q6', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "experience range" will show Job Title distribution for that Experience range.

Above plot reveals that distribution of Experience (Programming) Range for 3-5 years,1-3 years and < 1 years has "Student" Job Title as majority with 5-10 years experience evenly distributed for "Data Scientist", "Software Engineer" and "Student".

---
**Q: What is the distribution of Experience (Machine Learning) with respect to Job Title?**

---

In [None]:
ml_experience=["Under 1 year","1-2 years","I do not use machine learning methods","2-3 years"]
#distribution of Experience (Machine Learning) with respect to Job Title
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q15.isin(ml_experience)], path=['Q15','Q5'], color='Q15', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "experience range" will show Job Title distribution for that Experience range.

Above plot reveals that distribution of Experience (Machine Learning) Range for 1-2 years and < 1 years has "Student" Job Title as majority with 2-3 years experience range has "Data Scientist" Job Title as majority.

---
**Q: What is the distribution of Experience (Programming) with respect to Programming Language?**

---

In [None]:
#distribution of Experience (Programming) with respect to Programming Language
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=[px.Constant('Experience->Programming Language'), 'Q6','Q7'], color='Q7', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "experience range" will show Programming Language distribution for that Experience range.
  
Above plot reveals that distribution of Programming Language has "Python" category appears to dominate across all experience range (1-3 years, <1 years, 3-5 years and 5-10 years).

---
**Q: What is the distribution of Experience (Programming) with respect to IDE?**

---

In [None]:
#distribution of Experience (Programming) with respect to IDE
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=[px.Constant('Experience->IDE'), 'Q6','Q9'], color='Q9', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "experience range" will show IDE distribution for that Experience range.
  
Above plot reveals that distribution of IDE has "Jupyter Notebook" category which appears to dominate across experience range (1-3 years, <1 years, 3-5 years and 5-10 years) and 

---
**Q: What is the distribution of Experience (Programming) with respect to Computing Platform?**

---

In [None]:
#distribution of Experience (Programming) with respect to Computing Platform
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=[px.Constant('Experience->Job Title->Computing Platform'), 'Q6','Q5','Q11'], color='Q11', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "experience range" will show Job Title distribution for that Experience range.
 * Click/Touch on respective Label for "Job Title" will show Computing Platform distribution for that.
 
Above plot reveals that distribution of Computing Platform has "Laptop" category which appears to dominate across all experience range (1-3 years, <1 years, 3-5 years and 5-10 years).

---
**Q: What is the distribution of Experience (Programming) with respect to Industry?**

---

In [None]:
#distribution of Experience (Programming) with respect to Industry
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=[px.Constant('Experience->Job Title->Industry'), 'Q6','Q5','Q20'], color='Q20', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "experience range" will show Job Title distribution for that Experience Range.
 * Click/Touch on respective Label for "Job Title" will show Industry distribution for that.

Above plot reveals that distribution of Industry is majority around "Student" Job Title for 1-3 years,<1 years and 3-5 years with 5-10 years experience range around "Data Scientist", "Software Engineer" and Research Scientist Job Title having Industry distribution majority towards Computers/Technology and Academics/Education.

---
**Q: What is the distribution of Experience (Programming) with respect to Compensation?**

---

In [None]:
#distribution of Experience (Programming) with respect to Compensation
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=[px.Constant('Experience->Job Title->Compensation'), 'Q6','Q5','Q25'], color='Q25', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "experience range" will show Job Title distribution for that Experience Range.
 * Click/Touch on respective Label for "Job Title" will show Compensation distribution for that.

Above plot reveals that distribution of Compensation is majority around "Student" Job Title for 1-3 years,<1 years and 3-5 years with 5-10 years experience range as majority around "Data Scientist" Job Title having compensation range of 100,000-124,999 USD.

---
**Q: What is the distribution of Experience (Programming) with respect to Country?**

---

In [None]:
#distribution of Experience (Programming) with respect to Country
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q6.isin(experience)], path=[px.Constant('Experience->Job Title->Country'), 'Q6','Q5','Q3'], color='Q3', hover_data=['Q6'])
fig.show()

* Click/Touch on Label with "experience range" will show Job Title distribution for that Experience range.
 * Click/Touch on respective Label for "Job Title" will show Country distribution for that.
 
Above plot reveals that distribution of Country has "Student" category for 3-5 years,1-3 years and <1 years experience range which belongs to India with experience range for 5-10 years having "Data Scientist" category belong to United States of America.

---
**Data Scientist Profile: Country**

---

As per Kaggle Executive Summary on State of Machine Learning and Data Science 2021:

* Country demographics are nearly the same as last year with two countries having far more representation in the Kaggle community. India makes up 24.4% of Kaggle data scientists, while 12.2% reside in the United States. Brazil is a distant third, at under 4.3%.

---

---
**Q: What is the distribution of Country in survey response?**

---

In [None]:
#distribution of Country
fig = px.treemap(kaggle_survey_2021_data[1:], path=[px.Constant('Country'), 'Q3'], color='Q3', hover_data=['Q3'])
fig.show()

Let's explore from **Country** (India and United States of America) perspective as to what does Data Scientist Profile looks like via Plotly Express using interactive charts.

---
**Q: What is the distribution of Country with respect to Education?**

---

In [None]:
country=["India","United States of America"]
#distribution of Country with respect to Education
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q3.isin(country)], path=[px.Constant('Country->Gender->Education'), 'Q3','Q2','Q4'], color='Q4', hover_data=['Q3'])
fig.show()

* Click/Touch on Label with "Country" will show Gender distribution for that Country.
 * Click/Touch on respective Label for "Gender" will show Education distribution for that.
 
Above plot reveals that distribution of Country has Gender imbalance for both of them but both Gender has equal access to Education and has similar distribution for Bachelor's, Master's and Doctoral Degree for both the Country who participated in Kaggle 2021 Survey.

---
**Q: What is the distribution of Country with respect to Job Title?**

---

In [None]:
#distribution of Country with respect to Job Title
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q3.isin(country)], path=[px.Constant('Country->Gender->Job Title'), 'Q3','Q2','Q5'], color='Q5', hover_data=['Q3'])
fig.show()

* Click/Touch on Label with "Country" will show Gender distribution for that Country.
 * Click/Touch on respective Label for "Gender" will show Job Title distribution for that.
 
Above plot reveals that distribution of Country has Gender imbalance for both of them but both Gender has equal participation for "Student", "Data Scientist" , "Data Analyst" and "Software Engineer" Job Title which are in majority and has similar distribution for both the Country who participated in Kaggle 2021 Survey.

---
**Q: What is the distribution of Country with respect to Industry?**

---

In [None]:
#distribution of Country with respect to Industry
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q3.isin(country)], path=[px.Constant('Country->Job Title->Industry'), 'Q3','Q5','Q20'], color='Q20', hover_data=['Q3'])
fig.show()

* Click/Touch on Label with "Country" will show Job Title distribution for that Country.
 * Click/Touch on respective Label for "Job Title" will show Industry distribution for that.
   
Above plot reveals that distribution of Country has "Student" as majority of Job Title and rest of Job Title are distributed across "Computers/Technology","Academics/Education","Accounting/Finance" and "Medical/Pharmaceutical" as majority for almost all the Country who participated in Kaggle 2021 Survey.

---
**Data Scientist Profile: Job Title**

As per Kaggle Executive Summary on State of Machine Learning and Data Science 2021:

* There are many other job titles that support data science and ML workflows and also many students and data enthusiasts who aren’t full-time, employed data scientists.

---

---
**Q: What is the distribution of Job Title in survey response?**

---

In [None]:
# distribution count for Job Title
fig = px.pie(kaggle_survey_2021_data[1:], names='Q5')
fig.show()

Let's explore from **Job Title** (Student, Data Scientist, Data Analyst and Software Engineer) perspective as to what does Data Scientist Profile looks like via Plotly Express using interactive charts.

---
**Q: What is the distribution of Job Title with respect to Compensation?**

---

In [None]:
jobtitle=["Student","Data Scientist","Data Analyst","Software Engineer"]
#distribution of Job Title with respect to Compensation
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q5.isin(jobtitle)], path=['Q5','Q25'], color='Q5', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "Job Title" will show Compensation distribution for that Job Title.

Above plot reveals that distribution of Compensation range 0-999 USD forms majority for "Data Scientist", "Software Engineer" and "Data Analyst" Job Title for all who participated in Kaggle 2021 Survey.

---
**Q: What is the distribution of Job Title with respect to Company Size?**

---

In [None]:
#distribution of Job Title with respect to Company Size
fig = px.sunburst(kaggle_survey_2021_data[kaggle_survey_2021_data.Q5.isin(jobtitle)], path=['Q5','Q21'], color='Q5', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "Job Title" will show Company Size distribution for that Job Title.

Above plot reveals that distribution of Company size 0-49 employees forms majority for "Data Scientist","Software Engineer" and "Data Analyst" Job Title for all who participated in Kaggle 2021 Survey.

---
**Q: What is the distribution of Job Title with respect to Experience (Programming and Machine Learning)?**

---

In [None]:
#distribution of Job Title with respect to Experience
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q5.isin(jobtitle)], path=[px.Constant('Job Title->Experience (Programming)->Experience (Machine Learning)'), 'Q5','Q6','Q15'], color='Q15', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "Job Title" will show Experience Range (Programming) distribution for that Job Title.
 * Click/Touch on respective Label for "Experience Range (Programming)" will show Experience Range (Machine Learning) distribution for that.
   
Above plot reveals that distribution of Job Title shows that majority of Experience (Programming) range is from 0 to 3 years and Experience (Machine Learning) range is from 0 to 3 years as majority for "Student", "Data Scientist", "Data Analyst" and "Software Engineer" Job Title category who participated in Kaggle 2021 Survey.

---
**Q: What is the distribution of Job Title with respect to Industry?**

---

In [None]:
#distribution of Job Title with respect to Industry
fig = px.treemap(kaggle_survey_2021_data[kaggle_survey_2021_data.Q5.isin(jobtitle)], path=[px.Constant('Job Title->Industry'), 'Q5','Q20'], color='Q20', hover_data=['Q5'])
fig.show()

* Click/Touch on Label with "Job Title" will show Industry distribution for that Job Title.
   
Above plot reveals that distribution of Job Title with respect to Industry shows that majority of Job Title are from "Computers/Technology", "Academics/Education" and "Accounting/Finance" Industry who participated in Kaggle 2021 Survey.

---
# Summary
---

**Data Scientist Profile: Gender**

Even though there is imbalance in Gender ratio (82% Male vs 16% Female) for Data Scientist Profile in survey response, there is similarity in many aspects with respect to Age Group, Education Background, Job Title, Experience in Programming, Industry, Compensation and Country distribution.

**Data Scientist Profile: Education**

Even though 60% of Data Scientist Profile in survey response has Education Background such as Bachelor's, Master's or Doctoral Degree,there is variation in terms of Job Title, Programming Experience, Machine Learning Exprience, Industry,Compensation and Country distribution.

**Data Scientist Profile: Experience**

Even though around 50% of Data Scientist Profile in survey response has Experience (Programming) in range of 0-3 years, there is similar pattern/distribution in terms of Job Title, Programming Language, IDE, Computing Platform and Country.

**Data Scientist Profile: Country**

Even though almost 36% of Data Scientist Profile in survey response are distributed between India and United States of America, there is very similar trend for Education Background, Job Title (e.g. Student) and Industry for almost majority of Country covered in this survey.

**Data Scientist Profile: Job Title**

Even though around 26% of Data Scientist Profile in survey response are Student and 9.2% are under "Other" Job Title (Not Data Scientist or Data Anlyst or Software Engineer which all three together forms 32%), there is similar trend on Compensation, Company Size, Experience (Programming and Machine Learning) and Industry for Job Title covered in this survey.

---
**Thank you and Happy Learning.**

---

In [None]:
thank_you_str="Thanks,Happy Learning,Collaboration,Thankyou,Keep Learning"
# create WordCloud with converted string
wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='white', collocations=True).generate(thank_you_str)
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()