# Objective - 

This year Kaggle is launching the second annual Data Science Survey Challenge, where we will be awarding a prize pool of $30,000 to notebook authors who tell a rich story about a subset of the data science and machine learning community.

The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.offline import plot

# display all the columns in the dataset
pd.pandas.set_option('display.max_columns', None)

# Setting color palette.
orange_black = ['#fdc029', '#df861d', '#FF6347', '#aa3d01', '#a30e15', '#800000', '#171820']

# Setting plot styling.
plt.style.use('fivethirtyeight')

In [None]:
# lets load the dataset first
response = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

In [None]:
# check the dataset
response.head()

In [None]:
response.shape

Dataset has 20037 rows and 355 columns

In [None]:
# lets drop the 1st row
response.drop(0,inplace=True)

In [None]:
# re-check the dataset
response.head()

In [None]:
# dataset information
response.info()

All the features are categorical

In [None]:
# check null values
response.isnull().sum().sort_values(ascending = False)[response.isnull().sum()!=0]

In [None]:
# visualize 15 columns with most no. of null values 
plt.figure(constrained_layout=True, figsize=(12, 8))
percent = (response.isnull().sum().sort_values(ascending=False) / len(response) *
           100)[(response.isnull().sum().sort_values(ascending=False) / len(response) *
                 100) != 0]

missing = pd.DataFrame({"missing%":percent[:15]})

sns.barplot(x=missing.index,
            y='missing%',
            data=missing,
            palette=orange_black)
plt.xticks(rotation=30)
plt.title('Train Data Missing Values')

In [None]:
# Replace null values with "Missing"
for col in response.columns:
    response[col].fillna("Missing",inplace=True)

In [None]:
# helper function to plot countplots for features
def univariate_plots(feat,name):
    count = response[feat].value_counts().sort_values(ascending = False)

    # lets visualize the feature distribution in the dataset
    fig = px.pie(count,
                 values=count.values,
                 names=count.index,
                 color_discrete_sequence=orange_black,
                 hole=.1,title= name + " Distribution")
    fig.update_traces(textinfo='percent', pull=0.01)    
    fig.show()

#### Q1 - What is your age (# years)?

In [None]:
# Check Age Distribution
univariate_plots("Q1","Age Wise")

We have quite a good represntation from all age groups ranging from 18 to 70+ old

#### Q2 - What is your gender

In [None]:
# Gender Distribution
univariate_plots("Q2","Gender Wise")

Out of the total participants, 80% were Male!

#### Q3 - In which country do you currently reside?

In [None]:
# Country wise Distribution
univariate_plots("Q3","Country Wise")

Almost 30% Kagglers are from India, next highest is from USA (11%)

#### Q4 - What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

In [None]:
# Let's check Educational Qualification of the Kagglers
univariate_plots("Q4","Educational Qualification")

Almost 75% of the Kagglers are either Graduate or Post Graduate!

#### Q5 - Select the title most similar to your current role (or most recent title if retired)

In [None]:
# Check the current role
univariate_plots("Q5","Title Most Similar to your current Role")

* 26% of the Kagglers are students
* 13% are Professional Data Scientists
* 10% are Software Professionals but not Data Scientists

#### Q6 - For how many years have you been writing code and/or programming?

In [None]:
univariate_plots("Q6","Coding Experience")

* Around 60% of the Kagglers have less than 5 years of Progamming Experience
* 15% of the Kagglers have more than 10 years of Programming Experience

#### Q7 - What programming languages do you use on a regular basis? (Select all that apply)

In [None]:
Q7 = ["Q7_Part_1","Q7_Part_2","Q7_Part_3","Q7_Part_4","Q7_Part_5","Q7_Part_6","Q7_Part_7","Q7_Part_8","Q7_Part_9","Q7_Part_10",
      "Q7_Part_11","Q7_Part_12","Q7_OTHER"]

plt.figure(figsize = (16,32))

for idx,col in enumerate(Q7):
    plt.subplot(7,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("Programming Languages used on a Regular Basis")
plt.show()

In [None]:
values = []
index  = ["Python","R","SQL","C","C++","JAVA","Java Script","Julia","Swift","Bash","MATLAB","None","Other"]

values.append((len(response[response['Q7_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_7']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_8']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_9']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_10'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_11'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_Part_12'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q7_OTHER']   != "Missing"])/response.shape[0])*100)

users = pd.DataFrame({"index":index,"value":values})

plt.figure(figsize = (12,8))
sns.barplot(x=index,
            y=values,
            palette=orange_black)
plt.xticks(rotation=30)
plt.title('Programming Languages used on a Regular Basis')

Python is by far the most popular programming language for Data Scientists

#### Q8 - What programming language would you recommend an aspiring data scientist to learn first?

In [None]:
univariate_plots("Q8","Programming Language Recommended to an Aspirant Data Scientist")

Again, Python!

#### Q9- Which of the following integrated development environments (IDE's) do you use on a regular basis?(Select all that apply)

In [None]:
Q9 = ["Q9_Part_1","Q9_Part_2","Q9_Part_3","Q9_Part_4","Q9_Part_5","Q9_Part_6","Q9_Part_7","Q9_Part_8","Q9_Part_9","Q9_Part_10",
      "Q9_Part_11","Q9_OTHER"]

plt.figure(figsize = (16,32))

for idx,col in enumerate(Q9):
    plt.subplot(6,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("Integrated Developmemt Environment")
plt.show()

In [None]:
values = []
index  = ["Jupyter","RStudio","Visual Studio","VSCode","PyCharm","Spyder","Notepad++","Sublime Text","Vim/Emacs","MATLAB","None","Other"]

values.append((len(response[response['Q9_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_7']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_8']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_9']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_10'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_Part_11'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q9_OTHER']   != "Missing"])/response.shape[0])*100)

users = pd.DataFrame({"index":index,"value":values})

plt.figure(figsize = (12,8))
sns.barplot(x=index,
            y=values,            
            palette=orange_black)
plt.xticks(rotation=30)
plt.title('Integrated Development Environment')

Jupyter is the most popular IDE for Data Scientists

#### Q10-Which of the following hosted notebook products do you use on a regular basis? (Select all that apply)

In [None]:
Q10 = ["Q10_Part_1","Q10_Part_2","Q10_Part_3","Q10_Part_4","Q10_Part_5","Q10_Part_6","Q10_Part_7","Q10_Part_8","Q10_Part_9","Q10_Part_10","Q10_Part_11","Q10_Part_12","Q10_Part_13","Q10_OTHER"]

plt.figure(figsize = (16,32))

for idx,col in enumerate(Q10):
    plt.subplot(7,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("Hosted Notebook Products used")
plt.show()

In [None]:
values = []
index  = ["Kaggle Notebooks","Colab Notebooks","Azure Notebooks","PaperSpace/Gradient","Binder/JupyterHub","Code Ocean",
          "IBM Watson Studio","Amazon Sagemaker Studio","Amazon EMR Notebooks","Google Cloud AI Platform Notebook",
          "Google Cloud Datalab Notebooks","Databricks Collaborative Notebooks","None","Other"]

values.append((len(response[response['Q10_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_7']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_8']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_9']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_10'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_11'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_12'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_Part_13'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q10_OTHER']   != "Missing"])/response.shape[0])*100)

#values = sorted(values,reverse=True)

users = pd.DataFrame({"Users%":values})

plt.figure(figsize = (12,8))
sns.barplot(y=index,
            x="Users%",
            data=users,
            palette=orange_black)
plt.title("Hosted Notebook Products used")

Google Collab & Kaggle are the top hosted notebooks environments

#### Q11 - What type of computing platform do you use most often for your data science projects?

In [None]:
univariate_plots("Q11","Computing Platform")

Personal Laptops are the most popular computing platforms

#### Q12-Which types of specialized hardware do you use on a regular basis? (Select all that apply)

In [None]:
Q12= ['Q12_Part_1','Q12_Part_2','Q12_Part_3','Q12_OTHER']

plt.figure(figsize = (12,8))

for idx,col in enumerate(Q12):
    plt.subplot(2,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("Specialized Hardware Used")
plt.show()

In [None]:
values = []
index  = ['GPUs','None','TPUs','Others']


values.append((len(response[response['Q12_Part_1'] == "GPUs"])/response.shape[0])*100)
values.append((len(response[response['Q12_Part_3'] == "None"])/response.shape[0])*100)
values.append((len(response[response['Q12_Part_2'] == "TPUs"])/response.shape[0])*100)
values.append((len(response[response['Q12_OTHER'] == "Other"])/response.shape[0])*100)

users = pd.DataFrame({"Users%":values})

plt.figure(figsize = (12,8))

sns.barplot(x=index,
            y="Users%",
            data=users,
            palette=orange_black)
plt.title('Specialized Hardware Used')

GPUs is most widely used by data scientists

#### Q13 - Approximately how many times have you used a TPU (tensor processing unit)?

In [None]:
univariate_plots("Q13","No. of Times TPUs used")

TPUs are yet to gain popularity, may be because its not as straight forward as using CPUs/GPUs!

#### Q14 - What data visualization libraries or tools do you use on a regular basis? (Select all that apply)

In [None]:
Q14 = ["Q14_Part_1","Q14_Part_2","Q14_Part_3","Q14_Part_4","Q14_Part_5","Q14_Part_6","Q14_Part_7","Q14_Part_8","Q14_Part_9","Q14_Part_10","Q14_Part_11","Q14_OTHER"]

plt.figure(figsize = (16,32))

for idx,col in enumerate(Q14):
    plt.subplot(6,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("Data Visualization Libraries or Tools")
plt.show()

In [None]:
values = []
index  = ["Matplotlib","Seaborn","Plotly","GGplot","Shiny","D3 js","Altair","Bokeh","Geoplotlib","Leafleat/Folium","None","Other"]

values.append((len(response[response['Q14_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_7']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_8']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_9']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_10'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_Part_11'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q14_OTHER']   != "Missing"])/response.shape[0])*100)

#values = sorted(values,reverse=True)

users = pd.DataFrame({"Users%":values})

plt.figure(figsize = (12,8))
sns.barplot(y=index,
            x="Users%",
            data=users,
            palette=orange_black)
plt.title('Data Visualization Libraries or Tools')

* Matplotlibs and Seaborns are the most popular libraries for visualization
* Plotly (extensively used in this notebook :-) ) is also becoming popular
* GGPlot is at the third place

#### Q15 - For how many years have you used machine learning methods?

In [None]:
univariate_plots("Q15","Year using ML Methods")

Almost 50% of the Kagglers have less than 2 years of experience in using Machine Learning methods

#### Q16 - Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply)

In [None]:
Q16 = ["Q16_Part_1","Q16_Part_2","Q16_Part_3","Q16_Part_4","Q16_Part_5","Q16_Part_6","Q16_Part_7","Q16_Part_8","Q16_Part_9",
       "Q16_Part_10","Q16_Part_11","Q16_Part_12","Q16_Part_13","Q16_Part_14","Q16_Part_15","Q16_OTHER"]

plt.figure(figsize = (16,32))

for idx,col in enumerate(Q16):
    plt.subplot(8,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("Machine Learning Frameworks used")
plt.show()

In [None]:
values = []
index  = ["Scikit-Learn","TensorFlow","Keras","PyTorch","Fast.ai","MXNet","XGBoost","LGBM","CatBoost","Prophet","H20 3",
          "Caret","Tidymodels","JAX","None","Other"]

values.append((len(response[response['Q16_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_7']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_8']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_9']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_10'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_11'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_12'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_13'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_14'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_Part_15'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q16_OTHER']   != "Missing"])/response.shape[0])*100)

users = pd.DataFrame({"Users%":values})

plt.figure(figsize = (12,8))
sns.barplot(y=index,
            x="Users%",
            data=users,
            palette=orange_black)
plt.title('Machine Learning Frameworks ')

No surprise looking at the top 4 frameworks here!

#### Q17 - Which of the following ML algorithms do you use on a regular basis? (Select all that apply)

In [None]:
Q17 = ["Q17_Part_1","Q17_Part_2","Q17_Part_3","Q17_Part_4","Q17_Part_5","Q17_Part_6","Q17_Part_7","Q17_Part_8","Q17_Part_9"
       ,"Q17_Part_10","Q17_Part_11","Q17_OTHER"]

plt.figure(figsize = (16,32))

for idx,col in enumerate(Q17):
    plt.subplot(6,2,idx+1)
    sns.countplot(response[col],palette=orange_black)

plt.suptitle("ML Algorithm used on a Regular Basis")
plt.show()

In [None]:
values = []
index  = ["Linear or Logistic Regression","Decision Trees or Random Forests","Gradient Boosting Machines (xgboost, lightgbm, etc)"
          ,"Bayesian Approaches","Evolutionary Approaches","Dense Neural Networks (MLPs, etc)","Convolutional Neural Networks",
          "Generative Adversarial Networks","Recurrent Neural Networks","Transformer Networks (BERT, gpt-3, etc)","None","Other"]

values.append((len(response[response['Q17_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_7']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_8']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_9']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_10'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_Part_11'] != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q17_OTHER']   != "Missing"])/response.shape[0])*100)

users = pd.DataFrame({"index":index,"value":values})

plt.figure(figsize = (12,8))
sns.barplot(y=index,
            x=values,
            palette=orange_black)

plt.title('ML Algorithm used on a Regular Basis')

No surprise again, Linear, Logistic Regressions , Decision Trees and Random Forests are the most popular algorithms used

#### Q18- Which categories of computer vision methods do you use on a regular basis? (Select all that apply)

In [None]:
Q18 = ["Q18_Part_1","Q18_Part_2","Q18_Part_3","Q18_Part_4","Q18_Part_5","Q18_Part_6","Q18_OTHER"]

plt.figure(figsize = (32,32))

for idx,col in enumerate(Q18):
    plt.subplot(4,2,idx+1)
    sns.countplot(response[col],palette=orange_black)
    #plt.xticks(rotation=30)

plt.suptitle("Computer Vision Method used on a Regular Basis")
plt.show()

In [None]:
values = []
index  = ["PIL/cv2/skimage etc.","U-Net/Mask/R-CNN etc.","YOLOv3/RetinaNet/etc.",
          "VGG/Inception/ResNet etc.","GAN/VAN etc","None","Other"]

values.append((len(response[response['Q18_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q18_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q18_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q18_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q18_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q18_Part_6']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q18_OTHER']   != "Missing"])/response.shape[0])*100)

users = pd.DataFrame({"index":index,"value":values})

plt.figure(figsize = (12,8))
sns.barplot(y=index,
            x=values,
            palette=orange_black)

plt.title('Computer Vision Methods used on a Regular Basis')

Transfer Learning rules as far as computer vision is concerned

#### Q19 - Which of the following natural language processing (NLP) methods do you use on a regular basis (Select all that apply)

In [None]:
Q19 = ["Q19_Part_1","Q19_Part_2","Q19_Part_3","Q19_Part_4","Q19_Part_5","Q19_OTHER"]

plt.figure(figsize = (32,32))

for idx,col in enumerate(Q19):
    plt.subplot(4,2,idx+1)
    sns.countplot(response[col],palette=orange_black)
    #plt.xticks(rotation=30)

plt.suptitle("NLP Methods used on a Regular Basis")
plt.show()

In [None]:
values = []
index  = ["Word embeddings/vectors (GLoVe, fastText, word2vec)",
          "Encoder-decoder models (seq2seq, vanilla transformers)",
          "Contextualized embeddings (ELMo, CoVe)",
          "Transformer language models (GPT-3, BERT, XLnet, etc)",
          "None","Other"]

values.append((len(response[response['Q19_Part_1']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q19_Part_2']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q19_Part_3']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q19_Part_4']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q19_Part_5']  != "Missing"])/response.shape[0])*100)
values.append((len(response[response['Q19_OTHER']   != "Missing"])/response.shape[0])*100)

users = pd.DataFrame({"index":index,"value":values})

plt.figure(figsize = (12,8))
sns.barplot(y=index,
            x=values,
            palette=orange_black)

plt.title('NLP Methods used on a Regular Basis')

Again, no surprise, Word Embeddings,Encoder Decoder models are at the top here!

#### Q20 - What is the size of the company where you are employed?

In [None]:
univariate_plots("Q20","Size Of the Company")