# <font color="#566573">2020 Kaggle Machine Learning & Data Science Survey</font>

This notebook compare about the differences between role as students, data scientists and software engineers based on the questions on 2020 Kaggle Machine Learning & Data Science Survey. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings
import squarify

warnings.filterwarnings("ignore")
plt.style.use("fivethirtyeight")
palette=np.array(["#17202A", "#2C3E50", "#566573", "#ABB2B9", "#DADADA", "#F8F5F1", "#EDD39C", "#E8AD52", "#FF9200", "#DC7633", "#A04000", "#78281F", "#633C00"])

In [None]:
data=pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
questions=data.iloc[0]
data.drop(index=0, inplace=True)

In [None]:
data.head()

### <a id="age_gender"><font color="#566573">Roles of Respondents</font></a>
* Around quarter of the respondents are students which take the first place among other roles.
* Half of the respondents are in the role of student, data scientist and soft engineer.
* Only small amount of respondents are database engineers among all the roles.

In [None]:
data_role=data["Q5"].value_counts().reset_index().rename(columns={"index":"Role", "Q5":"Count"})
data_role_pct=data["Q5"].value_counts().to_frame(name="Count")
data_role_pct.T

In [None]:
plt.figure(figsize=(10, 5))
bar=sns.barplot(x="Count", y="Role", palette=palette, data=data_role)
plt.yticks(size=12)
plt.xticks(size=12)
plt.xlabel("Role", size=12)
plt.ylabel("Number of Respondents", size=12);

for index, b in enumerate(bar.patches):
    bar.text(b.get_width()+100, b.get_y()+0.5, data_role_pct["Count"][index], color=palette[1], size=12)
# for index, row in data_role.iterrows():
#     print(row)
#     bar.text(index, row.Count+50, row.Count, color="#2C3E50", ha="center", size=12)

As a student, I am interested to find out what is the difference between the role as **student, data scientist and software engineer**. In the following section, I will look deeper into this three roles based on different questions and see what I can find. 

### <a id="age_gender"><font color="#566573">Student | Data Scientist | Software Engineer</font></a>

In [None]:
data_sds=data[(data["Q5"]=="Student") | (data["Q5"]=="Data Scientist") | (data["Q5"]=="Software Engineer")].reset_index(drop=True)
role_lis=["Student", "Data Scientist", "Software Engineer"]
data_sds.head()

#### <a id="age_gender"><font color="#566573">Age</font></a>
* It comes natural that students includes the younger age of respondents and majority of them are age below 30.
* Most of the data scientists are age between 22 to 40 which is the same as software engineers.

In [None]:
data_sds_age=data_sds.groupby(["Q5", "Q1"]).size().reset_index().rename(columns={0:"Count"})

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(x="Q1", y="Q5", s="Count", c=palette[[1]], marker=".", linewidths=0.5, data=data_sds_age)
plt.xticks(size=12)
plt.xlabel("Age", size=12);

It is not surprising that the students are the younger part of the respondents and from the plot you can see it kind of transition into different roles as aged.

#### <a id="age_gender"><font color="#566573">Education Level</font></a>
* Regardless of the role, majority of the respondents attained bachelor's or master's degree.
* Half of the data scienties attained master's degree, and around one fifth of them attained doctoral degree which is the highest among other roles.
* Respondents as software engineers attained bachelor's degree slightly over than master's degree and among others.
* Overall, majority of the data scienties have higher level of degree than software engineers.


In [None]:
data_sds_edu=data_sds.groupby(["Q5", "Q4"]).size().reset_index().rename(columns={0:"Count"})
data_sds_edu.sort_values(by=["Q5", "Count"], ascending=False, inplace=True)
data_sds_edu["Q4"]=data_sds_edu["Q4"].replace(["Some college/university study without earning a bachelorâ€™s degree", "No formal education past high school"], ["college/university study without bachelor", "Below high school"])

In [None]:
fig, axes=plt.subplots(1, 3, figsize=(30, 10))

for index, age in enumerate(role_lis):
    sds_edu=data_sds_edu.query("Q5=='{}'".format(age)).sort_values(by=["Count"], ascending=False).reset_index(drop=True)
    axes[index].pie(sds_edu["Count"], wedgeprops=dict(width=0.5), labels=sds_edu["Q4"], textprops={"fontsize": 18}, colors=palette[np.arange(11, 0, -2)]);
    axes[index].set_title("{}".format(age), size=18)

Though I don't clearly understand the level of degree that need to attained to become data scientist or software engineer but I was expecting that this two roles are very similar but it turns out there is a big difference between them.

#### <a id="age_gender"><font color="#566573">Programming Experience</font></a>
* Majority of the data scientists and software engineers have experienced more than three years of programming.
* Students are likely to have shorter years of programming as expected compare to other roles but one thing I noticed is that even the number of respondents are different among three roles which may be bias but the number of students expericenced programming around 3-5 years is quite high. It can be said that the students are getting to program at a younger age.

In [None]:
data_sds_codeyear=data_sds.groupby(["Q5", "Q6"]).size().reset_index().rename(columns={0:"Count"})

codeyear_sort=[2, 5, 6, 3, 4, 1, 0]
data_sds_codeyear.loc[:, "Sort"]=codeyear_sort*3
data_sds_codeyear=data_sds_codeyear.sort_values(by=["Sort"]).reset_index(drop=True)
data_sds_codeyear["Q6"].replace("I have never written code", "Never written", inplace=True)
# data_sds_codeyear.head()

In [None]:
plt.figure(figsize=(10, 4))
sns.pointplot(x="Q6", y="Count", hue="Q5", hue_order=role_lis, scale=0.5, palette=palette[[1, 9, 3]], data=data_sds_codeyear)
plt.yticks(size=12)
plt.xticks(size=12)
plt.xlabel("Years of Programming", size=12)
plt.ylabel("Number of Respondents", size=12)
plt.legend(title="Role", prop={"size":12});

#### <a id="age_gender"><font color="#566573">Programming Languages</font></a>
* Overall, Python is the most popular language and followed by SQL and C++.
* Take a Look at the student, Python is the most used language for sure and C++ is the second and after is C and SQL(very close). You can see that the students experience varities of programming languages on a regular basis than data scientists and software engineers which is understandable, when it comes down to specific role usually only use couples of languages. Other Interesting I found is that student use Matlab more than others.
* Data Scientists use Python, SQL and R much more than other languages.
* Software engineers on the other hand are slightly different, other than Python and SQL, Java and Javascript used more on regular.

In [None]:
def combine_selectquestion(num, questnum, data):
    df_comb=pd.DataFrame()
    for i in range(1, num):
        df=data.groupby(["Q5", "Q{}_Part_{}".format(questnum, i)]).size().reset_index().rename(columns={0:"Count", "Q{}_Part_{}".format(questnum, i):"Q{}".format(questnum)})
        df_comb=df_comb.append(df)
    df_comb.reset_index(drop=True)
    
    return df_comb
    

In [None]:
data_sds_codelang=data_sds.loc[:, :"Q7_OTHER"]
data_sds_codelang.columns

In [None]:
data_sds_codelang.rename(columns={"Q7_OTHER":"Q7_Part_13"}, inplace=True)
data_sds_codelang=combine_selectquestion(14, 7, data_sds_codelang)

data_sds_codelang.groupby(["Q7"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q7"]).T

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x="Q7", y="Count", hue="Q5", hue_order=role_lis, palette=palette[[1, 9, 3]], data=data_sds_codelang)
plt.xticks(size=12)
plt.yticks(size=12)
plt.legend(title="Role", prop={"size":12})
plt.xlabel("Programming Languages", size=12)
plt.ylabel("Number of Respondents", size=12);

#### <a id="age_gender"><font color="#566573">Recommended Programming Languages</font></a>
* Recommended programming languages are Python, R and SQL, especially Python.

In [None]:
data_sds_recomcodelang=data_sds.groupby(["Q5", "Q8"]).size().reset_index().rename(columns={0:"Count"})
data_sds_recomcodelang_T=data_sds_recomcodelang.groupby(["Q8"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q8"]).T
data_sds_recomcodelang_T

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x="Q8", y="Count", hue="Q5", order=data_sds_recomcodelang_T.columns, hue_order=role_lis, palette=palette[[1, 9, 3]], data=data_sds_recomcodelang)
plt.xticks(size=12)
plt.yticks(size=12)
plt.legend(title="Role", prop={"size":12})
plt.xlabel("Programming Languages", size=12)
plt.ylabel("Number of Respondents", size=12);

#### <a id="age_gender"><font color="#566573">Integrated Development Environments</font></a>
* Overall, Jupyter is the most popular IDE and followed by VSCode and PyCharm.
* Data scientists use Visual Studio relatively low.
* Software engineers use Jupyter and VSCode relatively even unlike students and data scientists, the gap between Jupyter and VSCode are quite large. Another interesting thing is that software engineers use Rstudio relatively low.

In [None]:
data_sds_ide=data_sds.loc[:, :"Q9_OTHER"]
data_sds_ide.columns

In [None]:
data_sds_ide.rename(columns={"Q9_OTHER":"Q9_Part_12"}, inplace=True)  
data_sds_ide=combine_selectquestion(13, 9, data_sds_ide)

data_sds_ide_T=data_sds_ide.groupby(["Q9"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q9"]).T
data_sds_ide_T

In [None]:
data_sds_ide.sort_values(by=["Count"]).reset_index(drop=True, inplace=True)

plt.figure(figsize=(15, 8))
sns.barplot(x="Count", y="Q9", hue="Q5", order=data_sds_ide_T.columns, hue_order=role_lis, palette=palette[[1, 9, 3]], data=data_sds_ide)
plt.xticks(size=12)
plt.yticks(size=12)
plt.legend(title="Role", prop={"size":12})
plt.ylabel("Programming Languages", size=12)
plt.xlabel("Number of Respondents", size=12);

#### <a id="age_gender"><font color="#566573">Hosted Notebook Products</font></a>
* Colab Notebooks and Kaggle Notebooks are the top two most used notebook products.
* Number of data scientists use Amazon Sagemaker Studio are relatively high compare to others.

In [None]:
data_sds_hnp=data_sds.loc[:, :"Q10_Part_10"]
data_sds_hnp.columns

In [None]:
data_sds_hnp=combine_selectquestion(11, 10, data_sds_hnp)
data_sds_hnp_T=data_sds_hnp.groupby(["Q10"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q10"]).T
data_sds_hnp_T

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x="Count", y="Q10", hue="Q5", order=data_sds_hnp_T.columns, hue_order=role_lis, palette=palette[[1, 9, 3]], data=data_sds_hnp)
plt.xticks(size=12)
plt.yticks(size=12)
plt.legend(title="Role", prop={"size":12})
plt.ylabel("Hosted Notebook Products", size=12)
plt.xlabel("Number of Respondents", size=12);

#### <a id="age_gender"><font color="#566573">Computing Platforms</font></a>
* Personal computer or laptop is the most othen used platform in all roles.
* Data scientists use cloud computing platform relatively high compare to other two roles.

In [None]:
data_sds_cp=data_sds.groupby(["Q5", "Q11"]).size().reset_index().rename(columns={0:"Count"})
data_sds_cp["Q11"].replace(["A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)", "A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)"], ["A deep learning workstation", "A cloud computing platform"], inplace=True)

In [None]:
fig, axes=plt.subplots(1, 3, figsize=(30, 10))

for index, age in enumerate(role_lis):
    sds_cp=data_sds_cp.query("Q5=='{}'".format(age)).sort_values(by=["Count"], ascending=False).reset_index(drop=True)
    axes[index].pie(sds_cp["Count"], wedgeprops=dict(width=0.5), labels=sds_cp["Q11"], textprops={"fontsize": 18}, colors=palette[np.arange(11, 0, -2)]);
    axes[index].set_title("{}".format(age), size=18)

#### <a id="age_gender"><font color="#566573">Hardware | TPU (used times)</font></a>
* GPUs used most regular compare to other hardwares.
* Only a small amout of respondents use TPUs.
***
* Majority of the respondents never use the TPUs before.

In [None]:
data_sds_hd=data_sds.loc[:, :"Q12_OTHER"]
data_sds_hd.rename(columns={"Q12_OTHER":"Q12_Part_4"}, inplace=True)
data_sds_hd.columns

data_sds_hd=combine_selectquestion(5, 12, data_sds_hd)

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(x="Q12", y="Q5", s="Count", c=palette[[1]], marker=".", data=data_sds_hd)
plt.xlabel("Hardware", size=12);
plt.xticks(size=12)
plt.yticks(size=12);

In [None]:
print("Hardware Use On A Regular Basis Among All Groups"+"\n") 
print(data_sds_hd.groupby(["Q12"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q12"]).T)

In [None]:
data_sds_tpu=data_sds.groupby(["Q5", "Q13"]).size().reset_index().rename(columns={0:"Count"})

tpuyear_time=[2, 3, 4, 0, 1]
data_sds_tpu.loc[:, "Rank"]=tpuyear_time*3
data_sds_tpu=data_sds_tpu.sort_values(by=["Rank"]).reset_index(drop=True)

In [None]:
plt.figure(figsize=(10, 4))
sns.pointplot(x="Q13", y="Count", hue="Q5", hue_order=role_lis, scale=0.5, palette=palette[[1, 9, 3]], data=data_sds_tpu)
plt.yticks(size=12)
plt.xticks(size=12)
plt.xlabel("TPU Used", size=12)
plt.ylabel("Number of Respondents", size=12)
plt.legend(title="Role", prop={"size":12});

#### <a id="age_gender"><font color="#566573">Visualization Libraries</font></a>
* Matplotlib is the most popular visualization library and followed by Seaborn, Plotly and Ggplot.
* Software engineers might not use visualization libaraies that much since the None option ranked at number four.
* On the other hand, scientists might use the visualization libraries more often than others since there is only a little amount of None option.

In [None]:
data_sds_vl=data_sds.loc[:, :"Q14_OTHER"]
data_sds_vl.columns

In [None]:
data_sds_vl.rename(columns={"Q14_OTHER":"Q14_Part_12"}, inplace=True)  
data_sds_vl=combine_selectquestion(13, 14, data_sds_vl).reset_index(drop=True)

data_sds_vl_T=data_sds_vl.groupby(["Q14"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q14"]).T
data_sds_vl_T

data_sds_vl["Q14"].replace(" Leaflet / Folium ", "Leaflet / \n Folium", inplace=True)

In [None]:
fig, axes=plt.subplots(1, 3, figsize=(50, 15))

for index, age in enumerate(role_lis):
    sds_vl=data_sds_vl.query("Q5=='{}'".format(age)).sort_values(by=["Count"], ascending=False).reset_index(drop=True)
    squarify.plot(label=sds_vl.Q14, sizes=sds_vl.Count, color=palette, text_kwargs={"size":25, "color":"#5499C7", "weight":"bold"}, ax=axes[index])
    axes[index].set_title("{}".format(age), size=18)
    axes[index].axis("off")
fig.tight_layout()

#### <a id="age_gender"><font color="#566573">Machine Learning Experience | Machine Learning Frameworks | Machine Learning Algorithms</font></a>
* Majority of the students have used machine learning methods less than three years.
* Data scientists experienced machine learning methods more than three years are relatively high compare to others which mean they might use machine learning methods very often on a regular basis. It might be one of the skills to become a data scientist.
***
* Linear/Logistic Regression and Decision Trees/Random Forests are the first two algorithms used on a regular basis regardless of the roles.
* Students and software engineers have same rank for the first four algorithms. Convolutional Neural Networks and Gradient Boosting Machines (xgboost, lightgbm, etc) are ranked third and forth place, while data scientists ranked in reverse.

In [None]:
data_sds_mle=data_sds.groupby(["Q5", "Q15"]).size().reset_index().rename(columns={0:"Count"})
data_sds_mle["Q15"].replace(["I do not use machine learning methods"], ["Do not use"], inplace=True)

mleyear_time=[2, 7, 3, 8, 4, 5, 6, 0, 1]
data_sds_mle.loc[:, "Rank"]=mleyear_time*3
data_sds_mle=data_sds_mle.sort_values(by=["Rank"]).reset_index(drop=True)

In [None]:
plt.figure(figsize=(10, 4))
sns.pointplot(x="Q15", y="Count", hue="Q5", hue_order=role_lis, scale=0.5, palette=palette[[1, 9, 3]], data=data_sds_mle)
plt.yticks(size=12)
plt.xticks(rotation=90, size=12)
plt.xlabel("Years of using Machine Learning", size=12)
plt.ylabel("Number of Respondents", size=12)
plt.legend(title="Role", prop={"size":12});

#### <a id="age_gender"><font color="#566573">Machine Learning Frameworks</font></a>
* Scikit-learn is the most popular machine learning framework among all roles
* Student and software engineer have same rank for the first seven frameworks. They have used TensorFlow slightly more than Keras, PyTorch and Xgboost, on the other hand, data scientists have used Keras more than TensorFlow, Xgboost and PyTorch.

In [None]:
data_sds_mlf=data_sds.loc[:, :"Q16_OTHER"]
data_sds_mlf.columns

In [None]:
data_sds_mlf.rename(columns={"Q16_OTHER":"Q16_Part_16"}, inplace=True)  
data_sds_mlf=combine_selectquestion(17, 16, data_sds_mlf)

data_sds_mlf_T=data_sds_mlf.groupby(["Q16"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q16"]).T
data_sds_mlf_T

In [None]:
fig, axes=plt.subplots(1, 3, figsize=(50, 15))

for index, age in enumerate(role_lis):
    sds_mlf=data_sds_mlf.query("Q5=='{}'".format(age)).sort_values(by=["Count"], ascending=False).reset_index(drop=True)
    squarify.plot(label=sds_mlf.Q16[:8], sizes=sds_mlf.Count[:8], color=palette, text_kwargs={"size":25, "color":"#5499C7", "weight":"bold"}, ax=axes[index])
    axes[index].set_title("{}".format(age), size=40)
    axes[index].axis("off")
fig.tight_layout()

#### <a id="age_gender"><font color="#566573">Machine Learning Algorithms</font></a>
* Linear/Logistic Regression and Decision Trees/Random Forests are the first two algorithms used on a regular basis regardless of the roles.
* Students and software engineers have same rank for the first four algorithms. Convolutional Neural Networks and Gradient Boosting Machines (xgboost, lightgbm, etc) are ranked third and forth place, while data scientists ranked in reverse.

In [None]:
data_sds_mla=data_sds.loc[:, :"Q17_OTHER"]
data_sds_mla.columns

data_sds_mla.rename(columns={"Q17_OTHER":"Q17_Part_12"}, inplace=True)  
data_sds_mla=combine_selectquestion(13, 17, data_sds_mla)

data_sds_mla_T=data_sds_mla.groupby(["Q17"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q17"]).T
data_sds_mla_T

In [None]:
split=["Linear or \nLogistic \nRegression", "Decision Trees\n or Random \nForests", "Convolutional \nNeural \nNetworks", "Gradient Boosting \nMachines (xgboost, \nlightgbm, etc)", "Bayesian \nApproaches", "Dense Neural \nNetworks (MLPs, etc)", "Recurrent \nNeural \nNetworks", "Transformer \nNetworks \n(BERT, gpt-3, etc)", "Generative \nAdversarial \nNetworks", "None", "Evolutionary \nApproaches", "Other"]
columns=data_sds_mla_T.columns
data_sds_mla.replace(columns, split, inplace=True)

In [None]:
fig, axes=plt.subplots(1, 3, figsize=(60, 20))

for index, age in enumerate(role_lis):
    sds_mla=data_sds_mla.query("Q5=='{}'".format(age)).sort_values(by=["Count"], ascending=False).reset_index(drop=True)
    squarify.plot(label=sds_mla.Q17, sizes=sds_mla.Count, color=palette, text_kwargs={"size":25, "color":"#5499C7", "weight":"bold"}, ax=axes[index])
    axes[index].set_title("{}".format(age), size=40)
    axes[index].axis("off")
fig.tight_layout()

#### <a id="age_gender"><font color="#566573">Computer Vision Methods | Natural Language Processing (NLP)</font></a>
* All three roles have similar rank for computer vision methods
* Image classification and other general purpose network are the most popular one.
***
* Word embeddings/vectors used most often on a regular basis and followed by Encoder-decorder models.



In [None]:
data_sds_cvm=data_sds.loc[:, :"Q18_OTHER"]
data_sds_cvm.columns

data_sds_cvm.rename(columns={"Q18_OTHER":"Q18_Part_7"}, inplace=True)  
data_sds_cvm=combine_selectquestion(8, 18, data_sds_cvm).reset_index(drop=True)

data_sds_cvm_T=data_sds_cvm.groupby(["Q18"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q18"]).T
data_sds_cvm_T

In [None]:
df_add=pd.DataFrame()
for role in role_lis:
    df=data_sds_cvm[data_sds_cvm["Q5"]=="{}".format(role)]
    df.loc[:, "Pct"]=df["Count"]/df["Count"].sum()
    df=df.set_index(["Q18"]).drop(columns=["Q5", "Count"]).rename(columns={"Pct":"{}".format(role)}).T
    df_add=df_add.append(df)

    
df_add.plot.barh(stacked=True, color=palette[2:], figsize=(10, 5))
plt.legend(loc="upper center", prop={"size":10},  ncol=3, bbox_to_anchor=(0.5, 1.2))
plt.yticks(size=12)
plt.xticks(size=12);

In [None]:
data_sds_nlp=data_sds.loc[:, :"Q19_OTHER"]
data_sds_nlp.columns

data_sds_nlp.rename(columns={"Q19_OTHER":"Q19_Part_6"}, inplace=True)  
data_sds_nlp=combine_selectquestion(7, 19, data_sds_nlp).reset_index(drop=True)

data_sds_nlp_T=data_sds_nlp.groupby(["Q19"])["Count"].sum().reset_index().sort_values(by=["Count"], ascending=False).set_index(["Q19"]).T
data_sds_nlp_T

In [None]:
df_add=pd.DataFrame()
for role in role_lis:
    df=data_sds_nlp[data_sds_nlp["Q5"]=="{}".format(role)]
    df.loc[:, "Pct"]=df["Count"]/df["Count"].sum()
    df=df.set_index(["Q19"]).drop(columns=["Q5", "Count"]).rename(columns={"Pct":"{}".format(role)}).T
    df_add=df_add.append(df)

    
df_add.plot.barh(stacked=True, color=palette[2:], figsize=(10, 5))
plt.legend(loc="upper center", prop={"size":12},  ncol=3, bbox_to_anchor=(0.5, 1.2))
plt.yticks(size=12)
plt.xticks(size=12);