# **Introduction** 

Entering the industrial revolution 4.0 era which is marked by the development of work automation that utilizes robots, machines, and Artificial Intelligence (AI). Human resources are required to learn new skills in order to compete in an industrial environment. One of the most important skill to learn is programming skills. People who want to start learning about programming often get confused because there are a lot of languages, notebooks, and hardware that can be used in programming.

In this notebook, we will discuss the languages, notebooks, and hardware used and recommended by people who have experience in writing programs using the results of Kaggle 2021 survey. We assume that non-experts are people who have programming experience less than 5 years and experts are people who have programming experience from 5 to 20 years.



**Import Data**

In [None]:
#import python libraries

import os
import pandas as pd
import numpy as np
import seaborn as sns
import random
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import plotly.graph_objs as go
import plotly.offline as pyo
import plotly.figure_factory as ff
from plotnine import *
from matplotlib.colors import rgb2hex
import plotly.express as px
from plotly import tools
from plotly.subplots import make_subplots
from plotly.offline import iplot
from IPython.core.display import HTML

In [None]:
#load dataframe

responses_df_2021 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
responses_df_2021.head()

In [None]:
#remove the question row in the dataframe

responses_2021 = responses_df_2021.iloc[1:,:]
responses_2021.head()

Define the new dataframe, df_experts which is the dataframe of people who have programming experience from 5 to 20 years.

In [None]:
experts_survey = ['5-10 years','10-20 years','20+ years']
df_experts = responses_2021[responses_2021['Q6'].isin(experts_survey)]
responses_2021['Level'] = ["Experts" if x in experts_survey else "< 5 years experience" for x in responses_2021['Q6']]

In [None]:
#check the number of rows and columns of the new dataframe

print('Experts =',df_experts.shape)

# **Program Experience**

This graphic illustrates the number of people who are experts in programming.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15,8))
x = responses_2021['Q6'].value_counts().index
y = responses_2021['Q6'].value_counts().values.tolist()
data = responses_2021.groupby("Q6").size()
sns.set(style="darkgrid", color_codes=True)
pal = sns.color_palette("YlOrRd", len(data))
rank = data.argsort().argsort() 
sns.barplot(x=x,y=y,palette=np.array(pal[::-1])[rank],ax = ax[0])
for p in ax[0].patches:
        ax[0].annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
ax[0].set_xlabel('Years', weight='semibold', fontname = 'monospace')
ax[0].set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
        
ax[1].pie(y, labels = x, colors = pal, autopct='%1.1f%%',
        explode=[0.03 for i in responses_2021['Q6'].value_counts().index])
plt.legend(bbox_to_anchor=(1, 1))
plt.suptitle ('Programming Experience',weight = 'bold')
plt.setp(ax[0].get_xticklabels(), rotation=90)
plt.show()

It can be seen that 68.5% of respondents have less than 5 years experience in programming, while only 27.4% of respondents have more than 5 years experience in programming. This is probably because the need for programming has increased recently. So that makes many people start learning to write programs.

# **Experts and Non-Experts Group of Age**

This graphic below describes the age ratio of experts and non-experts in programming.


In [None]:
age = (responses_2021.groupby(['Level'])['Q1']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q1'))

plt.figure(figsize = (15,7))
p = sns.barplot(x = "Q1", y = "percentage", hue="Level", data = age[:-1])
_ = plt.setp(p.get_xticklabels(), rotation=45)  # Rotate labels

It can be seen that most of the non-expert respondents are between 18-21 years old, while most of the expert respondents are between 30-34 years old. An interesting pattern can be seen from the graph, as the age range get bigger, the ratio between experts and non-experts get higher. This is because the older you get, the more experience you have.

# **Highest Education**

Below is the graph of people who are experts in programming and their highest education.

In [None]:
education = (responses_2021.groupby(['Level'])['Q4']
             .value_counts(normalize=True)
             .rename('Ratio')
             .mul(100)
             .reset_index()
             .sort_values('Ratio', ascending=False))

education.Q4.replace(to_replace = {"Some college/university study without earning a bachelor’s degree":"University"}, 
                     inplace=True)
education.rename(columns={"Q4": "Highest Education"}, inplace=True)

plt.figure(figsize=(10,7))

p = sns.barplot(x="Highest Education", y = "Ratio", hue = "Level" , data = education[:-1])
_ = plt.setp(p.get_xticklabels(), rotation=90)

We can see that most of the non-expert respondents have a Bachelor’s degree, while most of the expert respondents have a Master’s degree. This is because  most of the expert respondents are older than the non-expert respondents, which indicates that expert respondents have more time and opportunities to continue their studies.

# **Experts Profession**

In [None]:
fig = plt.figure(figsize=(10, 6))
x = df_experts['Q5'].value_counts().index
y = df_experts['Q5'].value_counts().values.tolist()
data = df_experts.groupby("Q5").size()
sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(data))
ax = sns.barplot(x=x,y=y,palette=pal)

for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Years', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')        

plt.suptitle ('Title',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

It can be seen that most of the expert respondents choose to become Data Scientists.  Recently, Data Scientist became one of the jobs that's growing rapidly. One of the possible reasons why people with experience in prograaming choose Data Scientist is because this job has a nice work environment with very low stress levels. The high salary also makes many people want to become data scientists. 

# **Program Languages Experts Use on Daily Basis**

In [None]:
column = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3','Q7_Part_4',
                       'Q7_Part_5', 'Q7_Part_6', 'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9',
                       'Q7_Part_10', 'Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER']
a = []
for i in column:
    a.append(df_experts[i].value_counts().keys()[0])
b = []
for i in column:
    b.append(df_experts[i].value_counts().iloc[0])

In [None]:
fig = plt.figure(figsize=(10, 6))
x = a
y = b
sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(a))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Programming Languages', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')

plt.suptitle ('Programming Language Used on Regular basis',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

Expert respondents more often use Python and SQL as programming languages that are used daily. According to the Tiobe Index, Python is one of the top 3 languages used by professional web developers. So, many people are experts in using the Python language. This may be because most of the respondents have experience working as Data Scientists, where this job requires high Python and SQL skills.

# **Most Recommended Program Languages by Experts**

In [None]:
fig = plt.figure(figsize=(10, 6))
x = df_experts['Q8'].value_counts().index
y = df_experts['Q8'].value_counts().values.tolist()
data = df_experts.groupby("Q8").size()

sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(data))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Programming Languages', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
plt.suptitle ('Most Recommended Programming Language',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

Most of the expert respondents suggested learning python as their first programming language. This further shows that the python programming language is a language that MUST be learn by anyone who wants to enter the world of data. This can be due to:
1. Python is the most popular programming language in the world, so people who know Python can work in a wider range of fields and places.
2. Python language is easy to learn so it is suitable for people who are just starting to learn programming.

Based on graph 6a, Python is the language most often used by expert respondents.


# **IDE Experts use on regular basis**

In [None]:
column2 = ['Q9_Part_1', 'Q9_Part_2', 'Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5',
       'Q9_Part_6', 'Q9_Part_7', 'Q9_Part_8', 'Q9_Part_9', 'Q9_Part_10',
       'Q9_Part_11', 'Q9_Part_12', 'Q9_OTHER']
a2 = []
for i in column2:
    a2.append(df_experts[i].value_counts().keys()[0])
b2 = []
for i in column2:
    b2.append(df_experts[i].value_counts().iloc[0])

In [None]:
fig = plt.figure(figsize=(10, 6))
x=a2
y=b2
sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(a2))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('IDEs', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
plt.suptitle ('IDEs Used on Regular Basis',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

We can see that expert respondents prefer to use Jupyter Notebook the most as their daily IDE. This can be because the Jupyter Notebook is a very effective IDE to be used to build an application because one of its capabilities is in the visualization aspect. Visualization can be in the form of comment text, images or mathematical formulas. The format of the Jupyter Notebook can be easily integrated with Google Colab.

# **Hosted Notebook Products Experts use on Regular Basis**

In [None]:
column3 = ['Q10_Part_1', 'Q10_Part_2',
       'Q10_Part_3', 'Q10_Part_4', 'Q10_Part_5', 'Q10_Part_6',
       'Q10_Part_7', 'Q10_Part_8', 'Q10_Part_9', 'Q10_Part_10',
       'Q10_Part_11', 'Q10_Part_12', 'Q10_Part_13', 'Q10_Part_14',
       'Q10_Part_15', 'Q10_Part_16', 'Q10_OTHER']
a3 = []
for i in column3:
    a3.append(df_experts[i].value_counts().keys()[0])
b3 = []
for i in column3:
    b3.append(df_experts[i].value_counts().iloc[0])

In [None]:
fig = plt.figure(figsize=(10, 6))
x=a3
y=b3
sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(a3))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Hosted Notebook Products', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
plt.suptitle ('Most Used Hosted Notebook Products on Regular Basis',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

Most of the expert respondents prefer Colab Notebook and Kaggle Notebook as hosted notebooks that are used on a daily basis. This is because Colab Notebook and Kaggle Notebook are hosted notebooks that are easy to share and collaborate with others so as to facilitate the discussion process when working in a team.

# **Type of computing platform Experts use Most Often for Data Science Projects**

In [None]:
fig = plt.figure(figsize=(10, 6))
x = df_experts['Q11'].value_counts().index
y = df_experts['Q11'].value_counts().values.tolist()
data = df_experts.groupby("Q11").size()

sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(data))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Computing Platform', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
plt.suptitle ('Computing Platform use most often for Data Science Projects',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

Most of the professional respondents prefer to use a laptop as a platform for working on data science projects. This is because laptops are easy to carry everywhere and are more efficient.

# **Types of Specialized Hardware Experts use on a Regular Basis**

In [None]:
column4 = ['Q12_Part_1',
       'Q12_Part_2', 'Q12_Part_3', 'Q12_Part_4', 'Q12_Part_5',
       'Q12_OTHER']
a4 = []
for i in column4:
    a4.append(df_experts[i].value_counts().keys()[0])
b4 = []
for i in column4:
    b4.append(df_experts[i].value_counts().iloc[0])

In [None]:
fig = plt.figure(figsize=(10, 6))
x=a4
y=b4
sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(a4))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Specialized Hardware', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
plt.suptitle ('Most Used Specialized Hardware on Regular Basis',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

Based on the graph, it can be seen that most expert respondents said they did not use special hardware to write programs, this indicates that expert respondents are not too worried about the hardware used. But NVIDIA GPUs are the best choice given by expert respondents when there is a question of what hardware is often used by expert respondents.

# **Experts Favorite Media Sources that Report on Data Science Topics**

In [None]:
column5 = ['Q42_Part_1', 'Q42_Part_2', 'Q42_Part_3', 'Q42_Part_4',
       'Q42_Part_5', 'Q42_Part_6', 'Q42_Part_7', 'Q42_Part_8',
       'Q42_Part_9', 'Q42_Part_10', 'Q42_Part_11', 'Q42_OTHER']
a5 = []
for i in column5:
    a5.append(df_experts[i].value_counts().keys()[0])
b5 = []
for i in column5:
    b5.append(df_experts[i].value_counts().iloc[0])

In [None]:
fig = plt.figure(figsize=(10, 6))
x=a5
y=b5

sns.set(style="dark", color_codes=True)
pal = sns.color_palette("YlOrRd", len(a5))
ax = sns.barplot(x=x,y=y,palette=pal)
for p in ax.patches:
        ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
                    ha='center', va='bottom',
                    color= 'black')
        
ax.set_xlabel('Media Sources', weight='semibold', fontname = 'monospace')
ax.set_ylabel('Number of Respondents', weight='semibold', fontname = 'monospace')
plt.suptitle ('Favorite Media Source',weight = 'bold')
plt.xticks(rotation=90,weight = 'bold')
plt.show()

There are various sources used by expert respondents in finding information, but most expert respondents choose Kaggle, Youtube, and Blogs as sources of information. Kaggle is the largest platform for finding information, this may be because the survey conducted has respondents who are mostly Kaggle users. In addition, Kaggle offers convenience and advantages such as having various data sets, codes from various users, community data analysts and data scientists. who are ready to share knowledge, and competitions with various attractive prizes.

# **Conclusion**

Many people are interested to learn programming. There are many programming languages you can learn starting from Python, Java, C, and others. But, you might get confused about what programming language you should learn first and what expert use to write their programs. From the survey data that we have analyzed, it can be seen that the programming languages used often by experts are Python and SQL. In addition, experts also recommend Python as a language that beginners can learn for the first time, this further strengthens the reason why Python is a programming language that beginners must learn.

Experts also prefer Jupyter Notebook as the IDE to use, while collab notebook and kaggle notebook are chosen by professionals as hosted notebooks for daily use. In writing programs, the experts also choose a laptop as a computing platform that is often used. In looking for information, the experts choose youtube, kaggle, and blogs as their main sources of information.
