Getting started in Data science can be a daunting one. It is not because of the things you have to learn but rather the lack of a clear guidance. I am an aspiring data scientist studying Business Analytics in Singapore. In my journey to become a Data Scientist I often find myself confused as to what I need to know to be a good Data Scientist. One reason for this confusion,in my opinion, is the overused umbrella term "data scientist".   

In order to provide clarity to those who are on the same boat as I am, in this notebook, I will answer the following questions:
* 1) How do the different professions differ in the tasks they perform regularly?
* 2) What are the programming languages used by the professionals
* 3) Do I need to get a Master's degree? If so, should I get it right after Bachelor's degree?


*Side note: This is my first attempt ever using the Python data analytics framework as well as publishing any notebook online. Any feedback is more than welcomed!!* 



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


df= pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv",skiprows=[0], low_memory=False)


First, let us understand the demographics of the data. The age distributions of the respondents are right-skewed with a higher proportion of male than female across all ages.

In [None]:
ageAndGender =  df.iloc[:,1:3]
ageAndGender.columns = ['Age','Gender']
ageAndGender = ageAndGender.groupby(['Age','Gender']).size().to_frame(name='size').reset_index()

m =ageAndGender[ageAndGender['Gender'] == 'Man']
f = ageAndGender[ageAndGender['Gender'] == 'Woman']
fig = plt.figure()
ax = fig.add_subplot()
male = ax.bar(m['Age'],m['size'],0.35,color='royalblue')
female = ax.bar(f['Age'],f['size'],0.35, color='pink')
plt.show()


Student takes up the largest pie in this survey followed by Data Scientists. 

In [None]:
current_title = df.iloc[:,5].to_frame().reset_index()
current_title.columns = ['','Current Title']
current_title = current_title.groupby('Current Title').size().to_frame(name='size').reset_index()
explode = (0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1)


fig2, ax2 = plt.subplots()
ax2.pie(current_title['size'],labels=current_title['Current Title'],
                        labeldistance=1.2,
                        explode=explode,
                        wedgeprops=dict(width=0.5),
                        autopct='%1.0f%%',
                        pctdistance = 1.1,
                        startangle=-40
                       )

ax2.axis('equal')
plt.show()


Finally, let's get started on answering the first question. "How do the different professions differ in the tasks they perform regularly?" To answer this question, we will dive in deeper into Q23 where respondents were asked to select all applicable activities that are considered important in their role at work. By splitting the answers based on their current profession, we will be able to get a clearer idea of what each of them actually do at work. 

It turns out that, although the percentage varies across the different professions, the activities that they do are very similar. Except for research scientists, to "Analyze and understand data to influence product or business decisions" takes the largest proportion in all the different professions.

In [None]:
titleVsActivities =  df.iloc[:,[5,110,111,112,113,114,115,116,117]]
titleVsActivities.columns = ['title','Analyze and understand data to influence product or business decisions',
                            'Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
                            'Build prototypes to explore applying machine learning to new areas',
                            'Build and/or run a machine learning service that operationally improves my product or workflows',
                            'Experimentation and iteration to improve existing ML models',
                            'Do research that advances the state of the art of machine learning',
                            "None of these activities are an important part of my role at work",
                            'Other']
titleVsActivities = titleVsActivities.groupby(['title']).count().reset_index()
titleVsActivities = titleVsActivities.drop([1,12]).reset_index(drop=True)

fig8, axs8= plt.subplots(nrows = 4,ncols = 3,figsize=(20,25))
axs8[3][-1].set_visible(False)

axs8 = axs8.flatten()

for index,rows in titleVsActivities.iterrows():
    value = rows[1:]
   
    axs8[index].pie(value,wedgeprops=dict(width=0.5),
                    autopct='%1.0f%%',
                    pctdistance = 1.1,
                    startangle=-40) 
    axs8[index].set(title=rows[0])

fig8.subplots_adjust(top=0.9,bottom=0.3 )
fig8.legend(titleVsActivities.columns[1:],loc='upper center')

fig8.show()


Next, we will answer the next question "What are the programming languages used by the professionals". This question aims to provide a clearer idea of what programming languages students should learn for the respective professions they are aiming for. Python seems to in the top 3 programming languages used by all. 

In [None]:
df2 = df.iloc[:,4:21]
df2.columns = ["education",'title','prog_exp','python','R','SQL','C','C++','Java','Javascript','Julia','Swift','Bash','MATLAB','None','Other','learn_first']
title_lang = df2[['title','python','R','SQL','C','C++','Java','Javascript','Julia','Swift','Bash','MATLAB','None','Other']]
title_lang = title_lang.groupby(['title']).count().reset_index()


fig3, axs = plt.subplots(5,3,figsize=(10,20))
axs[4][1].set_visible(False)
axs[4][2].set_visible(False)
fig3.subplots_adjust(left= -1)
labels= ('python','R','SQL','C','C++','Java','Javascript','Julia','Swift','Bash','MATLAB','None','Other')
axs = axs.ravel()

for index,row in title_lang.iterrows():
       top3 = sorted(list(row[1:]))[10:]
       axs[index].barh(labels,row[1:],color=['orange' if (x in top3) else 'tomato' for x in row[1:]])
       axs[index].set(title=row[0])
        

Now that we know what programming languages each profession uses, we will now look at the programming languages these profession recommend students to learn first. Unsurprisingly, Python once again emerged the top. 

In [None]:
titleVsLearn_first = df2[['title','learn_first']]
titleVsLearn_first = titleVsLearn_first.groupby(['title','learn_first']).size().to_frame(name='size').reset_index()


titleVsLearn_first = titleVsLearn_first.pivot('title','learn_first').reset_index().fillna(0)

fig6, axs6 = plt.subplots(5,3,figsize=(10,20))
axs6[4][1].set_visible(False)
axs6[4][2].set_visible(False)
fig6.subplots_adjust(left= -1)
labels = ('Bash','C', 'C++','Java', 'Javascript', 'Julia', 'MATLAB', 'None', 'Other', 'Python', 'R', 'SQL', 'Swift' )
axs6 = axs6.ravel()

for index,row in titleVsLearn_first.iterrows():
    top3 = sorted(list(row[1:]))[10:]
    x = titleVsLearn_first[titleVsLearn_first['title'] == row[0]]  
    axs6[index].barh(labels, row[1:],color=['orange' if (x in top3) else 'tomato' for x in row[1:]])
    axs6[index].set(title=row[0])


As a bachelor undergraduate I often wonder if there is a need for me to get a Master's degree and if so, when should I get it. To answer this question, we will first take a look at the distribution of Doctoral, Master and Bachelor degree across the profession. It turns out that Master's degree are the most common qualification across the industry. If you are thinking of becoming a Research Scientist, then you probably have to get a Doctoral degree. 

In [None]:
eduVstitle = df2[['education','title']]

eduVstitle = eduVstitle[(eduVstitle['education'] == 'Doctoral degree') | 
                        (eduVstitle['education'] == 'Master’s degree') | 
                        (eduVstitle['education'] == 'Bachelor’s degree') ]

eduVstitle = eduVstitle.groupby(['title','education']).size().to_frame(name='size').reset_index()

eduVstitle = eduVstitle.pivot('title','education').reset_index()
for index,row in eduVstitle.iterrows():
    total = sum(row[1:])
    for i in range(len(row[1:])):
        eduVstitle.iloc[index,i+1] =  (int(row[i+1])/total)*100
        


bachelor =eduVstitle[('size','Bachelor’s degree')]
master =eduVstitle[('size','Master’s degree')]
doctoral = eduVstitle[('size','Doctoral degree')]

fig4, ax4 = plt.subplots(figsize=(5,10))
labels = eduVstitle['title']
ax4.barh(y = labels, width=bachelor,height=0.8,color='darkcyan' )
ax4.barh(labels,master,0.8,left=list(bachelor), color='cadetblue')
ax4.barh(labels,doctoral ,
         left = [i+j for i,j in zip(bachelor,master)],
          color='darkturquoise')
for i,(b,m,d) in enumerate(zip(bachelor,master,doctoral)):
    plt.text((b/2), i, str(round(b))+"%", horizontalalignment='center',verticalalignment='center', color="white")
    plt.text((b+(m/2)), i, str(round(m))+"%", horizontalalignment='center',verticalalignment='center', color="white")
    plt.text(((b+m)+(d/2)), i, str(round(d))+"%", horizontalalignment='center',verticalalignment='center', color="white")
ax4.legend(['Bachelor',"Master","Doctoral"],ncol=3, bbox_to_anchor=(0, 1),loc='lower left', fontsize='medium')
plt.show()

Now that we have a rough idea of whether there is a need to further your education beyond a Bachelor's degree, we now have to answer the "when". To answer this, I decided to plot the distrbution of the different degrees against programming experience across the different professions. The reason I am plotting against programming experience instead of age range is to mitgate the impact of the different education system in different countries. 

From the chart below, we can observe that for some professions, the Bachelor's degree is the dominant group but as the programming experience increases, Master's degree becomes the dominant group. 

In [None]:
import matplotlib.ticker as mticker
titleVsProgExp = df2[['title','prog_exp','education']]
titleVsProgExp =  titleVsProgExp[(titleVsProgExp['education'] == 'Bachelor’s degree') | 
                                   (titleVsProgExp['education'] == 'Master’s degree') |
                                   (titleVsProgExp['education'] == "Doctoral degree")]

titleVsProgExp['prog_exp'] = pd.Categorical(titleVsProgExp['prog_exp'],
                                           categories=['I have never written code','< 1 years','1-2 years',
                                                       '3-5 years','5-10 years','10-20 years','20+ years'],
                                           ordered=True)

titleVsProgExp  = titleVsProgExp.groupby(['title','prog_exp','education']).size().to_frame(name='size').reset_index()



titles = titleVsProgExp['title'].unique()
width= 0.2

exp = titleVsProgExp['prog_exp'].unique()

fig5,axs2 = plt.subplots(13,1,figsize=(15,25))
fig5.subplots_adjust(hspace=0.5)


        
for index in range(len(titles)):
    y= titleVsProgExp[titleVsProgExp['title'] == titles[index]]
    x=  np.arange(len(titleVsProgExp['prog_exp'].unique()))

    z = y[y['education'] == 'Bachelor’s degree']['size']
    t = y[y['education'] == 'Master’s degree']['size']
    s = y[y['education'] == 'Doctoral degree']['size']
    
    bach = []
    mast = []
    doc = []
    
    for b,m,d in zip(z,t,s):
        total = d+b+m
        bach.append((b/total)*100)
        mast.append((m/total)*100)
        doc.append((d/total)*100)
        
    axs2[index].bar(x,
        [i for i in bach],
       width, align="center")

    axs2[index].bar(x+width,
            [i for i in mast],
            width,color="salmon", align="center")

    axs2[index].bar(x+width+width,
            [i for i in doc],
            width,color="slategray", align="center")

    axs2[index].xaxis.set_major_locator(mticker.MaxNLocator(7))
    ticks_loc = [0.20,1.20,2.20,3.20,4.20,5.20,6.20]
    axs2[index].xaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
    axs2[index].set_xticklabels(['I have never written code' ,'< 1 years' , '1-2 years' , '3-5 years' , '5-10 years' , '10-20 years' , '20+ years'])
    axs2[index].set(title=titles[index])

fig5.suptitle('Proportion of Education Level Vs Programming Experiences',fontsize=20)
fig5.legend(('Bachelor','Master','Doctoral'),loc='center right', fontsize='large')








In conclusion, even with the different titles, the tasks that they perform are mostly similar. For students trying to break into any of these profession, the best language to start with is Python. In most profession, at some point, you probably have to take a Master's degree.