# *Analysis of Kaggle 2020 Survey... Keypoints for aspiring Data Scientists to focus on!!!*

# Introduction

A survey taken up by Data Scientist and Machine Learning experts from all over the world provides a clear view for aspiring Data scientist that where is the industry heading to. What are the most popular tools currently being used in the industry? What are the recommended tools one should learn in order to push start their career as a Data Scientist?

I have tried my best to help upcoming Data Scientist to get the most out of this survey to get answers for all the above mentioned questions. The following data analysis covers all the possible views one can have about this dataset.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as gobj
from plotly.offline import iplot

# Cleaning the dataset

This survey is attempted by 20836 people from all over the world and the survey consisted of 39 questions. But many respondents took less than 3 minutes to submit the survey which accounts for wrong and incorrect information. So in order to get correct output from the data we need to remove these spam responses.

In [None]:
df = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", skiprows = 1) # dropping the rows containing Q.Nos. I will be working with exact question name
total_respondents = df.shape[0]
spam_df = df[df["Duration (in seconds)"]<=180]
spam_respondents = spam_df.shape[0]
df = df[df["Duration (in seconds)"]>180]
print("Total Respondents: ", total_respondents)
print("Spam Respondents: ", spam_respondents)
print("Percent of spam respondents: ", '%.2f'%(spam_respondents*100/total_respondents),"%")

Out of 20836 responses 1732 were spam submissions which accounts for 8.64 percent of the original dataset.

As I will be focusing on what tips does the industry experts have for upcoming Data Scientist. So I will be creating a different dataframe for Professionals.

In [None]:
df["For how many years have you used machine learning methods?"].value_counts()

In [None]:
ml=df.loc[df["For how many years have you used machine learning methods?"] !="I do not use machine learning methods",:]
ml=ml.loc[ml["For how many years have you used machine learning methods?"] !="Under 1 year",:]
ml=ml.dropna(subset= ["For how many years have you used machine learning methods?"])
print("Machine Learning Experienced Respondents: ", ml.shape[0])

After cleaning the dataset I am left wit 7952 Machine Learning Expert respondents.

# Millenials are breaking the gender barrier

Society has always been partial with men getting an upper edge over women. So we will be focusing on no. of responses from each gender and compare the results with current professionals working in the industry and all respondents of th esurvey.

In [None]:
# ml_profs
y = ml["What is your gender? - Selected Choice"].value_counts()
y_values = []
y_gender = []
for i in range(len(y)):
    n = y[i]
    y_values.append(n)
    y_gender.append(y[y==n].index[0])
fig_p_2 = px.pie( values = y_values, names = y_gender, color_discrete_sequence=px.colors.sequential.RdBu, hole=0.4, title='Percentage of Reponse from each Gender- ML Professionals')
fig_p_2.show()

In [None]:
# all respondents
x = df["What is your gender? - Selected Choice"].value_counts()
x_values = []
x_gender = []
for i in range(len(x)):
    n = x[i]
    x_values.append(n)
    x_gender.append(x[x==n].index[0])
fig_p_1 = px.pie( values = x_values, names = x_gender, color_discrete_sequence=px.colors.sequential.RdBu, hole=0.4, title='Percentage of Reponse from each Gender- Total Respondents')
fig_p_1.show()

It can be observed that percentage of men respondents have dropped almost 4 % which is a significant number considering the size dataset. Hence helping us with the insight of gender barrier being broken by the upcoming generation.

# Respondents from all over the world

First map shows the number of Machine Learning professionals from different parts of the world who submitted a response for the survey and the second one shows all the respondents including students and inexperienced data scientists as well.

In [None]:
#ml_profs
#creating a country dataset with name of country and its count
country_ml = ml.groupby("In which country do you currently reside?").agg({"In which country do you currently reside?":['count']})
country_ml.columns = ['count']
country_ml['country'] = country_ml.index
#plotting the country dataset on world map using plotly
data = dict(type = 'choropleth',
            locations = country_ml['country'],
            locationmode = 'country names',
            autocolorscale = False,
            colorscale = 'Inferno_r',
            text= country_ml['country'],
            z=country_ml['count'],
            marker = dict(line = dict(color = 'rgb(255,0,0)',width = 1)),
            colorbar = {'title':'No. of Respondents','len':0.25,'lenmode':'fraction'}           )
layout = dict(geo = dict(scope='world'))
worldmap = gobj.Figure(data = [data],layout = layout)
iplot(worldmap)

In [None]:
# all respondents
#creating a country dataset with name of country and its count
country = df.groupby("In which country do you currently reside?").agg({"In which country do you currently reside?":['count']})
country.columns = ['count']
country['country'] = country.index
#plotting the country dataset on world map using plotly
data = dict(type = 'choropleth',
            locations = country['country'],
            locationmode = 'country names',
            autocolorscale = False,
            colorscale = 'Inferno_r',
            text= country['country'],
            z=country['count'],
            marker = dict(line = dict(color = 'rgb(255,0,0)',width = 1)),
            colorbar = {'title':'No. of Respondents','len':0.25,'lenmode':'fraction'})
layout = dict(geo = dict(scope='world'))
worldmap = gobj.Figure(data = [data],layout = layout)
iplot(worldmap)

A remarkable difference can been seen in the number of responses from India alone which signifies that students from India are more interested in being a Data Scientist or Machine Learning engineer as compared to rest of the world.

# Age as a driving factor

When talking about being a professional in a particular field age plays a very important role in understanding the trends. Below are some comparisons based on Highest Qualifications, Experience and Profession.

In [None]:
ml_data_qual = pd.crosstab(ml['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'],df['What is your age (# years)?'])
fig_cross_2 = px.imshow(ml_data_qual, labels=dict(x="Age groups", y="Highest Qualification"), title = "Age v/s Highest Qualification - ML Professionals")
fig_cross_2.show()

In [None]:
all_data_qual = pd.crosstab(df['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'],df['What is your age (# years)?'])
fig_cross_1 = px.imshow(all_data_qual, labels=dict(x="Age groups", y="Highest Qualification"), title = "Age v/s Highest Qualification - All Respondents")
fig_cross_1.show()

The above heatmaps shows that highest number of ML professionals have the completed Master's Degree as their highest qualification and they belong to age group of 25-29 and 30-34 years. Whereas considering all the respondents maximum number of responses are from 18-21 age group who have either completed or are completing their Bachelor's degree. We can clearly understand the enthusiasm among 18-21 age group for being a data scientist.

It is a common observation among all those trying to be a data scientist that we need prior experience of programming to be a successful data scientist.

In [None]:
ml_data_exp = pd.crosstab(ml['For how many years have you been writing code and/or programming?'],df['What is your age (# years)?'])
fig_cross_4 = px.imshow(ml_data_exp, labels=dict(x="Age groups", y="Experience"), title = "Age v/s Experience in coding - ML Professionals")
fig_cross_4.show()

In [None]:
all_data_exp = pd.crosstab(df['For how many years have you been writing code and/or programming?'],df['What is your age (# years)?'])
fig_cross_3 = px.imshow(all_data_exp, labels=dict(x="Age groups", y="Experience"), title = "Age v/s Experience in coding - All Respondents")
fig_cross_3.show()

These heatmaps show that most of the ML professionals have 3-5 years of coding experience and all belonging to the age group of 25-29 years of age. While talking about all respondents most of them have 1-2 years of experience so they need some more experience to be good at data science and that's not a problem as they are from age group of 18-21 years and hence enough time to be an expert with coding skills.

Age v/s Profession will throw some light on the profession selected by people of each age group

In [None]:
ml_data_prof = pd.crosstab(ml['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'],df['What is your age (# years)?'])
fig_cross_6 = px.imshow(ml_data_prof, labels=dict(x="Age groups", y="Profession"), title = "Age v/s Profession - ML Professionals")
fig_cross_6.show()

In [None]:
all_data_prof = pd.crosstab(df['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'],df['What is your age (# years)?'])
fig_cross_5 = px.imshow(all_data_prof, labels=dict(x="Age groups", y="Profession"), title = "Age v/s Profession - All Respondents")
fig_cross_5.show()

The dataframe with only ML Professionals show that people of age groups 25-29 and 30-34 prefer to be a Data Scientist whereas whole data suggests that people of age groups 18-21 and 22-24 are mostly students.

# Current Trends in the Industry

The Kaggle Survey 2020 contains a lot of questions related to various tools and algorithms that are used on a regular basis in the industry. Analysing the demands of these skills is very crucial from a fresher point of view as it will give them a clear understanding of what they should be prepared for when entering the workspace.

In [None]:
def optn(df,column_name): #gets all the options of more than one select questions
    d = {}
    opts = []
    values = []
    for column in df.columns:
        if column_name in column:
            opt = column[len(column_name):]
            data_count=(df[column].value_counts())
            opts.append(opt)
            values.append(data_count.values[0])
    d['options'] = opts
    d['values'] = values
    dt=pd.DataFrame(d)
    dt=dt.sort_values(by='values',ascending=False)
    return dt

In [None]:
def bar_chart(d, title):
    figure = px.bar(d, x ='values', y='options', orientation='h',color='values', title = title)
    return figure.show()

In [None]:
# Q7
prog_lang = optn(ml,"What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - ")
bar_chart(prog_lang, "Regulalry Used Programming Languages")

***Python*** was the most common answers when the respondents were asked about the programming language that they use on a regular basis.

In [None]:
# Q14
visual_tools = optn(ml, "What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  ")
bar_chart(visual_tools, "Regularly Used Data Visualization Libraries")

***Matplotlib*** tops the chart for regularly used data visualization library. ***Seaborn*** is also preferred by a lot of data science experts as it is the second most regularly used library

In [None]:
# Q16
ml_frameworks = optn(ml, "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   ")
bar_chart(ml_frameworks, "Regularly Used ML Frameworks")

Although we have alot of ML Frameworks but after cleaning the dataset we are left with only two choices that are being used in the industry which consists of Scikit-learn and TensorFlow. And Scikit-learn is leading the race in this two way battle with TensorFlow not so far behind. Also the numbers suggests that most of the respondents are using both frameworks on a regular basis.

In [None]:
# Q17
ml_algo = optn(ml, "Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - ")
bar_chart(ml_algo, "Regularly Used ML Algorithms")

One question that every budding Data Scientist have in their minds is that what is the most often used ML algorithms. The above bar charts gives the answer to this question that ***Linear or Logistic Regression*** and ***Decision Trees or Random Forests*** are top two of most often used Machine Learning Algorithms with not a significant difference between these two

In [None]:
#Q27
data_mlprod=optn(ml,"Do you use any of the following machine learning products on a regular basis? (Select all that apply) - Selected Choice -  ")
bar_chart(data_mlprod, "Regularly used ML Products")

Google Cloud ML Engine and Amazon Sage Maker are the most regularly used ML products by Machine Learning Professionals all over the world.

In [None]:
#q26
data_ml_cloud = optn(ml,"Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  ")
bar_chart(data_ml_cloud, "Regularly Used Cloud Computing Platforms")

Those familiar with Machine Learning and Data Science should also have information on how to use cloud computing platforms as these platforms are widely used in the industry to store and use datasets and run various other tasks. AWS, GCP and Microsoft Azure are the three leaders in this field.

In [None]:
def step_bar(col, title):
    dx = ml.groupby([col,"For how many years have you used machine learning methods?"]).agg({"For how many years have you used machine learning methods?":['count']})
    dx['options'] = dx.index.get_level_values(0)
    dx["experience"] = dx.index.get_level_values(1)
    dx['count'] = dx[('For how many years have you used machine learning methods?', 'count')]
    dx=dx.sort_values(by='count',ascending=False)
    fig1 = px.bar(dx, x='options', y='count', color='experience', title = title)
    return fig1.show()

In [None]:
# Q38
step_bar("What is the primary tool that you use at work or school to analyze data? (Include text response) - Selected Choice", "Mostly Used EDA Tools")

EDA is one of the most important step in analyzing a dataset. JupyterLab, RStudio or any other Local development enviornments are preferred over rest of the tools such as Tableau, Microsoft Excel. As using these tools gives alot of flexibility to its user.

In [None]:
# Q31
bi_tools = optn(ml, "Which of the following business intelligence tools do you use on a regular basis? (Select all that apply) - Selected Choice - ")
bar_chart(bi_tools, "Regularly Used BI tools")

After EDA next step is to visualize the data. There are several BI tools to perform this step. Many respondents have selected none suggesting that there role in company is not related to this field and the most preferred BI tools is Tableau with Microsoft Power BI second in the list, giving a tough competition to Tableau.

In [None]:
# Q32
step_bar("Which of the following business intelligence tools do you use most often? - Selected Choice", "Mostly Used BI Tools")

This chart shows the mostly used BI tools in the industry and clearly shows that Tableau is leading the group with Microsoft BI not far behind.

# Machine Learning from companies point of view

We as an individual love to work as a data scientist or as a machine learning engineer. But let's see do we have enough options or vaccancies from companies!!

In [None]:
step_bar("What is the size of the company where you are employed?", "Company Size")

This chart shows the size of companies our respondents are working in. As we can see 0-49 employees is the most selected answer choice. Hence most of the professionals are working on a startup or are trying to make their own startup using data science or machine learning. We have significant responses for 10,000 or more employees and also for 1000-9,999 employees suggesting that there are enough opportunities for a new data scientist.

In [None]:
data_role = optn(ml,"Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - ")
bar_chart(data_role, "Important Roles in the industry")

This bar chart shows various roles that ML professionals are assigned with in their respective companies. Analyzing and Understanding the data is the selected the most.

In [None]:
step_bar("Does your current employer incorporate machine learning methods into their business?", "Scope of ML in different companies")

It is also very important to know whether the employers are using Machine learning methods in their companies or not. This analysis shows that most of the companies a re still in their budding years of exploring this field for their company and have yet not established their Machine Learning department.

In [None]:
# Q8
# suggestions for budding data scientists from experts
step_bar("What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice", "Mostly Recommended Programming Languages")

When asked about the language that Machine Learning professionals will recommend the beginners to learn. The most common and standout choice was Python. Hence it makes crystal clear that a stronghold on Python will be a good skill on your resume.

# Conclusion

After performing the in depth analysis of the survey responses, I have drawn various significant conclusions that will definitely help aspiring Data Science to move in right direction to be a successfull data scientist. They should focus on learning Python and after that try to familiarize themselves with EDA using Python as that is the most frequently assigned job roles in the company. Then they should start workin on Matplotlib and Seaborn so as to make any kind of charts in Python itself. Next they should start focusing on ML algorithms such as Liner regression or Decision trees as they are regularly used in the companies. Learners should also try some hands on with popular BI Tools such as Tableau or Power BI.