# 1. Introduction to a free, no-code tool

In this deep exploration, we are going to try to answer some really interesting questions namely,

1. **Why is the young generation not so much inclined towards Machine Learning? Is the math hard, or is it something else?**

2. **Why so many Data Scientists in the Finance and the Manufacturing industry does not use Machine Learning methods, be it a small data science team of 1-2 people or a large team of 20+ Data Scientists?**

3. **Most of the youngsters (Age group 25-29) are exploring ML methods and may put them in production one day. But is this statement really true? Or is it an example of false feedback due to some really subtle psychological effect? (SPOILER: Something similar to Simpson's Paradox will happen!)**

4. **Why do majority of Finance, Government and Medical industry do not put their model into production?**

5. **Does Age have an impact on people's ability to learn a new technology like MLOps?**

6. **Does using Microsoft excel or Google sheets create an imposter syndrome in some of the teams when they see other teams using fancy ML algorithms in an IDE? Does this makes them testify falsely?**

7. **Is money the root of all evil? Can money inspire pipe dreams?**

8. **Do you need higher education to get an amazing job offer?**

9. **What professions in which industries contribute the highest towards the GDP and the Charges for the use of intellectual property?**


Now of course, in order to answer these challenging questions, we first need some initial explorations of the data on basis of which we will try to validate these hypotheses. So for this purpose, we will kill two birds with one stone.

* We will perform the initial basic exploration of the data
* In turn, we will introduce CyberDeck, a free, no-code, end to end Data Science platform and familiarize the users with its interface.



In this notebook, we will show how we can use [CyberDeck](https://cyberdeck.in/), a free, no-code, end-to-end Data Science Platform to perform a thorough analysis on the **Machine Learning and Data Science Survey** data.

Here, we will mainly stick to the Dashboard and EDA section of CyberDeck. But if you want to see its full power, then you can go through these additional resources.

1. End to End AutoML demo with Titanic Data: https://cyberdeck.in/cyberdeck-titanic-demo-no-code-data-science-tool/
2. EDA - Clustering - AutoML with COVID Data: https://www.kaggle.com/sagarnildass/eda-ml-clustering-covid-with-the-click-of-a-mouse
3. Auto Time Series forecasting - https://cyberdeck.in/time-series-forecasting-without-coding/
4. Website: https://cyberdeck.in/

With this platform, you can also save hours of your coding time and get the results as fast as I did in this demo. We are releasing this platform in Jan 2022. But we are accepting early sign-ups and already have got over 50 signups in the last week. If you think this platform can benefit you, then you can visit us at: https://cyberdeck.in/ and **Sign Up for Free**.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df.head()

# 2. CyberDeck Home Page

In the following diagram, we can see the homepage of the **[CyberDeck](https://cyberdeck.in/)** app and all the details it provides about the user. 



<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/05/homepage.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


# 3. Initial Data Exploration

Next, we go to the Dashboard section, and load the data in. Here, we have a plethora of visualizations at our disposal. We will mainly use this section to answer our questions on this dataset.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/2_select_bar_chart.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 3.a) Distribution of Age groups

To answer this question, we simply select **Histogram** from the list of available charts and select the required fields.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/1_dashboard_hist.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We immediately see that the most dominant age groups are 25-29, 18-21 and 22-24. That's how easy it is. 

**Note**: All these plots are resizable and interactive.


Data
￼
add
Add data
￼
keyboard_arrow_down
Settings
￼
keyboard_arrow_down
Schedule a notebook run
￼
keyboard_arrow_down
Code Help
￼
keyboard_arrow_up
search￼
Find Code Help
Search for examples of how to do things## 3.b) Distribution of Gender

To do this, we select another histogram again and select **Q2** in the x-axis and done!

We see that most of the survey takers are Men followed by Women.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/3_dash_hist_2.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 3.c) Distribution of Country

For this, let's select a pie chart as the number of countries are pretty large. We see that most of them are from my own country: **India**

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/4_dash_pie.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


# 4. Some interesting multivariate explorations

## 4.a) Is there a correlation between age and number of years of coding experience?

### Method 1

At this point, most likely you have got the gist of this platform and how easy it is to use. So now, let's mix and match things a little bit. What if we wanted a little more in depth understanding taking two or more variables into account? Let's try them now.

So let's ask this question: **Does the older people have more coding experience or did they start at the same time as that of the young generation going by the 4th Industrial revolution that we are witnessing right now?**

To answer this question, we go back to our first plot and select **Q6** as the color variable like this.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/9_dash_multi_value_select.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


And here's the plot without the popup.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/8_dash_multi_1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we get the following observations:

1. For different age groups, the dominant colors are still green (<1 years) and light pink (1-3 years).
2. In the age group 25-34, there are significant number of people with 5-10 years of coding experience and very few people with 10-20 years of coding experience.
3. In the age group 35-44, there are significant number of people with 10-20 years of coding experience.

So there is a slight correlation that with age, the number of years of experience might be more. But this is not a causation effect. For that, we need to perform hypothesis test which remains the subject of another kernel.

### Method 2

Note, we could have done the same thing in a little bit different way. For this, we need to plot a **Sunburst** plot. We provide Q1 (Age group) and Q6 (Coding experience) in the **Field** column.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/10_dash_sunburst.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


And here's the plot without the pop-up.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/11_dash_sunburst_2.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


If we look at the chart carefully, we get the same information that we got from the bar plot. 

**Note:** We can add as many variables as we want to in the sunburst plot and it will go ahead and show us the divisions. Here's a standalone Sunburst plot extracted from the CyberDeck app with multiple columns visualized.

![12_dash_sunburst_multi.png](attachment:efc1b634-cddf-4690-ba5c-fb7993ac36ce.png)

So we see that this plot indeed gives us a rich and detailed visualization with multiple columns taken together.

# 5. CyberDeck EDA Section

The CyberDeck app also has a dedicated EDA section which is slightly more ready made and less customizable. This is perfect for people who want a quick and dirty exploration of the data. Let's see that in action now!

We choose EDA from the sidebar and select the dataset. Then, we hit the **Run EDA** button and that's all there to it!

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/13_eda_select_data.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Immediately, we see that the app shows us 5 sections.

1. Overview
2. Pivot Chart
3. Variables
4. Interactions
5. Correlations

Now, as this data is mostly categorical in nature, there's no question of variable interaction or correlation. So we will cover the first 3 tabs for this dataset. But if you want to see the other tabs in action as well, check these kernels out!

* https://www.kaggle.com/sagarnildass/eda-ml-clustering-covid-with-the-click-of-a-mouse
* https://www.kaggle.com/sagarnildass/convert-your-data-science-hours-to-minutes

## 5.a) EDA Overview Section

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/14_eda_overview.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


In the **Overview** tab, we see the descriptive statistics for all the variables. Note that as all the variables are categorical in nature, we are only seeing the **count, number of unique values, highest occuring values and frequency**. If there would have been numerical columns in this dataset, the summary statistics would have changed and shown us the **mean, standard deviation, different percentiles** as well.

Next we go to the **Pivot Chart** section. This is the place which will become super useful for us in this case.

## 5.b) EDA Pivot Chart section (Single Variable)

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/15_eda_pivot_chart.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Here, we immediately notice a few things. 

1. In the left hand side, we see all the columns in this dataset listed.
2. Just on top of that, we see a dropdown, which says **Table**. We can change this to multiple things as we will see later.
3. Beside the dropdown which says **Table** currently, we see another dropdown, which currently says **Count**. We will also explore what can be changed in this dropdown.
4. On the right hand side of the **Count** dropdown and also below it, we see two empty spaces for dragging different columns in.
5. We also see the total row count of this dataset titled **Total** in the lower right corner.

With these knowledge, let's start. Let us drag Q1 in one of the empty spaces.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/16_eda_pc_q1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We immediately see that the row counts has been distributed based on the age group. How handy is that!

Now, I will show some of the possibilities in which we can visualize this information. Remember the **Table** dropdown at the top of the listed columns? Let's see what else can we choose here.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/17_eda_table_dropdown.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Wow! Do you see that! All those arsenals waiting to be clicked! Let's click a few of them shall we?

### Table Heatmap

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/18_eda_table_heatmap.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that immediately, the table cells gets a color gradient based on the cell value. The higher it is, the redder it is. Pretty neat, right?

### Exportable TSV

Next, we select the **Exportable TSV** option. What this does is present your aggregated data in a TSV format which you can simply copy and then paste them into an excel file for your reporting purpose. How cool is that! Forget about all those hours spent for creating reports!

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/19_eda_export_tsv.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


### Grouped Column Chart

Let's select the **Grouped Column Chart** now. This will immediately show us the visualization of the table that we created in a grouped column chart format.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/20_eda_gcc.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


### Pie chart

Let's select the **Pie Chart** now and see it in action. You might be wondering, what happened to the other plots like stacked column chart, grouped bar chart etc etc. Don't worry. I am gonna show them to you in a better fashion! Just read on for now.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/21_eda_pie.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we see, that in this section, our job of a thorough Exploratory Data Analysis becomes so much more easier. You can also understand why we kept this section separate to the Dashboard section.

## 5.c) EDA Pivot Chart Section (multiple variables)

Now let's step on the gas pedal a little bit more, shall we? Upto this point, we only chose **Q1** and showed it in different forms. But did you know, you can add as many variables as you want to in this section? Yes! **Read that line one more time again.**

Let's do this! We go back to showing the **Table** format and now we add **Q2** in the other empty space.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/22_eda_two_var_table.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Man! How much I wanted to show you how fast that was! But alas! I cannot show the speed of this platform in this static article. 

But do you see what happened? On top of aggregating the rows by age, now we have an additional layer of subdivision by gender and the table cell shows us the data aggregated and pivoted to these two columns. And we can keep on adding more columns to this. Here's a view after adding **Q4 (Education degree)** to the row and **Q5 (Job title)** to the column.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/23_eda_multiple_cols.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


And like you saw before, now only can you see all these information in a standard table format, but in all the other formats I mentioned before (like a heatmapped table, a stacked or grouped bar chart, a pie chart etc etc). Here's the same view, Rows HeatMapped!

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/24_eda_multiple_cols_heatmapped.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Now let's see the other visualizations. To decrease the complexity, let's keep only two columns, **Q1** in the row and **Q2** in the column.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/22_eda_two_var_table.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Let's select the **Grouped Column Chart**

### Grouped Column Chart

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/25_eda_gcc_mult.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


### Stacked Column Chart

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/26_eda_scc_mult.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So you see, in the above two viz, we are getting **Gender** in the x-axis, the count in the y-axis and **Age group** becomes the color variable. Now what if we wanted to flip these two around?

In order to do that, we go to the next two sets of plots, **Grouped Bar Chart** and **Stacked Bar Chart**

#### Grouped Bar Chart

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/27_eda_gbc_mult.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


#### Stacked Bar Chart

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/28_eda_sbc_mult.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So now we see, that the axes have reversed. We have the **Age Group** in one of the axes and **Gender** as the color variable. That's how easy doing things are inside this app!

### Area chart

In this chart, the area under the curve represents the numerical count.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/29_eda_area_chart.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


### Multiple Pie charts

In this plot, we can see that there is one pie chart for each gender and then the pie values are the age groups.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/30_eda_mult_pie_charts.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


# 6. Starting the real story via the EDA section

Now that we have got a basic and thorough understanding of the amount of work you can do inside this app in such less time, let's go ahead and start the real exploration where we are going to answer some really, really interesting questions.

## 6.a) What type of device do they mostly use

We see that most of them uses a Laptop followed by a Desktop

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/31_eda_device.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 6.b) How many times have they used a TPU (Tensor Processing Unit)

We see that the majority has never used a TPU.



<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/32_eda_tpu.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 6.c) For how many years have they used Machine Learning methods

Majority of them have used ML methods for under 1 year followed by 1-2 years. We also see that a large number of people do not use Machine Learning methods. Maybe an interesting question will be who are they. Let's try to find that next.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/33_eda_ml_methods.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 6.c.i) Age group of people who do not use Machine Learning Methods

For doing this, we drag **Q1** in the column section.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/34_eda_ml_method_age_group.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


# Why is the young generation not so much inclined towards Machine Learning? Is the math hard, or is it something else?

Very interestingly, we see that the largest age group which does not use ML methods is 18-21, followed by 22-24, followed by 25-29. So is this a signal that young people don't like Machine Learning at all? Of course, this data is skewed and may or may not be a proper representation of the whole population, but still this leads to an interesting querstion. 

Is the Young generation not interested in Machine Learning because:

1. The math is hard?
2. They use statistical methods more?
3. Or they simply haven't encountered a lot of Machine Learning use cases in the real world given such a young age?

I leave the answer upto you.

Now let's see what Occupations are they in

## 6.c.ii) Occupation of people who do not use Machine Learning Methods

For this, we drag **Q5** in the columns section.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/35_eda_ml_occupation.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that of all the people who do not use Machine Learning methods, majority of them are

1. Students
2. Software Engineers
3. Data analysts

Now this makes sense because these top 3 occupations who do not use Machine Learning methods are such occupations in which Machine Learning is seldom necessary. 

Now what if we mix both of these variables to understand the occupation and also the age group who don't use Machine Learning?

## 6.c.iii) Occupation and Age Group Combined of people who do not use Machine Learning Methods

For doing this, we drag both **Q1 (Age Group)** and **Q5 (Occupation)** in the column section

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/36_eda_age_occ_ml.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Man! This is too much for my eyes! So many labels, so many categories! But CyberDeck has me covered in this aspect too. All I have to do is click the appropriate color legend (In this case: **I do not use Machine Learning Methods**) and the plot will filter out only those rows. Isn't this a breeze!

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/37_eda_age_occ_ml_2.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So now the information is much clearer. We get the following observations.

1. The highest number of people by far who does not use Machine Learning methods are students of age group 18-21 followed by students of age group 22-24 and students of age group 25-29. So most likely they are undergrad and grad students who need not use Machine Learning as a part of their curriculum.
2. A significant number of Data Analysts of age group 25-29 does not use Machine Learning method. This makes sense because Data analysts seldom has to use Machine Learning because of their job description. Also they are not so experienced either to be trusted to handle a complicated task like machine learning for which majorly Data Scientists are responsible.

## 6.c.iv) How many years of coding experience do the people have who do not use Machine Learning?

For getting this answer, we drag **Q6 (Years of coding experience)** in the column section and then filter out the legend to only those people who do not use Machine Learning methods.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/38_eda_ml_coding_years.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we see, that of all the people who does not use Machine Learning methods, majority of them has <1 year and 1-3 years of coding experience. This makes sense again as Machine Learning is a complicated subject and people with lesser experience most likely will not attempt this.

So you see, we can form so many interesting hypothesis and then partially validate those with our exploratory data analysis. There are really, an infinite number of hypotheses we can make just for those people who does not use Machine Learning methods. I demonstrated just a few of them. But I will leave you for the rest. This should be a good enough starting point about learning how to ask questions that you will get answers for from the data. This is where the real beauty of Data Science lies IMHO.

But for now, let's move onto the next parts.

## 6.d) What is the majority's occupation

For this, we remove all the features from the rows and columns and only drag **Q20** in the column section.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/39_eda_occupation.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that majority of the people did not answer this questions. But for those, who did answer, the prevailing occupation is **Computers/Technology** and **Academics** 

## 6.e) How many people are responsible for Data Science workloads

For this, we drag **Q22** in the columns section

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/40_eda_ds_workload.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see again that most of the data is null here. But apart from the null values, one interesting fact is the majority of them said that there are 1-2 Data Scientists (3642 of them) or 20+ Data Scientists (3595 of them) in their organization. This huge standard deviation is really interesting. 

Now I know I said that I will not delve into those people who does not use machine learning methods any more. But this really really calls for it, won't you say? So forgive me for breaking my promise, but I really want to know what percentage of them don't use Machine Learning methods. For this, I drag **Q15** in the rows section and then again filter out the legend for the people who do not use Machine Learning methods.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/41_eda_not_ml_ds.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So here we see that for people who have 1-2 data scientists in their org, 380 (10%) of them does not use ML methods. And for people who have 20+ Data Scientists in their org, 409 (11%) of them do not use ML methods.

This is really interesting that out of the two most prominent number of data scientists present in an organizations, around 11% of them (which is pretty large in my opinion) do not use Machine Learning methods.

## 6.f) What industries do they belong to who do not use Machine Learning methods

So we saw that

1. Majority of the people who do not use machine learning are mainly students, software engineers, and Data Analysts.
2. Majority of the people who do not use machine learning have lesser number of years of coding experience.
3. The highest number of Data Scientists that people reported their organizations have is 1-2 and 20+ but almost 11% of them do not use Machine Learning.

Now it's only logical to ask that what industries do these people belong to for each group of Data scientists in terms of the number (1-2 and 20+). Let's try to get that information next.

For this, we keep **Q15 (Number of years of ML experience)** in the row and **Q22 (Numbers of Data Scientists) and Q20 (Industry they work in)** in the column. We again filter out only those people who do not use Machine Learning methods. We do this just by double clicking on the legend.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/42_eda_ds_ml_industry.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


# Why so many Data Scientists in the Finance and the Manufacturing industry does not use Machine Learning methods, be it a small data science team of 1-2 people or a large team of 20+ Data Scientists?

1. We see that for both type of companies who have 1-2 Data Scientists or 20+ Data Scientists, the largest number of people who do not use ML methods belong to **Computers/Technology** or **Education/Academics field**.
2. Apart from these two fields, the fields where Data Scientists do not use Machine Learning the most are **Accounting/Finance** and **Manufacturing/Fabrication**.

I get the first point. The data is skewed. Most of them are from the education or computer industry and stochastically many of them will not use ML methods. That's very statistically significant. What intrigues me most is the 2nd point. That is Finance and Manufacturing industries Data Scientists also do not use Machine Learning often, be it a team of 1-2 Data Scientists or 20+ Data Scientists. 

Now is this an indicator of the following points?

* These two industries simply does not require Machine Learning a lot? Now I don't know a lot about the manufacturing industry, but if we see the current trend of AI in finance, we almost daily see an article in medium about **Time Series Forecasting** in finance industry. Then why is there such a shortage of ML methods in finance, at least in this dataset? Is it because:
* For the finance industry, the articles we see in medium are not good enough for the real world and they are more likely to fail in the real world?
* And for the manufacturing industry, it seems to me that there can be a lot of really good use cases especially regarding inventory. But why is that not reflected in this dataset? Is it because of the lack of experience, the Data Scientists this industry is hiring is not able to solve this problem or is it something else?

Now most likely you have understood my pattern. I ask a question, try to answer that and then bring in more questions! More questions and few answers! Life is so hard! But let's adhere to that pattern and ask ourselves this next logical question:

## 6.g) For the Finance and the Manufacturing industry, what is the average experience (in terms of years) of Machine Learning that the employess in there have? 

To get this answer, we drag **Q15 (Experience of ML (years)) in the rows** and **Q20 (industry)** in the column. This time we filter out only **Finance and Manufacturing industry** and plot a **Grouped Bar Chart**

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/43_eda_ml_exp_industry.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


And voila! We clearly see that for both the industries, the most commmon experience that the employess have for Machine Learning methods is 1-2 years which is really not enough! **So this is a really clear indicator that the Finance and the Manufacturing industries are not lacking ML use cases, but the people there are simply not experienced enough to solve such complicated problems.**

## 6.h) How comfortable people are in putting their ML models into production?

First let's get the simple answer and then I will go more complicated as I have been throughout this whole article. But one thing, I am telling you Data Scientist to Data Scientist, if I had to code all of these, I would have given up a long time ago. Because I am lazy, lazy like hell! That's why I made CyberDeck, so that I can remain lazy, ha ha! But enough about me, let's get down to the next order of business!

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/44_ml_models_into_prod.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

Let's take a little bit of our time and put these values in descending order for our convinience (trust me, this will make our lives easier later)

1. **Most of them are exploring ML methods and may put them in production one day** - This certainly sounds biased doesn't it? May the force be with you kind of thing. They are waiting for a miracle to happen or something like that. Most likely this group are still using jupyter notebooks for doing machine learning. And I don't blame them. Putting ML models in production is hard, really hard. Take a look at the rough steps it takes for productionizing a model.
    * First, you have to create an API for the model's pickle file.
    * Then you most likely have to create a dockerfile for your devops team so that they can run this API at some open port and deploy it.
    * Then the front facing web app consumes it.
    * But your task is far from over. What if your test data doesn's match the distribution of your training data? You suddenly understand that the training data was an anomaly? Or what if it initially matches your training data but slowly drifts away to a different distribution because of perception change, market change or behaviour change? 
    * Taking all the above into account, you have to create some kind of model monitoring system so that you can be alerted of any kind of [concept drift](https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-learning/) if that happens.
    * And if that concept drift happens (actually it will happen, just a matter of when), then you need to have some kind of re-training framework in place so that your business doesn't lose millions of dollars just by trusting your model.
    * So you see, this whole process is not easy. It is actually even more difficult than the model training itself. If you don't believe me, read this [paper](https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf) by google which talks about "Hidden technical debts in Machine Learning". In this paper, you will see why Machine Learning is only a very small part in any organization's AI workflow. But again, CyberDeck has a one click MLOPS section, which will do all the above, with just one click of the mouse. If you want to see that in action, check this video out: https://youtu.be/bj9nCRkT8nM and skip to **8:46**.
2. **Next are the people who have admitted that they don't use ML models in production.**
3. **After that, we have people, who have well established ML models in production for more than 2 years (good for them!)**
4. **Finally, we have people, whose models have been in production for less than 2 years and people who use ML models for generating insights (but do not put them in production).**

So you see, this whole MLOPS thingy is still very new to everyone, and a very few of them have nailed it.

Now let's try to segregate these people again by their different characteristics so that we can form our next set of hypotheses.


# Most of the youngsters (Age group 25-29) are exploring ML methods and may put them in production one day. But is this statement really true? Or is it an example of false feedback due to some really subtle psychological effect? (SPOILER: Something similar to Simpson's Paradox will happen!)

# 6.i) Is there a relation between Age group and how comfortable they are in putting ML models to production?

We saw previously that young aged people do not use Machine Learning that much either because of their profession or because of their lack of experience. Let's validate that by understanding if they are also uncomfortable in putting ML models to production.

We will put **Q23 (Putting ML models in production)** in the rows and **Q1 (Age group)** in the columns section. Next we plot a **Grouped Column chart**. 

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/45_eda_ml_models_prod_age.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

This plot looks very confusing. So let's isolate the legends one by one. We will go by the descending order of count as we have listed above.

# 6.i.a) Exploring ML Models and may put them in production one day

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/46_eda_ml_prod_2.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that this is mostly dominated by younger age group namely 25-29, 30-34 and 22-24. This makes sense as they most likely lacks the experience of putting an ML model into production. 

Now naturally the next question become what are the current job title of these dominating age groups. For convenience, let's take the 25-29 age group as that is the highest occuring one.

## 6.i.a.1) What is the current job title of people who are exploring ML models and may put them in production one day for Age group 25-29?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/47_ml_prod_age_occu.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see when we filter out the 25-29 age group, the most dominating job titles who are exploring ML models and may put them in production one day are:

1. Data Scientists
2. Data Analysts
3. Software Engineers
4. Research scientists
5. Machine Learning Engineers

Now almost all of these job titles makes sense except two. Didn't we previously see that Data Analysts and Software Engineers are the ones who use Machine Learning methods the least? But here, we see that they belong to one of the dominating classes within age group 25-29 who are exploring ML models and may put them in production one day. These two insights are quite contradictory. So let's try to delve deeper into this.

For this, we bring **Q15 (Years of ML experience)** into the row alongside the other variable. Now this is becoming really complex! We have filtered out only those rows where the **Q23** value is **We are exploring ML models (and may one day puth a model into production)**.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/48_eda_ml_prod_occ_age.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>



So now we see what caused the anomaly. This is so beautiful!

1. If we see closely at the 25-29 age range in the above chart, we see that the highest bar belongs to the Data Scientists. For them, **I do not use Machine Learning Methods** response is very very low. So almost all of them have some experience with ML and thus they plan to put the model into production in future. 

2. But if we see the bar just next to it, this bar belongs to the Data Analysts. Majority of them has under 1 years of experience or 1-2 years of experience. But little experience always brings the dream out of you and you tend to seek higher, almost impossible acheivements. So these are the people who said that **They are exploring ML models and plan to put them in production one day**. But we also see that a large number of them does not have any experience with Machine Learning and thus they didn't make such claim. It is because of these low experience Data Analysts, we were getting this strange anomaly. It almost reminds me of [Simpson's Paradox](https://towardsdatascience.com/simpsons-paradox-how-to-prove-two-opposite-arguments-using-one-dataset-1c9c917f5ff9)!

3. The same, exact logic applies to Software engineers.

# 6.i.b) They do not use ML models in production

Let's again see what Age groups do these people fall in.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/49_eda_ml_no_prod.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>



So we see that the people who do not use ML in production mainly belongs to age groups of **25-29, 30-34 and 35-39**. Now we can again try to see what are their job titles. Let's do that.

## 6.i.b.1) What is the current job title of people who do not use ML models in production?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/51_eda_ml_no_prod_job_title.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Here we see that the highest occuring Job Title who does not use Machine Learning in production is **OTHER** followed by **Data Analysts** and **Software Engineers**. Their average age group lies between 25-39.

Let's scrutinize this segment a little more. e.g. What industries do these people fall in? Maybe some industries does not require a model to be deployed in production.

# Why does majority of the Finance, Government and Medical industry do not put their model into production?

## 6.i.b.2) What industry does these people mostly belong to who do not use ML models in production?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/52_eda_ml_no_prod_industry.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

We see that the dominating industries which does not use ML models in production are

1. Education
2. Computers/Technology (But 1 and 2 might be there again due to data skew)
3. Accounting/Finance (This is interesting given that we saw that people in finance industry have low experience in Machine Learning methods). So is that the reason why they don't use ML models in production?
4. Government/Public Services (This also kind of makes sense as in govt sector, Data privacy is of utmost importance and according to GDPR, they are most likely not allowed to deploy models in production).
5. Medical/Pharmaceutical (The logic behind this industry might be because this industry mostly use statistical methods for insights).


# 6.i.c) They have well established ML methods (models in production for more than 2 years)

Let's see what age groups does these people fall in.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/53_eda_ml_models_in_prod.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

We see here also the dominating age groups are 25-29 and 30-34. Let's see their job titles.

## 6.i.c.1) What is the job title of the people who have well established ML models?


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/54_eda_ml_prod_job_title.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Here we see that the top 2 job titles who have well established ML models in production are **Data Scientists** and **Machine Learning Engineers** which pefectly makes logial sense given their job description.

# Does Age have an impact on people's ability to learn a new technology like MLOps?

Now an interesting hypothesis will be as this whole MLOps domain is very new, then does that mean that older people might have more difficulty in putting ML models into production? It might be that they are not able to keep up with the technological upgrade. Let's validate that hypothesis.

## 6.i.c.2) Of all the people who have well established ML models, does age play a factor on their ability to apply MLOPS in their organization?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/55_ml_models_in_prod_age.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So here we clearly see that for the Data Scientists, who are a major segment in being comfortable in putting the ML models in production, there is clearly a slowly decreasing trend of number of people with increase in their age. So this hypothesis might just be true that even among the people who have well established ML models, age plays a dominant factor in their capability to deploy a ML model into production.

# 6.i.d) They use ML models for generating insights (but do not put them in production)

This segment looks pretty interesting. Let's see the age group first.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/56_eda_ml_no_prod.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

We see that the dominating age group here are: 25-29, 30-34 and 22-24

## 6.i.d.1) What is the job title of these people?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/57_eda_ml_no_prod_job_title.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

We see that the dominant job title of these people are

1. Data Scientist
2. Data Analyst
3. Research Scientist
4. Software Engineer



# Does using Microsoft excel or Google sheets create an imposter syndrome in some of the teams when they see other teams using fancy ML algorithms in an IDE? Does this makes them testify falsely?

## 6.i.d.2) What is the primary tool these people use to analyze data?

Now here, we can form an interesting hypothesis. As these people use Machine Learning to generate insights solely and not put them in production, most likely they will be more dependent on using statistical methods too. 

A even more interesting hypothesis will be as these people have a large number of data analysts in them (whom we previously saw yearn for putting ML models into production especially if they have 1-2 years of work experience, like a distant dream), it might be possible that they are actually not using ML so much and might be giving a biased and sligtly untrue feedback. Because, we always look for the shine. They see ML models as something really cool and it might be possible that they are not even using it but are coveting for it. As a result, we might be getting some false feedbacks. Maybe we can partially validate this hypothesis from the tools that they use.

Let's try to see what we get from the data. We will only segregate the section where the people only use ML models for generating insights but not put it in production.



<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/58_eda_ml_no_prod_tool.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>



Wow! Did you see that? Most of them use jupyter labs and RStudio which is pretty normal, but look at the second highest one: **Microsoft Excel and Google Sheets**. Now we can surely tell that they really cannot use Machine Learning inside excel or google sheets and most likely they do only some basic analysis inside them. Surprise, Surprise! Looks like the hypothesis we are forming isn't such a distant dream at all!

Now the next logical thing to see would be to segregate this view into two parts. One for Data Scientists and the other for Data Analysts. Based on our previous observations and insights, I am betting my bucks on the Hypothesis: **Data Analysts will use a lot more excel and google sheets than a coding environment than Data Scientists**

## 6.i.d.3) Of these people who use ML models only for generating insights, can it be true that the majority of the Data Analysts use more Microsoft Excel and Google Sheets (Where Machine Learning is not possible at all) than some IDE?

### A) Data Analysts

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/59_eda_ml_no_prod_data_analysts.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

### B) Data Scientists

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/60_eda_ml_no_prod_data_scientist.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>



# Now my mind is blown!

Do you see how important it is to explore the data step by step, in a slow and deliberate manner? As we explored this data and formed hypotheses along the way, starting small and then going big, we are really able to hit some important points. We clearly see from the above two points, Data Analysts mostly use Microsoft Excel where Machine Learning is simply not possible and Data Scientists mostly use Coding environments much much more than Excel or Google sheets. So this is where we see some clear bias in the dataset.

Now a few big questions arises from all these observations. 

1. Does money have something to do with it also? In other words, apart from the lustre of the job that Data Scientists have, is their average salary higher than Data Analysts?
2. If we compare companies who have a steady ML pipeline and models deployed in production, are the salaries of Data Scientists higher than companies who does not have well established Machine Learning pipeline?

Let's see what we get!

# Is money the root of all evil?

# 6.j) Does companies with a well established ML pipeline pay higher salary than companies who does not have well established ML pipelines?

For this, we only filter make two comparisons: 

* **Companies with well established ML pipelines vs Companies who does not use ML methods**
* **Companies with well established ML pipelines vs Companies who use ML only for generating insights, but have not productionized them**

# 6.j.a) Companies with well established ML pipelines vs Companies who do not use ML methods

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/61_ml_vs_non_ml_salary.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So here we clearly see that companies who have well established ML methods also pay well than companies who does not.

# 6.j.b) Companies with well established ML pipelines vs Companies who use ML only for generating insights, but have not productionized them

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/62_ml_vs_ml_insight.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Here also, we see the same thing. Companies with well established ML methods pay much higher than companies who use ML only for generating insights, but have not productionized them.

Now, you know what I am going to ask next. How does the salaries of Data Scientists compare with the salaries of the Data Analysts in both of the above situation?

# 6.k) Salaries of Data Scientists vs Data Analysts in companies who have well established ML method

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/63_ml_prod_ds_vs_da_salary.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>



Wow! This is a real revelation! You see that? How much more the Data Scientists get paid than Data Analysts. This can also be a strong indicator towards the dream of Data Analysts in migrating towards Machine Learning that we saw earlier.

# 6.l) Salaries of Data Scientists vs Data Analysts in companies who does not have well established ML methods

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/64_non_ml_ds_vs_da_salary.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## Surprise! Surprise!

In the above plot, we see that in this case, the game completely changes! Now, it's the Data Analysts who are paid more in average! So no wonder they are much happier! That's why we got so much anomalies from Data Analysts who work in a company with well established ML methods, but not so much from companies who does not use ML methods! This is an excellent case of You hanker for something when that is within your reach, but remove that completely from equation, and you stop coveting!





# 6.m) How does overall coding experience fare with ML experience?

This question is actually a really important one. Based on this answer, many other important questions might unfold. For this, I am going to take a leaf out of the page by [Teresa Kubacka](https://www.kaggle.com/tkubacka) from her excellent notebook : [A story told through a heatmap](https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap). In this notebook, she created a new feature out of two questions:

* How long have you been writing code to analyze data (at work or at school)?
* For how many years have you used machine learning methods?

![image.png](attachment:c2a25ab7-a290-447a-953e-e08ad9baf052.png)

* *Professional subgroups based on the answers for the two questions.* 
* *Author: [Teresa Kubacka](https://www.kaggle.com/tkubacka)*
* *Source: [A story told through a heatmap](https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap)*

Based on this matrix, she engineered segmented the users into 4 parts:

1. Beginners: People who have just started coding and also has very less amount of ML experience.
2. Modern Data Scientists: People who have moderate coding experience and started ML when the hype was high.
3. Coders in transition: People who have a very high coding experience but started ML only recently.
4. ML Veterans: People whoe have been doing both ML and coding for a very long time.

Let's try to recreate this matrix inside CyberDeck with this year's data.

For this, we select **Table Row Heatmap** instead of **Table** and **Count as a fraction of Total (To give the results in percentages)** instead of **Count**

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/65_coding_vs_ml_exp.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we see that indeed each of the segments has some percentages of people with majority of them with little coding and ML experience. We will engineer a new feature called **experience** based on the same logic that Teresa did and then try to understand these segments further.

# Do you really need higher education to get an amazing job offer?

# 6.m.a) What are the job titles of these segments?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/66_job_title_segments.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we see

* Data Scientists, Machine Learning Engineers and Research Scientists have the highest percentage of **ML veterans** among all the job titles.
* Software Engineers, Project Managers and Data Analysts have highest number of **Coders in transition**. This again corroborates to our previous theory regarding Data Analysts and Software Engineers.

# 6.m.b) Does well established companies hire mostly ML veterans?

We always see that large companies have a very stringent Job description. We see that often they want experienced people only. Let's see if that is reflected in the data.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/67_segment_company_size.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we indeed see that larger the company, more they tend towards ML veterans. 

Also, we often see that large organizations often prefer a candidate with a Doctorate or Masters degree. Let's see if we get any signal like that from this data.

# 6.m.c) Does large organizations prefer ML veterans with a Doctorate or Masters degree?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/66_job_title_segments.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that the ML veterans (Red Bars) are very prominent in the Doctorate and the Masters degree. But this plot is a little bit too big. Let's zoom in to those individual segments.

## A) Doctorates

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/69_doctorates_company_size.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>

## B) Masters

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/70_masters_company_size.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>



So we really see the effect of a proper education and experience here. **All the biggest companies really hanker for a ML veteran with a advanced degree.**

# 7. Bringing in External Data

Now that we have extracted a huge lot of information from this dataset, why not try to get even more insights? For this purpose, we will take the world economy/demographics dataset of 2020 by World Bank (https://www.kaggle.com/sagarnildass/worldbank-economicsdemographics-data). We have only taken the data for the year 2020. Now of course, the below analyses will be highly skewed and most likely will not be a representation of the fact. But still, it will be really fun to get some signals from these merged dataset.

# 7.a) Which countries have highest Median GDP per capita?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/71_gdp_per_capita.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we see that the countries with the highest GDP per capita are mostly European Countries like Norway, Ireland, Switzerland, Denmark etc.

# 7.b) Which job titles among which industries among the survey takers contribute the highest towards the Median GDP per capita?

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/72_gdp_per_capita_title_region_v2.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Here we see some absolutely novel information. The above snippet does not cover the whole table and I had scroll up and down to make sense of it. I used a table here, because the charts were becoming simply too cluttered. But the job title within the industries which contribute the highest towards the Median GDP per capita as are follows divided by continents. We are taking median for these cases as the survey dataset does not have even distribution of the countries and taking a simple average will greatly skew the results.

## A) America

1. Data Analysts in Military/Security/Defense
2. Data Scientist in Military/Security/Defense
3. DBA/Database Engineer in Online Service/Internet-based Services
4. Product Manager in Medical/Pharmaceutical
5. Product Manager in Marketing/CRM

It seems that Data Scientists and Analysts in Military contribute the highest towards its GDP per capita.

## B) Europe

1. Product Manager in Academics/Education
2. Business Analyst in Online Business/Internet-based Sales
3. Machine Learning Engineer in Retail/Sales
4. Data Engineer in Shipping/Transportation
5. Program/Project Manager in Marketing/CRM

In contrast to America, in Europe, the highest contributor towards the GDP per capita are Product Managers in Academics/Education. Kind of ironic, isn't it? One being the symbol of war, the other being the symbol of peace. This fact is somewhat reflected in this article: https://skillspanorama.cedefop.europa.eu/en/dashboard/employed-population-occupation-and-sector?occupation=&year=2019&country=EU#1

## C) Africa

1. Research Scientist in Manufacturing/Fabrication
2. Developer Relations/Advocacy in Government/Public Service
3. Software Engineer in Insurance/Risk Assessment
4. Data Engineer in Insurance/Risk Assessment
5. Product Manager in Energy/Mining

For africa, the fact obtained here is consistent with this report: https://www.jobnetafrica.com/blog/article/34-top5-most-popular-sectors-for-jobs-in-africa

## D) Asia

1. Research Scientist in Insurance/Risk Assessment
2. Program/Project Manager in Military/Security/Defense
3. Research Scientist in Hospitality/Entertainment/Sports
4. Machine Learning Engineer in Military/Security/Defense
5. DBA/Database Engineer in Military/Security/Defense


# 7.c) Which job titles among which industries among the survey takers contribute the highest towards the Median Charges for the use of intellectual property?





<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/10/73_charges_for_int_property.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Again, the table is partially visible here. But here are the top 5 job titles and the industry:

## A) America

1. Data Analysts in Military/Security/Defense
2. Data Scientist in Medical/Pharmaceutical
3. DBA/Database Engineer in Hospitality/Entertainment/Sports
4. Business Analyst in Military/Security/Defense
5. Product Manager in Hospitality/Entertainment/Sports

## B) Europe

1. Product Manager in Broadcasting/Communications
2. Research Scientist in Insurance/Risk Assessment
3. Research Scientist in Non-profit/Service
4. DBA/Database Engineer in Government/Public Service
5. Machine Learning Engineer in Retail/Sales

## C) Africa

1. Data Scientist in Insurance/Risk Assessment
2. Data Engineer in Online Service/Internet-based Services
3. Product Manager in Energy/Mining
4. Machine Learning Engineer in Government/Public Service
5. Research Scientist in Manufacturing/Fabrication

## D) Asia

1. Research Scientist in Manufacturing/Fabrication
2. Research Scientist in Medical/Pharmaceutical
3. Statistician in Marketing/CRM
4. Product Manager in Military/Security/Defense
5. Developer Relations/Advocacy in Shipping/Transportation

# Conclusion

I hope this notebook helps you in getting started with this amazing Dataset and also in asking the right questions which can bring out tremendously surprising answers from the data. Everything is right under our nose, we just have to find it. For that, the only way is to ask the right questions! I only scratched the surface, but I hope you can find the starting points in this notebook for so many interesting hypothesis that is worth answering.

Also, in this endeavour, I hope that you have found the **[CyberDeck](https://cyberdeck.in/)** platform useful. If you think that this platform can also save you hours just like it did for me, then don't forget to sign up **[here](https://cyberdeck.in/)**!

This was really a privilege in exploring this beautiful and enriched dataset. We uncovered some really interesting insights and it even gave us a partial glimpse into the human psyche.

I would like to end this journey with a famous quote by Napoleon:  ***War Is Ninety Percent Information.***