# How You, Aspiring Indonesia Data Scientist, Should Learn Data Science

<span style="color:gray">Findings from 2020 Kaggle Data Science & Machine Learning Survey for Aspiring Indonesia Data Scientist</span>


## Motivation

The year is 2020 and we are in a pandemic. A lot of bad things happen, from career shifting for professionals to online learning for students. People are struggling. One common characteristic is that we **learn** to adapt in a flash.

The government have tried to tranquillize us using data. In Indonesia, IMO, some of it makes sense, but most of it doesn't. But, at least the decision-making processes are **based on data**.

Unfortunately, people who are in charge of extracting insights and contributing to those decision-making processes are still wanted. We have the resources in terms of people, but we still need more of them.

So...

Is Indonesia lack data scientist, or generally speaking, data talents a lot?

Over and above that, how are exactly Indonesian data scientists or machine learning engineers' state?

For aspiring data scientists, how should they learn data science, driven by (post-)pandemic circumstances?

In [None]:
import textwrap

try:
    import countrygroups
except ModuleNotFoundError:
    !pip install countrygroups
    import countrygroups
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("ggplot")

## Points At Issue

Motivated by previous "why"-es, I explored the 2020 Kaggle Data Science & Machine Learning Survey to at least try to uncover the answers for some questions below.

* How are data talents in Indonesia?
* What are the most used data science tools and platforms?
* How does someone, especially Indonesian, should learn to break into the field of data science?

Let's gather the dataset and inspect the first 5 rows.

In [None]:
df_kaggle = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

# create mappings from column names to questions
dict_kaggle_questions = df_kaggle.loc[0].to_dict()

# drop questions in row 0
df_kaggle.drop(index=0, inplace=True)

df_kaggle.head()

We've already created a dictionary to map each column name to corresponding question. Using this, we are easier to know what columns consists of answers to what questions.

Now, for those questions with multiple choice, the corresponding columns have the word `Part` in its name. Let's define two subgroups of questions with only one answers and multiple choice.

In [None]:
list_single_answer_cols = df_kaggle.loc[:, ~df_kaggle.columns.str.contains("(?i)part|other")].columns.tolist()
list_multiple_answers_cols = df_kaggle.drop(columns=list_single_answer_cols).columns.tolist()

Columns representing multiple choice questions:

In [None]:
df_kaggle[list_single_answer_cols].head()

The dataset must have a lot of missing values since, as Kaggle has stated in their survey methodology document, more experienced respondents will get more questions that the less experience ones, especially for the multiple-selection questions.

So, we inspect the missing values in multiplte-choice questions below.

In [None]:
df_kaggle[list_single_answer_cols].info()

Looks like there is no missing values from the first columns until the fourth. Missing values exist from Q4 until the last multiple choice question. For multiple selection questions, definitely contain missing values since they are scattered into one column for one selection. Hence, we don't interested in this.

In [None]:
num_missing_row_q4 = df_kaggle[df_kaggle.Q4.isna()].isna().sum(axis=0).unique().tolist()
num_missing_col_q4 = df_kaggle[df_kaggle.Q4.isna()].isna().sum(axis=1).unique().tolist()
print("Number of missing values in rows in Question 4 (highest level of formal education):", num_missing_row_q4)
print("Number of missing values in columns in Question 4 (highest level of formal education):", num_missing_col_q4)

This one is interesting. The 467 rows of missing values (and only 467 rows) shows that all rows in other columns are missing. Also, since there are 351 missing values in each row, and we have 355 columns in total with the first 4 columns are complete, all the values in each column are missing. Therefore, we can **drop those rows**.

In [None]:
df_kaggle = df_kaggle.drop(index=df_kaggle[df_kaggle.Q4.isna()].index).reset_index(drop=True)

## The Survey At A Glance

After we have cleaned the dataset, let's dive in deeper to understand the data. Since we are not interested in building a model, we just want to analyze and make some visualizations to answer those questions stated earlier.

In [None]:
df_kaggle.Q3.value_counts().sort_values(ascending=False)[14::-1].plot(kind="barh")
plt.title(dict_kaggle_questions["Q3"])
plt.show()

In [None]:
df_kaggle.Q4.value_counts().sort_values(ascending=False)[14::-1].plot(kind="barh")
plt.title(textwrap.fill(dict_kaggle_questions["Q4"], 60))
plt.show()

In [None]:
df_kaggle.Q8.value_counts().sort_values(ascending=False)[14::-1].plot(kind="barh")
plt.title(textwrap.fill(dict_kaggle_questions["Q8"].split(" - ")[0], 50))
plt.show()

From above figures, the respondents are dominated by those who have bachelor's and master's degree, with most of them reside in India. **Indonesia**, our main focus here, take the 15th position with less than 500 respondents. For those aspiring data scientist, according to this survey, you're encouraged to learn Python as your first weapon.

Some of this characteristics will act as a comparison to all Indonesian respondents's. We may add more characteristics as we look at more questions.

First, we split the data into 3 groups to suit our motivation, they are respondents who reside in **Indonesia**, **SEA country** other than Indonesia, and the **global**.

In [None]:
sea_country = countrygroups.UNSTATS_GEOGRAPHICAL_REGIONS.ASIA.SOUTH_EASTERN_ASIA.names
sea_country.remove("Indonesia")
print("Other Southeast Asia countires except Indonesia:\n", ", ".join(sea_country))

In [None]:
df_indonesia = df_kaggle[df_kaggle.Q3 == "Indonesia"]
print("Number of respondents reside in Indonesia:", len(df_indonesia))

df_sea = df_kaggle[df_kaggle.Q3.isin(sea_country)]
print("Number of respondents reside in Southeast Asia:", len(df_sea))

df_global = df_kaggle[~df_kaggle.Q3.isin(sea_country + ["Indonesia"])]
print("The rest respondents:", len(df_global))

# for visualization, we will add one column to df_kaggle to indicate those subgroups
def subgroups(country):
    if country == "Indonesia":
        return "indonesia"
    if country in sea_country:
        return "sea"
    return "global"

df_kaggle["subgroups"] = df_kaggle.Q3.apply(subgroups)

## Data Talents in Indonesia

We refer to data talents as anyone who deals with data in their day-to-day work or anyone who are learning and want to work that relies on data to a great extent, e.g. data engineer, data analyst, or any titles that exist in the survey data.

To be able to understand the state of data talents in Indonesia, we inspect their job titles and educational backgrounds.

In [None]:
indonesia_titles = df_indonesia.Q5.value_counts(normalize=True) * 100
sea_titles = df_sea.Q5.value_counts(normalize=True) * 100
global_titles = df_global.Q5.value_counts(normalize=True) * 100

all_groups_titles = [indonesia_titles, sea_titles, global_titles]
list_subgroups = ["indonesia", "sea", "global"]

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, group_title, group in zip(axes, all_groups_titles, list_subgroups):
    ax.bar(
        x=group_title.index,
        height=group_title.values
    )
    ax.set_xticklabels(group_title.index, rotation=90)
    ax.set_ylabel("Percentage (%)")
    ax.set_title(group.upper())
plt.suptitle("Job Titles", fontsize=15)
plt.show()

My first expectation was that Data Scientist is the most popular role selected in the survey (well, it’s a survey from Kaggle). But, as we can see, **Student is the most selected role**. In fact, it’s the most selected role for all subgroups.

> This signifies a lot of respondents are in the middle of their learning path.

Another similar character between those 3 subgroups is that the bottom 3 roles selected are **Data Engineer, Statisticianand DBA/Database Engineer**.

It seems like respondents from other SEA countries are more of non-professional aspiring data scientists compare to Indonesia’s and global’s. Data Scientist is in the top 5 roles in Indonesia and Global. It’s even the second most selected job title in Indonesia.

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(15, 10))
list_titles = df_kaggle.Q5.dropna().unique()
list_education = df_kaggle.Q4.dropna().unique()
for ax, df, group in zip(axes, [df_indonesia, df_sea, df_global], list_subgroups):
    plot = sns.countplot(data=df, x="Q5", order=list_titles,
                         palette="tab20", hue="Q4", hue_order=list_education, ax=ax)
    ax.set_title(group.upper())
    ax.set_xticklabels([textwrap.fill(title, 15) for title in list_titles])
    ax.set_xlabel(None)
    ax.legend(loc="upper right", fontsize="medium", framealpha=.3)
fig.tight_layout()
plt.suptitle("Educational Background", x=.52, y=1.03, fontweight="bold", fontsize="x-large", )
plt.show()

Consider the formal educational background for professional or non-professional respondents plans to attain within the next 2 years above.

Although Master's Degree is the most educational background overall respondents have, the case is different from Indonesia. Most background Indonesian respondents have is Bachelor's Degree. One deviation would be Machine Learning Engineer whose most background is Master's Degree.

If we compare to the global’s state, Master's Degree is the most common educational background. In fact, Research Scientist requires a higher degree, Doctoral or Master’s, while in Indonesia there is an only slight difference between a bachelor, master, and doctoral. It’s not so much though compare to global.

## Most Used Data Science Tools and Platform



In [None]:
list_ds_tools = ["Q7", "Q9", "Q10", "Q11", "Q14", "Q16", "Q36", "Q38"]
list_ds_question_title = [
    "What programming languages do you use on a regular basis?",
    "Which of the following integrated development environments (IDE's) do you use on a regular basis?",
    "Which of the following hosted notebook products do you use on a regular basis?",
    "What type of computing platform do you use most often for your data science projects?",
    "What data visualization libraries or tools do you use on a regular basis?",
    "Which of the following machine learning frameworks do you use on a regular basis?",
    "Where do you publicly share or deploy your data analysis or machine learning applications?",
    "What is the primary tool that you use at work or school to analyze data?",
]
list_col_tools = [col for col in df_kaggle.columns if col.split("_")[0] in list_ds_tools]
dict_ds_question_part = {
    question: [col for col in list_col_tools if question in col]
    for question in list_ds_tools
}


def plot_most_used_tools(df, title="", figsize=(22, 15), height_space=1):
    """Create bar plot for most used tools and platform.
    
    Args:
        df: dataframe used to plot.
        title: centered title to the figure. No title by default.
        figsize: size of the figure in a tuple (width, height). Default size is (22, 15)
        height_space: vertical space between subplot rows
    """
    fig = plt.figure(figsize=figsize)
    for idx, question in enumerate(list_ds_tools):
        plt.subplot(2, 4, idx+1)
        for col in dict_ds_question_part[question]:
            plt.bar(
                x=[textwrap.fill(label, 30) for label in df[col].value_counts().index],
                height=df[col].value_counts()
            )
            plt.xticks(rotation=90)
            plt.title(textwrap.fill(list_ds_question_title[idx], 35))
    plt.subplots_adjust(hspace=height_space)
    plt.suptitle(title, fontweight="bold", fontsize="x-large")
    plt.show()

In [None]:
plot_most_used_tools(df_indonesia, "Most Used Data Science Tools & Platform in Indonesia")
plot_most_used_tools(df_sea, "Most Used Data Science Tools & Platform in SEA Country")
plot_most_used_tools(df_global, "Most Used Data Science Tools & Platform in Other Countries")

There are so many tools, library, or platform that can help in doing data science projects. Most of them are open-source, some are paid.

Of course, Python is the main programming language used regularly in Indonesia and for other 2 subgroups, along (maybe, parallel) with SQL and R. This gives rise to the most used IDE that, still, belongs to the Jupyter family (Lab, Notebook, etc.).

One interesting thing is that in Indonesia, Julia hasn’t been used at all, for either learning or in a data science project. While in other SEA countries and globally, some respondents have tried to use Julia.

While for **hosted notebooks**, Colab and Kaggle notebooks took the first and second place, respectively. In my case, I started to use Deepnote and some of my notebook projects was turned into interactive notebooks using Binder. But, seems other respondents don’t use it, compare to other subgroups.

In terms of **libraries**, the combination of Matplotlib and Seaborn is the most used visualization libraries. The same story applies to machine learning libraries where Scikit-learn, TensorFlow, Keras, and PyTorch in decreasing order.

What surprised me is that **local development environment**, like JupyterLab, RStudio, etc., is the primary tool used in 2 other subgroups, while most respondents in Indonesia are still using basic software such as Excel or Google Sheets as their primary tool to analyze data.

In [None]:
list_howto = ["Q8", "Q17", "Q18", "Q19", "Q23", "Q37", "Q39"]
list_howto_question_title = [
    "What programming language would you recommend an aspiring data scientist to learn first?",
    "Which of the following ML algorithms do you use on a regular basis?",
    "Which categories of computer vision methods do you use on a regular basis?",
    "Which of the following natural language processing (NLP) methods do you use on a regular basis?",
    "Important activities in your role at work",
    "On which platforms have you begun or completed data science courses?",
    "Who/what are your favorite media sources that report on data science topics?",
]
list_col_howto = [col for col in df_kaggle.columns if col.split("_")[0] in list_howto]
dict_howto_question_part = {
    howto: [col for col in list_col_howto if howto in col]
    for howto in list_howto
}


# utility function to sort multiple selection questions sorted by its value counts
def sort_multiple_selection(df, columns):
    """Sort multiple-selection questions in multiple columns.
    
    Args:
        df: dataframe.
        columns: list of column names corresponding to the questions.
    
    Returns:
        list: sorted list of columns from highest value counts to the lowest.
    """
    d = {
        col: df[col].value_counts().values[0]
        for col in columns
    }
    return sorted(
        d,
        key=d.get,
        reverse=True
    )


# function to visualize recommendation for aspiring data scientist
def plot_aspiring_recommendation(df, title="", figsize=(22, 15), height_space=1.2):
    """Create bar plot to visualize questions with regard recommendation for
    aspiring data scientist.
    
    Args:
        df: dataframe.
        title: centered title to the figure. No title by default.
        figsize: size of the figure of tuple (width, height). Default size is (22, 15).
        height_space: vertical space between subplot rows
    """
    fig = plt.figure(figsize=figsize)
    for idx, question in enumerate(list_howto):
        columns = sort_multiple_selection(df, dict_howto_question_part[question])
        plt.subplot(2, 4, idx+1)
        for col in columns:
            plt.bar(
                x=[
                    textwrap.fill(label, 40)
                    for label in df[col].value_counts().index
                ],
                height=df[col].value_counts()
            )
            plt.xticks(rotation=90)
            plt.title(textwrap.fill(list_howto_question_title[idx], 35))
    plt.subplots_adjust(hspace=height_space)
    plt.suptitle(title, fontweight="bold", fontsize="x-large")
    plt.show()

In [None]:
plot_aspiring_recommendation(df_indonesia, "How to Break into Data Science for Aspiring Indonesia Data Scientists")
plot_aspiring_recommendation(df_sea, "How to Break into Data Science for Aspiring Southeast Asia Data Scientists")
plot_aspiring_recommendation(df_global, "How to Break into Data Science for Aspiring Data Scientists in Other Countries")

For an aspiring data scientist, especially those who reside in Indonesia, it’s very recommended to learn Python as your first programming exposure. Or if not, at least you should focus on Python first.

Basic methods such as linear or logistic regression are the most frequently used algorithm. Since it constructs a vital foundation for more complex models, you may want to start digest those basics first.

> There’re still few respondents who use evolutionary approaches on their regular basis, in all subgroups. Huge respect for evoluationary approaches!

In computer vision world, downstream tasks like image classification and object detection may become a routine for professional or research scientist. Hence, methods which deal with that kind of tasks are among the 3 most used methods, along with other general-purpose vision tools, like PIL, cv2, skimage, etc.

At the same time, in the NLP world, word embedding like word2vec or fastText is among the most used NLP methods. This is because word embedding/vectors are required to apply other advance methods, like encoder-decoder models or Transformers language models.

Next, if we inspect the above visualization about core activities at work for SEA countries and global, they have the same level of urgency in what they do at work. As we all can see, **analyzing and understanding data to influence products and business decisions** become a regular thing yet the most important one.

While in Indonesia, analyzing data to influence products is still the most selected core activities, but data infrastructure still the second have-to-do activity. Only then, they focus on prototyping, deploying model, and then research. Even iteration to improve the existing model is the lowest citizen in the work if you compare with other 2 subgroups.


## Key Takeaways

* Most data professionals in Indonesia are bachelor graduate compare to data professionals globally which are master graduate. We can’t say whether this is a lack, though.
* Python is still, by far, irreplaceable programming language to use for data science.
* Traditional software like Microsoft Excel or Google Sheets is the main statistical tools to analyze and exploring data. While globally, the local development environment like JupyterLab or RStudio is the main tools, as well as in other SEA countries (almost the same with traditional software)
* Vital foundational methods like linear or logistic regression still worth to learn first. Besides it’s foundational, they are still used either as a first model iteration or it’s just the right model to use with the lowest cost and good performance.
* Although Coursera is the most popular MOOC in other 2 subgroups, respondents from Indonesia mostly use Kaggle Learn Courses to learn data science.
* Data pipeline may be still an issue in Indonesia since most respondents state it as second important work after analyzing and understanding data.
* Data talents in Indonesia could be still in the development phase where they try to connect all important end-to-end data pipeline while they also learn to apply fancy machine learning models.