# Indonesian Survey Takers Exploratory Data Analysis (Summary: Indonesians Love Kaggle)
Welcome to this notebook, I aim to explore the Kaggle ML & DS survey for population of Indonesia. 

Please follow the notebook to see the data exploration but to summarize a bit, Indonesian Survey Takers put Kaggle as :
1. No. 1 DS & ML Learning Platform
2. No. 2 DS & ML Notebook Hosting Platform
3. No. 2 DS & ML Application Sharing or Deployment Platform

Of course there're some other things that are explored but the key highlight seems to be Indonesians Love Kaggle.

## Indonesian Survey Takers among World Population
According to [worldometers](https://www.worldometers.info/world-population/indonesia-population/), Indonesia ranks number 4 in the list of countries (and dependencies) by population. However, based on data of Kaggle survey takers, that's not the case.

In [None]:
import os
from IPython.display import Markdown as md
os.listdir('../input')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
kaggle_survey_2020_all = df.iloc[1:]
kaggle_survey_2020_indonesia = df[df['Q3'] == 'Indonesia']

In [None]:
kaggle_survey_2020_indonesia_ratio = round(kaggle_survey_2020_indonesia.shape[0] / df.shape[0] * 100, 2)
kaggle_survey_2020_rank_by_country = pd.merge(kaggle_survey_2020_all['Q3'].value_counts(), kaggle_survey_2020_all['Q3'].value_counts().rank(ascending=False), left_index=True, right_index=True)
kaggle_survey_2020_rank_by_country.columns = ['number of survery takers', 'rank']
kaggle_survey_2020_indo_rank = kaggle_survey_2020_rank_by_country[kaggle_survey_2020_rank_by_country.index == "Indonesia"]['rank'][0]

In [None]:
md("Apparently only {} out of {} people / {}% of the survey takers are from Indonesia. <br /> That translates to Indonesia ranked {} among all countries ({}) whose people participate in 2020 Kaggle ML & DS Survey".
   format(kaggle_survey_2020_indonesia.shape[0], kaggle_survey_2020_all.shape[0], kaggle_survey_2020_indonesia_ratio, int(kaggle_survey_2020_indo_rank), kaggle_survey_2020_rank_by_country.shape[0]))

The number is not that significant but that doesn't decrease my intention to explore Indonesia data, simply because I am Indonesian.

In [None]:
def showPieChart(column, y, chartTitle, figsize=(5,5), autopct='%1.2f%%', shadow=True, startangle=0):
    kaggle_survey_2020_indonesia[column].value_counts(dropna=False).plot.pie(y=y,figsize=figsize, autopct=autopct, shadow=shadow, startangle=startangle)
    plt.axis('equal')
    plt.ylabel('')
    plt.title(chartTitle)
    plt.show()

def showBarChart(target_column, df, title):
    spread_cols = [col for col in kaggle_survey_2020_indonesia if col.startswith(target_column)]
    df_with_spread_cols = df[spread_cols]
    spread_cols_val = [df_with_spread_cols[col].dropna().unique()[0] for col in df_with_spread_cols]
    df_with_spread_cols_usage = df_with_spread_cols.count()
    df_with_spread_cols_usage.index = spread_cols_val
    df_with_spread_cols_usage.plot.barh(rot=0)
    plt.title(title)
    plt.show()

## Indonesian Survey Takers By Gender

First, let's explore the Indonesian survey takers by gender.

In [None]:
#kaggle_survey_2020_indonesia['Q2'].value_counts(dropna=False).plot.pie(y='Gender',figsize=(5, 5), autopct='%1.2f%%', shadow = True)
#plt.axis('equal')
#plt.ylabel('')
#plt.title("Indonesian Survey Takers by Gender")
#plt.show()

showPieChart('Q2', 'Gender', 'Indonesian Survey Takers by Gender')

The pie chart shows that based on this survey alone, it (ML & DS) is a field dominated by male (**70%**) in Indonesia.

## Indonesian Survey Takers By Age

Next, let's explore Indonesian Survey Takers by age.

In [None]:
#kaggle_survey_2020_indonesia['Q1'].value_counts(dropna=False).plot.pie(y='Age Range',figsize=(5, 5), autopct='%1.2f%%', shadow = True)
#plt.axis('equal')
#plt.ylabel('')
#plt.title("Indonesian Survey Takers by Age")
#plt.show()

showPieChart('Q1', 'Age Range', 'Indonesian Survey Takers by Age')

The field is dominated by people below 35 years old that take **83.45%**. If we exclude age range 30-34, that takes **77.24%**. This shows the field is dominated mostly by people in younger age group, probably because the field although not very recent worldwide, it is still considered new field in Indonesia.

## Indonesian Survey Takers By Education Level

Next, let's explore by education level.

In [None]:
#kaggle_survey_2020_indonesia['Q4'].value_counts(dropna=False).plot.pie(y='Education Level',figsize=(5, 5), autopct='%1.2f%%', shadow = True, startangle=45)
#plt.axis('equal')
#plt.ylabel('')
#plt.title("Indonesian Survey Takers by Education Level")
#plt.show()

showPieChart('Q4', 'Education Level', 'Indonesian Survey Takers by Education Level', startangle=45)

More than **80%** of survey takers are having either Bachelor or Master's or Doctoral degree.  
However, this triggers a curiosity, whether they learn DS & ML in university or not. Let's explore that.

In [None]:
kaggle_survey_2020_indo_with_education = kaggle_survey_2020_indonesia[kaggle_survey_2020_indonesia['Q4'].notna()]
kaggle_survey_2020_indo_with_degree = kaggle_survey_2020_indo_with_education[kaggle_survey_2020_indo_with_education['Q4'].str.contains('Bachelor|Master|Doctoral', regex=True)]
kaggle_survey_2020_indo_with_degree_learn_mlds_from_univ = kaggle_survey_2020_indo_with_degree['Q37_Part_10'].value_counts(dropna=False)
kaggle_survey_2020_indo_with_degree_learn_mlds_from_univ.index = ['No', 'Yes']

kaggle_survey_2020_indo_with_degree_learn_mlds_from_univ.plot.pie(y='Learn DS & ML from University',figsize=(5, 5), autopct='%1.2f%%', shadow = True)
plt.axis('equal')
plt.ylabel('')
plt.title("Do Indonesian Survey Takers with University Degree Learn DS & ML from University ?")
plt.show()

Apparently, only **20.09%** of survey takers with University Degree that really learn DS & ML from university. This could mean University **has a lot to catch-up** if people need to go to another platform to learn DS & ML while having University degree although we also **can't conclude that 100%** because **in the survey there's no question related to what major the survey takers pursue in university**.  

So, now another questions pops-up where do the majority of Indonesian survey takers learn DS & ML from ? Let's explore it and this time we will include whole population and not only the ones with University Degree.

In [None]:
showBarChart("Q37", kaggle_survey_2020_indonesia, "Indonesian Survey Takers DS & ML Learning Platforms")

So, the top three for Indonesian Survey Takers are:
1. Kaggle Learn
2. Coursera
3. DataCamp

Congratulations **Kaggle**, seems like you are the the top DS & ML Learning Platform in Indonesia.

## Indonesian Survey Takers' Hosting
While Kaggle Learn is the top DS & ML Learning Platform in Indonesia, let's investigate if it's also the no. 1 for where Indonesian Survey Taker hosted its notebook.

In [None]:
showBarChart("Q10", kaggle_survey_2020_indonesia, "Indonesian Survey Takers DS & ML Notebook Hosting Platforms")

Apparently, it's not the case, **Kaggle** is placed no. 2 behind **Google Colab** Notebooks which is the favorite for Indonesian Survey Takers. But the top 2 anyway are far ahead the rest of the pack (although between Kaggle and Colab there's also quite big gap). The interesting thing is the number of people learning using Kaggle platform and using Kaggle Notebooks are similar, perhaps we can do exploration if it's mostly the same people using both Kaggle learn and Kaggle Notebooks. 

In [None]:
kaggle_survey_2020_indo_kaggle_notebooks = kaggle_survey_2020_indonesia[kaggle_survey_2020_indonesia['Q10_Part_1'].notna()]
kaggle_survey_2020_indo_kaggle_learn = kaggle_survey_2020_indonesia[kaggle_survey_2020_indonesia['Q37_Part_3'].notna()]
notebooks_learn_diff = len(list((set(kaggle_survey_2020_indo_kaggle_notebooks.index) - set(kaggle_survey_2020_indo_kaggle_learn.index))) + list((set(kaggle_survey_2020_indo_kaggle_learn.index) - set(kaggle_survey_2020_indo_kaggle_notebooks.index))))

In [None]:
md("Apparently not, there're {} Kaggle Notebooks users and {} Kaggle Learn users but there're also {} people who use only Kaggle Notebooks or Kaggle Learn (not both).".
   format(kaggle_survey_2020_indo_kaggle_notebooks.shape[0], kaggle_survey_2020_indo_kaggle_learn.shape[0], notebooks_learn_diff))

Now, let's check where do Indonesian Survey Takers share or deploy their DS or ML applications ? Do we get similar number between hosting of notebooks with deployment or sharing of application ?

In [None]:
showBarChart("Q36", kaggle_survey_2020_indonesia, "Indonesian Survey Takers DS & ML Applications Deployment Hosting")

Apparently, Indonesian people use **Github** more than other platforms for hosting or sharing the DS / ML Applications. Still, we can see **Kaggle** in second spot sharing place with **Google Colab**. 

## Goodbye

For now, that's it, hope you enjoy this notebook.