# Data Scientist in Indonesia

Data scientists are big data wranglers, gathering and analyzing large sets of structured and unstructured data. A data scientist’s role combines computer science, statistics, and mathematics. They analyze, process, and model data then interpret the results to create actionable plans for companies and other organizations.

Data scientists are analytical experts who utilize their skills in both technology and social science to find trends and manage data. They use industry knowledge, contextual understanding, skepticism of existing assumptions – to uncover solutions to business challenges.

Hal Varian, the chief economist at Google, is known to have said, “The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”

Source: https://www.mastersindatascience.org/careers/data-scientist/
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df1 = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df1.head()

In [None]:
df1.info()

In [None]:
df_Indonesia=df1[df1["Q3"]=="Indonesia"]

## Age

In [None]:
import matplotlib.pyplot as plt
df_Indonesia['Q1'].value_counts().sort_index().plot(kind='bar')

In [None]:
df1['Q1'].value_counts().sort_index().plot(kind='bar')

we can see that most of the respondents in Indoensia are mostly 18-21 years old meaning that the largest group of data scientist in Indonesia are students and young professionals in the early stages of their careers. 



## Gender

In [None]:
sns.countplot(y="Q2", data=df_Indonesia)

In [None]:
sns.countplot(y="Q2", data=df1)

We can see that there are large gender gap in this fields with the majority of respondents are men. This shows that we need to raise more awareness to women in big data field

## Education

In [None]:
sns.countplot(y="Q4", data=df_Indonesia)

In [None]:
sns.countplot(y="Q4", data=df1)

From the chart above most of responders have Bachelor's degree & Master's. This shows that education really matters to become a data scientist.

## Current Role

In [None]:
sns.countplot(y="Q5", data=df_Indonesia)

In [None]:
sns.countplot(y="Q5", data=df1)

we can see that the majority of the respondents are students. This result match with the age output above. 

## Experience

In [None]:
df_Indonesia.Q6.value_counts().reindex(['I have never written code', '< 1 years', '1-3 years', '3-5 years','5-10 years','10-20 years','20+ years']).plot(kind="bar")

In [None]:
df1.Q6.value_counts().reindex(['I have never written code', '< 1 years', '1-3 years', '3-5 years','5-10 years','10-20 years','20+ years']).plot(kind="bar")

We can see that most of the respondents have 1-3 years of general experience in programming in Indonesia and also globlly.

## Programming Language

In [None]:
languages = ['Python','R','SQL','C','C++','Java',"Javascript","Julia","Bash","Matlab"]
python=df_Indonesia.Q7_Part_1.value_counts()[0]
r= df_Indonesia.Q7_Part_2.value_counts()[0]
sql=df_Indonesia.Q7_Part_3.value_counts()[0]
c=df_Indonesia.Q7_Part_4.value_counts()[0]
c_plus =df_Indonesia.Q7_Part_5.value_counts()[0]
java = df_Indonesia.Q7_Part_6.value_counts()[0]
javascript=df_Indonesia.Q7_Part_7.value_counts()[0]
julia = df_Indonesia.Q7_Part_8.value_counts()[0]
bash = df_Indonesia.Q7_Part_10.value_counts()[0]
matlab = df_Indonesia.Q7_Part_11.value_counts()[0]
values_1 = [python,r,sql,c,c_plus,java,javascript,julia,bash,matlab]

In [None]:
plt.bar(languages,values_1)
plt.xticks(rotation=45)
plt.title("Regulary Used Programming Language in Indonesia")

In [None]:
languages = ['Python','R','SQL','C','C++','Java',"Javascript","Julia","Swift","Bash","Matlab"]
python=df1.Q7_Part_1.value_counts()[0]
r= df1.Q7_Part_2.value_counts()[0]
sql=df1.Q7_Part_3.value_counts()[0]
c=df1.Q7_Part_4.value_counts()[0]
c_plus =df1.Q7_Part_5.value_counts()[0]
java = df1.Q7_Part_6.value_counts()[0]
javascript=df1.Q7_Part_7.value_counts()[0]
julia = df1.Q7_Part_8.value_counts()[0]
swift = df1.Q7_Part_9.value_counts()[0]
bash = df1.Q7_Part_10.value_counts()[0]
matlab = df1.Q7_Part_11.value_counts()[0]
values_2 = [python,r,sql,c,c_plus,java,javascript,julia,swift,bash,matlab]

plt.bar(languages,values_2)
plt.xticks(rotation=45)
plt.title("Regulary Used Programming Language Globally")

The most popular programming language is Python, as expected both in Inodnesia and globally.

After we did some quick research, Python is widely used even when it is somehow slower than other languages because:
Python is more productive, enable competitiveness improvement by fast innovation, has rich set of libraries, framework, and large comunities.

## Industry

In [None]:
sns.countplot(y="Q20",palette = 'muted',data=df_Indonesia)

In [None]:
sns.countplot(y="Q20",palette = 'muted',data=df1)

We can see that most of the kaggle users glibally are currently working in computer technology-based industry. Meanwhile in Indonesia, mostly are currently working in academic/ education. We can expect more Indonesians to enter technology-based industry

## Salary

In [None]:
df_Indonesia.Q25.value_counts().reindex(['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999','4,000-4,999','5,000-7,499',
                               '7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999','30,000-39,999',
                               '40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999',
                               '90,000-99,999','100,000-124,999','125,000-149,999','150,000-199,999','200,000-249,999','250,000-299,999',
                               '300,000-499,999','$500,000-999,999','>$1,000,000']).plot(kind="bar")

In [None]:
df1.Q25.value_counts().reindex(['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999','4,000-4,999','5,000-7,499',
                               '7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999','30,000-39,999',
                               '40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999',
                               '90,000-99,999','100,000-124,999','125,000-149,999','150,000-199,999','200,000-249,999','250,000-299,999',
                               '300,000-499,999','$500,000-999,999','>$1,000,000']).plot(kind="bar")

From the chart above we can see that the salary for data scientist in both Indonesia and globally is mostly $0-999. This shows that even the salary in Indonesia is quite promising and competitive

## Course


In [None]:
languages = ['Cousera','edX',"Kaggle Learn Courses",'DataCamp','Fast.ai','Udacity',"Udemy","LinkedIn Learning","Cloud","University Courses","None"]
coursera=df_Indonesia.Q40_Part_1.value_counts()[0]
edx= df_Indonesia.Q40_Part_2.value_counts()[0]
klc=df_Indonesia.Q40_Part_3.value_counts()[0]
dc=df_Indonesia.Q40_Part_4.value_counts()[0]
fa =df_Indonesia.Q40_Part_5.value_counts()[0]
udacity = df_Indonesia.Q40_Part_6.value_counts()[0]
udemy=df_Indonesia.Q40_Part_7.value_counts()[0]
ll = df_Indonesia.Q40_Part_8.value_counts()[0]
cc = df_Indonesia.Q40_Part_9.value_counts()[0]
uc = df_Indonesia.Q40_Part_10.value_counts()[0]
none = df_Indonesia.Q40_Part_11.value_counts()[0]
values_3 = [coursera,edx,klc,dc,fa,udacity,udemy,ll,cc,uc,none]

plt.bar(languages,values_3)
plt.xticks(rotation=90)
plt.title("Regulary Used Course in Indonesia")

In [None]:
languages = ['Cousera','edX',"Kaggle Learn Courses",'DataCamp','Fast.ai','Udacity',"Udemy","LinkedIn Learning","Cloud","University Courses","None"]
coursera=df1.Q40_Part_1.value_counts()[0]
edx= df1.Q40_Part_2.value_counts()[0]
klc=df1.Q40_Part_3.value_counts()[0]
dc=df1.Q40_Part_4.value_counts()[0]
fa =df1.Q40_Part_5.value_counts()[0]
udacity = df1.Q40_Part_6.value_counts()[0]
udemy=df1.Q40_Part_7.value_counts()[0]
ll = df1.Q40_Part_8.value_counts()[0]
cc = df1.Q40_Part_9.value_counts()[0]
uc = df1.Q40_Part_10.value_counts()[0]
none = df1.Q40_Part_11.value_counts()[0]
values_4 = [coursera,edx,klc,dc,fa,udacity,udemy,ll,cc,uc,none]

plt.bar(languages,values_4)
plt.xticks(rotation=90)
plt.title("Regulary Used Course Globally")

From the chart above, we can see that Kaggle Learn Courses is the most popular platform in Indonesia among the new learners for Data science related courses menwhile Coursera is the most popular platform globally.

## Tools

In [None]:
sns.countplot(y="Q41",palette = 'muted',data=df_Indonesia)

In [None]:
sns.countplot(y="Q41",palette = 'muted',data=df1)

From the chart above, Indonesians seem to prefer basic statistical software to perform their primary data analysis meanwhile people globally also use local development environments (RStudio, jupyterlab, etc) as their primary tools.

This shows that data scientist in Indonesia have to improve their skill in order to compete globally.