# Summary
This is an analysis on Kaggle ML and Data Science surveys in the context of ASEAN region. ASEAN consists of 10 countries (Singapore, Viet Nam, Thailand, Malaysia, Indonesia, Brunei, Cambodia, Laos, Myanmar and the Philippines).  Specific observations on Malaysia vs ASEAN are called out to understand how much Malaysia is aligned or not in line compared to its regional ASEAN neighbors. 

**Analysis Limitation**

The survey response is from Kaggle user community, hence the usage of ML/ Data Science observations in this analysis is only limited to the representation/ view of Kaggle community. How well this represents the full population of ML/ Data Science is unknown.

**Data Availability**

In the survey dataset, only 6 countries have data explicitly labelled with the country name. These are Indonesia, Singapore, Viet Nam, Thailand, Philippines and Malaysia.

**Findings**

1. There are more beginner/ intermediate Kaggle programmers compared to experienced programmers. Observation is consistent across ASEAN countries. Malaysia exhibits similar observation.
2. ASEAN male Kaggle users are much more than female. Analysis shows that Malaysia has the smallest gender (male vs female) gap. Something to be proud of as a fellow Malaysian.
3. Python is the clear winner of most widely used language in ASEAN as the usage gap compared to other languages is significant. In Malaysia, Python is the most used programming language in line with ASEAN. However, the second/ third most used language is C++/ SQL as compared to ASEAN which are SQL/R. 
4. Python is recommended as the to-learn language at 75.4%, much higher than the current usage of Python which is hovering around 30 to 35%. Malaysia follows the same trend where 72.7% of the recommended language to learn is Python followed by R.
5. Linear or logistic regression, decision tree or random forests followed by convolutional neural networks are the top 3 machine learning algorithm preferred across ASEAN where Malaysia follows the same trend.

**Conclusion**

Malaysia is pretty much in line with its ASEAN neighbours in as per findings 1 to 5 (above).

While R and Python are equally promoted in Malaysia when I started with Python in 2016, this survey clearly showed that Python is the contender of current trend and future recommended programming language for ML/ Data Science.

In [None]:
# Import libraries

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1. Data Preparation

In [None]:
# Read data from csv file
survey = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

# Rename column for ease of reading and referencing
survey.rename({'Q3':'COUNTRY'}, axis = 1, inplace = True) # Rename Q3 as COUNTRY
survey.rename({'Q6':'YR_EXPERIENCE'}, axis = 1, inplace = True) # Rename Q6 as YR_EXPERIENCE
survey.rename({'Q1':'AGE'}, axis = 1, inplace = True) # Rename Q1 as AGE
survey.rename({'Q2':'GENDER'}, axis = 1, inplace = True) # Rename Q2 as GENDER
survey.rename({'Q8':'LANG_RECOMMENDED'}, axis = 1, inplace = True) # Rename Q8 as LANGUAGE_RECOMMENDED
survey.rename({'Q7_Part_1':'USE_PYTHON', 
               'Q7_Part_2':'USE_R',
               'Q7_Part_3':'USE_SQL',
               'Q7_Part_4':'USE_C',
               'Q7_Part_5':'USE_C++',
               'Q7_Part_6':'USE_JAVA',
               'Q7_Part_7':'USE_JAVASCRIPT',
               'Q7_Part_8':'USE_JULIA',
               'Q7_Part_9':'USE_SWIFT',
               'Q7_Part_10':'USE_BASH',
               'Q7_Part_11':'USE_MATLAB',
               'Q7_Part_12':'USE_NONE',
               'Q7_OTHER':'USE_OTHER'}, 
              axis = 1, inplace = True) # rename columns on LANGUAGE_USED

# Get question short form and the corresponding question text
question = survey.iloc[0,:]

In [None]:
# Subset ASEAN countries data
ASEAN = ["Singapore", "Viet Nam", "Thailand", "Malaysia", "Indonesia", "Brunei", "Cambodia", "Laos", "Myanmar", "Philippines"]
ASEAN_df = survey[survey["COUNTRY"].isin(ASEAN)]
ASEAN_df["COUNTRY"].unique() # Only 6 countries are in subset data

# 2. Analysis

**Analysis I: Year of Experience**

In [None]:
# Group country data by year of experience. Value calculated is number of responses 
# for that particular year of experience bucket over total responses, for a particular country.
# For every country, percentage responses for all applicable year of experience bucket should 
# sum up to 100%.
ASEAN_df_experience = pd.DataFrame(ASEAN_df.groupby("COUNTRY")["YR_EXPERIENCE"].value_counts(dropna = False, normalize = True))
# Renaming column and copying index as columns
ASEAN_df_experience.rename({'YR_EXPERIENCE':'CTRY_RESPONSE_PCT'}, axis = 1, inplace = True)
ASEAN_df_experience["CTRY_RESPONSE_PCT"] = ASEAN_df_experience["CTRY_RESPONSE_PCT"]*100
ASEAN_df_experience["COUNTRY"] = ASEAN_df_experience.index.get_level_values("COUNTRY")
ASEAN_df_experience["YR_EXPERIENCE"] = ASEAN_df_experience.index.get_level_values("YR_EXPERIENCE")

# Plot categorical plot
experience_order = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', 
                    '5-10 years', '10-20 years', '20+ years']

sns.set()
plt.clf() 
sns.catplot(x="YR_EXPERIENCE", y="CTRY_RESPONSE_PCT", hue="COUNTRY",
            kind="point", height=5, aspect=2, order = experience_order, data = ASEAN_df_experience)
plt.xlabel('YR_EXPERIENCE', fontsize=10)
plt.ylabel('CTRY_RESPONSE_PCT', fontsize=10)
plt.show()

**Observation 1:**

Kaggle user community in terms of year of experience can be separated into two groups.
* Beginner to Intermediate: Less than 5 years of programming experience 
* Experienced: More than 5 years of programming experience

There are more beginner/ intermediate Kaggle programmers compared to experienced programmers as visually portrayed by concentration in the upper left quadrant (Beginner/ Intermediate) vs. lower right (Experienced). Observation is consistent across ASEAN countries. Malaysia exhibits similar observation.

**Analysis II: Gender Disparity**

In [None]:
# Group country data by gender. Value calculated is number of responses 
# for a particular gender over total responses, for a particular country.
# For every country, percentage responses for all applicable gender bucket should 
# sum up to 100%. 
ASEAN_df_gender = pd.DataFrame(ASEAN_df.groupby("COUNTRY")["GENDER"].value_counts(dropna = False, normalize = True))
ASEAN_df_gender.rename({'GENDER':'CTRY_RESPONSE_PCT'}, axis = 1, inplace = True)
ASEAN_df_gender["CTRY_RESPONSE_PCT"] = ASEAN_df_gender["CTRY_RESPONSE_PCT"]*100
ASEAN_df_gender["COUNTRY"] = ASEAN_df_gender.index.get_level_values("COUNTRY")
ASEAN_df_gender["GENDER"] = ASEAN_df_gender.index.get_level_values("GENDER")

# Plot categorical plot
gender_order = ['Man', 'Woman', 'Prefer not to say', 'Nonbinary','Prefer to self-describe']
sns.set()
plt.clf() 
sns.catplot(x="GENDER", y="CTRY_RESPONSE_PCT", hue="COUNTRY", aspect=2,
            kind="bar", order = gender_order , data = ASEAN_df_gender)
plt.xlabel('GENDER', fontsize=10)
plt.ylabel('CTRY_RESPONSE_PCT', fontsize=10)
plt.show()

**Observation 2a:**

Consistent across ASEAN countries, male Kaggle users are higher than female. 

Knowing that there are more male users compared to female users, we want to find out how big is the male vs female gap at the country level. 

In [None]:
# Subset gender data
woman = ASEAN_df_gender[ASEAN_df_gender["GENDER"] == "Woman"]
man = ASEAN_df_gender[ASEAN_df_gender["GENDER"] == "Man"]
woman.index.names = ["COUNTRY1", "GENDER1"]
man.index.names = ["COUNTRY1", "GENDER1"]
gender_comb = pd.merge(woman, man, how="outer", on = "COUNTRY")
gender_comb["GENDER_GAP_PCT"] = gender_comb["CTRY_RESPONSE_PCT_y"] - gender_comb["CTRY_RESPONSE_PCT_x"]

# Plot categorical graph
plt.clf() 
sns.catplot(x="COUNTRY", y="GENDER_GAP_PCT", aspect=2,
            kind="bar", data = gender_comb)
plt.xlabel('COUNTRY', fontsize=10)
plt.ylabel('GENDER_GAP_PCT', fontsize=10)
plt.show()

**Observation 2b:**

Analysis shows that Malaysia has the smallest gender (male vs female) gap. Something to be proud of as a fellow Malaysian.

**Analysis III: Programming Language Used**

In [None]:
# Replace all cells with value with 1 and no value with 0 
lang_used_1= ASEAN_df.iloc[:,7:20].notnull().astype('int')

# Merge above with corresponding country, gender and year of experience data
lang_used = pd.merge(ASEAN_df.loc[:, ["COUNTRY", "GENDER", "YR_EXPERIENCE"]], lang_used_1, 
                     left_index = True, right_index= True, how = "left")

lang_list = ["USE_PYTHON", "USE_R", "USE_SQL", "USE_C", "USE_C++", 
         "USE_JAVA", "USE_JAVASCRIPT","USE_JULIA","USE_SWIFT","USE_BASH", 
         "USE_MATLAB", "USE_NONE", "USE_OTHER"]

# Pivot number of responses with country as index, language as column.
# Then create column CTRY_TOT_COUNT, as sum of responses in a country.
lang_pivot = lang_used.pivot_table(index=["COUNTRY"], values=lang_list, aggfunc='sum')
lang_pivot["CTRY_TOT_COUNT"]=lang_pivot.sum(axis=1)

# Copy lang_pivot and calculate percentage usage for each language
lang_summary = lang_pivot.copy(deep=True)
lang_summary["USE_PYTHON_PCT"]=round(lang_summary["USE_PYTHON"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_R_PCT"]=round(lang_summary["USE_R"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_SQL_PCT"]=round(lang_summary["USE_SQL"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_C_PCT"]=round(lang_summary["USE_C"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_C++_PCT"]=round(lang_summary["USE_C++"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_JAVA_PCT"]=round(lang_summary["USE_JAVA"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_JAVASCRIPT_PCT"]=round(lang_summary["USE_JAVASCRIPT"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_JULIA_PCT"]=round(lang_summary["USE_JULIA"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_SWIFT_PCT"]=round(lang_summary["USE_SWIFT"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_BASH_PCT"]=round(lang_summary["USE_BASH"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_MATHLAB_PCT"]=round(lang_summary["USE_MATLAB"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_NONE_PCT"]=round(lang_summary["USE_NONE"]/lang_summary["CTRY_TOT_COUNT"]*100,5)
lang_summary["USE_OTHER_PCT"]=round(lang_summary["USE_OTHER"]/lang_summary["CTRY_TOT_COUNT"]*100,5)

# Subset data for final summary. 
# Final summary of language usage percentage is averaged across ASEAN and ordered from highest to lowest
lang_plot = lang_summary.iloc[:,14:27]
lang_rank_ASEAN = lang_plot.mean(axis=0).sort_values(ascending = False)
print(lang_rank_ASEAN)

**Observation 3a:**

In ASEAN, Python is the most used language followed by SQL, R, Javascript, C and Java. Python is the clear winner of most widely used language as the gap with other languages are significant.

In [None]:
# Rank most widely used programming language
lang_plot.loc["ASEAN"] = lang_rank_ASEAN
lang_compare_pct =lang_plot.transpose()
lang_compare_rank =lang_plot.transpose().rank(ascending = False, method = "min")
print(lang_compare_rank)

**Observation 3b:**

In Malaysia, Python is the most used language in line with ASEAN. However, the second/ third most used language is C++/ SQL as compared to ASEAN which are SQL/R. ASEAN and individual countries ranking shows that R is at the third/ fourth position.

**Analysis IV: Recommended Programming Language**

In [None]:
# Group recommended language and calculate percentage of response over total response
lang_recommended= ASEAN_df.loc[:,["LANG_RECOMMENDED"]]["LANG_RECOMMENDED"].value_counts(normalize = True)*100
lang_recommended_df = pd.DataFrame(lang_recommended)
lang_recommended_df["LANGUAGE"]= lang_recommended_df.index

# Plot categorical plot
plt.clf() 
sns.catplot(x="LANGUAGE", y="LANG_RECOMMENDED", aspect=2,
            kind="bar", data = lang_recommended_df)
plt.xlabel('LANGUAGE', fontsize=10)
plt.ylabel('RECOMMENDED_PERCENTAGE', fontsize=10)
plt.show()

# Malaysia view recommended language
lang_recommended_MYS= ASEAN_df[ASEAN_df["COUNTRY"]=="Malaysia"].loc[:,["LANG_RECOMMENDED"]]["LANG_RECOMMENDED"].value_counts(normalize = True)*100
print(lang_recommended_MYS)

**Observation 4:**

Python is recommended as the to-learn language at 75.4%, much higher than the current usage of Python which is hovering around 30 to 35%. Malaysia follows the same trend where 72.7% of the recommended language to learn is Python followed by R.

**Analysis V: Preferred Machine Learning Algorithm**

In [None]:
# Data preprocessing to combine all machine learning responses column
ml_usage = pd.DataFrame()

for i in range(82,94):
    a = pd.DataFrame(ASEAN_df.iloc[:,[3,i]])
    a.columns.values[1] = "MACHINE_LEARNING"
    # print(i, " - ", set(a))
    ml_usage = pd.concat([ml_usage, a])


In [None]:
# Data preprocessing to rename columns and copy index as columns
ml_summary = ml_usage.groupby("COUNTRY")["MACHINE_LEARNING"].value_counts(normalize = True).to_frame()*100
ml_summary.rename({"MACHINE_LEARNING":"CTRY_RESPONSE_PCT"}, axis = 1, inplace = True)
ml_summary["COUNTRY"] = ml_summary.index.get_level_values("COUNTRY")
ml_summary["MACHINE_LEARNING"] = ml_summary.index.get_level_values("MACHINE_LEARNING")
#ml_summary.rename_axis(["CTRY", "ML"])

In [None]:
# Plot graph to compare preferred machine learning method across countries
sns.set()
plt.clf()
plt.figure(figsize=(15,12))
graph = sns.barplot(x="MACHINE_LEARNING", y="CTRY_RESPONSE_PCT", hue="COUNTRY", data = ml_summary)
graph.set_xticklabels(graph.get_xticklabels(), rotation = 30, ha = "right", fontsize = 10)
plt.show()

**Observation 5:**

Linear or logistic regression, decision tree or random forests followed by convolutional neural networks are the top 3 machine learning methods preferred across ASEAN where Malaysia follows the same trend. In Singapore, Gradient Boosting Machines method is the third preferred machine learning method (ranked higher than convolutional neural networks).

In [None]:
# Save result to pickle

ASEAN_df_experience.to_pickle("ASEAN_df_experience.pkl")
ASEAN_df_gender.to_pickle("ASEAN_df_gender.pkl")
lang_compare_rank.to_pickle("lang_compare_rank.pkl")
lang_recommended_df.to_pickle("lang_recommended_df.pkl")

# Save ASEAN data into csv
ASEAN_df.to_csv("ASEAN_df.csv", index=False)