In [None]:
# importing packages
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import textwrap
import warnings
warnings.filterwarnings('ignore')

# making new compensation bins - creating less number of bins so that it's easier to see
def new_compensation_bin(comp_str):
    if comp_str in ['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999']:
        return '$0-4,999'
    elif comp_str in ['5,000-7,499', '7,500-9,999', '10,000-14,999', '15,000-19,999', '20,000-24,999']:
        return '$5,000-24,999'
    elif comp_str in ['25,000-29,999', '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999']:
        return '$25,000-69,999'
    elif comp_str in ['70,000-79,999', '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999']:
        return '$70,000-149,999'
    elif comp_str in ['150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-500,000', '> $500,000']:
        return '> $150,000'
    else:
        return "Null"

# identifying a tag for the southeast asian countries
def tag_sea_countries(country):
    if country in ["Brunei", "Cambodia", "East Timor", "Indonesia", "Laos", "Malaysia", "Myanmar", "Philippines", "Singapore", "Thailand", "Viet Nam"]:
        return "SEA"
    else:
        return "RoW"

df = pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv", skiprows=[0])    
df["sea_tag"] = df.apply(lambda x: tag_sea_countries(x["In which country do you currently reside?"]), axis=1)
df["new_compensation_bin"] = df.apply(lambda x: new_compensation_bin(x["What is your current yearly compensation (approximate $USD)?"]), axis=1)

sea_countries = df[df["sea_tag"]=="SEA"]
nonsea_countries = df[df["sea_tag"]=="RoW"]

<img src="https://www.amstat.org/images/asaimages/news/dataWave.png" width="600"/>
<p style="text-decoration:none;font-size:50%;text-align:center;color:gray;">Source: American Statistical Association (www.amstat.org/images/asaimages/news/dataWave.png)</p>
<div style="text-align: center;" markdown="1"><font size="5"><b>Data Scientists in SEA: Are there enough and are they enough? </b></font> </div>

<hr style="background-color:#E3692A;height:1px">
<br>
"What would you do with your degree? Teach?"

Entering the University of the Philippines in 2006, these are the questions I would get from aunts and uncles after telling them that I was taking a degree in Physics. Early in my university days, the response usually would be "Teaching would be nice." Coming from a developing country with a relatively small science and technology industry, pursuing a Physics degree is uncommon. My batch of Physics undergraduates numbered in the tens (as compared to other "more professional" courses which numbers in the hundreds).

Motivated by love for the field, I finished my degree, uncertain what opportunities would open to me after I graduated. Fast forward to 2019, I am now a Physicist by training, a Data Scientist by profession and a Teacher at heart (which means whenever the opportunity presents itself).

Recently, I have taught some beginner programming classes to aspiring data scientist. The large number of students is not surprising as the demand for the role has increased. However, from my experience in trying to find people to add to our team and from my experience with recruiters letting me know of a new potential role, I know that the number of practitioners are not enough. This raises the question, how are the Filipino data scientists, and more broadly, the Southeast Asians matching up with other countries?
<br>
<br>
<hr style="background-color:#E3692A;height:1px">
<br>
In this notebook, we will be exploring the opportunities, challenges and current solutions of Southeast Asia in the field of data science. We will be tackling these things and while overlaying the results of the 2019 Kaggle ML & DS Survey.
<ol>      
  <li><b>Opportunities: Emerging Markets are Fertile for DS/AI Adoption</b>
    <ol>
      <dd>1.1. &nbsp;&nbsp;Digital Shift</dd>
      <dd>1.2. &nbsp;Private Companies are Adopting</dd>
    </ol>
      
  <li><b>Challenges: Untapped Potential</b>
    <ol>
      <dd>2.1. &nbsp;&nbsp;DS Practitioners</dd>
      <dd>2.2. &nbsp;Gender Disparity</dd>
      <dd>2.3. &nbsp;Age, Education and Salary</dd>
      <dd>2.4. &nbsp;Workplace Tools</dd>
    </ol>
  
  <li><b>Solutions: Coming Together</b>
    <ol>
      <dd>3.1. &nbsp;&nbsp;Individual Development: Self-Improvement</dd>
      <dd>3.2. &nbsp;Private Companies: Enticing Professionals</dd>
      <dd>3.3. &nbsp;Government Initiatives: Driving Adoption</dd>
    </ol>

  <li><b>Conclusion</b>
  
  <li><b>References</b>
<br>
    <br>
<hr style="background-color:#E3692A;height:1px">

## **Opportunities: Emerging Markets are Fertile for DS/AI Adoption**

Southeast Asia (SEA) is located east of the Indian subcontinent and south of China. The region is composed of 11 countries: Brunei, Cambodia, East Timor, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand and Vietnam.

<img src="https://www.arnaudbonzom.com/wp-content/uploads/2019/10/Southeast-Asia.jpg" width="600"/>
<p style="text-decoration:none;font-size:50%;text-align:center;color:gray;">Source: Arnaud Bonzom (https://www.arnaudbonzom.com/wp-content/uploads/2019/10/Southeast-Asia.jpg)</p>

The region accounts for almost 9% (664,732,024) of the world's population with 49.5 % of the population population living in urban areas. The median age in South-Eastern Asia is 29.1 years. <sup>1</sup>

<div style="text-align: left;" markdown="1"><font size="4"><b>1.1. Digital Shift</b></font> </div>
<br>
More than half of Southeast Asia's population is under 30 years old with 90% of this group having internet access, classified as tech-savvy and heavy internet users.<sup>2</sup>  In the 2019 e-Conomy SEA report covering the largest markets in the region (Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam), it was found that there are 360 million internet users in southeast asia and were the most engaged mobile internet users in the world. <sup>3</sup>

The region's move towards digital gave rise to financial technology and digital banking sector. A survey based on enterprise-level companies in the region by Analytics India Magazine revealed that the analytics and data science industry is estimated at $6.8 Billion in market size. The finance and banking sector accounted as the biggest adopter of analytics and AI solutions at 35%. <sup>4</sup>

<img src="https://cdn.shortpixel.ai/client/to_webp,q_lossless,ret_img,w_810/https://analyticsindiamag.com/wp-content/uploads/2019/02/G2.png" width="600"/>
<p style="text-decoration:none;font-size:50%;text-align:center;color:gray;">Source: Analytics India Magazine (https://cdn.shortpixel.ai/client/to_webp,q_lossless,ret_img,w_810/https://analyticsindiamag.com/wp-content/uploads/2019/02/G2.png)</p>

<div style="text-align: left;" markdown="1"><font size="4"><b>1.2. Private Companies are Adopting</b></font> </div>
<br>
With the acceleration of digitization comes the start of AI adoption. Reports have identified an increase in AI adoption by companies in Southeast Asia in the last decade. In a Mckinsey Global Institute study in 2018, the number of companies that mentioned “big data”, “advanced analytics”, “AI”, “machine learning”, and the “internet of things” in their annual reports increased from 6% in 2011 to around 33% in 2016 - attributed to the positive outcomes of adoption in companies.<sup>5</sup>

The increase in AI adoption in the region is also reflected in the number of SEA survey participants. From 2017 to 2019, the share grew by 0.5 percentage points.

In [None]:
### COMPARING THE INCREASE IN NUMBER OF SURVEY PARTICIPANTS FROM 2017 TO 2018 ###

# dataset for 2017
df_2017 = pd.read_csv("/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv", encoding = "ISO-8859-1")
df_2017["sea_tag"] = df_2017.apply(lambda x: tag_sea_countries(x["Country"]), axis=1)
df_2017["survey_year"] = 2017
sea_2017 = df_2017[df_2017["sea_tag"]=="SEA"]
sea_2017_ = sea_2017[["survey_year", "Country"]]
sea_2017_.columns = ["survey_year", "country"]
sea_2017_grouped = (sea_2017_.groupby("survey_year")["country"].count()).reset_index(name="count")
sea_2017_grouped["total"] = len(df_2017)
sea_2017_grouped["percent_sea"] = round((sea_2017_grouped["count"]/sea_2017_grouped["total"])*100,2)

# dataset for 2018
df_2018 = pd.read_csv("/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv", skiprows=[0])    
df_2018["sea_tag"] = df_2018.apply(lambda x: tag_sea_countries(x["In which country do you currently reside?"]), axis=1)
df_2018["survey_year"] = 2018
sea_2018 = df_2018[df_2018["sea_tag"]=="SEA"]
sea_2018_ = sea_2018[["survey_year", "In which country do you currently reside?"]]
sea_2018_.columns = ["survey_year", "country"]
sea_2018_grouped = (sea_2018_.groupby("survey_year")["country"].count()).reset_index(name="count")
sea_2018_grouped["total"] = len(df_2018)
sea_2018_grouped["percent_sea"] = round((sea_2018_grouped["count"]/sea_2018_grouped["total"])*100,2)

#dataset for 2019
sea_2019 = sea_countries
sea_2019["survey_year"] = 2019
sea_2019_ = sea_2019[["survey_year", "In which country do you currently reside?"]]
sea_2019_.columns = ["survey_year", "country"]
sea_2019_grouped = (sea_2019_.groupby("survey_year")["country"].count()).reset_index(name="count")
sea_2019_grouped["total"] = len(df)
sea_2019_grouped["percent_sea"] = round((sea_2019_grouped["count"]/sea_2019_grouped["total"])*100,2)

# 2017-2019
sea_2017to2019 = (pd.concat([sea_2017_grouped, sea_2018_grouped, sea_2019_grouped])).reset_index(drop=True)

# plotting
gridsize = (10, 3)
fig = plt.figure(figsize=(40, 10))

ax1 = plt.subplot2grid(gridsize, (3, 1), rowspan=6)
ax1 = sns.barplot(x = "survey_year", 
                  y= "percent_sea",
                  data=sea_2017to2019,
                  color = "#529FCD"
                   )
ax1_title = ax1.set_title('Percentage of Respondents of SEA Countries per Year')
ax1_yticks = ax1.set_yticks([])
ax1_yticklabels = ax1.set_yticklabels([])
ax1_ylabel = ax1.set_ylabel("")
ax1_xlabel = ax1.set_xlabel("")
ax1.grid(False)
ax1.set_ylim(0,4)

for p in ax1.patches:
    patch = ax1.annotate(str(p.get_height())+"%", 
                        (p.get_x() + p.get_width() / 2.0, 
                         p.get_height()), 
                        ha = 'center', 
                        va = 'center', 
                        xytext = (0, 5),
                        textcoords = 'offset points')
sns.despine(ax=ax1, left=True, top=True, right=True,bottom=False)

However, AI adoption in the SEA region still lags behind powerhouses like US and China. Interestingly, the level of adoption in Southeast asia is not consistent throughout the region. Singapore leads the pack with Vietnam and Malaysia recently making significant progress in the past years. In a 2017 report produced by Asgard<sup>6</sup> ranking countries by the number of AI startups, Singapore made the top 20 countries, ranking 14th world wide. For the rest of the block, the adoption rates have been slower. Although these countries realise the potential benefits that AI has to offer, they still lack the necessary infrastructure for adoption.<sup>7</sup>
<br>
<br>
<hr style="background-color:#E3692A;height:1px">

## **Challenge: Untapped Potential**

One of the companies I worked for gave me the opportunity to work with European data scientists. This gave me the chance to learn different approaches and industry best practices, things which would be difficult to learn on your own and from local business practices.

Unsurprisingly (and unforunately), a lot of the data scientists I know have decided to pursue their studies and work overseas for better training and opportunities. This was a common occurrence in the academe where my colleagues take their chances overseas to pursue post graduate degrees and eventually settle there to continue their careers. I thought it would be different in the work place. But it is not. For Filipinos, if someone wants to develop their passion, they would seek all opportunities to develop it. And sometimes, these opportunities cannot be found in their own country.

<div style="text-align: left;" markdown="1"><font size="4"><b>2.1. DS Practitioners</b></font> </div>
<br>
Compared to other countries, data science practitioners in the region take a small share of the total DS talent. In a Global AI Talent Report for 2019, the top five countries that accounts for 71.62% of machine learning paper authors in 2018 are from only five countries, namely, the United States, China, the United Kingdom, Germany and Canada. The SEA region, represented by Singapore, Malaysia, Thailand and Vietnam, accounted for only 1.45%.<sup>8</sup>

A 2019 Asia Partners report shared by Tech in Asia<sup>9</sup> shows the number of tech talents in 6 countries in Southeast Asia. Singapore and Vietnam was identified with the most number of tech talents, with the Philippines at a tier lower.

In [None]:
from IPython.display import Image
Image("/kaggle/input/dsinsea/sea_chessboard.png", width=600)


<p style="text-decoration:none;font-size:50%;text-align:left;color:gray;">Source: Asia Partners (https://www.techinasia.com/southeast-asias-golden-age)</p>

From the survey dataset, only 6 SEA countries are represented (out of a total of 11 SEA countries), with a total of 59 countries represented in the dataset. Respondents from SEA comprise 3% (653 out of 19,717) of the total kaggler respondents. This is comparatively lower than the representation of the region's total population at 9% versus world wide population. Consistent with the report with Tech in Asia, Singapore and Vietnam have a larger share of SEA kaggler responders. Indonesia topped the region in number of kaggler responders - this may be due to the good penetration of kaggle amongst the data community in the country.

In [None]:
### COMPARING THE 2019 SURVEY PARTICIPANTS PER SEA COUNTRY ###

gridsize = (5, 5)
fig = plt.figure(figsize=(20, 15))

ax1 = plt.subplot2grid(gridsize, (0, 2), colspan=5, rowspan=2)
ax1 = sns.countplot(x = "In which country do you currently reside?", 
                    data = sea_countries,
                    order = sea_countries["In which country do you currently reside?"].value_counts().index,
                    color = "#529FCD"
                   )
ax1_title = ax1.set_title('Number of Respondents per SEA Countries')
ax1_yticklabels = ax1.set_yticklabels([])
ax1_ylabel = ax1.set_ylabel("")
ax1_xlabel = ax1.set_xlabel("")
ax1.grid(False)
sns.despine(ax=ax1, left=True, top=True, right=True,bottom=True)

for p in ax1.patches:
    patch = ax1.annotate(p.get_height(), 
                        (p.get_x() + p.get_width() / 2.0, 
                         p.get_height()), 
                        ha = 'center', 
                        va = 'center', 
                        xytext = (0, 5),
                        textcoords = 'offset points')

### COMPARING THE COUNTRIES OF 2019 SURVEY PARTICIPANTS - SEA, ROW ###    
    
ax2 = plt.subplot2grid(gridsize, (0, 1))
ax2_set_title = ax2.set_title('Countries')
grouped_countries = (df.groupby('sea_tag')["In which country do you currently reside?"].nunique()).reset_index(name="count_country")
ax2 = plt.pie(grouped_countries['count_country'],
              labels=grouped_countries['sea_tag'],
              shadow=False,
              startangle=0,
              autopct='%1.2f%%',
              colors=["#A8A495", "#E3692A"]
             )

### COMPARING THE PARTICIPANTS OF 2019 SURVEY - SEA, ROW ###    

ax3 = plt.subplot2grid(gridsize, (1, 1))
ax3_set_title = ax3.set_title('Respondents')
grouped_respondents = (df.groupby('sea_tag')["In which country do you currently reside?"].count()).reset_index(name="count_respondents")
ax3 = plt.pie(grouped_respondents['count_respondents'],
              labels=grouped_respondents['sea_tag'],
              shadow=False,
              startangle=0,
              autopct='%1.2f%%',
              colors=["#A8A495", "#E3692A"],
             )

<div style="text-align: left;" markdown="1"><font size="4"><b>2.2. Gender Disparity</b></font> </div>
<br>
In 2018, Southeast Asia was reported to have a 42% female workforce participation - this is higher than the global average of 39% - attributed to the cultural shifts that Southeast Asia has experienced in the past decades.<sup>10</sup>

However, considerable variations of gender parity persists throughout the region. In the Global Gender Gap Report of 2019, the Philippines ranked 8th (out of 149 countries) in closing the gender gap on four pillars: economic participation and opportunity, educational attainment, health and survival, and political empowerment. Laos ranked 26th while the remaining countries range from 67th to 101st rank. <sup>10</sup>

This insight should not discount the fact that, specifically for data science, women are still under represented. In my team at work, only 2 out of 7 data scientist are women.

Similar trends can be seen region-wide and at a country level from the survey. Only 16% of the respondents from non-SEA region countries are female. The SEA region slightly outperforms non-SEA region countries with 20% of the respondents being female. Though a significant improvement at 32%, the ratio of Philippine female kagglers to their male counterparts still falls below the Southeast Asia female to male ratio in the work force and the global average.

In [None]:
### COMPARING THE GENDERS OF 2019 SURVEY PARTICIPANTS - SEA, ROW ###    

gridsize = (10, 3)
fig = plt.figure(figsize=(40, 10))


# RoW
ax1 = plt.subplot2grid(gridsize, (0, 1))
ax1_title = ax1.set_title('Gender Breakdown')
ax1_start = 0
ax1_never = round((len(nonsea_countries[nonsea_countries["What is your gender? - Selected Choice"]=="Female"])/len(nonsea_countries[nonsea_countries["What is your gender? - Selected Choice"].isin(["Female", "Male"])]))*100)
ax1_seldom = round((len(nonsea_countries[nonsea_countries["What is your gender? - Selected Choice"]=="Male"])/len(nonsea_countries[nonsea_countries["What is your gender? - Selected Choice"].isin(["Female", "Male"])]))*100)
ax1.broken_barh([(ax1_start, ax1_never), (ax1_never, ax1_never+ax1_seldom)], [10, 9], facecolors=('#E3692A', '#529FCD'))
ax1.set_xlim(0, 100)
ax1.spines['left'].set_visible(False)
ax1.spines['bottom'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.set_yticks([15, 20])
ax1.set_xticks([0, 25, 50, 75, 100])
ax1.set_axisbelow(True) 
ax1.set_xticklabels("")
ax1.set_yticklabels(['RoW'])
ax1.grid(axis='x')
ax1.text(ax1_never-6, 14.5, str(ax1_never)+"%", fontsize=8)
ax1.text((ax1_never+ax1_seldom)-6, 14.5, str(ax1_seldom)+"%", fontsize=8)

# SEA
ax2 = plt.subplot2grid(gridsize, (1, 1))
ax2_start = 0
ax2_never = round((len(sea_countries[sea_countries["What is your gender? - Selected Choice"]=="Female"])/len(sea_countries[sea_countries["What is your gender? - Selected Choice"].isin(["Female", "Male"])]))*100)
ax2_seldom = round((len(sea_countries[sea_countries["What is your gender? - Selected Choice"]=="Male"])/len(sea_countries[sea_countries["What is your gender? - Selected Choice"].isin(["Female", "Male"])]))*100)
ax2.broken_barh([(ax2_start, ax2_never), (ax2_never, ax2_never+ax2_seldom)], [10, 9], facecolors=('#E3692A', '#529FCD'))
ax2.set_xlim(0, 100)
ax2.spines['left'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.set_yticks([15, 20])
ax2.set_xticks([0, 25, 50, 75, 100])
ax2.set_axisbelow(True) 
ax2.set_yticklabels(['SEA'])
ax2.grid(axis='x')
ax2.text(ax2_never-6, 14.5, str(ax2_never)+"%", fontsize=8)
ax2.text((ax2_never+ax2_seldom)-6, 14.5, str(ax2_seldom)+"%", fontsize=8)

### COMPARING THE GENDERS OF 2019 SURVEY PARTICIPANTS per SEA coutry ###    

ax3 = plt.subplot2grid(gridsize, (3, 1), rowspan=6)
r = [0,1,2,3,4,5]
raw_data = {'Female': 
            [len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Female")&
                               (sea_countries["In which country do you currently reside?"]=="Indonesia")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Female")&
                               (sea_countries["In which country do you currently reside?"]=="Malaysia")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Female")&
                               (sea_countries["In which country do you currently reside?"]=="Philippines")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Female")&
                               (sea_countries["In which country do you currently reside?"]=="Singapore")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Female")&
                               (sea_countries["In which country do you currently reside?"]=="Thailand")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Female")&
                               (sea_countries["In which country do you currently reside?"]=="Viet Nam")])], 
            'Male': 
            [len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Male")&
                               (sea_countries["In which country do you currently reside?"]=="Indonesia")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Male")&
                               (sea_countries["In which country do you currently reside?"]=="Malaysia")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Male")&
                               (sea_countries["In which country do you currently reside?"]=="Philippines")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Male")&
                               (sea_countries["In which country do you currently reside?"]=="Singapore")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Male")&
                               (sea_countries["In which country do you currently reside?"]=="Thailand")]),
             len(sea_countries[(sea_countries["What is your gender? - Selected Choice"]=="Male")&
                               (sea_countries["In which country do you currently reside?"]=="Viet Nam")])]
           }
raw_data_df = pd.DataFrame(raw_data)
totals = [i+j for i,j in zip(raw_data_df['Female'], raw_data_df['Male'])]
greenBars = [round(i / j * 100) for i,j in zip(raw_data_df['Female'], totals)]
orangeBars = [round(i / j * 100) for i,j in zip(raw_data_df['Male'], totals)]
barWidth = 0.85
names = ('Indonesia', 'Malaysia', 'Philippines', 'Singapore', 'Thailand', 'Viet Nam')
ax3.bar(r, greenBars, color='#E3692A', edgecolor='white', width=barWidth, label="Female")
ax3.bar(r, orangeBars, bottom=greenBars, color='#529FCD', edgecolor='white', width=barWidth, label="Male")
ax3_title = ax3.set_title('Gender Breakdown per SEA Countries')
ax3_xticks = plt.xticks(r, names)
ax3_legend = plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)

for r in range(0,6,1):
    p = ax3.patches[r]
    patch = ax3.annotate(str(p.get_height())+"%", 
                        (p.get_x() + p.get_width()/2.0, 
                         p.get_height()/2.0), 
                        ha = 'center', 
                        va = 'center')
    q = ax3.patches[r+6]
    patch = ax3.annotate(str(q.get_height())+"%", 
                        (q.get_x() + q.get_width()/2.0, 
                         p.get_height() + (q.get_height()/2.0)), 
                        ha = 'center', 
                        va = 'center')
ax3.grid(False)
ax3_yticklabels = ax3.set_yticklabels([])

sns.despine(ax=ax3, left=True, top=True, right=True,bottom=True)
sns.despine(ax=ax2, left=True, top=True, right=True,bottom=True)
sns.despine(ax=ax1, left=True, top=True, right=True,bottom=True)
ax1.grid(False)
ax1_xticklabels = ax1.set_xticklabels([])
ax2.grid(False)
ax2_xticklabels = ax2.set_xticklabels([])

<div style="text-align: left;" markdown="1"><font size="4"><b>2.3. Age, Education and Salary</b></font> </div>
<br>


Kagglers in the SEA region are relatively younger with around 60% falling from 18 to 29 years old compared to only 50% falling under the same age range from the rest of the respondents. This is not surprising given the recency of the data science field in the region

In [None]:
### COMPARING THE AGE OF 2019 SURVEY PARTICIPANTS - SEA, ROW ###    

column_tosee = "What is your age (# years)?"
short_column_tosee = "age"

# doing all of this to get a pivot for a barh subplot!!!
sea_nonsea_agedata = (df.groupby(['sea_tag', column_tosee])["sea_tag"].count()).reset_index(name="count")
sea_nonsea_agedata.columns = ["sea_tag", short_column_tosee, "count"]
pivoted_sea_nonsea_agedata = (sea_nonsea_agedata.pivot(index='sea_tag', columns=short_column_tosee, values='count')).reset_index()
pivoted_sea_nonsea_agedata.loc[:,'Total'] = pivoted_sea_nonsea_agedata.sum(axis=1)
answers = df[column_tosee].dropna().unique().tolist()
for a in answers:
    pivoted_sea_nonsea_agedata[a] = round((pivoted_sea_nonsea_agedata[a]/pivoted_sea_nonsea_agedata["Total"])*100, 2)
pivoted_sea_nonsea_agedata.reset_index(drop=True, inplace=True)
pivoted_sea_nonsea_agedata.set_index("sea_tag", inplace=True)
pivoted_sea_nonsea_agedata = pivoted_sea_nonsea_agedata.rename_axis(None, axis=1)
pivoted_sea_nonsea_agedata = pivoted_sea_nonsea_agedata.rename_axis(None, axis=0)
pivoted_sea_nonsea_agedata = pivoted_sea_nonsea_agedata.drop(['Total'], axis=1)
pivoted_sea_nonsea_agedata = pivoted_sea_nonsea_agedata.loc[['SEA', 'RoW'], :]

gridsize = (2, len(answers))
fig = plt.figure(figsize=(20, 7))
fig.suptitle('Age: SEA-RoW', fontsize=14)

sns.set_style("whitegrid")

ax_frames = []
counter = 0
for ans in answers:
    ax = plt.subplot2grid(gridsize, (0, counter))
    counter += 1
    ax_frames.append(ax)

# barh barh barh
sns.set_style("whitegrid")
pivoted_sea_nonsea_agedata.plot(kind='barh', subplots=True, sharey=True, layout=(1,len(ax_frames)), legend=False, 
                                 xticks=[], yticks=[], ax=ax_frames,
                                 grid=False, xlim=(0, 35), edgecolor='none', fontsize=14,
                                 color = sns.light_palette(sns.color_palette("Blues_r")[0], len(ax_frames))[::-1]
                                )

sns.despine(left=False, top=True, right=True, bottom=True)

# labels!!!
for a in ax_frames:
    for p in a.patches:
        patch = a.annotate(str(p.get_width())+"%", 
                            (p.get_width(), 
                             p.get_y() + p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (20, -12),
                            textcoords = 'offset points'
                            )

Almost half of the kagglers in the SEA region only have a bachelor's degree while the rest of the world has more master's and doctoral degree. In Philippines, taking a post-graduate degree is uncommon. Children who finished their degrees are expected to help in the finances of the entire family when they start working.

In [None]:
### COMPARING THE EDUCATION OF 2019 SURVEY PARTICIPANTS - SEA, ROW ###    

column_tosee = "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?"
short_column_tosee = "formal_education"

# doing all of this to get a pivot for a barh subplot!!!
sea_nonsea_educdata = (df.groupby(['sea_tag', column_tosee])["sea_tag"].count()).reset_index(name="count")
sea_nonsea_educdata.columns = ["sea_tag", short_column_tosee, "count"]
pivoted_sea_nonsea_educdata = (sea_nonsea_educdata.pivot(index='sea_tag', columns=short_column_tosee, values='count')).reset_index()
pivoted_sea_nonsea_educdata.loc[:,'Total'] = pivoted_sea_nonsea_educdata.sum(axis=1)
answers = df[column_tosee].dropna().unique().tolist()
for a in answers:
    pivoted_sea_nonsea_educdata[a] = round((pivoted_sea_nonsea_educdata[a]/pivoted_sea_nonsea_educdata["Total"])*100, 2)
pivoted_sea_nonsea_educdata.reset_index(drop=True, inplace=True)
pivoted_sea_nonsea_educdata.set_index("sea_tag", inplace=True)
pivoted_sea_nonsea_educdata = pivoted_sea_nonsea_educdata.rename_axis(None, axis=1)
pivoted_sea_nonsea_educdata = pivoted_sea_nonsea_educdata.rename_axis(None, axis=0)
pivoted_sea_nonsea_educdata = pivoted_sea_nonsea_educdata.drop(['Total'], axis=1)
pivoted_sea_nonsea_educdata = pivoted_sea_nonsea_educdata[[
    "No formal education past high school", 
    "Professional degree",
    "Some college/university study without earning a bachelor’s degree", 
    "Bachelor’s degree", 
    "Master’s degree", 
    "Doctoral degree", 
#     "I prefer not to answer"
]]
pivoted_sea_nonsea_educdata.columns = [
    "No formal education \npast high school", 
    "Professional degree",
    "Some college/university \nstudy without earning \na bachelor’s degree", 
    "Bachelor’s degree", 
    "Master’s degree", 
    "Doctoral degree", 
#     "I prefer not to answer"
]
pivoted_sea_nonsea_educdata = pivoted_sea_nonsea_educdata.loc[['SEA', 'RoW'], :]

gridsize = (2, len(pivoted_sea_nonsea_educdata.columns.tolist()))
fig = plt.figure(figsize=(20, 10))
sns.set_style("whitegrid")
fig.suptitle('Formal Education: SEA-RoW', fontsize=14)


ax_frames = []
counter = 0
for ans in pivoted_sea_nonsea_educdata.columns.tolist():
    ax = plt.subplot2grid(gridsize, (0, counter))
    counter += 1
    ax_frames.append(ax)

# barh barh barh
sns.set_style("whitegrid")
pivoted_sea_nonsea_educdata.plot(kind='barh', subplots=True, sharey=True, layout=(1,len(ax_frames)), legend=False, 
                                 xticks=[], yticks=[], 
                                 ax=ax_frames,
                                 grid=False, xlim=(0, 55), edgecolor='none', fontsize=14,
                                 color = sns.light_palette(sns.color_palette("Blues_r")[0], len(ax_frames))[::-1]
                                )

sns.despine(left=False, top=True, right=True, bottom=True)

# labels!!!
for a in ax_frames:
    for p in a.patches:
        patch = a.annotate(str(p.get_width())+"%", 
                            (p.get_width(), 
                             p.get_y() + p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (20, -12),
                            textcoords = 'offset points'
                            )

Unsurprisingly, the annual compensation of SEA kagglers is relatively lower than the rest of the respondents. The highest share of SEA kagglers (25%) reported annual compensations of less than USD 5,000. In contrast, the highest share of non-SEA kagglers (16%) reported annual compensations of USD 25,000 - 69,000. This may be attributed to the lower cost of living (on average) in SEA. 

In an article from WorldData.info, as compared to the United States (which was used as the benchmark at 100 index), Singapore was identified as having the highest cost of living index in SEA with 95.8. In comparison, the Philippines had the lowest cost of living index of 44.3 among the SEA countries identified in the report. Other SEA countries identified in the article (Indonesia, Vietnam, Malaysia, Cambodia and Thailand) had a cost of living index ranging from 44.7-61.8.<sup>11</sup>

In [None]:
### COMPARING THE COMPENSATION OF 2019 SURVEY PARTICIPANTS - SEA, ROW ###    

column_tosee = "new_compensation_bin"
short_column_tosee = "compensation"

# doing all of this to get a pivot for a barh subplot!!!
sea_nonsea_compensationdata = (df.groupby(['sea_tag', column_tosee])["sea_tag"].count()).reset_index(name="count")
sea_nonsea_compensationdata.columns = ["sea_tag", short_column_tosee, "count"]
pivoted_sea_nonsea_compensationdata = (sea_nonsea_compensationdata.pivot(index='sea_tag', columns=short_column_tosee, values='count')).reset_index()
pivoted_sea_nonsea_compensationdata.loc[:,'Total'] = pivoted_sea_nonsea_compensationdata.sum(axis=1)
answers = df[column_tosee].dropna().unique().tolist()
for a in answers:
    pivoted_sea_nonsea_compensationdata[a] = round((pivoted_sea_nonsea_compensationdata[a]/pivoted_sea_nonsea_compensationdata["Total"])*100, 2)
pivoted_sea_nonsea_compensationdata.reset_index(drop=True, inplace=True)
pivoted_sea_nonsea_compensationdata.set_index("sea_tag", inplace=True)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.rename_axis(None, axis=1)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.rename_axis(None, axis=0)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.drop(['Total'], axis=1)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata[["$0-4,999", "$5,000-24,999", "$25,000-69,999", "$70,000-149,999", "> $150,000", "Null"]]
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.loc[['SEA', 'RoW'], :]

gridsize = (2, len(answers))
fig = plt.figure(figsize=(20, 7))
sns.set_style("whitegrid")
fig.suptitle('Annual Compensation: SEA-RoW', fontsize=14)

ax_frames = []
counter = 0
for ans in answers:
    ax = plt.subplot2grid(gridsize, (0, counter))
    counter += 1
    ax_frames.append(ax)

# barh barh barh
sns.set_style("whitegrid")
pivoted_sea_nonsea_compensationdata.plot(kind='barh', subplots=True, sharey=True, layout=(1,len(ax_frames)), legend=False, 
                                 xticks=[], yticks=[], ax=ax_frames,
                                 grid=False, xlim=(0, 50), edgecolor='none', fontsize=14,
                                 color = sns.light_palette(sns.color_palette("Blues_r")[0], len(ax_frames))[::-1]
                                )

sns.despine(left=False, top=True, right=True, bottom=True)

# labels!!!
for a in ax_frames:
    for p in a.patches:
        patch = a.annotate(str(p.get_width())+"%", 
                            (p.get_width(), 
                             p.get_y() + p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (20, -12),
                            textcoords = 'offset points'
                            )

Further inspection of annual compensation per country shows countries have different distributions with Singapore having more kagglers that receive higher annual compensation.

In [None]:
### COMPARING THE COMPENSATION OF 2019 SURVEY PARTICIPANTS - PER SEA COUNTRY ###    

column_tosee = "new_compensation_bin"
short_column_tosee = "compensation"

# doing all of this to get a pivot for a barh subplot!!!
sea_nonsea_compensationdata = (sea_countries.groupby(['In which country do you currently reside?', column_tosee])["In which country do you currently reside?"].count()).reset_index(name="count")
sea_nonsea_compensationdata.columns = ["In which country do you currently reside?", short_column_tosee, "count"]
pivoted_sea_nonsea_compensationdata = (sea_nonsea_compensationdata.pivot(index='In which country do you currently reside?', columns=short_column_tosee, values='count')).reset_index()
pivoted_sea_nonsea_compensationdata.loc[:,'Total'] = pivoted_sea_nonsea_compensationdata.sum(axis=1)
answers = sea_countries[column_tosee].dropna().unique().tolist()
for a in answers:
    pivoted_sea_nonsea_compensationdata[a] = round((pivoted_sea_nonsea_compensationdata[a]/pivoted_sea_nonsea_compensationdata["Total"])*100, 2)
pivoted_sea_nonsea_compensationdata.reset_index(drop=True, inplace=True)
pivoted_sea_nonsea_compensationdata.set_index("In which country do you currently reside?", inplace=True)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.rename_axis(None, axis=1)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.rename_axis(None, axis=0)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.drop(['Total'], axis=1)
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata[["$0-4,999", "$5,000-24,999", "$25,000-69,999", "$70,000-149,999", "> $150,000", "Null"]]
pivoted_sea_nonsea_compensationdata = pivoted_sea_nonsea_compensationdata.loc[['Viet Nam', 'Indonesia', 'Philippines', 'Malaysia', 'Thailand', 'Singapore'], :]

gridsize = (2, len(answers))
fig = plt.figure(figsize=(20, 10))
sns.set_style("whitegrid")
fig.suptitle('Annual Compensation per SEA Countries', fontsize=14)


ax_frames = []
counter = 0
for ans in answers:
    ax = plt.subplot2grid(gridsize, (0, counter))
    counter += 1
    ax_frames.append(ax)

# barh barh barh
sns.set_style("whitegrid")
pivoted_sea_nonsea_compensationdata.plot(kind='barh', subplots=True, sharey=True, layout=(1,len(ax_frames)), legend=False, 
                                 xticks=[], yticks=[], ax=ax_frames,
                                 grid=False, xlim=(0, 50), edgecolor='none', fontsize=14,
                                 color = '#E3692A'
                                )

sns.despine(left=False, top=True, right=True, bottom=True)

# labels!!!
for a in ax_frames:
    for p in a.patches:
        patch = a.annotate(str(p.get_width())+"%", 
                            (p.get_width(), 
                             p.get_y() + p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (20, -12),
                            textcoords = 'offset points'
                            )

If we remove Singapore in our comparison, the numbers of SEA kagglers with annual compensations of less than USD 5,000 rises to 28% from 25% and SEA kagglers with annual compensations of USD 5,000 - 24,999 rises to 21% from 18%

In [None]:
### COMPARING THE COMPENSATION OF 2019 SURVEY PARTICIPANTS - SEA (without singapore), ROW ###    

new_df = df[df['In which country do you currently reside?']!="Singapore"]

column_tosee = "new_compensation_bin"
short_column_tosee = "compensation"

# doing all of this to get a pivot for a barh subplot!!!
sea_nonsea_compensationWOsingdata = (new_df.groupby(['sea_tag', column_tosee])["sea_tag"].count()).reset_index(name="count")
sea_nonsea_compensationWOsingdata.columns = ["sea_tag", short_column_tosee, "count"]
pivoted_sea_nonsea_compensationWOsingdata = (sea_nonsea_compensationWOsingdata.pivot(index='sea_tag', columns=short_column_tosee, values='count')).reset_index()
pivoted_sea_nonsea_compensationWOsingdata.loc[:,'Total'] = pivoted_sea_nonsea_compensationWOsingdata.sum(axis=1)
answers = new_df[column_tosee].dropna().unique().tolist()
for a in answers:
    pivoted_sea_nonsea_compensationWOsingdata[a] = round((pivoted_sea_nonsea_compensationWOsingdata[a]/pivoted_sea_nonsea_compensationWOsingdata["Total"])*100, 2)
pivoted_sea_nonsea_compensationWOsingdata.reset_index(drop=True, inplace=True)
pivoted_sea_nonsea_compensationWOsingdata.set_index("sea_tag", inplace=True)
pivoted_sea_nonsea_compensationWOsingdata = pivoted_sea_nonsea_compensationWOsingdata.rename_axis(None, axis=1)
pivoted_sea_nonsea_compensationWOsingdata = pivoted_sea_nonsea_compensationWOsingdata.rename_axis(None, axis=0)
pivoted_sea_nonsea_compensationWOsingdata = pivoted_sea_nonsea_compensationWOsingdata.drop(['Total'], axis=1)
pivoted_sea_nonsea_compensationWOsingdata = pivoted_sea_nonsea_compensationWOsingdata[["$0-4,999", "$5,000-24,999", "$25,000-69,999", "$70,000-149,999", "> $150,000", "Null"]]
pivoted_sea_nonsea_compensationWOsingdata = pivoted_sea_nonsea_compensationWOsingdata.loc[['SEA', 'RoW'], :]
pivoted_sea_nonsea_compensationWOsingdata.index = ['SEA - \nw/o Singapore', 'RoW']

gridsize = (2, len(answers))
fig = plt.figure(figsize=(20, 7))
sns.set_style("whitegrid")
fig.suptitle('Annual Compensation: SEA(w/o Singapore)-RoW', fontsize=14)

ax_frames = []
counter = 0
for ans in answers:
    ax = plt.subplot2grid(gridsize, (0, counter))
    counter += 1
    ax_frames.append(ax)

# barh barh barh
sns.set_style("whitegrid")
pivoted_sea_nonsea_compensationWOsingdata.plot(kind='barh', subplots=True, sharey=True, layout=(1,len(ax_frames)), legend=False, 
                                 xticks=[], yticks=[], ax=ax_frames,
                                 grid=False, xlim=(0, 50), edgecolor='none', fontsize=14,
                                 color = sns.light_palette(sns.color_palette("Blues_r")[0], len(ax_frames))[::-1]
                                )

sns.despine(left=False, top=True, right=True, bottom=True)

# labels!!!
for a in ax_frames:
    for p in a.patches:
        patch = a.annotate(str(p.get_width())+"%", 
                            (p.get_width(), 
                             p.get_y() + p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (20, -12),
                            textcoords = 'offset points'
                            )

<div style="text-align: left;" markdown="1"><font size="4"><b>2.4. Tools and Experience</b></font> </div>
<br>
Unsurprisingly, kagglers in the SEA region have less experience when it comes to implementing ML methods compared to the rest of the respondents.

In [None]:
### COMPARING THE NUMBER OF 2019 SURVEY PARTICIPANTS - per ML YEARS ###    

column_tosee = 'For how many years have you used machine learning methods?'
shorter_column_tosee = 'years_used_ml_methods'

sea_nonsea_mlyearsdata = (sea_countries.groupby([column_tosee])["sea_tag"].count()).reset_index(name="count")
sea_nonsea_mlyearsdata.columns = [shorter_column_tosee, "count"]
sea_nonsea_mlyearsdata.sort_values(['count'], ascending=False, inplace=True)
sea_nonsea_mlyearsdata.reset_index(drop=True, inplace=True)
length_answer = len(df[column_tosee].dropna().unique().tolist())

gridsize = (length_answer, 9)
fig = plt.figure(figsize=(30, 8))
sns.set_style("whitegrid")
fig.suptitle('Years using Machine Learning Methods: SEA-RoW', fontsize=14)

ax2 = plt.subplot2grid(gridsize, (0, 4), colspan=3, rowspan=length_answer)
ax2 = sns.barplot(x='count', y=shorter_column_tosee, data=sea_nonsea_mlyearsdata, 
                  color = "#529FCD"
#                   palette=sns.light_palette(sns.color_palette("Blues_r")[0], length_answer)[::-1]
                 )
ax2_title = ax2.set_title('S.E.A. Countries')
new_yticks = [ax.get_text().replace("(", "\n(") for ax in ax2.get_yticklabels()]
ax2_new_yticks = ax2.set_yticklabels(new_yticks, {"horizontalalignment":"center", "x":"-0.2"})
ax2_ylabel = ax2.set_ylabel("")
ax2_xlabel = ax2.set_xlabel("")
ax2.grid(False)
for p in ax2.patches:
    patch = ax2.annotate(str(int(p.get_width())), 
                        (p.get_width(), 
                         p.get_y() + p.get_height()), 
                        ha = 'center', 
                        va = 'center', 
                        xytext = (15, 25),
                        textcoords = 'offset points'
                        )
sns.despine(ax=ax2, left=True, top=True, right=True,bottom=False)

### COMPARING THE ML METHODS YEARS OF 2019 SURVEY PARTICIPANTS - SEA, RoW ###    

grouped_all = df.groupby(["sea_tag", column_tosee])["sea_tag"].count().reset_index(name="count")
grouped_all.columns = ["sea_tag", shorter_column_tosee, "count"]
len_sea = grouped_all.groupby("sea_tag")["count"].sum()["SEA"]
len_nonsea = grouped_all.groupby("sea_tag")["count"].sum()["RoW"]
grouped_all["count_percent"] = grouped_all.apply(lambda row: round((row["count"]/len_sea)*100,2) if row["sea_tag"]=="SEA" else round((row["count"]/len_nonsea)*100,2), axis=1)

yticks_list = [ax.get_text() for ax in ax2.get_yticklabels()]
position = list(range(0,len(yticks_list)))

colors = ["#A8A495", "#E3692A"]

for po, b in zip(position, yticks_list):
    a = plt.subplot2grid(gridsize, (po, 1), colspan=2)
    a = sns.barplot(x = "sea_tag", 
                    y= "count_percent",
                    data=grouped_all[grouped_all[shorter_column_tosee]==b],
                    palette=sns.set_palette(sns.color_palette(colors))                    
                       )

    a.get_yaxis().set_visible(False)
    a.get_xaxis().set_visible(False)
    if po==len(yticks_list)-1:
        a.get_xaxis().set_visible(True)
        a_xlabel = a.set_xlabel("")
    for p in a.patches:
        patch = a.annotate(str(p.get_height())+"%", 
                            (p.get_x() + p.get_width() / 2.0, 
                             p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (0, 5),
                            textcoords = 'offset points')
    a_ylim = a.set_ylim(0, 60)
    sns.despine(ax=a, left=True, top=True, right=True,bottom=False)

Although majority of the kagglers are already using local development environments, when comparing it to the rest of the respondents, there are notable differences in the other tools that they are using. Twenty four percent (24%) of SEA kagglers uses basic statistical softwares like Microsoft Excel and Google Sheets. This is larger compared to the rest of the repondents at 19%. However, implementation of cloud-based data softwares and APIs are lower in SEA kagglers at 4% compared to the rest of the respondents at 7%.

In [None]:
### COMPARING THE NUMBER OF 2019 SURVEY PARTICIPANTS - per PRIMARY TOOLS ###    

column_tosee = 'What is the primary tool that you use at work or school to analyze data? (Include text response) - Selected Choice'
shorter_column_tosee = 'tool_analyze_data'

sea_nonsea_ptooldata = (sea_countries.groupby([column_tosee])["sea_tag"].count()).reset_index(name="count")
sea_nonsea_ptooldata.columns = [shorter_column_tosee, "count"]
sea_nonsea_ptooldata.sort_values(['count'], ascending=False, inplace=True)
sea_nonsea_ptooldata.reset_index(drop=True, inplace=True)
length_answer = len(df[column_tosee].dropna().unique().tolist())

gridsize = (length_answer, 9)
fig = plt.figure(figsize=(30, 8))
sns.set_style("whitegrid")
fig.suptitle('Primary Tool Used at Work/School to Analyze Data: SEA-RoW', fontsize=14)

ax2 = plt.subplot2grid(gridsize, (0, 4), colspan=3, rowspan=length_answer)
ax2 = sns.barplot(x='count', y=shorter_column_tosee, data=sea_nonsea_ptooldata, 
                  color="#529FCD"
#                   palette=sns.color_palette("Blues_r")
                 )
ax2_title = ax2.set_title('S.E.A. Countries')
new_yticks = [ax.get_text().replace("(", "\n(") for ax in ax2.get_yticklabels()]
ax2_new_yticks = ax2.set_yticklabels(new_yticks, {"horizontalalignment":"center", "x":"-0.2"})
ax2_ylabel = ax2.set_ylabel("")
ax2_xlabel = ax2.set_xlabel("")
ax2.grid(False)
for p in ax2.patches:
    patch = ax2.annotate(str(int(p.get_width())), 
                        (p.get_width(), 
                         p.get_y() + p.get_height()), 
                        ha = 'center', 
                        va = 'center', 
                        xytext = (15, 25),
                        textcoords = 'offset points'
                        )
sns.despine(ax=ax2, left=True, top=True, right=True,bottom=False)

### COMPARING THE PRIMARY TOOLS OF 2019 SURVEY PARTICIPANTS - SEA, RoW ###    

grouped_all = df.groupby(["sea_tag", column_tosee])["sea_tag"].count().reset_index(name="count")
grouped_all.columns = ["sea_tag", shorter_column_tosee, "count"]
len_sea = grouped_all.groupby("sea_tag")["count"].sum()["SEA"]
len_nonsea = grouped_all.groupby("sea_tag")["count"].sum()["RoW"]
grouped_all["count_percent"] = grouped_all.apply(lambda row: round((row["count"]/len_sea)*100,2) if row["sea_tag"]=="SEA" else round((row["count"]/len_nonsea)*100,2), axis=1)

yticks_list = [ax.get_text().replace("\n(", "(") for ax in ax2.get_yticklabels()]
position = list(range(0,len(yticks_list)))

colors = ["#A8A495", "#E3692A"]

for po, b in zip(position, yticks_list):
    a = plt.subplot2grid(gridsize, (po, 1), colspan=2)
    a = sns.barplot(x = "sea_tag", 
                    y= "count_percent",
                    data=grouped_all[grouped_all[shorter_column_tosee]==b],
                    palette=sns.set_palette(sns.color_palette(colors))
                       )

    a.get_yaxis().set_visible(False)
    a.get_xaxis().set_visible(False)
    if po==5:
        a.get_xaxis().set_visible(True)
        a_xlabel = a.set_xlabel("")
    for p in a.patches:
        patch = a.annotate(str(p.get_height())+"%", 
                            (p.get_x() + p.get_width() / 2.0, 
                             p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (0, 5),
                            textcoords = 'offset points')
    a_ylim = a.set_ylim(0, 60)
    sns.despine(ax=a, left=True, top=True, right=True,bottom=False)

<hr style="background-color:#E3692A;height:1px">

## **Solution: Coming Together**

<div style="text-align: left;" markdown="1"><font size="4"><b>3.1. Individual Development: Self-Improvement</b></font> </div>
<br>
In the past few years, schools in the Philippines have started offering analytics or data science courses in order to keep up with the demand. Fortunately for me, I graduated from Physics, which trained me in mathematical methods and computational programming. This enabled me to transition to the DS industry relatively easily.

That doesn't mean I can rely only on my university courses to do well with work. We all know that data science is a field that requires knowledge on a vast range of topics. I make it a point to keep on exploring approaches and fields that I have not yet tried (which is a lot!). One of my most memorable courses is Andrew Ng's Machine Learning course from Coursera where I took pleasure in stopping the videos to derive the equations.

A similar behavior can be seen from our kagglers, where most people seek out coursera as their learning platform. This is followed by Kaggle Learn and DataCamp.

In [None]:
### COMPARING THE NUMBER OF 2019 SURVEY PARTICIPANTS - per DS SOURCES for SEA ###    

ds_sources_columns = [
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udacity',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataQuest',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Courses (i.e. Kaggle Learn)',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udemy',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - LinkedIn Learning',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - University Courses (resulting in a university degree)',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - None',
    'On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Other',
]

sea_dssources = sea_countries[ds_sources_columns]
count_allsea = len(sea_dssources)
sea_dssources_all = pd.DataFrame()
for msc in ds_sources_columns:
    dssources_ = ((sea_dssources[[msc]]).dropna().groupby(msc)[msc].count()).reset_index(name="count")
    dssources_.columns = ["ds_sources", "count"]
    dssources_["count_percent"] = round((dssources_["count"]/count_allsea)*100, 2)
    sea_dssources_all = sea_dssources_all.append(dssources_)
sea_dssources_all["sea_tag"] = "SEA"
sea_dssources_all

nonsea_dssources = nonsea_countries[ds_sources_columns]
count_allnonsea = len(nonsea_dssources)
nonsea_dssources_all = pd.DataFrame()
for msc in ds_sources_columns:
    dssources_ = ((nonsea_dssources[[msc]]).dropna().groupby(msc)[msc].count()).reset_index(name="count")
    dssources_.columns = ["ds_sources", "count"]
    dssources_["count_percent"] = round((dssources_["count"]/count_allnonsea)*100, 2)
    nonsea_dssources_all = nonsea_dssources_all.append(dssources_)
nonsea_dssources_all["sea_tag"] = "RoW"

allcountries_dssources_all = pd.concat([sea_dssources_all, nonsea_dssources_all])

column_tosee = 'ds_sources'
shorter_column_tosee = 'ds_sources'

sea_nonsea_tooldata = (sea_dssources_all.groupby([column_tosee])["count"].sum()).reset_index(name="count")
sea_nonsea_tooldata.columns = [shorter_column_tosee, "count"]
sea_nonsea_tooldata.sort_values(['count'], ascending=False, inplace=True)
sea_nonsea_tooldata.reset_index(drop=True, inplace=True)
length_answer = len(sea_dssources_all[column_tosee].dropna().unique().tolist())

gridsize = (length_answer, 9)
fig = plt.figure(figsize=(30, 10))
sns.set_style("whitegrid")
fig.suptitle('Data Science Courses: SEA-RoW', fontsize=14)

ax2 = plt.subplot2grid(gridsize, (0, 4), colspan=3, rowspan=length_answer)
ax2 = sns.barplot(x='count', y=shorter_column_tosee, data=sea_nonsea_tooldata, 
                  color = "#529FCD"
#                   palette=sns.light_palette(sns.color_palette("Blues_r")[0], length_answer)[::-1]
                 )
ax2_title = ax2.set_title('S.E.A. Countries')
new_yticks = [ax.get_text().replace("(", "\n(") for ax in ax2.get_yticklabels()]
ax2_new_yticks = ax2.set_yticklabels(new_yticks, {"horizontalalignment":"center", "x":"-0.2"})
ax2_ylabel = ax2.set_ylabel("")
ax2_xlabel = ax2.set_xlabel("")
ax2.grid(False)
for p in ax2.patches:
    patch = ax2.annotate(str(int(p.get_width())), 
                        (p.get_width(), 
                         p.get_y() + p.get_height()), 
                        ha = 'center', 
                        va = 'center', 
                        xytext = (15, 18),
                        textcoords = 'offset points'
                        )
sns.despine(ax=ax2, left=True, top=True, right=True,bottom=False)

### COMPARING THE NUMBER OF 2019 SURVEY PARTICIPANTS - per DS SOURCES for SEA, RoW ###    

grouped_all = allcountries_dssources_all

yticks_list = [ax.get_text().replace("\n(", "(") for ax in ax2.get_yticklabels()]
position = list(range(0,len(yticks_list)))

palette=sns.set_palette(sns.color_palette(colors))

colors = ["#A8A495", "#E3692A"]

for po, b in zip(position, yticks_list):
    a = plt.subplot2grid(gridsize, (po, 1), colspan=2)
    a = sns.barplot(x = "sea_tag", 
                    y= "count_percent",
                    data=grouped_all[grouped_all[shorter_column_tosee]==b],
                    palette=sns.set_palette(sns.color_palette(colors))
                       )

    a.get_yaxis().set_visible(False)
    a.get_xaxis().set_visible(False)
    if po==len(yticks_list)-1:
        a.get_xaxis().set_visible(True)
        a_xlabel = a.set_xlabel("")
    for p in a.patches:
        patch = a.annotate(str(p.get_height())+"%", 
                            (p.get_x() + p.get_width() / 2.0, 
                             p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (0, 5),
                            textcoords = 'offset points')
    a_ylim = a.set_ylim(0, 60)
    sns.despine(ax=a, left=True, top=True, right=True,bottom=False)

For news and reports, Kagglers from our survey use Kaggle and Blogs the most to keep up with the latest in the field.

In [None]:
### COMPARING THE NUMBER OF 2019 SURVEY PARTICIPANTS - per MEDIA SOURCES for SEA, RoW ###    

media_sources_columns = [
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Twitter (data science influencers)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Hacker News (https://news.ycombinator.com/)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, r/datascience, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (forums, blog, social media, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Cloud AI Adventures, Siraj Raval, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, Linear Digressions, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (traditional publications, preprint journals, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None',
    'Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other',
]

sea_mediasources = sea_countries[media_sources_columns]
count_allsea = len(sea_mediasources)
sea_mediasources_all = pd.DataFrame()
for msc in media_sources_columns:
    mediasources_ = ((sea_mediasources[[msc]]).dropna().groupby(msc)[msc].count()).reset_index(name="count")
    mediasources_.columns = ["media_sources", "count"]
    mediasources_["count_percent"] = round((mediasources_["count"]/count_allsea)*100, 2)
    sea_mediasources_all = sea_mediasources_all.append(mediasources_)
sea_mediasources_all["sea_tag"] = "SEA"
sea_mediasources_all

nonsea_mediasources = nonsea_countries[media_sources_columns]
count_allnonsea = len(nonsea_mediasources)
nonsea_mediasources_all = pd.DataFrame()
for msc in media_sources_columns:
    mediasources_ = ((nonsea_mediasources[[msc]]).dropna().groupby(msc)[msc].count()).reset_index(name="count")
    mediasources_.columns = ["media_sources", "count"]
    mediasources_["count_percent"] = round((mediasources_["count"]/count_allnonsea)*100, 2)
    nonsea_mediasources_all = nonsea_mediasources_all.append(mediasources_)
nonsea_mediasources_all["sea_tag"] = "NON-SEA"

allcountries_mediasources_all = pd.concat([sea_mediasources_all, nonsea_mediasources_all])

column_tosee = 'media_sources'
shorter_column_tosee = 'media_sources'

sea_nonsea_tooldata = (sea_mediasources_all.groupby([column_tosee])["count"].sum()).reset_index(name="count")
sea_nonsea_tooldata.columns = [shorter_column_tosee, "count"]
sea_nonsea_tooldata.sort_values(['count'], ascending=False, inplace=True)
sea_nonsea_tooldata.reset_index(drop=True, inplace=True)

length_answer = len(sea_mediasources_all[column_tosee].dropna().unique().tolist())

gridsize = (length_answer, 9)
fig = plt.figure(figsize=(30, 10))
sns.set_style("whitegrid")
fig.suptitle('Media Sources: SEA-RoW', fontsize=14)

ax2 = plt.subplot2grid(gridsize, (0, 4), colspan=3, rowspan=length_answer)
ax2 = sns.barplot(x='count', y=shorter_column_tosee, data=sea_nonsea_tooldata,
                  color = "#529FCD"
#                   palette=sns.light_palette(sns.color_palette("Blues_r")[0], length_answer)[::-1]
                 )
ax2_title = ax2.set_title('S.E.A. Countries')
new_yticks = [ax.get_text().replace("(", "\n(").replace("Time Data Science,", "Time Data Science,\n").replace("Medium,", "Medium,\n").replace("publications,", "publications,\n") for ax in ax2.get_yticklabels()]
ax2_new_yticks = ax2.set_yticklabels(new_yticks, {"horizontalalignment":"center", "x":"-0.2"})
ax2_ylabel = ax2.set_ylabel("")
ax2_xlabel = ax2.set_xlabel("")
ax2.grid(False)
for p in ax2.patches:
    patch = ax2.annotate(str(int(p.get_width())), 
                        (p.get_width(), 
                         p.get_y() + p.get_height()), 
                        ha = 'center', 
                        va = 'center', 
                        xytext = (15, 17),
                        textcoords = 'offset points'
                        )
sns.despine(ax=ax2, left=True, top=True, right=True,bottom=False)

### COMPARING THE NUMBER OF 2019 SURVEY PARTICIPANTS - per MEDIA SOURCES for SEA, RoW ###    

grouped_all = allcountries_mediasources_all

yticks_list = [ax.get_text().replace("\n", "") for ax in ax2.get_yticklabels()]
position = list(range(0,len(yticks_list)))

colors = ["#A8A495", "#E3692A"]

for po, b in zip(position, yticks_list):
    a = plt.subplot2grid(gridsize, (po, 1), colspan=2)
    a = sns.barplot(x = "sea_tag", 
                    y= "count_percent",
                    data=grouped_all[grouped_all[shorter_column_tosee]==b],
                    palette=sns.set_palette(sns.color_palette(colors))
                       )

    a.get_yaxis().set_visible(False)
    a.get_xaxis().set_visible(False)
    if po==len(yticks_list)-1:
        a.get_xaxis().set_visible(True)
        a_xlabel = a.set_xlabel("")
    for p in a.patches:
        patch = a.annotate(str(p.get_height())+"%", 
                            (p.get_x() + p.get_width() / 2.0, 
                             p.get_height()), 
                            ha = 'center', 
                            va = 'center', 
                            xytext = (0, 5),
                            textcoords = 'offset points')
    a_ylim = a.set_ylim(0, 70)
    sns.despine(ax=a, left=True, top=True, right=True,bottom=False)

SEA kagglers are seeking out different learning platforms and media sources in order to continuously learn and be up to date with topics in the data science field.

<div style="text-align: left;" markdown="1"><font size="4"><b>3.2. Private Companies: Enticing Professionals</b></font> </div>
<br>

Private companies also initiate talks, meet ups and DS challenges in order to help develop the talent in the region and, of course, recruit data scientists into their company. Nevertheless, these provide good avenues for the community to learn and develop their skills.

Southeast Asia's leading ride-hailing app, Grab, launched "AI for SEA". This challenge was aimed to drive Southeast Asia forward by using AI in solving the region's biggest transportation-related problems. Cash prizes were offered to the top 5 individuals, while full-time positions at the company were offered to the top 50 participants.<sup>12</sup>

<img src="https://i.ytimg.com/vi/7BL8EeAkNDw/maxresdefault.jpg" width="600"/>
<p style="text-decoration:none;font-size:50%;text-align:center;color:gray;">Source: Grab (https://i.ytimg.com/vi/7BL8EeAkNDw/maxresdefault.jpg)</p>

<div style="text-align: left;" markdown="1"><font size="4"><b>3.3. Government Initiatives: Driving Adoption</b></font> </div>
<br>

Governments play a crucial role in furthering DS adoption. Take for example, the Singaporean government supported and pushed Singapore into a flourishing AI adoption. This enabled Singapore to implement, test and improve on several emerging technology solutions.<sup>13</sup>

Just this year in the Philippines, the country's Department of Science and Technology launched a program called "Smarter Philippines Through Data Analytics R&D, Training and Adoption"(SPARTA). The program aims to train 30,000 indvidiuals on data science and analytics over a period of three years. The opportunitiy will be offered to people from the government, academe and from the business process outsourcing (BPO) industry. <sup>14</sup>

<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fdostpcieerd%2Fposts%2F1653473791456364&width=500" width="500" height="616" align="middle" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true" allow="encrypted-media"></iframe>
<p style="text-decoration:none;font-size:50%;text-align:center;color:gray;">Source: DOST PCIEERD ("https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fdostpcieerd%2Fposts%2F1653473791456364&width=500)</p>

<hr style="background-color:#E3692A;height:1px">

## **Learnings**

Our notebook combines published articles about the region, some of my experiences as a data scientist in the Philippines, and the 2019 ML and DS Survey results. Here are some key realizations:

* <b>Ripe for Technological Advancements.</b> The region is experiencing a digital shift that is a prime time for the implementation of DS/AI. Companies are starting to embrace the adoption of more data-driven and technologically advance solutions. This presents opportunities a plenty.
* <b>Different Pace, Same direction.</b> Southeast asian countries are experiencing the same growth and improvement. However, there are still a lot of disparity in terms of technological advancements and (surprisingly) even in gender disparity. Singapore is the hub for the region while the other countries are still catching up. These differences across countries in the region should present itself as an opportunity for countries to learn from one another.
* <b>Young Talent Should be Honed.</b> DS/AI talent, albeit young, is present in the region. The number of individuals seeking courses for training suggest that the younger generation is interested in the field. Private companies are enticing these individuals, but government agencies should be able to push the country as a whole towards DS/AI adoption.

I admit that my sources are not enough to paint a genuine landscape of the region. However, it gives us the puzzle pieces of a bigger picture. Hopefully, other SEA kagglers (especially those in underrepresented markets) can read this and share their own experiences.

Thank you so much for reading my notebook! And from the teacher in me, I hope you learned a thing or two about our wonderful region! 

P.S. Our food is really good!
<br>
<br>
<hr style="background-color:#E3692A;height:1px">

## **References**

1. https://www.worldometers.info/world-population/south-eastern-asia-population/
2. https://www.thinkwithgoogle.com/_qs/documents/8600/e-Conomy_SEA_2019_Report.pdf
3. https://www.forbes.com/sites/jonathanmoed/2018/07/12/a-guide-to-southeast-asias-thriving-startup-ecosystem-heres-what-you-need-to-know/#3d9465e6e181
4. https://analyticsindiamag.com/study-state-of-analytics-in-south-east-asia-2019/
5. https://www.mckinsey.com/~/media/McKinsey/Featured%20Insights/Artificial%20Intelligence/AI%20and%20SE%20ASIA%20future/Artificial-intelligence-and-Southeast-Asias-future.ashx
6. https://asgard.vc/global-ai/
7. https://www.cio.com/article/3311756/how-is-artificial-intelligence-benefiting-industries-throughout-southeast-asia.html
8. https://jfgagne.ai/talent-2019/)
9. https://www.techinasia.com/southeast-asias-golden-age
10. https://investinginwomen.asia/posts/mixed-gender-gap-data-south-east-asia/
11. https://www.worlddata.info/cost-of-living.php
12. https://www.hrinasia.com/press-release/grab-launches-first-ever-ai-for-southeast-asia-challenge/
13. https://www.cio.com/article/3311756/how-is-artificial-intelligence-benefiting-industries-throughout-southeast-asia.html
14. https://www.dap.edu.ph/moa-signing-for-the-project-sparta/