# Kaggle Survey 2020: Vietnamese & ML career
Hello Kagglers! I come from Vietnam. Today I will analyze the Kaggle Survey 2020 on Vietnamse.  
NOTE: all assumptions here about Kagglers/people are all only Vietnamese. I do not mean the general case at all.

Some personal notes before we start about ML in Vietnam:
- ML/DL or related fields are quite new in Vietnam. Only in the past 3 years that schools/universities start to develop a Data Science learning path.
- I as a Software Engineering student, is also taught basic ML algorithms, but only up to Perceptron (few nodes of Neural Net).
- I find very few to no ML textbooks in bookstores.
- Companies start to grow fast, and they notice the impact of data, so many ML jobs are highly compensated, but in short demand.
- A proud mention: about 2 (or 3) years ago, Vietnam's first AI research lab VinAI founded, and in 1 year they wrote a paper accepted at ICML! https://www.vinai.io/an-overview-of-icml-2020s-publications/

These reasons are (but not limited to) motivation that I do this analysis. Kaggle is one of the most well-known ML platforms, so I think I will find interesting things about ML career of Vietnamese here based on the survey.

# Update and prepare libraries

In [None]:
%%capture
!pip install --upgrade seaborn
!pip install phik

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import phik

sns.set_style("darkgrid")
%matplotlib inline 
plt.rcParams['figure.figsize'] = (12, 12.0)

# Read data & filter

In [None]:
df = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
df = df.iloc[1:, :]
df = df[df["Q3"] == "Viet Nam"]
df

We only see 147 responses, which is a pretty low number.

# Questions to answer:
* Q1: What is your age (# years)?
* Q2: What is your gender?
* Q3: In which country do you currently reside?
* Q4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
* Q5: Select the title most similar to your current role (or most recent title if retired)
* Q6: For how many years have you been writing code and/or programming?
* Q7: What programming languages do you use on a regular basis?
* Q8: What programming language would you recommend an aspiring data scientist to learn first?
* Q9: Which of the following integrated development environments (IDE's) do you use on a regular basis?
* Q10: Which of the following hosted notebook products do you use on a regular basis?
* Q11: What type of computing platform do you use most often for your data science projects?
* Q12: Which types of specialized hardware do you use on a regular basis?
* Q13: Approximately how many times have you used a TPU (tensor processing unit)?
* Q14: What data visualization libraries or tools do you use on a regular basis?
* Q15: For how many years have you used machine learning methods?
* Q16: Which of the following machine learning frameworks do you use on a regular basis?
* Q17: Which of the following ML algorithms do you use on a regular basis?
* Q18: Which categories of computer vision methods do you use on a regular basis?
* Q19: Which of the following natural language processing (NLP) methods do you use on a regular basis?
* Q20: What is the size of the company where you are employed?
* Q21: Approximately how many individuals are responsible for data science workloads at your place of business?
* Q22: Does your current employer incorporate machine learning methods into their business?
* Q23: Select any activities that make up an important part of your role at work:
* Q24: What is your current yearly compensation ( approximate USD)?
* Q25: Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years ( approximate USD )?
* Q26: Which of the following cloud computing platforms do you use on a regular basis?
* Q27: Do you use any of the following cloud computing products on a regular basis?
* Q28: Do you use any of the following machine learning products on a regular basis?
* Q29: Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?
* Q30: Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often?
* Q31: Which of the following business intelligence tools do you use on a regular basis?
* Q32: Which of the following business intelligence tools do you use most often?
* Q33: Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis?
* Q34: Which of the following automated machine learning tools (or partial AutoML tools) do you use on a regular basis?
* Q35: Do you use any tools to help manage machine learning experiments?
* Q36: Where do you publicly share or deploy your data analysis or machine learning applications?
* Q37: On which platforms have you begun or completed data science courses?
* Q38: What is the primary tool that you use at work or school to analyze data?
* Q39: Who/what are your favorite media sources that report on data science topics?

# Career-related questions

## Q1: What is your age (# years)?

In [None]:
plt.figure(figsize=(9, 9))
sns.countplot(y = "Q1", data=df)

We see that:
- Most Kagglers here are very young, mostly below 24, which consist of ~85 responses
- There is no people older than 50
- Group 45-49 has only under 5 people, which is not big, but is also an encouragement: learning is never late!

## Q2: What is your gender?

In [None]:
plt.figure(figsize=(9, 9))
sns.countplot(y = "Q2", data=df)

We see that there are only about 20 women on Kaggle. How interesting! What about age and gender relationship?

In [None]:
sns.countplot(y = "Q1", hue="Q2", data=df)

We see that:
- So most of them are also very young, undergraduate students or recent graduates.
- There is one person in the age ground 35-39! What a fascinating finding!

## Q4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

In [None]:
sns.countplot(y = "Q4", data=df)

Since this is in the next 2 years, we see that:
- Most people earned the Bachelor's or Master's degree
- There are about 5 people without past high school education
- There are about 5 people who got Doctoral or Professional degree each
- About 50 people get or are still learning to get the Bachelor's degree (including me!)

In [None]:
plt.figure(figsize=(15, 15))
sns.countplot(y = "Q4", hue="Q2", data=df)

We see interesting findings: 
- There are more women earned the Master's than Bachelor's
- There is 1 doctoral degree woman!

## Q5: Select the title most similar to your current role (or most recent title if retired)

In [None]:
plt.figure(figsize=(15, 15))
sns.countplot(y = "Q5", data=df)

In [None]:
df["Q5"].unique()

We see that:
- Wow there are a lot of students (including me!)
- A lot of people are in the data-related fields: ML Engineer, Data Scientist, Analyst, Research Scientists, etc.

Now I know if I meet these people in competitions I am competiting with giants of the field.

In [None]:
plt.figure(figsize=(15, 15))
sns.countplot(y = "Q5", hue="Q2", data=df)

Wee see that:
- Most women Kagglers are expert in the field (Analyst, Research Scientist). WOW!
- A large proportion of men are ML Engineer, Student and SWE


## Q6: For how many years have you been writing code and/or programming?

In [None]:
sns.countplot(y = "Q6", data=df)

We see that:
- Most people write code in 1-5 years, which fit students and/or recent graduates
- There is a person who has been writting code for 20+ years! WOW!

In [None]:
df[df["Q6"] == "20+ years"]

In [None]:
sns.PairGrid(x_vars = ["Q6"], y_vars =["Q4"], data=df)

## Q7: What programming languages do you use on a regular basis?

In [None]:
def count_multiplechoices(df, question, name):
    """Helper function to count and plot multiple choices
    
    Arguments:
        df: pandas DataFrame. The main DF
        question: string. The question initial, for example: "Q7", "Q8"
        name: string. The name for the return DF column.
    Returns:
        return_df: pandas DataFrame. Multiple choices count.
        Q_choices: list. List of columns that contain question as initials.
    """
    Q_choices = [Q for Q in df.columns if question in Q]
    answers = []
    answers_count = []
    for col in Q_choices:
        cur_df = df[col].dropna(axis=0)
        if cur_df.unique():
            answers.append(cur_df.unique()[0])
            answers_count.append(len(cur_df))
    return_df = pd.DataFrame({name: answers, "count": answers_count})
    return return_df, Q_choices

In [None]:
Q7_df, Q7_cols = count_multiplechoices(df, "Q7", "language")
Q7_df

In [None]:
sns.barplot(y = "language", x ="count", data=Q7_df)

We see that:
- Python, R and SQL are the most popular language
- C/C++ and Java are still very popular
- MATLAB is not yet deprecated (in Vietnam most people I know believe that MATLAB is very expensive)
- JavaScript is also a common one. Maybe because of TFJS?

To further discover how many languages do a person know, I have a hypothesis:
- A lot of people will know 1-3 languages, because Python, C/C++, Java and SQL are commonly taught

In [None]:
def count_choices(df, cols):
    return df[cols].count(axis=1).reset_index(name="count")

In [None]:
sns.countplot(y = "count", data=count_choices(df, Q7_cols))

We see that:
- Most people use 1-4 languages
- A few people use 5+ languages! I wonder what projects are they building that require so many languages?

## Q8: What programming language would you recommend an aspiring data scientist to learn first?

In [None]:
sns.countplot(y = "Q8", data=df)

We see that most people choose R and Python. I personally agree tho, because they are ML/DL languages.

# In-depth ML-related questions

## Q14: What data visualization libraries or tools do you use on a regular basis?


In [None]:
Q14_df, Q14_cols = count_multiplechoices(df, "Q14", "vis_lib")
sns.barplot(y = "vis_lib", x ="count", data=Q14_df)

We see that:
- Matplotlib is still the most popular, then comes seaborn (which is based on it)
- Plotly and GGPlot are very nice interactive tools, I guess that's why they are common too.

## Q15: For how many years have you used machine learning methods?

In [None]:
sns.countplot(y = "Q15", data=df)

We see that:
- Since most of the Kagglers are students, I guess Under 1 year choice is reasonable
- "I do not use ML methods" well this I don't understand
- There are people have been using ML methods for 5-10 years, where Internet was not widely used!


## Q16: Which of the following machine learning frameworks do you use on a regular basis?

In [None]:
Q16_df, Q16_cols = count_multiplechoices(df, "Q16", "framework")
sns.barplot(y = "framework", x ="count", data=Q16_df)

We see:
- Scikit-learn is still very popular. It also has basic ML algorithms, which I think people will try with it first. It can help with Feature Engineering too.
- TF, Keras then Pytorch are the next in the list. Deep learning is very popular
- Tree-based models like LightGBM, CatBoost, XGBoost are not that popular. Maybe the drawback of tree-based models is obvious (cannot extrapolate predictions)

In [None]:
sns.countplot(y = "count", data=count_choices(df, Q16_cols))

We see:
- Most people use 1-5 frameworks. I wonder what is the reason behind `0` framework.
- There is/are a person/people use(s) 14 frameworks!

## Q17: Which of the following ML algorithms do you use on a regular basis?

In [None]:
Q17_df, Q17_cols = count_multiplechoices(df, "Q17", "framework")
sns.barplot(y = "framework", x ="count", data=Q17_df)

We see that:
- Logistic and Linear Regression are the most popular. Since it will be the first thing we try on any problem, I understand its popularity.
- The come Convolutional Neural Net. Image-related tasks are a very common tasks nowadays.
- Intuitionally, if Linear/Logistic fails I will use Tree-based methods and/or Bayesian approach before DL algorithms. That I think explains why DecisionTree/RF is the next.
- (Very) DL methods like BERT, GPT3 are not popular. Computational cost is a problem I guess.


## Q18: Which categories of computer vision methods do you use on a regular basis?

In [None]:
Q18_df, Q18_cols = count_multiplechoices(df, "Q18", "framework")
sns.barplot(y = "framework", x ="count", data=Q18_df)

We see that:
- Image classification and Object detection are the most common. 
- Image segmentation and GAN come next. I wonder if it is to augmentate the data?

# Work in progress