# Project Overview

**Path to Becoming a Data Analyst in Vietnam: Skills, Salaries, and Career Opportunities for Career Changers**

This project explores the Vietnamese job market for Data Analyst and related roles using real-world job posting data. The analysis is designed for professionals considering a career transition into data analytics, with the goal of answering two big questions:
- What does the current job market for Data Analysts in Vietnam look like?
- How can career changers strategically prepare themselves to succeed in this field?

### Load the Dataset

In [62]:
# Importing Libraries
import ast
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns

# Loading Dataset
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])
df['job_skills'] = df['job_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)

### Filter for Vietnam Data Analyst Roles

In [63]:
df_da_vn = df[(df['job_country'] == 'Vietnam') & (df['job_title_short'] == 'Data Analyst')].copy()
df_da_vn.sample(5)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
50702,Data Analyst,Data Analyst,"Ho Chi Minh City, Vietnam",via Trabajo.org,Full-time,False,Vietnam,2023-11-30 06:30:42,False,False,Vietnam,,,,Deliveree On-Demand Logistics (Southeast Asia),"[python, postgresql, bigquery, looker, power bi]","{'analyst_tools': ['looker', 'power bi'], 'clo..."
295472,Data Analyst,[DI4] Junior/ Senior Analytics Engineer (SQL) ...,"Hanoi, Hoàn Kiếm, Hanoi, Vietnam",via Vietnamnet,Full-time,False,Vietnam,2023-05-15 17:55:28,True,False,Vietnam,,,,Ngân hàng Thương mại Cổ phần Hàng Hải Việt Nam...,,
360235,Data Analyst,Data Analyst SQL,"Hanoi, Vietnam",via ITviec,Full-time,False,Vietnam,2023-02-25 23:55:22,True,False,Vietnam,,,,K&G Việt Nam,"[sql, power bi, excel]","{'analyst_tools': ['power bi', 'excel'], 'prog..."
527103,Data Analyst,Data Analysis Intern,Vietnam,via LinkedIn Vietnam,Full-time,False,Vietnam,2023-06-21 11:21:56,False,False,Vietnam,,,,dentsu,[excel],{'analyst_tools': ['excel']}
65607,Data Analyst,Merchandise Data Analyst/ Junior Data Analysis,Vietnam,via LinkedIn Vietnam,Full-time,False,Vietnam,2023-06-16 06:53:12,False,False,Vietnam,,,,Central Retail in Vietnam,"[go, outlook, power bi]","{'analyst_tools': ['outlook', 'power bi'], 'pr..."


In [64]:
df_da_vn.info()

<class 'pandas.core.frame.DataFrame'>
Index: 334 entries, 3056 to 785382
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   job_title_short        334 non-null    object        
 1   job_title              334 non-null    object        
 2   job_location           334 non-null    object        
 3   job_via                334 non-null    object        
 4   job_schedule_type      334 non-null    object        
 5   job_work_from_home     334 non-null    bool          
 6   search_location        334 non-null    object        
 7   job_posted_date        334 non-null    datetime64[ns]
 8   job_no_degree_mention  334 non-null    bool          
 9   job_health_insurance   334 non-null    bool          
 10  job_country            334 non-null    object        
 11  salary_rate            12 non-null     object        
 12  salary_year_avg        11 non-null     float64       
 13  sala

# Exploratory Data Analysis  
## EDA Research Questions  

The exploratory data analysis (EDA) focuses on answering the following key questions:

### 1. Skills Gap & Demand (What to Learn)
- What are the most frequently mentioned skills in DA job postings in Vietnam?  
- How do skill requirements differ between job titles (DA vs BI Analyst vs DS)?  
- Which skills are more common in entry-level vs senior roles?

What are the most frequently mentioned skills in DA job postings in Vietnam?

In [65]:
df_da_vn.explode('job_skills')['job_skills'].value_counts().sort_values(ascending=False).head(10)

job_skills
sql         157
excel        99
python       97
power bi     83
tableau      74
r            45
sas          30
oracle       25
azure        22
aws          20
Name: count, dtype: int64

How do skill requirements differ between job titles (DA vs BA vs DE vs DS)? 

In [66]:
df_vn = df[(df['job_country'] == 'Vietnam')].copy()
              
title_map = {
    **dict.fromkeys(['Data Analyst', 'Senior Data Analyst'], 'DA'),
    **dict.fromkeys(['Business Analyst'], 'BA'),
    **dict.fromkeys(['Data Engineer', 'Senior Data Engineer'], 'DE'),
    **dict.fromkeys(['Data Scientist', 'Senior Data Scientist', 'Machine Learning Engineer'], 'DS'),
}

pivot_table = (
    df_vn[df_vn['job_title_short'].isin(title_map)]
    .assign(job_group=lambda d: d['job_title_short'].map(title_map))
    .explode('job_skills')
    .groupby(['job_group', 'job_skills']).size().reset_index(name='count')
    .pivot(index='job_skills', columns='job_group', values='count')
    .fillna(0).astype(int)
    .sort_values(by=['DA','BA','DE','DS'], ascending=False)
)

pivot_table.head(10)

job_group,BA,DA,DE,DS
job_skills,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sql,21,206,610,210
python,14,131,610,365
excel,26,113,24,39
power bi,18,91,78,31
tableau,5,91,98,60
r,3,66,63,122
sas,2,38,22,28
azure,4,29,181,48
oracle,6,28,116,18
spark,4,27,366,112


Which skills are more common in entry-level vs senior roles?

In [67]:
df_vn['job_level'] = df_vn['job_title_short'].apply(
    lambda x: 'Senior' if 'Senior' in x else 'Entry'
).copy()

df_vn_exploded = df_vn.explode('job_skills')

skill_by_level = (
    df_vn_exploded
    .groupby(['job_level', 'job_skills'])
    .size()
    .reset_index(name='count')
)

pivot_level = (
    skill_by_level
    .pivot(index='job_skills', columns='job_level', values='count')
    .fillna(0)
    .astype(int)
)

pivot_level['Entry_%'] = pivot_level['Entry'] / pivot_level['Entry'].sum() * 100
pivot_level['Senior_%'] = pivot_level['Senior'] / pivot_level['Senior'].sum() * 100

pivot_level['diff_%'] = pivot_level['Entry_%'] - pivot_level['Senior_%']

entry_top = (
    pivot_level
    .sort_values('diff_%', ascending=False)
    .head(10)
    .reset_index()
    .loc[:, ['job_skills', 'Entry', 'Entry_%', 'Senior', 'Senior_%', 'diff_%']]
)

senior_top = (
    pivot_level
    .sort_values('diff_%', ascending=True)
    .head(10)
    .reset_index()
    .loc[:, ['job_skills', 'Entry', 'Entry_%', 'Senior', 'Senior_%', 'diff_%']]
)

print("=== Top 10 Skills More Common in Entry-level Roles ===")
display(entry_top.sort_values('Entry', ascending=False))

print("\n=== Top 10 Skills More Common in Senior-level Roles ===")
display(senior_top.sort_values('Senior', ascending=False))

=== Top 10 Skills More Common in Entry-level Roles ===


job_level,job_skills,Entry,Entry_%,Senior,Senior_%,diff_%
0,java,413,4.161209,69,3.123585,1.037624
6,aws,388,3.90932,79,3.576279,0.333041
5,azure,257,2.589421,49,2.218198,0.371222
3,power bi,188,1.894207,32,1.448619,0.445587
4,excel,180,1.813602,31,1.40335,0.410252
9,oracle,158,1.59194,30,1.358081,0.233859
2,javascript,73,0.735516,6,0.271616,0.4639
1,sap,69,0.695214,1,0.045269,0.649945
8,spring,45,0.453401,4,0.181077,0.272323
7,word,41,0.413098,2,0.090539,0.32256



=== Top 10 Skills More Common in Senior-level Roles ===


job_level,job_skills,Entry,Entry_%,Senior,Senior_%,diff_%
2,spark,419,4.221662,113,5.115437,-0.893774
0,mongodb,226,2.277078,88,3.983703,-1.706625
4,hadoop,267,2.690176,71,3.214124,-0.523948
1,airflow,195,1.964736,66,2.987777,-1.023042
3,scala,212,2.13602,60,2.716161,-0.580141
6,go,86,0.866499,29,1.312811,-0.446312
5,bigquery,62,0.624685,25,1.131734,-0.507049
8,databricks,75,0.755668,24,1.086464,-0.330797
7,c,46,0.463476,20,0.905387,-0.441911
9,dax,7,0.070529,8,0.362155,-0.291626


### 2. Salary & Career Growth (What It Pays)
- What is the average salary for DA roles in Vietnam?  
- How do salaries for DA roles differ across job titles and locations? 
- Does the presence of certain skills (e.g., Python, PowerBI) correlate with higher salaries?

What is the average salary for DA roles in Vietnam?  

In [68]:
print(f"Average salary for Data Analyst roles in Vietnam: {df_da_vn['salary_year_avg'].mean():,.0f}")

Average salary for Data Analyst roles in Vietnam: 90,842


How do salaries for DA roles differ across job titles and locations?

In [69]:
df_salary_by_loc = (
    df_da_vn[
        (df_da_vn['job_location'] != 'Vietnam')
    ]
    .dropna(subset=['salary_year_avg'])
    .groupby('job_location')['salary_year_avg']
    .mean()
    .reset_index(name='mean')
    .sort_values('mean', ascending=False)
).copy()

df_salary_by_loc

Unnamed: 0,job_location,mean
0,"Hanoi, Hoàn Kiếm, Hanoi, Vietnam",111175.0
1,"Ho Chi Minh City, Vietnam",101805.6


Does the presence of certain skills (e.g., Python, PowerBI) correlate with higher salaries?

In [70]:
df_clean = df_da_vn.dropna(subset=['salary_year_avg']).copy()

df_exploded = df_clean.explode('job_skills')

skill_salary = (
    df_exploded
    .groupby('job_skills')['salary_year_avg']
    .mean()
    .reset_index(name='avg_salary')
    .sort_values('avg_salary', ascending=False)
)

overall_avg = df_clean['salary_year_avg'].mean()

print(f"Overall average salary: {overall_avg:,.0f}")
skill_salary.head(20)

Overall average salary: 90,842


Unnamed: 0,job_skills,avg_salary
1,looker,100500.0
15,word,100500.0
11,sql,86533.375
6,power bi,73540.6
7,python,68766.0
12,sql server,63282.0
4,oracle,63282.0
0,java,63282.0
9,sap,63282.0
14,windows,53014.0


### 3. Job Market & Opportunities (Where to Apply)
- How many DA job postings appear monthly in Vietnam?
- Which cities (Hanoi, Ho Chi Minh City) dominate the hiring landscape?
- Which companies are hiring the most Data Analysts?
- Which channels are hiring the most Data Analysts?

How many DA job postings appear monthly in Vietnam?

In [81]:
df_da_vn['job_posted_month'] = df_da_vn['job_posted_date'].dt.month

df_da_vn['job_posted_month'].value_counts().sort_index()

job_posted_month
1     43
2     25
3     26
4     22
5     17
6     24
7     24
8     12
9     21
10    40
11    54
12    26
Name: count, dtype: int64

Which cities (Hanoi, Ho Chi Minh City, etc) dominate the hiring landscape?

In [83]:
df_loc = df_da_vn.copy()

df_loc = df_loc[~df_loc['job_location'].isin(["Vietnam", "Anywhere"])]
df_loc = df_loc[~df_loc['job_location'].str.contains("other", na=False, case=False)]

def clean_location(x):
    if pd.isna(x):
        return "Other"
    if "Ho Chi Minh" in x:
        return "Ho Chi Minh City"
    elif "Hanoi" in x:
        return "Hanoi"
    elif "Da Nang" in x:
        return "Da Nang"
    elif "Binh Duong" in x:
        return "Binh Duong"
    elif "Dong Nai" in x:
        return "Dong Nai"
    elif "Bình Định" in x:
        return "Binh Dinh"
    elif "Quảng Nam" in x:
        return "Quang Nam"
    elif "Bắc Giang" in x or "Bac Giang" in x:
        return "Bac Giang"
    elif "Thua Thien Hue" in x:
        return "Thua Thien Hue"
    else:
        return "Other"

df_loc['city'] = df_loc['job_location'].apply(clean_location)

df_city_posted = (
    df_loc['city']
    .value_counts()
    .reset_index(name='no_job_posted')
    .rename(columns={'index': 'city'})
    .sort_values('no_job_posted', ascending=False)
)

df_city_posted

Unnamed: 0,city,no_job_posted
0,Ho Chi Minh City,121
1,Hanoi,70
2,Da Nang,18
3,Binh Duong,4
4,Dong Nai,2
5,Quang Nam,1
6,Binh Dinh,1
7,Bac Giang,1


Which companies are hiring the most Data Analysts?

In [86]:
df_da_vn['company_name'].value_counts().sort_values(ascending=False).head(10)

company_name
Bosch Group                                        5
Ninja Van                                          5
Công Ty TNHH Bosch Global Software Technologies    4
Zalo                                               4
Gear Inc.                                          4
TELUS International                                4
Viettel Big Data Analytics Center                  4
Công ty TNHH Onpoint                               4
TELUS International AI Data Solutions              4
Ngân Hàng TMCP Quân Đội                            4
Name: count, dtype: int64

Which channels are hiring the most Data Analysts?

In [89]:
df_da_vn['job_via'].value_counts().sort_values(ascending=False).head(10)

job_via
via Trabajo.org         62
via LinkedIn Vietnam    61
via CareerBuilder       31
via LinkedIn            21
via JobsGO              17
via BeBee               15
via Vn.linkedin.com     14
via Ai-Jobs.net         11
via Joboko               8
via WhatJobs             8
Name: count, dtype: int64

### 4. Career Accessibility & Entry Barriers (Who Can Get In)
- What percentage of postings do not require a bachelor’s degree?
- What percentage of postings offer remote opportunities?
- Do jobs without degree requirements still offer competitive salaries?

What percentage of postings do not require a bachelor’s degree?

In [96]:
counts = df_da_vn['job_no_degree_mention'].value_counts(dropna=True).copy()

percent = counts / counts.sum() * 100

df_degree_req = pd.DataFrame({
    'count': counts,
    'percentage': percent.round(2)
}).reset_index()

df_degree_req

Unnamed: 0,job_no_degree_mention,count,percentage
0,False,173,51.8
1,True,161,48.2


What percentage of postings offer remote opportunities?

In [97]:
counts = df_da_vn['job_work_from_home'].value_counts(dropna=True).copy()

percent = counts / counts.sum() * 100

df_wfh = pd.DataFrame({
    'count': counts,
    'percentage': percent.round(2)
}).reset_index()

df_wfh

Unnamed: 0,job_work_from_home,count,percentage
0,False,315,94.31
1,True,19,5.69


Do jobs without degree requirements still offer competitive salaries?

In [98]:
df_salary = df_da_vn.dropna(subset=['salary_year_avg'])

df_salary_comp = (
    df_salary
    .groupby('job_no_degree_mention')['salary_year_avg']
    .mean()
    .reset_index(name='avg_salary')
)

df_salary_comp['degree_requirement'] = df_salary_comp['job_no_degree_mention'].map(
    {True: 'No degree required', False: 'Degree required'}
)

df_salary_comp = df_salary_comp[['degree_requirement', 'avg_salary']]

df_salary_comp

Unnamed: 0,degree_requirement,avg_salary
0,Degree required,86011.142857
1,No degree required,99297.25
