<a href="https://colab.research.google.com/github/ricardocx/PUC-RIO/blob/main/MVP_1_Global_AI_Job.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MVP Análise de Dados e Boas Práticas
## Global AI Job Market & Salary Trends 2025 EDA
### A Comprehensive Analysis of Global Employment Trends in Artificial Intelligence

**Nome:** Ricardo Cortez Xavier

**Matrícula:** 4052025000912

#### Dataset Overview
This dataset offers a comprehensive analysis of the artificial intelligence job market, featuring over 15,000 real job postings collected from globally accessible employment platforms. The information includes detailed insights into job roles, salaries, and market trends, segmented by country, experience level, and company size, enabling a deep and comparative understanding of global employment trends in the artificial intelligence.

[AI Job Market Analysis Dataset 2025](https://www.kaggle.com/datasets/bismasajjad/global-ai-job-market-and-salary-trends-2025/data). Retrieved from Kaggle.com

##### Data Collection Methodology
The data was collected and curated by the dataset author through ethical web scraping from major job platforms, including:
- AngelList
- Company career pages
- Glassdoor
- Indeed
- LinkedIn Jobs
- Stack Overflow Jobs

##### Columns Description
| Column | Description | Type |
| ------ | ----------- | ---- |
| job_id | Unique identifier for each job posting | String |
| job_title | Standardized job title | String |
| salary_usd | Annual salary in USD | Integer |
| salary_currency | Original salary currency | String |
| salary_local | Salary in local currency | Float |
| experience_level | EN (Entry), MI (Mid), SE (Senior), EX (Executive) | String |
| employment_type | FT (Full-time), PT (Part-time), CT (Contract), FL (Freelance) | String |
| company_location | Country where company is located | String |
| company_size | S (Small <50), M (Medium 50-250), L (Large >250) | String |
| employee_residence | Country where employee resides | String |
| remote_ratio | 0 (No remote), 50 (Hybrid), 100 (Fully remote) | Integer |
| required_skills | Top 5 required skills (comma-separated) | String |
| education_required | Minimum education requirement | String |
| years_experience | Required years of experience | Integer |
| industry | Industry sector of the company | String |
| posting_date | Date when job was posted | Date |
| application_deadline | Application deadline | Date |
| job_description_length | Character count of job description | Integer |
| benefits_score | Numerical score of benefits package (1-10) | Float |

#### Importing Libraries
This section lists all the library imports required for analysis and visualization.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

#### Loading the Data
This section loads the data and displays the first five rows of the created dataframe

In [2]:
df_ai_job = pd.read_csv('https://raw.githubusercontent.com/ricardocx/PUC-RIO/refs/heads/main/ai_job_dataset.csv')
df_ai_job.head()

Unnamed: 0,job_id,job_title,salary_usd,salary_currency,experience_level,employment_type,company_location,company_size,employee_residence,remote_ratio,required_skills,education_required,years_experience,industry,posting_date,application_deadline,job_description_length,benefits_score,company_name
0,AI00001,AI Research Scientist,90376,USD,SE,CT,China,M,China,50,"Tableau, PyTorch, Kubernetes, Linux, NLP",Bachelor,9,Automotive,2024-10-18,2024-11-07,1076,5.9,Smart Analytics
1,AI00002,AI Software Engineer,61895,USD,EN,CT,Canada,M,Ireland,100,"Deep Learning, AWS, Mathematics, Python, Docker",Master,1,Media,2024-11-20,2025-01-11,1268,5.2,TechCorp Inc
2,AI00003,AI Specialist,152626,USD,MI,FL,Switzerland,L,South Korea,0,"Kubernetes, Deep Learning, Java, Hadoop, NLP",Associate,2,Education,2025-03-18,2025-04-07,1974,9.4,Autonomous Tech
3,AI00004,NLP Engineer,80215,USD,SE,FL,India,M,India,50,"Scala, SQL, Linux, Python",PhD,7,Consulting,2024-12-23,2025-02-24,1345,8.6,Future Systems
4,AI00005,AI Consultant,54624,EUR,EN,PT,France,S,Singapore,100,"MLOps, Java, Tableau, Python",Master,0,Media,2025-04-15,2025-06-23,1989,6.6,Advanced Robotics


#### Data Analysis
This section will perform data analysis using graphs of the dataset columns.\
Hover over graphs to see more details.

In [3]:
print("--- Dataset Info ---")
print(df_ai_job.info())

print("--- Note: There are no Missing Values in the data ---")

--- Dataset Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   job_id                  15000 non-null  object 
 1   job_title               15000 non-null  object 
 2   salary_usd              15000 non-null  int64  
 3   salary_currency         15000 non-null  object 
 4   experience_level        15000 non-null  object 
 5   employment_type         15000 non-null  object 
 6   company_location        15000 non-null  object 
 7   company_size            15000 non-null  object 
 8   employee_residence      15000 non-null  object 
 9   remote_ratio            15000 non-null  int64  
 10  required_skills         15000 non-null  object 
 11  education_required      15000 non-null  object 
 12  years_experience        15000 non-null  int64  
 13  industry                15000 non-null  object 
 14  posting_date     

##### Correlation

In [4]:
# Plotting correlation graph of numerical variables

numerical_columns = df_ai_job.select_dtypes(include=['number']).columns

correlation_matrix = df_ai_job[numerical_columns].corr(method='pearson').round(2)

for i in range(correlation_matrix.shape[0]):
    for j in range(correlation_matrix.shape[1]):
        if j > i:
            correlation_matrix.iloc[i, j] = np.nan

labels = ['Salary (USD)', 'Remote Ratio', 'Years Experience', 'Job Description Length', 'Benefits Score']

fig = px.imshow(correlation_matrix, zmin=0, zmax=1, origin="upper",
                labels=dict(x="Variable", y="Variable", color="Correlation"),
                x=labels,
                y=labels,
                text_auto=True,
                color_continuous_scale='greens',
                color_continuous_midpoint=0,
                title='Correlation Matrix of Numerical Variables')

fig.update_xaxes(tickangle=-45, tickfont=dict(size=12))

fig.update_yaxes(tickangle=0, tickfont=dict(size=12))

fig.update_layout(xaxis_title='', yaxis_title='', width=600, plot_bgcolor="rgba(0,0,0,0)")
fig.update_traces(xgap=1, ygap=1, hoverongaps=False,
                  hovertemplate="X: %{x}<br>Y: %{y}<br>Correlation: %{z}<extra></extra>")

fig.show()

print('--- The chart shows that Years Experience has a good correlation with Salary (USD). ---')

--- The chart shows that Years Experience has a good correlation with Salary (USD). ---


##### Defining Functions to generate graphs

In [5]:
# Defining Functions

# Function to generate a bar graph
def plot_bar(df, x, y, title, labels, color, orientation, hovertemplate,
             color_discrete_sequence=px.colors.qualitative.Dark24 ,height=600,
             legend_traceorder='normal'):
    fig = px.bar(df, x=x, y=y, title=title,
                 labels=labels, color=color, orientation=orientation,
                 color_discrete_sequence=color_discrete_sequence)
    fig.update_layout(height=height)
    fig.update_traces(marker_line_color='black', marker_line_width=1.5, opacity=0.8,
                      hovertemplate=hovertemplate)
    fig.update_layout(legend_traceorder=legend_traceorder)
    fig.show()

# Function to generate a histogram graph with categorical order
def plot_histogram_category_orders(df, x, title, color, category_orders,
              labels, xaxis_title, yaxis_title,
              ticktext, tickvals, legend_labels,
              hovertemplate='<b>%{x}</b><br>Count=%{y}<extra></extra>',
              color_discrete_sequence=px.colors.qualitative.Pastel2,
              legend_traceorder='reversed'):
    fig = px.histogram(df, x=x, title=title,
                       color=color, color_discrete_sequence=color_discrete_sequence,
                       category_orders=category_orders,
                       labels=labels)
    fig.update_layout(xaxis_title=xaxis_title, yaxis_title=yaxis_title)
    fig.update_traces(marker_line_color='black', marker_line_width=1.5, opacity=0.7,
                      hovertemplate=hovertemplate)
    fig.update_xaxes(ticktext=ticktext, tickvals=tickvals)
    for trace in fig.data:
        if trace.name in legend_labels:
            trace.name = legend_labels[trace.name]

    fig.update_layout(legend_traceorder=legend_traceorder)
    fig.show()

# Function to generate a pie chart that shows label and percentage at the same time
def plot_pie_special(labels, values, title):
    fig = go.Figure(data=[go.Pie(
        labels=labels,
        values=values,
        textinfo='label', textposition='auto',
        texttemplate='%{label}', showlegend=False,
        marker=dict(
            colors=px.colors.qualitative.Set2,
            line=dict(color='rgba(0,0,0,0)', width=0)
        )
    )])

    fig.update_traces(
        textposition='outside', textfont_size=14
    )

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        textinfo='percent',
        hovertemplate='<b>%{label}</b><br>Count: %{value}<br><extra></extra>',
        textposition='inside', textfont_size=14,
        showlegend=False,
        marker=dict(
            colors=px.colors.qualitative.Set2,
            line=dict(color='black', width=0)
        )
    ))

    fig.update_traces(
        textfont_size=14,
        pull=[0.01, 0.01, 0.01, 0.01]
    )

    fig.update_layout(title=title, height=400, width=600)
    fig.show()

# Function to generate a box
def plot_box_(df, x, y, title, labels, color, category_orders,
              ticktext='', tickvals='', legend_labels='',
              with_tick=False, legend_traceorder="normal",
              color_discrete_sequence=px.colors.qualitative.Dark24, height=400):
    if with_tick:
        fig = px.box(df, x=x, y=y, title=title,
                     labels=labels, color=color, category_orders=category_orders,
                     color_discrete_sequence=color_discrete_sequence)
        fig.update_yaxes(ticktext=ticktext, tickvals=tickvals)
        legend_labels=legend_labels
        for trace in fig.data:
            if trace.name in legend_labels:
                trace.name = legend_labels[trace.name]
        fig.update_layout(legend_traceorder=legend_traceorder)
    else:
        fig = px.box(df, x=x, y=y, title=title,
                     labels=labels, color=color, category_orders=category_orders,
                     color_discrete_sequence=color_discrete_sequence)

    fig.update_layout(height=height)
    fig.show()


##### Job Title Frequency (job_title)

In [6]:
# Plotting the Frequency of AI jobs titles

job_title_counts = df_ai_job['job_title'].value_counts().reset_index()
job_title_counts.columns = ['job_title', 'count']

plot_bar(df=job_title_counts, x='count', y='job_title',
         title='Job Title Frequency in AI Jobs',
         labels={'job_title': 'Job Title', 'count': 'Count'},
         color='job_title', orientation='h',
         hovertemplate='<b>%{y}</b><br>Count=%{x}<extra></extra>')

##### Distribuition of salaries in AI jobs (salary_usd)

In [7]:
# Plotting the distribution of salaries in AI jobs

fig = px.histogram(df_ai_job, x='salary_usd', nbins=50, title='Salary Distribution in AI Jobs', marginal='box')
fig.update_layout(xaxis_title='Salary (USD)', yaxis_title='Count')
fig.update_traces(marker_color='green', marker_line_color='black', marker_line_width=1.5, opacity=0.6,
                  hovertemplate='<b>Salary (USD)=%{x}</b><br>Count=%{y}<extra></extra>')
fig.show()

print("--- Note: The salary distribution is highly skewed to the right, indicating that most salaries are below $100,000")
print("          with a few high-paying outliers. ---")

--- Note: The salary distribution is highly skewed to the right, indicating that most salaries are below $100,000
          with a few high-paying outliers. ---


##### Experience Levels Frequency in AI jobs (experience_levels)

In [8]:
# Plotting the experience level distribution of AI jobs

# Assuming 'EN'=Entry, 'MI'=Mid, 'SE'=Senior, 'EX'=Executive
experience_levels_sorted = ['EN', 'MI', 'SE', 'EX']

plot_histogram_category_orders(df_ai_job, x='experience_level', title='Experience Level Frequency in AI Jobs',
                               color='experience_level', category_orders={'experience_level': experience_levels_sorted},
                               labels={'experience_level': 'Experience Level'},
                               xaxis_title='Experience Level', yaxis_title='Count',
                               ticktext=['Entry', 'Mid', 'Senior', 'Executive'],
                               tickvals=['EN', 'MI', 'SE', 'EX'],
                               legend_labels={'EN': 'Entry', 'MI': 'Mid', 'SE': 'Senior', 'EX': 'Executive'})

##### Employment Type Distribuition in AI jobs (employment_type)

In [9]:
# Plotting the pie chart of employment type distribution in AI jobs

employment_type_counts = df_ai_job['employment_type'].value_counts()
employment_type_counts.index = employment_type_counts.index.map({
    'FL': 'Freelance',
    'CT': 'Contract',
    'PT': 'Part-time',
    'FT': 'Full-time'
})

plot_pie_special(labels=employment_type_counts.index,
                 values=employment_type_counts.values,
                 title='Employment Type Distribution in AI Jobs')

##### Company Location Frequency in AI jobs (company_location)

In [10]:
# Plotting the frequency AI ​​company location

company_location_counts = df_ai_job['company_location'].value_counts().reset_index()
company_location_counts.columns = ['company_location', 'count']

plot_bar(df=company_location_counts, x='count', y='company_location',
         title='Company Location Frequency in AI jobs',
         labels={'company_location': 'Company Location', 'count': 'Count'},
         color='company_location', orientation='h',
         hovertemplate='<b>%{y}</b><br>Count=%{x}<extra></extra>')

print("--- Note: Germany and Denmark are the countries with the largest number of vacancies in AI, ahead of China and the United States. ---")

--- Note: Germany and Denmark are the countries with the largest number of vacancies in AI, ahead of China and the United States. ---


##### Company Size Frequency in AI jobs (company_size)

In [11]:
# Plotting the frequency AI ​​company size

# Assuming 'S'=Small, 'M'=Medium, 'L'=Large
company_size_sorted = ['S', 'M', 'L']

plot_histogram_category_orders(df_ai_job, x='company_size', title='Company Size Frequency in AI jobs',
                               color='company_size', category_orders={'company_size': company_size_sorted},
                               labels={'company_size': 'Company Size'},
                               xaxis_title='Company Size', yaxis_title='Count',
                               ticktext=['Small', 'Medium', 'Large'],
                               tickvals=['S', 'M', 'L'],
                               legend_labels={'S': 'Small', 'M': 'Medium', 'L': 'Large'})


##### Employee Residence Frequency in AI jobs (employee_residence)

In [12]:
# Plotting the frequency of employee residence in AI jobs

employee_residence_counts = df_ai_job['employee_residence'].value_counts().reset_index()
employee_residence_counts.columns = ['employee_residence', 'count']

plot_bar(df=employee_residence_counts, x='count', y='employee_residence',
         title='Employee Residence Frequency in AI jobs',
         labels={'employee_residence': 'Employee Residence', 'count': 'Count'},
         color='employee_residence', orientation='h',
         hovertemplate='<b>%{y}</b><br>Count=%{x}<extra></extra>')

##### Remote Ratio Distribution in AI jobs (remote_ratio)

In [13]:
# Plotting the pie chart of remote ratio distribution in AI jobs

remote_ratio_counts = df_ai_job['remote_ratio'].value_counts()

remote_ratio_counts = remote_ratio_counts.reset_index()
remote_ratio_counts.columns = ['remote_ratio', 'count']
remote_ratio_counts['remote_ratio'] = remote_ratio_counts['remote_ratio'].astype(str)
remote_ratio_counts['remote_label'] = remote_ratio_counts['remote_ratio'].map({
    '0': 'No Remote',
    '50': 'Hybrid',
    '100': 'Fully Remote'
})

plot_pie_special(labels=remote_ratio_counts['remote_label'],
                 values=remote_ratio_counts['count'],
                 title='Remote Ratio Distribution in AI Jobs')

##### Required Skills Frequency in AI jobs (required_skills)

In [14]:
# Plotting the Frequency of AI required skills

required_skills_counts = df_ai_job['required_skills'].str.split(', ').explode().value_counts().reset_index()
required_skills_counts.columns = ['skill', 'count']

plot_bar(df=required_skills_counts, x='count', y='skill',
         title='Required Skills Frequency in AI jobs',
         labels={'skill': 'Skill', 'count': 'Count'},
         color='skill', orientation='h',
         height=700, hovertemplate='<b>%{y}</b><br>Count=%{x}<extra></extra>')

print("--- Note 1: The Python and SQL languages are the most requested, the surprise for me is that Scala is required more than Java and R. ---")
print("--- Note 2: The TensorFlow and PyTorch packages are quite requested.  ---")
print("--- Note 3: Knowing Kubernetes, Linux and Git opens up many possibilities. ---")

--- Note 1: The Python and SQL languages are the most requested, the surprise for me is that Scala is required more than Java and R. ---
--- Note 2: The TensorFlow and PyTorch packages are quite requested.  ---
--- Note 3: Knowing Kubernetes, Linux and Git opens up many possibilities. ---


##### Required Education Frequency in AI jobs (education_required)

In [15]:
# Plotting the distribution of required education in AI jobs

education_required_sorted = ['Associate', 'Bachelor', 'Master', 'PhD']

plot_histogram_category_orders(df_ai_job, x='education_required', title='Required Education Frequency in AI jobs',
                               color='education_required', category_orders={'education_required': education_required_sorted},
                               labels={'education_required': 'Education Required'},
                               xaxis_title='Education Required', yaxis_title='Count',
                               ticktext=['Associate', 'Bachelor', 'Master', 'PhD'],
                               tickvals=['Associate', 'Bachelor', 'Master', 'PhD'],
                               legend_labels={'Associate': 'Associate', 'Bachelor': 'Bachelor', 'Master': 'Master', 'PhD': 'PhD'})

##### Years Experience Frequency in AI jobs (years_experience)

In [16]:
# Plotting the distribution of years of experience in AI jobs

bins = [0, 2, 5, 10, 20]
labels = ['Up to 1', '2-4', '5-9', '10-19']

years_experience_grouped = pd.cut(df_ai_job['years_experience'], bins=bins, labels=labels, right=False)
years_experience_counts = years_experience_grouped.value_counts().sort_index()

plot_bar(df=years_experience_counts, x=years_experience_counts.index, y=years_experience_counts.values,
         title='Years Experience Frequency in AI jobs',
         labels={'x': 'Years Experience', 'y': 'Count', 'years_experience': 'Years Experience'},
         color=years_experience_counts.index, orientation='v',
         color_discrete_sequence=px.colors.qualitative.Pastel1,
         hovertemplate='<b>%{x}</b><br>Count=%{y}<extra></extra>',
         legend_traceorder="reversed")

print("--- Note : There are plenty of vacancies for up to 1 year of experience compared to the other ranges,")
print("           especially if we consider the range from 10 years onwards. ---")

--- Note : There are plenty of vacancies for up to 1 year of experience compared to the other ranges,
           especially if we consider the range from 10 years onwards. ---


##### Industry Frequency in AI jobs (industry)

In [17]:
# Plotting the distribution of industry in AI jobs

industry_counts = df_ai_job['industry'].value_counts().reset_index()
industry_counts.columns = ['industry', 'count']

plot_bar(df=industry_counts, x='count', y='industry',
         title='Industry Frequency in AI jobs',
         labels={'industry': 'Industry', 'count': 'Count'},
         color='industry', orientation='h',
         hovertemplate='<b>%{y}</b><br>Count=%{x}<extra></extra>')

##### Posting Date frequency over time in AI Job (posting_date)

In [18]:
# Plotting the frequency of job posting in AI jobs

df_ai_job['posting_date'] = pd.to_datetime(df_ai_job['posting_date'])
posting_date_counts = df_ai_job['posting_date'].dt.to_period('M').value_counts().sort_index()

fig = px.line(posting_date_counts, x=posting_date_counts.index.astype(str), y=posting_date_counts.values, markers=True,
             title='Posting Date frequency over time in AI Job',
             labels={'x': 'Posting Date', 'y': 'Count'}, color_discrete_sequence=px.colors.qualitative.Plotly)
fig.update_layout(xaxis_title='Posting Date', yaxis_title='Count')
fig.update_yaxes(range=[800, None])
fig.update_traces(marker_line_color='black', marker_line_width=1.5,
                  hovertemplate='<b>%{x}</b><br>Count=%{y}<extra></extra>')

fig.add_hline(y=920, line_dash="dash", line_color="blue", opacity=0.5,
              annotation_text="920", annotation_position="left", )

fig.show()
print("--- Note: The graph shows that the frequency of job postings in AI jobs remained above 920 postings ")
print("          and only in Sep 2024 and Feb 2025 the frequency was lower than that. ---")

--- Note: The graph shows that the frequency of job postings in AI jobs remained above 920 postings 
          and only in Sep 2024 and Feb 2025 the frequency was lower than that. ---


##### Application Deadline Frequency over time in AI jobs (application_deadline)

In [19]:
# Plotting the frequency of application deadline in AI jobs

df_ai_job['application_deadline'] = pd.to_datetime(df_ai_job['application_deadline'])

application_deadline_counts = df_ai_job['application_deadline'].dt.to_period('M').value_counts().sort_index()

fig = px.line(application_deadline_counts, x=application_deadline_counts.index.astype(str), y=application_deadline_counts.values, markers=True,
             title='Application Deadline Frequency over time in AI jobs',
             labels={'x': 'Application Deadline', 'y': 'Count'}, color_discrete_sequence=px.colors.qualitative.Plotly)
fig.update_layout(xaxis_title='Application Deadline', yaxis_title='Count')
fig.update_yaxes(range=[0, None])
fig.update_traces(marker_line_color='black', marker_line_width=1.5, opacity=0.7,
                  hovertemplate='<b>%{x}</b><br>Count=%{y}<extra></extra>')
fig.show()

##### Job Description Length Frequency in AI jobs (job_description_length)

In [20]:
# Plotting the distribution of the character count of job description in AI jobs

bins = [500, 1001, 1501, 2001, 2501]
labels = ['500-1000', '1001-1500', '1501-2000', '2001-2500']

job_description_length_grouped = pd.cut(df_ai_job['job_description_length'], bins=bins, labels=labels, right=False)
job_description_length_counts = job_description_length_grouped.value_counts().sort_index()

plot_bar(df=job_description_length_counts, x=job_description_length_counts.index, y=job_description_length_counts.values,
         title='Job Description Length Frequency in AI jobs',
         labels={'x': 'Job Description Length (characters)', 'y': 'Count', 'job_description_length': 'Job Description Length'},
         color=job_description_length_counts.index, orientation='v',
         color_discrete_sequence=px.colors.qualitative.Pastel1,
         hovertemplate='<b>%{x}</b><br>Count=%{y}<extra></extra>',
         legend_traceorder="reversed")

##### Benefits Score Frequency in AI jobs (benefits_score)

In [21]:
# Plotting the frequency of benefits package score in AI jobs

bins = [5, 6, 7, 8, 9, 10, 11]
labels=['5-5.9','6-6.9', '7-7.9', '8-8.9', '9-9.9', '10']

benefits_score_grouped = pd.cut(df_ai_job['benefits_score'], bins=bins, labels=labels, right=False)
benefits_score_counts = benefits_score_grouped.value_counts().sort_index()

plot_bar(df=benefits_score_counts, x=benefits_score_counts.index, y=benefits_score_counts.values,
         title='Benefits Score Frequency in AI jobs',
         labels={'x': 'Benefits Score', 'y': 'Count', 'benefits_score': 'Benefits Score'},
         color=benefits_score_counts.index, orientation='v',
         color_discrete_sequence=px.colors.qualitative.Pastel1,
         hovertemplate='<b>%{x}</b><br>Count=%{y}<extra></extra>',
         legend_traceorder="reversed")
print("--- Note: The graph shows the frequency of the score of benefits package in AI jobs, grouped by ranges. ---")

--- Note: The graph shows the frequency of the score of benefits package in AI jobs, grouped by ranges. ---


#### Questions
This section will show you some questions about Global AI Job Market & Salary Trends 2025

Note: The data was ordered by the median because it is a robust measure to outliers and is indicated when the data distribution is not symmetric and has a long "tail" to one side, as seen in the salary distribution graph.

##### What is the salary distribution vs job title? (ordered by median salary)

In [22]:
# Plotting the salary distribution by job title (Ordered by median salary)

job_titles_sorted = df_ai_job.groupby("job_title")["salary_usd"].median().sort_values(ascending=False).index.to_list()

plot_box_(df_ai_job, x='salary_usd', y='job_title', color='job_title',
          title='Salary Distribution vs Job Title<br><sup>Ordered by median salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'job_title': 'Job Title'},
          category_orders={'job_title': job_titles_sorted}, height=600)

print("--- Note 1: The highest median salaries are for Data Engineer, Machine Learning Engineer and Ai Specialist. ---")
print("--- Note 2: The median salary for the listed job titles from data engineer to data scientist is higher than")
print("            the median salary ($100,000) for the 15,000 jobs analyzed. ---")

--- Note 1: The highest median salaries are for Data Engineer, Machine Learning Engineer and Ai Specialist. ---
--- Note 2: The median salary for the listed job titles from data engineer to data scientist is higher than
            the median salary ($100,000) for the 15,000 jobs analyzed. ---


##### What is the salary distribution vs experience level? (ordered by median salary)

In [23]:
# Plotting the salary distribution vs experience level (Ordered by median salary)

experience_levels_sorted = df_ai_job.groupby("experience_level")["salary_usd"].median().sort_values(ascending=False).index.to_list()

plot_box_(df_ai_job, x='salary_usd', y='experience_level', color='experience_level',
          title='Salary Distribution vs Experience Level<br><sup>Ordered by median salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'experience_level': 'Experience Level'},
          category_orders={'experience_level': experience_levels_sorted},
          ticktext=['Entry', 'Mid', 'Senior', 'Executive'],
          tickvals=['EN', 'MI', 'SE', 'EX'],
          legend_labels={'EN': 'Entry', 'MI': 'Mid', 'SE': 'Senior', 'EX': 'Executive'},
          with_tick=True)

print("--- Note 1: Note: The graph shows the increasing influence of experience level on salary in AI Jobs. ---")
print("--- Note 2: The highest median salaries are for Executive and Senior levels, followed by Mid and Entry levels. ---")

--- Note 1: Note: The graph shows the increasing influence of experience level on salary in AI Jobs. ---
--- Note 2: The highest median salaries are for Executive and Senior levels, followed by Mid and Entry levels. ---


##### What is the salary distribution vs employment type? (ordered by median salary)

In [24]:
# Plotting the salary distribution vs employment type (Ordered by median salary)

employment_type_sorted = df_ai_job.groupby("employment_type")["salary_usd"].median().sort_values(ascending=False).index.to_list()

plot_box_(df_ai_job, x='salary_usd', y='employment_type', color='employment_type',
          title='Salary Distribution vs Employment Type<br><sup>Ordered by median salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'employment_type': 'Employment Type'},
          category_orders={'employment_type': employment_type_sorted},
          ticktext=['Freelance', 'Contract', 'Part-time', 'Full-time'],
          tickvals=['FL', 'CT', 'PT', 'FT'],
          legend_labels={'FL': 'Freelance', 'CT': 'Contract', 'PT': 'Part-time', 'FT': 'Full-time'},
          with_tick=True)

print("--- Note 1: Note: The graph shows that employment type does not have a significant influence on the salary distribution in AI Jobs. ---")
print("--- Note 2: The highest median salaries are for Contract and Full-time employment types, followed by Freelance and Part-time. ---")

--- Note 1: Note: The graph shows that employment type does not have a significant influence on the salary distribution in AI Jobs. ---
--- Note 2: The highest median salaries are for Contract and Full-time employment types, followed by Freelance and Part-time. ---


##### What is the salary distribution by company size? (ordered by median salary)

In [25]:
# Plotting the salary distribution vs company size (Ordered by median salary)

company_size_sorted = df_ai_job.groupby("company_size")["salary_usd"].median().sort_values(ascending=False).index.to_list()

plot_box_(df_ai_job, x='salary_usd', y='company_size', color='company_size',
          title='Salary Distribution vs Company Size<br><sup>Ordered by median salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'company_size': 'Company Size'},
          category_orders={'company_size': company_size_sorted},
          ticktext=['Small', 'Medium', 'Large'],
          tickvals=['S', 'M', 'L'],
          legend_labels={'S': 'Small', 'M': 'Medium', 'L': 'Large'},
          with_tick=True)

print("--- Note 1: The graph shows that company size has a positive influence on the wage distribution in AI jobs. ---")
print("--- Note 2: The highest median salaries are for Large companies, followed by Medium and Small companies. ---")

--- Note 1: The graph shows that company size has a positive influence on the wage distribution in AI jobs. ---
--- Note 2: The highest median salaries are for Large companies, followed by Medium and Small companies. ---


##### What is the salary distribution vs remote ratio? (ordered by median salary)

In [26]:
# Plotting the salary distribution vs remote ratio (Ordered by median salary)

remote_ratio_sorted = df_ai_job[["remote_ratio", "salary_usd"]]
remote_ratio_sorted["remote_ratio"] = remote_ratio_sorted["remote_ratio"].astype(str)
remote_ratio_sorted['remote_label'] = remote_ratio_sorted['remote_ratio'].map({
    '0': 'No Remote',
    '50': 'Hybrid',
    '100': 'Fully Remote'
})

remote_ratio_sorted_index = remote_ratio_sorted.groupby("remote_label")["salary_usd"].median().sort_values(ascending=False).reset_index()

plot_box_(remote_ratio_sorted, x='salary_usd', y='remote_label', color='remote_label',
          title='Salary Distribution vs Remote Ratio<br><sup>Ordered by median salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'remote_label': 'Remote Ratio'},
          category_orders={'remote_label': remote_ratio_sorted_index['remote_label'].unique()},
          ticktext=['No Remote', 'Hybrid', 'Fully Remote'],
          tickvals=['No Remote', 'Hybrid', 'Fully Remote'],
          legend_labels={'0': 'No Remote', '50': 'Hybrid', '100': 'Fully Remote'},
          with_tick=True)

print("--- Note: Note: The graph shows that remote ratio does not have a significant influence on the salary distribution in AI Jobs. ---")

--- Note: Note: The graph shows that remote ratio does not have a significant influence on the salary distribution in AI Jobs. ---


##### What is the salary distribution vs education required? (ordered by median salary)

In [27]:
# Plotting the education required vs salary distribution (Ordered by median salary)

education_required_sorted = df_ai_job.groupby("education_required")["salary_usd"].median().sort_values(ascending=False).index.to_list()

plot_box_(df_ai_job, x='salary_usd', y='education_required', color='education_required',
          title='Salary Distribution vs Education Required<br><sup>Ordered by median salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'education_required': 'Education Required'},
          category_orders={'education_required': education_required_sorted},
          ticktext=['Associate', 'Bachelor', 'Master', 'PhD'],
          tickvals=['Associate', 'Bachelor', 'Master', 'PhD'],
          legend_labels={'Associate': 'Associate', 'Bachelor': 'Bachelor', 'Master': 'Master', 'PhD': 'PhD'},
          with_tick=True)

print("--- Note: Interestingly, the graph shows that the best-paid education required is a master's degree. ---")

--- Note: Interestingly, the graph shows that the best-paid education required is a master's degree. ---


##### What is the salary distribution vs years experience? (ordered by median salary)

In [28]:
# Plotting the years experience vs salary distribution (Ordered by median salary)

years_experience_salary_usd_grouped = df_ai_job[['years_experience', 'salary_usd']]
years_experience_salary_usd_grouped['years_experience_grouped'] = years_experience_salary_usd_grouped['years_experience'].apply(
    lambda x: 'Up to 1' if x < 2 else ('2-4' if x < 5 else ('5-9' if x < 10 else '10-19')))

years_experience_salary_usd_sorted = years_experience_salary_usd_grouped.groupby("years_experience_grouped")["salary_usd"].median().sort_values(ascending=False).index.to_list()

plot_box_(years_experience_salary_usd_grouped, x='salary_usd', y='years_experience_grouped', color='years_experience_grouped',
          title='Salary Distribution vs Years Experience<br><sup>Ordered by median salary</sup>',
          labels={'years_experience_grouped': 'Years Experience', 'salary_usd': 'Salary (USD)'},
          category_orders={'years_experience_grouped': years_experience_salary_usd_sorted},
          ticktext=['Up to 1', '2-4', '5-9', '10-19'],
          tickvals=['Up to 1', '2-4', '5-9', '10-19'],
          legend_labels={'Up to 1': 'Up to 1', '2-4': '2-4', '5-9': '5-9', '10-19': '10-19'},
          with_tick=True)

print("--- Note: The graph shows that the years of experience has a positive influence on the salary distribution in AI Jobs. ---")
print("--- Note: The highest median salaries are for the 10-19 years of experience range, followed by the 5-9, 2-4 and Up to 1 ranges. ---")

--- Note: The graph shows that the years of experience has a positive influence on the salary distribution in AI Jobs. ---
--- Note: The highest median salaries are for the 10-19 years of experience range, followed by the 5-9, 2-4 and Up to 1 ranges. ---


##### What is the salary distribution vs industry? (ordered by max salary)

In [29]:
# Plotting the salary distribution vs industry (Ordered by max salary)

industry_sorted = df_ai_job.groupby("industry")["salary_usd"].max().sort_values(ascending=False).index.to_list()

plot_box_(df_ai_job, x='salary_usd', y='industry', color='industry',
          title='Salary Distribution vs Industry<br><sup>Ordered by max salary</sup>',
          labels={'salary_usd': 'Salary (USD)', 'industry': 'Industry'},
          category_orders={'industry': industry_sorted}, height=600)

print("--- Note: Interestingly, the graph shows that retail pays higher wages than the tech sector. ---")

--- Note: Interestingly, the graph shows that retail pays higher wages than the tech sector. ---


##### What are the top 5 skills valued in the top 5 paying jobs sorted by median?

In [30]:
# Plotting the top 5 skills valued in the top 5 paying jobs sorted by

job_titles_sorted_5 = df_ai_job.groupby("job_title")["salary_usd"].median().sort_values(ascending=False).index.to_list()[:5]

df_ai_job_top5 = df_ai_job[df_ai_job['job_title'].isin(job_titles_sorted_5)][['job_title', 'salary_usd', 'required_skills']]

df_ai_job_top5['required_skills'] = df_ai_job_top5['required_skills'].str.split(', ')
df_ai_job_top5 = df_ai_job_top5.explode('required_skills')

for job_title in job_titles_sorted_5:
    skills = df_ai_job_top5[df_ai_job_top5['job_title'] == job_title]
    skills_counts = skills['required_skills'].value_counts().reset_index()
    skills_counts.columns = ['skill', 'count']
    skills_counts = skills_counts[:5]

    plot_bar(df=skills_counts, x='count', y='skill',
         title=f'[{job_title}] Required Skills Frequency in AI Jobs',
         labels={'skill': 'Skill', 'count': 'Count'},
         color='skill', orientation='h', height=300,
         hovertemplate='<b>%{y}</b><br>Count=%{x}<extra></extra>')

    print("---")

print("--- Note: The top 5 AI job titles and their top 5 required skills are shown in the graphs above. ---")

---


---


---


---


---
--- Note: The top 5 AI job titles and their top 5 required skills are shown in the graphs above. ---


#### Conclusion
After all these graphs, we hope you can plan your career in AI jobs by looking at the distribution and frequency of different columns.
I think it is pretty clear that you need to start learning Python and SQL and as you gain experience and time, keep in mind that bigger companies pay better salaries.

#### Suggestion
You can explore other questions to increase your awareness, such as:\
What is the distribution of benefits scores vs company size?\
What is the distribution of employment type vs experience level?\
What is the distribution of remote ratio vs company location?