
<label>TO DO Tasks</label>
<ul>
    <input type="checkbox"> Revisar títulos dos gráficos <br>
    <input type="checkbox"> Mudar comentários em portugues para comentarios em ingles <br>
    <input type="checkbox"> Adicionar textos de analises <br>
    <input type="checkbox"> Justificar todos os textos <br>
</ul>

***

# About Notebook
<p style='text-align: justify;'>The work of EDA (Exploratory Data Analysis) is essential in any data analysis study. This is because, before any type of modeling or inference is performed, it is necessary to deeply understand the data that will be used. EDA allows the researcher to get to know the characteristics of the data, such as its distribution, correlation between variables, presence of outliers, among other relevant aspects. With this information in hand, it is possible to make more accurate choices about the type of model to be used, necessary pre-processing, variable selection, and so on. </p>

<p style='text-align: justify;'>In addition, EDA can help identify data quality issues, such as inconsistencies, missing values, or measurement errors. By detecting these issues, they can be corrected, thus improving the quality of the analysis as a whole. </p>

<p style='text-align: justify;'>The main objective of this notebook is to apply the process of EDA, which records the salaries of professionals in the data career, interpreting them to present based on numbers and data.</p>

## EDA Steps
Example of steps in Exploratory Data Analysis

![EDA Steps](https://www.researchgate.net/publication/329930775/figure/fig3/AS:873046667710469@1585161954284/The-fundamental-steps-of-the-exploratory-data-analysis-process.png)

## About [ai-jobs.net](https://ai-jobs.net/)

<p style='text-align: justify;'> This site collects salary information anonymously from professionals all over the world in the AI/ML/Data Science space and makes it publicly available for anyone to use, share and play around with. </p>

<p style='text-align: justify;'> The primary goal is to have data that can provide better guidance in regards to what's being paid globally. So newbies, experienced pros, hiring managers, recruiters and also startup founders or people wanting to make a career switch can make better informed decisions. </p>

***

# Installing Packages

In [1]:
!pip install numpy
!pip install pandas
!pip install scipy
!pip install statsmodels
!pip install scikit-learn

!pip install plotly
!pip install matplotlib
!pip install nbformat

!pip install pycountry
!pip install pycountry-convert
#!pip install wordcloud



# Libraries

In [2]:
# Libraries for data manipulation
import pandas as pd
import numpy as np

# Libraries for data analysis
from scipy.stats import gaussian_kde
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency
import statsmodels.api as sm

# Libraries for machine learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA


# Libraries for data visualization
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.colors as colors

# from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.express as px


# External libraries
import pycountry as pyc
import pycountry_convert as pc

# Default Libraries
import itertools
import re

# Load data
The dataset is provided via a web request, so every time this notebook is executed, the data is updated.

In [3]:
dataset_link = "https://ai-jobs.net/salaries/download/salaries.csv"
df = pd.read_csv(dataset_link)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,EN,FT,Data Quality Analyst,100000,USD,100000,NG,100,NG,L
1,2023,EN,FT,Compliance Data Analyst,30000,USD,30000,NG,100,NG,L
2,2022,MI,FT,Machine Learning Engineer,1650000,INR,20984,IN,50,IN,L
3,2023,EN,FT,Applied Scientist,204620,USD,204620,US,0,US,L
4,2023,EN,FT,Applied Scientist,110680,USD,110680,US,0,US,L


# Data Information

In [4]:
# attributes
df.columns

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
       'remote_ratio', 'company_location', 'company_size'],
      dtype='object')

The attributes will be explained individually in the chapter on univariate analysis.

In [5]:
# concise summary of a dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3716 entries, 0 to 3715
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3716 non-null   int64 
 1   experience_level    3716 non-null   object
 2   employment_type     3716 non-null   object
 3   job_title           3716 non-null   object
 4   salary              3716 non-null   int64 
 5   salary_currency     3716 non-null   object
 6   salary_in_usd       3716 non-null   int64 
 7   employee_residence  3716 non-null   object
 8   remote_ratio        3716 non-null   int64 
 9   company_location    3716 non-null   object
 10  company_size        3716 non-null   object
dtypes: int64(4), object(7)
memory usage: 319.5+ KB


We can notice that we have both categorical and numerical variables in this dataframe. We will separate the analyses for both cases.

<div style="display: flex; justify-content: center;">
    <table>
        <thead>
            <tr>
                <th>Categorical Attributes</th>
                <th>Numerical Attributes</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>experience_level</td>
                <td>work_year</td>
            </tr>
            <tr>
                <td>employment_type</td>
                <td>salary</td>
            </tr>
            <tr>
                <td>job_title</td>
                <td>salary_in_usd</td>
            </tr>
            <tr>
                <td>employee_residence</td>
                <td>remote_ratio</td>
            </tr>
            <tr>
                <td>company_location</td>
                <td> </td>
            </tr>
            <tr>
                <td>company_size</td>
                <td> </td>
            </tr>
            <tr>
                <td>salary_currency</td>
                <td> </td>
            </tr>
        </tbody>
    </table>
</div>


In [6]:
# dimensionality of the DataFrame
df.shape

(3716, 11)

In [7]:
# checking for null values
df.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [8]:
# Categorical attributes [object]
categorical_df = df.loc[
    :,
    [
        "experience_level",
        "employment_type",
        "job_title",
        "salary_currency",
        "employee_residence",
        "company_location",
        "company_size",
    ],
]

# Numerical Attributes [int64]
numerical_df = df.loc[:, ["work_year", "salary", "salary_in_usd", "remote_ratio"]]

In [9]:
# descriptive statistics
numerical_df.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3716.0,3716.0,3716.0,3716.0
mean,2022.367061,191134.0,137449.668192,46.461249
std,0.692068,675146.1,63022.374313,48.589708
min,2020.0,6000.0,5132.0,0.0
25%,2022.0,100000.0,95000.0,0.0
50%,2022.0,137500.0,135000.0,0.0
75%,2023.0,180000.0,175000.0,100.0
max,2023.0,30400000.0,450000.0,100.0


Insights:
 - The database contains records of salaries for professionals in the data career;
 - Attributes work_year and remote_ratio are examples of discrete nominal attributes;
 - The database contains records dating back to 2020;
 - 75% of professionals who work in data career receive up to $175,000 per year;
 - Employers offer different forms of work for professionals in the data career.

# Univariate Analysis

Describe and analyze a single variable to obtain important information about it.

## Categorical Attributes

### experience_level: The experience level in the job during the year

In [10]:
# experience levels
df["experience_level"].unique()

array(['EN', 'MI', 'SE', 'EX'], dtype=object)

 - EN Entry-level / Junior
 - MI Mid-level / Intermediate
 - SE Senior-level / Expert
 - EX Executive-level / Director

In [11]:
df["experience_level"].value_counts(normalize=True)

experience_level
SE    0.669268
MI    0.214478
EN    0.085576
EX    0.030678
Name: proportion, dtype: float64

In [12]:
data = df["experience_level"].replace(
    {"SE": "Senior", "MI": "Mid-level", "EN": "Junior", "EX": "Executive"}
)

# counts of unique values
unique_values = data.value_counts()

# frequency of values and the labels
freq = unique_values.values
labels = unique_values.index.values

# sum of total experience levels
total = sum(freq)

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Criar o gráfico de rosca
data = [
    go.Pie(
        labels=labels,
        values=freq,
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria o layout do gráfico
layout = go.Layout(
    title={
        "text": "Level of experience of professionals in the data career",
        "x": 0.5,
        # "xanchor": "center",
    },
    # width=700,
    # height=700,
    annotations=[annotation],
)

# Adicionar título
fig = go.Figure(data=data, layout=layout)

# Mostrar o gráfico
fig.show()

### employment_type: The type of employement for the role

In [13]:
# employment type
df["employment_type"].unique()

array(['FT', 'FL', 'PT', 'CT'], dtype=object)

 - PT: Part-time
 - FT: Full-time
 - CT: Contract
 - FL: Freelance

In [14]:
df["employment_type"].value_counts(normalize=True)

employment_type
FT    0.990581
PT    0.004575
FL    0.002691
CT    0.002153
Name: proportion, dtype: float64

Most professionals in the data field work under the full-time hiring modality. However, we can notice an extremely concentrated distribution in this modality, so we will include other hiring modalities to improve visualization.

In [15]:
data = df["employment_type"].replace(
    {"PT": "Part-time", "FT": "Full-time", "CT": "Contract", "FL": "Freelance"}
)

# counts of unique values
unique_values = data.value_counts()

top_labels = unique_values.index.values[:1].tolist()
top_values = unique_values.values[:1].tolist()

# sum of total jobs registered
total = sum(unique_values.values)

# sum of other jobs
others = sum(unique_values.values[1:])

values = top_values + [others]
labels = top_labels + ["Others"]

data_dict = dict(zip(labels, values))

# Ordenar o dicionário pelos valores em ordem crescente
data_dict_ordered = dict(
    sorted(data_dict.items(), key=lambda item: item[1], reverse=True)
)


# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Criar o gráfico de rosca
data = [
    go.Pie(
        labels=list(data_dict_ordered.keys()),
        values=list(data_dict_ordered.values()),
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria o layout do gráfico
layout = go.Layout(
    title={
        "text": "Type of employment for data professionals",
        "x": 0.5,
        # "xanchor": "right",
    },
    # width=700,
    # height=900,
    annotations=[annotation],
)

# Adicionar título
fig = go.Figure(data=data, layout=layout)

# Mostrar o gráfico
fig.show()

### job_title: The role worked in during the year.

In [16]:
df["job_title"].value_counts()

job_title
Data Engineer                1036
Data Scientist                832
Data Analyst                  606
Machine Learning Engineer     287
Data Architect                101
                             ... 
Compliance Data Analyst         1
BI Data Engineer                1
Data DevOps Engineer            1
Staff Data Analyst              1
Finance Data Analyst            1
Name: count, Length: 91, dtype: int64

There are numerous job titles for data professionals. At this point, we could adopt a technique to categorize these titles into groups, but that will be for the feature engineering chapter. For simplicity, we will analyze only the titles with the highest occurrence.

In [17]:
"""import matplotlib.colors as mcolors

# Cria a paleta de cores
cmap = mcolors.LinearSegmentedColormap.from_list('black_to_white', ['#000000', '#ffffff'])

palavras = df["job_title"].values

# Cria um objeto WordCloud com as palavras do array e a paleta de cores personalizada
nuvem_palavras = WordCloud(background_color='white', width=1800, height=720, colormap=cmap, max_words=50).generate(' '.join(palavras))

# Cria o gráfico
fig, ax = plt.subplots(figsize=(22, 8))
ax.imshow(nuvem_palavras, interpolation='bilinear')
ax.axis("off")
plt.show()"""

'import matplotlib.colors as mcolors\n\n# Cria a paleta de cores\ncmap = mcolors.LinearSegmentedColormap.from_list(\'black_to_white\', [\'#000000\', \'#ffffff\'])\n\npalavras = df["job_title"].values\n\n# Cria um objeto WordCloud com as palavras do array e a paleta de cores personalizada\nnuvem_palavras = WordCloud(background_color=\'white\', width=1800, height=720, colormap=cmap, max_words=50).generate(\' \'.join(palavras))\n\n# Cria o gráfico\nfig, ax = plt.subplots(figsize=(22, 8))\nax.imshow(nuvem_palavras, interpolation=\'bilinear\')\nax.axis("off")\nplt.show()'

In [18]:
jobs_title = df["job_title"].value_counts().sort_values(ascending=False)

# top jobs
top_jobs_labels = jobs_title.index.values[:5].tolist()
top_jobs_values = jobs_title.values[:5].tolist()

# sum of total jobs registered
total = sum(jobs_title.values)

# sum of other jobs
others = sum(jobs_title.values[5:])

jobs_values = top_jobs_values + [others]
jobs_labels = top_jobs_labels + ["Others"]

jobs = dict(zip(jobs_labels, jobs_values))

# Ordenar o dicionário pelos valores em ordem crescente
jobs_ordered = dict(sorted(jobs.items(), key=lambda item: item[1], reverse=True))

# Cria um objeto Pie para o gráfico de donut
data = [
    go.Pie(
        labels=list(jobs_ordered.keys()),
        values=list(jobs_ordered.values()),
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Cria o layout do gráfico
layout = go.Layout(
    title={
        "text": "Top jobs of professionals in the data career",
        "x": 0.5,
        #'xanchor': 'center'
    },
    # width=700,
    # height=700,
    annotations=[annotation],
)

# Cria a figura do gráfico
fig = go.Figure(data=data, layout=layout)

fig.show()

### employee_residence

In [19]:
df["employee_residence"].value_counts(normalize=True).sort_values(ascending=False)[0:10]

employee_residence
US    0.800861
GB    0.044403
ES    0.021259
CA    0.021259
IN    0.019107
DE    0.012379
FR    0.010226
PT    0.004844
BR    0.004844
GR    0.004306
Name: proportion, dtype: float64

In [20]:
def country_to_continent(country_code):
    if not isinstance(country_code, str) or len(country_code) != 2:
        raise ValueError(
            "Input inválido. O código de país deve ser uma string com dois caracteres."
        )
    country_alpha2 = country_code
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    return country_continent_code


# Categorizar o país de cada cientista de dados e adicionar uma nova coluna ao DataFrame
def categorize_country(country_code):
    try:
        return pyc.countries.get(alpha_2=country_code).alpha_3
    except:
        return "Unknown"

In [21]:
# cria o dicionário de siglas e nomes por extenso
continentes = {
    "NA": "North America",
    "EU": "Europe",
    "AS": "Asia",
    "SA": "South America",
    "AF": "Africa",
    "OC": "Oceania",
}

df["employee_residence"].apply(country_to_continent).value_counts(
    normalize=True
).sort_values(ascending=False).rename(index=continentes)

employee_residence
North America    0.826964
Europe           0.123520
Asia             0.032293
South America    0.008881
Africa           0.004575
Oceania          0.003767
Name: proportion, dtype: float64

In [22]:
data = (
    df["employee_residence"]
    .apply(country_to_continent)
    .value_counts()
    .sort_values(ascending=False)
)


data.rename(continentes, inplace=True)

# counts of unique values
unique_values = data.value_counts()

# frequency of values and the labels
freq = data.values
labels = data.index.values

# sum of total experience levels
total = sum(freq)

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Criar o gráfico de rosca
data = [
    go.Pie(
        labels=labels,
        values=freq,
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria o layout do gráfico
layout = go.Layout(
    title={
        "text": "Residance of professionals in the data career",
        "x": 0.5,
        # "xanchor": "center",
    },
    # width=700,
    # height=700,
    annotations=[annotation],
)

# Adicionar título
fig = go.Figure(data=data, layout=layout)

# Mostrar o gráfico
fig.show()

In [23]:
df["employee_residence"].apply(categorize_country).value_counts(
    normalize=True
).sort_values(ascending=False)

employee_residence
USA    0.800861
GBR    0.044403
ESP    0.021259
CAN    0.021259
IND    0.019107
         ...   
MYS    0.000269
JEY    0.000269
NZL    0.000269
DZA    0.000269
MLT    0.000269
Name: proportion, Length: 78, dtype: float64

In [24]:
residence = (
    df["employee_residence"]
    .apply(categorize_country)
    .value_counts()
    .sort_values(ascending=False)
)[0:10]


# Criar o gráfico de barras
data = [
    go.Bar(
        x=residence.index,
        y=residence.values,
        marker=dict(color=list(reversed(colors.sequential.Greys))),
        text=residence.values,
        textposition="auto",
    )
]

# Criar o layout
layout = go.Layout(
    title={"text": "Top 10 employee locations", "x": 0.5},
    plot_bgcolor="white",
    paper_bgcolor="white",
    xaxis=dict(title="Country"),
    yaxis=dict(title="Quantity of Employees"),
)

# Criar a figura e plotar o gráfico de barras
fig = go.Figure(data=data, layout=layout)
fig.show()

### company_location: The country of the employer's main office or contracting branch

In [25]:
df["company_location"].value_counts(normalize=True)

company_location
US    0.810549
GB    0.045748
CA    0.021798
ES    0.020452
IN    0.015608
        ...   
MK    0.000269
BS    0.000269
IR    0.000269
CR    0.000269
MT    0.000269
Name: proportion, Length: 72, dtype: float64

In [26]:
company_location = (
    df["company_location"]
    .apply(categorize_country)
    .value_counts()
    .sort_values(ascending=False)
)[0:10]


# Criar o gráfico de barras
data = [
    go.Bar(
        x=company_location.index,
        y=company_location.values,
        marker=dict(color=list(reversed(colors.sequential.Greys))),
        text=company_location.values,
        textposition="auto",
    )
]

# Criar o layout
layout = go.Layout(
    title={"text": "Top 10 company locations", "x": 0.5},
    plot_bgcolor="white",
    paper_bgcolor="white",
    xaxis=dict(title="Country"),
    yaxis=dict(title="Quantity of Companies"),
)

# Criar a figura e plotar o gráfico de barras
fig = go.Figure(data=data, layout=layout)
fig.show()

Companies and employees are extremely concentrated in the united states

### company_size: The average number of people that worked for the company during the year

In [27]:
df["company_size"].unique()

array(['L', 'M', 'S'], dtype=object)

- S less than 50 employees (small)
- M 50 to 250 employees (medium)
- L more than 250 employees (large)

In [28]:
df["company_size"].value_counts(normalize=True)

company_size
M    0.839882
L    0.120829
S    0.039290
Name: proportion, dtype: float64

In [29]:
data = df["company_size"].replace(
    {"S": "Small company", "M": "Medium company", "L": "Large company"}
)

# counts of unique values
unique_values = data.value_counts()

# frequency of values and the labels
freq = unique_values.values
labels = unique_values.index.values
# cria o dicionário de siglas e nomes por extenso
continentes = {
    "NA": "North America",
    "EU": "Europe",
    "AS": "Asia",
    "SA": "South America",
    "AF": "Africa",
    "OC": "Oceania",
}
# sum of total experience levels
total = sum(freq)

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Criar o gráfico de rosca
data = [
    go.Pie(
        labels=labels,
        values=freq,
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

layout = go.Layout(
    title={
        "text": "Size of companies with professionals in the data career",
        "x": 0.5,
    },
    annotations=[
        {
            "text": "Small company: fewer than 50 employees<br>"
            "Medium company: between 50 and 250 employees<br>"
            "Large company: more than 250 employees",
            "showarrow": False,
            "x": 1.15,
            "y": 0.03,
            "xref": "paper",
            "yref": "paper",
            "align": "right",
        },
        dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5),
    ],
)

# Adicionar título
fig = go.Figure(data=data, layout=layout)

# Mostrar o gráfico
fig.show()

### salaray_currency: The currency of the salary paid

In [30]:
df["salary_currency"].value_counts(normalize=True)

salary_currency
USD    0.857374
EUR    0.063240
GBP    0.043326
INR    0.016146
CAD    0.006728
AUD    0.002422
SGD    0.001615
BRL    0.001615
PLN    0.001346
CHF    0.001076
HUF    0.000807
DKK    0.000807
JPY    0.000807
TRY    0.000807
THB    0.000538
ILS    0.000269
HKD    0.000269
CZK    0.000269
MXN    0.000269
CLP    0.000269
Name: proportion, dtype: float64

The predominant payment currencies are the Dollar and the Euro.

In [31]:
salary_currency = df["salary_currency"].value_counts()

# top currency
top_currencies_labels = salary_currency.index.values[:2].tolist()
top_currencies_values = salary_currency.values[:2].tolist()

# sum of total jobs registered
total = sum(salary_currency.values)

# sum of other jobs
others = sum(salary_currency.values[2:])

currency_values = top_currencies_values + [others]
currency_labels = top_currencies_labels + ["Others"]

currencies = dict(zip(currency_labels, currency_values))

# Ordenar o dicionário pelos valores em ordem crescente
currencies_ordered = dict(
    sorted(currencies.items(), key=lambda item: item[1], reverse=True)
)

# Cria um objeto Pie para o gráfico de donut
data = [
    go.Pie(
        labels=list(currencies_ordered.keys()),
        values=list(currencies_ordered.values()),
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Cria o layout do gráfico
layout = go.Layout(
    title={
        "text": "Top currencies of professionals in the data career",
        "x": 0.5,
        "xanchor": "center",
    },
    # width=600,
    # height=600,
    annotations=[annotation],
)

# Cria a figura do gráfico
fig = go.Figure(data=data, layout=layout)

fig.show()

## Numerical Attributes

### work_year: The year of the salary was paid

In [32]:
# description of frequency work_year attribute
work_year = df["work_year"].value_counts().sort_index()

# The Years of salary was paid
years = work_year.index.values

# Sample
quantity_of_employeements = work_year.values

# Growth percentage array
growth_percentage = (
    100 * (np.diff(quantity_of_employeements)) / quantity_of_employeements[:-1]
)

growth_percentage = np.concatenate(([0], growth_percentage))


# Create a figure with layout configuration
fig = make_subplots(rows=1, cols=2)

# Quantity bar graph
fig.add_trace(
    go.Bar(
        x=years,
        y=quantity_of_employeements,
        name="Quantity",
        text=quantity_of_employeements,
        textposition="auto",
    ),
    row=1,
    col=1,
)

# Growth scatter graph
fig.add_trace(
    go.Scatter(
        x=years,
        y=np.round(growth_percentage, 0),
        name="Growth",
        mode="lines+markers+text",
        text=np.char.mod("%.0f", growth_percentage),
        textposition="top center",
        textfont=dict(size=12),
        hoverinfo="y+text",
    ),
    row=1,
    col=2,
)

# Layout configuration
fig.update_layout(
    title="Salaries registered in <a href='https://ai-jobs.net'>ai-jobs.net</a> per Year",
    yaxis1_title="Quantity",
    yaxis2_title="Growth (%)",
    xaxis1=dict(
        title="Years",
        tickmode="linear",
        tickformat="%Y",
        dtick="M12",
    ),
    xaxis2=dict(
        title="Years",
        tickmode="linear",
        tickformat="%Y",
        dtick="M12",
    ),
)
fig.update_layout(template=None)

# Change color to black
fig.update_traces(marker=dict(color="black"))
# height=400, width=800)

fig.show()

<p style='text-align: justify;'> The highest number of records of professionals in the data career occurred in 2022 so far, but the difference between the current year and the previous one has been narrowing more and more, so we should have more salaries registered compared to last year. An exponential growth in salary records on the platform is observed, the data community is increasingly adopting the use of the platform, engaged in registering their salaries on it, thus the number of records keeps increasing. </p>

<p style='text-align: justify;'> Over time, ai-jobs.net can serve as a sample to analyze the demand curve for data professionals. At this moment, we cannot assume that the demand for data professionals is increasing, as the increase in 2022 may have been due to the platform's promotion in the community and not necessarily due to an increase in demand for professionals. </p>

### salary_in_usd: The salary in USD (FX rate divided by avg. USD rate of respective year via data from BIS).
While the attribute salary represents the value in local currency, the attribute salary_in_usd represents the converted value in US dollars. To facilitate analysis, we will adopt only the salary in US dollars.

In [33]:
# array dos valores de salarios registrados
salary = df["salary_in_usd"].values

# Calculando a curva de densidade
kde = gaussian_kde(salary)
x_kde = np.linspace(0, salary.max(), 100)
y_kde = kde.evaluate(x_kde)

# Criação do objeto de figuras com subplots
fig = make_subplots(rows=1, cols=2)

# Adição do boxplot no primeiro subplot
fig.add_trace(go.Box(x=salary, name="Salary", marker_color="black"), row=1, col=1)

# Adição do gráfico de densidade no segundo subplot
fig.add_trace(go.Scatter(x=x_kde, y=y_kde, line=dict(color="black")), row=1, col=2)

# Personalização do layout da figura
fig.update_layout(
    title="Salary of professionais in data carrer",
    xaxis1_title="Salary (USD)",
    yaxis2_title="Density",
    xaxis2_title="Salary (USD)",
    showlegend=False,
    template=None,
)


# Exibição da figura
fig.show()

A distribution with positive skewness is a probability distribution where the right tail is longer than the left tail. This means that most of the values are concentrated on the left side of the graph, while a smaller number of values extend to the right. The mean is higher than the median and mode, indicating the presence of extreme values in the right tail of the distribution.

A common example of a distribution with positive skewness is the salary distribution, where a large number of people earn low salaries, while a smaller number of people earn very high salaries, pulling the mean upwards.

### remote_ratio: The overall amount of work done remotely, possible values are as follows: 

In [34]:
df["remote_ratio"].unique()

array([100,  50,   0])

- 0: No remote work (less than 20%)
- 50: Partially remote
- 100: Fully remote (more than 80%)

In [35]:
df["remote_ratio"].value_counts(normalize=True)

remote_ratio
0      0.509957
100    0.439182
50     0.050861
Name: proportion, dtype: float64

In [36]:
data = df["remote_ratio"].replace(
    {0: "No remote work", 50: "Partially remote", 100: "Fully remote"}
)

# counts of unique values
unique_values = data.value_counts()

# frequency of values and the labels
freq = unique_values.values
labels = unique_values.index.values

# sum of total experience levels
total = sum(freq)

# Cria a anotação com o valor total
annotation = dict(font=dict(size=20), showarrow=False, text=str(total), x=0.5, y=0.5)

# Criar o gráfico de rosca
data = [
    go.Pie(
        labels=labels,
        values=freq,
        hole=0.5,
        marker=dict(colors=list(reversed(colors.sequential.Greys))),
        textinfo="percent+label",
        insidetextorientation="auto",
        hoverinfo="label+value",
    )
]

# Cria o layout do gráfico
layout = go.Layout(
    title={
        "text": "Tof professionals in the data career",
        "x": 0.5,
        # "xanchor": "center",
    },
    # width=700,
    # height=700,
    annotations=[annotation],
)

# Adicionar título
fig = go.Figure(data=data, layout=layout)

# Mostrar o gráfico
fig.show()

# Bivariate analysis

## Hypothesis class
Used to assess the plausibility of a hypothesis by using sample data

In [37]:
class HypothesisTest:
    def __init__(self, df) -> None:
        self._df = df

    def t_test(self, var1: str, var2: str, h0: str, alpha=0.05) -> dict:
        """
        Performs a t-test for two numerical variables in a DataFrame.

        Parameters:
        var1 (str): The name of the first numerical variable.
        var2 (str): The name of the second numerical variable.
        hypothesis (str): user's hypothesis in the form of a string, e.g. "The mean is equal to 5"
        alpha (float, optional): The desired significance level for the test. Defaults to 0.05.

        Returns:
        dict: A dictionary containing the following fields:
            - "t": The t-statistic calculated by the test.
            - "p_value": The p-value calculated by the test.
            - "conclusion": the result of the test, indicating whether the null hypothesis was rejected or not.
        """

        df = self._df

        # Calculates the t-statistic and p-value for the t-test
        t_stat, p_val = ttest_ind(df[var1], df[var2])

        # Determine conclusion
        if p_val < alpha:
            conclusion = f"Reject null hypothesis: {h0}"
        else:
            conclusion = f"Fail to reject null hypothesis: {h0}"

        # Returns a dictionary with the results of the test
        return {"t": t_stat, "p_value": p_val, "conclusion": conclusion}

    def chi_square_test(self, var1: str, var2: str, h0: str, alpha=0.05) -> dict:
        """
        Performs the chi-square test for two categorical variables of a DataFrame.

        Parameters:
            var1 (str): name of the first categorical variable.
            var2 (str): name of the second categorical variable.
            hypothesis (str): user's hypothesis in the form of a string, e.g. "The mean is equal to 5"
            alpha (float): desired significance level for the test (usually 0.05 or 0.01).

        Returns:
            A dictionary with the following fields:
            - "chi2": the calculated chi-square value.
            - "p_value": the p-value calculated by the test.
            - "conclusion": the result of the test, indicating whether the null hypothesis was rejected or not.
        """

        df = self._df

        # Creates a contingency table
        contingency_table = pd.crosstab(df[var1], df[var2])

        # Performs the chi-squared test
        chi2, p_val, _, _ = chi2_contingency(contingency_table)

        # Determine conclusion
        if p_val < alpha:
            conclusion = f"Reject null hypothesis: {h0}"
        else:
            conclusion = f"Fail to reject null hypothesis: {h0}"

        # Returns a dictionary with the results of the test
        return {"chi²": chi2, "p_value": p_val, "conclusion": conclusion}

## Categorical & Categorical

In [38]:
categorical_columns = categorical_df.columns.values

bivariate_categorical_combinations = list(
    itertools.combinations(categorical_columns, 2)
)

bivariate_categorical_combinations

[('experience_level', 'employment_type'),
 ('experience_level', 'job_title'),
 ('experience_level', 'salary_currency'),
 ('experience_level', 'employee_residence'),
 ('experience_level', 'company_location'),
 ('experience_level', 'company_size'),
 ('employment_type', 'job_title'),
 ('employment_type', 'salary_currency'),
 ('employment_type', 'employee_residence'),
 ('employment_type', 'company_location'),
 ('employment_type', 'company_size'),
 ('job_title', 'salary_currency'),
 ('job_title', 'employee_residence'),
 ('job_title', 'company_location'),
 ('job_title', 'company_size'),
 ('salary_currency', 'employee_residence'),
 ('salary_currency', 'company_location'),
 ('salary_currency', 'company_size'),
 ('employee_residence', 'company_location'),
 ('employee_residence', 'company_size'),
 ('company_location', 'company_size')]

Attributes such as "salary_currency", "employment_type", "employee_residence" and "company_location" may be disregarded in pairwise analysis of categorical variables, as they exhibit a very dominant distribution, as seen in the univariate analysis. For example the dominant currency is the dollar and the majority of data professionals work full-time. 

When the distribution of an attribute is dominant compared to other values, it may affect the pairwise analysis of categorical variables, as the variable may not exhibit sufficient variation to be considered relevant in the analysis. In the specific case of attributes salary_currency and employment_type, if they exhibit a very dominant distribution, it is possible that they may not bring much relevant information for the pairwise analysis of categorical variables. 

It is important to remember that the decision to disregard an attribute or not depends on the context and objective of the analysis. In some cases, even if the distribution of an attribute is dominant, it may still bring relevant information for the analysis.

In [39]:
# List of all categorical columns
categorical_columns = categorical_df.columns.values

# Remove specific columns
columns_to_exclude = [
    "salary_currency",
    "employment_type",
    "employee_residence",
    "company_location",
]
categorical_columns = [
    col for col in categorical_columns if col not in columns_to_exclude
]

# Create combinations
bivariate_categorical_combinations = list(
    itertools.combinations(categorical_columns, 2)
)

bivariate_categorical_combinations

[('experience_level', 'job_title'),
 ('experience_level', 'company_size'),
 ('job_title', 'company_size')]

Since the distribution of the job_title attribute is considerable, I will adopt only three categories: Data Engineer, Data Scientist, and Data Analyst.

## Experience level x Job Title

In [40]:
df_copy = categorical_df.copy()

analysis = categorical_df.loc[
    categorical_df["job_title"].isin(
        ["Data Engineer", "Data Scientist", "Data Analyst"]
    )
]

grouped_data = (
    analysis.replace(
        {
            "SE": "Senior",
            "MI": "Intermediate",
            "EN": "Junior",
            "EX": "Director",
        }
    )
    .groupby(["job_title", "experience_level"])["experience_level"]
    .count()
    .reset_index(name="count")
)

fig = go.Figure()

traces = [
    go.Bar(
        x=grouped_data[grouped_data["experience_level"] == experience_level][
            "job_title"
        ],
        y=grouped_data[grouped_data["experience_level"] == experience_level]["count"],
        name=experience_level,
    )
    for experience_level in grouped_data["experience_level"].unique()
]

# Criando o layout do gráfico
layout = go.Layout(
    title="Experience Level by Job Title",
    xaxis=dict(title="Job Title"),
    yaxis=dict(title="Count"),
)

# Adicionando as barras ao objeto Figure
fig = go.Figure(data=traces, layout=layout)

fig.show()

## Company Size x Job Title

In [41]:
df_copy = categorical_df.copy()

analysis = categorical_df.loc[
    categorical_df["job_title"].isin(
        ["Data Engineer", "Data Scientist", "Data Analyst"]
    )
]


grouped_data = (
    analysis.groupby(["job_title", "company_size"])["company_size"]
    .count()
    .reset_index(name="count")
)


fig = go.Figure()

# Criando as barras para small company
small_company = go.Bar(
    x=grouped_data[grouped_data["company_size"] == "S"]["job_title"],
    y=grouped_data[grouped_data["company_size"] == "S"]["count"],
    name="Small Company (less than 50 employees)",
)

# Criando as barras para os EUA
medium_company = go.Bar(
    x=grouped_data[grouped_data["company_size"] == "M"]["job_title"],
    y=grouped_data[grouped_data["company_size"] == "M"]["count"],
    name="Medium Company (50 to 250 employees)",
)

# Criando as barras para os EUA
large_company = go.Bar(
    x=grouped_data[grouped_data["company_size"] == "L"]["job_title"],
    y=grouped_data[grouped_data["company_size"] == "L"]["count"],
    name="Large Company (more than 250)",
)

# Criando o layout do gráfico
layout = go.Layout(
    title="Company Size by Job Title",
    xaxis=dict(title="Job Title"),
    yaxis=dict(title="Count"),
)

# Adicionando as barras ao objeto Figure
fig = go.Figure(data=[small_company, medium_company, large_company], layout=layout)

fig.show()

## Experience level x Company Size

In [42]:
grouped_data = (
    df.groupby(["company_size", "experience_level"])["experience_level"]
    .count()
    .reset_index(name="count")
)

grouped_data["proportion"] = grouped_data.groupby(["company_size"])[
    "count"
].transform(lambda x: x / x.sum())

grouped_data["company_size"].replace(
    {"L": "Large company", "M": "Medium company", "S": "Small company"}, inplace=True
)
grouped_data["experience_level"].replace(
    {"EN": "Junior", "MI": "Medium", "SE": "Senior", "EX": "Executive"}, inplace=True
)

# Define a ordem desejada
cat_order = pd.CategoricalDtype(categories=["Small company", "Medium company", "Large company"], ordered=True)

# Converte a coluna "company_size" em uma coluna categórica com a ordem definida
grouped_data["company_size"] = grouped_data["company_size"].astype(cat_order)

# Classifica os dados pela coluna "company_size"
grouped_data = grouped_data.sort_values("company_size")

grouped_data

Unnamed: 0,company_size,experience_level,count,proportion
8,Small company,Junior,49,0.335616
9,Small company,Executive,6,0.041096
10,Small company,Medium,48,0.328767
11,Small company,Senior,43,0.294521
4,Medium company,Junior,171,0.05479
5,Medium company,Executive,95,0.030439
6,Medium company,Medium,615,0.197052
7,Medium company,Senior,2240,0.717719
0,Large company,Junior,98,0.218263
1,Large company,Executive,13,0.028953


In [43]:
fig = go.Figure()

company_size = ["Small company", "Medium company", "Large company"]

traces = [
    go.Bar(
        x=company_size,
        y=grouped_data[(grouped_data["experience_level"] == experience_level)][
            "proportion"
        ],
        name=experience_level,
    )
    for i, experience_level in enumerate(["Junior", "Medium", "Senior", "Executive"])
]

# Criando o layout do gráfico
layout = go.Layout(
    title="Proportional Experience Level by Company Size",
    xaxis=dict(title="Company Size"),
    yaxis=dict(title="Proportional Percentage of Experience Level"),
)

# Adicionando as barras ao objeto Figure
fig = go.Figure(data=traces, layout=layout)

fig.show()

### Hypothesis test

The distribution of seniority levels among professionals does not differ significantly between companies of different sizes

In [44]:
hypothesis = HypothesisTest(categorical_df)

hypothesis.chi_square_test(
    "experience_level",
    "company_size",
    h0="The distribution of seniority levels among professionals does not differ significantly between companies of different sizes",
)

{'chi²': 334.8121342444997,
 'p_value': 2.806529402003438e-69,
 'conclusion': 'Reject null hypothesis: The distribution of seniority levels among professionals does not differ significantly between companies of different sizes'}

Strong evidence against the null hypothesis and suggests that the distribution of seniority levels among professionals differs between companies of different sizes.

## Numerical & Numerical

In [45]:
numerical_columns = numerical_df.columns.values

bivariate_numerical_combinations = list(
    itertools.combinations(numerical_columns, 2)
)

bivariate_numerical_combinations

[('work_year', 'salary'),
 ('work_year', 'salary_in_usd'),
 ('work_year', 'remote_ratio'),
 ('salary', 'salary_in_usd'),
 ('salary', 'remote_ratio'),
 ('salary_in_usd', 'remote_ratio')]

In [46]:
# List of all categorical columns
numerical_columns = numerical_df.columns.values

# Remove specific columns
columns_to_exclude = ["salary"]
numerical_columns = [
    col for col in numerical_columns if col not in columns_to_exclude
]

# Create combinations
bivariate_numerical_combinations = list(
    itertools.combinations(numerical_columns, 2)
)

bivariate_numerical_combinations

[('work_year', 'salary_in_usd'),
 ('work_year', 'remote_ratio'),
 ('salary_in_usd', 'remote_ratio')]

## Work Year x Remote Ratio

In [86]:
grouped_data = (
    df.groupby(["work_year", "remote_ratio"])["remote_ratio"]
    .count()
    .reset_index(name="count")
)

grouped_data["proportion"] = (
    grouped_data.groupby(["work_year"])["count"].transform(lambda x: x / x.sum()) * 100
)

grouped_data["proportion"] = grouped_data["proportion"].round(1)

grouped_data["remote_ratio"].replace(
    {0: "In-Office", 50: "Hybrid", 100: "Remote"}, inplace=True
)

grouped_data


Unnamed: 0,work_year,remote_ratio,count,proportion
0,2020,In-Office,16,21.1
1,2020,Hybrid,21,27.6
2,2020,Remote,39,51.3
3,2021,In-Office,34,14.8
4,2021,Hybrid,76,33.0
5,2021,Remote,120,52.2
6,2022,In-Office,711,42.7
7,2022,Hybrid,62,3.7
8,2022,Remote,891,53.5
9,2023,In-Office,1134,64.9


In [92]:


# Criando uma lista de cores de cinza para cada employee_type
gray_colors = {
    "In-Office": colors.sequential.Greys[8],
    "Hybrid": colors.sequential.Greys[5],
    "Remote": colors.sequential.Greys[2],
}

# Criando as barras
traces = [
    go.Bar(
        x=grouped_data[grouped_data["remote_ratio"] == employee_type]["work_year"],
        y=grouped_data[grouped_data["remote_ratio"] == employee_type]["proportion"],
        name=employee_type,
        marker=dict(color=gray_colors[employee_type]),
        text=[
            f"{val:.1f}%"
            for val in grouped_data[grouped_data["remote_ratio"] == employee_type][
                "proportion"
            ]
        ],
        textposition="auto",
    )
    for i, employee_type in enumerate(["In-Office", "Hybrid", "Remote"])
]

# Criando o layout do gráfico
layout = go.Layout(
    title="Proportional Work Models per Year",
    xaxis=dict(title="Year"),
    yaxis=dict(title="Proportional Percentage of Work Models"),
    plot_bgcolor="white",
)

# Adicionando as barras ao objeto Figure
fig = go.Figure(data=traces, layout=layout)

fig.show()


## Numerical & Categorical

## Hypotesis test

Profissionais mais experientes trabalham em empresas maiores?

### Trend Salary [USD] per top job_title

In [50]:
trend_salary = df.loc[
    df["job_title"].isin(["Data Engineer", "Data Scientist", "Data Analyst"])
]

trend_salary = trend_salary.loc[:, ["work_year", "job_title", "salary_in_usd"]]

# agrupar por percentil e calcular a média do salário
grouped = trend_salary.groupby(["work_year", "job_title"]).quantile(0.95)

grouped.reset_index(inplace=True)

# new figure
fig = go.Figure()

for job_title in grouped["job_title"].unique():
    data = grouped[grouped["job_title"] == job_title]
    fig.add_trace(
        go.Scatter(
            x=data["work_year"], y=data["salary_in_usd"], mode="lines", name=job_title
        )
    )

# Definir os limites e intervalo do eixo x
fig.update_layout(xaxis=dict(range=[2019.5, 2023.5], dtick=1), template=None)

# Exibir o gráfico
fig.show()

In [51]:
pd.get_dummies(df)

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio,experience_level_EN,experience_level_EX,experience_level_MI,experience_level_SE,employment_type_CT,employment_type_FL,...,company_location_SI,company_location_SK,company_location_TH,company_location_TR,company_location_UA,company_location_US,company_location_VN,company_size_L,company_size_M,company_size_S
0,2023,100000,100000,100,True,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,2023,30000,30000,100,True,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,2022,1650000,20984,50,False,False,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,2023,204620,204620,0,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
4,2023,110680,110680,0,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3711,2020,412000,412000,100,False,False,False,True,False,False,...,False,False,False,False,False,True,False,True,False,False
3712,2021,151000,151000,100,False,False,True,False,False,False,...,False,False,False,False,False,True,False,True,False,False
3713,2020,105000,105000,100,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
3714,2020,100000,100000,100,True,False,False,False,True,False,...,False,False,False,False,False,True,False,True,False,False


# Multivariate analysis
Multivariate analysis is used to understand how various independent variables are related to the dependent variable, allowing us to predict or explain the value of the dependent variable based on the independent variables. The choice of the response variable depends on the analysis objectives and the nature of the data. Generally, the response variable is the variable of greatest interest and the one we want to explain or predict.

In our case, the dependent variable will be the professional's salary in dollars in the data career. Do our independent attributes explain these records? Let's find out.

## Correlation analysis

In [52]:
df_copy = df.copy()

# Selecionar as colunas categóricas
colunas_categoricas = df_copy.select_dtypes(include=["object"]).columns

# Aplicar factorize em cada coluna categórica
for coluna in colunas_categoricas:
    df_copy[coluna], _ = pd.factorize(df_copy[coluna])

# Calcule a matriz de correlação entre todos os atributos numéricos e "salario"
corr_matrix = df_copy.corr(method="spearman")["salary_in_usd"].sort_values(
    ascending=True
)
corr_matrix

salary_currency      -0.481475
company_location     -0.368022
employee_residence   -0.361293
job_title            -0.156190
employment_type      -0.116628
remote_ratio         -0.060190
company_size          0.018920
work_year             0.211624
experience_level      0.459580
salary                0.882722
salary_in_usd         1.000000
Name: salary_in_usd, dtype: float64

In [53]:
# Create bar plot
fig = go.Figure(
    go.Bar(
        x=corr_matrix.values,
        y=corr_matrix.index,
        orientation="h",
        marker=dict(
            color=corr_matrix.values,
            colorscale="Greys",
            colorbar=dict(title="Correlation"),
            cmin=-1,
            cmax=1,
        ),
    )
)

# Configure layout
fig.update_layout(
    title="Correlation of Salary (USD) with other features",
    xaxis_title="Correlation",
    # yaxis_title="Features",
    template=None,
    margin=dict(l=130),
)
# Show figure
fig.show()

There is a weak positive correlation between the variables "work_year", "experience_level", and "company_size". It is interesting to note that regardless of the professional's level of experience and the size of the company, they justify the salary paid (+), but not highly correlated, which may seem illogical (causality). One hypothesis that can be attributed to the result is the demand for these professionals in the job market.

On the negative side, the variables "company_location" and "employee_residence" stand out with a negative correlation. Interpreting it, one can say that the location of the professional and the company does not matter that much. This is justified by the high distribution between remote and on-site jobs (50/50).

## Regression Analysis

In [54]:
import pandas as pd


data = df.copy()

# Selecionar as colunas categóricas
colunas_categoricas = data.select_dtypes(include=["object"]).columns

# Aplicar factorize em cada coluna categórica
for coluna in colunas_categoricas:
    data[coluna], _ = pd.factorize(data[coluna])


# Define a variável dependente e as independentes
y = data["salary_in_usd"]
X = data.drop("salary_in_usd", axis=1)

# Adiciona a constante
X = sm.add_constant(X)

# Ajusta o modelo de regressão
model = sm.OLS(y, X).fit()

# Imprime os resultados
model.summary()

0,1,2,3
Dep. Variable:,salary_in_usd,R-squared:,0.291
Model:,OLS,Adj. R-squared:,0.289
Method:,Least Squares,F-statistic:,152.0
Date:,"Wed, 12 Apr 2023",Prob (F-statistic):,9.84e-268
Time:,14:16:39,Log-Likelihood:,-45700.0
No. Observations:,3716,AIC:,91420.0
Df Residuals:,3705,BIC:,91490.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.533e+07,2.81e+06,-5.445,0.000,-2.08e+07,-9.81e+06
work_year,7628.6638,1391.964,5.481,0.000,4899.574,1.04e+04
experience_level,3.234e+04,1358.553,23.805,0.000,2.97e+04,3.5e+04
employment_type,423.4098,4508.270,0.094,0.925,-8415.524,9262.344
job_title,-72.6713,62.102,-1.170,0.242,-194.429,49.086
salary,0.0057,0.001,4.241,0.000,0.003,0.008
salary_currency,-7643.6169,562.323,-13.593,0.000,-8746.110,-6541.124
employee_residence,-331.8822,163.547,-2.029,0.043,-652.533,-11.231
remote_ratio,-13.0176,18.512,-0.703,0.482,-49.312,23.277

0,1,2,3
Omnibus:,580.874,Durbin-Watson:,1.93
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1411.855
Skew:,0.879,Prob(JB):,2.63e-307
Kurtosis:,5.455,Cond. No.,2270000000.0


 - R-squared: The coefficient of determination is a measure of how well the regression model fits the data. In this case, the R-squared value of 0.185 indicates that 18.5% of the variation in the dependent variable (salary_in_usd) is explained by the independent variables.

 - Adjusted R-squared: This is the same as R-squared, but adjusted for the number of independent variables in the model. The adjusted R-squared value of 0.182 indicates that the model still explains 18.2% of the variation in the dependent variable, even after accounting for the number of independent variables.

 - F-statistic: This is a measure of how well the overall model fits the data. In this case, the F-statistic of 82.98 indicates that the model is a good fit for the data, with a very low probability (2.09e-154) of getting such a result by chance.

 - P-values: These indicate the statistical significance of the coefficients for each independent variable. In this case, all variables except job_title, remote_ratio, and company_size have p-values less than 0.05, indicating that they are statistically significant and have a meaningful impact on the dependent variable.

 - Coefficients: These represent the estimated change in the dependent variable for a one-unit change in the independent variable, holding all other independent variables constant. For example, the coefficient for work_year is 1.097e+04, which means that for every additional year, the predicted salary_in_usd increases by approximately $10,970, holding all other variables constant.

 - Standard errors: These indicate the precision of the coefficient estimates. In general, smaller standard errors indicate more precise estimates.

 - Omnibus test: This is a test of whether the residuals (the difference between the predicted and actual values of the dependent variable) are normally distributed. In this case, the probability value is very low (p < 0.001), indicating that the residuals are not normally distributed.

 - Durbin-Watson test: This is a test for autocorrelation in the residuals. In this case, the value of 1.839 indicates that there is a moderate positive autocorrelation.

 - Jarque-Bera test: This is a test of whether the residuals are normally distributed, based on measures of skewness and kurtosis. In this case, the probability value is very low (p < 0.001), indicating that the residuals are not normally distributed.

 - Condition number: This is a measure of the amount of multicollinearity (correlation between independent variables) in the model. In this case, the large value of 2.26e+09 indicates that there may be strong multicollinearity between some of the independent variables.

# Feature Engineering

O objetivo desse capitulo é demonstrar alguma tecnica de engenharia de variaveis para aplicar ao modelo dos dados. Nesse caso irei realizei um modelo de NLP que consegue classificar com certo grau de precisao as profissoes reportadas e classificar em categorias, assim termos uma nova variavel de entrada por exemplo

## NLP data job classification
O objetivo é classificar a descricao do trabalho registrado em 4 grandes areas: Cientista de Dados, Engenhenheiros de Dados, Analista de Dados e Engenheiro de Inteligencia Artifical

In [55]:
def text_to_lower(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)
    return text

In [56]:
jobs = df["job_title"].values
jobs = [text_to_lower(job) for job in jobs]

# Vetoriza os títulos utilizando o TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(jobs)
vectorizer.get_feature_names_out()

array(['3d', 'ai', 'analyst', 'analytics', 'applied', 'architect',
       'autonomous', 'azure', 'bi', 'big', 'business', 'cloud',
       'compliance', 'computer', 'consultant', 'data', 'database', 'deep',
       'developer', 'devops', 'director', 'engineer', 'etl', 'finance',
       'financial', 'head', 'infrastructure', 'insight', 'intelligence',
       'lead', 'learning', 'machine', 'management', 'manager',
       'marketing', 'ml', 'mlops', 'nlp', 'operations', 'power',
       'principal', 'product', 'programmer', 'quality', 'research',
       'researcher', 'science', 'scientist', 'software', 'specialist',
       'staff', 'tech', 'technician', 'vehicle', 'vision'], dtype=object)

In [57]:
# Define o dicionário de mapeamento de valores numéricos para nomes de rótulos
job_roles = {
    0: "data scientist",
    1: "data analyst",
    2: "machine learning engineer",
    3: "data engineer",
}

# Executa o k-means com k=4
kmeans = KMeans(n_clusters=len(job_roles), random_state=0, n_init=10).fit(X)

# Cria um DataFrame com os títulos e os grupos atribuídos pelo KMeans
job_title_classificator_df = pd.DataFrame(
    {"job_title": jobs, "job_role": kmeans.labels_}
)


# Substitui os valores numéricos pelos nomes de rótulos no DataFrame
job_title_classificator_df = job_title_classificator_df.replace({"job_role": job_roles})

# Imprime o DataFrame com os títulos, grupos e rótulos
job_title_classificator_df

Unnamed: 0,job_title,job_role
0,data quality analyst,data engineer
1,compliance data analyst,data engineer
2,machine learning engineer,data analyst
3,applied scientist,data scientist
4,applied scientist,data scientist
...,...,...
3711,data scientist,data scientist
3712,principal data scientist,data scientist
3713,data scientist,data scientist
3714,business data analyst,data engineer


In [58]:
for job_role in list(job_roles.values()):
    print(job_role)
    print(
        job_title_classificator_df.loc[
            job_title_classificator_df["job_role"] == job_role
        ]["job_title"].unique()
    )
    print("\n")

data scientist
['applied scientist' 'data scientist' 'research scientist'
 'applied data scientist' 'ai scientist' 'lead data scientist'
 'data scientist lead' 'product data scientist' 'principal data scientist'
 'staff data scientist']


data analyst
['machine learning engineer' 'applied machine learning engineer'
 'machine learning researcher' 'machine learning scientist'
 'applied machine learning scientist' 'deep learning researcher'
 'machine learning infrastructure engineer' 'deep learning engineer'
 'machine learning software engineer' 'machine learning research engineer'
 'machine learning developer' 'principal machine learning engineer'
 'machine learning manager' 'lead machine learning engineer'
 'head of machine learning']


machine learning engineer
['data engineer' 'research engineer' 'computer vision engineer'
 'data architect' 'ai developer' 'business intelligence engineer'
 'analytics engineer' 'data analytics manager' 'etl engineer'
 'data devops engineer' 'head of dat

In [59]:
# Reduce the dimensionality of the data to 2 principal components using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X.toarray())

X_pca

array([[ 0.09621789,  0.3808771 ],
       [ 0.0906006 ,  0.3398964 ],
       [-0.29810578, -0.00925404],
       ...,
       [ 0.63058696, -0.36947838],
       [ 0.10068258,  0.41370375],
       [ 0.06072973,  0.11672312]])

In [60]:
# Create a DataFrame with job titles, groups, and principal component coordinates
job_title_classificator_df["pc1"] = X_pca[:, 0]
job_title_classificator_df["pc2"] = X_pca[:, 1]

job_title_classificator_df

Unnamed: 0,job_title,job_role,pc1,pc2
0,data quality analyst,data engineer,0.096218,0.380877
1,compliance data analyst,data engineer,0.090601,0.339896
2,machine learning engineer,data analyst,-0.298106,-0.009254
3,applied scientist,data scientist,0.373257,-0.104208
4,applied scientist,data scientist,0.373257,-0.104208
...,...,...,...,...
3711,data scientist,data scientist,0.630587,-0.369478
3712,principal data scientist,data scientist,0.270867,-0.060649
3713,data scientist,data scientist,0.630587,-0.369478
3714,business data analyst,data engineer,0.100683,0.413704


In [61]:
# Define a escala de cinza
colorscale = [(i / len(job_roles), f"rgb({i}, {i}, {i})") for i in range(256)]


traces = [
    go.Scatter(
        x=job_title_classificator_df.loc[
            job_title_classificator_df["job_role"] == role
        ]["pc2"].values,
        y=job_title_classificator_df.loc[
            job_title_classificator_df["job_role"] == role
        ]["pc1"].values,
        mode="markers",
        marker=dict(size=11, color=colorscale[int(i / len(job_roles) * 255)][1]),
        name=role,
        text=job_title_classificator_df.loc[
            job_title_classificator_df["job_role"] == role
        ]["job_title"],
        hoverinfo="text",
    )
    for i, role in job_roles.items()
]

# Define the layout
layout = go.Layout(
    title="Job Titles Clustered by Job Role",
    xaxis=dict(title="Principal Component 1"),
    yaxis=dict(title="Principal Component 2"),
    showlegend=True,
    plot_bgcolor="rgba(0,0,0,0)",
    legend=dict(title="Job Role"),
)

# Create the figure
fig = go.Figure(data=traces, layout=layout)

# Show the figure
fig.show()

Os pontos extremos são regisros que contém o mesmo nome do cluster de origem, importante reparar que os engenheiros de inteligencia artificial e de dados ficam com seu centro de cluster bem proximos, isto por que, há similiaridade na palavra engineer. Títulos de vagas que ficam proximos ao cluster de cientista de dados pois a similiaridade acontece na palavra "scientist"

# Conclusions
O principal objetivo desse trabalho é promover o conhecimento aprendido em sala de aula durante a disciiplna de Ciencia de Dados