## In-depth exploration on the rise of jobs in the data domain

#### 1. Introduction to the Data Landscape:

In recent years, the global business landscape has undergone a seismic shift, with data emerging as a critical factor driving decision-making across sectors. According to a report from the World Economic Forum, by 2025, the world will produce 463 exabytes of data each day, revealing the sheer volume and potential of data1. Consequently, there's been a surge in demand for professionals skilled in harnessing and making sense of this vast reservoir of information, leading to the rise of jobs such as data scientists, machine learning engineers, and big data engineers.

#### 2. The Data Scientist Phenomenon:

Often touted as the "sexiest job of the 21st century" by the Harvard Business Review2, the role of the data scientist has seen unparalleled growth. Businesses, recognizing the value of data-driven decision-making, have been clamoring to hire experts who can sift through massive datasets, extract insights, and convert them into actionable strategies. As per LinkedIn, data science positions have seen a growth rate of 650% since 20123, illustrating their escalating importance in the contemporary job market.

#### 3. Machine Learning Engineers and Scientists:

Parallel to the rise of data scientists has been the growth in demand for machine learning engineers and scientists. As industries increasingly turn to automation and artificial intelligence, individuals skilled in creating, testing, and applying machine learning models have become indispensable4. These professionals bridge the gap between theoretical advances in machine learning and their practical application in business settings.

#### 4. Big Data Engineers and the Infrastructure Challenge:

With the exponential growth in data comes the challenge of storing, processing, and accessing this data efficiently. Enter the big data engineer, responsible for building robust and scalable data infrastructure. Their expertise ensures that the vast streams of data flowing into organizations are structured and accessible, paving the way for meaningful analysis

#### 5. Crafting the Blueprint: Data Architects:

Data architects play the pivotal role of designing data systems and structures. Their vision helps lay down the roadmap for how data will be stored, utilized, and integrated across platforms and departments6. As businesses transition to being more data-centric, ensuring a sound architecture becomes fundamental, and as a result, data architects have found themselves in increasing demand.

#### 6. The Principle Data Scientist and Leadership in Data:

While entry and mid-level data roles are burgeoning, there's also been a rise in leadership positions like the principle data scientist. These individuals don't just possess technical know-how; they also carry a strategic mindset that helps steer data-driven initiatives at an organizational level. They play a role in both hands-on data analysis and guiding a team to align with larger business objectives

#### 7. Conclusion and Future Prospects:

The rise of these roles underlines the larger trend towards a world increasingly driven by data. As businesses continue to recognize the power of data in providing a competitive edge, the demand for these professions is only set to grow. Individuals looking to future-proof their careers would do well to consider the expansive and promising field of data science and its associated roles.

1. [How much data is generated each day?](https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/)
2. [Data Scientist: The Sexiest Job of the 21st Century](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century)

In [1]:
# data
import pandas as pd
import numpy as np

# visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from wordcloud import WordCloud

## Importing the dataset from the file

In [2]:
df = pd.read_csv('./data/ds_salaries.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
5,5,2020,EN,FT,Data Analyst,72000,USD,72000,US,100,US,L
6,6,2020,SE,FT,Lead Data Scientist,190000,USD,190000,US,100,US,S
7,7,2020,MI,FT,Data Scientist,11000000,HUF,35735,HU,50,HU,L
8,8,2020,MI,FT,Business Data Analyst,135000,USD,135000,US,100,US,L
9,9,2020,SE,FT,Lead Data Engineer,125000,USD,125000,NZ,50,NZ,S


#### Detailed info about the columns and its respective values is given below:

1. work_year: The year the salary was paid

2. experience_level: The experience level in the job during the year with the following possible values:

- EN = Entry-level / Junior;
- MI = Mid-level / Intermediate;
- SE = Senior-level / Expert;
- EX = Executive-level / Director

3. employment_type: The type of employement for the role:

- PT = Part-time;
- FT = Full-time;
- CT = Contract;
- FL = Freelance;

4. job_title: The role worked in during the year.

5. salary: The total gross salary amount paid.

6. salary_currency: The currency of the salary paid as an ISO 4217 currency code.

7. salary_in_usd: The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).

8. employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code(Alpha-2 code).

9. remote_ratio: The overall amount of work done remotely, possible values are as follows:

- 0 = No remote work (less than 20%);
- 50 = Partially remote;
- 100 = Fully remote (more than 80%)

10. company_location: The country of the employer's main office or contracting branch as an ISO 3166 country code(Alpha-2 code).

11. company_size: The average number of people that worked for the company during the year:

- S = less than 50 employees (small);
- M = 50 to 250 employees (medium);
- L = more than 250 employees (large)



## 1. Data Pre-Processing

This process will first begin by finding the null values within the dataframe.

#### 1.1 Finding the NULL and N/A values within the dataframe


In [3]:
null_counts = df.isnull().sum()
print("The NULL counts are:\n",null_counts)

na_counts = df.isna().sum()
print("\nThe N/A counts are:\n",na_counts)

The NULL counts are:
 Unnamed: 0            0
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

The N/A counts are:
 Unnamed: 0            0
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64


In [4]:
df = df.iloc[:,1:]

#### 1.3 Replacing values within dataframe

There are no NULL and N/A values so we move onto replacing some values within the dataframe. Replace all the abbreviations within columns such as 'experience_level', 'employment_type', 'employee_residence', 'company_location' and 'company_size'. Also replace 'remote ratio' from numbers to full-names.

In [5]:
# Replacing experience level
df.experience_level.replace(['EN','MI','SE','EX'],['Entry-level/Junior','Mid-level/Intermediate','Senior-level/Expert','Executive-level/Director'],inplace=True)

# Replacing employment type
employment_type = {'FT':'Full Time',
                    'PT':'Part Time',
                    'CT':'Contract',
                    'FL':'Freelance'}

df['employment_type'] = df['employment_type'].map(employment_type)

# Replacing remote ratio
df.remote_ratio.replace([100,50,0],['Fully Remote','Partially Remote','On-site'], inplace=True)

# ISO code
ISO_CODE = {
    'AD': 'Andorra',
	'AE': 'United Arab Emirates',
	'AF': 'Afghanistan',
	'AG': 'Antigua & Barbuda',
	'AI': 'Anguilla',
	'AL': 'Albania',
	'AM': 'Armenia',
	'AN': 'Netherlands Antilles',
	'AO': 'Angola',
	'AQ': 'Antarctica',
	'AR': 'Argentina',
	'AS': 'American Samoa',
	'AT': 'Austria',
	'AU': 'Australia',
	'AW': 'Aruba',
	'AZ': 'Azerbaijan',
	'BA': 'Bosnia and Herzegovina',
	'BB': 'Barbados',
	'BD': 'Bangladesh',
	'BE': 'Belgium',
	'BF': 'Burkina Faso',
	'BG': 'Bulgaria',
	'BH': 'Bahrain',
	'BI': 'Burundi',
	'BJ': 'Benin',
	'BM': 'Bermuda',
	'BN': 'Brunei Darussalam',
	'BO': 'Bolivia',
	'BR': 'Brazil',
	'BS': 'Bahama',
	'BT': 'Bhutan',
	'BU': 'Burma (no longer exists)',
	'BV': 'Bouvet Island',
	'BW': 'Botswana',
	'BY': 'Belarus',
	'BZ': 'Belize',
	'CA': 'Canada',
	'CC': 'Cocos (Keeling) Islands',
	'CF': 'Central African Republic',
	'CG': 'Congo',
	'CH': 'Switzerland',
	'CI': 'Côte D\'ivoire (Ivory Coast)',
	'CK': 'Cook Iislands',
	'CL': 'Chile',
	'CM': 'Cameroon',
	'CN': 'China',
	'CO': 'Colombia',
	'CR': 'Costa Rica',
	'CS': 'Czechoslovakia (no longer exists)',
	'CU': 'Cuba',
	'CV': 'Cape Verde',
	'CX': 'Christmas Island',
	'CY': 'Cyprus',
	'CZ': 'Czech Republic',
	'DD': 'German Democratic Republic (no longer exists)',
	'DE': 'Germany',
	'DJ': 'Djibouti',
	'DK': 'Denmark',
	'DM': 'Dominica',
	'DO': 'Dominican Republic',
	'DZ': 'Algeria',
	'EC': 'Ecuador',
	'EE': 'Estonia',
	'EG': 'Egypt',
	'EH': 'Western Sahara',
	'ER': 'Eritrea',
	'ES': 'Spain',
	'ET': 'Ethiopia',
	'FI': 'Finland',
	'FJ': 'Fiji',
	'FK': 'Falkland Islands (Malvinas)',
	'FM': 'Micronesia',
	'FO': 'Faroe Islands',
	'FR': 'France',
	'FX': 'France, Metropolitan',
	'GA': 'Gabon',
	'GB': 'United Kingdom (Great Britain)',
	'GD': 'Grenada',
	'GE': 'Georgia',
	'GF': 'French Guiana',
	'GH': 'Ghana',
	'GI': 'Gibraltar',
	'GL': 'Greenland',
	'GM': 'Gambia',
	'GN': 'Guinea',
	'GP': 'Guadeloupe',
	'GQ': 'Equatorial Guinea',
	'GR': 'Greece',
	'GS': 'South Georgia and the South Sandwich Islands',
	'GT': 'Guatemala',
	'GU': 'Guam',
	'GW': 'Guinea-Bissau',
	'GY': 'Guyana',
	'HK': 'Hong Kong',
	'HM': 'Heard & McDonald Islands',
	'HN': 'Honduras',
	'HR': 'Croatia',
	'HT': 'Haiti',
	'HU': 'Hungary',
	'ID': 'Indonesia',
	'IE': 'Ireland',
	'IL': 'Israel',
	'IN': 'India',
	'IO': 'British Indian Ocean Territory',
	'IQ': 'Iraq',
	'IR': 'Islamic Republic of Iran',
	'IS': 'Iceland',
	'IT': 'Italy',
	'JM': 'Jamaica',
	'JO': 'Jordan',
	'JP': 'Japan',
	'KE': 'Kenya',
	'KG': 'Kyrgyzstan',
	'KH': 'Cambodia',
	'KI': 'Kiribati',
	'KM': 'Comoros',
	'KN': 'St. Kitts and Nevis',
	'KP': 'Korea, Democratic People\'s Republic of',
	'KR': 'Korea, Republic of',
	'KW': 'Kuwait',
	'KY': 'Cayman Islands',
	'KZ': 'Kazakhstan',
	'LA': 'Lao People\'s Democratic Republic',
	'LB': 'Lebanon',
	'LC': 'Saint Lucia',
	'LI': 'Liechtenstein',
	'LK': 'Sri Lanka',
	'LR': 'Liberia',
	'LS': 'Lesotho',
	'LT': 'Lithuania',
	'LU': 'Luxembourg',
	'LV': 'Latvia',
	'LY': 'Libyan Arab Jamahiriya',
	'MA': 'Morocco',
	'MC': 'Monaco',
	'MD': 'Moldova, Republic of',
	'MG': 'Madagascar',
	'MH': 'Marshall Islands',
	'ML': 'Mali',
	'MN': 'Mongolia',
	'MM': 'Myanmar',
	'MO': 'Macau',
	'MP': 'Northern Mariana Islands',
	'MQ': 'Martinique',
	'MR': 'Mauritania',
	'MS': 'Monserrat',
	'MT': 'Malta',
	'MU': 'Mauritius',
	'MV': 'Maldives',
	'MW': 'Malawi',
	'MX': 'Mexico',
	'MY': 'Malaysia',
	'MZ': 'Mozambique',
	'NA': 'Namibia',
	'NC': 'New Caledonia',
	'NE': 'Niger',
	'NF': 'Norfolk Island',
	'NG': 'Nigeria',
	'NI': 'Nicaragua',
	'NL': 'Netherlands',
	'NO': 'Norway',
	'NP': 'Nepal',
	'NR': 'Nauru',
	'NT': 'Neutral Zone (no longer exists)',
	'NU': 'Niue',
	'NZ': 'New Zealand',
	'OM': 'Oman',
	'PA': 'Panama',
	'PE': 'Peru',
	'PF': 'French Polynesia',
	'PG': 'Papua New Guinea',
	'PH': 'Philippines',
	'PK': 'Pakistan',
	'PL': 'Poland',
	'PM': 'St. Pierre & Miquelon',
	'PN': 'Pitcairn',
	'PR': 'Puerto Rico',
	'PT': 'Portugal',
	'PW': 'Palau',
	'PY': 'Paraguay',
	'QA': 'Qatar',
	'RE': 'Réunion',
	'RO': 'Romania',
	'RU': 'Russian Federation',
	'RW': 'Rwanda',
	'SA': 'Saudi Arabia',
	'SB': 'Solomon Islands',
	'SC': 'Seychelles',
	'SD': 'Sudan',
	'SE': 'Sweden',
	'SG': 'Singapore',
	'SH': 'St. Helena',
	'SI': 'Slovenia',
	'SJ': 'Svalbard & Jan Mayen Islands',
	'SK': 'Slovakia',
	'SL': 'Sierra Leone',
	'SM': 'San Marino',
	'SN': 'Senegal',
	'SO': 'Somalia',
	'SR': 'Suriname',
	'ST': 'Sao Tome & Principe',
	'SU': 'Union of Soviet Socialist Republics (no longer exists)',
	'SV': 'El Salvador',
	'SY': 'Syrian Arab Republic',
	'SZ': 'Swaziland',
	'TC': 'Turks & Caicos Islands',
	'TD': 'Chad',
	'TF': 'French Southern Territories',
	'TG': 'Togo',
	'TH': 'Thailand',
	'TJ': 'Tajikistan',
	'TK': 'Tokelau',
	'TM': 'Turkmenistan',
	'TN': 'Tunisia',
	'TO': 'Tonga',
	'TP': 'East Timor',
	'TR': 'Turkey',
	'TT': 'Trinidad & Tobago',
	'TV': 'Tuvalu',
	'TW': 'Taiwan, Province of China',
	'TZ': 'Tanzania, United Republic of',
	'UA': 'Ukraine',
	'UG': 'Uganda',
	'UM': 'United States Minor Outlying Islands',
	'US': 'United States of America',
	'UY': 'Uruguay',
	'UZ': 'Uzbekistan',
	'VA': 'Vatican City State (Holy See)',
	'VC': 'St. Vincent & the Grenadines',
	'VE': 'Venezuela',
	'VG': 'British Virgin Islands',
	'VI': 'United States Virgin Islands',
	'VN': 'Viet Nam',
	'VU': 'Vanuatu',
	'WF': 'Wallis & Futuna Islands',
	'WS': 'Samoa',
	'YD': 'Democratic Yemen (no longer exists)',
	'YE': 'Yemen',
	'YT': 'Mayotte',
	'YU': 'Yugoslavia',
	'ZA': 'South Africa',
	'ZM': 'Zambia',
	'ZR': 'Zaire',
	'ZW': 'Zimbabwe',
	'ZZ': 'Unknown or unspecified country',
}

df['employee_residence'] = df['employee_residence'].map(ISO_CODE)
df['company_location'] = df['company_location'].map(ISO_CODE)

# Replacing company size
company_size={'S': 'Small',
                  'M':'Medium',
                  'L':'Large'}
df.company_size=df['company_size'].map(company_size)

In [6]:
df.head(15)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Mid-level/Intermediate,Full Time,Data Scientist,70000,EUR,79833,Germany,On-site,Germany,Large
1,2020,Senior-level/Expert,Full Time,Machine Learning Scientist,260000,USD,260000,Japan,On-site,Japan,Small
2,2020,Senior-level/Expert,Full Time,Big Data Engineer,85000,GBP,109024,United Kingdom (Great Britain),Partially Remote,United Kingdom (Great Britain),Medium
3,2020,Mid-level/Intermediate,Full Time,Product Data Analyst,20000,USD,20000,Honduras,On-site,Honduras,Small
4,2020,Senior-level/Expert,Full Time,Machine Learning Engineer,150000,USD,150000,United States of America,Partially Remote,United States of America,Large
5,2020,Entry-level/Junior,Full Time,Data Analyst,72000,USD,72000,United States of America,Fully Remote,United States of America,Large
6,2020,Senior-level/Expert,Full Time,Lead Data Scientist,190000,USD,190000,United States of America,Fully Remote,United States of America,Small
7,2020,Mid-level/Intermediate,Full Time,Data Scientist,11000000,HUF,35735,Hungary,Partially Remote,Hungary,Large
8,2020,Mid-level/Intermediate,Full Time,Business Data Analyst,135000,USD,135000,United States of America,Fully Remote,United States of America,Large
9,2020,Senior-level/Expert,Full Time,Lead Data Engineer,125000,USD,125000,New Zealand,Partially Remote,New Zealand,Small


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary              607 non-null    int64 
 5   salary_currency     607 non-null    object
 6   salary_in_usd       607 non-null    int64 
 7   employee_residence  605 non-null    object
 8   remote_ratio        607 non-null    object
 9   company_location    607 non-null    object
 10  company_size        607 non-null    object
dtypes: int64(3), object(8)
memory usage: 52.3+ KB


In [8]:
df.describe()

Unnamed: 0,work_year,salary,salary_in_usd
count,607.0,607.0,607.0
mean,2021.405272,324000.1,112297.869852
std,0.692133,1544357.0,70957.259411
min,2020.0,4000.0,2859.0
25%,2021.0,70000.0,62726.0
50%,2022.0,115000.0,101570.0
75%,2022.0,165000.0,150000.0
max,2022.0,30400000.0,600000.0


In [9]:
df.nunique()

work_year               3
experience_level        4
employment_type         4
job_title              50
salary                272
salary_currency        17
salary_in_usd         369
employee_residence     55
remote_ratio            3
company_location       50
company_size            3
dtype: int64

In [10]:
# Finding duplicates
df.duplicated().sum()

42

## 2. Exploratory data analysis

This part will delve deeper into the data about its distributed and peform further analysis about the corelation of one to the other.

#### 2.1 Job distribution in accordance with experience level

This will help get a better picture about the job availibility in comparison to the individual's experience level which will show the vitality of the job market and help understand the job market scenario better.

In [11]:
exp_level = df['experience_level'].value_counts()
fig = px.pie(
    values = exp_level.values,
    names = exp_level.index,
    color_discrete_sequence=px.colors.sequential.Mint,
    title='Pie distribution in accordance with the level of exprience',
    template='plotly_dark'
)
fig.update_traces(
    textinfo='label+percent+value',
    textfont_size=10,
    marker=dict(line=dict(color='#100000', width=0.2)))
fig.show()

#### 2.2 Job demand as per the job title

We will plot a barplot which will help us understand the job demand in accordance with the job title.

In [12]:
job_title_list = df['job_title'].value_counts()
fig = px.bar(
    y = job_title_list.values,
    x = job_title_list.index,
    color = job_title_list.index,
    color_discrete_sequence = px.colors.sequential.Mint,
    text = job_title_list.values,
    title = 'Bar plot distribution in accordance with the job title',
    template = 'plotly_dark'
)
fig.update_layout(
    xaxis_title="Job Titles",
    yaxis_title="count",
    font = dict(size=10,family="Franklin Gothic"))
fig.show()

#### 2.3 Job demand as per the type of employment

We will plot a barplot which will help us understand the job demand in accordance with the type of employment.

In [13]:
employment_type_list = df['employment_type'].value_counts()
fig = px.bar(
    y = employment_type_list.values,
    x = employment_type_list.index,
    color = employment_type_list.index,
    color_discrete_sequence = px.colors.sequential.Mint,
    text = employment_type_list.values,
    title = 'Bar plot distribution in accordance with the type of employment',
    template = 'plotly_dark'
)
fig.update_layout(
    xaxis_title="Type of employment",
    yaxis_title="count",
    font = dict(size=10,family="Franklin Gothic"))
fig.show()

#### 2.4 Employee residence and company location

We will plot a barplot wherein the employee residence and company location will be plotted on the same graph to compare the distribution.

In [14]:
comp_loc = df['company_location'].value_counts()
emp_loc = df['employee_residence'].value_counts()
fig = go.Figure(data=[
    go.Bar(name='Employee Location', x=comp_loc.index, y=comp_loc.values, text=comp_loc.values, marker_color='Red'),
    go.Bar(name='Company Location', x=emp_loc.index, y=emp_loc.values, text=emp_loc.values, marker_color='Blue')
])
fig.update_layout(barmode='group', xaxis_tickangle=-45,
                  title='Employee Location and Company Location',template='plotly_dark',
                  font = dict(size=10,family="Franklin Gothic"))
fig.show()

#### 2.5 Company size distribution

We will plot the size distribution of the company to get a better idea of the job market demand.

In [15]:
group_dist = df['company_size'].value_counts()
fig = px.bar(x = group_dist.index, y = group_dist.values,
             text = group_dist.values,
             color = group_dist.index,
             color_discrete_sequence=px.colors.sequential.Inferno,
             title = 'Company Size distribution',
             template = 'plotly_dark'
             )
fig.update_layout(
    xaxis_title="Company Size",
    yaxis_title="count",
    font = dict(size=10,family="Franklin Gothic"))
fig.show()

#### 2.6 Remote work distribution

we will also plot the barplot distribution for remote ratio of the work needed. 

In [16]:
rem_rat = df['remote_ratio'].value_counts()
fig = px.bar(x = rem_rat.index, y = rem_rat.values,
             text = rem_rat.values,
             color = rem_rat.index,
             color_discrete_sequence=px.colors.sequential.deep,
             title = 'Remote ratio work distribution',
             template = 'plotly_dark'
             )
fig.update_layout(
    xaxis_title="Remote ratio",
    yaxis_title="count",
    font = dict(size=10,family="Franklin Gothic"))
fig.show()

#### 2.7 Salary based on company location

Now the Salary distribution for the data jobs all across the world based on its location.

In [19]:
sal_dis = df.groupby('company_location')['salary_in_usd'].mean()
fig = px.bar(x = sal_dis.index, y = sal_dis.values,
             text = sal_dis.values,
             color = sal_dis.index,
             color_discrete_sequence=px.colors.sequential.Agsunset,
             title = 'Avg salary distribution based on its location',
             template = 'plotly_dark'
             )
fig.update_layout(
    xaxis_title="Company location",
    yaxis_title="Average salary (in USD)",
    font = dict(size=10,family="Franklin Gothic"))
fig.show()

#### 2.8 Job percentage as per the year

This pie-chart will show the distribution as per the year

In [20]:
year_dis = df['work_year'].value_counts()
fig = px.pie(
    values = year_dis.values,
    names = year_dis.index,
    color_discrete_sequence=px.colors.sequential.Plasma,
    title='Pie distribution in accordance with the year',
    template='plotly_dark'
)
fig.update_traces(
    textinfo='label+percent+value',
    textfont_size=10,
    marker=dict(line=dict(color='#100000', width=0.2)))
fig.show()

#### 2.9 Job title w.r.t experience level

This stacked barplot will show the distribution of each job title with respect to the experience level for each job and show the final count.

In [40]:
job_level = df.groupby(['experience_level','job_title']).size()

entry = job_level['Entry-level/Junior'].sort_values(ascending=False)
exec = job_level['Executive-level/Director'].sort_values(ascending=False)
mid = job_level['Mid-level/Intermediate'].sort_values(ascending=False)
senior = job_level['Senior-level/Expert'].sort_values(ascending=False)

fig = go.Figure(
    data = [
        go.Bar(name='Entry-level/Junior', x=entry.index, y=entry.values, text=entry.values, marker_color='Red'),
        go.Bar(name='Executive-level/Director', x=entry.index, y=entry.values, text=entry.values, marker_color='Green'),
        go.Bar(name='Mid-level/Intermediate', x=entry.index, y=entry.values, text=entry.values, marker_color='Blue'),
        go.Bar(name='Senior-level/Expert', x=entry.index, y=entry.values, text=entry.values, marker_color='Yellow'),
    ]
)
fig.update_layout(
    barmode = 'stack', xaxis_tickangle = -45,
    title = 'Job title for each type of experience level',
    font = dict(family="Franklin Gothic", size=10),
    template='plotly_dark'
)
fig.show()

#### 2.10 Company size w.r.t experience level

This stacked barplot will show the distribution of the size of the comapny with respect to the experience level for each company size and show the final count.

In [41]:
comp_level = df.groupby(['experience_level','company_size']).size()

entry = comp_level['Entry-level/Junior'].sort_values(ascending=False)
exec = comp_level['Executive-level/Director'].sort_values(ascending=False)
mid = comp_level['Mid-level/Intermediate'].sort_values(ascending=False)
senior = comp_level['Senior-level/Expert'].sort_values(ascending=False)


fig = go.Figure(
    data = [
        go.Bar(name='Entry-level/Junior', x=entry.index, y=entry.values, text=entry.values, marker_color='Red'),
        go.Bar(name='Executive-level/Director', x=entry.index, y=entry.values, text=entry.values, marker_color='Green'),
        go.Bar(name='Mid-level/Intermediate', x=entry.index, y=entry.values, text=entry.values, marker_color='Blue'),
        go.Bar(name='Senior-level/Expert', x=entry.index, y=entry.values, text=entry.values, marker_color='Yellow'),
    ]
)
fig.update_layout(
    barmode = 'stack', xaxis_tickangle = -45,
    title = 'Job title for each type of experience level',
    font = dict(family="Franklin Gothic", size=10),
    template='plotly_dark'
)
fig.show()

## 3 Multi-variable analysis

In this part of the project we will try to establish the corelation b/w different aspects of columns and their direct corelation with one another.

#### 3.1 Finding co-relation b/w experience level, salary and company size

We will try plotting boxplots and then compare from the figure to get the right analysis

In [32]:
exp_list = ['Entry-level/Junior','Mid-level/Intermediate','Senior-level/Expert','Executive-level/Director']
fig = px.box(df, 
             x='experience_level', 
             y='salary_in_usd', 
             category_orders={"experience_level": exp_list},
             color='company_size',
             title='Co-relation b/w experience Level, salary, and company Size',
             labels={'experience_level': 'Experience level', 'salary_in_usd': 'Salary'},
             template='plotly_dark'
            )
fig.show()

#### 3.2 Finding co-relation b/w remote ratio, salary and employment type

We will try plotting boxplots and then compare from the figure to get the right analysis

In [33]:
fig = px.box(df, 
             x='remote_ratio', 
             y='salary_in_usd', 
             color='employment_type',
             title='Co-relation b/w remote ratio, salary, and employment type',
             labels={'remote_ratio': 'Employment Type', 'salary_in_usd': 'Salary'},
             template='plotly_dark'
            )
fig.show()

#### 3.3 Finding co-relation b/w experience level, salary and job title on a year basis

We will make a scatterplot and then analyze the figure to find the co-relation.

In [47]:
px.scatter(df, x = 'salary_in_usd', y = 'experience_level',
           size = 'salary_in_usd',
           hover_name = 'job_title',
           color = 'job_title', 
           color_discrete_sequence=px.colors.sequential.Agsunset, template = 'plotly_dark',
           animation_frame = 'work_year',
           title = 'Co-relation b/w experience level, salary and job title on a year basis').update_yaxes(categoryarray = ['Entry', 'Mid', 'Senior', 'Executive'])

# Conclusion

The prominence of data science roles is on a marked rise. 

For professionals eyeing the zenith of pay scales, the United States emerges as the front-runner. However, it's essential to contextualize this perspective. While the U.S. does offer higher salaries, a holistic understanding would entail examining factors like the cost of living, healthcare provisions, and more.

Larger and medium-sized companies tend to be more generous with compensation compared to their smaller counterparts.

As we stride into 2022, contract-based and full-time positions appear to be the most lucrative employment types.

Roles like Data Engineers, Data Scientists, and Machine Learning Engineers stand out, commanding noteworthy average salaries, showcasing their market value.

The ascendancy of remote positions is evident, both in terms of popularity and pay scale. This shift could arguably be attributed to the recent pandemic.

A significant uptick in salary can be observed as one transitions into senior-level roles.

It's imperative to note that a substantial portion of our dataset originates from the U.S., where wages are notably higher than in other nations. Consequently, the presented average salaries might not be truly representative of global trends. Using them as a yardstick for global salary expectations might be misleading.

Concluding from the analysis, venturing into a data science-centric career path seems promising, especially considering the salary prospects and flexibility of remote work.
