<div style = "display: fill;
              border-radius: 5px;
              background-color: #68BB59;
              border-color:red">
    <h1 style = "padding: 15px; 
                 color: black;
                 text-align:center;
                 font-family: Trebuchet MS;"> Data Science Job Salaries Exploratory Data Analysis (EDA)
    </h1>
</div>

![](https://www.northeastern.edu/graduate/blog/wp-content/uploads/2020/06/iStock-1221293664-1.jpg)


<div style = "display: fill;
              border-radius: 5px;
              background-color: #68BB59;
              border-color:red">
    <h1 style = "padding: 15px; 
                 color: black;
                 text-align:center;
                 font-family: Trebuchet MS;"> Why data science?
    </h1>
</div>

Data Science is a field of study that deals with the process of data extraction, processing, and analysis to find patterns and make predictive models that drive decision making. Careers in data science span data analyst to data engineer and data scientist, each has different scope and use different skillsets. Data science is one of the fastest growing occupations with competitive career and flexible work life balance.

****The objective of this project is to explore the the top careers in data science, its job outlook, and growth since 2020 until now.****

<div style = "display: fill;
              border-radius: 5px;
              background-color: #68BB59;
              border-color:red">
    <h1 style = "padding: 15px; 
                 color: black;
                 text-align:center;
                 font-family: Trebuchet MS;"> Important Skills and Tools in Data Science
    </h1>
</div>

**Important analytical skills:**
* Data Visualization (Tableau, PowerBI, etc)
* Data Cleaning
* Matlab
* R
* Python
* SQL/mySQL
* Microsolf Excel

**In addition to these, data scientists should also be fluent with:**

* Multivariable calculus and linear algebra
* Machine Learning

**Soft skills that are must have:**
* Critical Thinking
* Communication and Presentation Skills

<div style = "display: fill;
              border-radius: 5px;
              background-color: #68BB59;
              border-color:red">
    <h1 style = "padding: 15px; 
                 color: black;
                 text-align:center;
                 font-family: Trebuchet MS;"> Data Cleaning and Preprocessing
    </h1>
</div>

<font size="5">Importing required libaries</font>

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
df = pd.read_csv('../input/data-science-job-salaries/ds_salaries.csv')
#load the file using pandas

**Data preparation and cleaning, what need to be done:**
* Overview of the data set
* Clean the data (if needed any cleaning) 
* Adjust and rename any columns and variables if needed

In [None]:
df.head()
#quick look at what data table looks like

In [None]:
df.columns
#overview of all columns

In [None]:
df.info()
#overview of df

<div style="color:black;
            display:fill;
            border-radius:1px;
            background-color:LIGHTGREEN;
            font-size:80%;
            font-family:Verdana;
            letter-spacing:1px">
    
<font size="5">**Dataset Description**</font>
1. **work_year**: The year the salary was paid.
2. **experience_level**: The experience level in the job during the year with the following possible values:

* EN = Entry level or Junior
* MI = Mid level or Intermediate
* SE = Senior level or Expert
* EX = Executive level or Director

3. **employment_type**: The type of employement for the role:

* PT = Part time
* FT = Full time
* CT = Contract
* FL = Freelance

4. **job_title**: The role worked in during the year.
5. **salary**: The total gross salary amount paid.
6. **salary_currency**: The currency of the salary paid as an ISO 4217 currency code.
7. **salary_in_usd**: The salary in USD
8. **employee_residence**: Employee's primary country of residence in during the work year as an ISO 3166 country code.
9. **remote_ratio**: The overall amount of work done remotely, possible values are as follows:
* 0 = No remote work (less than 20%)
* 50 = Partially remote
* 100 = Fully remote (more than 80%)
10. **company_location**: The country of the employer's main office or contracting branch as an ISO 3166 country code.

11. **company_size**: The average number of people that worked for the company during the year:

* S = less than 50 employees 
* M = 50 to 250 employees 
* L = more than 250 employees 
    
</div>

In [None]:
df.describe()

In [None]:
df.isnull().sum()
#see if there's any missing data
#no missing data

In [None]:
df['experience_level'].replace({'EN':'Entry-Level','MI':'Mid-Level','EX':'Executive Level','SE':'Senior'},inplace=True)
df['employment_type'].replace({'PT':'Part-Time','FT':'Full-Time','CT':'Contract','FL':'Freelance'},inplace=True)
df['company_size'].replace({'S':'Small','M':'Medium','L':'Large'},inplace=True)
#Rename for better reading

In [None]:
df.drop(['Unnamed: 0', 'salary', 'salary_currency', 'employee_residence'], axis=1, inplace=True)

#drop 4 columns that are not needed



<div style = "display: fill;
              border-radius: 5px;
              background-color: #68BB59;
              border-color:red">
    <h1 style = "padding: 15px; 
                 color: black;
                 text-align:center;
                 font-family: Trebuchet MS;">Exploratory Analysis and Visualization
    </h1>
</div>


<font size="5">Univariate Analysis</font>
1. Popular data science jobs(top jobs)
2. Jobs with highest average salary
3. Experience level
    *   Top jobs for entry level experience 
4. Employment level
5. Remote Ratio
6. Company size
7. Company location (where most jobs are at)


<font size="5">What are the top data science jobs right now?</font>

In [None]:
a=df['job_title'].value_counts().head(5)
plt.figure(figsize=(10,6), tight_layout=True)
ax = sns.barplot(x=a.index, y=a.values, palette='Set2', ci=None)
ax.set(title='Top 5 Most Popular Data Science Job', xlabel='Job title', ylabel='Total Count')
plt.show()


<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Data scientist, Data Engineer and Data Analyst are top 3 job titles
            </h4>
</div>

<font size="5">The top 5 data science job with the highest average salary</font>

In [None]:
a=df.groupby('job_title',as_index=False)["salary_in_usd"].mean().sort_values(by='salary_in_usd', ascending=False)
a['salary_in_usd'] = a['salary_in_usd'].round()
ax=a.head(5)
plt.figure(figsize=(10,6), tight_layout=True)
an = sns.barplot(x=ax['job_title'], y=ax['salary_in_usd'], palette='Set2', ci=None)
an.set(title='Top 5 Data Science Job with Highest Average Salary', xlabel='Job Title', ylabel='Average Salary in USD')
an.bar_label(an.containers[0])
plt.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> The top 3 data science jobs with the highest average salary are: data analytics lead, principal data engineer, financial data analyst. 
            </h4>
</div>

<font size="5">What are the distribution of jobs base on experience level? </font>

In [None]:
a=df.groupby('experience_level',as_index=False)["salary_in_usd"].count().sort_values(by='salary_in_usd', ascending=False)
colors = sns.color_palette('Set2')
plt.figure(figsize=(10, 6), tight_layout=True)
explode_list = [0, 0, .2,0]
plt.pie(a['salary_in_usd'], labels=a['experience_level'], autopct='%.0f %%', explode=explode_list, pctdistance=.7,
          colors=colors, shadow=True)
plt.title('Total Jobs Base on Experience Level', weight='bold')
plt.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Nearly 50% of data science jobs are senior experience level. Entry level jobs only represent 14% of the workforce.
            </h4>
</div>

<font size="5">What are the top 5 entry level data science jobs?   </font>

In [None]:
a =df[df['experience_level'] == 'Entry-Level']
ax =a.groupby('job_title',as_index=False)["salary_in_usd"].count().sort_values(by='salary_in_usd', ascending=False).head(5)
colors = sns.color_palette('Set2')
plt.figure(figsize=(10, 6), tight_layout=True)
plt.pie(ax['salary_in_usd'], labels=ax['job_title'], autopct='%.0f %%', pctdistance=.7,
          colors=colors, shadow=True)
plt.title('Top 5 Entry Level Data Science Job', weight='bold')
plt.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> The top 3 entry level jobs are data scientist, data analyst, and data engineer
            </h4>
</div>

<font size="5"> What are the distribution of jobs base on employment type?</font>

In [None]:
a=df.groupby('employment_type',as_index=False)["salary_in_usd"].count().sort_values(by='salary_in_usd', ascending=False)
colors = sns.color_palette('Set2')
plt.figure(figsize=(10, 6), tight_layout=True)
explode_list = [.4, 0, 0 ,0]
plt.pie(a['salary_in_usd'], labels=a['employment_type'], autopct='%.0f %%', explode=explode_list, pctdistance=.7,
          colors=colors, shadow=True)
plt.title('Total Jobs Base on Employment Type', weight='bold')
plt.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Almost all data science jobs are full time positions
            </h4>
</div>

<font size="5"> How many of these jobs are remote? </font>

In [None]:
a=df.groupby('remote_ratio',as_index=False)["salary_in_usd"].count().sort_values(by='salary_in_usd', ascending=False)
colors = sns.color_palette('Set2')
plt.figure(figsize=(10, 6), tight_layout=True)
explode_list = [0, 0 ,0]
plt.pie(a['salary_in_usd'], labels=a['remote_ratio'], autopct='%.0f %%', explode=explode_list, pctdistance=.7,
          colors=colors, shadow=True)
plt.title('Remote Ratio', weight='bold')
plt.legend(a['remote_ratio'])
plt.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;">More than half of data science job positions are remote. If we include hybrid positions, then more than 80% of jobs are fully to near remote.
            </h4>
</div>

<font size="5"> Overview of company size in data science</font>

In [None]:
a=df.groupby('company_size',as_index=False)["salary_in_usd"].count().sort_values(by='salary_in_usd', ascending=False)
colors = sns.color_palette('Set2')
plt.figure(figsize=(10, 6), tight_layout=True)
explode_list = [0, 0 ,0]
plt.pie(a['salary_in_usd'], labels=a['company_size'], autopct='%.0f %%', explode=explode_list, pctdistance=.7,
          colors=colors, shadow=True)
plt.title('Company size', weight='bold')
plt.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Most data science employees work for medium and large companies
            </h4>
</div>

<font size="5"> Where are data science job located?</font>

In [None]:
a=df.groupby('company_location',as_index=False)["salary_in_usd"].count().sort_values(by='salary_in_usd', ascending=False).head(10)
fig = px.bar(a, x='salary_in_usd', y="company_location", orientation='h',labels={
                     "salary_in_usd": "Count",
                     "company_location": "Location"},
                title="Top 10 countries where data science jobs are located")
fig.update_layout(yaxis=dict(autorange="reversed"))
fig.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> More than 50% of the jobs are located in the United States. The top 2 behind the United States are Canada and Germany. 
            </h4>
</div>

<font size="5">Bivariate Analysis</font>
1. Salary by year
2. Remote ratio by year
3. Salary by experience level
4. Salary by company size

<font size="5"> Have data science salaries been increasing over the years? </font>

In [None]:
fig = px.box(df, x="work_year", y="salary_in_usd", template='seaborn', labels={
                     "work_year": "Year",
                     "salary_in_usd": "Salary in Usd"},
                title="Data science salary in 2020-2022")
fig.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Data science median salaries had increased by approximately 60% since the year 2020. 
            </h4>
</div>

<font size="5"> Has more and more companies offer data science jobs remotely?</font>

In [None]:
a=pd.crosstab(df.work_year, df.remote_ratio, margins=True,     values=df.salary_in_usd, aggfunc=pd.Series.count)

sns.set_style("dark")
a.plot(kind="bar",figsize=(12,7), xlabel = 'Work Year', ylabel = 'Count', title = "Remote Work trend over the year 2022-2022")

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> There was more remote and hybrid data science jobs than in-office since 2020. The year 2021 followed a similar trend, however, this trend was broken in 2022 where most of the jobs are now either in-person or remote. 
            </h4>
</div>

<font size="5"> Do salaries increase as experience level increase?</font>

In [None]:
fig = px.violin(df, x="experience_level", y="salary_in_usd", template = "seaborn", box=True, labels={
                     "experience_level": "Experience Level",
                     "salary_in_usd": "Salary in USD"},
                title="Data Science Salaries By Experience Level")
fig.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Data science salaries do increase as experience level increase.
            </h4>
</div>

<font size="5"> Does company size affected salary ?</font>

In [None]:
fig = px.violin(df, x="company_size", y="salary_in_usd", template = "seaborn", box=True, labels={
                     "company_size": "Company Size",
                     "salary_in_usd": "Salary in USD"},
                title="Data Science Salary By Company Size")
fig.show()

<div style = "display: fill;
              border-radius: 5px;
              background-color: #E2E5DE;">
    <h4 style = "padding: 15px; 
                 color: black;
                 text-align: left;
                 font-family: Trebuchet MS;"> Data science salaries increase as company size increase
            </h4>
</div>

<div style = "display: fill;
              border-radius: 5px;
              background-color: #68BB59;
              border-color:red">
    <h1 style = "padding: 15px; 
                 color: black;
                 text-align:center;
                 font-family: Trebuchet MS;"> Summary and Conclusion
    </h1>
</div>

The top 5 jobs most popular jobs in data science are: data scientist, data analyst, data researcher, machine learning scientist, and data engineer. These also make up most of the entry level positions. Concerning average salary wise, the top positions are all senior and executive level. Most data science companies are large and medium size, located in the United States, Germany, and Canada. 

Data science salaries are also growing since 2020. Analysis also shows that salaries also increase with experience level as well as company size. Another important factor is the remote ratio. Most data science jobs now are only offered either remotely or fully in-person, with very little hybrid offers. This is a big change compare to 2020 and 2021, where hybrid was more common