# Salary Prediction Project

## Problem Statement

InnovaTech, a leading company in the robotics industry, is now hiring! As the lead data scientist at the compnay, you have been tasked with creating an automated system that estimates annual salaries for conmpany employees based on information such as their age, gender, years of experience, education level, and job title.

Estimates from your system will be used by potential employees to help them decide whether InnovaTech is right for them. Due to regulatory requirements, you must be able to explain why your system outputs a certain prediction.

You will be given a CSV file consisting of the aforementioned information and the actual salaries of over 300 employees.

## Getting the Data

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/Salary Data.csv')
data

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
...,...,...,...,...,...,...
370,35.0,Female,Bachelor's,Senior Marketing Analyst,8.0,85000.0
371,43.0,Male,Master's,Director of Operations,19.0,170000.0
372,29.0,Female,Bachelor's,Junior Project Manager,2.0,40000.0
373,34.0,Male,Bachelor's,Senior Operations Coordinator,7.0,90000.0


The dataset has 375 rows and 6 columns, with each row representing information a specific employee. Let's take a closer look at the employee information.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375 entries, 0 to 374
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  373 non-null    float64
 1   Gender               373 non-null    object 
 2   Education Level      373 non-null    object 
 3   Job Title            373 non-null    object 
 4   Years of Experience  373 non-null    float64
 5   Salary               373 non-null    float64
dtypes: float64(3), object(3)
memory usage: 17.7+ KB


The age and years of experience columns are numerical, and the gender, education level, and job title are strings (possible categories). There are no missing values in any of the columns.

Let's now look at some statistics for the numerical columns.

In [4]:
data.describe()

Unnamed: 0,Age,Years of Experience,Salary
count,373.0,373.0,373.0
mean,37.431635,10.030831,100577.345845
std,7.069073,6.557007,48240.013482
min,23.0,0.0,350.0
25%,31.0,4.0,55000.0
50%,36.0,9.0,95000.0
75%,44.0,15.0,140000.0
max,53.0,25.0,250000.0


The values in all of the columns seem reasonable. The salary and years of experience columns look very skewed, as in both columns the median is much lower than the maxmimum value. 

## Exploratory Analysis and Visualization

In [5]:
%pip install plotly --quiet

Note: you may need to restart the kernel to use updated packages.


In [6]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

We will now add some setting that will improve the default style and font sizes for our charts.

In [7]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### Age

Age is a numeric column. The minimum age in the dataset is 23 and the maximum age is 53. Thus, we can use plotly to visualize the distribution of age using a histogram with 30 bins (one for each year) and a box plot. 

In [8]:
data.Age.describe()

count    373.000000
mean      37.431635
std        7.069073
min       23.000000
25%       31.000000
50%       36.000000
75%       44.000000
max       53.000000
Name: Age, dtype: float64

In [9]:
fig = px.histogram(data, 
                   x='Age', 
                   marginal='box', 
                   nbins=30, 
                   title='Age Distribution')
fig.update_layout(bargap=0.1)
fig.show()


The age distribution in the dataset is mostly uniform except for the edges of the dataset, which could be because the company wants more experience (thereby hiring less young people) or because the older employees are retiring (which is why there are less old people). 

### Gender

Let's now visualize the distribution of gender using a bar graph.

In [11]:
data.Gender.value_counts()

Gender
Male      194
Female    179
Name: count, dtype: int64

In [16]:
px.histogram(data, x='Gender', title='Gender Distribution')