# Data Exploration on Salary by College Major Data

Data exploration to gain insights into the salary trends across different undergraduate majors.

---

In [1]:
import pandas as pd
# display() allows us to view in rich-text without having
# the object we want to view in the last line of the cell
from IPython.display import display

In [2]:
# Load in csv
df = pd.read_csv("salaries_by_college_major.csv")
clean_df = df.dropna()  # For quickly verifying solutions provided

# Limit max rows to 10 to avoid clutter
pd.set_option("display.max_rows", 10)

In [3]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


In [4]:
display(df.shape)
df.dtypes

(51, 6)

Undergraduate Major                   object
Starting Median Salary               float64
Mid-Career Median Salary             float64
Mid-Career 10th Percentile Salary    float64
Mid-Career 90th Percentile Salary    float64
Group                                 object
dtype: object

In [5]:
# Drop rows with NAN values
df = df.dropna()
df

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business
...,...,...,...,...,...,...
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS


In [6]:
max_index = df["Starting Median Salary"].idxmax()
df.loc[max_index]

Undergraduate Major                  Physician Assistant
Starting Median Salary                           74300.0
Mid-Career Median Salary                         91700.0
Mid-Career 10th Percentile Salary                66400.0
Mid-Career 90th Percentile Salary               124000.0
Group                                               STEM
Name: 43, dtype: object

## Practice on Minimum and Maximum Values

1. What college major has the highest mid-career salary? How much do graduates with this major earn? (Mid-career is defined as having 10+ years of experience).

2. Which college major has the lowest starting salary and how much do graduates earn after university?

3. Which college major has the lowest mid-career salary and how much can people expect to earn with this degree? 

In [7]:
# 1. What college major has the highest mid-career salary?
# How much do graduates with this major earn? 
# (Mid-career is defined as having 10+ years of experience).

filt = df["Mid-Career Median Salary"] == df["Mid-Career Median Salary"].max()
df.loc[filt]
# Conclusion: Chemical engineering majors have the 
# highest mid-career median salary of 107,000 USD

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
8,Chemical Engineering,63200.0,107000.0,71900.0,194000.0,STEM


In [8]:
# 2. Which college major has the lowest starting salary
# and how much do graduates earn after university?

filt = df["Starting Median Salary"] == df["Starting Median Salary"].min()
df.loc[filt]
# Conclusion: Spanish majors have the 
# lowest starting median salary of 34,000 USD

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


In [9]:
# 3. Which college major has the lowest mid-career salary 
# and how much can people expect to earn with this degree? 

filt = df["Mid-Career Median Salary"] == df["Mid-Career Median Salary"].min()
df.loc[filt]
# Conclusion: Education and religion majors share the lowest
# mid-career median salary of 52,000 USD

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
18,Education,34900.0,52000.0,29300.0,102000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS


In [10]:
# Note
# In the solutions provided by 100 Days of Code, they use
# idxmin() and idxmax() which only returns the first occurence
# of the minimum or maximum value, thus only returns 1 row. This
# leads to less accurate observations as shown by the 3rd problem.
# i.e. using idxmin() will only return the "Education" major row even
# when education and religion is tied for the lowest mid career salary
df.loc[df['Mid-Career Median Salary'].idxmin()]
# Note how it only returns Education even though there are two with the 
# same value.


Undergraduate Major                  Education
Starting Median Salary                 34900.0
Mid-Career Median Salary               52000.0
Mid-Career 10th Percentile Salary      29300.0
Mid-Career 90th Percentile Salary     102000.0
Group                                     HASS
Name: 18, dtype: object

---