
# SALARIES BY COLLEGE MAJOR — EARNINGS, RISK & CAREER POTENTIAL

This notebook explores a dataset of **starting and mid-career salaries by U.S. college major**.  
We aim to answer:

- Which majors have the highest starting and mid-career salaries?
- Which majors are the lowest paid?
- How big is the salary spread (90th − 10th percentile), i.e., risk vs reward?
- How do different broad fields (STEM, Business, Humanities) compare?

---


# SALARIES BY COLLEGE MAJOR — EARNINGS, RISK & CAREER POTENTIAL

A concise analysis of starting and mid-career salaries by U.S. college major, focusing on earnings, risk (spread), and potential.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.float_format = '{:,.2f}'.format

## Load the dataset

In [None]:
df = pd.read_csv('salaries_by_college_major.csv')
print('Columns:', list(df.columns))
df.head()

## Quick data check

In [None]:
print('Shape:', df.shape)
print('\nMissing values per column:\n', df.isna().sum())

## Clean and ensure numeric types
Coerce the money columns to numeric and drop rows with missing values in key fields.


In [None]:
cols_money = [
    'Starting Median Salary',
    'Mid-Career Median Salary',
    'Mid-Career 10th Percentile Salary',
    'Mid-Career 90th Percentile Salary',
]

for c in cols_money:
    df[c] = pd.to_numeric(df[c], errors='coerce')

clean_df = df.dropna(subset=['Undergraduate Major'] + cols_money).copy()
clean_df.tail()

## Key questions

In [None]:
# Highest starting salary
idx_start_max = clean_df['Starting Median Salary'].idxmax()
clean_df.loc[idx_start_max, ['Undergraduate Major', 'Starting Median Salary']]

In [None]:
# Highest mid-career salary
idx_mid_max = clean_df['Mid-Career Median Salary'].idxmax()
clean_df.loc[idx_mid_max, ['Undergraduate Major', 'Mid-Career Median Salary']]

In [None]:
# Lowest starting salary
idx_start_min = clean_df['Starting Median Salary'].idxmin()
clean_df.loc[idx_start_min, ['Undergraduate Major', 'Starting Median Salary']]

In [None]:
# Lowest mid-career salary
idx_mid_min = clean_df['Mid-Career Median Salary'].idxmin()
clean_df.loc[idx_mid_min, ['Undergraduate Major', 'Mid-Career Median Salary']]

## Salary spread (risk vs reward)
Spread = 90th percentile − 10th percentile. Smaller spread ⇒ lower risk/variance.


In [None]:
clean_df = clean_df.assign(
    Spread = clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']
)
clean_df[['Undergraduate Major', 'Spread']].head()

## Lowest spread majors (lower variance)

In [None]:
low_risk = clean_df.sort_values('Spread', ascending=True)
low_risk[['Undergraduate Major', 'Spread']].head(10)

## Highest potential majors (by 90th percentile)

In [None]:
highest_potential = clean_df.sort_values('Mid-Career 90th Percentile Salary', ascending=False)
highest_potential[['Undergraduate Major', 'Mid-Career 90th Percentile Salary']].head(10)

## Highest spread majors (higher variance)

In [None]:
high_risk = clean_df.sort_values('Spread', ascending=False)
high_risk[['Undergraduate Major', 'Spread']].head(10)

## Group averages
Average salaries and spread by Group.


In [None]:
group_summary = (clean_df
                 .groupby('Group', as_index=False)[cols_money + ['Spread']]
                 .mean(numeric_only=True)
                 .sort_values('Mid-Career Median Salary', ascending=False))
group_summary

## Visualizations

### Top 10 Majors by Starting Median Salary

In [None]:
top_start = clean_df.sort_values('Starting Median Salary', ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.barh(top_start['Undergraduate Major'], top_start['Starting Median Salary'])
plt.xlabel('Starting Median Salary ($)')
plt.title('Top 10 Majors by Starting Median Salary')
plt.gca().invert_yaxis()
plt.show()

### Distribution of Salary Spread (P90 − P10)

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(clean_df['Spread'].dropna(), bins=20, edgecolor='black')
plt.xlabel('Salary Spread ($)')
plt.ylabel('Number of Majors')
plt.title('Distribution of Salary Spread')
plt.show()

## (Optional) Save cleaned dataset

In [None]:
# clean_df.to_csv('salaries_by_college_major_clean.csv', index=False)


---

## 📌 Conclusion & Insights

From the analysis we observed:

- **STEM majors** dominate both starting and mid-career salaries, especially Engineering.  
- **Humanities and Education majors** tend to earn less across the board.  
- Some majors show **high salary spread**, which means high risk/reward — top earners do very well, but the bottom earners don’t.  
- On average, **Engineering groups** lead in median salaries, while Humanities lag behind.  

This simple analysis highlights how a college major can significantly influence both early career opportunities and long-term earning potential.
