# College Major and Salary Outcome Analysis - Kaggle

In this Jupyter Notebook, we will be exploring how college major is linked to short-term and long-term earnings. The dataset we will be working with is from Kaggle, and contains information regarding salary outcome for each college major.

First, we will import some libraries:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
print('All libraries have been imported.')

Next, we will import the data and preview it to see what's inside:

In [None]:
salaries_major_filepath = '../input/college-salaries/degrees-that-pay-back.csv'
salaries_major = pd.read_csv(salaries_major_filepath)

salaries_major.head()

Now we will alter some aspects of the DataFrame so it's easier to work with. First we will rename the columns:

In [None]:
# Rename cols.
salaries_major.columns = ['major', 'start', 'mid_career', 'delta_start_mid', 'mid_10p',
            'mid_25p', 'mid_75p', 'mid_90p']

salaries_major.head()

In [None]:
# Check dtypes for salary figures.
type(salaries_major.start[0])

Since the salary figures are strings, we also need to convert them to their numerical equivalents:

In [None]:
salary_cols = ['start', 'mid_career', 'mid_10p', 'mid_25p', 'mid_75p', 'mid_90p']

for col in salary_cols:
    salaries_major[col] = salaries_major[col].str.replace('$', '') # remove dollar signs.
    salaries_major[col] = salaries_major[col].str.replace(',', '').astype(float) # remove commas and convert to floats.
    
salaries_major.head()

In [None]:
# Verify salary figures have been converted to floats.
type(salaries_major.start[0])

Now that our data is in the format that we like, let's do some data exploration and try to understand what's going on. First we will look at some descriptive statistics:

In [None]:
salaries_major.describe()

Here are some insights we can draw:

    a. The mean starting salary for all graduates regardless of major is 44,310 USD.
    
    b. The major with the highest mean starting salary earns 74,300 USD, while the major with the lowest mean starting salary earns 34,000 USD.
    
    c. The mean mid-career salary for all college graduates regardless of major is 74,786 USD.
    
    d. The major with the highest mean mid-career salary earns 107,000 USD, while the major with the lowest mean mid-career salary earns 52,000 USD.

Let's dive deeper into the starting salaries data and see if we can create an effective visualization.

In [None]:
# Top 5 starting salaries.
salaries_major.sort_values('start', ascending = False).head()

In [None]:
# Bottom 5 starting salaries.
salaries_major.sort_values('start').head()

In [None]:
# Bar plot for starting salaries by major.
plt.figure(figsize = (10, 10))
sns.barplot(x = salaries_major.sort_values('start', ascending = False).start, # Put data in descending order by start salary.
            y = salaries_major.sort_values('start', ascending = False).major) # Makes the graph easier to interpret.
plt.title('Mean Starting Salaries\nby College Major', fontsize = 18)
plt.xlabel('Salary (USD)', fontsize = 14)
plt.ylabel('Major', fontsize = 14)
plt.grid(axis = 'x')

Looking at the chart above, we now have a better understanding of what each major makes for a starting salary. We observe that somebody who studies to become a Physician Assistant earns the highest mean starting salary, at 74,000 USD, while somebody studying Spanish earns the lowest mean starting salary, at 34,000 USD.

Now lets look into mid-career salaries for each major and create another visualization:

In [None]:
# Top 5 mid-career salaries.
salaries_major.sort_values('mid_career', ascending = False).head()

In [None]:
# Bottom 5 mid-career salaries.
salaries_major.sort_values('mid_career').head()

In [None]:
# Bar plot for mid-career salaries by major,
plt.figure(figsize = (10, 10))
sns.barplot(x = salaries_major.sort_values('mid_career', ascending = False).mid_career, # Put data in descending order by mid-career salary.
            y = salaries_major.sort_values('mid_career', ascending = False).major) # Makes graph easier to interpret.
plt.title('Mean Mid-Career Salaries\nby College Major', fontsize = 18)
plt.xlabel('Salary (USD)', fontsize = 14)
plt.ylabel('Major', fontsize = 14)
plt.grid(axis = 'x')

Looking at the chart above, we now have a better understanding of what each major makes for a mid-career salary. We observe that somebody who studies Chemical Engineering earns the highest mean mid-career salary at 107,000 USD, while somebody who studies either Education or Religion earns the lowest mean mid-career salary, at 52,000 USD.

After oberving and comparing the charts for starting and mid-career salaries, we notice that the rankings have changed. Some majors have moved up or down in the rankings for their respective salaries. Let's plot both the starting and mid-career salaries on the same chart to better visualize the differences. As such, this will require creating a grouped-bar plot.

However, the main DataFrame is not well-suited for creating a grouped-bar plot, so we will have to create a subset DataFrame. This subset DataFrame will combine both starting and mid-career salaries, but will have a classifier column denoting each figure as "Starting" or "Mid-Career".

In [None]:
# First create a new df for starting salaries.
start = salaries_major.loc[:, ['major', 'start']].sort_values('start', ascending = False)
start.rename(columns = {'start': 'salary'}, inplace = True)

# Add salary classifier col.
classify = []
for i in range(len(start)):
    classify.append('Starting')
start['salary_type'] = classify

start.head()

In [None]:
# Then create a df for mid-career salaries.
mid = salaries_major.loc[:, ['major', 'mid_career']]
mid.rename(columns = {'mid_career': 'salary'}, inplace = True)

# Add salary classifier col.
classify = []
for i in range(len(mid)):
    classify.append('Mid-Career')
mid['salary_type'] = classify

mid.head()

In [None]:
# Combine the two dfs to form one big df with salaries classified as either starting or mid-career.
combined = pd.concat([start, mid]).reset_index()
combined

In [None]:
# Grouped bar plot showing mean starting salary and mean mid-career salary for each major.
plt.figure(figsize = (10, 10))
sns.barplot(x = 'salary', y = 'major', hue = 'salary_type', data = combined.sort_values(['salary_type', 'salary'], ascending = [True, False]))
plt.title('Mean Starting vs Mid-Career\nSalaries by College Major', fontsize = 18)
plt.xlabel('Salary (USD)', fontsize = 14)
plt.ylabel('Major', fontsize = 14)
plt.grid(axis = 'x')

Looking at the chart above, we are now able to see the differences between various majors' mean starting and mid-career salaries. It is evident that while some majors started out with high mean salaries, they maye have later been overtaken by other majors when compared at the mid-career point.

Let's make another chart to visualize percent salary growth from starting to mid-career for each major:

In [None]:
# Top 5 percent salary growth.
salaries_major.sort_values('delta_start_mid', ascending = False).head()

In [None]:
# Bottom 5 percent salary growth.
salaries_major.sort_values('delta_start_mid').head()

Let's go ahead and create a bar plot showing percent salary growth by major:

In [None]:
# Bar plot showing salary growth for each major
plt.figure(figsize = (10, 10))
sns.barplot(x = salaries_major.sort_values('delta_start_mid', ascending = False).delta_start_mid,
           y = salaries_major.sort_values('delta_start_mid', ascending = False).major)
plt.title('Mean % Salary Growth\nby College Major', fontsize = 18)
plt.xlabel('Salary Growth Rate (%)', fontsize = 14)
plt.ylabel('Major', fontsize = 14)
plt.grid(axis = 'x')

As we can see from the chart above, people who study Math experience the highest mean salary growth rate at 103.5%, while people who study to become a Physician Assistant experience the lowest mean salary growth rate, at 23.4%.

Looking at the mean mid-career salaries earned by each major was useful, but the dataset also included mid-career salaries for various percentiles. Let's make a scatter plot to understand this distribution.

Like before, the main DataFrame is not quite in a format suitable for plotting multiple data series on one chart, so we will need to make a new subset DataFrame for our usage. This new subset DataFrame will contain major, salary, and a classifier column indicating what percentile each salary figure belongs to.

In [None]:
# Preview main df again.
salaries_major.head()

In [None]:
# Create subest dfs, then concatenate them together into a combined df.
df_10 = salaries_major[['major', 'mid_10p']].rename(columns = {'mid_10p': 'salary'})
df_25 = salaries_major[['major', 'mid_25p']].rename(columns = {'mid_25p': 'salary'})
df_50 = salaries_major[['major', 'mid_career']].rename(columns = {'mid_career': 'salary'})
df_75 = salaries_major[['major', 'mid_75p']].rename(columns = {'mid_75p': 'salary'})
df_90 = salaries_major[['major', 'mid_90p']].rename(columns = {'mid_90p': 'salary'})

combined = pd.concat([df_10, df_25, df_50, df_75, df_90]).reset_index()
combined

In [None]:
# Create salary percentile classifier col and add it to the combined df.
classifiers = ['10th Percentile', '25th Percentile', '50th Percentile',
              '75th Percentile', '90th Percentile']

classify = []
indicator = 0
for i in range(len(combined)):
    classify.append(classifiers[indicator])
    if len(classify) % 50 == 0:
        indicator +=1

combined['percentile'] = classify
combined

In [None]:
# Scatter plot using subset df.
plt.figure(figsize = (10, 10))
sns.scatterplot(x = 'salary', y = 'major', hue = 'percentile', data = combined.sort_values(['percentile', 'salary'], ascending = [False, True]))
plt.title('Mid-Career Salary Percentiles\nby College Major', fontsize = 18)
plt.xlabel('Salary (USD)', fontsize = 14)
plt.ylabel('Major', fontsize = 14)
plt.grid()

Looking at the chart above, we can see the distribution of mid-career salaries for each major. Some majors tend to have relatively wide distributions, with top performers earning well into the six-figures, while others have narrower distributions that are concentrated in the five-figure salary range.

In summation, it is important to notice that while some majors may have high starting salaries, they may not grow a whole lot as time goes on. On the flip side, some majors have relatively low starting salaries, but end up experiencing high earnings growth rates and, in some cases, surpass earnings of other majors when measured mid-career.

These observations are important to take into consideration when deciding on a major worth pursuing. Some people may prefer to earn more money earlier on, but experience less growth, while others may be willing to take a lower starting salary, if it means huge salary growth potential over the span of their careers.