# University Salaries: Getting started
In this notebook we'll take a look at the "University Salaries" dataset. We'll specifically look at the `salaries_final.csv` dataset which contains information on each faculty's department and college. We will not use the `salaries_without_dept.csv` dataset.

The purpose of this notebook is to show how to get started with this dataset, it is not meant to present an extensive/insightful/visually aesthetic report.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Loading the data...

In [None]:
df = pd.read_csv("../input/university-salaries/university-salaries/salaries_final.csv")
df.shape

Let's take a look at a few rows.

In [None]:
df.head(15)

## Primary job title
What are the value counts for primary job title?

In [None]:
df["Primary Job Title"].value_counts(normalize=True)

There are 147 unique values but most titles fall under Lecturer, Assistant Professor, Associate Professor, and Professor. There are several lecturer titles.

In [None]:
[x for x in df["Primary Job Title"].unique() if "lecturer" in x.lower()]

Let's filter our data to only contain the four most popular titles we mentioned above, but let's first recode Senior Lecturer to Lecturer. It's unclear whether the other lecturer titles are full-time positions or not, so we'll omit these.

In [None]:
df["Primary Job Title"] = df["Primary Job Title"].replace({"Senior Lecturer": "Lecturer"})

In [None]:
df = df.loc[df["Primary Job Title"].isin(["Lecturer", "Assistant Professor", "Associate Professor", "Professor"])]
df.shape

In [None]:
df["Primary Job Title"].value_counts(normalize=True).plot(kind="barh")
plt.xlabel("Fraction of faculty with role")
plt.show()

Note this includes all years, so each faculty may be counted several times. Let's specifically look at 2010 and 2020.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12,5), sharex=True)
yrs = [2010, 2020]
for i in range(2):
    plt.sca(ax[i])
    df.loc[df["Year"]==yrs[i], "Primary Job Title"].value_counts(normalize=True).plot(kind="barh")
    plt.title(yrs[i], fontsize=16)
    plt.xlabel("Fraction of faculty with role")
    
plt.tight_layout()
plt.show()

The makeup of lecturers and professors is mostly consistent between both years. However, assistant professors and associated professors "reversed" in prevalence from 2010 to 2020. Approximately 40% of faculty (within these four roles) were associate professors in 2010 while now in 2020 less than 30% are associate professors.

## Colleges
We'll refer to the data dictionary (`data_dictionary.csv`) to get just the undergraduate colleges. We'll omit the College of Medicine and other miscellaneous "Colleges" (e.g. UVM Libraries isn't really a college).

First we load the data dictionary...

In [None]:
data_dict = pd.read_csv("../input/university-salaries/university-salaries/data_dictionary.csv")
data_dict

We extract the relevant colleges from the data dictionary.

In [None]:
colleges = data_dict.loc[data_dict["Undergraduate College"]=="Yes", "College Abbreviation"]
colleges

Now we filter our dataset to only contain colleges within our selection.

In [None]:
df = df.loc[df["College"].isin(colleges)]
df.shape

What is the college count (among faculty) for 2010 and 2020?

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12,5), sharex=True)
yrs = [2010, 2020]
for i in range(2):
    plt.sca(ax[i])
    df.loc[df["Year"]==yrs[i], "College"].value_counts(normalize=True).plot(kind="barh")
    plt.title(yrs[i], fontsize=16)
    plt.xlabel("Fraction of faculty belonging to college")

plt.tight_layout()
plt.show()

For those unfamiliar with UVM, we can make this plot more interpretable by using the data dictionary's "Meaning" column.

In [None]:
# create a lookup where each key is the college abbreviation and each value is the full college name ("Meaning")
colleges = data_dict.loc[data_dict["Undergraduate College"]=="Yes", ["College Abbreviation", "Meaning"]]
colleges_map = dict(zip(colleges["College Abbreviation"], colleges["Meaning"]))

# map college data using our lookup
df["College"] = df["College"].map(colleges_map)

Now we'll recreate the previous plot.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14,5), sharex=True)
yrs = [2010, 2020]
for i in range(2):
    plt.sca(ax[i])
    df.loc[df["Year"]==yrs[i], "College"].value_counts(normalize=True).plot(kind="barh")
    plt.xlabel("Fraction of faculty belonging to college")
    plt.title(yrs[i], fontsize=16)
    
plt.tight_layout()
plt.show()

Although the College of Arts and Sciences (CAS) has the largest faculty hiring in both years, we do see a decline between 2010 and 2020 (60% vs 50%). Has this trend developed consistently over time?

In [None]:
cas_counts = df.loc[df["College"]=="College of Arts and Sciences"].groupby("Year").size() # number of CAS faculty each year
all_counts = df.groupby("Year").size() # number of all faculty each year
cas_freqs = cas_counts/all_counts # fraction of CAS faculty each year

cas_freqs.plot(kind="line")
plt.ylabel("Fraction of faculty belonging to CAS")
plt.show()

The number of faculty in the College of Arts in Sciences has indeed steadily declined since 2009.

## Salaries
We'll start by examining the distribution of salaries among all faculty in 2020.

In [None]:
# subset data 
df_pay = df.loc[df["Year"]==2020, "Base Pay"]

# display quantiles in a dataframe
q = [0, .25, .50, .75, 1]
display(pd.DataFrame({ "Quantile":q, "Value":np.quantile(df_pay.values, q) }))

# print mean
print("Mean:", df_pay.mean())

# plot distribution
df_pay.plot(kind="hist", bins=30, edgecolor="black")
plt.show()

The median pay among all faculty in 2020 is 88,930. The distibution is right skewed, with some faculty making over 300k. Let's re-plot the distribution on a log-scale.

In [None]:
plt.hist(df_pay.values, bins=np.logspace(0,6,100), edgecolor="black")
plt.xscale("log")
plt.xlim([10**4,10**6]) # narrow our range
plt.show()

Salaries seem to be approximately log-normally distributed.

Now we'll compare salaries among the four different roles over time.

In [None]:
plt.figure(figsize=(16,5))
ax = plt.gca()
df.groupby(["Primary Job Title", "Year"])["Base Pay"].median().unstack(level=0).plot(kind="line", ax=ax)
plt.ylabel("Median base pay among all faculty")
plt.show()

All roles have seen increased pay to keep up with inflation. Relative pay between job titles is as expected with [full] professors making the most and lecturers making the least.

What about pay between colleges?

In [None]:
plt.figure(figsize=(16,5))
ax = plt.gca()
df.groupby(["College", "Year"])["Base Pay"].median().unstack(level=0).plot(kind="line", ax=ax)
plt.ylabel("Median base pay among all faculty")
plt.legend(loc="center right")
plt.show()

The main takeaway is that business professors have a much higher base pay (130k starting in 2009, now above 160k). Let's fixate on 2020.

In [None]:
df.loc[df["Year"]==2020].groupby("College")["Base Pay"].median().sort_values().plot(kind="barh")
plt.xlabel("Median base pay among all faculty")
plt.show()

The difference between the other colleges is relatively small (10k difference between the College of Engineering and Mathematical Sciences and the College of Education and Social Services).

Let's continue on with the investigation of CAS and now ask what the **fraction of salary funding** has looked like over time.

In [None]:
cas_pay = df.loc[df["College"]=="College of Arts and Sciences"].groupby("Year")["Base Pay"].sum() # total salary funding for CAS each year
all_pay = df.groupby("Year")["Base Pay"].sum() # total salary funding for all colleges, each year
cas_pay_frac = cas_pay/all_pay # fraction of salary funding going to CAS, each year

cas_pay_frac.plot(kind="line")
plt.ylabel("Fraction of salary funding to CAS")
plt.show()

Funding has gone down, but it roughly correlates with the percentage of faculty hires we saw before, so it's not entirely surprising.

## Departments

As a final, unrelated investigation, let's take a look at the top departments (not college).

In [None]:
df.loc[df["Year"]==2020, "Department"].value_counts(normalize=True).head(5)

There are apparently a lot of faculty in the Education, English, and Math departments. What is the breakdown of job titles within these three departments?

In [None]:
top_deps = df.loc[df["Year"]==2020, "Department"].value_counts().head(3).index # top departments
df_dep_filter = df.loc[(df["Year"]==2020) & (df["Department"].isin(top_deps))] # filter data
df_dep_title = df_dep_filter.groupby(["Primary Job Title", "Department"]).size().unstack(level=0) # compute group sizes
df_dep_title = df_dep_title.div(df_dep_title.sum(axis=1), axis=0) # normalize group sizes

# plot
plt.figure(figsize=(14,5))
ax = plt.gca()
df_dep_title.plot(kind="bar", ax=ax)
plt.xticks(rotation=0)
plt.ylabel("Fraction of faculty")
plt.show()

Lecturers are much more prevalent in the Education and Mathematics & Statistics departments (nearly 50% of all roles), relative to English department. There are a lot of full professors in the English departments but very few assistant professors. This supports the "decline of humanities" theme we've seen; UVM is not hiring new faces in the English department, and the department is growing old.

**This has just been an introductory notebook. There are many more questions you could ask, and many ways to improve my visualizations (adding interactivity with Plotly, etc.). The main purpose of this notebook was to show how to use the data and what kinds of questions you might investigate.**