Exploring the college salary data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting, https://matplotlib.org/api/pyplot_summary.html
import seaborn as sns # more data vis, https://seaborn.pydata.org/

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

Let's start by checking out the salaries by degree

In [None]:
all_degrees = pd.read_csv('../input/degrees-that-pay-back.csv')
all_degrees.head()

Now let's rename the columns and clean up the dollar amounts.

In [None]:
all_degrees.columns = ['major','starting','midcareer','delta', 'mid_10', 'mid_25', 'mid_75', 'mid_90']
for x in all_degrees.columns:
    if x != 'major' and x != 'delta':
        salary = all_degrees[x].str.replace("$", "")
        salary = salary.str.replace(",", "")
        all_degrees[x] = pd.to_numeric(salary)
        
all_degrees.head()

For now I only care about the first 4 columns and will ignore percentile, so let's lop them off

In [None]:
degrees = all_degrees.drop(all_degrees.columns[[4,5,6,7]],axis=1,inplace=False)
degrees.head()

Ok, now we have just the pieces of data we care about in a format we can use.  Let's get some basic info about the dataset.

In [None]:
degrees.describe()

Ok so mean starting salary across all degrees is $44K, mid-career is $75K. Median percent increase over the first half of your career is 69%.

Let's grab the 10 majors with the highest med-career median salary and plot them

In [None]:
top_degrees = degrees.nlargest(10, 'midcareer').reset_index()
top_degrees.head(10)

In [None]:
x = top_degrees['midcareer']
y = len(top_degrees.index) - top_degrees.index #swap high and low
labels = top_degrees['major']

plt.scatter(x, y, color='g', label = 'Mid Career Median Salary')
plt.yticks(y, labels)
plt.show()

Intersting enough, and actually it looks fairly linear.  For fun and to learn seaborn, let's try to make a bar chart

In [None]:
sns.barplot('midcareer', 'major', data=top_degrees)

Ok now let's get an idea of the spread between starting salaries and midcareer salaries using a stacked bar chart

In [None]:
# background will be midcareer
sns.barplot(x = "midcareer", y = "major", data=top_degrees, color = "red")

#Plot 2 - overlay - "bottom" series
bottom_plot = sns.barplot(x = "starting", y = "major", data=top_degrees, color = "#0000A3")
bottom_plot.set_xlabel("Salaries (starting->midcareer)")

Looks like among the top midcareer salaries, the engineering jobs have higher initial salaries, while math has a low starting salary but has a lot of room for improvement

Now I'm going to try to re-introduce the quartile data from the original dataset and see if I can build a plot which shows the distribution of salaries

In [None]:
mid_degrees = all_degrees.drop(['starting','delta'],axis=1,inplace=False)
mid_degrees.head()


In [None]:
plt.figure(figsize=(20,12))
df = mid_degrees.sort_values('mid_90', ascending=False).head(10)
pl_90 = sns.barplot(x = "mid_90", y = "major", data=df, color = "red", label = '90%')
pl_75 = sns.barplot(x = "mid_75", y = "major", data=df, color = "blue", label = '75%')
pl_50 = sns.barplot(x = "midcareer", y = "major", data=df, color = "green", label = '50%')
pl_25 = sns.barplot(x = "mid_25", y = "major", data=df, color = "orange", label = '25%')
pl_10 = sns.barplot(x = "mid_10", y = "major", data=df, color = "teal", label = '10%')
pl_10.set_xlabel("Salaries")
pl_10.legend(loc=4) #move the legend
plt.show()

And there is a few of the top 90 percentile salary degrees laid out by the different quartiles, 

# Now we are going to take a different tack

Let's explore the other data sets. These show salaries by school type and by region.

In [None]:
college_type_degrees = pd.read_csv('../input/salaries-by-college-type.csv')
college_type_degrees.columns = ['school','type','starting','midcareer', 'mid_10', 'mid_25', 'mid_75', 'mid_90']
college_type_degrees.head()


In [None]:
college_region_degrees = pd.read_csv('../input/salaries-by-region.csv')
college_region_degrees.columns = ['school','region','starting','midcareer', 'mid_10', 'mid_25', 'mid_75', 'mid_90']
college_region_degrees.head()


Seems like we have some NaN values in some columns.  Let's make sure we don't have any in the starting and midcareer columns we care the most about

In [None]:
print(len(college_type_degrees.index)-college_type_degrees.count())
len(college_region_degrees.index)-college_region_degrees.count()

Yup, we have missing values for 10% and 90% in both sets, but all values accounted for with starting/mid as well as 25 & 75.

Since many of these schools overlap, let's join the two datasets.  For example, We have CalTech (CIT) in both datasets, so we need to join the fact that it is in California and also an Engineering school.

In [None]:
# first drop everyone but the school & region from region dataset
# since we are just going to merge those values and use salary info from the type dataset

truncated_colege_regions = college_region_degrees.drop(['starting','midcareer', 'mid_10', 'mid_25', 'mid_75', 'mid_90'], axis=1, inplace=False)
college_salaries = pd.merge(college_type_degrees, truncated_colege_regions, on='school')
college_salaries.head()

Very cool!  Now let's move the region over and dump the columns we don't care about, and clean up the salary columns to make them workable numbers

In [None]:
college_salaries = college_salaries[['school', 'type', 'region', 'starting', 'midcareer']]
salary_cols = ['starting', 'midcareer']
for x in salary_cols:
    salary = college_salaries[x].str.replace("$", "")
    salary = salary.str.replace(",", "")
    college_salaries[x] = pd.to_numeric(salary)
college_salaries.head()

Now we have a joined dataset with school, type, region, and some salary info.  Let's take a quick look to see how many school types and regions are in this dataset

In [None]:
print(college_salaries.groupby('type')['school'].nunique())

print(college_salaries.groupby('region')['school'].nunique())


Alright so we have 5 categories of each school type and school region.  There are tons of State schools and unsuprisingly not a lot of Ivies.  The regions have a better distribution, except with California not being included in the "West" region that one feels a little underrepresented.

Lets make a pair of charts which will use this data to get a basic overview of how salaries differ across different regions and and school types

In [None]:
sns.barplot(x = "region", y = "midcareer", data=college_salaries)


In [None]:
sns.barplot(x = "type", y = "midcareer", data=college_salaries)


So the general trend is Ivy league (which implies and skews the northeast region) produces the highest mid-career salaries.  Otherwise California is a nice place to be, and Engineering schools are always a good bet for higher salaries.  Interestingly enough, despite my love for State schools (go Cal Aggies!), even Party schools do better on average than State schools.

That data is interesting on its own, but since we have all the data in one dataset let's see if we can graph out the relationships between school types and regions.  For example, do party schools from some regions beat engineering schools from other regions?

In [None]:
sns.barplot(x = "type", y = "midcareer", hue = "region", data=college_salaries)


The Ivy league doesn't tell us much and has only one school type in this data, so let's remove Ivies from the dataset

In [None]:
college_salaries = college_salaries.query('type != "Ivy League"');

plt.figure(figsize=(12,8))
plot = sns.barplot(x = "type", y = "midcareer", hue = "region", data=college_salaries, palette="muted")
plot.legend(loc=1) #move the legend

Much more interesting data here -- California and the northeast provide a higher midcareer salary on average for Engineering, Pary and State schools, but not for Liberal Arts.  Though if you look at the size of the confidence intervals on top  of the bars, you can infer that some of that comes from the small size of the dataset, since the limited number of liberal arts school in California or Engineering schools in the South mean that it's hard to generalize about that category.

Well I think that's it for now.  I had fun learning some new tech (this was my first time with pandas, seaborn, and the Kaggle platform) and hope to explore some more datasets in the future