# **Intro**

In this exploratory analysis of the college salaries data sets, I wanted to practice my data analysis skills in python and I wanted to see if the data would show any interesting trends or findings. I didn't have a specific goal in mind other than to just practice my coding skills, and pursued whichever analyses I thought would be interesting.

To begin, I loaded the relevant datasets needed to complete this analysis, as well as the necessary python libraries.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt 

In [None]:
college_df = pd.read_csv('../input/college-salaries/salaries-by-college-type.csv')
region_df = pd.read_csv('../input/college-salaries/salaries-by-region.csv')
degrees_df = pd.read_csv('../input/college-salaries/degrees-that-pay-back.csv')

Now, to see what each dataset looks like.

In [None]:
college_df.head()

In [None]:
region_df.head()

In [None]:
degrees_df.head()

For degrees_df, I wanted to make sure that all the columns with numbers in the rows were actually floats and not strings, so I converted each number into floats.

# **Data Analysis**

In [None]:
# making sure all the columns have float data and not string data
salary_cols = ['Starting Median Salary', 'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary', 'Mid-Career 25th Percentile Salary', 'Mid-Career 75th Percentile Salary', 'Mid-Career 90th Percentile Salary']
for col in salary_cols:
    degrees_df[col] = degrees_df[col].str.replace('$', '') # remove dollar signs.
    degrees_df[col] = degrees_df[col].str.replace(',', '').astype(float) # remove commas and convert to floats.

After converting all the numerical data to floats, I wanted to organize the data by Starting Median Salary in ascending order.

In [None]:
degrees_df = degrees_df.sort_values(by='Starting Median Salary',ascending=True)
degrees_df.head()

In [None]:
# checking to make sure that the change worked
type(degrees_df['Mid-Career Median Salary'][0])

In [None]:
# interactive bar graph for starting salary vs degree
fig = px.bar(degrees_df,x='Undergraduate Major',y='Starting Median Salary',title='Starting Median Salary vs Undergraduate Major')
fig.show()

I then wanted to plot a similar graph for degrees_df but organizing it by Mid-Career Median Salary. I first organized the data by Mid-Career Median Salary so that it would graph properly in ascending order.

In [None]:
degrees_df = degrees_df.sort_values(by='Mid-Career Median Salary',ascending=True)
degrees_df.head()

In [None]:
fig = px.bar(degrees_df,x='Undergraduate Major',y='Mid-Career Median Salary',title='Mid-Career Median Salary vs Undergraduate Major')
fig.show()

I was next interested in plotting the median starting and mid-career salaries for each major on the same graph so that I could see the change in salaries for each major. I also wanted to organize this graph by mid-career salary.

In [None]:
start = degrees_df.loc[:, ['Undergraduate Major', 'Starting Median Salary']].sort_values('Starting Median Salary', ascending = False)
start.rename(columns = {'Starting Median Salary': 'salary'}, inplace = True)

# Add salary classifier col.
classify = []
for i in range(len(start)):
    classify.append('Starting')
start['salary_type'] = classify

start.head()

In [None]:
mid = degrees_df.loc[:, ['Undergraduate Major', 'Mid-Career Median Salary']]
mid.rename(columns = {'Mid-Career Median Salary': 'salary'}, inplace = True)

# Add salary classifier col.
classify = []
for i in range(len(mid)):
    classify.append('Mid-Career')
mid['salary_type'] = classify

mid.head()

In [None]:
combined = pd.concat([start, mid]).reset_index()
combined.head()

In [None]:
plt.figure(figsize = (10, 10))
sns.barplot(x = 'salary', y = 'Undergraduate Major', hue = 'salary_type', data = combined.sort_values(['salary_type', 'salary'], ascending = [True, False]))
plt.title('Median Starting vs Mid-Career\nSalaries by College Major', fontsize = 18)
plt.xlabel('Salary (USD)', fontsize = 14)
plt.ylabel('Undergraduate Major', fontsize = 14)
plt.grid(axis = 'x')

After analyzing the dataset on degrees, I wanted to do a similar analysis on the dataset for colleges. I first wanted to sort the data by Starting Median Salary.

In [None]:
college_df = college_df.sort_values(by='Starting Median Salary',ascending=True)
college_df.head()

In [None]:
# In this line of code I am dropping any duplicates that there may be in the dataset.
college_df = college_df.drop_duplicates(subset='School Name')

In [None]:
# bar graph for salary vs school
fig = px.bar(college_df,x='School Name',y='Starting Median Salary',title='Starting Median Salary vs School')
fig.show()

I could have done more analysis on the college_df, but I decided it would just be roughly the same type of stuff as in the degrees_df so I chose to move on to the next dataset, which I called region_df.

In [None]:
region_df

In [None]:
# checking to see what types of values the data is in
type(region_df['Starting Median Salary'][0])

In [None]:
region_df.columns

In [None]:
# making the numbers into float values instead of string values--> when I added this comment and ran it again it messed up bc there aren't anymore strings cause they were all converted to floats. Everything else still works
salary_cols = ['Starting Median Salary', 'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary', 'Mid-Career 25th Percentile Salary', 'Mid-Career 75th Percentile Salary', 'Mid-Career 90th Percentile Salary']
for col in salary_cols:
    region_df[col] = region_df[col].str.replace('$', '') # remove dollar signs.
    region_df[col] = region_df[col].str.replace(',', '').astype(float) # remove commas and convert to floats.
    
region_df.head()

In [None]:
# checking to make sure that my change to floats worked
type(region_df['Starting Median Salary'][0])

After converting all the values to floats, I then wanted to organize the data into starting median salary for each region. I did this by creating grouped lists.

In [None]:
region_grouped_df = region_df.groupby('Region')
grouped_lists = region_grouped_df['Starting Median Salary'].apply(list)
grouped_lists = grouped_lists.reset_index()
grouped_lists

In [None]:
# just making sure the indexing is correct
print(grouped_lists['Starting Median Salary'][0][0])

I then wanted to obtain the average starting median salary for each region so I could then compare regions.

In [None]:
# creating a function to get an average from a list of numbers
def list_average(list):
    return sum(list)/len(list)

In [None]:
california_sal_avg = list_average(grouped_lists['Starting Median Salary'][0])
california_sal_avg

In [None]:
midwestern_sal_avg = list_average(grouped_lists['Starting Median Salary'][1])
midwestern_sal_avg

In [None]:
northeastern_sal_avg = list_average(grouped_lists['Starting Median Salary'][2])
northeastern_sal_avg

In [None]:
southern_sal_avg = list_average(grouped_lists['Starting Median Salary'][3])
southern_sal_avg

In [None]:
western_sal_avg = list_average(grouped_lists['Starting Median Salary'][4])
western_sal_avg

I then put all of the regional average starting salaries into a dictionary to be accessed more easily.

In [None]:
avg_sals = {'California':california_sal_avg,'Midwestern':midwestern_sal_avg,'Northeastern':northeastern_sal_avg,'Southern':southern_sal_avg,'Western':western_sal_avg}
avg_sals

I then plotted the information in the dictionary to compare the regional averages visually.

In [None]:
plt.bar(range(len(avg_sals)), list(avg_sals.values()), align='center', color = ['red','blue','orange','purple','green'])
plt.xticks(range(len(avg_sals)), list(avg_sals.keys()))
plt.title('Starting Median Salary by Region')
plt.xlabel('Region')
plt.ylabel('Starting Median Salary')
plt.show()

# **Analysis of financial value of each school**

After completing my regional analysis, I was essentially finished analyzing all three data sets; however, I wanted to see if it was really worth it to go to a top-ranked US college versus going to a state school which is much cheaper. For this analysis, I wanted to see how long it would take a graduate from a top-ranked school to earn as much money as a graduate of a state school would have if they invested during their four years of university. In this scenario, the grad from a top-ranked school would not be investing any money during their time at university, whereas the state-school grad would be.

The amount of money the state-school grad invests per year of university is the difference between the average state-school tuition of [$9,687](https://www.topuniversities.com/student-info/student-finance/how-much-does-it-cost-study-us) and each repspective private school's tuition.

I first organized the college_df by Starting Median Salary to get the top ten highest-paid grads.

In [None]:
college_df = college_df.sort_values(by='Starting Median Salary',ascending=False)
college_df[0:10]

In [None]:
top_10_colleges_df = college_df[0:10]
top_10_colleges_df

In [None]:
# checking to see what format the data is in
type(top_10_colleges_df['Starting Median Salary'][0])

In [None]:
# converting the strings to floats
top_10_cols = ['Starting Median Salary', 'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary', 'Mid-Career 25th Percentile Salary', 'Mid-Career 75th Percentile Salary', 'Mid-Career 90th Percentile Salary']
for col in salary_cols:
    top_10_colleges_df[col] = top_10_colleges_df[col].str.replace('$', '') # remove dollar signs.
    top_10_colleges_df[col] = top_10_colleges_df[col].str.replace(',', '').astype(float) # remove commas and convert to floats.
    
top_10_colleges_df

In [None]:
#re-indexing the data
top_10_colleges_df.reset_index(drop=True,inplace=True)
top_10_colleges_df

In [None]:
# checkng to make sure the number change was done correctly
type(top_10_colleges_df['Starting Median Salary'][1])

In [None]:
# checking to make sure the indexing was done correctly
top_10_colleges_df['Starting Median Salary'][0]

After getting the top-ten schools with the highest-paid grads, and after confirming that all the data was changed to float values and that the indexing was properly reset, I wanted to add a column for each school's tuition. This tuition is the most recent tuition and was obtained from each respective university's school website. Also, this is just the annual tuition per school. I did not include any other costs such as residence or food.

In [None]:
top_10_colleges_df['School Tuition'] = 54570
top_10_colleges_df['School Tuition'][0] = 54570
top_10_colleges_df['School Tuition'][1] = 53790
top_10_colleges_df['School Tuition'][2] = 58359
top_10_colleges_df['School Tuition'][3] = 53890
top_10_colleges_df['School Tuition'][4] = 51904
top_10_colleges_df['School Tuition'][5] = 40060
top_10_colleges_df['School Tuition'][6] = 22275
top_10_colleges_df['School Tuition'][7] = 57560
top_10_colleges_df['School Tuition'][8] = 55600
top_10_colleges_df['School Tuition'][9] = 54640

In [None]:
top_10_colleges_df

I then found the average public, in-state tuition, the average public, out-of-state tuition, and the average private tuition for colleges in the US. This information was obtained from [Top Universities](https://www.topuniversities.com/student-info/student-finance/how-much-does-it-cost-study-us).

In [None]:
top_10_colleges_df['Public, in-state tuition'] = 9687
top_10_colleges_df['Public, out-of-state tuition'] = 21184
top_10_colleges_df['Private tuition'] = 35087
top_10_colleges_df

In [None]:
top_10_colleges_df['Diff Tuition vs Public in-state'] = top_10_colleges_df['School Tuition'] - top_10_colleges_df['Public, in-state tuition']
top_10_colleges_df['Diff Tuition vs Public out-state'] = top_10_colleges_df['School Tuition'] - top_10_colleges_df['Public, out-of-state tuition']
top_10_colleges_df['Diff Tuition vs Private'] = top_10_colleges_df['School Tuition'] - top_10_colleges_df['Private tuition']
top_10_colleges_df

I next wanted to create a new dataframe from my existing top_10_colleges_df to only take the information that I actually needed for my analysis.

In [None]:
top_10_colleges_cost_df = top_10_colleges_df[['School Name','School Tuition','Public, in-state tuition','Public, out-of-state tuition','Private tuition','Diff Tuition vs Public in-state','Diff Tuition vs Public out-state','Diff Tuition vs Private']]
top_10_colleges_cost_df

To see whether each school was really worth it, I first made a large function to do this, but I couldn't get my function to output the correct values, so I decided to use the solver function on Excel to expedite the process. 

To determine the purely financial value of each school, as I said before, I wanted to see how long it would take for a graduate from a top-ranked school to earn as much money as a graduate from a state school would have if they invested during their four years of university. I used the future value of annuity due formula to calculate the money a state school grad would have after university. I then used a compound interest formula on this future value and took the sum of the earnings a grad from a top-ranked school would have after t years, and set these equations equal to each other, solving for t. If you want a better explanation of what I did, I made a post about it which you can find [here](https://www.forecaster.site/is-college-worth-it/). If you look at the post, just skip to the part labled, "Determining the financial value of top-ranked universities (ranked by starting median salary of grads)." 

# **Results**

In [None]:
results_df = pd.read_csv('../input/uni-values-csv3/Uni Values CSV.csv')
results_df

t in the dataframe above represents the number of years needed for a graduate from a top-ranked school to match the net worth of a grad from a state school if they invested each year of university, as mentioned before. The smaller the t, the better the value of the university.

From the data, you can see that Cooper Union represents the best value for money, and Carnegie Mellon represents the worst value for money, both in purely financial terms. Also, for this analysis I assumed each grad would have to pay full tuition in order to simplify the analysis. I listed my assumptions on my post on my blog, which you can find [here](https://www.forecaster.site/is-college-worth-it/)---look for "Assumptions made in this analysis" (didn't want to list them all because it would make this post too messy).