# Highest earning majors in San Antonio

In this notebook we're going to use the U.S. Department of Education's [College Scorecard dataset](https://collegescorecard.ed.gov/data/) to find the highest-earning majors from San Antonio-area colleges and universities.

## Import libraries and load data

First we'll import the necessary libraries and load the data.

In [22]:
# We import pandas for data manipulation
import pandas as pd

import os

Here are the datasets we're loading in:
- `college_coordinates.csv`: I'm pulling this from [a repo I created that contains the coordinates for all of the colleges in the dataset](https://github.com/ryan-serpico/us-college-coordinates). The data itself comes from the DOE's [College Scorecard institution-level data](https://collegescorecard.ed.gov/data/). We're importing it so that we can find the colleges in San Antonio.
- `Most-Recent-Cohorts-Field-of-Study.parquet`: These data are also from the DOE's College Scorecard — specifically, the most recent data by field of study. This is the dataset we'll be using to find the highest-earning majors — we'll merge it with the coordinates file to find all the majors tied to colleges in San Antonio. Note that in another projected I converted this file from a CSV to a Parquet file. I did this so that the file wouldn't hit up against Githubs repo file limit — as well as to speed up my analysis.

Please note that data were pulled in March 2023.

In [23]:
# First we'll import the coordinates file
full_coordinates_df = pd.read_csv('../data/college_coordinates.csv')

# Let's filter the data to only include colleges and universities in San Antonio, Texas, based on the "CITY" AND "STATE" columns
san_antonio_df = full_coordinates_df[(full_coordinates_df['CITY'] == 'San Antonio') & (full_coordinates_df['STABBR'] == 'TX')]

san_antonio_df.info()

san_antonio_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46 entries, 108 to 6358
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   INSTNM     46 non-null     object 
 1   CITY       46 non-null     object 
 2   STATE      46 non-null     object 
 3   STABBR     46 non-null     object 
 4   ZIP        46 non-null     object 
 5   LATITUDE   42 non-null     float64
 6   LONGITUDE  42 non-null     float64
 7   UNITID     46 non-null     int64  
 8   OPEID      46 non-null     float64
 9   INSTURL    46 non-null     object 
dtypes: float64(3), int64(1), object(6)
memory usage: 4.0+ KB


Unnamed: 0,INSTNM,CITY,STATE,STABBR,ZIP,LATITUDE,LONGITUDE,UNITID,OPEID,INSTURL
108,Alamo City Barber College,San Antonio,Texas,TX,78250-3227,29.52717,-98.639532,482981,4231900.0,www.alamocitybarbercollege.com/
439,Aveda Arts & Sciences Institute-San Antonio,San Antonio,Texas,TX,78259-2792,29.637354,-98.452126,455354,4142300.0,https://avedaarts.edu/
527,Baptist Health System School of Health Profess...,San Antonio,Texas,TX,78229,29.517063,-98.567486,223083,660600.0,www.bshp.edu/
529,Baptist University of the Americas,San Antonio,Texas,TX,78224-1364,29.34998,-98.546081,444398,3733300.0,www.bua.edu/
1215,Christ Mission College,San Antonio,Texas,TX,78254-1000,29.543412,-98.70699,494630,4279400.0,cmctx.edu/


In [24]:
# Now let's import the Most-Recent-Cohorts-Field-of-Study.parquet
full_field_of_study_df = pd.read_parquet('../data/Most-Recent-Cohorts-Field-of-Study.parquet')

# There are only a few fields we're interested in, so let's create a list of fields we want to keep. This includes: UNITID, OPEID6, INSTNM, CONTROL, CIPDESC, CREDLEV, CREDDESC, EARN_NE_MDN_3YR, IPEDSCOUNT2
fields_of_interest = ['UNITID', 'OPEID6', 'INSTNM', 'CONTROL', 'CIPDESC', 'CREDLEV', 'CREDDESC', 'EARN_NE_MDN_3YR', 'IPEDSCOUNT2']

# Now let's filter the data to only include the fields we're interested in
major_earnings_df = full_field_of_study_df[fields_of_interest]

major_earnings_df.info()

major_earnings_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224817 entries, 0 to 224816
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   UNITID           215889 non-null  float64
 1   OPEID6           224817 non-null  int64  
 2   INSTNM           224817 non-null  object 
 3   CONTROL          224817 non-null  object 
 4   CIPDESC          224817 non-null  object 
 5   CREDLEV          224817 non-null  int64  
 6   CREDDESC         224817 non-null  object 
 7   EARN_NE_MDN_3YR  224817 non-null  object 
 8   IPEDSCOUNT2      190135 non-null  float64
dtypes: float64(2), int64(2), object(5)
memory usage: 15.4+ MB


Unnamed: 0,UNITID,OPEID6,INSTNM,CONTROL,CIPDESC,CREDLEV,CREDDESC,EARN_NE_MDN_3YR,IPEDSCOUNT2
0,100654.0,1002,Alabama A & M University,Public,"Agriculture, General.",3,Bachelor’s Degree,PrivacySuppressed,
1,100654.0,1002,Alabama A & M University,Public,Animal Sciences.,3,Bachelor’s Degree,PrivacySuppressed,6.0
2,100654.0,1002,Alabama A & M University,Public,Food Science and Technology.,3,Bachelor’s Degree,PrivacySuppressed,7.0
3,100654.0,1002,Alabama A & M University,Public,Food Science and Technology.,5,Master's Degree,PrivacySuppressed,8.0
4,100654.0,1002,Alabama A & M University,Public,Food Science and Technology.,6,Doctoral Degree,PrivacySuppressed,2.0


In [25]:
# Now let's merge the two dataframes together using the UNITID column
san_antonio_earnings_df = pd.merge(san_antonio_df, major_earnings_df, on='UNITID')

san_antonio_earnings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1019 entries, 0 to 1018
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   INSTNM_x         1019 non-null   object 
 1   CITY             1019 non-null   object 
 2   STATE            1019 non-null   object 
 3   STABBR           1019 non-null   object 
 4   ZIP              1019 non-null   object 
 5   LATITUDE         1019 non-null   float64
 6   LONGITUDE        1019 non-null   float64
 7   UNITID           1019 non-null   int64  
 8   OPEID            1019 non-null   float64
 9   INSTURL          1019 non-null   object 
 10  OPEID6           1019 non-null   int64  
 11  INSTNM_y         1019 non-null   object 
 12  CONTROL          1019 non-null   object 
 13  CIPDESC          1019 non-null   object 
 14  CREDLEV          1019 non-null   int64  
 15  CREDDESC         1019 non-null   object 
 16  EARN_NE_MDN_3YR  1019 non-null   object 
 17  IPEDSCOUNT2   

## Cleaning

Now let's do some cleaning.

Something to note: We're filtering out "PrivacySuppressed" values.

"Any debt or earnings data points suppressed for privacy are indicated by the “PrivacySuppressed” data code." — [College Scorecard Data Documentation](https://collegescorecard.ed.gov/assets/FieldOfStudyDataDocumentation.pdf)

In [26]:
# If EARN_NE_MDN_3YR is "PrivacySuppressed	" then we'll drop the row
san_antonio_earnings_df = san_antonio_earnings_df[san_antonio_earnings_df['EARN_NE_MDN_3YR'] != 'PrivacySuppressed']

# We're only looking at bachelor's degrees, so let's filter the data to only include those.
san_antonio_earnings_df = san_antonio_earnings_df[san_antonio_earnings_df['CREDLEV'] == 3]

# Sort the data by the median earnings
san_antonio_earnings_df.sort_values(by='EARN_NE_MDN_3YR', ascending=False, inplace=True)

# Let's drop the columns we don't need. This includes: STABBR, INSTNM_
san_antonio_earnings_df.drop(columns=['STABBR', 'INSTNM_y'], inplace=True)

# # Rename the INSTNM_x column to INSTNM
san_antonio_earnings_df.rename(columns={'INSTNM_x': 'INSTNM'}, inplace=True)

# Let's reorder the columns so that the most important columns are first: UNITID, OPEID6, INSTNM, CONTROL, CITY, STATE, ZIP, LATITUDE, LONGITUDE, CIPDESC, CREDLEV, CREDDESC, EARN_NE_MDN_3YR, IPEDSCOUNT2
san_antonio_earnings_df = san_antonio_earnings_df[['UNITID', 'OPEID6', 'INSTNM', 'CONTROL', 'CITY', 'STATE', 'ZIP', 'LATITUDE', 'LONGITUDE', 'CIPDESC', 'CREDLEV', 'CREDDESC', 'EARN_NE_MDN_3YR', 'IPEDSCOUNT2']]

# Replace the "." in the CIPDESC column with nothing.
san_antonio_earnings_df['CIPDESC'] = san_antonio_earnings_df['CIPDESC'].str.replace('.', '')

san_antonio_earnings_df.head()

  san_antonio_earnings_df['CIPDESC'] = san_antonio_earnings_df['CIPDESC'].str.replace('.', '')


Unnamed: 0,UNITID,OPEID6,INSTNM,CONTROL,CITY,STATE,ZIP,LATITUDE,LONGITUDE,CIPDESC,CREDLEV,CREDDESC,EARN_NE_MDN_3YR,IPEDSCOUNT2
868,229267,3647,Trinity University,"Private, nonprofit",San Antonio,Texas,78212-7200,29.462682,-98.48186,"Computer and Information Sciences, General",3,Bachelor’s Degree,98114,30.0
815,229027,10115,The University of Texas at San Antonio,Public,San Antonio,Texas,78249-1644,29.583709,-98.620657,"Building/Construction Finishing, Management, a...",3,Bachelor’s Degree,78030,66.0
913,229267,3647,Trinity University,"Private, nonprofit",San Antonio,Texas,78212-7200,29.462682,-98.48186,Finance and Financial Management Services,3,Bachelor’s Degree,74778,17.0
692,229027,10115,The University of Texas at San Antonio,Public,San Antonio,Texas,78249-1644,29.583709,-98.620657,"Computer and Information Sciences, General",3,Bachelor’s Degree,73023,186.0
31,406033,30837,Galen College of Nursing-San Antonio,"Private, for-profit",San Antonio,Texas,78229,29.507685,-98.585653,"Registered Nursing, Nursing Administration, Nu...",3,Bachelor’s Degree,72065,0.0


## Analysis no. 1: Highest-earning majors in San Antonio

The first graphic in this story is going to highlight the highest-paying majors in San Antonio.

In [27]:
# Let's create a pivot table that finds the max median earnings for each INSTNM_x.
pivot_table = pd.pivot_table(san_antonio_earnings_df, values='EARN_NE_MDN_3YR', index=['INSTNM'], aggfunc='max')

# Let's sort the data by the median earnings
pivot_table.sort_values(by='EARN_NE_MDN_3YR', ascending=False, inplace=True)

# Merge the pivot table with the san_antonio_earnings_df dataframe so that we can get the CIPDESC. Merge on the EARN_NE_MDN_3YR column.
san_antonio_top_earning_majors_df = pd.merge(san_antonio_earnings_df, pivot_table, on='EARN_NE_MDN_3YR')

# Drop any rows where IPEDSCOUNT2 is 0. Instances where IPEDSCOUNT2 is 0 means either the program is no longer offered or there is a mismatch in the data.
san_antonio_top_earning_majors_df = san_antonio_top_earning_majors_df[san_antonio_top_earning_majors_df['IPEDSCOUNT2'] != 0]

# Export san_antonio_top_earning_majors_df to a CSV file
san_antonio_top_earning_majors_df.to_csv('../output/San_Antonio/san_antonio_top_earning_majors.csv', index=False)

san_antonio_top_earning_majors_df

Unnamed: 0,UNITID,OPEID6,INSTNM,CONTROL,CITY,STATE,ZIP,LATITUDE,LONGITUDE,CIPDESC,CREDLEV,CREDDESC,EARN_NE_MDN_3YR,IPEDSCOUNT2
0,229267,3647,Trinity University,"Private, nonprofit",San Antonio,Texas,78212-7200,29.462682,-98.48186,"Computer and Information Sciences, General",3,Bachelor’s Degree,98114,30.0
1,229027,10115,The University of Texas at San Antonio,Public,San Antonio,Texas,78249-1644,29.583709,-98.620657,"Building/Construction Finishing, Management, a...",3,Bachelor’s Degree,78030,66.0
3,223083,6606,Baptist Health System School of Health Profess...,"Private, for-profit",San Antonio,Texas,78229,29.517063,-98.567486,"Registered Nursing, Nursing Administration, Nu...",3,Bachelor’s Degree,68196,34.0
4,228644,3659,The University of Texas Health Science Center ...,Public,San Antonio,Texas,78229-3900,29.508107,-98.575655,Dental Support Services and Allied Professions,3,Bachelor’s Degree,64236,26.0
5,228149,3623,St. Mary's University,"Private, nonprofit",San Antonio,Texas,78228,29.453244,-98.562778,Accounting and Related Services,3,Bachelor’s Degree,62421,12.0
6,225627,3578,University of the Incarnate Word,"Private, nonprofit",San Antonio,Texas,78209,29.46708,-98.46583,"Registered Nursing, Nursing Administration, Nu...",3,Bachelor’s Degree,60420,105.0
7,225201,10509,Hallmark University,"Private, nonprofit",San Antonio,Texas,78230-1736,29.538442,-98.568618,Computer/Information Technology Administration...,3,Bachelor’s Degree,51119,62.0
8,227331,3598,Our Lady of the Lake University,"Private, nonprofit",San Antonio,Texas,78207-4689,29.426194,-98.54365,Communication Disorders Sciences and Services,3,Bachelor’s Degree,43192,22.0
9,458982,21171,The Art Institute of San Antonio,"Private, for-profit",San Antonio,Texas,78230,29.532925,-98.56556,Culinary Arts and Related Services,3,Bachelor’s Degree,32275,27.0


**Takeaways:**
- The highest-paying major in San Antonio Computer and Information Sciences, General from Trinity University, a private, non-profit institution. The median pay three full years after graduation is $98,114 a year.
- The highest-paying major at the University of Texas at San Antonio is Building/Construction Finishing, Management, a... The median pay three full years after graduation is $78,030 a year.

## Analysis no. 2: The five highest-earning majors at each San Antonio college and university

The second graphic in this story is going to highlight the five highest-earning majors at each San Antonio college and university. We'll spit each schools' top five majors into a separate CSV file.

In [28]:
# We first drop any rows that have a null value in the EARN_NE_MDN_3YR column.
san_antonio_earnings_df = san_antonio_earnings_df.dropna(subset=['EARN_NE_MDN_3YR'])

# Drop any rows where IPEDSCOUNT2 is 0. Instances where IPEDSCOUNT2 is 0 means either the program is no longer offered or there is a mismatch in the data.
san_antonio_earnings_df = san_antonio_earnings_df[san_antonio_earnings_df['IPEDSCOUNT2'] != 0]

# Convert the EARN_NE_MDN_3YR column to an int.
san_antonio_earnings_df['EARN_NE_MDN_3YR'] = san_antonio_earnings_df['EARN_NE_MDN_3YR'].astype(int)

# For each INSTNM, find the five CIPDESC with the highest EARN_NE_MDN_3YR values. We'll use the nlargest() function.
five_highest_paying_majors_per_san_antonio_college_df = san_antonio_earnings_df.groupby('INSTNM').apply(lambda x: x.nlargest(5, 'EARN_NE_MDN_3YR')).reset_index(drop=True)

# Sort first by INSTNM and then by EARN_NE_MDN_3YR
five_highest_paying_majors_per_san_antonio_college_df.sort_values(by=['INSTNM', 'EARN_NE_MDN_3YR'], ascending=False, inplace=True)

# For each INSTNM, create a new CSV file that contains the five highest paying majors.
for college in five_highest_paying_majors_per_san_antonio_college_df['INSTNM'].unique():
    college_df = five_highest_paying_majors_per_san_antonio_college_df[five_highest_paying_majors_per_san_antonio_college_df['INSTNM'] == college]
    
    # Replace spaces with underscores
    college = college.replace(' ', '_')

    # Use college to find the current row's city name
    city = college_df['CITY'].iloc[0]

    # Replace spaces with underscores
    city = city.replace(' ', '_')

    # If the city directory doesn't exist, then create it
    if not os.path.exists(f'../output/{city}'):
        os.makedirs(f'../output/{city}')

    college_df.to_csv(f'../output/{city}/{college}.csv', index=False)

five_highest_paying_majors_per_san_antonio_college_df.head()

Unnamed: 0,UNITID,OPEID6,INSTNM,CONTROL,CITY,STATE,ZIP,LATITUDE,LONGITUDE,CIPDESC,CREDLEV,CREDDESC,EARN_NE_MDN_3YR,IPEDSCOUNT2
32,225627,3578,University of the Incarnate Word,"Private, nonprofit",San Antonio,Texas,78209,29.46708,-98.46583,"Registered Nursing, Nursing Administration, Nu...",3,Bachelor’s Degree,60420,105.0
33,225627,3578,University of the Incarnate Word,"Private, nonprofit",San Antonio,Texas,78209,29.46708,-98.46583,"Engineering, General",3,Bachelor’s Degree,52434,5.0
34,225627,3578,University of the Incarnate Word,"Private, nonprofit",San Antonio,Texas,78209,29.46708,-98.46583,Human Resources Management and Services,3,Bachelor’s Degree,51476,89.0
35,225627,3578,University of the Incarnate Word,"Private, nonprofit",San Antonio,Texas,78209,29.46708,-98.46583,"Business Administration, Management and Operat...",3,Bachelor’s Degree,50325,230.0
36,225627,3578,University of the Incarnate Word,"Private, nonprofit",San Antonio,Texas,78209,29.46708,-98.46583,"Computer and Information Sciences, General",3,Bachelor’s Degree,45952,28.0
