# Research Question
What combination of college characteristics results in the greatest post-undergrad salary for computer science majors? 

The factors/characteristics we are analyzing are: college prestige ranking, teacher-to-student ratio, student population, average professor rating, geographic location, and the salary of computer scientists in the area at which the college is located. 

In [1]:
# Import necessary packages
import pandas as pd

# Data Cleaning

### Step 1: Loading CSV's
We load dataframes for data stored in CSV files. In later steps, we will combine all of these dataframes into one dataframe. 

In [2]:
# Loading all CSV files 
salarydf = pd.read_csv("salaries.csv")
rankingdf = pd.read_csv("ranking.csv")
stud_fac_ratio = pd.read_csv("stud_fac_ratio.csv")
enrollment = pd.read_csv("enrollment.csv")
location = pd.read_csv("geographic_characteristics.csv")

### Step 2: Creating Main Dataframe
We want one big dataframe, hereby called 'main dataframe', that has one entry per college. **The dataframe will have these columns: school, ranking, tsr (stands for teacher-student ratio), pop (for student undergrad population), unitID (the ID that the National Center for Education Statistics assigns to the college), county...TODO** Since we already have a dataframe `rankingdf` that contains college names and rankings, we will make a copy of it the base for our main dataframe. 

In [3]:
main = rankingdf.copy() 

In [4]:
# Rearrange columns so school name comes first 
main = main[['school', 'rank']]
# Add empty columns 
main = pd.concat([main, pd.DataFrame(columns=['tsr', 'pop', 'early_pay', 'unitID', 'county'])])
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology (MIT),1,,,,,
1,Stanford University,2,,,,,
2,Carnegie Mellon University,3,,,,,
3,"University of California, Berkeley (UCB)",4,,,,,
4,Harvard University,7,,,,,


### Step 3: Clean School Names
We will be using college names in the `master` dataframe to lookup college characteristics in other dataframes (i.e. we will treat the other dataframes as lookup tables). Because other dataframes may organize their college names differently we will remove punctuation and abbreviations from `master` school names and strip whitespaces for consistency. We also remove 'The' from the beginning of college names and 'SUNY' because it is an unnecesary designation. 

In [5]:
def clean_string(s): 
    no_abbrvs = s.split("(")[0]
    no_punc = no_abbrvs.replace(",", "").replace(" - ", " ").replace("-", " ").replace(".", " ").replace("&", " ")
    no_suny = no_punc.replace("SUNY", "")
    stripped = no_suny.strip()
    
    # Remove The from beginning
    if stripped[:3].lower() == "the": 
        stripped = stripped[3:]
    # Have to restrip because the original stripped string was replaced 
    return stripped.strip()

In [6]:
# Clean school names as described earlier
main['school'] = main['school'].apply(lambda s: clean_string(s))
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,,,,,
1,Stanford University,2,,,,,
2,Carnegie Mellon University,3,,,,,
3,University of California Berkeley,4,,,,,
4,Harvard University,7,,,,,


In [7]:
# We observed that in row 38, the school name was too wordy, so we shortened it for easier future lookup
main.loc[38, "school"] = "Stony Brook University"

### Step 4: Importing Salaries
We lookup each school in `main` in the `salarydf` dataframe and add its corresponding early career pay into `main`. 

In [8]:
def contains(school, string): 
    '''
    returns: True if all components of the school name are found in string in the correct order. False otherwise. 
    example: if school is "Columbia University" and string is "Columbia University at Main Campus", returns True.
    '''
    parts = school.split(" ")
    for part in parts: 
        idx = string.find(part)
        if idx == -1: 
            return False
        string = string[idx:]
    return True 

In [9]:
def lookup_sal(cleaned_sal, school):
    cleaned_copy = cleaned_sal.copy() 
    cleaned_copy['school'] = cleaned_copy['school'].apply(lambda s: contains(school, s))
    subset = cleaned_copy.loc[cleaned_copy['school']]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()
    try: 
        # Retrieve first and only entry 
        return subset['early_pay'][0]
    except: 
        print("Not found: " + school)
        return None

In [10]:
school_series = main.copy()['school']

# Apply cleaning to salarydf for consistency
cleaned_sal = salarydf.copy()
cleaned_sal['school'] = cleaned_sal['school'].apply(lambda s: clean_string(s))

In [11]:
earlypay_series = school_series.apply(lambda school: lookup_sal(cleaned_sal, school))
earlypay_series.head()

Not found: California Institute of Technology
Not found: University of Rochester
Not found: City University of New York
Not found: Georgetown University
Not found: Emory University


0     99,800
1    107,400
2     99,000
3    105,700
4     96,100
Name: school, dtype: object

In [12]:
# Update main with early pay data 
main['early_pay'] = earlypay_series
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,,,99800,,
1,Stanford University,2,,,107400,,
2,Carnegie Mellon University,3,,,99000,,
3,University of California Berkeley,4,,,105700,,
4,Harvard University,7,,,96100,,


Since the website we scraped from didn't contain information for the universities above, we removed these universities from consideration. If we were to manually Google and input these salaries, this would generate inconsistency with our existing salary data as different online sources use different data collection methods. 

In [13]:
# Remove universities for which there is no salary data
# After double checking algorithm results, we noticed that the University of Illinois at Chicago was confused with
# University of Chicago 
# We drop University of Chicago since it actually has no salary data
to_drop = ['California Institute of Technology', 'University of Rochester', 'Georgetown University', 'Emory University', 'City University of New York', 'University of Chicago']
main = main[~main['school'].isin(to_drop)]

In [14]:
main = main.reset_index()

### Step 5: Importing UnitID 
The National Center for Education Statistics (NCES) lists universities by unitID in their datasets. We first grab the unitIDs for our schools of interest from the `stud_fac_ratio.csv` and loaded them into `main`. We will use the unitID to look up colleges in NCES datasets in future steps. 

In [15]:
def extra_clean_string (s): 
    s = clean_string(s)
    s = s.replace("at", " ").replace("of", " ")
    s = s.strip()
    return s 

In [16]:
def lookup_from_stud_fac(cleaned_ratio, school, return_col):
    cleaned_copy = cleaned_ratio.copy() 
    subset = cleaned_copy[cleaned_copy['Institution Name'] == extra_clean_string(school)]
                          
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()
    
    try: 
        # Retrieve data 
        return subset[return_col][0]
    except: 
        print("Not found: " + school)
        return None

In [17]:
# Apply school name cleaning to stud_fac_ratio dataframe for consistency
cleaned_ratio = stud_fac_ratio.copy()
cleaned_ratio['Institution Name'] = cleaned_ratio['Institution Name'].apply(lambda s: extra_clean_string(s))

In [18]:
# Column in stud_fac_ratio dataframe that we want to grab data for 
column_of_interest = 'UnitID'
school_series = main["school"].copy()
unitid_series = school_series.apply(lambda school: lookup_from_stud_fac(cleaned_ratio, school, column_of_interest))

Not found: University of Washington
Not found: Columbia University
Not found: Georgia Institute of Technology
Not found: Purdue University
Not found: Pennsylvania State University
Not found: University of North Carolina Chapel Hill
Not found: Ohio State University
Not found: Texas A M University
Not found: University of Pittsburgh
Not found: University of Virginia
Not found: Arizona State University
Not found: North Carolina State University
Not found: University of Texas Dallas
Not found: Washington University in St  Louis
Not found: University of South Florida
Not found: University of South Carolina
Not found: Colorado State University
Not found: University of Texas Arlington


In [19]:
main["unitID"] = unitid_series

In [20]:
# Set main index from numbers to school names for easier manual updating 
main = main.set_index("school")

In [21]:
# Manually adding unitIDs
main.loc["University of Washington", "unitID"] = 236948
main.loc["Columbia University", "unitID"] = 190150
main.loc["Georgia Institute of Technology", "unitID"] = 139755
main.loc["University of Texas at Austin", "unitID"] = 228778
main.loc["Purdue University", "unitID"] = 243780
main.loc["Pennsylvania State University", "unitID"] = 214777
main.loc["University of North Carolina Chapel Hill", "unitID"] = 199120
main.loc["Ohio State University", "unitID"] = 204796
main.loc["Texas A M University", "unitID"] = 228723
main.loc["University of Pittsburgh", "unitID"] = 215293
main.loc["University of Virginia", "unitID"] = 234076
main.loc["Arizona State University", "unitID"] = 448886
main.loc["North Carolina State University", "unitID"] = 199193
main.loc["University of Arizona", "unitID"] = 104179
main.loc["University of Texas Dallas", "unitID"] = 228787
main.loc["Washington University in St  Louis", "unitID"] = 179867
main.loc["University of South Florida", "unitID"] = 137351
main.loc["University of Georgia", "unitID"] = 139959
main.loc["University of South Carolina", "unitID"] = 218663
main.loc["Colorado State University", "unitID"] = 126818
main.loc["University of Texas at San Antonio", "unitID"] = 229027
main.loc["University of Texas Arlington", "unitID"] = 228769

In [22]:
# Turn unitIDs from floats to ints
main.loc[:, "unitID"] = main["unitID"].astype(int)

### Step 6: Importing Student-Faculty Ratio 
We lookup student-faculty ratio for each college in `main` from dataframe `stud_fac_ratio` and import it into `main`. 

In [23]:
def lookup_from_stud_fac(cleaned_ratio, unitID, return_col):
    cleaned_copy = cleaned_ratio.copy() 
    subset = cleaned_copy.loc[cleaned_copy['UnitID'] == unitID]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()
    try: 
        # Retrieve student faculty ratio for the first and only entry in subset 
        return subset[return_col][0]
    except: 
        print("Not found: " + school)
        return None

In [24]:
# Apply cleaning to stud_fac_ratio dataframe for consistency
cleaned_ratio = stud_fac_ratio.copy()
cleaned_ratio['Institution Name'] = cleaned_ratio['Institution Name'].apply(lambda s: clean_string(s))

In [25]:
# Column in stud_fac_ratio dataframe that we want to grab data for 
column_of_interest = 'Student-to-faculty ratio (EF2018D)'
id_series = main["unitID"].copy()
ratio_series = id_series.apply(lambda unitID: lookup_from_stud_fac(cleaned_ratio, unitID, column_of_interest))

In [26]:
# Update main dataframe with student faculty ratios
main['tsr'] = ratio_series
main.head()

Unnamed: 0_level_0,index,rank,tsr,pop,early_pay,unitID,county
school,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Massachusetts Institute of Technology,0,1,3.0,,99800,166683,
Stanford University,1,2,5.0,,107400,243744,
Carnegie Mellon University,2,3,10.0,,99000,211440,
University of California Berkeley,3,4,20.0,,105700,110635,
Harvard University,4,7,7.0,,96100,166027,


### Step 7: Importing Enrollment
For each college in `main`, we lookup that college using its `unitID` in the `enrollment` dataframe and grab the corresponding total undergraduate enrollment number. 

In [27]:
def lookup_enroll(enrolldf, unitID):
    enroll_copy = enrolldf.copy() 
    subset = enroll_copy.loc[(enroll_copy['Unit Id'] == unitID) & (enroll_copy['Student level'] == 'Undergraduate total')]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()

    # Retrieve student enrollment for the first and only entry in subset 
    return subset['Grand Total'][0]

In [28]:
unitID_series = main.copy()['unitID']

In [29]:
enroll_series = unitID_series.apply(lambda unitID: lookup_enroll(enrollment, unitID))

In [30]:
# Update main with the enrollment numbers
main['pop'] = enroll_series
main.head()

Unnamed: 0_level_0,index,rank,tsr,pop,early_pay,unitID,county
school,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Massachusetts Institute of Technology,0,1,3.0,4602,99800,166683,
Stanford University,1,2,5.0,7087,107400,243744,
Carnegie Mellon University,2,3,10.0,6589,99000,211440,
University of California Berkeley,3,4,20.0,30853,105700,110635,
Harvard University,4,7,7.0,9950,96100,166027,


### Step 8: Importing County
We will use the school's county in a later step to help find the salary of computer scientists in the area where the school is located. For now, we retrieve the county from dataframe `location` and import into `main`. 

In [31]:
def lookup_county(locationdf, unitID):
    location_copy = locationdf.copy() 
    subset = location_copy.loc[(location_copy['UnitID'] == unitID)]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()

    return subset['County name (HD2018)'][0]

In [32]:
county_series = unitID_series.apply(lambda unitID: lookup_county(location, unitID))

In [33]:
# Update main dataframe with county info
main['county'] = county_series 
main.head()

Unnamed: 0_level_0,index,rank,tsr,pop,early_pay,unitID,county
school,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Massachusetts Institute of Technology,0,1,3.0,4602,99800,166683,Middlesex County
Stanford University,1,2,5.0,7087,107400,243744,Santa Clara County
Carnegie Mellon University,2,3,10.0,6589,99000,211440,Allegheny County
University of California Berkeley,3,4,20.0,30853,105700,110635,Alameda County
Harvard University,4,7,7.0,9950,96100,166027,Middlesex County


# Margia's code below

In [34]:
map_area= pd.read_csv("area_definitions_m2019.csv")
del map_area['FIPS code']
del map_area['State']
del map_area['State abbreviation']
del map_area['County code']
del map_area['Township code']
map_area.columns = ['MSA code', 'MSA name', 'County']
map_area.head()

FileNotFoundError: [Errno 2] No such file or directory: 'area_definitions_m2019.csv'

In [None]:
##create a  metropolitan df which takes the county in main dataframe and maps it to the corresponding metropolitan 
##area from map_area in the column MSA name

metropolitan = main.merge(map_area, left_on = 'county', right_on = 'County', how='left')
metropolitan[0:50]

In [None]:
# TODO:
# rankings into bins 
# average professor rating scrape 

# Data Description

**Choosing a College Ranking List**

# Data Limitations

# Exploratory Data Analysis

# Questions for Reviewers

# Appendix
For our final report, we would include the notebook containing our web scraping code here. We web-scraped computer science post-undergrad salaries from `payscale.com` and scraped the top computer science universities from `topuniversities.com`. 