# Research Question
What combination of college characteristics results in the greatest post-undergrad salary for computer science majors? 

The factors/characteristics we are analyzing are: college prestige ranking, teacher-to-student ratio, student population, average professor rating, geographic location, and the salary of computer scientists in the area at which the college is located. 

In [1]:
# Import necessary packages
import pandas as pd

# Data Cleaning

### Step 1: Loading CSV's
We load dataframes for data stored in CSV files. In later steps, we will combine all of these dataframes into one dataframe. 

In [2]:
# Loading all CSV files 
salarydf = pd.read_csv("salaries.csv")
rankingdf = pd.read_csv("ranking.csv")
stud_fac_ratio = pd.read_csv("stud_fac_ratio.csv")
enrollment = pd.read_csv("enrollment.csv")
location = pd.read_csv("geographic_characteristics.csv")

### Step 2: Creating Main Dataframe
We want one big dataframe, hereby called 'main dataframe', that has one entry per college. **The dataframe will have these columns: school, ranking, tsr (stands for teacher-student ratio), pop (for student undergrad population), unitID (the ID that the National Center for Education Statistics assigns to the college), county...TODO** Since we already have a dataframe `rankingdf` that contains college names and rankings, we will make a copy of it the base for our main dataframe. 

In [3]:
main = rankingdf.copy() 

In [4]:
# Rearrange columns so school name comes first 
main = main[['school', 'rank']]
# Add empty columns 
main = pd.concat([main, pd.DataFrame(columns=['tsr', 'pop', 'early_pay', 'unitID', 'county'])])
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology (MIT),1,,,,,
1,Stanford University,2,,,,,
2,Carnegie Mellon University,3,,,,,
3,"University of California, Berkeley (UCB)",4,,,,,
4,Harvard University,7,,,,,


### Step 3: Clean School Names
We will be using college names in the `master` dataframe to lookup college characteristics in other dataframes (i.e. we will treat the other dataframes as lookup tables). Because other dataframes may organize their college names differently we will remove punctuation and abbreviations from `master` school names and strip whitespaces for consistency. We also remove 'The' from the beginning of college names and 'SUNY' because it is an unnecesary designation. 

In [5]:
def clean_string(s): 
    no_abbrvs = s.split("(")[0]
    no_punc = no_abbrvs.replace(",", "").replace(" - ", " ").replace("-", " ").replace(".", " ").replace("&", " ")
    no_suny = no_punc.replace("SUNY", "")
    stripped = no_suny.strip()
    
    # Remove The from beginning
    if stripped[:3].lower() == "the": 
        stripped = stripped[3:]
    return stripped

In [6]:
# Clean school names as described earlier
main['school'] = main['school'].apply(lambda s: clean_string(s))
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,,,,,
1,Stanford University,2,,,,,
2,Carnegie Mellon University,3,,,,,
3,University of California Berkeley,4,,,,,
4,Harvard University,7,,,,,


In [7]:
# We observed that in row 38, the school name was too wordy, so we shortened it for easier future lookup
main.loc[38, "school"] = "Stony Brook University"

In [8]:
for m in main['school']: 
    print(m)

Massachusetts Institute of Technology
Stanford University
Carnegie Mellon University
University of California Berkeley
Harvard University
Princeton University
University of California Los Angeles
University of Washington
Columbia University
Cornell University
New York University
Georgia Institute of Technology
California Institute of Technology
University of Texas at Austin
University of Illinois at Urbana Champaign
University of Pennsylvania
University of Southern California
Yale University
University of Chicago
University of Michigan Ann Arbor
University of Maryland College Park
Boston University
Duke University
Johns Hopkins University
Purdue University
University of California San Diego
University of Wisconsin Madison
Michigan State University
Pennsylvania State University
University of California Irvine
University of Massachusetts Amherst
University of North Carolina Chapel Hill
Brown University
Northeastern University
Northwestern University
 Ohio State University
Rice University

### Step 4: Importing Salaries
We lookup each school in `main` in the `salarydf` dataframe and add its corresponding early career pay into `main`. 

In [9]:
def contains(school, string): 
    '''
    returns: True if all components of the school name are found in string. False otherwise. 
    example: if school is "Columbia University" and string is "Columbia University at Main Campus", returns True.
    '''
    parts = school.split(" ")
    for part in parts: 
        if string.find(part) == -1: 
            return False
    return True 

In [10]:
def lookup_sal(cleaned_sal, school):
    cleaned_copy = cleaned_sal.copy() 
    cleaned_copy['school'] = cleaned_copy['school'].apply(lambda s: contains(school, s))
    subset = cleaned_copy.loc[cleaned_copy['school']]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()
    try: 
        # Retrieve first and only entry 
        return subset['early_pay'][0]
    except: 
        print("Not found: " + school)
        return None

In [11]:
school_series = main.copy()['school']

# Apply cleaning to salarydf for consistency
cleaned_sal = salarydf.copy()
cleaned_sal['school'] = cleaned_sal['school'].apply(lambda s: clean_string(s))

In [12]:
earlypay_series = school_series.apply(lambda school: lookup_sal(cleaned_sal, school))
earlypay_series.head()

Not found: California Institute of Technology
Not found: University of Rochester
Not found: Georgetown University
Not found: Emory University


0     99,800
1    107,400
2     99,000
3    105,700
4     96,100
Name: school, dtype: object

In [13]:
# Update main with early pay data 
main['early_pay'] = earlypay_series
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,,,99800,,
1,Stanford University,2,,,107400,,
2,Carnegie Mellon University,3,,,99000,,
3,University of California Berkeley,4,,,105700,,
4,Harvard University,7,,,96100,,


Since the website we scraped from didn't contain information for the universities above, we removed these universities from consideration. If we were to manually Google and input these salaries, this would generate inconsistency with our existing salary data as different online sources use different data collection methods. 

In [14]:
# Remove universities for which there is no salary data
to_drop = ['California Institute of Technology', 'University of Rochester', 'Georgetown University', 'Emory University']
main = main[~main['school'].isin(to_drop)]

### Step 5: Importing Student-Faculty Ratio 
We lookup student-faculty ratio for each college in `main` from dataframe `stud_fac_ratio` and import it into `main`. 

In [15]:
def lookup_from_stud_fac(cleaned_ratio, school, return_col):
    cleaned_copy = cleaned_ratio.copy() 
    cleaned_copy['Institution Name'] = cleaned_ratio['Institution Name'].apply(lambda s: contains(school, s))
    subset = cleaned_copy.loc[cleaned_copy['Institution Name']]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()
    try: 
        # Retrieve student faculty ratio for the first and only entry in subset 
        return subset[return_col][0]
    except: 
        print("Not found: " + school)
        return None

In [16]:
# Apply cleaning to stud_fac_ratio dataframe for consistency
cleaned_ratio = stud_fac_ratio.copy()
cleaned_ratio['Institution Name'] = cleaned_ratio['Institution Name'].apply(lambda s: clean_string(s))

In [17]:
# Column in stud_fac_ratio dataframe that we want to grab data for 
column_of_interest = 'Student-to-faculty ratio (EF2018D)'
ratio_series = school_series.apply(lambda school: lookup_from_stud_fac(cleaned_ratio, school, column_of_interest))

In [18]:
# Update main dataframe with student faculty ratios
main['tsr'] = ratio_series
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,3.0,,99800,,
1,Stanford University,2,5.0,,107400,,
2,Carnegie Mellon University,3,10.0,,99000,,
3,University of California Berkeley,4,20.0,,105700,,
4,Harvard University,7,7.0,,96100,,


### Step 6: Importing UnitID 
We also grab the unitID for each college from the dataframe `stud_fac_ratio`. In future steps, we will use the unitID to look up colleges in dataframes originating from National Center for Education Statistics datasets. Using the unitID will be easier than trying to match up variations of college names.  

In [19]:
# Column in stud_fac_ratio dataframe that we want to grab data for 
column_of_interest = 'UnitID'
unitid_series = school_series.apply(lambda school: lookup_from_stud_fac(cleaned_ratio, school, column_of_interest))

In [20]:
# Update main dataframe with unitIDs
main['unitID'] = unitid_series
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,3.0,,99800,166683,
1,Stanford University,2,5.0,,107400,243744,
2,Carnegie Mellon University,3,10.0,,99000,211440,
3,University of California Berkeley,4,20.0,,105700,110635,
4,Harvard University,7,7.0,,96100,166027,


### Step 7: Importing Enrollment
For each college in `main`, we lookup that college using its `unitID` in the `enrollment` dataframe and grab the corresponding total undergraduate enrollment number. 

In [21]:
def lookup_enroll(enrolldf, unitID):
    enroll_copy = enrolldf.copy() 
    subset = enroll_copy.loc[(enroll_copy['Unit Id'] == unitID) & (enroll_copy['Student level'] == 'Undergraduate total')]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()

    # Retrieve student enrollment for the first and only entry in subset 
    return subset['Grand Total'][0]

In [22]:
unitID_series = main.copy()['unitID']

In [23]:
enroll_series = unitID_series.apply(lambda unitID: lookup_enroll(enrollment, unitID))

In [24]:
# Update main with the enrollment numbers
main['pop'] = enroll_series
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,3.0,4602,99800,166683,
1,Stanford University,2,5.0,7087,107400,243744,
2,Carnegie Mellon University,3,10.0,6589,99000,211440,
3,University of California Berkeley,4,20.0,30853,105700,110635,
4,Harvard University,7,7.0,9950,96100,166027,


### Step 8: Importing County
We will use the school's county in a later step to help find the salary of computer scientists in the area where the school is located. For now, we retrieve the county from dataframe `location` and import into `main`. 

In [25]:
def lookup_county(locationdf, unitID):
    location_copy = locationdf.copy() 
    subset = location_copy.loc[(location_copy['UnitID'] == unitID)]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()

    return subset['County name (HD2018)'][0]

In [26]:
county_series = unitID_series.apply(lambda unitID: lookup_county(location, unitID))

In [27]:
# Update main dataframe with county info
main['county'] = county_series 
main.head()

Unnamed: 0,school,rank,tsr,pop,early_pay,unitID,county
0,Massachusetts Institute of Technology,1,3.0,4602,99800,166683,Middlesex County
1,Stanford University,2,5.0,7087,107400,243744,Santa Clara County
2,Carnegie Mellon University,3,10.0,6589,99000,211440,Allegheny County
3,University of California Berkeley,4,20.0,30853,105700,110635,Alameda County
4,Harvard University,7,7.0,9950,96100,166027,Middlesex County


In [28]:
# TODO:
# rankings into bins 
# average professor rating scrape 

# Data Description

**Choosing a College Ranking List**

# Data Limitations

# Exploratory Data Analysis

# Questions for Reviewers

# Appendix
For our final report, we would include the notebook containing our web scraping code here. We web-scraped computer science post-undergrad salaries from `payscale.com` and scraped the top computer science universities from `topuniversities.com`. 