# Research Question
What combination of college characteristics results in the greatest post-undergrad salary for computer science majors? 

The factors/characteristics we are analyzing are: college prestige ranking, teacher-to-student ratio, student population, average professor rating, geographic location, and the salary of computer scientists in the area at which the college is located. 

# Data Collection

We include this data collection section to explain how data that was not from a CSV file was imported into a pandas dataframe. This section is necessarily first because otherwise, the dataframes we generate in the code would not be in scope for the data cleaning section.

In [1]:
# Import necessary packages
import requests 
import bs4
import pandas as pd

---
**Code for Webscraping College Prestige Rankings** 

We used a modified form of "web scraping" in that we highlighted all text on the website and copied and pasted it into a txt file for easier automated text parsing, as truly webscraping the text off of the website would result in a complex and time-consuming HTML structure parsing. Below is the code for parsing the txt file: 

In [159]:
rankingdf = pd.DataFrame(data = None, columns = ["rank", "school"])

In [160]:
def parse_school_name(line): 
    '''
    param line is a line from the txt file of the form 'school_name Logo school_name More United States'
    return a string of the school name parsed from the file line
    '''
    text = line.split("Logo")[1]
    return text.split("More")[0]

In [161]:
def parse_lines(lines, rankingdf): 
    '''
    param lines are the lines read from the txt file containing college prestige rankings from topuniversities.com
    returns a dataframe that has parsed school names and international prestige rankings appended to it 
    '''
    idx = 1
    while idx <= len(lines) - 1: 
        rank = lines[idx - 1].replace("\n", "")
        school = parse_school_name(lines[idx])
        entry = {"rank": rank, "school": school}
        rankingdf = rankingdf.append(entry, ignore_index=True)
        idx = idx + 2
    return rankingdf

In [162]:
with open("topuniversities_rankings.txt", 'r') as f: 
    lines = f.readlines()
    rankingdf = parse_lines(lines, rankingdf)

In [163]:
rankingdf.head()

Unnamed: 0,rank,school
0,1,Massachusetts Institute of Technology (MIT)
1,2,Stanford University
2,3,Carnegie Mellon University
3,4,"University of California, Berkeley (UCB)"
4,7,Harvard University


---
**Code for Webscraping Computer Science Graduates' Salaries (by college)**

In [7]:
def parse_school_name(href): 
    '''
    param href is the link component of an <a> tag of the form '/research/US/School=school_name/Salary'
    returns school_name parsed from href
    '''
    school_name = href.split("/research/US/School=")[1]
    school_name = school_name.split("/Salary")[0]
    return school_name 

In [8]:
def parse_career_pay(spans): 
    '''
    param spans is all of the spans for a given HTML table row in payscale.com's table element
    returns (early career pay, mid career pay) for the college associated with the table row
    '''
    early_pay = None
    mid_pay = None
    for span in spans: 
        txt = span.text
        if txt.find("$") != -1: 
            if early_pay is None: 
                early_pay = txt.split("$")[1]
            else: 
                mid_pay = txt.split("$")[1]
                break 
    return (early_pay, mid_pay)

In [9]:
def add_rows(salarydf, rows): 
    '''
    param salarydf is the dataframe to add rows to 
    param rows are the HTML table row elements from payscale.com webpage 
    returns a dataframe that has college name and corresponding early and mid career pay appened to it for all 
    HTML table rows in the HTML table element
    '''
    for row in rows: 
        spans = row.findAll("span")
        title_idx = 0
        for idx, span in enumerate(spans): 
            if span.text == "School Name:": 
                title_idx = idx + 1
                break 
        school_name = spans[title_idx].text     
        #school_name = parse_school_name(link)
        spans = row.findAll("span", class_="data-table__value")
        early_pay, mid_pay = parse_career_pay(spans)
        salarydf = salarydf.append({"school": school_name, "early_pay": early_pay, "mid_pay": mid_pay}, ignore_index = True)
    return salarydf

In [10]:
def get_salaries(page, salarydf, base_url): 
    '''
    param page is the first page to webscrape 
    param salarydf is the dataframe to append scraped data to 
    param base_url is the name of the webpage to scrape data from, excluding URL components for page number 
    returns a dataframe that has school names and corresponding mid and early career pay appended to it for all schools
    in payscale.com's salary table (which spans multiple web pages)
    '''
    while page < 26: 
        salary_url = base_url + "page/" + str(page)
        salary_resp = requests.get(salary_url)
        salary_soup = bs4.BeautifulSoup(salary_resp.text, 'html.parser')
        table = salary_soup.body.findAll("table", class_ = "data-table")
        rows = table[0].findAll("tr", class_ = "data-table__row")
        salarydf = add_rows(salarydf, rows)
        page = page + 1 
    return salarydf

In [11]:
base_url = "https://www.payscale.com/college-salary-report/best-schools-by-majors/computer-science/"
salarydf = pd.DataFrame(data = None, columns = ["school", "early_pay", "mid_pay"])
salarydf = get_salaries(1, salarydf, base_url)

In [12]:
salarydf.head()

Unnamed: 0,school,early_pay,mid_pay
0,Stanford University,107400,174800
1,Harvey Mudd College,100500,172800
2,Carnegie Mellon University,99000,167200
3,University of California-Berkeley,105700,166500
4,Princeton University,101300,162500


In [173]:
for m in salarydf['school']: 
    print(m)

Stanford University
Harvey Mudd College
Carnegie Mellon University
University of California-Berkeley
Princeton University
Dartmouth College
Massachusetts Institute of Technology
Harvard University
Yale University
Columbia University in the City of New York
Duke University
Brown University
University of Pennsylvania
University of Washington-Seattle Campus
United States Naval Academy
Cornell University
Rice University
University of Virginia-Main Campus
University of California-Santa Barbara
University of California-Los Angeles
Worcester Polytechnic Institute
CUNY Hunter College
Johns Hopkins University
University of California-Santa Cruz
Lehigh University
University of California-San Diego
California Polytechnic State University-San Luis Obispo
Santa Clara University
University of Notre Dame
Brandeis University
Case Western Reserve University
University of California-Irvine
United States Military Academy
Virginia Polytechnic Institute and State University
Stony Brook University
Boston Co

# Data Cleaning

### Step 1: Loading CSV's
We have dataframes created from web-scraped data above. Now, we want to get dataframes for data stored in CSV files. In later steps, we will combine all of these dataframes into one dataframe. 

In [13]:
# Loading all CSV files 
stud_fac_ratio = pd.read_csv("stud_fac_ratio.csv")
enrollment = pd.read_csv("enrollment.csv")
location = pd.read_csv("geographic_characteristics.csv")

### Step 2: Creating Master Dataframe
We want one big dataframe, hereby called 'master dataframe', that has one entry per college. **The dataframe will have these columns: school, ranking, tsr (stands for teacher-student ratio), pop (for student undergrad population)...TODO** Since we already have a dataframe `rankingdf` that contains college names and rankings, we will make a copy of it the base for our master dataframe. 

In [270]:
master = rankingdf.copy() 

In [271]:
# Rearrange columns so school name comes first 
master = master[['school', 'rank']]
# Add empty columns 
master = pd.concat([master, pd.DataFrame(columns=['tsr', 'pop', 'early_pay'])])
master.head()

Unnamed: 0,school,rank,tsr,pop,early_pay
0,Massachusetts Institute of Technology (MIT),1,,,
1,Stanford University,2,,,
2,Carnegie Mellon University,3,,,
3,"University of California, Berkeley (UCB)",4,,,
4,Harvard University,7,,,


### Step 3: Clean School Names
We will be using college names in the `master` dataframe to lookup college characteristics in other dataframes (i.e. we will treat the other dataframes as lookup tables). Because other dataframes may organize their college names differently we will remove punctuation and abbreviations from `master` school names and strip whitespaces for consistency. We also remove 'The' from the beginning of college names and 'SUNY' because it is an unnecesary designation. 

In [272]:
def clean_string(s): 
    no_abbrvs = s.split("(")[0]
    no_punc = no_abbrvs.replace(",", "").replace(" - ", " ").replace("-", " ").replace(".", " ").replace("&", " ")
    no_suny = no_punc.replace("SUNY", "")
    stripped = no_suny.strip()
    
    # Remove The from beginning
    if stripped[:3].lower() == "the": 
        stripped = stripped[3:]
    return stripped

In [273]:
# Clean school names as described earlier
master['school'] = master['school'].apply(lambda s: clean_string(s))
master.head()

Unnamed: 0,school,rank,tsr,pop,early_pay
0,Massachusetts Institute of Technology,1,,,
1,Stanford University,2,,,
2,Carnegie Mellon University,3,,,
3,University of California Berkeley,4,,,
4,Harvard University,7,,,


In [274]:
# We observed that in row 38, the school name was too wordy, so we shortened it for easier future lookup
master.loc[38, "school"] = "Stony Brook University"

In [275]:
for m in master['school']: 
    print(m)

Massachusetts Institute of Technology
Stanford University
Carnegie Mellon University
University of California Berkeley
Harvard University
Princeton University
University of California Los Angeles
University of Washington
Columbia University
Cornell University
New York University
Georgia Institute of Technology
California Institute of Technology
University of Texas at Austin
University of Illinois at Urbana Champaign
University of Pennsylvania
University of Southern California
Yale University
University of Chicago
University of Michigan Ann Arbor
University of Maryland College Park
Boston University
Duke University
Johns Hopkins University
Purdue University
University of California San Diego
University of Wisconsin Madison
Michigan State University
Pennsylvania State University
University of California Irvine
University of Massachusetts Amherst
University of North Carolina Chapel Hill
Brown University
Northeastern University
Northwestern University
 Ohio State University
Rice University

### Step 4: Importing Salaries
We will lookup each school in `master` in the `salarydf` dataframe and its corresponding early career pay into `master`. 

In [276]:
def contains(school, string): 
    '''
    returns: True if all components of the school name are found in string. False otherwise. 
    example: if school is "Columbia University" and string is "Columbia University at Main Campus", returns True.
    '''
    parts = school.split(" ")
    for part in parts: 
        if string.find(part) == -1: 
            return False
    return True 

In [277]:
def lookup(cleaned_sal, school):
    cleaned_copy = cleaned_sal.copy() 
    cleaned_copy['school'] = cleaned_copy['school'].apply(lambda s: contains(school, s))
    subset = cleaned_copy.loc[cleaned_copy['school']]
    
    # Subset will contain the rows' original index unless reset 
    subset = subset.reset_index()
    try: 
        # Retrieve first and only entry 
        return subset['early_pay'][0]
    except: 
        print("Not found: " + school)
        return None

In [278]:
school_series = master.copy()['school']

# Apply cleaning to salarydf for consistency
cleaned_sal = salarydf.copy()
cleaned_sal['school'] = cleaned_sal['school'].apply(lambda s: clean_string(s))

In [279]:
earlypay_series = school_series.apply(lambda school: lookup(cleaned_sal, school))
earlypay_series.head()

Not found: California Institute of Technology
Not found: University of Rochester
Not found: Georgetown University
Not found: Emory University


0     99,800
1    107,400
2     99,000
3    105,700
4     96,100
Name: school, dtype: object

In [282]:
# Update master with early pay data 
master['early_pay'] = earlypay_series
master.head()

Unnamed: 0,school,rank,tsr,pop,early_pay
0,Massachusetts Institute of Technology,1,,,99800
1,Stanford University,2,,,107400
2,Carnegie Mellon University,3,,,99000
3,University of California Berkeley,4,,,105700
4,Harvard University,7,,,96100


Since the website we scraped from didn't contain information for the universities above, we removed these universities from consideration. If we were to manually Google and input these salaries, this would generate inconsistency with our existing salary data as different online sources use different data collection methods. 

In [284]:
# Remove universities for which there is no salary data
to_drop = ['California Institute of Technology', 'University of Rochester', 'Georgetown University', 'Emory University']
master = master[~master['school'].isin(to_drop)]

In [17]:
# TODO:
# rankings into bins 
# put salaries into rankingdf
# average professor rating 

# Data Description

**Choosing a College Ranking List**

# Data Limitations

# Exploratory Data Analysis

# Questions for Reviewers

# Appendix