# Research Question
What combination of college characteristics results in the greatest post-undergrad salary for computer science majors? 

# Data Collection

We include this data collection section to explain how data that was not from a CSV file was imported into a pandas dataframe. This section is necessarily first because otherwise, the dataframes we generate in the code would not be in scope for the data cleaning section.

In [2]:
# Import necessary packages
import requests 
import bs4
import pandas as pd

---
**Code for Webscraping College Prestige Rankings** 

We used a modified form of "web scraping" in that we highlighted all text on the website and copied and pasted it into a txt file for easier automated text parsing, as truly webscraping the text off of the website would result in a complex and time-consuming HTML structure parsing. Below is the code for parsing the txt file: 

In [13]:
rankingdf = pd.DataFrame(data = None, columns = ["rank", "school"])

In [14]:
def parse_school_name(line): 
    '''
    param line is a line from the txt file of the form 'school_name Logo school_name More United States'
    return a string of the school name parsed from the file line
    '''
    text = line.split("Logo")[1]
    return text.split("More")[0]

In [15]:
def parse_lines(lines, rankingdf): 
    '''
    param lines are the lines read from the txt file containing college prestige rankings from topuniversities.com
    returns a dataframe that has parsed school names and international prestige rankings appended to it 
    '''
    idx = 1
    while idx <= len(lines) - 1: 
        rank = lines[idx - 1].replace("\n", "")
        school = parse_school_name(lines[idx])
        entry = {"rank": rank, "school": school}
        rankingdf = rankingdf.append(entry, ignore_index=True)
        idx = idx + 2
    return rankingdf

In [16]:
with open("topuniversities_rankings.txt", 'r') as f: 
    lines = f.readlines()
    rankingdf = parse_lines(lines, rankingdf)

In [17]:
rankingdf.head()

Unnamed: 0,rank,school
0,1,Massachusetts Institute of Technology (MIT)
1,2,Stanford University
2,3,Carnegie Mellon University
3,4,"University of California, Berkeley (UCB)"
4,7,Harvard University


---
**Code for Webscraping Computer Science Graduates' Salaries (by college)**

In [19]:
def parse_school_name(href): 
    '''
    param href is the link component of an <a> tag of the form '/research/US/School=school_name/Salary'
    returns school_name parsed from href
    '''
    school_name = href.split("/research/US/School=")[1]
    school_name = school_name.split("/Salary")[0]
    return school_name 

In [20]:
def parse_career_pay(spans): 
    '''
    param spans is all of the spans for a given HTML table row in payscale.com's table element
    returns (early career pay, mid career pay) for the college associated with the table row
    '''
    early_pay = None
    mid_pay = None
    for span in spans: 
        txt = span.text
        if txt.find("$") != -1: 
            if early_pay is None: 
                early_pay = txt.split("$")[1]
            else: 
                mid_pay = txt.split("$")[1]
                break 
    return (early_pay, mid_pay)

In [21]:
def add_rows(salarydf, rows): 
    '''
    param salarydf is the dataframe to add rows to 
    param rows are the HTML table row elements from payscale.com webpage 
    returns a dataframe that has college name and corresponding early and mid career pay appened to it for all 
    HTML table rows in the HTML table element
    '''
    for row in rows: 
        spans = row.findAll("span")
        title_idx = 0
        for idx, span in enumerate(spans): 
            if span.text == "School Name:": 
                title_idx = idx + 1
                break 
        school_name = spans[title_idx].text     
        #school_name = parse_school_name(link)
        spans = row.findAll("span", class_="data-table__value")
        early_pay, mid_pay = parse_career_pay(spans)
        salarydf = salarydf.append({"school": school_name, "early_pay": early_pay, "mid_pay": mid_pay}, ignore_index = True)
    return salarydf

In [22]:
def get_salaries(page, salarydf, base_url): 
    '''
    param page is the first page to webscrape 
    param salarydf is the dataframe to append scraped data to 
    param base_url is the name of the webpage to scrape data from, excluding URL components for page number 
    returns a dataframe that has school names and corresponding mid and early career pay appended to it for all schools
    in payscale.com's salary table (which spans multiple web pages)
    '''
    while page < 26: 
        salary_url = base_url + "page/" + str(page)
        salary_resp = requests.get(salary_url)
        salary_soup = bs4.BeautifulSoup(salary_resp.text, 'html.parser')
        table = salary_soup.body.findAll("table", class_ = "data-table")
        rows = table[0].findAll("tr", class_ = "data-table__row")
        salarydf = add_rows(salarydf, rows)
        page = page + 1 
    return salarydf

In [23]:
base_url = "https://www.payscale.com/college-salary-report/best-schools-by-majors/computer-science/"
salarydf = pd.DataFrame(data = None, columns = ["school", "early_pay", "mid_pay"])
salarydf = get_salaries(1, salarydf, base_url)

In [24]:
salarydf.head()

Unnamed: 0,school,early_pay,mid_pay
0,Stanford University,107400,174800
1,Harvey Mudd College,100500,172800
2,Carnegie Mellon University,99000,167200
3,University of California-Berkeley,105700,166500
4,Princeton University,101300,162500


# Data Cleaning

In [None]:
# Add stuff to rankingdf 
# TODO:
# rankings into bins 
# put salaries into rankingdf

# Data Description

**Choosing a College Ranking List**

# Data Limitations

# Exploratory Data Analysis

# Questions for Reviewers

# Appendix