# Research Question
What combination of college characteristics results in the greatest post-undergrad salary for computer science majors? 

The factors/characteristics we are analyzing are: college prestige ranking, teacher-to-student ratio, student population, average professor rating, geographic location, and the salary of computer scientists in the area at which the college is located. 

# Data Cleaning 
(Summarized, without code)

#### Overall Goal
We want one big dataframe, hereby called `main dataframe`, that has one entry per college. The final dataframe will have these columns: school name, ranking, tsr (stands for teacher-student ratio), pop (for student undergrad population), early_pay (salary post-graduation), a_mean (average salary for computer scientists in the area of the college), rat (average professor rating), and county (for location)

#### Data Sources for each factor 
- School names and prestige rankings
> We webscraped a list of the top computer science schools in 2020 (ranked by prestige) from `topuniversities.com`. This site was chosen because unlike other ranking lists such as `USNews`, this site included employer surveys on school prestige into their ranking list, which is important to us because we are analyzing relationships with job salary. 
- Post-undergrad salary (called early_pay in our main dataframe) 
> We webscraped early-career computer science salaries by school from `payscale.com`.  
- Teacher-student ratio 
> The National Center for Education Statistics (NCES) collects data on teacher-student ratio for all US institutions. We downloaded their teacher-student ratio dataset of 6000+ schools for cleaning and processing. 
- Student undergrad population
> NCES has a dataset for enrollment (graduate and undergraduate) per school. 
- County
> NCES has a dataset for geographic characteristics per school (i.e. state, city, county, zip code, etc.) 
- Average salary for computer scientists in the area where the college is located 
> The Bureau of Labor Statistics (BLS) has a dataset for each occupation's salary in all metropolitan and nonmetropolitan areas. A (non)metropolitan area is defined by the counties that are within it, and the BLS includes a separate dataset that shows which counties map to which metropolitan area. Thus it is important that we find the county that each school is in, since we can map counties to (non)metropolitan areas, and then (non)metropolitan area to the computer science salary of that region.
- Average professor rating 
> We attempted to webscrape average professor ratings from `ratemyprofessor.com`, but due to technical difficulties, we manually collected data from the site. 

#### Data Cleaning Steps
1. **Cleaning School Names**: We used the dataframe of school names and rankings (created from webscraping) as the basis of our `main` dataframe. Because we needed to use the school names in `main` to lookup information about the same school in other datasets, we first cleaned the clean the school names by removing extra spaces, punctuation, and abbreviations, as different datasets punctuate their school names differently. 

2. **Importing Post-Undergrad Salary**: Since the school names in our dataset of post-under grad salaries differed from what we used in `main` (ex. `Columbia University` was listed as `Columbia University in New York City`), we used the following algorithm to match school names: if a school name in the lookup dataset contained all of the components of the school name in our `main` dataset in the same order, then that was a match (so since `Columbia University in New York City` contains `Columbia` and `University` in the same order as `Columbia University`, there is a match). In this way, we could import salary data from one dataset into our main. 
> If any school wasn't found in the salary dataset, we checked to see that the algorithm wasn't failing. If the school really wasn't on `payscale.com`, we threw out the school from analysis because if we were to manually Google and input these salaries, this would generate inconsistency with our existing salary data as different online sources use different data collection methods. 

3. **Importing UnitID**: We used a similar algorithm as to above to lookup and import the UnitID for each school from a NCES dataset of schools and their unitIDs. The unitID is a unique number assigned to each college by the NCES, and is the number the NCES uses to index their other datasets. We use the unitID to lookup data from any NCES datasets in the following steps. 

4. **Importing Student-Faculty Ratio, County, and Enrollment**: We use the unitID for each school to lookup its enrollment, county, and student-faculty ratio from three different CSV files and import that into the main dataframe.

5. **Finding average computer scientist salary by (non)metropolitan area**: We take the BLS dataset of all occupations' salaries per (non)metropolitan area and filter out all non-computer science related jobs. We defined computer-science related job as a job that someone with a bachelor's in computer science could have and that would allow them to apply computer science knowledge (ex. Computer Systems Analysts). We then filtered out salaries listed as `*` (not found), grouped by (non)metropolitan area and calculated the mean salary per area. 

6. **Placing colleges in their respective (non)metropolitan area**: We match up schools with the non(metropolitan) area that it is by county. Essentially, we use the BLS dataset mapping (non)metropolitan area to county, and lookup each school's county to get the area name. 

7. **Matching colleges with the average salary in its non(metropolitan) area**: Using the (non)metropolitan area that each college is in, gathered in step 6, we were able to match the metropolitan area with the average salary of the area from the cleaned dataset in step 5 and import that salary into our `main` dataframe.

8. **Importing Average Professor Rating**: We used a similar algorithm from step 2 to lookup a school's average professor rating from our manually collected dataset from `ratemyprofessor.com`. 

9. **Converting Prestige Rankings to Ranks 1-4**: Because the exact ranking of a college is fairly variable between different ranking lists, and because we don't care about the exact ranking (rather, we care about whether a college is generally high-ranking, low-ranking, etc.), we converted the numeric rankings scraped from `topuniversities.com` into a ranking between 1-4, like so: 

>- Rank 1 (very good) 
>> The college is in the top 20 internationally for computer science
>- Rank 2 (great)
>> The college is ranked between 21-100
>- Rank 3 (ok) 
>> The college is ranked between 101-300 
>- Rank 4 (at this point, nobody really cares about the ranking)
>> The college is ranked between 301-600 (the lowest ranking `topuniversities.com` has) 

While there wouldn't be the same number of schools per rank 'bin' (i.e. less schools with 1 ranking), this ranking system clarifies the boundaries between school prestige ranking better and more accurately represents which schools have the top prestige level. In contrast, if we had the same number of schools per rank, super prestigious schools may be lumped with somewhat prestigious schools, which is not ideal.

10. **Miscellaneous Typecasting Columns**: We typecasted any columns that weren't the correct datatype, as well as did a finally cleaning of school names for consistency. 

# Data Description

# Data Limitations

# Exploratory Data Analysis

# Questions for Reviewers

# Appendix
For our final report, we would include the notebook containing our web scraping code here (called `Phase II Scraping.ipynb`). We web-scraped computer science post-undergrad salaries from `payscale.com` and scraped the top computer science universities from `topuniversities.com`. 

We would also include the notebook containing our data cleaning code (called `Phase II Data Cleaning.ipynb`) here.