# Intro 

College degrees are very expensive. But, do they pay you back? Choosing Philosophy or International Relations as a major may have worried your parents, but does the data back up their fears?

We will extract and use updated information from PayScale's website:
https://www.payscale.com/college-salary-report/majors-that-pay-you-back/bachelors 

We'll be digging into this data and answer these questions:

* Which majors have the highest/lowest starting salaries? 

* Which majors have the highest/lowest mid-career salaries? 




In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
# get number of pages
endpoint = "https://www.payscale.com/college-salary-report/majors-that-pay-you-back/bachelors"
response = requests.get(endpoint)
soup = BeautifulSoup(response.text, "html.parser")
inner_btns = soup.find_all("div", {"class": "pagination__btn--inner"})
page_numbers = [inner_btn.getText() for inner_btn in inner_btns if inner_btn.getText().isnumeric()]
total_pages = int(max(page_numbers))

In [None]:
records = []
for current_page in range(total_pages):
    endpoint = f"https://www.payscale.com/college-salary-report/majors-that-pay-you-back/bachelors/page/{current_page + 1}"
    response = requests.get(endpoint)
    soup = BeautifulSoup(response.text, "html.parser")
 
    rows = soup.select("table.data-table tbody tr")
    for row in rows:
        cells = row.select("span.data-table__value")
        record = {
            "Major": cells[1].getText(),
            "Early Career Pay": float(cells[3].getText().strip("$").replace(",", "")),
            "Mid-Career Pay": float(cells[4].getText().strip("$").replace(",", "")),
        }
        records.append(record)

df = pd.DataFrame(records)
df.to_csv("salaries_by_college_major_updated.csv", index=False)

# Preliminary Data Exploration and Data Cleaning

In [None]:
df.shape

(827, 3)

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()
df.head()

Unnamed: 0,Major,Early Career Pay,Mid-Career Pay
0,Petroleum Engineering,93200.0,187300.0
1,Operations Research & Industrial Engineering,84800.0,170400.0
2,Electrical Engineering & Computer Science (EECS),108500.0,159300.0
3,Interaction Design,68300.0,155800.0
4,Public Accounting,59800.0,147700.0


In [None]:
df.columns

Index(['Major', 'Early Career Pay', 'Mid-Career Pay'], dtype='object')

In [None]:
# drop NaN (Not A Number) values (if any)
df = df.dropna()

In [None]:
df.shape

(827, 3)

In [None]:
df.nlargest(5, "Early Career Pay")

Unnamed: 0,Major,Early Career Pay,Mid-Career Pay
2,Electrical Engineering & Computer Science (EECS),108500.0,159300.0
75,Physician Assistant Studies,95900.0,118500.0
0,Petroleum Engineering,93200.0,187300.0
1,Operations Research & Industrial Engineering,84800.0,170400.0
5,Operations Research,83500.0,147400.0


# Major with Highest Starting Salaries

In [None]:
idx = df['Early Career Pay'].idxmax()
df['Major'][idx]

'Electrical Engineering & Computer Science (EECS)'

In [None]:
df['Early Career Pay'][idx]

108500.0

In [None]:
df.loc[idx]

Major               Electrical Engineering & Computer Science (EECS)
Early Career Pay                                            108500.0
Mid-Career Pay                                              159300.0
Name: 2, dtype: object

# Major with The Highest Mid-Career Salary

**What college major has the highest mid-career salary?**
How much do graduates with this major earn? (Mid-career is defined as having 10+ years of experience).

In [None]:
idx = df['Mid-Career Pay'].idxmax()
df['Major'][idx]

'Petroleum Engineering'

In [None]:
df['Mid-Career Pay'][idx]

187300.0

# Major with The Lowest Starting Salary

**Which college major has the lowest starting salary and how much do graduates earn after university?**

In [None]:
idx = df['Early Career Pay'].idxmin()
df['Major'][idx]

'Voice & Opera'

In [None]:
df['Early Career Pay'][idx]

34500.0

# Major with Lowest Mid-Career Salary

**Which college major has the lowest mid-career salary and how much can people expect to earn with this degree?**

In [None]:
idx = df['Mid-Career Pay'].idxmin()
df['Major'][idx]

'Metalsmithing'

In [None]:
df['Mid-Career Pay'][idx]

40300.0