<h1 align="center">A Machine Learning Approach at Estimating Career Salaries</h1> 
<h2 align="center">Yianni Mercer | DSC 478 Programming ML Apps</h2> 
<h2 align="center">Final Project Report</h2> 

U.S. News presented a list of the [Top 100 Careers in 2021](https://money.usnews.com/careers/best-jobs/rankings/the-100-best-jobs), which featured many relatively new and exciting options like Data Scientist and Software Enginner, but also old-reliable one's like Accountant or Physician. This project aimed to find a relationship between the average salary of the *Top 50* Careers from the former list and numerous variables that would be found on a respective job posting. When these relationship were identified, we exploited the underlying patterns that drive salary in order to develop a regression model that can accurately predict the average salary of various career paths.  Upon arriving at an optimized, well performing model, we developed a web application that can ingest a user's inputted data, and return our model's average salary predicition. 

# Data Collection

The top 50 careers according to U.S. News seemed to use a rather ambiguous method of ranking these careers.  However, the scope of our project was not concerned with the actual true ranking of these careers, especially considering the career choice can be such an opinionated matter.  Regardless, it is important to note we are utilizing the former list to drive our data collection process.  In other words, we ingested the first 50 *unique* careers (*unique* ~ in the matter of ties), and did no analysis of their ranking method.  

## Web Scraping

[Glassdoor.com](https://www.glassdoor.com/index.htm) is a worldwide leading platform for individuals to review companies and provide data regarding their experience and role at a company, and for companies to post job soliciations with the hope of hiring quality candidates.  Glasdoor offers unprecendented insights into the employee experience powered by millions of company ratings and reviews, CEO approval ratings, salary reports, interview reviews and questions, benefits reviews, and much more.  Utilzing [Selenium](https://www.selenium.dev/), and a [web scraper designed for glassdoor](https://github.com/arapfaik/scraping-glassdoor-selenium) from nearly three years ago, we were able to ingest nearly 50,000 records of data.  Specifically, we scraped 1,000 job postings for each of the careers in our list.  Below you can see the web scraping function being imported and called to scrape Glassdoor for five job postings that are related to the search term, 'Data Scientist'.  The function returns a pandas Data Frame that houses the five scraped job postings.  

In [28]:
import os
# Assuming your cwd is the career_salary_estimator root folder
cwd = os.getcwd()
os.chdir(cwd + '/data_collection')
from glassdoor_web_scraper import get_jobs

In [7]:
path = os.getcwd() + '/chromedriver'
df = get_jobs(keyword = "Data Scientist",num_jobs=5,verbose = False, path = path,slp_time = 10)

Progress: 0/5
Progress: 1/5
Progress: 2/5
Progress: 3/5
Progress: 4/5
Progress: 5/5


In [8]:
df.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue
0,Data Scientist - U.S. Electricity Markets,$65K - $133K (Glassdoor est.),Company Overview:\nEdison Energy is the expert...,3.7,Edison Energy\n3.7,"Boston, MA",51 to 200 Employees,2013,Company - Private,Energy,"Oil, Gas, Energy & Utilities",Unknown / Non-Applicable
1,Data Scientist,$64K - $109K (Glassdoor est.),Would you like to be part of an organization t...,4.3,Johns Hopkins Applied Physics Laboratory (APL)...,"Offutt A F B, NE",5001 to 10000 Employees,1942,Nonprofit Organization,Aerospace & Defense,Aerospace & Defense,$1 to $2 billion (USD)
2,Data Scientist,$62K - $117K (Glassdoor est.),What you’ll do:\nDrive product decisions using...,3.3,RELX\n3.3,"Raleigh, NC",1001 to 5000 Employees,1968,Subsidiary or Business Segment,Advertising & Marketing,Business Services,$1 to $2 billion (USD)
3,Data Scientist,-1,"At Density, we build one of the most advanced ...",4.8,Density Inc.\n4.8,Remote,1 to 50 Employees,2014,Company - Private,Internet,Information Technology,Unknown / Non-Applicable
4,Data Scientist,Employer Provided Salary:$35 - $75 Per Hour,MetLife Legal Plans is the leading consumer le...,-1.0,MetLife Legal Plans,Remote,-1,-1,-1,-1,-1,-1


Above you can see the five scaped job postings that relate to the search term, 'Data Scientist'.  This function was utilized by iterating over the list of our careers, and substituting the *keyword* for each one of our careers.  For every iteration, the resulting data frame was saved to an object and then written to an individual csv file, titled the respective career.  This was done intentionally, rather than just appending each new data frame to the last, to create one large data frame at the end of the scrape.  The former method was the efficient and robust to the code breaking at any point.  There were instances where the for-loop broke due to unforseen events.  However, because we chose to write each data frame for each career to it's own file, we were able to resume the for loop where it broke, rather than starting over completely.  

After the conclusion of the entire scrape, we concatenated each data frame to each other, creating one large object. In total, our data frame contained 48,928 records, with 13 features.  These original features included:

* Job Title - Title of the Job Posting on Glassdoor (E.g., Senior Data Scientist, Junior Dental Hygienist)
* Salary Estimate - The Glassdoor provided average salary estimate based on previous employees in the same company and/or position who have reported their earnings to the platform. (E.g., Employer Provided Salary: $80K - $100K)
* Job Description - A brief description of the job, responsibilities, and other need to know's that the company has chosen to share.
* Company Rating - A float data type representing the company's average rating, on a scale from 1.0 to 5.0 (E.g., 3.7)
* Company Name - The name of the company who is offering the job (E.g., Amazon.com Services LLC)
* Location - The location of where the job is being offered (E.g., Sandy, TX)
* Size - The size of the company as a whole. (E.g., 10000+ Employees)
* Founded - The year the company was founded (E.g., 1994)
* Type of Ownership - A string variable indicating if the company is public, private, school/university, government, etc. (E.g., Company - Public)
* Industry - The industry the company is in (E.g., Internet)
* Sector - The sector the company is in (E.g., Information Technology)
* Revenue - The revenue the company earns each fiscal year (E.g., $10+ Billion)
* Simplified Job Title - The career name that was used to scrape for that specific job posting

*Below is the code from the data_cleaning.py script that we utilzied to perform the former actions.*

In [29]:
import glob # Wild carding for filenames
import pandas as pd
cwd = os.getcwd() # get cwd
cwd = cwd.split('/') # split cwd on '/'
cwd = cwd[:-1] # remove the last item of the cwd list (moving back one directory)
s = '/'
cwd = s.join(cwd) # join together new cwd
os.chdir(cwd) # change directory

path = os.getcwd() 
all_files = glob.glob(path + "/data_collection/data_files/*.csv") # get all csv file paths

dfs = [] # list of df's

for filename in all_files: # iterate through each csv file path
    df = pd.read_csv(filename, index_col=0) # read in to df
    df['simplified_job_title'] = filename.split('/')[-1].replace('_',' ').replace('.csv','') # append the search term that was used to find that specific job title (career name)
    dfs.append(df) # append to the dfs list

df_orig = pd.concat(dfs, axis=0, ignore_index=True) # concatenate all the df's
df_orig.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,simplified_job_title
0,IT Manager for Logistics Company,Employer Provided Salary:$80K - $100K,Job Summary:\nSupports all aspects of Company’...,-1.0,"TradePort Logistics, LLC",Georgia,Unknown,-1.0,Company - Public,-1,-1,Unknown / Non-Applicable,IT Manager
1,IT Director,Employer Provided Salary:$80K - $100K,Seeking an experienced IT Director to oversee ...,-1.0,Confidential,"Cranbury, NJ",-1,-1.0,-1,-1,-1,-1,IT Manager
2,Corporate IT Manager,Employer Provided Salary:$85K - $95K,Noregon is looking for a Corporate IT Manager ...,3.6,Noregon Systems\n3.6,"Greensboro, NC",51 to 200 Employees,1993.0,Company - Private,Computer Hardware & Software,Information Technology,$5 to $10 million (USD),IT Manager
3,IT Manager,Employer Provided Salary:$75K - $85K,Company: A small Entertainment Company located...,-1.0,Confidential,"Los Angeles, CA",-1,-1.0,-1,-1,-1,-1,IT Manager
4,IT Project Manager (Agile delivery experience ...,Employer Provided Salary:$120K - $150K,*** Prefer candidates local to Washington DMV ...,-1.0,Radiant Infotech,"Catonsville, MD",-1,-1.0,-1,-1,-1,-1,IT Manager


In [31]:
df_orig.shape

(48928, 13)