# **STAT-201 Project Proposal**
## Analysis of data science salaries

Prabh Sehgal, Rainie Fu, Rishabh Mathur, Dennis Truong

## Introduction
A key factor in employee happiness is financial compensation, and past work during the COVID-19 pandemic have shown that working remotely can also have a positive impact on happiness thanks to increased employee autonomy. (Card et al., 2012; Galanti et al., 2021). Our analysis aims to explore the relationships between different data science jobs, their salaries, and their proportions doing remote work. Our analyses can inform prospective emplyees about each position's remote work flexibility and typical salary. Employees may use these inferences to discover new data science occupations that better align with their work-life goals.

### Research Question

If full-time Data Scientists and Data Analysts were compared, is there a *difference* in the means of salaries and *difference* in proportions of remote work?


### Dataset

Our research utilizes the dataset "Data Science Job Salaries" by Ruchi Bhatia (2022) from [Kaggle](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries). Observations include the year the salary was paid, experience level, work kind, title, salary, remote ratio, firm size, and company location. The main observations we aim to use are the ```job_title``` to tell us the job type, the ```salary_in_USD``` to inform us about each role's salary, and the ```remote_ratio``` to give us the proportion of poeple working remotely in that respective role.


## Preliminary Results

### Exploring our Data

We downloaded the data to our repository and read it from there since Kaggle does not let datasets to be read straight from the website.

In [1]:
library(tidyverse)

# load the address of the data set

url <- "https://raw.githubusercontent.com/vindennt/stat-201-project-group-10/main/ds_salaries.csv"

#use suppressMessages() to avoid the warning message when it automatically gives a
#name to the first column, which is just indecies
salaries <- suppressMessages(
read_delim(url, delim = ",") %>%
select(!...1))

head(salaries)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>
2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
2020,EN,FT,Data Analyst,72000,USD,72000,US,100,US,L


In [2]:
# Since we are only interested in the salaries for full time positions as that
# of part time positions can vary from various factors. We filter them out.
salaries <- salaries %>%
    filter(employment_type == "FT")

# We might be using CLT for our hypothesis testing later, so let's see which job title
# contains large enough sample size (we use n > 30 as large as discussed in class).
large_sample_size_salaries <- salaries %>%
    group_by(job_title) %>%
    summarise(n = n()) %>%
    filter(n > 30)

# Now that we have potential job titles to work with, let's only gather the information
# we are interested in about these four titles:
# 1. salaries in usd (so that the salaries are compareable)
# 2. remote ratio, change the numerical values into "in person" (if 0), "hybrid" (if 50), "remote" (if 100)
inperson <- salaries %>%
    filter(remote_ratio == 0) %>%
    mutate(remote_ratio = "in person")
hybrid <- salaries %>%
    filter(remote_ratio == 50) %>%
    mutate(remote_ratio = "hybrid")
remote <- salaries %>%
    filter(remote_ratio == 100) %>%
    mutate(remote_ratio = "remote")
salaries_new <- rbind(inperson, hybrid, remote)

head(salaries_new)

work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>
2020,MI,FT,Data Scientist,70000,EUR,79833,DE,in person,DE,L
2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,in person,JP,S
2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,in person,HN,S
2020,EN,FT,Data Scientist,45000,EUR,51321,FR,in person,FR,S
2020,MI,FT,Data Scientist,3000000,INR,40481,IN,in person,IN,L
2020,EN,FT,Data Scientist,35000,EUR,39916,FR,in person,FR,M


# Methods

Up till now, the report has demonstrated differences in mean salaries and proportions of remote work among Data Scientists and Data Analysts. To ensure randomness, representative samples of full-time employees with sample sizes greater than 30 were used.
However, relying on point estimates and plots alone is not enough to establish these differences. Therefore, to address this issue, the report proposes using the Central Limit Theorem, as the samples are roughly normally distributed and less computationally intensive than resampling methods.
By letting Data Scientists be population 1, and Data Analysts as population 2, we aim to test the following hypothesis for the mean salary:
$$H_0 : \mu_2 - \mu_1 = 0$$
$$H_1 : \mu_2 - \mu_1 \neq 0$$

for the proportion of remote work:
$$H_0 : p_2 - p_1 = 0$$
$$H_1 : p_2 - p_1 \neq 0$$

where $\mu$ is the population's mean salary and $p$ is its proportion of employees working remotely. 

We expect to reject both null hypotheses, and will include a 90% confidence interval in our final report.




### Impact

The outcomes of these findings could affect the decisions of newly graduated data scientists in terms of which job positions to seek and which career paths to focus on during their undergraduate studies.

### Future

This prompts further questions about how differences in experience, company location, and company size may affect the comparison of job titles in the field of Data Science. Additionally, with more data, these issues can be explored for the other professions that were not thoroughly examined due to limitations in sample size and distribution.

# References

Bhatia, Ruchi. “Data Science Job Salaries.” Kaggle, 15 June 2022, https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries. 

Card, D., Mas, A., Moretti, E., & Saez, E. (2012). Inequality at work: The effect of peer salaries on job satisfaction. *American Economic Review*, 102(6), 2981–3003. https://doi.org/10.1257/aer.102.6.2981

Galanti, T., Guidetti, G., Mazzei, E., Zappalà, S., & Toscano, F. (2021). Work from home during the COVID-19 outbreak: the impact on employees' remote work productivity, engagement, and stress. *Journal of Occupational and Environmental Medicine*, 63(7), e426–e432. https://doi.org/10.1097/JOM.0000000000002236