# **STAT-201 Project Proposal**
## Do data analysts and data scientists differ in salaries and work flexibility? 

Prabh Sehgal, Rainie Fu, Rishabh Mathur, Dennis Truong

## Introduction

A key factor in employee happiness is financial compensation. During the COVID-19 pandemic, employees have shown that working remotely can also have a positive impact on happiness. (Card et al., 2012; Galanti et al., 2021). The data science salaries analysis we’ve performed aims to explore the relationships between different data science jobs, their salaries, and their proportions doing remote work. This would help prospective employees to find jobs that match their own hopes and aspirations, as well as create a work environment that is productive in and out of the workplace by influencing work-life balance.


### Guiding Question
> Which profession pays more, data scientists or analysts, and who works more remotely?

### Dataset

In this project, we use the dataset, "Data Science Job Salaries", by Ruchi Bhatia (2022) from Kaggle. It contains both qualitative and quantitative observations we can harness, like the year the salary was paid, experience level, employment type, title, salary, remote ratio, company size, and company location.
From this dataset, our variables of interest are as follows:

* ```salary_in_usd``` is our response variable, it uses a single currency (USD) to make comparisons easier regardless of country.
* ```job_title``` is one of our explanatory variables; a chr describing each employee's profession.
* ```remote_ratio``` is another explanatory variable; a fct describing each employee's ratio of remote work (0 = less than 20%, 50 = in between, 100 = greater than 80%). For this project, we will consider a value of 100 indicative of working remotely

Although ```experience_level``` is a factor that affects salary in the data science field, it is not something job seekers can control when searching for a job. This project aims to assist potential employees in selecting a suitable job by considering multiple variables, including experience level and other relevant factors, to provide a comprehensive understanding of the job market in data science.


## Preliminary Results

### Reading the Data

We downloaded the data to our repository and read it from there since Kaggle does not let datasets to be read straight from the website.

In [1]:
# just add all the libraries, data reading and preview of data here

*table 1*

### Cleaning and Wrangling Data

Let's look at which of the many ```job_title``` categories in this dataset we can use.

In [2]:
# calculating the number of employees in each job title category

*table 2*

Here we can see that the first three categories each have more than 30 observations, meaning they are large enough, and are similar in size. Of those three, Data Scientist is a very broad job definition, so we will compare Data Scientists and Data Analysts.
Let's grab only those categories and store them in a dataframe named ```top_jobs```.

In [3]:
# filtering for full-time employees from both categories

*table 3*

### Plots

#### Salary distributions

Let's start by plotting the salary distributions for each of the categories side by side.
Also, we'll include red vertical lines to display their mean incomes.

In [4]:
# just show the salary distribution of data scientist and analyst 

*figures 1, 2*

Also, let's examine the percentages of employees who work remotely using a stacked bar chart.

In [5]:
# visulaise the remote work data

*figure 3*

### Estimates

We require the right point estimates prior to testing our hypotheses.

In [6]:
# find all the point estimates 

*table 4*
> Even though the median provides a more accurate salary estimate, this project depends on the Central Limit Theorem, which it cannot accommodate. Thus, we'll employ means instead.

## Methods

### Plan

This analysis has so far shown that the average salaries for data engineers and data analysts are different, as well as the percentage of those workers that work from home. Moreover, the sample standard deviation of pay is larger for data engineers, and both standard deviations for the proportion of remote work are similar. To ensure that the response variables are truly random and not based on job type, we selected representative samples (with n > 30) and focused on full-time employees.

Sadly, our plots and point estimates are insufficient to support that. We will employ resampling and asymptotic approaches to describe the population distributions in order to address the fact that they use only one sample, making them more error-prone than a *sampling* distribution.

In order to create a sampling distribution of the *salary standard deviations* and *percentage of remote work standard deviations* for each sample, we will do bootstrapping on both samples. This is to compare the range of incomes and the percentage of remote employment.

We will employ an asymptotic strategy to account for differences in mean salaries and the fraction of remote employment. We will use the sample means  $\bar{x}$, proportions $p$, and standard deviations $\sigma$ because we lack the populations' means $\mu$, proportions $\hat{p}$, and sigma values $s$. We shall employ $t$-distributions because of their fatter tails and broader spread when compared to a standard distribution to take into consideration the possibility of error associated with this replacement.

We want to test the following assumptions by making data scientists population 1 and data analysts population 2.

for the mean salary:
$$H_0 : \mu_1 - \mu_2 = 0$$
$$H_1 : \mu_1 - \mu_2 \neq 0$$

for the proportion of remote work:
$$H_0 : p_1 - p_2 = 0$$
$$H_1 : p_1 - p_2 \neq 0$$

We want to test the following assumptions by making data scientists population 1 and data analysts population 2.

where $p$ is the percentage of employees who work remotely and $mu$ is the population's mean annual wage. Hence, our $t$-distributions will be $t_{128}$ and $t_{95}$, respectively.

With a 90% confidence level, we anticipate rejecting both null hypotheses at a significance level of 5%.


## Results