# Data Science Salaries Hypothesis Testing

[Data & Description Source](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries?resource=download)

**Column Description**
- work_year: The year the salary was paid. <br>
- experience_level: The experience level in the job during the year with the following possible values: <br>
    1. EN = Entry-level / Junior <br>
    2. MI = Mid-level / Intermediate <br>
    3. SE = Senior-level / Expert <br>
    4. EX = Executive-level / Director <br>
- employment_type: The type of employement for the role: <br>
    1. PT = Part-time <br> 
    2. FT = Full-time <br> 
    3. CT = Contract <br>
    4. FL = Freelance <br>
- job_title: The role worked in during the year.<br>
- salary: The total gross salary amount paid.<br>
- salary_currency: The currency of the salary paid as an ISO 4217 currency code.<br>
- salary_in_usd: The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com.<br>
- employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code. <br>
- remote_ratio: The overall amount of work done remotely, possible values are as follows: <br>
    1. 0 = No remote work (less than 20%) <br>
    2. 50 = Partially remote <br>
    3. 100 = Fully remote (more than 80%) <br>
- company_location: The country of the employer's main office or contracting branch as an ISO 3166 country code. <br>
- company_size: The average number of people that worked for the company during the year: <br>
    1. S = less than 50 employees (small) <br>
    2. M = 50 to 250 employees (medium) <br>
    3. L = more than 250 employees (large) <br>

## Importing packages

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re

# For t-test, f-test
import scipy.stats as stats

# For z-test
import statsmodels.api as sm

## Importing data

In [3]:
raw_salaries = pd.read_csv('ds_salaries.csv')

In [4]:
raw_salaries.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


### Dataset Cleaning

In [5]:
# Model data starts off as a pristine dataset from the raw data and then we do pre-processing on it
hypothesis_data = raw_salaries.copy(deep=True)

# Creating the job category column
# Function to find the pattern and assign corresponding value
def find_pattern(text, patterns):
    text = text.lower()
    for pattern, value in patterns.items():
        match = re.findall(pattern, text)
        if match:
            return value
    return text

job_categories = {
    r'data engineer': 'data engineer',
    r'data analy': 'data analyst',
    r'data scien': 'data scientist',
    r'machine learning': 'machine learning',
    r'ml engineer': 'machine learning'
    
}

# Apply the function to the 'Text' column and assign the results to a new column 'NewColumn'
hypothesis_data['job_category'] = hypothesis_data['job_title'].apply(find_pattern, patterns=job_categories)

# Removing all columns that aren't needed for training or testing the models
hypothesis_data.drop(columns=['Unnamed: 0', 'salary', 'job_title'], inplace = True)

## Hypothesis Testing

Hypothesis testing means testing assumptions. These are the commonly used hypothesis tests.

#### 1. T-test:<br>
**Definition:**<br> 
A t-test is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related. It is also used to check if the mean value of a sample is statistically similar to the mean value of the population if the sample is small (size<30 records).
1. Assumptions: <br>
    a. The datasets follow a normal distribution.<br>
    b. Only mean of population is known.<br>
    c. Sample size < 30<br>
    d. Independent samples: The observations in each group are independent of each other. <br>
    e. Homogeneity of variances: The variances of the two groups are equal. If this assumption is violated, alternative versions of the t-test, such as the Welch's t-test, can be used.<br>
2. Both variables compared need to be continuous in nature.<br>

[Youtube Link](https://www.youtube.com/watch?v=7UKKVSWp1Lg&list=PLOLWGEXpOrBx4ivtMfPvDyCjlok2OBk_w&index=6)<br>
[T-test and types](https://www.investopedia.com/terms/t/t-test.asp)<br> 

#### 2. Z-Test:<br>
**Definition**<br>
The z-test calculates a z-statistic, which represents the number of standard deviations the sample mean is away from the population mean.<br>
1. Assumptions: <br>
    a. Both Mean and standard deviation of the population are known.<br>
    b. Sample size > 30. If sample size > 30 then it will follow a normal distribution.<br>
    c. Normality especially in smaller samples.<br>
    d. Random sample.<br>



### T-test

**Hypothesis:** Data Science salaries are different than the mean of the population salaries.

Step 1: Create datasets to be compared. We are taking a small sample (n<30) because for larger samples we implement z-statistics.

In [7]:
# Datasets being compared
all_salaries = hypothesis_data['salary_in_usd'] #population
data_scientist_sal_sample = hypothesis_data.loc[hypothesis_data.job_category == 'data scientist','salary_in_usd']\
                            .sample(n=25, random_state=1) #sample

In [9]:
print(f'Average salary: {all_salaries.mean()}')
print(f'''Sample's avg. salary: {data_scientist_sal_sample.mean()}''')

Average salary: 112297.86985172982
Sample's avg. salary: 107091.44


**Defining Ho & Ha**

Ho: Mean Data Science Salary = 112300

Ha: MeanData Science Salary != 112300
    
It is going to be a two-tailed t-test.

Step 2: Perform a simple t-test

In [8]:
stats.ttest_ind(all_salaries, data_scientist_sal_sample)

TtestResult(statistic=0.3608643574264332, pvalue=0.7183218184421127, df=630.0)

In [7]:
# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(all_salaries, data_scientist_sal_sample)

# Compare p-value with significance level
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis - There is a significant difference between the means.")
else:
    print("Failed to reject null hypothesis - There is no significant difference between the means.")


Failed to reject null hypothesis - There is no significant difference between the means.


### z-test

**Hypothesis:** Data Science salaries are different than the mean of the population salaries.

In [8]:
# Datasets being compared
all_salaries = hypothesis_data['salary_in_usd'] #population
data_scientist_sal_sample = hypothesis_data.loc[hypothesis_data.job_category == 'data scientist','salary_in_usd']\
                            .sample(n=50, random_state=1) #sample
# Population mean and standard deviation
population_mean = all_salaries.mean()
population_std = all_salaries.std()

# Perform one-sample z-test
z_score, p_value = sm.stats.ztest(data_scientist_sal_sample, value=population_mean, alternative='two-sided')

# Compare p-value with significance level
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis - There is a significant difference from the population mean.")
else:
    print("Failed to reject null hypothesis - There is no significant difference from the population mean.")


Failed to reject null hypothesis - There is no significant difference from the population mean.


**Note:** The alternative parameter in sm.stats.ztest can take values: two-sided (for two tailed), larger (for right tailed z-test) and smaller (for left tailed z-test)

### Matched or Paired T-Test

This hypothesis is done when the data contains before and after values for the same set of subjects. <br>
Null Hypothesis: The process change has had no effect on subjects/ end metric <br>
Alternative Hypothesis: The process change has effect on subjects/ end metric. <br>
Some examples:
    
1. If a weight loss program has been effective. The data will contain before and after weights for the same people. <br>
Value we will use for test: Xd (Difference) = After value - Before value
H0: Mean of difference in before and after values >= 0 (Weight loss program has failed)
Ha: Mean of Xd < 0 (Weight loss program helped subjects reduce weight)

2. Has the change in prices of items affected their sales.


### F-Test

**Definition:**<br>
Compare or hypothesis test on variances in the data between in populations. <br>


**Hypothesis 1:** 

H0: Variance of Data Scientist and Data Analyst salaries are similar. <br>
Ha: Variance of Data Scientist and Data Analyst salaries are different. <br>

Level of significance = 0.05

In [36]:
# Populations being compared
data_scientist_salaries = hypothesis_data.loc[hypothesis_data.job_category == 'data scientist','salary_in_usd']
data_analyst_salaries = hypothesis_data.loc[hypothesis_data.job_category == 'data analyst','salary_in_usd']

# Perform F-test
f_statistic, p_value = stats.f_oneway(data_scientist_salaries, data_analyst_salaries)

print(f"f_statistic: {f_statistic}; p_value: {p_value} \n")

# Compare p-value with significance level (e.g., 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis - There are significant differences in the variances of the groups.")
else:
    print("Failed to reject null hypothesis - There are no significant differences in the variances of the groups.")

f_statistic: 8.966934661369082; p_value: 0.002962789160360177 

Reject null hypothesis - There are significant differences in the variances of the groups.


**Hypothesis 2:** 

H0: Variance of Entry level and Executive level salaries are similar. <br>
Ha: Variance of Entry level and Executive level salaries are different. <br>

Level of significance = 0.05

In [35]:
# Populations being compared
entry_level_salaries = hypothesis_data.loc[hypothesis_data.experience_level == 'EN','salary_in_usd']
exec_level_salaries = hypothesis_data.loc[hypothesis_data.experience_level == 'EX','salary_in_usd']

# Perform F-test
f_statistic, p_value = stats.f_oneway(entry_level_salaries, exec_level_salaries)

print(f"f_statistic: {f_statistic}; p_value: {p_value} \n")

# Compare p-value with significance level
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis - There are significant differences in the variances of the groups.")
else:
    print("Failed to reject null hypothesis - There are no significant differences in the variances of the groups.")

f_statistic: 82.9627895604364; p_value: 3.765360745681678e-15 

Reject null hypothesis - There are significant differences in the variances of the groups.


### ANOVA test

ANOVA is a statistical method used to compare the means of three or more groups to determine if there are significant differences among them. It is a broader framework that allows for the examination of variability among multiple groups simultaneously. ANOVA partitions the total variance in the data into two components: the variance between groups and the variance within groups.

ANOVA utilizes F-test to assess the statistical significance of the observed differences wbetween group means.It compares the ratio of the mean square between groups to the mean square within groups, which is then used to calculate the F-statistic. 


Ex 1: Comparing the mean salaries of DS, DE and DA job categories.

H0: Means of salaries within the DS, DE and DA job categories are same.<br>
Ha: Means of salaries within the DS, DE and DA job categories are different.

In [11]:
# Populations being compared
data_scientist_salaries = hypothesis_data.loc[hypothesis_data.job_category == 'data scientist','salary_in_usd']
data_engineer_salaries = hypothesis_data.loc[hypothesis_data.job_category == 'data engineer','salary_in_usd']
data_analyst_salaries = hypothesis_data.loc[hypothesis_data.job_category == 'data analyst','salary_in_usd']

# Perform F-test
f_statistic, p_value = stats.f_oneway(data_scientist_salaries, data_engineer_salaries, data_analyst_salaries)

# Compare p-value with significance level
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis - There are significant differences in the means of the groups.")
else:
    print("Failed to reject null hypothesis - There are no significant differences in the means of the groups.")

Reject null hypothesis - There are significant differences in the means of the groups.


### Chi-squared test

It is used to determine any relationship between two categorical variables. Some use cases of Chi-squared test are as below:

1. Goodness-of-Fit Test:
Determine if observed data fits a specified distribution or expected proportions.
<br/>Example: Testing if the distribution of blood types in a population follows the expected proportions of A, B, AB, and O blood types.

2. Test of Independence:
Assess if there is an association or relationship between two categorical variables.
<br/>Test Statistic: Pearson's chi-squared statistic.
<br/>Example: Investigating if there is a significant relationship between smoking habits (smoker vs. non-smoker) and lung cancer occurrence (present vs. absent).

3. Homogeneity Test:
Compare the distributions or proportions of categorical variables across different populations or groups.
<br/>Test Statistic: Pearson's chi-squared statistic.
<br/>Example: Analyzing if the proportions of political party affiliation differ significantly across different age groups (e.g., young adults, middle-aged, elderly).

4. Test of Association in Contingency Table:
Determine if there is a significant relationship between two categorical variables in a contingency table.
<br/>Example: Examining if there is an association between gender and preference for a particular brand (Brand A, Brand B, Brand C) based on survey data.

5. Fisher's Exact Test:
Determine the association between two categorical variables in a 2x2 contingency table when the sample sizes are small, and the Chi-square test may not be appropriate due to its reliance on asymptotic properties.  It is often applied in medical research, genetics, and other fields where small sample sizes are encountered. 
<br/>It is a non-parametric test that calculates the exact probability of observing the data, given the marginal totals, under the assumption of independence. It provides a p-value that indicates the likelihood of the observed values occuring due to sheer chance.
<br/>Assumptions: <br/>Fisher's Exact test does not rely on any specific assumptions about the underlying distribution of the data. However, it assumes that the data are collected independently and that the sampling is random.
<br/>Example: it can be used to assess if a particular treatment significantly affects the proportion of patients with a certain disease outcome.


Steps for Analysis:
<br/>a. Setup Hypotheses: Formulate the null and alternative hypotheses based on the research question.
<br/>b. Create Contingency Table: Organize the observed data into a contingency table.
<br/>c. Calculate Expected Frequencies: Compute the expected frequencies for each cell under the assumption of independence.
<br/>d. Calculate Chi-Squared Statistic: Use the observed and expected frequencies to calculate the chi-squared statistic.
<br/>e.Determine Significance: Compare the chi-squared statistic to a critical value from the chi-squared distribution or use a p-value to determine statistical significance.
<br/>f. Interpret Results: Make conclusions based on the significance level and the context of the analysis.

Source for the above information: Internet (ChatGPT)


Q. Use cases "Test of Independence" and "Test of Association in Contingency Table" both sound similar. How do they differ?
- Yes, both are used to determine association between categorical variables and use contingency tables. They differ on the basis of focus and context of the analysis.

Test of Independence: User is interested in finding association between two categorical variables on the whole i.e show "dependency" between variables. The Observed values are usually represented by a 2 x 2 contingency table representing presence (True/1) or absence (False/0) of the two categories. For eg: 
<br/>Ex 1: Does a smoker (Smoker = 1) have higher chance of having Lung Cancer (Cancer = 1)
<br/>Ex 2: Is a Female buyer (Female = 1) have a higher chance of buying a handbag (Buying Handbag = 1)


Test of Association in Contingency Table: The contingency tables associated with this teast are of varying sizes. It is used to investigate "association" between two categorical variables. For eg: 
<br/>Ex 1: Association between genders (M/F/Other) and which brand they prefer.
<br/>Ex 2: Age groups and what kind of candy they like.

### Test of association in contingency table

**Defining Null hypothesis and Alternative hypothesis**

Ho: Employee residence is not associated with company location.

Ha: Employee residence is asociated with company location.

Level of confidence assumed = 95%
Thus level of significance = 1 - 0.95 = 0.05

In [11]:
# First we create a contingency table containing observed values 
# for Company location and Employee Location from the dataset
observed = pd.crosstab(hypothesis_data['company_location'], hypothesis_data['employee_residence'])
observed.head()

employee_residence,AE,AR,AT,AU,BE,BG,BO,BR,CA,CH,...,RO,RS,RU,SG,SI,TN,TR,UA,US,VN
company_location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AE,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AT,0,0,3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AU,0,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BE,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
from scipy.stats import chi2_contingency

# Perform chi-squared test
chi2, p, dof, expected = chi2_contingency(observed)
print(f"Chi-squared value: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom (dof): {dof}")
print(f"Expected values: {expected}")

Chi-squared value: 23187.380792483484
P-value: 0.0
Degrees of freedom (dof): 2744
Expected values: [[1.48270181e-02 4.94233937e-03 1.48270181e-02 ... 4.94233937e-03
  1.64085667e+00 1.48270181e-02]
 [4.94233937e-03 1.64744646e-03 4.94233937e-03 ... 1.64744646e-03
  5.46952224e-01 4.94233937e-03]
 [1.97693575e-02 6.58978583e-03 1.97693575e-02 ... 6.58978583e-03
  2.18780890e+00 1.97693575e-02]
 ...
 [4.94233937e-03 1.64744646e-03 4.94233937e-03 ... 1.64744646e-03
  5.46952224e-01 4.94233937e-03]
 [1.75453048e+00 5.84843493e-01 1.75453048e+00 ... 5.84843493e-01
  1.94168040e+02 1.75453048e+00]
 [4.94233937e-03 1.64744646e-03 4.94233937e-03 ... 1.64744646e-03
  5.46952224e-01 4.94233937e-03]]


In [15]:
# Compare p-value with significance level (e.g., 0.05)
alpha = 0.05
if p < alpha:
    print("Reject null hypothesis - There is a significant association between the variables.")
else:
    print("Failed to reject null hypothesis - There is no significant association between the variables.")

Reject null hypothesis - There is a significant association between the variables.


Conclusion: We reject Null hypothesis with 95% confidence to prove that an Employee's residence is dependent on Company's location.

### Fisher's Exact Test

Used to test the likelihood of observed values as extreme as in the given data occuring by chance. Commonly used for small sizes so small that applying Chi squared test would be inappropriate.


Assumptions:
It does not rely on asymptotic approximations, making it suitable for small samples.
Appropriate when the expected cell counts in a contingency table are less than 5.


How does it differ from Chi-squared test?

Fishers's Exact differs from Chi-sq test in terms of sample size and assumptions. 
1. Sample Size Sensitivity:
- Fisher's Exact Test is more suitable for small sample sizes, especially when dealing with 2x2 tables.
- The Chi-squared test is more commonly used with larger sample sizes.
2. Exact vs. Asymptotic:
- Fisher's Exact Test calculates the exact probability of observing the data given the marginal totals.
- The Chi-squared test relies on asymptotic approximations, making it a good choice for larger datasets.
3. Expected Frequencies:
- Fisher's Exact Test is preferred when expected frequencies in any cell are very low.
- The Chi-squared test can become less reliable with small expected frequencies.
As a rule of thumb, if the sample size is small (<20) and the expected frequencies (<5 in at least 20% of the cells) are low, Fisher's Exact Test is a safer choice. If the sample size is large, the Chi-squared test is usually appropriate.


#### Implementation
Let's try to figure out if company sizes (S & L) are associated with experience level (Mid level and Entry level) of employees.
<br/>H0: There is no association between the size of company and the experience level of employees
<br/>Ha: There is association between  the size of company and the experience level of employees

Level of significance: 0.05 (or 95% Confidence)

In [29]:
exact_test_df = hypothesis_data\
                .loc[((hypothesis_data.experience_level.isin(['EN', 'MI']))  \
                & (hypothesis_data.company_size.isin(['S', 'L'])))]\
                .sample(n=30) 
observed_contingeny_tb = pd.crosstab(exact_test_df['experience_level'], exact_test_df['company_size'])
observed_contingeny_tb.head()

company_size,L,S
experience_level,Unnamed: 1_level_1,Unnamed: 2_level_1
EN,2,5
MI,13,10


In [31]:
# Perform Fisher's Exact test
odds_ratio, p_value = stats.fisher_exact(observed_contingeny_tb)

print(f"Odds-ratio: {odds_ratio}; p-value: {p_value}")

# Compare p-value with significance level (e.g., 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis - There is a significant association between the variables.")
else:
    print("Failed to reject null hypothesis - There is no significant association between the variables.")

Failed to reject null hypothesis - There is no significant association between the variables.
