In [27]:
import numpy as np
import pandas as pd
import os
path = os.environ.get("store_data")
store = pd.HDFStore(path=path)

In [29]:
df = store["company_salary_dataset"];df.head()

Unnamed: 0,surname,name,age,gender,country,ethnicity,start_date,department,position,salary
0,Bold,Caroline,63,Female,United States,White,2012-07-02,Executive Office,President & CEO,166400.0
1,Zamora,Jennifer,38,Female,United States,White,2010-04-10,IT/IS,CIO,135200.0
2,Houlihan,Debra,51,Female,United States,White,2014-05-05,Sales,Director of Sales,124800.0
3,Bramante,Elisa,34,Female,United States,Black or African American,2009-01-05,Production,Director of Operations,124800.0
4,Del Bosque,Keyla,38,Female,United States,Black or African American,2012-01-09,Software Engineering,Software Engineer,118809.6


# Hypothesis Tests

The dataset consists of 174 observations randomly selected from a company with 5,000 employees.

hypothesis test 1 will be conducted to determine whether there is a significant difference in the salaries paid to male and female employees in terms of gender.

hypothesis test 2 will be conducted to determine whether there is a significant difference in the salaries paid to white and non-white employees in terms of ethnicity.

We are using a t-test, assuming that the population variances are equal.


### Hypothesis Test 1 (Gender)

<p>H<sub>0</sub>:&nbsp;&nbsp;&#956;<sub>m</sub> - &#956;<sub>f</sub>&nbsp;=&nbsp;0 &nbsp; The average salary for male employees is equal to the average salary for female employees.</p> 
<p>H<sub>1</sub>:&nbsp;&nbsp;&#956;<sub>m</sub> - &#956;<sub>f</sub>&nbsp;&#8800;&nbsp;0 &nbsp; The average salary for male employees is not equal to the average salary for female employees.</p>

In [7]:
table = df.groupby(by = "gender")[["gender", "salary"]].agg(n = ("gender", "count"),
                                                            mean = ("salary", "mean"),
                                                            sample_variance=("salary", lambda x: np.var(x, ddof=1)))
                                                   
table["pooled_variance"] = sum((table["n"] - 1)*table["sample_variance"]) / (table["n"].sum() - 2)
table["standart_error"] = np.sqrt(sum(table["pooled_variance"]/table["n"]))

In [9]:
table.style.format(thousands = ".", decimal = ",", precision=2)

Unnamed: 0_level_0,n,mean,sample_variance,pooled_variance,standart_error
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,98,"65.736,91","1.097.618.027,68","1.160.327.458,01","5.206,49"
Male,76,"72.300,53","1.241.431.654,56","1.160.327.458,01","5.206,49"


In [11]:
T_score = (table.loc["Male", "mean"] - table.loc["Female", "mean"]) / table.loc["Male","standart_error"]

In [13]:
print(f"T_score: {T_score:.2f}\nCritical_value(t_table_value): {2.58}\np-value: {0.209}")

T_score: 1.26
Critical_value(t_table_value): 2.58
p-value: 0.209


### Hypothesis Test 1 Interpretation

Since the t-score is less than the critical value, or the p-value is greater than common significance levels, there is not enough evidence to reject the null hypothesis. In other words, we cannot conclude that there is a gender-based pay difference between male and female employees in the company.

<p>H<sub>0</sub>:&nbsp;&nbsp;&#956;<sub>m</sub> - &#956;<sub>f</sub>&nbsp;=&nbsp;0 &nbsp; cannot be rejected.

### Hypothesis Test 2 (Ethnicity)

'White' (containing only employees that are indicated as white) and 'Nonwhite' (Asian, Black or African American, Hispanic, Two or more races). 

<p>H<sub>0</sub>:&nbsp;&nbsp;&#956;<sub>w</sub> - &#956;<sub>non_w</sub>&nbsp;=&nbsp;0 &nbsp;  The average salary for White employees is equal to the average salary for non-White employees.</p> 
<p>H<sub>1</sub>:&nbsp;&nbsp;&#956;<sub>w</sub> - &#956;<sub>non_w</sub>&nbsp;&#8800;&nbsp;0 &nbsp; The average salary for White employees is not equal to the average salary for non-White employees.</p>

In [16]:
df["ethnicity"] = df["ethnicity"].where(df["ethnicity"] == "White", other = "Non-White", axis = 0)
df2 = df.copy()

In [19]:
table2 = df2.groupby(by = "ethnicity")[["ethnicity", "salary"]].agg(n = ("ethnicity", "count"),
                                                            mean = ("salary", "mean"),
                                                            sample_variance=("salary", lambda x: np.var(x, ddof=1)))
                                                   
table2["pooled_variance"] = sum((table2["n"] - 1)*table2["sample_variance"]) / (table2["n"].sum() - 2)
table2["standart_error"] = np.sqrt(sum(table2["pooled_variance"]/table2["n"]))

In [21]:
table2.style.format(thousands = ".", decimal = ",", precision=2)

Unnamed: 0_level_0,n,mean,sample_variance,pooled_variance,standart_error
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Non-White,62,"70.917,26","1.225.049.916,30","1.168.051.481,95","5.410,04"
White,112,"67.323,10","1.136.728.018,03","1.168.051.481,95","5.410,04"


In [23]:
T_score2 = abs((table2.loc["White", "mean"] - table2.loc["Non-White", "mean"])) / table2.loc["White","standart_error"]
print(f"T_score: {T_score2:.2f}\nCritical_value(t_table_value): {2.58}\np-value: {0.510}")

T_score: 0.66
Critical_value(t_table_value): 2.58
p-value: 0.51


### Hypothesis Test 2 Interpretation

Since the t-score is smaller than the critical value, or the p-value is larger than common significance levels, there is not enough evidence to reject the null hypothesis. In other words, we cannot conclude that there is a wage difference between white and non-white employees in the company.

<p>H<sub>0</sub>:&nbsp;&nbsp;&#956;<sub>w</sub> - &#956;<sub>non_w</sub>&nbsp;=&nbsp;0 &nbsp; cannot be rejected.