# HYPOTHESIS TESTING

In this notebook, we will explore hypothesis testing using employee data from the IBM HR Analytics dataset. The goal is to investigate whether there is a significant difference in monthly income between employees who left the company (attrition) and those who stayed.


## Step 1: Import Libraries

First, we import the necessary libraries: `pandas` for data manipulation, `numpy` for numerical operations, and `scipy.stats` for statistical functions.


In [17]:
import pandas as pd
import numpy as np
import scipy.stats as stats

## Step 2: Load and Preview the Data

Next, we load the dataset, which contains information on employees' demographics, employment details, and whether they left the company or not (Attrition).


In [18]:
df = pd.read_csv('D:\HypothesisTesting\pythonProject\WA_Fn-UseC_-HR-Employee-Attrition.csv')

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Step 3: Data Preprocessing

Here we prepare the data for the hypothesis test. We filter out the monthly income of employees who left (Attrition = 'Yes') and those who stayed (Attrition = 'No').


In [19]:
df = df[['Attrition','MonthlyIncome']]

In [20]:
df.head()

Unnamed: 0,Attrition,MonthlyIncome
0,Yes,5993
1,No,5130
2,Yes,2090
3,No,2909
4,No,3468


In [21]:
# Filter the data for 'Attrition' and 'MonthlyIncome'
attrition_income = df[df['Attrition'] == 'Yes']['MonthlyIncome']
non_attrition_income = df[df['Attrition'] == 'No']['MonthlyIncome']

## Step 4: Conduct a Z-Test

We use a Z-test to compare the means of two independent groups: employees who left and those who stayed. The test will help us determine whether the observed difference in average monthly income is statistically significant.


In [22]:
# Calculate descriptive statistics
mean_attrition = np.mean(attrition_income)
mean_non_attrition = np.mean(non_attrition_income)
std_attrition = np.std(attrition_income, ddof=1)  # sample standard deviation
std_non_attrition = np.std(non_attrition_income, ddof=1)
n_attrition = len(attrition_income)
n_non_attrition = len(non_attrition_income)

In [23]:
# Z-Score calculation
z_score = (mean_attrition - mean_non_attrition) / np.sqrt((std_attrition**2 / n_attrition) + (std_non_attrition**2 / n_non_attrition))

# Critical Z-value for 95% confidence (two-tailed test)
z_critical = stats.norm.ppf(1 - 0.025)



## Conclusion

Based on the Z-test results, if the p-value is less than the significance level (alpha = 0.05), we reject the null hypothesis, concluding that there is a statistically significant difference in monthly income between employees who left and those who stayed.


In [24]:

# Decision
if np.abs(z_score) > z_critical:
    print("Reject the null hypothesis: There is a significant difference in monthly income between employees who left and those who stayed.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in monthly income between employees who left and those who stayed.")

print(f"Z-Score: {z_score}, Z-Critical: {z_critical}")

Reject the null hypothesis: There is a significant difference in monthly income between employees who left and those who stayed.
Z-Score: -7.482621586644742, Z-Critical: 1.959963984540054


In [25]:
from statsmodels.stats.weightstats import ztest

# Here we're comparing the mean of two independent samples
z_score, p_value = ztest(attrition_income, non_attrition_income, usevar='unequal')

# Set significance level (alpha) and confidence level
alpha = 0.05  # for 95% confidence level

# Print Z-score and p-value
print(f"Z-Score: {z_score}")
print(f"P-Value: {p_value}")

# Decision rule
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in monthly income between employees who left and those who stayed.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in monthly income between employees who left and those who stayed.")

Z-Score: -7.482621586644742
P-Value: 7.28544370143024e-14
Reject the null hypothesis: There is a significant difference in monthly income between employees who left and those who stayed.
