# Data Analysis Mathematics, Algorithms and Modeling

## Team Information - ** TODO Fill the type of assignment**

**Team Members**

Name: Ayush Patel  
Student Number: 9033358

Name: Nikhil Shankar  
Student Number: 9026254

Name: Sreehari Prathap  
Student Number: 8903199


## Step 1: Install and Configure the IDE (e.g., Jupyter Notebook and VS Code)
- Install Anaconda (for Jupyter Notebook) and Visual Studio Code (VS Code).
  - Anaconda: Visit [anaconda.com](https://www.anaconda.com/products/individual) and download the appropriate installer for your operating system.
  - VS Code: Download and install from [Visual Studio Code](https://code.visualstudio.com/).
- Install Pandas Library
  - Open the terminal and run the following command: `pip install pandas`

## Step 2: Downloading the Dataset
We are using the Utrecht Fairness Recruitment dataset from [Kaggle], which can be downloaded directly via the link:
- URL: [https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset]

## Step 3 : Data Cleansing

### Data Cleansing Process for User Data (Talent Acquisition) from a CSV File

In [36]:
import pandas as pd
import matplotlib.pyplot as plt

file = "recruitment_dataset.csv"
df_unfiltered = pd.read_csv(file)
row_count = len(df_unfiltered)

print("Total number of datapoints(rows)", row_count)

Total number of datapoints(rows) 4000


##### Removing empty valued rows from the dataset
By giving axis as 0 we filter rows with any empty values. 

In [37]:
df = df_unfiltered.dropna(axis=0, how='any')
df = df[(df['gender'] == 'male') | (df['gender'] == 'female')]
print("Total number of datapoints(rows)", len(df))

Total number of datapoints(rows) 3917


We understand that there are no rows with empty values. The dataset is now filtered for analysis so we can proceed to the next step. 

#### **Aim of the analysis**
We are trying to find the various factors which impact the hiring decision the most.


#### **Selecting the right test**
Our dataset is hugely categorical. Only age and grade column was non-categorical. And using Shapiro Wilks we have rejected the null hypothesis of normal distribution. 

> Since data is not normally distributed it won't be wise to choose F-Test, T-Test or ANOVA one way test. 

> Since our data is hugely categorical and since we want to find the relation between two categories the best option will be to choose Chi-Squared test.

> Why not Wilcoxon ? We are not comparing two related samples. We can use this if we were trying to find relation between lets say total applicants and hired people. But first we are more interested in figuring out if there is any direct correlation between two parameters.

#### **Chi-Squared Test**

**Step 1:**

Create a contingency table with hiring decision and gender for the whole population

In [38]:
contingency_table = pd.crosstab(df['gender'], df['decision'], rownames=['Gender'], colnames=['Hired'])

**Step 2**

In [39]:
import pandas as pd
from scipy.stats import chi2_contingency

# Perform Chi-Squared Test
def calculate_chi_squared(contingency_table):
    chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)
    print(f"Chi-Squared Statistic: {chi2_stat}")
    print(f"P-Value: {p_val}")
    print(f"Degrees of Freedom: {dof}")
    print(f"Expected Frequencies:\n{ex}")

calculate_chi_squared(contingency_table)

Chi-Squared Statistic: 26.77640016379292
P-Value: 2.2840980419948216e-07
Degrees of Freedom: 1
Expected Frequencies:
[[1222.42787848  567.57212152]
 [1452.57212152  674.42787848]]


In [40]:
def create_contingency_table_company(company_name, df):
    df_new = df[df['company'] == company_name]
    contingency_table_new = pd.crosstab(df_new['gender'], df_new['decision'], rownames=['Gender'], colnames=['Hired'])
    return contingency_table_new

##### **Company A**

In [41]:
ct_a = create_contingency_table_company('A', df)
display(ct_a)
calculate_chi_squared(ct_a)

Hired,False,True
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,272,171
male,294,247


Chi-Squared Statistic: 4.677802663139956
P-Value: 0.03055480773483987
Degrees of Freedom: 1
Expected Frequencies:
[[254.81504065 188.18495935]
 [311.18495935 229.81504065]]


#### **Company B**

In [42]:
ct_b = create_contingency_table_company('B', df)
display(ct_b)
calculate_chi_squared(ct_b)

Hired,False,True
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,388,63
male,289,241


Chi-Squared Statistic: 111.60392237769867
P-Value: 4.363235585451077e-26
Degrees of Freedom: 1
Expected Frequencies:
[[311.24057085 139.75942915]
 [365.75942915 164.24057085]]


#### **Company C**

In [43]:
ct_c = create_contingency_table_company('C', df)
display(ct_c)
calculate_chi_squared(ct_c)

Hired,False,True
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,347,106
male,379,145


Chi-Squared Statistic: 2.1044293643383085
P-Value: 0.1468731254251134
Degrees of Freedom: 1
Expected Frequencies:
[[336.62026612 116.37973388]
 [389.37973388 134.62026612]]


#### **Company D**

In [44]:
ct_d = create_contingency_table_company('D', df)
display(ct_d)
calculate_chi_squared(ct_d)

Hired,False,True
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,291,152
male,415,117


Chi-Squared Statistic: 17.750377146944285
P-Value: 2.5186755742790345e-05
Degrees of Freedom: 1
Expected Frequencies:
[[320.7774359 122.2225641]
 [385.2225641 146.7774359]]
