# Data Analysis Mathematics, Algorithms and Modeling

## Team Information - Problem Analysis Workshop 1

**Team Members**

Name: Ayush Patel  
Student Number: 9033358

Name: Nikhil Shankar  
Student Number: 9026254

Name: Sreehari Prathap  
Student Number: 8903199


### Challenge 

### Field Of Inquiry - Talent Acquisition
#### Amazon Recruitment Automation And Gender Bias Case Study

In 2015 Amazon found out a serious issue with the automated tool used in recruitment. The tool was designed to filter and rank the resumes based on job requirements. To train their model Amazon relied on previous decade data consisting of the resumes that were submitted, the corresponding job descriptions and the subset of resumes that were finally hired. The model was giving undue advantage for male applicants over females especially for the technical positions due to the inherent bias historically. This resulted in the model giving higher priority for resumes in the male perspective than the female one.

#### How can we track it?
In this context, tracking the bias in the automated recruitment tool is crucial. Monitoring its performance and ensuring that it doesn't favor male applicants over female applicants or any other demographic group is essential. Tracking would involve evaluating the model’s output, analyzing the impact of changes made to the model, and ensuring fairness across different gender, racial, and socioeconomic groups.

**References**
- *[Reuters]
(https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G/)*
- *[Global Headcount]
(https://fingfx.thomsonreuters.com/gfx/rngs/AMAZON.COM-JOBS-AUTOMATION/010080Q91F6/index.html)*

## Step 1: Install and Configure the IDE (e.g., Jupyter Notebook and VS Code)
- Install Anaconda (for Jupyter Notebook) and Visual Studio Code (VS Code).
  - Anaconda: Visit [anaconda.com](https://www.anaconda.com/products/individual) and download the appropriate installer for your operating system.
  - VS Code: Download and install from [Visual Studio Code](https://code.visualstudio.com/).
- Install Pandas Library
  - Open the terminal and run the following command: `pip install pandas`

## Step 2: Downloading the Dataset
We are using the Utrecht Fairness Recruitment dataset from [Kaggle], which can be downloaded directly via the link:
- URL: [https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset]

### Displaying the csv file using python
Import panda to read the dataset in csv format

In [9]:
import pandas as pd

Read the csv file using panda
The parameter "**nrows**" is used to read only the first n rows since this is a large dataset.

In [10]:
file = "recruitmentdataset-2022-1.3.csv"
df = pd.read_csv(file, nrows=10)

There are 15 columns in the csv and for display purpose we use only 6 of them

In [11]:
pd.options.display.max_columns = 6
print(df)

       Id  gender  age  ... ind-degree company  decision
0  x8011e  female   24  ...        phd       A      True
1  x6077a    male   26  ...   bachelor       A     False
2  x6006e  female   23  ...     master       A     False
3  x2173b    male   24  ...     master       A      True
4  x6241a  female   26  ...     master       A      True
5  x9063d  female   26  ...   bachelor       A      True
6  x5785d  female   27  ...   bachelor       A     False
7  x8767c  female   22  ...     master       A      True
8  x6541b  female   28  ...   bachelor       A     False
9  x3890b    male   24  ...     master       A      True

[10 rows x 15 columns]


## Step 3 : Data Cleansing

### Data Cleansing Process for User Data (Talent Acquisition) from a CSV File

Before analyzing user data for talent acquisition, it is essential to clean the data for accuracy and consistency. The key actions for data cleansing include:

1. **Import the Data**: Load the CSV file containing user information into your data analysis tool (e.g., Excel, Python, or SQL).

2. **Merge Data Sets**: If user data comes from multiple sources, ensure it's combined into a consistent format for further analysis.

3. **Remove Duplicates**: Identify and remove duplicate entries based on unique identifiers like email addresses or phone numbers.

4. **Standardize Data**: Clean up inconsistencies like extra spaces, incorrect capitalization, or formatting issues (e.g., "Mr." instead of "mister").

5. **Handle Missing Data**: Fill in missing values where possible or remove rows with incomplete critical data fields.

6. **Validate Key Fields**: Ensure fields like phone numbers, emails, and addresses follow a standard format for consistency.



### SYSTOLIC


In [12]:
import pandas as pd

file = "Daily_merged.csv"
df_unfiltered = pd.read_csv(file, nrows=100)
df = df_unfiltered.dropna(axis=1, how='all')

columns_of_interest = ['time', 'TotalSteps', 'Calories','TotalSleepRecords','TotalMinutesAsleep']  # Replace with your actual column names
df_selected = df[columns_of_interest]

# Sort by one of the columns in ascending order
df_sorted = df_selected.sort_values(by='TotalSteps', ascending=True)


df_sorted_desc = df_selected.sort_values(by='TotalSteps', ascending=False)

pd.options.display.max_columns = 6
print(df_sorted)

          time  TotalSteps  Calories  TotalSleepRecords  TotalMinutesAsleep
30  2016-05-12           0         0                NaN                 NaN
99  2016-04-19         197      1366                NaN                 NaN
71  2016-04-21        1223      2140                NaN                 NaN
91  2016-05-11        1329      1276                NaN                 NaN
34  2016-04-15        1510      1344                NaN                 NaN
..         ...         ...       ...                ...                 ...
13  2016-04-25       15355      2013                1.0               277.0
7   2016-04-19       15506      2035                1.0               304.0
15  2016-04-27       18134      2159                NaN                 NaN
80  2016-04-30       18213      3846                1.0               124.0
50  2016-05-01       36019      2690                NaN                 NaN

[100 rows x 5 columns]


### DESCENDING

In [13]:
df_sorted_desc = df_selected.sort_values(by='TotalSteps', ascending=False)
print(df_sorted_desc)

          time  TotalSteps  Calories  TotalSleepRecords  TotalMinutesAsleep
50  2016-05-01       36019      2690                NaN                 NaN
80  2016-04-30       18213      3846                1.0               124.0
15  2016-04-27       18134      2159                NaN                 NaN
7   2016-04-19       15506      2035                1.0               304.0
13  2016-04-25       15355      2013                1.0               277.0
..         ...         ...       ...                ...                 ...
34  2016-04-15        1510      1344                NaN                 NaN
91  2016-05-11        1329      1276                NaN                 NaN
71  2016-04-21        1223      2140                NaN                 NaN
99  2016-04-19         197      1366                NaN                 NaN
30  2016-05-12           0         0                NaN                 NaN

[100 rows x 5 columns]


In [14]:
def sort_data(df, column_name, isAscending):
    return df.sort_values(by=column_name, ascending=isAscending)


sortedBasedOnCalories = sort_data(df_sorted, 'Calories', 1)
print(sortedBasedOnCalories)

          time  TotalSteps  Calories  TotalSleepRecords  TotalMinutesAsleep
30  2016-05-12           0         0                NaN                 NaN
61  2016-05-12        2971      1002                NaN                 NaN
91  2016-05-11        1329      1276                NaN                 NaN
58  2016-05-09        1732      1328                NaN                 NaN
52  2016-05-03        2100      1334                NaN                 NaN
..         ...         ...       ...                ...                 ...
83  2016-05-03       12850      3324                NaN                 NaN
86  2016-05-06        9787      3328                NaN                 NaN
87  2016-05-07       13372      3404                NaN                 NaN
66  2016-04-16       15300      3493                NaN                 NaN
80  2016-04-30       18213      3846                1.0               124.0

[100 rows x 5 columns]
