# Graduates Admission - Synthetic Data Generation

In this notebooks, we will generate synthetic data to introduce bias in the dataset.

The base dataset has been downloaded from [here](https://www.kaggle.com/datasets/mukeshmanral/graduates-admission-prediction).

The dataset has now following columns:
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- Statement of Purpose - (SOP) Strength ( out of 5 )
- Letter of Recommendation - (LOR) Strength ( out of 5 )
- Undergraduate GPA-CGPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )

Let's add a feature 'Gender' and introduce some bias in the data wrt Gender.

## Read data

In [None]:
import pandas as pd

raw_data = pd.read_csv('data/admission_data.csv')
raw_data.columns = [c.strip() for c in raw_data.columns.values]
raw_data.head()


## Add bias wrt 'Gender'

In [None]:
divide_at = 0.8

data = raw_data.copy()
data['Chance of Admit v2'] = data['Chance of Admit'].apply(lambda x: round(x, 1))
unique_percent_groups = list(data['Chance of Admit v2'].unique())
for percent_group in unique_percent_groups:
    percent_group_data = data[data['Chance of Admit v2'] == percent_group].copy()
    percent_group_data.sort_values(by=['Chance of Admit'], inplace=True)
    female_count = percent_group_data.shape[0] * (1 - divide_at) if percent_group > 0.6 else percent_group_data.shape[0] * divide_at
    
    count = 0
    for index, row in percent_group_data.iterrows():
        gender = 'M' if count >= female_count else 'F'
        data.at[index, 'Gender'] = gender
        count += 1
data


## Check correlation of 'Gender' and 'Chance of Admit'

In [None]:
data['Gender'] = data['Gender'].replace({'F': 0, 'M': 1}).astype('int')
data[['Gender', 'Chance of Admit']].corr()


## Verify presence of bias

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

data_for_plot = data.groupby(by=['Gender', 'Chance of Admit v2']).size().reset_index().rename(columns={0: 'Count'})
data_for_plot

g = sns.catplot(
    data=data_for_plot, kind="bar",
    x="Chance of Admit v2", y="Count", hue="Gender"
)
plt.show()


As can be seen, 'Female' students see lower 'Chance of Admit'

In [None]:
print(data.groupby(by=['Gender']).agg({'Chance of Admit': 'mean'}))


## Save the data

In [None]:
data_for_saving = data.drop(columns=['Chance of Admit v2'])
data_for_saving['Gender'] = data_for_saving['Gender'].replace({0: 'F', 1: 'M'})
columns = list(data_for_saving.columns.values)
columns.remove('Chance of Admit')
columns.append('Chance of Admit')
data_for_saving = data_for_saving[columns]
filename = './data/admission_data-v2.csv'
data_for_saving.to_csv(filename, index=False)
print(f'Saved data to file: {filename}')
