# Our Goal
This kernel has been created to give a simple technical demonstration of how an inappropriate feature selection can create a sexist model (or any other prejudiced model) **if the dataset is biased by gender or any other human charactheristic having no causality relation with what you want to predict or classify**.

Basically, we want to spread the importance of a careful feature selection, mainly when we are leading with problems that involves human factors.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import Normalizer, scale, StandardScaler

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

hrdata = pd.read_csv("../input/WA_Fn-UseC_-HR-Employee-Attrition.csv")

#hrdata.head()

['WA_Fn-UseC_-HR-Employee-Attrition.csv']


Once the [IBM HR Analytics Employee Attrition & Performance](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset) dataset isn't biased by gender, we will firstly bias the `PerformanceRating` column by gender. We will change, with a probability of 2%, the `PerformanceRating` from 4 to 3 if the sample is a `Female`, and, with the same probability of 2%, to change the `PerformanceRating` from 3 to 4 if the sample is a `Male`.

In [2]:
def generate_gender(x):
    if x.PerformanceRating == 4 and x.Gender == 'Female':
        if np.random.random_sample() >= 0.98:
            return 3
    if x.PerformanceRating == 3 and x.Gender == 'Male':
        if np.random.random_sample() >= 0.98:
            return 4
    return x.PerformanceRating

biased_y = hrdata.apply(generate_gender, axis=1)

hrdata['PerformanceRating'] = biased_y

hrdata.groupby(['PerformanceRating','Gender']).size()

PerformanceRating  Gender
3                  Female    495
                   Male      733
4                  Female     93
                   Male      149
dtype: int64

We want to create a model to classify the Performance Rating of an employee based on a set of attributes.

First, we called `Y` the vector of results containing the values of `PerformanceRating` column (the data we want to predict).

In [3]:
y = hrdata['PerformanceRating']

y.head()

# hrdata.columns

0    3
1    4
2    3
3    3
4    3
Name: PerformanceRating, dtype: int64

We chose the columns will be the features of the model.

We removed some columns we did not understand OR we considered unreliable once are based on complexes human factors or demand complex unclear collection methods (and `Y`, obvisouly):
 * `Attrition`: not clear and complex
 * `EmployeeCount`: not clear
 * `EmployeeNumber`: not clear
 * `EnvironmentSatisfaction`: complexes human factors or demand complex unclear collection methods
 * `JobInvolvement`: complexes human factors or demand complex unclear collection methods
 * `Over18`: all values equals to 'Y'
 * `PerformanceRating`: value should be predicted
 * `RelationshipSatisfaction`: complexes human factors or demand complex unclear collection methods
 * `StandardHours`: all values equals to 80
 * `WorkLifeBalance`: complexes human factors or demand complex unclear collection methods

We kept the following columns to create the `X` DataFrame:

`Age`, `BusinessTravel`, `DailyRate`, `Department`, `DistanceFromHome`, `Education`, `EducationField`, `Gender`, `HourlyRate`, `JobLevel`, `JobRole`, `JobSatisfaction`, `MaritalStatus`, `MonthlyIncome`, `MonthlyRate`, `NumCompaniesWorked`, `Over18`, `OverTime`, `PercentSalaryHike`, `StandardHours`, `StockOptionLevel`, `TotalWorkingYears`, `TrainingTimesLastYear`, `YearsAtCompany`, `YearsInCurrentRole`, `YearsSinceLastPromotion`, `YearsWithCurrManager`

In [4]:
X = hrdata[['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',\
            'Education', 'EducationField', 'Gender', 'HourlyRate', 'JobLevel', 'JobRole',\
            'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',\
            'OverTime', 'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears',\
            'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']] 

X.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,Gender,HourlyRate,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,Female,94,2,Sales Executive,4,Single,5993,19479,8,Yes,11,0,8,0,6,4,0,5
1,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,Male,61,2,Research Scientist,2,Married,5130,24907,1,No,23,1,10,3,10,7,1,7
2,37,Travel_Rarely,1373,Research & Development,2,2,Other,Male,92,1,Laboratory Technician,3,Single,2090,2396,6,Yes,15,0,7,3,0,0,0,0
3,33,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,Female,56,1,Research Scientist,3,Married,2909,23159,1,Yes,11,0,8,3,8,7,3,0
4,27,Travel_Rarely,591,Research & Development,2,1,Medical,Male,40,1,Laboratory Technician,2,Married,3468,16632,9,No,12,1,6,3,2,2,2,2


Let's create boolean columns for each value of String features

In [5]:

X['BusinessTravel'] = X['BusinessTravel'].map({'Non-Travel': 0, 'Travel_Rarely': 0.5, 'Travel_Frequently': 1})

X['DepSales'] = X['Department'].map({'Sales': 1, 'Research & Development': 0, 'Human Resources': 0})
X['DepResDev'] = X['Department'].map({'Sales': 0, 'Research & Development': 1, 'Human Resources': 0})
X['DepHR'] = X['Department'].map({'Sales': 0, 'Research & Development': 0, 'Human Resources': 1})

X['EducLifeScience'] = X['EducationField'].map({'Life Sciences': 1, 'Other':0, 'Medical':0, 'Marketing':0,
       'Technical Degree':0, 'Human Resources':0})
X['EducOther'] = X['EducationField'].map({'Life Sciences': 0, 'Other':1, 'Medical':0, 'Marketing':0,
       'Technical Degree':0, 'Human Resources':0})
X['EducMedical'] = X['EducationField'].map({'Life Sciences': 0, 'Other':0, 'Medical':1, 'Marketing':0,
       'Technical Degree':0, 'Human Resources':0})
X['EducMarketing'] = X['EducationField'].map({'Life Sciences': 0, 'Other':0, 'Medical':0, 'Marketing':1,
       'Technical Degree':0, 'Human Resources':0})
X['EducTechDegree'] = X['EducationField'].map({'Life Sciences': 0, 'Other':0, 'Medical':0, 'Marketing':0,
       'Technical Degree':1, 'Human Resources':0})
X['EducHR'] = X['EducationField'].map({'Life Sciences': 0, 'Other':0, 'Medical':0, 'Marketing':0,
       'Technical Degree':0, 'Human Resources':1})

X['Gender'] = X['Gender'].map({'Male': 0, 'Female': 1})

X['RoleSalesExec'] = X['JobRole'].map({'Sales Executive': 1, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 0})
X['RoleResScientist'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 1, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 0})
X['RoleLabTech'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 1,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 0})
X['RoleManufactDir'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 1, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 0})
X['RoleHealthRep'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 1, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 0})
X['RoleManager'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 1,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 0})
X['RoleSalesRep'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 1, 'Research Director': 0, 'Human Resources': 0})
X['RoleResDir'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 1, 'Human Resources': 0})
X['RoleHR'] = X['JobRole'].map({'Sales Executive': 0, 'Research Scientist': 0, 'Laboratory Technician': 0,
       'Manufacturing Director': 0, 'Healthcare Representative': 0, 'Manager': 0,
       'Sales Representative': 0, 'Research Director': 0, 'Human Resources': 1})

X['Single'] = X['MaritalStatus'].map({'Single': 1, 'Married':0, 'Divorced':0})
X['Married'] = X['MaritalStatus'].map({'Single': 0, 'Married':1, 'Divorced':0})
X['Divorced'] = X['MaritalStatus'].map({'Single': 0, 'Married':0, 'Divorced':1})

X['OverTime'] = X['OverTime'].map({'Yes': 1, 'No':0})

X['BelowCollege'] = X['Education'].map({1: 1, 2: 0, 3: 0, 4: 0, 5: 0})
X['College'] = X['Education'].map({1: 0, 2: 1, 3: 0, 4: 0, 5: 0})
X['Bachelor'] = X['Education'].map({1: 0, 2: 0, 3: 1, 4: 0, 5: 0})
X['Master'] = X['Education'].map({1: 0, 2: 0, 3: 0, 4: 1, 5: 0})
X['Doctor'] = X['Education'].map({1: 0, 2: 0, 3: 0, 4: 0, 5: 1})

X.loc[:50, ['Gender','EducationField','EducLifeScience', 'EducOther', 'EducMedical', 'EducMarketing','EducTechDegree', 'EducHR', \
'JobRole','RoleSalesExec', 'RoleResScientist', 'RoleLabTech', 'RoleManufactDir', 'RoleHealthRep', 'RoleManager', \
'RoleSalesRep', 'RoleResDir', 'RoleHR', 'MaritalStatus','Single', 'Married', 'Divorced', 'OverTime', 'Education', 'BelowCollege','College',\
             'Bachelor', 'Master', 'Doctor']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexin

Unnamed: 0,Gender,EducationField,EducLifeScience,EducOther,EducMedical,EducMarketing,EducTechDegree,EducHR,JobRole,RoleSalesExec,RoleResScientist,RoleLabTech,RoleManufactDir,RoleHealthRep,RoleManager,RoleSalesRep,RoleResDir,RoleHR,MaritalStatus,Single,Married,Divorced,OverTime,Education,BelowCollege,College,Bachelor,Master,Doctor
0,1,Life Sciences,1,0,0,0,0,0,Sales Executive,1,0,0,0,0,0,0,0,0,Single,1,0,0,1,2,0,1,0,0,0
1,0,Life Sciences,1,0,0,0,0,0,Research Scientist,0,1,0,0,0,0,0,0,0,Married,0,1,0,0,1,1,0,0,0,0
2,0,Other,0,1,0,0,0,0,Laboratory Technician,0,0,1,0,0,0,0,0,0,Single,1,0,0,1,2,0,1,0,0,0
3,1,Life Sciences,1,0,0,0,0,0,Research Scientist,0,1,0,0,0,0,0,0,0,Married,0,1,0,1,4,0,0,0,1,0
4,0,Medical,0,0,1,0,0,0,Laboratory Technician,0,0,1,0,0,0,0,0,0,Married,0,1,0,0,1,1,0,0,0,0
5,0,Life Sciences,1,0,0,0,0,0,Laboratory Technician,0,0,1,0,0,0,0,0,0,Single,1,0,0,0,2,0,1,0,0,0
6,1,Medical,0,0,1,0,0,0,Laboratory Technician,0,0,1,0,0,0,0,0,0,Married,0,1,0,1,3,0,0,1,0,0
7,0,Life Sciences,1,0,0,0,0,0,Laboratory Technician,0,0,1,0,0,0,0,0,0,Divorced,0,0,1,0,1,1,0,0,0,0
8,0,Life Sciences,1,0,0,0,0,0,Manufacturing Director,0,0,0,1,0,0,0,0,0,Single,1,0,0,0,3,0,0,1,0,0
9,0,Medical,0,0,1,0,0,0,Healthcare Representative,0,0,0,0,1,0,0,0,0,Married,0,1,0,0,3,0,0,1,0,0


Let's remove the String columns

In [6]:
X = X.drop(['Department', 'Education','EducationField', 'JobRole', 'MaritalStatus'], axis=1)

X.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,DistanceFromHome,Gender,HourlyRate,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,DepSales,DepResDev,DepHR,EducLifeScience,EducOther,EducMedical,EducMarketing,EducTechDegree,EducHR,RoleSalesExec,RoleResScientist,RoleLabTech,RoleManufactDir,RoleHealthRep,RoleManager,RoleSalesRep,RoleResDir,RoleHR,Single,Married,Divorced,BelowCollege,College,Bachelor,Master,Doctor
0,41,0.5,1102,1,1,94,2,4,5993,19479,8,1,11,0,8,0,6,4,0,5,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
1,49,1.0,279,8,0,61,2,2,5130,24907,1,0,23,1,10,3,10,7,1,7,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
2,37,0.5,1373,2,0,92,1,3,2090,2396,6,1,15,0,7,3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0
3,33,1.0,1392,3,1,56,1,3,2909,23159,1,1,11,0,8,3,8,7,3,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
4,27,0.5,591,2,0,40,1,2,3468,16632,9,0,12,1,6,3,2,2,2,2,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0


Let's normalize the features having different scale

In [7]:
scaler = StandardScaler()

features_to_scale = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate',\
                    'NumCompaniesWorked', 'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany',\
                    'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
scaled_features = pd.DataFrame(scaler.fit_transform(X[features_to_scale]), columns=features_to_scale)

X[features_to_scale] = scaled_features;

X.head()


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Unnamed: 0,Age,BusinessTravel,DailyRate,DistanceFromHome,Gender,HourlyRate,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,DepSales,DepResDev,DepHR,EducLifeScience,EducOther,EducMedical,EducMarketing,EducTechDegree,EducHR,RoleSalesExec,RoleResScientist,RoleLabTech,RoleManufactDir,RoleHealthRep,RoleManager,RoleSalesRep,RoleResDir,RoleHR,Single,Married,Divorced,BelowCollege,College,Bachelor,Master,Doctor
0,0.44635,0.5,0.742527,-1.010909,1,1.383138,-0.057788,1.153254,-0.10835,0.72602,2.125136,1,-1.150554,-0.932014,-0.421642,-2.171982,-0.164613,-0.063296,-0.679146,0.245834,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
1,1.322365,1.0,-1.297775,-0.14715,0,-0.240677,-0.057788,-0.660853,-0.291719,1.488876,-0.678049,0,2.129306,0.241988,-0.164511,0.155707,0.488508,0.764998,-0.368715,0.806541,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0
2,0.008343,0.5,1.414363,-0.887515,0,1.284725,-0.961486,0.2462,-0.937654,-1.674841,1.324226,1,-0.057267,-0.932014,-0.550208,0.155707,-1.144294,-1.167687,-0.679146,-1.155935,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0
3,-0.429664,1.0,1.461466,-0.764121,1,-0.486709,-0.961486,0.2462,-0.763634,1.243211,-0.678049,1,-1.150554,-0.932014,-0.421642,0.155707,0.161947,0.764998,0.252146,-1.155935,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
4,-1.086676,0.5,-0.524295,-0.887515,0,-1.274014,-0.961486,-0.660853,-0.644858,0.3259,2.525591,0,-0.877232,0.241988,-0.678774,0.155707,-0.817734,-0.615492,-0.058285,-0.595227,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0


Let's create a SVM model for multiclass classification and verify the accuracy.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

from sklearn.svm import SVC 
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(X_train, y_train) 
svm_predictions = svm_model_linear.predict(X_test) 
  
# model accuracy for X_test   
accuracy = svm_model_linear.score(X_test, y_test)

print(accuracy)

0.9728260869565217


As we can see, we got more than 90% of accuracy.

Now, we will try to find samples where, even having ALL features equal, changing only the gender, our model predicts different classifiers

In [9]:
X_male = X.copy()
X_female = X.copy()

X_male.Gender = 0
X_female.Gender = 1

male_prediction = svm_model_linear.predict(X_male)
female_prediction = svm_model_linear.predict(X_female)

diff_predictions = pd.Series(male_prediction == female_prediction)

print('There are', diff_predictions[diff_predictions == False].size, 'cases where the model classified with different PerformanceRating for samples having only the gender as different attribute from each other')

There are 11 cases where the model classified with different PerformanceRating for samples having only the gender as different attribute from each other


## Results

Considering the demonstration above, we created an explicitly sexist model once even having 45 equal features that really have causality relation with the `PerformanceRating`, the model used only the gender and predicted different performance ratings for male and female.

