EXPLANATION OF BUSINESS PROBLEM : Employee Performance Prediction

A company wants to optimize its hiring process to ensure they attract and retain top talent while maintaining a diverse workforce. They have collected data on past employees, including their age, salary, gender, and education level, as well as whether they were considered positive or negative outcomes for the company.

The company needs to develop a predictive model that can utilize the available data to identify key attributes and patterns associated with positive and negative outcomes. This model will help them make more informed decisions during the hiring process, such as identifying candidates with similar profiles to those who have historically performed well within the company and avoiding potential red flags associated with negative outcomes.







Pre-Processing Of Data:

Handle Missing Values: Fill all the null values which are existing in age, salary, and education column using measures of central tendency.


Encode Categorical Variables: Convert categorical variables like Name, gender and education into numerical format using OneHotEncoding.


Feature Scaling: Standardize or normalize numerical features like age and salary.


Data Splitting: Split the dataset into training and testing sets.


In [1]:
import pandas as pd
import matplotlib.pyplot as pylot

In [2]:
df = pd.read_csv("positive.csv")
df

Unnamed: 0,Name,Age,Salary,Gender,Education,Label
0,John,,,Male,Bachelor,Positive
1,Jane,,60000.0,Female,Master,Positive
2,Bob,30.0,,Male,PhD,Negative
3,Alice,28.0,55000.0,Female,,Negative
4,Charlie,35.0,70000.0,,Bachelor,Positive
5,Dave,32.0,65000.0,Male,Master,Positive
6,Eve,27.0,,Female,PhD,Negative
7,Frank,33.0,75000.0,Male,,Negative
8,Grace,29.0,62000.0,Female,Master,Positive


In [3]:
X = df.iloc[:,:-1].values
y = df.iloc[:,5].values
X

array([['John', nan, nan, 'Male', 'Bachelor'],
       ['Jane', nan, 60000.0, 'Female', 'Master'],
       ['Bob', 30.0, nan, 'Male', 'PhD'],
       ['Alice', 28.0, 55000.0, 'Female', nan],
       ['Charlie', 35.0, 70000.0, nan, 'Bachelor'],
       ['Dave', 32.0, 65000.0, 'Male', 'Master'],
       ['Eve', 27.0, nan, 'Female', 'PhD'],
       ['Frank', 33.0, 75000.0, 'Male', nan],
       ['Grace', 29.0, 62000.0, 'Female', 'Master']], dtype=object)

In [4]:
from sklearn.impute import SimpleImputer

In [5]:
imputer = SimpleImputer(strategy="mean")
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
X

array([['John', 30.571428571428573, 64500.0, 'Male', 'Bachelor'],
       ['Jane', 30.571428571428573, 60000.0, 'Female', 'Master'],
       ['Bob', 30.0, 64500.0, 'Male', 'PhD'],
       ['Alice', 28.0, 55000.0, 'Female', nan],
       ['Charlie', 35.0, 70000.0, nan, 'Bachelor'],
       ['Dave', 32.0, 65000.0, 'Male', 'Master'],
       ['Eve', 27.0, 64500.0, 'Female', 'PhD'],
       ['Frank', 33.0, 75000.0, 'Male', nan],
       ['Grace', 29.0, 62000.0, 'Female', 'Master']], dtype=object)

In [6]:
imputer = SimpleImputer(strategy="most_frequent")
imputer = imputer.fit(X[:,3:5])
X[:,3:5] = imputer.transform(X[:,3:5])
X

array([['John', 30.571428571428573, 64500.0, 'Male', 'Bachelor'],
       ['Jane', 30.571428571428573, 60000.0, 'Female', 'Master'],
       ['Bob', 30.0, 64500.0, 'Male', 'PhD'],
       ['Alice', 28.0, 55000.0, 'Female', 'Master'],
       ['Charlie', 35.0, 70000.0, 'Female', 'Bachelor'],
       ['Dave', 32.0, 65000.0, 'Male', 'Master'],
       ['Eve', 27.0, 64500.0, 'Female', 'PhD'],
       ['Frank', 33.0, 75000.0, 'Male', 'Master'],
       ['Grace', 29.0, 62000.0, 'Female', 'Master']], dtype=object)

In [7]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
X[:,3] = label_encoder.fit_transform(X[:,3])

In [8]:
X

array([['John', 30.571428571428573, 64500.0, 1, 'Bachelor'],
       ['Jane', 30.571428571428573, 60000.0, 0, 'Master'],
       ['Bob', 30.0, 64500.0, 1, 'PhD'],
       ['Alice', 28.0, 55000.0, 0, 'Master'],
       ['Charlie', 35.0, 70000.0, 0, 'Bachelor'],
       ['Dave', 32.0, 65000.0, 1, 'Master'],
       ['Eve', 27.0, 64500.0, 0, 'PhD'],
       ['Frank', 33.0, 75000.0, 1, 'Master'],
       ['Grace', 29.0, 62000.0, 0, 'Master']], dtype=object)

In [9]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Name",OneHotEncoder(),[0]),("Education",OneHotEncoder(),[4])],remainder="passthrough")
X = ct.fit_transform(X)

In [10]:
X

array([[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0,
        30.571428571428573, 64500.0, 1],
       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0,
        30.571428571428573, 60000.0, 0],
       [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 30.0,
        64500.0, 1],
       [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 28.0,
        55000.0, 0],
       [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 35.0,
        70000.0, 0],
       [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 32.0,
        65000.0, 1],
       [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 27.0,
        64500.0, 0],
       [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 33.0,
        75000.0, 1],
       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 29.0,
        62000.0, 0]], dtype=object)

In [11]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
y = y_encoded

In [12]:
y

array([1, 1, 0, 0, 1, 1, 0, 0, 1])

In [13]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

In [14]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
print(X_train)
X_test = sc_X.fit_transform(X_test)
print(X_test)

[[-0.40824829  0.         -0.40824829 -0.40824829 -0.40824829  0.
  -0.40824829  2.44948974 -0.40824829 -0.63245553  0.8660254  -0.40824829
   0.10704756 -0.69216144 -0.63245553]
 [-0.40824829  0.          2.44948974 -0.40824829 -0.40824829  0.
  -0.40824829 -0.40824829 -0.40824829  1.58113883 -1.15470054 -0.40824829
   1.89391844  1.61504335 -0.63245553]
 [-0.40824829  0.         -0.40824829 -0.40824829 -0.40824829  0.
   2.44948974 -0.40824829 -0.40824829 -0.63245553  0.8660254  -0.40824829
  -0.52700339 -0.23072048 -0.63245553]
 [-0.40824829  0.         -0.40824829 -0.40824829  2.44948974  0.
  -0.40824829 -0.40824829 -0.40824829 -0.63245553 -1.15470054  2.44948974
  -1.33397733  0.34608072 -0.63245553]
 [ 2.44948974  0.         -0.40824829 -0.40824829 -0.40824829  0.
  -0.40824829 -0.40824829 -0.40824829 -0.63245553  0.8660254  -0.40824829
  -0.93049036 -1.84576383 -0.63245553]
 [-0.40824829  0.         -0.40824829 -0.40824829 -0.40824829  0.
  -0.40824829 -0.40824829  2.44948974  