<h2 align= 'center'><b>DATA PREPROCESSING</b></h2>

#### **IMPORTING THE LIBRARIES**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

#### **IMPORTING THE DATASET**

In [2]:
!pip install xlrd>=2.0.1

In [2]:
data= pd.read_excel(r"E:\Employees_performance_analysis\src\data preprocessing\INX_Future_Inc_Employee_Performance_CDS_Project2_Data_V1.8.xls")
data

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,E1001000,32,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,10,3,...,4,10,2,2,10,7,0,8,No,3
1,E1001006,47,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,14,4,...,4,20,2,3,7,7,1,7,No,3
2,E1001007,40,Male,Life Sciences,Married,Sales,Sales Executive,Travel_Frequently,5,4,...,3,20,2,3,18,13,1,12,No,4
3,E1001009,41,Male,Human Resources,Divorced,Human Resources,Manager,Travel_Rarely,10,4,...,2,23,2,2,21,6,12,6,No,3
4,E1001010,60,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,16,4,...,4,10,1,3,2,2,2,2,No,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,E100992,27,Female,Medical,Divorced,Sales,Sales Executive,Travel_Frequently,3,1,...,2,6,3,3,6,5,0,4,No,4
1196,E100993,37,Male,Life Sciences,Single,Development,Senior Developer,Travel_Rarely,10,2,...,1,4,2,3,1,0,0,0,No,3
1197,E100994,50,Male,Medical,Married,Development,Senior Developer,Travel_Rarely,28,1,...,3,20,3,3,20,8,3,8,No,3
1198,E100995,34,Female,Medical,Single,Data Science,Data Scientist,Travel_Rarely,9,3,...,2,9,3,4,8,7,7,7,No,3


#### **DOMAIN ANALYSIS**

1. ***EmpNumber:*** Unique identifier for each employee in the dataset.

2. ***Age:*** Age of the employee, providing insight into workforce demographics and potential correlations with attrition.

3. ***Gender:*** Gender of the employee, which may impact workplace dynamics and attrition patterns.

4. ***EducationBackground:*** The educational background of the employee, influencing skillset and career trajectory.

5. ***MaritalStatus:*** Marital status of the employee, potentially affecting work-life balance and job satisfaction.

6. ***EmpDepartment:*** Department in which the employee works, indicating job role and organizational structure.

7. ***EmpJobRole:*** Specific job role of the employee within their department, reflecting responsibilities and career path.

8. ***BusinessTravelFrequency:*** Frequency of business travel for the employee, impacting lifestyle and job satisfaction.

9. ***DistanceFromHome:*** Distance of employee's residence from the workplace, influencing commuting stress and retention.

10. ***EmpEducationLevel:*** Level of education attained by the employee, reflecting qualifications and potential for advancement.

11. ***EmpEnvironmentSatisfaction:*** Employee satisfaction with the work environment, affecting morale and turnover.

12. ***EmpHourlyRate:*** Hourly wage of the employee, a factor in compensation satisfaction and retention.

13. ***EmpJobInvolvement:*** Level of involvement and engagement in the job role, affecting performance and attrition risk.

14. ***EmpJobLevel:*** Level of hierarchy within the organization, indicating seniority and career progression.

15. ***EmpJobSatisfaction:*** Satisfaction level with the job role, impacting employee morale and retention.

16. ***NumCompaniesWorked:*** Number of companies the employee has previously worked for, indicating job stability and turnover risk.

17. ***OverTime:*** Whether the employee works overtime, influencing work-life balance and burnout.

18. ***EmpLastSalaryHikePercent:*** Percentage of the employee's last salary hike, affecting compensation satisfaction and retention.

19. ***EmpRelationshipSatisfaction:*** Satisfaction with relationships at work, influencing job satisfaction and likelihood of turnover.

20. ***TotalWorkExperienceInYears:*** Total work experience of the employee, influencing skill level and career trajectory.

21. ***TrainingTimesLastYear:*** Number of training sessions attended by the employee last year, indicating investment in skill development and career growth.

22. ***EmpWorkLifeBalance:*** Employee's perceived balance between work and personal life, affecting job satisfaction and retention.

23. ***ExperienceYearsAtThisCompany:*** Years of experience at the current company, indicating loyalty and potential for promotion.

24. ***ExperienceYearsInCurrentRole:*** Years of experience in the current job role, influencing expertise and potential for advancement.

25. ***YearsSinceLastPromotion:*** Time since the employee's last promotion, impacting career progression and job satisfaction.

26. ***YearsWithCurrManager:*** Years of tenure with the current manager, affecting job satisfaction and retention.

27. ***Attrition:*** This variable indicates whether the employee has left the company  or not.

28. ***PerformanceRating:*** Target variable for the given problem.  this is the performance rating assigned to the employee, influencing career development and potential for retention.

#### **DATA PREPROCESSING**

##### **CHECKING NULL VALUES:**

In [3]:
data.isnull().sum()

EmpNumber                       0
Age                             0
Gender                          0
EducationBackground             0
MaritalStatus                   0
EmpDepartment                   0
EmpJobRole                      0
BusinessTravelFrequency         0
DistanceFromHome                0
EmpEducationLevel               0
EmpEnvironmentSatisfaction      0
EmpHourlyRate                   0
EmpJobInvolvement               0
EmpJobLevel                     0
EmpJobSatisfaction              0
NumCompaniesWorked              0
OverTime                        0
EmpLastSalaryHikePercent        0
EmpRelationshipSatisfaction     0
TotalWorkExperienceInYears      0
TrainingTimesLastYear           0
EmpWorkLifeBalance              0
ExperienceYearsAtThisCompany    0
ExperienceYearsInCurrentRole    0
YearsSinceLastPromotion         0
YearsWithCurrManager            0
Attrition                       0
PerformanceRating               0
dtype: int64

In [4]:
# hence there are no null values

#### **CHECKING FOR DUPLICATE RECORDS:**

In [5]:
data.duplicated().sum()

0

#### **ENCODING**

In [6]:
for column in data.drop('PerformanceRating', axis=1):
    if data[column].dtype =='object':
        print(column)
        print("==============================")

EmpNumber
Gender
EducationBackground
MaritalStatus
EmpDepartment
EmpJobRole
BusinessTravelFrequency
OverTime
Attrition


In [7]:
# From the basic checks in the data exploration, it was clear that

binary_features= ['OverTime', 'Attrition']
nominal_features= ['Gender', 'EducationBackground', 'MaritalStatus', 'EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency']

# taking a copy of original data for encoding
encoded_data= data.copy()
encoded_data.shape

(1200, 28)

#### ***OverTime***

In [8]:
# binary features
# OverTime
encoded_data.OverTime.value_counts()

OverTime
No     847
Yes    353
Name: count, dtype: int64

In [9]:
# yes=1, No= 0
# mapping is done
encoded_data['OverTime']= encoded_data['OverTime'].map({"No": 0, "Yes": 1})

In [10]:
encoded_data.OverTime.value_counts()

OverTime
0    847
1    353
Name: count, dtype: int64

#### ***Attrition***

In [11]:
encoded_data.Attrition.value_counts()

Attrition
No     1022
Yes     178
Name: count, dtype: int64

In [12]:
# yes=1, No= 0
# mapping is done
encoded_data['Attrition']= encoded_data['Attrition'].map({"No": 0, "Yes": 1})

In [13]:
encoded_data.Attrition.value_counts()

Attrition
0    1022
1     178
Name: count, dtype: int64

#### ***Gender***

In [14]:
# nominal features--> one-hot encoding is done
# Gender
encoded_data.Gender.value_counts()

Gender
Male      725
Female    475
Name: count, dtype: int64

In [15]:

encoded_data['Gender']= pd.get_dummies(encoded_data['Gender'], drop_first=True)

In [16]:
encoded_data.Gender.value_counts()

Gender
True     725
False    475
Name: count, dtype: int64

#### ***EducationBackground***

In [17]:
# EducationBackground
encoded_data.EducationBackground.value_counts()

EducationBackground
Life Sciences       492
Medical             384
Marketing           137
Technical Degree    100
Other                66
Human Resources      21
Name: count, dtype: int64

In [18]:

EducationBackground= pd.get_dummies(encoded_data['EducationBackground'], drop_first= True)

In [19]:
encoded_data= pd.concat([encoded_data, EducationBackground], axis=1)

In [20]:
encoded_data.drop('EducationBackground', axis=1, inplace= True)
encoded_data.head()

Unnamed: 0,EmpNumber,Age,Gender,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,...,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating,Life Sciences,Marketing,Medical,Other,Technical Degree
0,E1001000,32,True,Single,Sales,Sales Executive,Travel_Rarely,10,3,4,...,7,0,8,0,3,False,True,False,False,False
1,E1001006,47,True,Single,Sales,Sales Executive,Travel_Rarely,14,4,4,...,7,1,7,0,3,False,True,False,False,False
2,E1001007,40,True,Married,Sales,Sales Executive,Travel_Frequently,5,4,4,...,13,1,12,0,4,True,False,False,False,False
3,E1001009,41,True,Divorced,Human Resources,Manager,Travel_Rarely,10,4,2,...,6,12,6,0,3,False,False,False,False,False
4,E1001010,60,True,Single,Sales,Sales Executive,Travel_Rarely,16,4,1,...,2,2,2,0,3,False,True,False,False,False


#### ***MaritalStatus***

In [21]:
# MaritalStatus
encoded_data.MaritalStatus.value_counts()

MaritalStatus
Married     548
Single      384
Divorced    268
Name: count, dtype: int64

In [22]:
# one hot encoding
MaritalStatus= pd.get_dummies(encoded_data['MaritalStatus'], drop_first= True)
encoded_data= pd.concat([encoded_data, MaritalStatus], axis= 1)

In [23]:
encoded_data.drop('MaritalStatus', axis=1, inplace= True)
encoded_data.head()

Unnamed: 0,EmpNumber,Age,Gender,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,...,YearsWithCurrManager,Attrition,PerformanceRating,Life Sciences,Marketing,Medical,Other,Technical Degree,Married,Single
0,E1001000,32,True,Sales,Sales Executive,Travel_Rarely,10,3,4,55,...,8,0,3,False,True,False,False,False,False,True
1,E1001006,47,True,Sales,Sales Executive,Travel_Rarely,14,4,4,42,...,7,0,3,False,True,False,False,False,False,True
2,E1001007,40,True,Sales,Sales Executive,Travel_Frequently,5,4,4,48,...,12,0,4,True,False,False,False,False,True,False
3,E1001009,41,True,Human Resources,Manager,Travel_Rarely,10,4,2,73,...,6,0,3,False,False,False,False,False,False,False
4,E1001010,60,True,Sales,Sales Executive,Travel_Rarely,16,4,1,84,...,2,0,3,False,True,False,False,False,False,True


#### ***EmpDepartment***

In [24]:
encoded_data.EmpDepartment.value_counts()

EmpDepartment
Sales                     373
Development               361
Research & Development    343
Human Resources            54
Finance                    49
Data Science               20
Name: count, dtype: int64

In [25]:
EmpDepartment= pd.get_dummies(encoded_data['EmpDepartment'], drop_first= True)
encoded_data= pd.concat([encoded_data, EmpDepartment], axis=1)

In [26]:
encoded_data.drop('EmpDepartment', axis=1, inplace= True)
encoded_data.head()

Unnamed: 0,EmpNumber,Age,Gender,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,...,Medical,Other,Technical Degree,Married,Single,Development,Finance,Human Resources,Research & Development,Sales
0,E1001000,32,True,Sales Executive,Travel_Rarely,10,3,4,55,3,...,False,False,False,False,True,False,False,False,False,True
1,E1001006,47,True,Sales Executive,Travel_Rarely,14,4,4,42,3,...,False,False,False,False,True,False,False,False,False,True
2,E1001007,40,True,Sales Executive,Travel_Frequently,5,4,4,48,2,...,False,False,False,True,False,False,False,False,False,True
3,E1001009,41,True,Manager,Travel_Rarely,10,4,2,73,2,...,False,False,False,False,False,False,False,True,False,False
4,E1001010,60,True,Sales Executive,Travel_Rarely,16,4,1,84,3,...,False,False,False,False,True,False,False,False,False,True


#### ***EmpJobRole***

In [27]:
encoded_data.EmpJobRole.value_counts()

EmpJobRole
Sales Executive              270
Developer                    236
Manager R&D                   94
Research Scientist            77
Sales Representative          69
Laboratory Technician         64
Senior Developer              52
Manager                       51
Finance Manager               49
Human Resources               45
Technical Lead                38
Manufacturing Director        33
Healthcare Representative     33
Data Scientist                20
Research Director             19
Business Analyst              16
Senior Manager R&D            15
Delivery Manager              12
Technical Architect            7
Name: count, dtype: int64

In [28]:
EmpJobRole= pd.get_dummies(encoded_data['EmpJobRole'], drop_first= True)
encoded_data= pd.concat([encoded_data, EmpJobRole], axis=1)

In [29]:

encoded_data.drop('EmpJobRole', axis=1, inplace= True)
encoded_data.head()

Unnamed: 0,EmpNumber,Age,Gender,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,...,Manager R&D,Manufacturing Director,Research Director,Research Scientist,Sales Executive,Sales Representative,Senior Developer,Senior Manager R&D,Technical Architect,Technical Lead
0,E1001000,32,True,Travel_Rarely,10,3,4,55,3,2,...,False,False,False,False,True,False,False,False,False,False
1,E1001006,47,True,Travel_Rarely,14,4,4,42,3,2,...,False,False,False,False,True,False,False,False,False,False
2,E1001007,40,True,Travel_Frequently,5,4,4,48,2,3,...,False,False,False,False,True,False,False,False,False,False
3,E1001009,41,True,Travel_Rarely,10,4,2,73,2,5,...,False,False,False,False,False,False,False,False,False,False
4,E1001010,60,True,Travel_Rarely,16,4,1,84,3,2,...,False,False,False,False,True,False,False,False,False,False


#### ***BusinessTravelFrequency***

In [30]:
encoded_data.BusinessTravelFrequency.value_counts()

BusinessTravelFrequency
Travel_Rarely        846
Travel_Frequently    222
Non-Travel           132
Name: count, dtype: int64

In [31]:
BusinessTravelFrequency= pd.get_dummies(encoded_data['BusinessTravelFrequency'], drop_first= True)
encoded_data= pd.concat([encoded_data, BusinessTravelFrequency], axis=1)

In [32]:
encoded_data.drop('BusinessTravelFrequency', axis=1, inplace= True)
encoded_data.head()

Unnamed: 0,EmpNumber,Age,Gender,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,...,Research Director,Research Scientist,Sales Executive,Sales Representative,Senior Developer,Senior Manager R&D,Technical Architect,Technical Lead,Travel_Frequently,Travel_Rarely
0,E1001000,32,True,10,3,4,55,3,2,4,...,False,False,True,False,False,False,False,False,False,True
1,E1001006,47,True,14,4,4,42,3,2,1,...,False,False,True,False,False,False,False,False,False,True
2,E1001007,40,True,5,4,4,48,2,3,1,...,False,False,True,False,False,False,False,False,True,False
3,E1001009,41,True,10,4,2,73,2,5,4,...,False,False,False,False,False,False,False,False,False,True
4,E1001010,60,True,16,4,1,84,3,2,1,...,False,False,True,False,False,False,False,False,False,True


In [33]:
# checking the datatypes after encoding

encoded_data.dtypes

EmpNumber                       object
Age                              int64
Gender                            bool
DistanceFromHome                 int64
EmpEducationLevel                int64
EmpEnvironmentSatisfaction       int64
EmpHourlyRate                    int64
EmpJobInvolvement                int64
EmpJobLevel                      int64
EmpJobSatisfaction               int64
NumCompaniesWorked               int64
OverTime                         int64
EmpLastSalaryHikePercent         int64
EmpRelationshipSatisfaction      int64
TotalWorkExperienceInYears       int64
TrainingTimesLastYear            int64
EmpWorkLifeBalance               int64
ExperienceYearsAtThisCompany     int64
ExperienceYearsInCurrentRole     int64
YearsSinceLastPromotion          int64
YearsWithCurrManager             int64
Attrition                        int64
PerformanceRating                int64
Life Sciences                     bool
Marketing                         bool
Medical                  

In [34]:
# EmpNumber is ignored since it is a unique feature.
encoded_data.drop('EmpNumber', axis=1, inplace= True)
encoded_data.columns

Index(['Age', 'Gender', 'DistanceFromHome', 'EmpEducationLevel',
       'EmpEnvironmentSatisfaction', 'EmpHourlyRate', 'EmpJobInvolvement',
       'EmpJobLevel', 'EmpJobSatisfaction', 'NumCompaniesWorked', 'OverTime',
       'EmpLastSalaryHikePercent', 'EmpRelationshipSatisfaction',
       'TotalWorkExperienceInYears', 'TrainingTimesLastYear',
       'EmpWorkLifeBalance', 'ExperienceYearsAtThisCompany',
       'ExperienceYearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager', 'Attrition', 'PerformanceRating',
       'Life Sciences', 'Marketing', 'Medical', 'Other', 'Technical Degree',
       'Married', 'Single', 'Development', 'Finance', 'Human Resources',
       'Research & Development', 'Sales', 'Data Scientist', 'Delivery Manager',
       'Developer', 'Finance Manager', 'Healthcare Representative',
       'Human Resources', 'Laboratory Technician', 'Manager', 'Manager R&D',
       'Manufacturing Director', 'Research Director', 'Research Scientist',
       'Sal

#### **CHECKING FOR OUTLIERS:**

In [35]:
# fetching only continuous columns

continuous_col= encoded_data[['Age','DistanceFromHome', 'EmpHourlyRate', 'TotalWorkExperienceInYears', 'ExperienceYearsAtThisCompany']] 
continuous_col

Unnamed: 0,Age,DistanceFromHome,EmpHourlyRate,TotalWorkExperienceInYears,ExperienceYearsAtThisCompany
0,32,10,55,10,10
1,47,14,42,20,7
2,40,5,48,20,18
3,41,10,73,23,21
4,60,16,84,10,2
...,...,...,...,...,...
1195,27,3,71,6,6
1196,37,10,80,4,1
1197,50,28,74,20,20
1198,34,9,46,9,8


In [36]:
# using interquartile method to find out the outliers

q1= continuous_col.quantile(0.25)
q3= continuous_col.quantile(0.75)
iqr= q3 - q1
lower_lim= q1 - 1.5*iqr  #lower_lim-->lower limit
upper_lim= q3 + 1.5*iqr  # upper_lim--> upper limit

outliers= (continuous_col < lower_lim)|(continuous_col > upper_lim)
total_outliers= outliers.sum()
total_outliers.to_frame().T.style.background_gradient(cmap= 'Pastel1')

Unnamed: 0,Age,DistanceFromHome,EmpHourlyRate,TotalWorkExperienceInYears,ExperienceYearsAtThisCompany
0,0,0,0,51,56


#### ***TotalWorkExperienceInYears***

In [37]:
# handling outliers
# TotalWorkExperienceInYears

q1= encoded_data['TotalWorkExperienceInYears'].quantile(0.25)
q3= encoded_data['TotalWorkExperienceInYears'].quantile(0.75)
iqr= q3 - q1

lower_lim= q1 - 1.5*iqr
upper_lim= q3 + 1.5*iqr

outliers= (encoded_data['TotalWorkExperienceInYears']< lower_lim) | (encoded_data['TotalWorkExperienceInYears']> upper_lim)
outliers_percent= (outliers.sum()/ (len(data))) *100
outliers_percent   

4.25

In [38]:
# Since outliers count for the feature "TotalWorkExperienceInYears" is less than 5%, they can be handled.

# checking the records below lower limit
encoded_data.loc[encoded_data['TotalWorkExperienceInYears']< lower_lim]

Unnamed: 0,Age,Gender,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,NumCompaniesWorked,...,Research Director,Research Scientist,Sales Executive,Sales Representative,Senior Developer,Senior Manager R&D,Technical Architect,Technical Lead,Travel_Frequently,Travel_Rarely


In [39]:
# checking the records above the upper limit

len(encoded_data.loc[encoded_data['TotalWorkExperienceInYears'] > upper_lim])

51

In [40]:
# Replacing the outliers with the median

encoded_data.loc[encoded_data['TotalWorkExperienceInYears']> upper_lim, 'TotalWorkExperienceInYears']= np.median(encoded_data['TotalWorkExperienceInYears'])


In [41]:
# checking the data after handling the outliers

encoded_data.loc[encoded_data['TotalWorkExperienceInYears'] > upper_lim]

Unnamed: 0,Age,Gender,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,NumCompaniesWorked,...,Research Director,Research Scientist,Sales Executive,Sales Representative,Senior Developer,Senior Manager R&D,Technical Architect,Technical Lead,Travel_Frequently,Travel_Rarely


#### ***ExperienceYearsAtThisCompany***

In [42]:
# ExperienceYearsAtThisCompany

q1= encoded_data['ExperienceYearsAtThisCompany'].quantile(0.25)
q3= encoded_data['ExperienceYearsAtThisCompany'].quantile(0.75)
iqr= q3 - q1

lower_lim= q1 - 1.5*iqr
upper_lim= q3 + 1.5*iqr

outliers= (encoded_data['ExperienceYearsAtThisCompany']< lower_lim) | (encoded_data['ExperienceYearsAtThisCompany']> upper_lim)
outliers_percent= (outliers.sum()/ (len(data))) *100
outliers_percent   

4.666666666666667

In [43]:
# Since outliers count for the feature "ExperienceYearsAtThisCompany" is less than 5%, they can be handled.

# checking the records below lower limit
encoded_data.loc[encoded_data['ExperienceYearsAtThisCompany']< lower_lim]

Unnamed: 0,Age,Gender,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,NumCompaniesWorked,...,Research Director,Research Scientist,Sales Executive,Sales Representative,Senior Developer,Senior Manager R&D,Technical Architect,Technical Lead,Travel_Frequently,Travel_Rarely


In [44]:
# checking the records above the upper limit

len(encoded_data.loc[encoded_data['ExperienceYearsAtThisCompany'] > upper_lim])

56

In [45]:
# Replacing the outliers with the median

encoded_data.loc[encoded_data['ExperienceYearsAtThisCompany']> upper_lim, 'ExperienceYearsAtThisCompany']= np.median(encoded_data['ExperienceYearsAtThisCompany'])

In [46]:
# checking the data after handling the outliers

encoded_data.loc[encoded_data['ExperienceYearsAtThisCompany'] > upper_lim]

Unnamed: 0,Age,Gender,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,NumCompaniesWorked,...,Research Director,Research Scientist,Sales Executive,Sales Representative,Senior Developer,Senior Manager R&D,Technical Architect,Technical Lead,Travel_Frequently,Travel_Rarely


In [47]:
# saving the encoded data to a different csv file
encoded_data.to_csv('data_encoded.csv', index=False)

### **For more, Open Reference directory**