**_La clave del éxito en cualquier organización reside en atraer y retener al mejor talento. Este análisis es útil para un analista de RR. HH., ya que su tarea consiste en determinar qué factores retienen a los empleados en la empresa y cuáles motivan la salida de otros. Al conocer estos factores, el analista de RR. HH. puede adoptar medidas para evitar la pérdida de personal cualificado._**

<div style="text-align: center; background-color: #856ff8; padding: 10px;">
    <h2 style="font-weight: bold;">DESCRIPCIÓN</h2>
</div>

- Importación de varios módulos
- Carga de conjuntos de datos
- Procesamiento de datos
- Cálculo del tamaño del DataFrame
- Incluir las etiquetas de columna
- Generación de información básica de atributos
- Incluir características numéricas
- Incluir características categóricas
- Comprobación de valores faltantes
- Análisis descriptivo de atributos numéricos
- Eliminar columnas innecesarias
- Análisis descriptivo de atributos categóricos
- Comprobación de valores únicos en atributos categóricos
- Guardar el DataFrame en un archivo CSV

<div style="text-align: center; background-color: yellow; padding: 10px;">
    <h2 style="font-weight: bold;">IMPORTACIÓN DE MÓDULOS</h2>
</div>

In [30]:
# Library for Data Manipulation
import numpy as np
import pandas as pd

# Library for Statistical Modelling
from sklearn.preprocessing import LabelEncoder

# Library for Ignore the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

<div style="text-align: center; background-color: yellow; padding: 10px;">
    <h2 style="font-weight: bold;">CARGANDO CONJUNTO DE DATOS</h2>
</div>

In [31]:
employee_data = pd.read_csv(r'IBM-HR-Analytics-Employee-Attrition-and-Performance.csv')

In [33]:
# Imprimir las primeras 5 filas del dataframe.
employee_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [34]:
# Imprimir las últimas 5 filas del dataframe.
employee_data.tail()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8
1469,34,No,Travel_Rarely,628,Research & Development,8,3,Medical,1,2068,...,1,80,0,6,3,4,4,3,1,2


<div style="text-align: center; background-color: yellow; padding: 10px;">
    <h2 style="font-weight: bold;">MANEJO DE DATOS</h2>
</div>

## <span style='color:blue'> 1] CÁLCULO DEL TAMAÑO DEL CONJUNTO DE DATOS </span>

In [6]:
# Print the shape of the DataFrame
print("Tamaño del dataframe:", employee_data.shape)
# Print the length (number of rows) of the DataFrame
print("Numero de filas del dataframe:", len(employee_data))
# Print the number of columns in the DataFrame
print("Numero de columnas del dataframe:", len(employee_data.columns))

Tamaño del dataframe: (1470, 35)
Numero de filas deldataframe: 1470
Numero de columnas del dataframe: 35


## <span style='color:blue'> 2] LISTAR COLUMNAS DEL DATAFRAME </span>

In [35]:
for column in employee_data.columns:
    print(column)

Age
Attrition
BusinessTravel
DailyRate
Department
DistanceFromHome
Education
EducationField
EmployeeCount
EmployeeNumber
EnvironmentSatisfaction
Gender
HourlyRate
JobInvolvement
JobLevel
JobRole
JobSatisfaction
MaritalStatus
MonthlyIncome
MonthlyRate
NumCompaniesWorked
Over18
OverTime
PercentSalaryHike
PerformanceRating
RelationshipSatisfaction
StandardHours
StockOptionLevel
TotalWorkingYears
TrainingTimesLastYear
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager


## <span style='color:blue'> 3] GENERATING BASIC INFORMATION OF ATTRIBUTES </span>

In [9]:
# Print the Long summary of the dataframe by setting verbose = True
# Check for Non-Null or Nan Nalues in the dataset.
print(employee_data.info(verbose = True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

### <font color=red>Inference:</font>

1. Solo hay 26 atributos numéricos en el conjunto de datos.
2. Por otro lado, tenemos 9 atributos categóricos.

## <span style='color:blue'> 4] LISTAR CARACTERÍSTICAS NUMÉRICAS </span>

In [10]:
employee_data.select_dtypes(np.number).sample(5).style.set_properties(**{'background-color': '#E9F6E2',
                                                              'color': 'black','border-color': '#8b8c8c'})

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
709,31,335,9,2,1,991,3,46,2,1,1,2321,10322,0,22,4,1,80,0,4,0,3,3,2,1,2
575,54,376,19,4,1,799,4,95,3,2,1,5485,22670,9,11,3,2,80,2,9,4,3,5,3,1,4
127,19,528,22,1,1,167,4,50,3,1,3,1675,26820,1,19,3,4,80,0,0,2,2,0,0,0,0
482,31,1365,13,4,1,650,2,46,3,2,1,4233,11512,2,17,3,3,80,0,9,2,1,3,1,1,2
1236,36,1456,13,5,1,1733,2,96,2,2,1,6134,8658,5,13,3,2,80,3,16,3,3,2,2,2,2


### <font color=red>Inference:</font>

1. Some of the numerical features are storing cateegories labelled in numbers.
2. So for better analysis we will replace those labelled numerical values with appropriate categorical values.

### 4.1] Labelling Categories in Numerical Feature

In [11]:
employee_data["Education"] = employee_data["Education"].replace({1:"Below College",2:"College",3:"Bachelor",4:"Master",5:"Doctor"})

In [12]:
employee_data["EnvironmentSatisfaction"] = employee_data["EnvironmentSatisfaction"].replace({1:"Low",2:"Medium",3:"High",4:"Very High"})

In [13]:
employee_data["JobInvolvement"] = employee_data["JobInvolvement"].replace({1:"Low",2:"Medium",3:"High",4:"Very High"})

In [14]:
employee_data["JobLevel"] = employee_data["JobLevel"].replace({1:"Entry Level",2:"Junior Level",3:"Mid Level",4:"Senior Level",
                                         5:"Executive Level"})

In [15]:
employee_data["JobSatisfaction"] = employee_data["JobSatisfaction"].replace({1:"Low",2:"Medium",3:"High",4:"Very High"})

In [16]:
employee_data["PerformanceRating"] = employee_data["PerformanceRating"].replace({1:"Low",2:"Good",3:"Excellent",4:"Outstanding"})

In [17]:
employee_data["RelationshipSatisfaction"] = employee_data["RelationshipSatisfaction"].replace({1:"Low",2:"Medium",3:"High",4:"Very High"})

In [18]:
employee_data["WorkLifeBalance"] = employee_data["WorkLifeBalance"].replace({1:"Bad",2:"Good",3:"Better",4:"Best"})

## <span style='color:blue'> 5] ENLISTING CATEGORICAL FEATURES </span>

In [19]:
employee_data.select_dtypes(include="O").sample(5).style.set_properties(**{'background-color': '#E9F6E2',
                                                                'color': 'black','border-color': '#8b8c8c'})

Unnamed: 0,Attrition,BusinessTravel,Department,Education,EducationField,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,Over18,OverTime,PerformanceRating,RelationshipSatisfaction,WorkLifeBalance
1186,Yes,Travel_Frequently,Sales,Master,Other,Very High,Male,High,Junior Level,Sales Executive,Very High,Single,Y,Yes,Outstanding,Low,Best
1097,No,Travel_Rarely,Research & Development,College,Technical Degree,High,Male,Medium,Entry Level,Laboratory Technician,Low,Divorced,Y,No,Excellent,Medium,Better
610,No,Travel_Rarely,Research & Development,Below College,Technical Degree,High,Male,Medium,Mid Level,Research Director,Very High,Divorced,Y,Yes,Excellent,Medium,Better
550,No,Travel_Rarely,Research & Development,Below College,Medical,Medium,Male,High,Entry Level,Laboratory Technician,Low,Married,Y,No,Excellent,Very High,Best
535,No,Travel_Rarely,Human Resources,Master,Human Resources,Medium,Male,Medium,Executive Level,Manager,Very High,Divorced,Y,No,Excellent,Medium,Good


## <span style='color:blue'> 6] CHECK FOR MISSING VALUES </span>

In [20]:
# Calculate the number of missing values in each column
    
missing_df = employee_data.isnull().sum().to_frame().rename(columns={0:"Total No. of Missing Values"})
missing_df["% of Missing Values"] = round((missing_df["Total No. of Missing Values"]/len(employee_data))*100,2)
missing_df

Unnamed: 0,Total No. of Missing Values,% of Missing Values
Age,0,0.0
Attrition,0,0.0
BusinessTravel,0,0.0
DailyRate,0,0.0
Department,0,0.0
DistanceFromHome,0,0.0
Education,0,0.0
EducationField,0,0.0
EmployeeCount,0,0.0
EmployeeNumber,0,0.0


### <font color=red>Inference:</font>

1. None of the Attributes are having Missing Values.
2. Since there's no missing values our further analysis will be consistent and unbaised.

## <span style='color:blue'> 7] DESCRIPTIVE ANALYSIS ON NUMERICAL ATTRIBUTES </span>

In [21]:
employee_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1470.0,36.92381,9.135373,18.0,30.0,36.0,43.0,60.0
DailyRate,1470.0,802.485714,403.5091,102.0,465.0,802.0,1157.0,1499.0
DistanceFromHome,1470.0,9.192517,8.106864,1.0,2.0,7.0,14.0,29.0
EmployeeCount,1470.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
EmployeeNumber,1470.0,1024.865306,602.024335,1.0,491.25,1020.5,1555.75,2068.0
HourlyRate,1470.0,65.891156,20.329428,30.0,48.0,66.0,83.75,100.0
MonthlyIncome,1470.0,6502.931293,4707.956783,1009.0,2911.0,4919.0,8379.0,19999.0
MonthlyRate,1470.0,14313.103401,7117.786044,2094.0,8047.0,14235.5,20461.5,26999.0
NumCompaniesWorked,1470.0,2.693197,2.498009,0.0,1.0,2.0,4.0,9.0
PercentSalaryHike,1470.0,15.209524,3.659938,11.0,12.0,14.0,18.0,25.0


### <font color=red>Inference:</font>

1. The Minimum Age is 18 which conveys that All employees are Adult. So there's no need of Over18 Attribute for our analysis.
2. The Stanard Deviation value of EmployeeCount and StandardHours is 0.00 which conveys that All values present in this attribute are same.
3. Attribute EmployeeNumber represents a unique value to each of the employees, which will not provide any meaningful inisghts.
4. Since this Attribute will not provide any meaningful insights in our analysis we can simply drop these attributes.

## <span style='color:blue'> 8] DROP UNNECESSARY COLUMNS </span>

**Observation Report:** _We notice that 'EmployeeCount', 'Over18', 'StandardHours' have only one unique values and 'EmployeeNumber' has 1470 unique values. This features aren't useful for us, So we are going to drop those columns._

In [22]:
employee_data.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis="columns", inplace=True)

In [23]:
# Print top 5 rows in the dataframe.
employee_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,Medium,Female,...,Excellent,Low,0,8,0,Bad,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,High,Male,...,Outstanding,Very High,1,10,3,Better,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,Very High,Male,...,Excellent,Medium,0,7,3,Better,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,Very High,Female,...,Excellent,High,0,8,3,Better,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,Low,Male,...,Excellent,Very High,1,6,3,Better,2,2,2,2


In [24]:
# Print the shape of the DataFrame
print("The shape of data frame:", employee_data.shape)
# Print the length (number of rows) of the DataFrame
print("Number of Rows in the dataframe:", len(employee_data))
# Print the number of columns in the DataFrame
print("Number of Columns in the dataframe:", len(employee_data.columns))

The shape of data frame: (1470, 31)
Number of Rows in the dataframe: 1470
Number of Columns in the dataframe: 31


In [25]:
print("Column labels in the dataset in column order:")
for column in employee_data.columns:
    print(column)

Column labels in the dataset in column order:
Age
Attrition
BusinessTravel
DailyRate
Department
DistanceFromHome
Education
EducationField
EnvironmentSatisfaction
Gender
HourlyRate
JobInvolvement
JobLevel
JobRole
JobSatisfaction
MaritalStatus
MonthlyIncome
MonthlyRate
NumCompaniesWorked
OverTime
PercentSalaryHike
PerformanceRating
RelationshipSatisfaction
StockOptionLevel
TotalWorkingYears
TrainingTimesLastYear
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager


## <span style='color:blue'> 9] DESCRIPTIVE ANALYSIS ON CATEGORICAL ATTRIBUTES </span>

In [26]:
employee_data.describe(include="O").T

Unnamed: 0,count,unique,top,freq
Attrition,1470,2,No,1233
BusinessTravel,1470,3,Travel_Rarely,1043
Department,1470,3,Research & Development,961
Education,1470,5,Bachelor,572
EducationField,1470,6,Life Sciences,606
EnvironmentSatisfaction,1470,4,High,453
Gender,1470,2,Male,882
JobInvolvement,1470,4,High,868
JobLevel,1470,5,Entry Level,543
JobRole,1470,9,Sales Executive,326


### <font color=red>Inference:</font>

1. All the categorical attributes are having low cardiniality.
2. Attrition and OverTime column is highly biased towards No Category.
3. Businesstravel Attribute is highly biased towards Travel_Rarely category.

## <span style='color:blue'> 10] CHECKING UNIQUE VALUE OF CATEGORICAL ATTRIBUTES </span>

In [27]:
# Calculate the number of unique values in each column
for column in employee_data.columns:
    print(f"{column} - Número de valores únicos : {employee_data[column].nunique()}")
    print("=============================================================")

Age - Número de valores únicos : 43
Attrition - Número de valores únicos : 2
BusinessTravel - Número de valores únicos : 3
DailyRate - Número de valores únicos : 886
Department - Número de valores únicos : 3
DistanceFromHome - Número de valores únicos : 29
Education - Número de valores únicos : 5
EducationField - Número de valores únicos : 6
EnvironmentSatisfaction - Número de valores únicos : 4
Gender - Número de valores únicos : 2
HourlyRate - Número de valores únicos : 71
JobInvolvement - Número de valores únicos : 4
JobLevel - Número de valores únicos : 5
JobRole - Número de valores únicos : 9
JobSatisfaction - Número de valores únicos : 4
MaritalStatus - Número de valores únicos : 3
MonthlyIncome - Número de valores únicos : 1349
MonthlyRate - Número de valores únicos : 1427
NumCompaniesWorked - Número de valores únicos : 10
OverTime - Número de valores únicos : 2
PercentSalaryHike - Número de valores únicos : 15
PerformanceRating - Número de valores únicos : 2
RelationshipSatisfa

In [28]:
categorical_features = []
for column in employee_data.columns:
    if employee_data[column].dtype == object and len(employee_data[column].unique()) <= 30:
        categorical_features.append(column)
        print(f"{column} : {employee_data[column].unique()}")
        print(employee_data[column].value_counts())
        print("====================================================================================")
categorical_features.remove('Attrition')

Attrition : ['Yes' 'No']
No     1233
Yes     237
Name: Attrition, dtype: int64
BusinessTravel : ['Travel_Rarely' 'Travel_Frequently' 'Non-Travel']
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: BusinessTravel, dtype: int64
Department : ['Sales' 'Research & Development' 'Human Resources']
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64
Education : ['College' 'Below College' 'Master' 'Bachelor' 'Doctor']
Bachelor         572
Master           398
College          282
Below College    170
Doctor            48
Name: Education, dtype: int64
EducationField : ['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree'
 'Human Resources']
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: EducationField, dtype: int64
EnvironmentSatisfaction : ['Medium' 'High' 'Very High' 'Low']
High 

### <font color=red>Inference:</font>

1. The value set of the categorical attributes is complete and easy to understand.
2. So we do not need to perform preprocessing steps for these attributes.

<div style="text-align: center; background-color: yellow; padding: 10px;">
    <h2 style="font-weight: bold;">SAVING DATAFRAME TO CSV FILE</h2>
</div>

In [29]:
# Save DataFrame to CSV file
employee_data.to_csv('IBM-HR-Analytics-Employee-Attrition-and-Performance-Revised.csv', index=False)