# 'Quiet' vs. 'Loud quitting': Un análisis a fondo de los factores que influyen en ambos fenómenos

**Motivación**: El absentismo y abandono laboral son costosos para la empresas, ya que estas emplean recursos para formar a trabajadores que pueden rendir por debajo de su potencial o incluso abandonar la empresa, llevándose consigo experiencia y formación valiosa. También existe un coste de oportunidad en la contractación asociado a que la compañía podría haber incorporado a otros empleados que sí aprovechasen los recursos invertdos. Por tanto, conocer los factores que influyen en el absentismo y abandono laboral es vital para que la empresa optimice sus costes laborales y decisiones de contratación.

**Objetivo**: El objetivo de este EDA es explorar los factores que influyen en el absentismo y el abondono laboral

Para explorar dichos factores, **asumiré una serie de relaciones** entre las variables objetivo (absentismo y abandono laboral) y las variables explicativas de ambos fenómenos

- **Hipótesis 1**: "El grado de satisfacción con el ambiente de trabajo, el grado de desempeño laboral y la percepción del equilibrio entre la vida personal y laboral del trabajor influye negativamente en el abandono laboral (relación inversa)"
- **Hipótesis 2**: "El numero de hijos de un empleado, el gasto en transporte y el día de la semana influyen en el absentismo laboral"
- **Hipótesis 3**: "La edad influye negativamente en el abandono y el absentismo laboral (a mayor edad, menor abandono y absentismo laboral)"
- **Hipótesis 4**: "El nivel de educación influye positivamente en el abandono y el absentismo laboral (a mayor nivel de educación, mayor abandono y absentismo laboral)"
- **Hipótesis 5**: "La distancia entre el hagor y el trabajo tiene una correlación positiva con el abandono y el absentismo laboral (a mayor distancia, mayor abandono y absentismo laboral)"

Al final del EDA comprobaré si dichas relaciones se cumplen.

### DATASETS Y FUENTES ALTERNATIVAS DE DATOS

#### 1. DATASET ABANDONO LABORAL

- 1470 filas y 35 columnas
- Tras eliminar las columas redundantes quedan 30 columnas
- No hay 'missing values'
- 22 variables son enteros y 8 son objetos.
- Dentro de las variables numericas hay variables categóricas 'dummy' que toman el siguiente valor:
    - Education: 1 'Below College', 2 'College', 3 'Bachelor', 4 'Master', 5 'Doctor'
    - EnvironmentSatisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
    - JobInvolvement: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
    - JobSatisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
    - PerformanceRating: 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding'
    - RelationshipSatisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'
    - WorkLifeBalance: 1 'Bad', 2 'Good', 3 'Better', 4 'Best'

-Conviene convertir la columna objeitvo a una columnas numérica que me permita realizar un mejor análisis

In [497]:
import pandas as pd
import numpy as np

df = pd.read_csv(r'C:\Users\rafam\OneDrive\Documentos\GitHub\mi_copia_dsftmayo24semana1.1\semana 2 y 3\1_Data_Analysis\Entregas\EDA\WA_Fn-UseC_-HR-Employee-Attrition.tsv.txt', delimiter='\t')
df
df1 = pd.DataFrame(df)
fuente_1 = "https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset"

In [498]:
df1.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Elimino las columnas redundantes

In [499]:
df1['EmployeeCount'].value_counts()

EmployeeCount
1    1470
Name: count, dtype: int64

In [500]:
df1['StandardHours'].value_counts()

StandardHours
80    1470
Name: count, dtype: int64

In [501]:
df1['DailyRate'].value_counts()

DailyRate
691     6
408     5
530     5
1329    5
1082    5
       ..
650     1
279     1
316     1
314     1
628     1
Name: count, Length: 886, dtype: int64

In [502]:
df1['Over18'].value_counts()

Over18
Y    1470
Name: count, dtype: int64

In [503]:
df1['BusinessTravel'].value_counts()

BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64

In [504]:
df1['Education'].value_counts()

Education
3    572
4    398
2    282
1    170
5     48
Name: count, dtype: int64

In [505]:
df1['EducationField'].value_counts()

EducationField
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: count, dtype: int64

In [506]:
df1['EmployeeNumber'].value_counts()

EmployeeNumber
1       1
1391    1
1389    1
1387    1
1383    1
       ..
659     1
657     1
656     1
655     1
2068    1
Name: count, Length: 1470, dtype: int64

In [507]:
columnas_redundantes= ['EmployeeCount', 'Over18', 'DailyRate', 'StandardHours', 'EmployeeNumber']

In [508]:
df1= df1.drop(columnas_redundantes, axis= 1)

In [509]:
df1

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,Female,94,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,Male,61,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,Male,92,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,Female,56,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,Male,40,...,3,4,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,Research & Development,23,2,Medical,3,Male,41,...,3,3,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,Research & Development,6,1,Medical,4,Male,42,...,3,1,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,Research & Development,4,3,Life Sciences,2,Male,87,...,4,2,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,Sales,2,3,Medical,4,Male,63,...,3,4,0,17,3,2,9,6,0,8


In [510]:
df1.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [511]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   Department                1470 non-null   object
 4   DistanceFromHome          1470 non-null   int64 
 5   Education                 1470 non-null   int64 
 6   EducationField            1470 non-null   object
 7   EnvironmentSatisfaction   1470 non-null   int64 
 8   Gender                    1470 non-null   object
 9   HourlyRate                1470 non-null   int64 
 10  JobInvolvement            1470 non-null   int64 
 11  JobLevel                  1470 non-null   int64 
 12  JobRole                   1470 non-null   object
 13  JobSatisfaction           1470 non-null   int64 
 14  MaritalStatus           

In [512]:
df1.dtypes.value_counts()

int64     22
object     8
Name: count, dtype: int64

In [513]:
df1.shape

(1470, 30)

In [514]:
df1.isnull().sum().sum()

0

In [515]:
df1.duplicated().sum()

0

In [516]:
df1.describe()

Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,9.192517,2.912925,2.721769,65.891156,2.729932,2.063946,2.728571,6502.931293,14313.103401,...,3.153741,2.712245,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,8.106864,1.024165,1.093082,20.329428,0.711561,1.10694,1.102846,4707.956783,7117.786044,...,0.360824,1.081209,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,2094.0,...,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,2.0,2.0,2.0,48.0,2.0,1.0,2.0,2911.0,8047.0,...,3.0,2.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,7.0,3.0,3.0,66.0,3.0,2.0,3.0,4919.0,14235.5,...,3.0,3.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,14.0,4.0,4.0,83.75,3.0,3.0,4.0,8379.0,20461.5,...,3.0,4.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,29.0,5.0,4.0,100.0,4.0,5.0,4.0,19999.0,26999.0,...,4.0,4.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [517]:
df1.iloc[:, :21].describe()

Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,9.192517,2.912925,2.721769,65.891156,2.729932,2.063946,2.728571,6502.931293,14313.103401,2.693197,15.209524,3.153741
std,9.135373,8.106864,1.024165,1.093082,20.329428,0.711561,1.10694,1.102846,4707.956783,7117.786044,2.498009,3.659938,0.360824
min,18.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,2094.0,0.0,11.0,3.0
25%,30.0,2.0,2.0,2.0,48.0,2.0,1.0,2.0,2911.0,8047.0,1.0,12.0,3.0
50%,36.0,7.0,3.0,3.0,66.0,3.0,2.0,3.0,4919.0,14235.5,2.0,14.0,3.0
75%,43.0,14.0,4.0,4.0,83.75,3.0,3.0,4.0,8379.0,20461.5,4.0,18.0,3.0
max,60.0,29.0,5.0,4.0,100.0,4.0,5.0,4.0,19999.0,26999.0,9.0,25.0,4.0


In [518]:
df1.iloc[:, 21:].describe()

Unnamed: 0,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,2.712245,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,1.081209,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,3.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,4.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,4.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [519]:
df1['Attrition']= df1['Attrition'].str.replace("No", "0")

In [520]:
df1['Attrition']= df1['Attrition'].str.replace("Yes", "1")

In [521]:
df1['Attrition'] = df1['Attrition'].astype(int)

In [522]:
df1['Attrition'].value_counts()

Attrition
0    1233
1     237
Name: count, dtype: int64

In [523]:
df1['Attrition'].values

array([1, 0, 1, ..., 0, 0, 0])

#### 2. DATASET ABSENTISMO LABORAL

- Inicialmente tenía 740 filas y 21 columnas
- Tras tratar los 'missing values' y las filas duplicadas, el dataset tienes 614 filas y 21 columnas
- Las variables categóricas han sido introducidas como variables numéricas, siendo estas: 
    - Month of absence
    - Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
    - Seasons (summer (1), autumn (2), winter (3), spring (4))
    - Disciplinary failure (yes=1; no=0)
    - Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
    - Social drinker (yes=1; no=0)
    - Social smoker (yes=1; no=0) 

In [524]:
dfx = pd.read_excel(r"C:\Users\rafam\OneDrive\Documentos\GitHub\mi_copia_dsftmayo24semana1.1\semana 2 y 3\1_Data_Analysis\Entregas\EDA\Absenteeism_at_work_Project.xls")

In [525]:
df2=pd.DataFrame(dfx)
df2.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26.0,7.0,3,1,289.0,36.0,13.0,33.0,239554.0,...,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,30.0,4.0
1,36,0.0,7.0,3,1,118.0,13.0,18.0,50.0,239554.0,...,1.0,1.0,1.0,1.0,0.0,0.0,98.0,178.0,31.0,0.0
2,3,23.0,7.0,4,1,179.0,51.0,18.0,38.0,239554.0,...,0.0,1.0,0.0,1.0,0.0,0.0,89.0,170.0,31.0,2.0
3,7,7.0,7.0,5,1,279.0,5.0,14.0,39.0,239554.0,...,0.0,1.0,2.0,1.0,1.0,0.0,68.0,168.0,24.0,4.0
4,11,23.0,7.0,5,1,289.0,36.0,13.0,33.0,239554.0,...,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,30.0,2.0


In [526]:
df2.columns

Index(['ID', 'Reason for absence', 'Month of absence', 'Day of the week',
       'Seasons', 'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

In [527]:
df2.shape

(740, 21)

In [528]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               740 non-null    int64  
 1   Reason for absence               737 non-null    float64
 2   Month of absence                 739 non-null    float64
 3   Day of the week                  740 non-null    int64  
 4   Seasons                          740 non-null    int64  
 5   Transportation expense           733 non-null    float64
 6   Distance from Residence to Work  737 non-null    float64
 7   Service time                     737 non-null    float64
 8   Age                              737 non-null    float64
 9   Work load Average/day            730 non-null    float64
 10  Hit target                       734 non-null    float64
 11  Disciplinary failure             734 non-null    float64
 12  Education             

In [529]:
df2.isnull().sum()

ID                                  0
Reason for absence                  3
Month of absence                    1
Day of the week                     0
Seasons                             0
Transportation expense              7
Distance from Residence to Work     3
Service time                        3
Age                                 3
Work load Average/day              10
Hit target                          6
Disciplinary failure                6
Education                          10
Son                                 6
Social drinker                      3
Social smoker                       4
Pet                                 2
Weight                              1
Height                             14
Body mass index                    31
Absenteeism time in hours          22
dtype: int64

In [530]:
df2.isnull().sum().sum()

135

In [531]:
df2.dropna(inplace=True)

Elimino las filas con 'missing values'.

In [532]:
df2.duplicated().sum()

25

In [533]:
df2.drop_duplicates(inplace=True)

Elimino las filas duplicadas

In [534]:
df2

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26.0,7.0,3,1,289.0,36.0,13.0,33.0,239554.0,...,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,30.0,4.0
1,36,0.0,7.0,3,1,118.0,13.0,18.0,50.0,239554.0,...,1.0,1.0,1.0,1.0,0.0,0.0,98.0,178.0,31.0,0.0
2,3,23.0,7.0,4,1,179.0,51.0,18.0,38.0,239554.0,...,0.0,1.0,0.0,1.0,0.0,0.0,89.0,170.0,31.0,2.0
3,7,7.0,7.0,5,1,279.0,5.0,14.0,39.0,239554.0,...,0.0,1.0,2.0,1.0,1.0,0.0,68.0,168.0,24.0,4.0
4,11,23.0,7.0,5,1,289.0,36.0,13.0,33.0,239554.0,...,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,30.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,11,14.0,7.0,3,1,289.0,36.0,13.0,33.0,264604.0,...,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,30.0,8.0
736,1,11.0,7.0,3,1,235.0,11.0,14.0,37.0,264604.0,...,0.0,3.0,1.0,0.0,0.0,1.0,88.0,172.0,29.0,4.0
737,4,0.0,0.0,3,1,118.0,14.0,13.0,40.0,271219.0,...,0.0,1.0,1.0,1.0,0.0,8.0,98.0,170.0,34.0,0.0
738,8,0.0,0.0,4,2,231.0,35.0,14.0,39.0,271219.0,...,0.0,1.0,2.0,1.0,0.0,2.0,100.0,170.0,35.0,0.0


Elimino las columnas redundantes

In [535]:
df2['Weight'].corr(df2['Body mass index'])

0.8998027730911196

In [536]:
df2= df2.drop('Body mass index', axis=1)

In [537]:
df2

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Absenteeism time in hours
0,11,26.0,7.0,3,1,289.0,36.0,13.0,33.0,239554.0,97.0,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,4.0
1,36,0.0,7.0,3,1,118.0,13.0,18.0,50.0,239554.0,97.0,1.0,1.0,1.0,1.0,0.0,0.0,98.0,178.0,0.0
2,3,23.0,7.0,4,1,179.0,51.0,18.0,38.0,239554.0,97.0,0.0,1.0,0.0,1.0,0.0,0.0,89.0,170.0,2.0
3,7,7.0,7.0,5,1,279.0,5.0,14.0,39.0,239554.0,97.0,0.0,1.0,2.0,1.0,1.0,0.0,68.0,168.0,4.0
4,11,23.0,7.0,5,1,289.0,36.0,13.0,33.0,239554.0,97.0,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,11,14.0,7.0,3,1,289.0,36.0,13.0,33.0,264604.0,93.0,0.0,1.0,2.0,1.0,0.0,1.0,90.0,172.0,8.0
736,1,11.0,7.0,3,1,235.0,11.0,14.0,37.0,264604.0,93.0,0.0,3.0,1.0,0.0,0.0,1.0,88.0,172.0,4.0
737,4,0.0,0.0,3,1,118.0,14.0,13.0,40.0,271219.0,95.0,0.0,1.0,1.0,1.0,0.0,8.0,98.0,170.0,0.0
738,8,0.0,0.0,4,2,231.0,35.0,14.0,39.0,271219.0,95.0,0.0,1.0,2.0,1.0,0.0,2.0,100.0,170.0,0.0


In [538]:
df2.columns

Index(['ID', 'Reason for absence', 'Month of absence', 'Day of the week',
       'Seasons', 'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height',
       'Absenteeism time in hours'],
      dtype='object')

La nueva base de datos tiene 614 observaciones

In [539]:
df2.iloc[:, :10].describe()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,18.043974,18.897394,6.232899,3.859935,2.522801,222.239414,29.208469,12.630293,36.700326,271703.856678
std,10.982398,8.522357,3.328579,1.433663,1.096622,65.758913,14.599119,4.349922,6.656187,39434.300116
min,1.0,0.0,0.0,2.0,1.0,118.0,5.0,1.0,27.0,205917.0
25%,10.0,13.0,3.0,3.0,2.0,179.0,16.0,9.0,31.0,244387.0
50%,18.0,23.0,6.0,4.0,3.0,225.0,26.0,13.0,37.0,264604.0
75%,28.0,26.0,9.0,5.0,3.0,260.0,48.75,16.0,40.0,284853.0
max,36.0,28.0,12.0,6.0,4.0,388.0,52.0,29.0,58.0,378884.0


In [540]:
df2.iloc[:, 10:].describe()

Unnamed: 0,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Absenteeism time in hours
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,94.705212,0.050489,1.302932,1.053746,0.566775,0.076547,0.767101,79.285016,172.223127,7.210098
std,3.784928,0.219129,0.682638,1.080169,0.495925,0.266088,1.338168,12.88417,6.153612,14.054073
min,81.0,0.0,1.0,0.0,0.0,0.0,0.0,56.0,163.0,0.0
25%,93.0,0.0,1.0,0.0,0.0,0.0,0.0,69.0,169.0,2.0
50%,95.0,0.0,1.0,1.0,1.0,0.0,0.0,83.0,171.0,3.0
75%,98.0,0.0,1.0,2.0,1.0,0.0,1.0,89.0,172.0,8.0
max,100.0,1.0,4.0,4.0,1.0,1.0,8.0,108.0,196.0,120.0


### ANÁLISIS UNIVARIANTE

#### 1. ABANDONO LABORAL

### ANÁLISIS MULTIVARIANTE

In [541]:
counted = pd.pivot_table(df1, index="Education", columns="EducationField", values="Attrition", aggfunc="count")
counted

EducationField,Human Resources,Life Sciences,Marketing,Medical,Other,Technical Degree
Education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2,67,14,63,5,19
2,2,116,24,99,19,22
3,16,233,59,183,24,57
4,5,173,52,104,33,31
5,2,17,10,15,1,3


In [542]:
counted2 = pd.pivot_table(df1, index="Education", columns="EducationField", values="Attrition", aggfunc="sum")
counted2

EducationField,Human Resources,Life Sciences,Marketing,Medical,Other,Technical Degree
Education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,8,4,10,2,6
2,0,18,6,15,1,4
3,4,37,15,25,2,16
4,1,25,9,13,6,4
5,1,1,1,0,0,2
