# Survival Analysis Lab

Complete the following exercises to solidify your knowledge of survival analysis.

In [26]:
import pandas as pd
from chart_studio import plotly
import cufflinks as cf
from lifelines import KaplanMeierFitter

cf.go_offline()

In [11]:
data = pd.read_csv('../data/attrition.csv')

## 1. Generate and plot a survival function that shows how employee retention rates vary by gender and employee age.

*Tip: If your lines have gaps in them, you can fill them in by using the `fillna(method=ffill)` and the `fillna(method=bfill)` methods and then taking the average. We have provided you with a revised survival function below that you can use for the exercises in this lab*

In [27]:
def survival(data, group_field, time_field, event_field):
    kmf = KaplanMeierFitter()
    results = []

    for i in data[group_field].unique():
        group = data[data[group_field]==i]
        T = group[time_field]
        E = group[event_field]
        kmf.fit(T, E, label=str(i))
        results.append(kmf.survival_function_)

    survival = pd.concat(results, axis=1)
    front_fill = survival.fillna(method='ffill')
    back_fill = survival.fillna(method='bfill')
    smoothed = (front_fill + back_fill) / 2
    return smoothed

In [29]:
rates = survival(data, 'Gender', 'Age', 'Attrition')
type(rates)

rates.iplot(kind = 'line', xTitle = 'Age', yTitle = "value", title = 'Probailidad de empleo')

## 2. Compare the plot above with one that plots employee retention rates by gender over the number of years the employee has been working for the company.

In [30]:
rates = survival(data, 'Gender', 'YearsAtCompany', 'Attrition')
type(rates)

rates.iplot(kind = 'line', xTitle = 'Working (years)', yTitle = "value", title = 'Probailidad de retención en la empresa')


In [None]:

# Podemos decir que es una empresa con poca rotacion de empleados, aunque los que rotan las mujeres tienden a
#ser más jovenes que los hombres

## 3. Let's look at retention rate by gender from a third perspective - the number of years since the employee's last promotion. Generate and plot a survival curve showing this.

In [31]:
rates = survival(data, 'Gender', 'YearsSinceLastPromotion', 'Attrition')
type(rates)

rates.iplot(kind = 'line', xTitle = 'ultima promocion (years)', yTitle = "value", title = 'Probailidad de retención en la empresa')


Apartir de la última promoción, los hombres son los que tienden a dejar más el empleo si ha pasado más años desde su ultima promocion
vs mujeres, las cuales a partir de los 10 años, se vuelve lineal, es decir , no se van
es decir existe una correlacion de a mayor años de ultimo puesto, mayor desecrión de hombres

## 4. Let's switch to looking at retention rates from another demographic perspective: marital status. Generate and plot survival curves for the different marital statuses by number of years at the company.

In [22]:
data.columns


Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [34]:
rates = survival(data, 'MaritalStatus', 'YearsAtCompany', 'Attrition')
type(rates)

rates.iplot(kind = 'line', xTitle = 'Años en la empresa (years)', yTitle = "value", title = 'Probailidad de retención en la empresa segun su estado marital')

Las personas casadas son los que tienden a dejar más el empleo, las personas divorsiadas son los que casi no dejan el trabajo

## 5. Let's also look at the marital status curves by employee age. Generate and plot the survival curves showing retention rates by marital status and age.

In [24]:
# Marital Staturs by employee age.
group_field = 'MaritalStatus'
time_field = 'Age'
event_field = 'Attrition'
result5 = survival(data, group_field, time_field, event_field)
result5.iplot(kind = 'line', xTitle ='Employee age' , yTitle = 'Retention Probabilty', title = 'Employee Retention')

## 6. Now that we have looked at the retention rates by gender and marital status individually, let's look at them together. 

Create a new field in the data set that concatenates marital status and gender, and then generate and plot a survival curve that shows the retention by this new field over the age of the employee.

In [35]:
data['genero_estado'] = list(zip(data['MaritalStatus'],data['Gender']))

rates = survival(data, 'genero_estado', 'Age', 'Attrition')

rates.iplot(kind = 'line', xTitle = 'Edad (years)', yTitle = "value", title = 'Probailidad de retención en la empresa segun su estado marital y edad')


En general se puede ver que los solteros tienen una menor tasa de retención

## 6. Let's find out how job satisfaction affects retention rates. Generate and plot survival curves for each level of job satisfaction by number of years at the company.

In [36]:
rates = survival(data, 'EnvironmentSatisfaction', 'YearsAtCompany', 'Attrition')
rates.iplot(kind = 'line', xTitle = 'años en la empresa (years)', yTitle = "probabilidad de retencion", title = 'Probailidad de retención en la empresa segun el esatdo de satisfacción')


Solo los de mas baja calificación de satisfaccion tienden a mayor abandono

## 7. Let's investigate whether the department the employee works in has an impact on how long they stay with the company. Generate and plot survival curves showing retention by department and years the employee has worked at the company.

In [38]:
rates = survival(data, 'Department', 'YearsAtCompany', 'Attrition')
rates.iplot(kind = 'line', xTitle = 'años en la empresa (years)', yTitle = "probabilidad de retencion", title = 'Probailidad de retención en la empresa segun el departamento en la empresa')


Las personas en el departamento de ventas tienden a mayor abandono

## 8. From the previous example, it looks like the sales department has the highest attrition. Let's drill down on this and look at what the survival curves for specific job roles within that department look like.

Filter the data set for just the sales department and then generate and plot survival curves by job role and the number of years at the company.

In [40]:
ventas = data[data['Department'] == 'Sales']

rates = survival(ventas, 'JobRole', 'YearsAtCompany', 'Attrition')

rates.iplot(kind = 'line', xTitle = 'años en la empresa (years)', yTitle = "probabilidad de retencion", title = 'Probailidad de retención en la empresa segun el departamento en la empresa')


## 9. Let examine how compensation affects attrition.

- Use the `pd.qcut` method to bin the HourlyRate field into 5 different pay grade categories (Very Low, Low, Moderate, High, and Very High).
- Generate and plot survival curves showing employee retention by pay grade and age.

In [42]:
#agrega la data en compartimientos de volumen iguales por la cantidad de compartimientois, emn este caso 5, 
#posterior le agrega las etiquetas que le ponemos en una lista

pay_grade = ['Very Low', 'Low', 'Moderate', 'High','Very High']

data['Pay_Grade'] = pd.qcut(data['HourlyRate'],5,labels=pay_grade)

rates = survival(data, 'Pay_Grade', 'Age', 'Attrition')
rates.iplot(kind = 'line', xTitle = 'Edad (years)', yTitle = "probabilidad de retencion", title = 'Probailidad de retención en la empresa por el pay grade y edad')


In [43]:
data[['Pay_Grade','HourlyRate']].head()


Unnamed: 0,Pay_Grade,HourlyRate
0,Very High,94
1,Moderate,61
2,Very High,92
3,Low,56
4,Very Low,40


## 10. Finally, let's take a look at how the demands of the job impact employee attrition.

- Create a new field whose values are 'Overtime' or 'Regular Hours' depending on whether there is a Yes or a No in the OverTime field.
- Create a new field that concatenates that field with the BusinessTravel field.
- Generate and plot survival curves showing employee retention based on these conditions and employee age.

In [46]:
import numpy as np

In [44]:
data.columns


Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager', 'genero_estado', 'Pay_Grade'],
      dtype='object')

In [48]:
data['horas_ex_re'] = np.where(data['OverTime'] == 'Yes', 'Overtime', 'Regular Hours')

data['concatenado'] = list(zip(data['horas_ex_re'],data['BusinessTravel']))

rates = survival(data, 'concatenado', 'Age', 'Attrition')

rates.iplot(kind = 'line', xTitle = 'Edad (years)', yTitle = "probabilidad de retencion", title = 'Probailidad de retención en la empresa por la edad y horas/businesstravel')


Entre menor hundimiento menor desercion.

In [None]:
# Entre MENOR hundimiento menor desercion #sindicato
# A mayor hundimiento ("#ayquetrizte") mayor desercion

In [49]:
data['OverTime'].head()


0    Yes
1     No
2    Yes
3    Yes
4     No
Name: OverTime, dtype: object