# Employee Resignation and Dissatisfaction
As a data analyst tasked with following activities we make a report to the relavant stakeholders:
* Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
* Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

The following datasets are available for analysis purpose:
* dete_survey.csv  

| Data | Description |
| --- | --- |
| ID | An id used to identify the participant of the survey |
| SeparationType | The reason why the person's employment ended |
| Cease Date | The year or month the person's employment ended |
| DETE Start Date | The year the person began employment with the DETE | 
  
* tafe_survey.csv  

Data | Description
 --- | --- 
Record ID | An id used to identify the participant of the survey
Reason for ceasing employment | The reason why the person's employment ended
LengthofServiceOverall | Overall Length of Service at Institute (in years)

## Importing libraries and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as 

In [None]:
dete_survey = pd.read_csv('/kaggle/input/employee-exit-survey/dete_survey.csv')
tafe_survey = pd.read_csv('/kaggle/input/employee-exit-survey/tafe_survey.csv')

## Preliminary Analysis
* **dete_survey**

In [None]:
dete_survey.head()

In [None]:
dete_survey.info()

In [None]:
dete_survey.isnull().sum().sort_values(ascending = False)

* **tafe_survey**

In [None]:
tafe_survey.head()

In [None]:
dete_survey.info()

In [None]:
tafe_survey.isnull().sum().sort_values(ascending = False)

Re-reading the data with Nan values specified as 'Not Stated'

In [None]:
dete_survey = pd.read_csv('/kaggle/input/employee-exit-survey/dete_survey.csv', na_values='Not Stated')
tafe_survey = pd.read_csv('/kaggle/input/employee-exit-survey/tafe_survey.csv', na_values='Not Stated')

In [None]:
dete_survey.info()

In [None]:
tafe_survey.info()

## Removing unnecessary columns 
The following columns serve no purpose to our analysis

In [None]:
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49],axis = 1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66],axis = 1)

## Updating column names

In [None]:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
dete_survey_updated.columns

In [None]:
mapping = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age',
       'Employment Type. Employment Type': 'employment_status',
       'Classification. Classification': 'position',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}
tafe_survey_updated = tafe_survey_updated.rename(mapping, axis = 1)
tafe_survey_updated.columns

## Filtering the data
If we look at the unique values in the separationtype columns in each dataframe, we'll see that each contains a couple of different separation types. For this project, we'll only analyze survey respondents who resigned, so their separation type contains the string 'Resignation'.

In [None]:
dete_survey_updated['separationtype'].value_counts()

In [None]:
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]
dete_survey_updated['separationtype'].value_counts()

In [None]:
tafe_survey_updated['separationtype'].value_counts()

In [None]:
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()

In [None]:
dete_resignations.head()

In [None]:
tafe_resignations.head()

## Correcting date field values

In [None]:
dete_resignations['cease_date'].value_counts()

In [None]:
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1].astype("float")

In [None]:
dete_resignations['cease_date'].value_counts()

In [None]:
dete_resignations['dete_start_date'].value_counts().sort_values()

In [None]:
tafe_resignations['cease_date'].value_counts().sort_values()

There seem to be no outliers in the date type fields

## Creating new field
A new field 'institute_service' can be added to the dete_resignation dataset by subtracting start dates and the cease date - To indicate the years of service

In [None]:
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].head(10)

## Categorizing employee dissatisfaction
The following columns can be used to guage the employee dissatisfaction from each dataframe.  

tafe_survey_updated:
* Contributing Factors. Dissatisfaction
* Contributing Factors. Job Dissatisfaction  

dete_survey_updated:
* job_dissatisfaction
* dissatisfaction_with_the_department
* physical_work_environment
* lack_of_recognition
* lack_of_job_security
* work_location
* employment_conditions
* work_life_balance
* workload

In [None]:
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()

In [None]:
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()

Function to label the dataset

In [None]:
def label(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True

In [None]:
tafe_factors = ['Contributing Factors. Dissatisfaction',
                'Contributing Factors. Job Dissatisfaction']
dete_factors = ['job_dissatisfaction',
        'dissatisfaction_with_the_department', 'physical_work_environment',
       'lack_of_recognition', 'lack_of_job_security', 'work_location',
       'employment_conditions', 'work_life_balance',
       'workload']

In [None]:
tafe_resignations['dissatisfied'] = tafe_resignations[tafe_factors].applymap(label).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()

tafe_resignations_up['dissatisfied'].value_counts(dropna=False)

In [None]:
dete_resignations['dissatisfied'] = dete_resignations[dete_factors].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()

dete_resignations_up['dissatisfied'].value_counts(dropna=False)

## Merging the dataframes
Added a new column 'institite' to distinguish between the dataframes

In [None]:
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'

In [None]:
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)
lt500 = combined.notnull().sum().sort_values()
combined.notnull().sum().sort_values()

Dropping columns that have less than 500 nonnull values

In [None]:
combined_updated = combined.dropna(thresh = 500, axis =1).copy()

## Cleaning the institute_service column

In [None]:
combined_updated['institute_service'].value_counts(dropna=False)

In [None]:
expression = r'(\d+)'
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(expression, expand = True)
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')

combined_updated['institute_service_up'].value_counts()

Function to label the fields

In [None]:
def transform_service(val):
    if val >= 11:
        return "Veteran"
    elif 7 <= val < 11:
        return "Established"
    elif 3 <= val < 7:
        return "Experienced"
    elif pd.isnull(val):
        return np.nan
    else:
        return "New"

In [None]:
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(transform_service)
combined_updated['service_cat'].value_counts()

## Imputing missing values

In [None]:
combined_updated['dissatisfied'].value_counts(dropna=False)

In [None]:
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)

In [None]:
dis_pct = combined_updated.pivot_table(index='service_cat', values='dissatisfied')

In [None]:
dis_pct.plot(kind='bar', rot=30)