# Clean and Analyze Employee Exit Surveys

## Introduction

dataset
- exit surveys of employees from Queensland, Australia
    - Department of Education, Training and Employment (DETE)
    - Technical and Further Education (TAFE)
    - encoded to UTF-8

project goal
- Are employes who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
- What about the employees who have been there longer?
- Are younger employees resigning due to some kind of dissatisfaction?
- What about older employees?

- combine results for both surveys to answer the quetions
- use same survey template, but one customized some of the answers
- no data dictionary available

skills:
- apply(), map()
- fillna(), dropna(), drop()
- melt()
- concat(), merge()

In [5]:
import numpy as np
import pandas as pd

## 1. The DETE and TAFE Survey Datasets

`dete_survey.csv`
* `ID` participant ID
* `SeparationType` reason why employment ended
* `Cease Date` year or month employment ended
* `DETE Start Date` year employemnt started

`tafe_survey.csv`
* `Record ID` participant ID
* `Reason for ceasing employment`
* `LengthofServiceOverall. Overall Length of Service at Institute (in years)` employment in years

In [6]:
# read in and preview datasets
dete_raw = pd.read_csv('/Users/slp22/code/dataquest projects/dete_survey.csv')
tafe_raw = pd.read_csv('/Users/slp22/code/dataquest projects/tafe_survey.csv')

### DETE

dete_resignation['institute_service'] = dete_resignation['cease_date'] - dete_resignation['dete_start_date']


In [7]:
dete_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-

In [8]:
# dete_raw.info()

In [9]:
# dete_raw.columns

In [10]:
# dete_raw.isnull()

In [11]:
# dete_raw['SeparationType'].value_counts()

In [12]:
# dete_raw['Position'].value_counts()

In [13]:
# dete_raw['Classification'].value_counts()

**`dete_raw`**
- RangeIndex: 822 entries, 0 to 821
- Data columns (total 56 columns)
- Dytpe: ID=int, others=object, bool
- Non-Null: Business Unit, Aboriginal, Torres Strait, South Sea, Disability, NESB

### TAFE

In [14]:
tafe_raw.head()

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?,Workplace. Topic:Does your workplace promote and practice the principles of employment equity?,Workplace. Topic:Does your workplace value the diversity of its employees?,Workplace. Topic:Would you recommend the Institute as an employer to others?,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,Yes,Yes,Yes,Yes,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,Yes,Yes,Yes,Yes,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [15]:
# tafe_raw.info()

In [16]:
# tafe_raw.columns

In [17]:
# tafe_raw.isnull()

In [18]:
# tafe_raw['Reason for ceasing employment'].value_counts()

In [19]:
# tafe_raw['Employment Type. Employment Type'].value_counts()

In [20]:
# tafe_raw['Classification. Classification'].value_counts()

**`tafe_raw`**
- Record ID in scientific notation
- Columns names are long, descriptive, repetitive
- RangeIndex: 702 entries, 0 to 701
- Data columns (total 72 columns)
- Dtype: ID=int, others=object, cessation year=float
- Non-Null: range 400-500 of 700 rows

## 2. Identify Missing Values and Drop Unnecessary Columns

In [21]:
# dete_raw = pd.read_csv('/dete_survey.csv', na_values="Not Stated")
dete_raw = pd.read_csv('/Users/slp22/code/dataquest projects/dete_survey.csv')

In [22]:
#dete_raw drop columns [28:49] axis=1
dete = dete_raw.drop(dete_raw.columns[28:49],axis=1)
# dete.head()

In [23]:
#tafe drop columns [17:66] axis=1
tafe = tafe_raw.drop(tafe_raw.columns[17:66], axis=1)
# tafe.head()

#### Dropped columns from `tafe` [28:39] and `dete` [17:66] that are not relevant to this analysis. And will make the data easier to work with.  

## 3. Clean Column Names

In [24]:
dete_col = dete.columns
# dete_col

In [25]:
tafe_col = tafe.columns
# tafe_col

### 🧹 functions to clean up text

In [26]:
# function to make each column name lowercase
def lower(cols):
    lower_cols = []
    for c in cols:
        lower_cols.append(c.lower())
    return lower_cols

In [27]:
# function to remove trailing whitespace from end of strings
def spaceless(cols):
    spaceless_cols = []
    for c in cols:
        spaceless_cols.append(c.rstrip())
    return spaceless_cols

In [28]:
# function to replace space with underscore
def replace_punctuation(cols):
    underscore_cols = []
    for c in cols:
        new_c = c.replace(" ", "_").replace(".", "").replace("-", "")
        underscore_cols.append(new_c)
    return underscore_cols

### functions not used

#### x apply clean up functions #1 (nested functions)

In [29]:
# from types import new_class
def clean_up(col):
    new_cols = []

# function to make each column name lowercase

#   # function to remove trailing whitespace from end of strings
#     def spaceless(new_cols):
#         for c in new_cols:
#             new_cols.append(c.rstrip())
#         return new_cols

#   # function to replace space with underscore
#     def replace_punctuation(new_cols):
#         for c in new_cols:
#             new_c = c.replace(" ", "_").replace(".", "").replace("-", "")
#             new_cols.append(new_c)
#         return new_cols
    # return lower(new_cols


In [30]:
# # higer order function lesson

# def generate_age_checker(min_age):
#     def check_age(age):
#         return age > min_age
#     return check_age

# check_min_18 = generate_age_checker(18)
# check_min_21 = generate_age_checker(21)

# print(check_min_18(20))
# print(check_min_21(20))
    

#### apply clean up functions #2 (sequential)

In [31]:
# # appy lower, spaceless, and replace_punctuation functions for tafe_col
# lower_tafe = lower(tafe_col)
# spaceless_tafe = spaceless(lower_tafe)
# clean_tafe_cols = replace_punctuation(spaceless_tafe)

In [32]:
# # appy lower, spaceless, and replace_punctuation functions for dete_col
# lower_dete = lower(dete_col)
# spaceless_dete = spaceless(lower_dete)
# clean_dete_cols = replace_punctuation(spaceless_dete)

#### apply clean up functions #3 (nest func)

In [33]:
# replace_punctuation(spaceless(lower(dete_col)))

In [34]:
# replace_punctuation(spaceless(lower(tafe_col)))

### ✅ apply clean up functions #4 (call func)

In [35]:
# best practice
def clean_up(col):
    lowercased = lower(col)
    without_spaces = spaceless(lowercased)
    without_punctuation = replace_punctuation(without_spaces)
    return without_punctuation

In [36]:
new_dete_col = clean_up(dete_col)
dete.columns = new_dete_col
dete.columns

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

In [37]:
# tafe.columns

In [38]:
tafe.rename({'Record ID': 'id',
             'CESSATION YEAR': 'cease_date',
             'Reason for ceasing employment': 'separationtype',
             'Gender. What is your Gender?': 'gender',
             'CurrentAge. Current Age': 'age',
             'Employment Type. Employment Type': 'employment_status',
             'Classification. Classification': 'position',
             'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
             'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}, 
            axis='columns',
           inplace=True)

In [39]:
# tafe.head()

#### Renamed col names to make it easier to call. 

## 4. Filter the Data

*Filter data to answer*
- Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? 
- What about employees who have been at the job longer?

In [40]:
tafe['separationtype'].unique() #'Resignation'

array(['Contract Expired', 'Retirement', 'Resignation',
       'Retrenchment/ Redundancy', 'Termination', 'Transfer', nan],
      dtype=object)

In [41]:
dete['separationtype'].unique() #'Resignation-Other reasons'
                                #'Resignation-Other employer'
                                #'Resignation-Move overseas/interstate'

array(['Ill Health Retirement', 'Voluntary Early Retirement (VER)',
       'Resignation-Other reasons', 'Age Retirement',
       'Resignation-Other employer',
       'Resignation-Move overseas/interstate', 'Other',
       'Contract Expired', 'Termination'], dtype=object)

### `separationtype`

- `dete`
    - `Resignation-Other reasons`
    - `Resignation-Other employer`
    - `Resignation-Move overseas/interstate`
- `tafe`
    - `Resignation`

In [42]:
# #copy datasets
# dete.copy()
# tafe.copy()

print('dete.shape', dete.shape)
print('tafe.shape', tafe.shape)

dete.shape (822, 35)
tafe.shape (702, 23)


In [43]:
# Filter for resignation types
dete_resignation = dete.loc[(dete['separationtype'] == 'Resignation-Other reasons') | 
                             (dete['separationtype'] =='Resignation-Other employer') | 
                             (dete['separationtype'] == 'Resignation-Move overseas/interstate')]

In [44]:
dete_resignation.head(2)

Unnamed: 0,id,separationtype,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,...,work_life_balance,workload,none_of_the_above,gender,age,aboriginal,torres_strait,south_sea,disability,nesb
3,4,Resignation-Other reasons,05/2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,...,False,False,False,Female,36-40,,,,,
5,6,Resignation-Other reasons,05/2012,1994,1997,Guidance Officer,,Central Office,Education Queensland,Permanent Full-time,...,False,False,False,Female,41-45,,,,,


In [45]:
# Filter for resignation
tafe_resignation = tafe[tafe['separationtype'] == 'Resignation']

In [46]:
tafe_resignation.head(2)

Unnamed: 0,id,Institute,WorkArea,cease_date,separationtype,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE,gender,age,employment_status,position,institute_service,role_service
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [47]:
print('dete_resignation.shape', dete_resignation.shape)
print('tafe_resignation.shape', tafe_resignation.shape)

dete_resignation.shape (311, 35)
tafe_resignation.shape (340, 23)


#### Filtered for resignations only to answer question; filtered out 50% irrelevant data. 

## 5. Verify the Data

In [48]:
# function to clean up date columns by extracting the year from MM/YYYY or replacing 'Not Stated' with 0. 
def extract_year(cease_date):
    years = []
    for i in cease_date:
        if i == 'Not Stated':
            years.append(0) 
        elif len(i) == 7:
            years.append(int(i[3:7]))
        else:
            years.append(int(i))

    return years

### Dete Resignation Years

In [49]:
# before `cease_date`
dete_resignation['cease_date'].value_counts().sort_index(ascending=True)

01/2014        22
05/2012         2
05/2013         2
06/2013        14
07/2006         1
07/2012         1
07/2013         9
08/2013         4
09/2010         1
09/2013        11
10/2013         6
11/2013         9
12/2013        17
2010            1
2012          126
2013           74
Not Stated     11
Name: cease_date, dtype: int64

In [50]:
# run `cease_date` through extract_year() function
extract_year(dete_resignation['cease_date'])

[2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2012,
 2013,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2013,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2013,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2013,
 2013,
 2013,
 2012,
 2012,
 2012,
 2013,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2012,
 2012,
 2012,
 2013,
 2013,
 2013,
 2013,
 2012,
 2013,
 2013,
 2012,
 2012,
 2012,
 2012,
 2012,
 2012,
 2013,
 2012,
 2013,
 2013,
 2013,
 2012,
 2012,
 2013,
 2012,
 2013,

In [51]:
dete_resignation['cease_date'] = extract_year(dete_resignation['cease_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dete_resignation['cease_date'] = extract_year(dete_resignation['cease_date'])


In [52]:
# after `cease_date`
dete_resignation['cease_date'].value_counts().sort_index(ascending=False)

2014     22
2013    146
2012    129
2010      2
2006      1
0        11
Name: cease_date, dtype: int64

In [53]:
# before `dete_start_date`
dete_resignation['dete_start_date'].value_counts().sort_index(ascending=True)

1963           1
1971           1
1972           1
1973           1
1974           2
1975           1
1976           2
1977           1
1980           5
1982           1
1983           2
1984           1
1985           3
1986           3
1987           1
1988           4
1989           4
1990           5
1991           4
1992           6
1993           5
1994           6
1995           4
1996           6
1997           5
1998           6
1999           8
2000           9
2001           3
2002           6
2003           6
2004          14
2005          15
2006          13
2007          21
2008          22
2009          13
2010          17
2011          24
2012          21
2013          10
Not Stated    28
Name: dete_start_date, dtype: int64

In [54]:
# run `dete_start_date` through extract_year() function
extract_year(dete_resignation['dete_start_date'])

[2005,
 1994,
 2009,
 1997,
 2009,
 1998,
 2007,
 0,
 1982,
 1980,
 1997,
 1973,
 1995,
 2005,
 2003,
 2006,
 2011,
 0,
 1977,
 1974,
 2011,
 1976,
 2009,
 2009,
 1993,
 2008,
 2003,
 2011,
 2006,
 2011,
 2007,
 1986,
 2002,
 2011,
 2006,
 2002,
 2004,
 0,
 2008,
 2004,
 2007,
 1997,
 1976,
 2010,
 2012,
 1980,
 2012,
 2007,
 1994,
 2004,
 0,
 2007,
 2003,
 2011,
 2003,
 2005,
 2012,
 1998,
 2005,
 2006,
 1995,
 1989,
 0,
 2005,
 2008,
 2006,
 2007,
 1986,
 1999,
 1996,
 2009,
 1994,
 2009,
 2007,
 2011,
 2006,
 2000,
 2008,
 2005,
 2012,
 2007,
 2008,
 2011,
 2009,
 2011,
 2011,
 2011,
 2010,
 1991,
 2011,
 1992,
 2007,
 2007,
 2012,
 2012,
 0,
 0,
 0,
 1980,
 2006,
 1996,
 0,
 2005,
 2009,
 2001,
 1999,
 2001,
 1989,
 2012,
 2011,
 2007,
 2011,
 2000,
 2008,
 2011,
 2012,
 2005,
 0,
 1988,
 2008,
 1980,
 2007,
 1992,
 2003,
 0,
 2010,
 2012,
 2012,
 1992,
 1999,
 2007,
 1990,
 0,
 2008,
 1994,
 2007,
 1985,
 2000,
 2007,
 1993,
 0,
 0,
 1991,
 0,
 2006,
 0,
 2008,
 2008,
 1997,
 2011

In [55]:
dete_resignation['dete_start_date'] = extract_year(dete_resignation['dete_start_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dete_resignation['dete_start_date'] = extract_year(dete_resignation['dete_start_date'])


In [56]:
# after `dete_start_date`
dete_resignation['dete_start_date'].value_counts().sort_index(ascending=True)

0       28
1963     1
1971     1
1972     1
1973     1
1974     2
1975     1
1976     2
1977     1
1980     5
1982     1
1983     2
1984     1
1985     3
1986     3
1987     1
1988     4
1989     4
1990     5
1991     4
1992     6
1993     5
1994     6
1995     4
1996     6
1997     5
1998     6
1999     8
2000     9
2001     3
2002     6
2003     6
2004    14
2005    15
2006    13
2007    21
2008    22
2009    13
2010    17
2011    24
2012    21
2013    10
Name: dete_start_date, dtype: int64

### Tafe Resignation Years

In [57]:
# `cease_date`
tafe_resignation['cease_date'].value_counts().sort_index(ascending=True)

2009.0      2
2010.0     68
2011.0    116
2012.0     94
2013.0     55
Name: cease_date, dtype: int64

#### Observations:
- Two columns needed the date format cleaned up. 
- Others are in int and are only years. 
- None of the years are before 1940 or after 2014. 
- Limited number of years that span the two data sets. May need to limit query to 2012-2014.

## 6. Create a New Column

In [58]:
dete_resignation['institute_service'] = dete_resignation['cease_date'] - dete_resignation['dete_start_date']
dete_resignation['institute_service'].value_counts().sort_index()
# dete_resignation['institute_service'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dete_resignation['institute_service'] = dete_resignation['cease_date'] - dete_resignation['dete_start_date']


-2012     1
-2011     1
-2007     1
-2006     1
-2005     1
-2000     1
-1995     1
-1990     1
-1987     1
-1984     1
 0       21
 1       22
 2       14
 3       20
 4       16
 5       23
 6       17
 7       13
 8        8
 9       14
 10       6
 11       4
 12       6
 13       8
 14       6
 15       7
 16       5
 17       6
 18       5
 19       3
 20       7
 21       3
 22       6
 23       4
 24       4
 25       2
 26       2
 27       1
 28       2
 29       1
 30       2
 31       1
 32       3
 33       1
 34       1
 35       1
 36       2
 38       1
 39       3
 41       1
 42       1
 49       1
 2012    14
 2013    13
Name: institute_service, dtype: int64

In [59]:
tafe_resignation['institute_service'].value_counts().sort_index()

1-2                   64
11-20                 26
3-4                   63
5-6                   33
7-10                  21
Less than 1 year      73
More than 20 years    10
Name: institute_service, dtype: int64

#### To calculate the length of service (employment), substract start year from cease year. 

## 7. Identify Dissatisfied Employees

In [60]:
# # function to update values for dissatisfied employees
def update_vals(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True

In [61]:
dete_resignation_up = dete_resignation.copy()
dete_resignation_up['separationtype'].value_counts()

Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64

In [62]:
# tafe_resignation['Contributing Factors. Dissatisfaction'].value_counts()
# # Contributing Factors. Dissatisfaction 55
# # - 277 

In [63]:
# tafe_resignation['Contributing Factors. Job Dissatisfaction'].value_counts()
# # Job Dissatisfaction 62
# # - 270

In [64]:
tafe_resignation['dissatisfied'] = tafe_resignation[['Contributing Factors. Dissatisfaction', 
                                                     'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(axis=1, 
                                                                                                                             skipna=False)
tafe_resignation_up = tafe_resignation.copy()
# tafe_resignation_up.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tafe_resignation['dissatisfied'] = tafe_resignation[['Contributing Factors. Dissatisfaction',


In [65]:
# Check the unique values after the updates
tafe_resignation_up['dissatisfied'].value_counts(dropna=False)

False    241
True      99
Name: dissatisfied, dtype: int64

#### Changed dissatisfied col in tafe_resignations to True/False to identify employees who resigned because they were dissatisfied with their role. 

## 8. Combine the Data

In [66]:
# add column to each df to distinguish the two

dete_resignation_up['institute'] = 'DETE'
tafe_resignation_up['institute'] = 'TAFE'

In [67]:
dete_resignation_up.head(2)

Unnamed: 0,id,separationtype,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,...,none_of_the_above,gender,age,aboriginal,torres_strait,south_sea,disability,nesb,institute_service,institute
3,4,Resignation-Other reasons,2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,...,False,Female,36-40,,,,,,7,DETE
5,6,Resignation-Other reasons,2012,1994,1997,Guidance Officer,,Central Office,Education Queensland,Permanent Full-time,...,False,Female,41-45,,,,,,18,DETE


In [68]:
tafe_resignation_up.head(2)

Unnamed: 0,id,Institute,WorkArea,cease_date,separationtype,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Contributing Factors. Other,Contributing Factors. NONE,gender,age,employment_status,position,institute_service,role_service,dissatisfied,institute
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,-,-,,,,,,,False,TAFE
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4,False,TAFE


In [69]:
# combine dataframes
combined = pd.concat([dete_resignation_up,tafe_resignation_up], ignore_index=True)
combined.head(2)
combined.columns

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb', 'institute_service', 'institute',
       'Institute', 'WorkArea',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Fa

In [70]:
# drop cols with less than 500 non-null values with df.dropna()
combined_updated = combined.dropna(thresh=500, axis=1).copy()
combined_updated.head(2)
combined_updated.columns


Index(['id', 'separationtype', 'cease_date', 'position', 'employment_status',
       'gender', 'age', 'institute_service', 'institute'],
      dtype='object')

#### Added `institute` col to identify original dataframe source, drop col with less than 500 non-null values to narrow dataset for analysis

## 9. Clean the Service Column

tafe_resignation['institute_service'].value_counts().sort_index()

1-2                   64
11-20                 26
3-4                   63
5-6                   33
7-10                  21
Less than 1 year      73
More than 20 years    10

#### `institute_service`
- New: Less than 3 years at a company
- Experienced: 3-6 years at a company
- Established: 7-10 years at a company
- Veteran: 11 or more years at a company


In [74]:
# Extract the years of service and convert the type to float
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)')
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')

# Check the years extracted are correct
combined_updated['institute_service_up'].value_counts().sort_index()

0.0        21
1.0       159
2.0        14
3.0        83
4.0        16
5.0        56
6.0        17
7.0        34
8.0         8
9.0        14
10.0        6
11.0       30
12.0        6
13.0        8
14.0        6
15.0        7
16.0        5
17.0        6
18.0        5
19.0        3
20.0       17
21.0        3
22.0        6
23.0        4
24.0        4
25.0        2
26.0        2
27.0        1
28.0        2
29.0        1
30.0        2
31.0        1
32.0        3
33.0        1
34.0        1
35.0        1
36.0        2
38.0        1
39.0        3
41.0        1
42.0        1
49.0        1
1984.0      1
1987.0      1
1990.0      1
1995.0      1
2000.0      1
2005.0      1
2006.0      1
2007.0      1
2011.0      1
2012.0     15
2013.0     13
Name: institute_service_up, dtype: int64

Instructions  
1. First, we'll extract the years of service from each value in the institute_service column.
  - Use the Series.astype() method to change the type to 'str'.
  - Use vectorized string methods to extract the years of service from each pattern. You can find the full list of vectorized string methods here.
  - Double check that you didn't miss extracting any digits.
  - Use the Series.astype() method to change the type to 'float'.
2. Next, we'll map each value to one of the career stage definitions above.
  - Create a function that maps each year value to one of the career stages above.
    - Remember that you'll have to handle missing values separately. You can use the following code to check if a value is NaN where val is the name of the value: pd.isnull(val).
  - Use the Series.apply() method to apply the function to the institute_service column. Assign the result to a new column named service_cat.
3. Write a markdown paragraph explaining the changes you made and why.

## 10. Perform Initial Analysis

Instructions
1. Use the Series.value_counts() method to confirm the number of True and False values in the dissatisfied column. Set the dropna parameter to False to also confirm the number of missing values.
2. Use the DataFrame.fillna() method to replace the missing values in the dissatisfied column with the value that occurs most frequently in this column, either True or False.
3, Use the DataFrame.pivot_table() method to calculate the percentage of dissatisfied employees in each service_cat group.
  - Since a True value is considered to be 1, calculating the mean will also calculate the percentage of dissatisfied employees. The default aggregation function is the mean, so you can exclude the aggfunc argument.
4. Use the DataFrame.plot() method to plot the results. Set the kind parameter equal to bar to create a bar chart.
  - Make sure to run %matplotlib inline beforehand to show your plots in the notebook.
5. Write a markdown paragraph briefly describing your observations.

## 11. Next Steps

In this guided project, we experienced that in order to extract any meaningful insights from our data, we had to perform many data cleaning tasks. In order to create one visualization (and not even the final one), we completed the following tasks:

Explored the data and figured out how to prepare it for analysis
- Corrected some of the missing values
- Dropped any data not needed for our analysis
- Renamed our columns
- Verified the quality of our data
- Created a new institute_service column
- Cleaned the Contributing Factors columns
- Created a new column indicating if an employee resigned because they were dissatisfied in some way
- Combined the data
- Cleaned the institute_service column
- Handled the missing values in the dissatisfied column
- Aggregated the data

Our work here is far from done! We recommend that you continue with the following steps:

- Decide how to handle the rest of the missing values. Then, aggregate the data according to the service_cat column again. How many people in each career stage resigned due to some kind of dissatisfaction?
- Clean the age column. How many people in each age group resigned due to some kind of dissatisfaction?
- Instead of analyzing the survey results together, analyze each survey separately. Did more employees in the DETE survey or TAFE survey end their employment because they were dissatisfied in some way?
- Format your project using Dataquest's project style guide.