---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.17(Pandas-08)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Handling Missing Data.ipynb_

## Learning agenda of this notebook

1. Have an insight about the Dataset
2. Identify the Columns having Null/Missing values using `df.isna()` method
3. Handle/Impute the Null/Missing Values under the `math` Column using `df.loc[mask,col]=value`
4. Handle/Impute the Null/Missing Values under the `group` Column using `df.loc[mask,col]=value`
5. Handle Missing values under a Numeric/Categorical Column using `fillna()`
6. Handle Repeating Values (for same information) under the `session` Column
7. Create a new Column by Modifying an Existing Column
8. Delete Rows Having NaN values using `df.dropna()` method
9. Convert Categorical Variables into Numerical

## 1. Have an Insight about the Dataset

In [1]:
! cat datasets/group-marks.csv

rollno,name,gender,group,session,age,scholarship,math,english,urdu
MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72,74
MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90,88
MS03,ARIFA,female,,EVENING,34,3500,,95,93
MS04,SAADIA,female,group A,MOR,44,2000,47,57,44
MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78,55
MS06,SAFIA,female,group B,AFT,23,3800,,83,78
MS07,SARA,female,group B,EVENING,47,3000,88,95,92
MS08,ABDULLAH,male,group B,EVE,33,2000,40,43,39
MS09,KHAN,male,group D,MORNING,27,2500,64,,67
MS10,HASEENA,female,group B,AFT,33,2800,38,60,50
MS11,MUSTJAB,male,group C,MOR,46,3000,58,54,52
MS12,ABRAR,male,group D,MORNING,53,3312,40,52,43
MS13,MAHOOR,female,,MOR,25,2345,65,81,73
MS14,USAMA,male,group A,AFTERNOON,26,2654,78,72,70
MS15,NAVAIRA,female,group A,AFT,25,2137,50,53,58
MS16,SAWAIRA,female,group C,EVENING,29,2567,69,75,78
MS17,NOFIL,male,group C,MOR,22,3500,88,89,86
MS18,SHUMAILA,female,group B,AFTERNOON,31,2500,18,,28
MS19,ABUBAKAR,

In [2]:
# import the pandas library
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [3]:
df.shape

(50, 10)

- Whenever the **`pd.read.csv()`** method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) it flags it with NaN. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.
- There can also be a scenario that a user has entered some text under a numeric field about which he/she do not have any information.

## 2. Identify the Columns having Null/Missing values
- The **`df.isna()`** method isrecommended to use than `df.isnull()`, which return a boolean same-sized object that indicates whether an element is NA value or not. Missing values get mapped to True. Everything else gets mapped to False values. Remember, characters such as empty strings ``''`` or `numpy.inf` are not considered NA values.
- The **`df.notna()`** method is recommended to use than `df.notnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. Non-missing values get mapped to True. 

In [None]:
df.isna().head()

In [None]:
df.notna().head()

In [None]:
# Now we can use sum() on this dataframe object of Boolean values (True is mapped to 1)
df.isna().sum()

In [None]:
# Similarly, we can use sum() on this dataframe object of Boolean values (True is is mapped to 1)
df.notna().sum()

## 3. Handle/Impute the Null/Missing Values under the `math` Column

### a. Identify the Rows under the `math` Column having Null/Missing values
- The `df.isna()` method works equally good on Series objects as well

In [None]:
mask = df.math.isna()
mask

In [None]:
# This will return only those rows of dataframe having null values under the math column
df[mask]         # df[df.math.isna()]
df.loc[mask, :]  # df.loc[df.math.isna(), :]

### b. Replace the Null/Missing Values under the `math` Column
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values under the `math` column
- Suppose, we want to put the average values at the place of missing values.

In [None]:
# Compute the mean of math column
# df.math.mean() 

> By seeing the error, it appears that the `math` column do not have the `int64` or `float64` type. Let us check this out

In [None]:
# Check out the data type of math column
df['math'].dtypes

In [None]:
# We can also use the `df.info()` method to display the count of Non-Null columns, their datatypes, their names 
# and memory usage of that dataframe.

df.info()

- **What can be the reason for this?**
- Let us check out the values under this column

In [None]:
df['math']

In [None]:
# We can replace all such values using the `replace()` method
import numpy as np
df.replace('No Idea', np.nan).head()

In [None]:
# Note the marks of Saadia in math are changed from string `No Idea` to `NaN`
# Since this seems working fine let us make inplace=True to make these changes in the original dataframe
df.replace('No Idea', np.nan, inplace=True)

In [None]:
df.head()

In [None]:
# Let us check the data type of math column
df['math'].dtypes

In [None]:
# It is still Object, which is natural, however, we can change the datatype to `df.astype()` method
df['math'] = df['math'].astype(float)

In [None]:
# Let us check the data type of math column
df['math'].dtypes

In [None]:
# Let us compute the average of math marks again 
df.math.mean() 

In [None]:
# List only those records under math column having Null values
mask = df.math.isna()
df.loc[mask, 'math']

In [None]:
# Let us replace these values with mean value of the math column
df.loc[(df.math.isna()),'math'] = df.math.mean()

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

In [None]:
df.head()

## 4. Handle/Impute the Null/Missing Values under the `group` Column
- The `group` column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values.

### a. Identify the Rows under the `group` Column having Null/Missing values

In [None]:
df.head()

In [None]:
mask = df.group.isna()
mask.head()

In [None]:
df[mask]          # df[df.group.isna()]
df.loc[mask, :]   # df.loc[df.group.isna()]

### b. Replace the Null/Missing Values under the `group` Column
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values
- Since this is a categorical column having datatype object (group A, group B, group C, ...), so let us replace it with th value inside the column having the maximum frequency

In [None]:
# Use value_counts() function which return a Series containing counts of unique values (in descending order)
# with the most frequently-occurring element at first. It excludes NA values by default.
df.group.value_counts()

In [None]:
# Another way of doing is use the mode() function on the column
df.group.mode() 

In [None]:
# List only those records under group column having Null values
mask = df.group.isna()
df.loc[mask, 'group']     # df.loc[(df.group.isna()), 'group']

In [None]:
# Let us replace these values with maximum occurring value in the `group` column
df.loc[(df.group.isna()),'group'] = 'group C'

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

In [None]:
df.head()

>Note that in the original dataframe Arifa group information was missing, and now it is `group C` 

## 5. Handle Missing values under a Numeric/Categorical Column using `fillna()`

### a. Replace the Null/Missing Values under the math Column using `fillna()`
- This is more recommended way of filling in the Null values within columns of your dataset rather than the use of the `loc` method.
```
object.fillna(value, method, inplace=True)
```
- The only required argument is either the `value`, with which we want to replace the missing values OR the `method` to be used to replace the missing values
- Returns object with missing values filled or None if ``inplace=True``

In [None]:
# Let us read the dataset again with NA values under math column
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')

In [None]:
df.head()

>- Before proceeding, let us this time handle the string value `No Idea` under the math column while reading the csv file, instead of doing afterwards in the dataframe using the `replace()` method as we have done above.
>- For this we will use the `na_values` argument to the `pd.read_csv()` method, to which you can pass a single value or a list of values to be replaced with NaN

In [None]:
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')

In [None]:
df.head()

In [None]:
df.isna().sum()

In [None]:
df.loc[df.math.isna()]

In [None]:
# This time instead of loc, use fillna() method with just two arguments
# inplace=True parameter ensure that this happens in the original dataframe

df.math.fillna(value=df.math.mean(), inplace=True)

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

In [None]:
df.head()

### b. Replace the Null/Missing Values under the `group` Column using `fillna()`

In [None]:
# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head()

In [None]:
df.isna().sum()

In [None]:
# Once again instead of loc,let us use fillna() method with just two arguments

df.group.fillna('group C', inplace=True)

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

In [None]:
# Let us fill the math, english and scholarship columns as well again
df.math.fillna(df.math.mean(), inplace=True)
df.english.fillna(df.english.mean(), inplace=True)
df.scholarship.fillna(df.scholarship.mean(), inplace=True)

In [None]:
# Confirm the result
df.isna().sum()


### c. Replace the Null/Missing Values under the` math` and `group` Column using `ffill` and `bfill` Arguments
- In above examples, we have used the mean value in case of numeric column and mode value in case of a categorical column as the filling value to the `fillna()` method
```
object.fillna(value, method, inplace=True)
```

- We can pass `ffill` or `bfill` as method argument to the `ffillna()` method. This will replace the null values with other values from the DataFrame
- `ffill` (Forward fill): It fills the NaN value with the previous value
- `bfill` (Back fill): It fills the NaN value with the Next/Upcoming value

<img align="right" width="490" height="100"  src="images/bfill.PNG"  >
<img align="left" width="490" height="100"  src="images/ffill.PNG"  >

In [None]:
# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head()

In [None]:
df.isna().sum()

In [None]:
# forward fill or ffill attribute
# If have NaN value, just carry forward the previous value
# using ffill attribute, you can fill the NaN value with the previous value in that column
df.fillna(method = 'ffill', inplace=True)
df.head()

>Is it working fine?

In [None]:
df.fillna(method = 'bfill', inplace=True)
df.head()

In [None]:
# Confirm the result
df.isna().sum()

## 6. Handle Repeating Values (for same information) under the `session` Column
- If you observe the values under the `session` column, you can observe that it is a categorical column containing six different categories (as values).
    - Notice that the categories `MORNING` and `MOR` are same
    - Similarly, `AFTERNOON` and `AFT` are same
    - Similarly, `EVENING` and `EVE` are same
- This happens when you have collected data from different sources, where same information is written in different ways
- So the `session` column has six different categories (as values) but should have only three

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv' )
df

In [None]:
df.session

In [None]:
# Let use check out the counts of unique values inside the session Column
df.session.value_counts()

###  Handle  the Repeating Values under the session Column using `map()`
- To keep the data clean we will map all these values to only three categories to `MOR` , `AFT` and `EVE` using the map() function.
```
df.map(mapping, na_action=None)
```
- The `map()` method is used for substituting each value in a Series with another value, that may be derived from a `dict`. The `map()` method returns a series after performing the mapping
- You can give `ignore` as second argument which will propagate NaN values, without passing them to the mapping correspondence.

In [None]:
# To do this, let us create a new mapping (dictionary) 
dict1 = {
    'MORNING' : 'MOR',
    'MOR' : 'MOR',
    'AFTERNOON' : 'AFT',
    'AFT': 'AFT',
    'EVENING' : 'EVE',
    'EVE': 'EVE'
}

In [None]:
# It returns a series with the same index as caller, the original series remains unchanged. 
# So we have assigned the resulting series to `df.session` series
df.session.map(dict1)

In [None]:
df.session = df.session.map(dict1)

In [None]:
# Count of new categories in the column session
# Observe we have managed to properly manage the values inside the session column
df.session.value_counts()

In [None]:
# Let us verify the result
df.head()

## 7. Create a new Column by Modifying an Existing Column
- We have a column scholarship in the dataset, which is in Pak Rupees
- Suppose you want to have a new column which should represent the scholarship in US Dollars
- For that we need to add a new column by dividing each value of scholarship with 150

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv' )
df.head()

In [None]:
df.scholarship.apply(lambda x: x/170)

In [None]:
df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/150)

In [None]:
df.head()

## 8. Delete Rows Having NaN values using `df.dropna()` method

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

In [None]:
df.shape

In [None]:
# You can use dropna() method to drop all the rows, it it has any na value
df1 = df.dropna()
df1.shape

In [None]:
df1.head()

In [None]:
# Default Arguments to dropna()
df2 = df.dropna(axis=0, how='any')
df2.shape

In [None]:
# If we set how='all` it means drop a row only if all of its values are NA
df2 = df.dropna(axis=0, how='all')
df2.shape

In [None]:
# Use of subset argument and pass it a list of columns based on whose values you want to drop a row
df2 = df.dropna(axis=0, how='any', subset=['math'])
df2.shape

In [None]:
# Use of subset argument
df2 = df.dropna(axis=0, how='any', subset=['session'])
df2.shape

In [None]:
# Having `how=all` and `subset=listofcolumnnames`, then it will 
# drop a row only if both the columns have a NA value in that row
df2 = df.dropna(axis=0, how='any', subset=['math', 'session'])
df2.shape

In [None]:
# If we set the axis=1 and how=all, it means drop a column if all the  values under it is na
df2 = df.dropna(axis=1, how='all')
df2.shape

In [None]:
# If we set the axis=1 and how=any, it means drop a column if any value under it is na
df2 = df.dropna(axis=1, how='any')
df2.shape

In [None]:
df2.head()

## 9. Convert Categorical Variables into Numerical
- Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. 
- We can do this using Pandas function `pd.get_dummies()`, which will create a binary column for each of the categories. 
```
pd.get_dummies(data, drop_first=False)
```
- Where, the only required argument is `data` which can be a dataframe or a series
- The parameter drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.

**Note:** Making a dummy variable will take all the `K` distinct values in one coumn and make `K` columns out of them

### a. Convert all categorical variables into dummy/indicator variables

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

In [None]:
# currently we have 10 columns in the data
df.shape

In [None]:
# Convert all categorical variables into dummy/indicator variables
df = pd.get_dummies(df)

In [None]:
# Let us view the datafreame, keep a note on the number of columns
df.head()

In [None]:
# The Number of columns has gone to 1605 now
df.shape

- So we have 112 columns
- Even though one-hot encoding is a good way to convert your categorical columns to numerical columns
- But it adds a lot of dimensionality to your data, i.e., increase the number of columns
- It also become difficult to deal with that much number of columns
- This is a trade-off, which is handled by technique called dimensionality reduction

### b. Perform One-Hot Encoding for Categorical Column `gender` Only
- In our dataframe, the gender column is a categorical column having two values 'male' and 'female'
- It will create a dummy binary columns.  
- This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.


In [None]:
import pandas as pd
df1 = pd.read_csv('datasets/group-marks.csv')
df1.head()

In [None]:
# Convert only gender variable into dummy/indicator variables
df2 = pd.get_dummies(df1[['gender']])
df2.head()

In [None]:
# Since we donot need two separate columns, so simply use the `drop_first` argument of get_dummies to handle this
df2 = pd.get_dummies(df1[['gender']], drop_first=True)
df2.head()

In [None]:
# We will talk about join in the next session in detail.
df3 = df1.join(df2['gender_male'])
df3.head()

## Check Your Concepts:
- What is Pandas?

## Practice Questions
For the practice questions, we will use following dataset

In [5]:
import pandas as pd
import numpy as np
dict1 ={
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,np.nan,70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,1983.43,2480.4,250.45, 75.29,3045.6],
'ord_date': ['2012-10-05','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,3003,3002,3001,3001],
'salesman_id':[5002,5003,5001,np.nan,5002,5001,5001,np.nan,5003,5002,5003,np.nan]
}

### Write a Pandas program to detect missing values of a given DataFrame.(Hint : df.isna() ordf.isnull())

In [9]:
df = pd.DataFrame(dict1)
df.isnull().sum()

ord_no         4
purch_amt      0
ord_date       1
customer_id    0
salesman_id    3
dtype: int64

### Write a Pandas program to identify the column(s) of a given DataFrame which have at least one missing value.(Hint : df.isna().sum or df.isna().any())

In [19]:
mask = df.isnull().any()
mask

ord_no          True
purch_amt      False
ord_date        True
customer_id    False
salesman_id     True
dtype: bool

### Write a Pandas program to count the number of missing values in each column of a given DataFrame.(Hint: df.isna().sum())

### Write a Pandas program to find and replace the missing values in a given DataFrame which do not have any valuable information.(Hint : pd.read_csv(na_values) or df.replace())
For this question , use following dataset

In [35]:
dict1 = {
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,"--",70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,"?",12.43,2480.4,250.45, 3045.6],
'ord_date': ['?','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,"--",3002,3001,3001],
'salesman_id':[5002,5003,"?",5001,np.nan,5002,5001,"?",5003,5002,5003,"--"]}

In [38]:
df = pd.DataFrame(dict1)

### Write a Pandas program to drop the rows where at least one element is missing in a given DataFrame.(Hint : df.dropna())

### Write a Pandas program to drop the columns where at least one element is missing in a given DataFrame.(Hint : df.dropna())
For this question , ue following dataset

In [43]:
dict1 = {
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,np.nan,70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,1983.43,2480.4,250.45, 75.29,3045.6],
'ord_date': ['2012-10-05','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,3003,3002,3001,3001],
'salesman_id':[5002,5003,5001,np.nan,5002,5001,5001,np.nan,5003,5002,5003,np.nan]}


### Write a Pandas program to drop the rows where all elements are missing in a given DataFrame.(Hint : df.drop())
For this question, we will use following dataset

In [44]:
dict1 = {
'ord_no':[np.nan,np.nan,70002,70004,np.nan,70005,np.nan,70010,70003,70012,np.nan,70013],
'purch_amt':[np.nan,270.65,65.26,110.5,948.5,2400.6,5760,1983.43,2480.4,250.45, 75.29,3045.6],
'ord_date': [np.nan,'2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[np.nan,3001,3001,3003,3002,3001,3001,3004,3003,3002,3001,3001]}

### Write a Pandas program to keep the rows with at least 2 NaN values in a given DataFrame.(Hint: df.dropna(thresh=))

In [46]:
dict1 = {
'ord_no':[np.nan,np.nan,70002,np.nan,np.nan,70005,np.nan,70010,70003,70012,np.nan,np.nan],
'purch_amt':[np.nan,270.65,65.26,np.nan,948.5,2400.6,5760,1983.43,2480.4,250.45, 75.29,np.nan],
'ord_date': [np.nan,'2012-09-10',np.nan,np.nan,'2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17',np.nan],
'customer_id':[np.nan,3001,3001,np.nan,3002,3001,3001,3004,3003,3002,3001,np.nan]}

### Write a Pandas program to drop those rows from a given DataFrame in which specific columns have missing values.(Hint : df.dropna(subset))

### Write a Pandas program to keep the valid entries of a given DataFrame.(Hint : df.dropna)

### Write a Pandas program to calculate the total number of missing values in a DataFrame.

### Write a Pandas program to replace NaNs with a single constant value in specified columns in a DataFrame.(Hint : df.fillna())

### Write a Pandas program to replace NaNs with the value from the previous row or the next row in a given DataFrame.(Hint : df.fillna())

### Write a Pandas program to replace NaNs with median or mean of the specified columns in a given DataFrame.(Hint : df.fillna())

### Write a Pandas program to find the Indexes of missing values in a given DataFrame.(Hint : np.isnull().to_numpy())

ord_no            70002.0
purch_amt           65.26
ord_date       2012-09-10
customer_id        3001.0
Name: 0, dtype: object

### Write a Pandas program to replace the missing values with the most frequent values present in each column of a given dataframe.(Hint : df.mode())

## Bonus

## Create a hitmap for more information about the distribution of missing values in a given DataFrame.

# Pandas - Assignment no 08
- Here is link of [Pandas - Assignment no 08]()

### [Project : Clean And Analyze Employee Exit Surveys](https://github.com/AnshuTrivedi/Data-Scientist-In-Python/blob/master/Projects/step_2/Course_4/Guided%20Project_Clean%20And%20Analyze%20Employee%20Exit%20Surveys.ipynb)

**In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment) (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here. We've made some slight modifications to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)**



Our end goal is to answer the following question:

**Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?**


#### Import the libraries and load the dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
filterwarnings('ignore')
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)

In [2]:
tafe_survey = pd.read_csv('datasets/tafe_survey.csv')
dete_survey = pd.read_csv('datasets/dete_survey.csv')

In [3]:
# birdeye view of dataset
tafe_survey.sample(5)

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE,Main Factor. Which of these was the main factor for leaving?,InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction,InstituteViews. Topic:2. I was given access to skills training to help me do my job better,InstituteViews. Topic:3. I was given adequate opportunities for personal development,InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%,InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had,InstituteViews. Topic:6. The organisation recognised when staff did good work,InstituteViews. Topic:7. Management was generally supportive of me,InstituteViews. Topic:8. Management was generally supportive of my team,InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me,InstituteViews. Topic:10. Staff morale was positive within the Institute,InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly,InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently,InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly,WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit,WorkUnitViews. Topic:15. I worked well with my colleagues,WorkUnitViews. Topic:16. My job was challenging and interesting,WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work,WorkUnitViews. Topic:18. I had sufficient contact with other people in my job,WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job,WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job,WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT],WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job,WorkUnitViews. Topic:23. My job provided sufficient variety,WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job,WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction,WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance,"WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area",WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date,WorkUnitViews. Topic:29. There was adequate communication between staff in my unit,WorkUnitViews. Topic:30. Staff morale was positive within my work unit,Induction. Did you undertake Workplace Induction?,InductionInfo. Topic:Did you undertake a Corporate Induction?,InductionInfo. Topic:Did you undertake a Institute Induction?,InductionInfo. Topic: Did you undertake Team Induction?,InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?,InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?,InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?,InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?,InductionInfo. On-line Topic:Did you undertake a Institute Induction?,InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?,InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?,InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.],InductionInfo. Induction Manual Topic: Did you undertake Team Induction?,Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?,Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?,Workplace. Topic:Does your workplace promote and practice the principles of employment equity?,Workplace. Topic:Does your workplace value the diversity of its employees?,Workplace. Topic:Would you recommend the Institute as an employer to others?,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
301,6.345708e+17,SkillsTech Australia,Non-Delivery (corporate),2011.0,Resignation,-,-,-,-,-,Contributing Factors. Dissatisfaction,Job Dissatisfaction,Interpersonal Conflict,-,-,Other,-,Other,Strongly Disagree,Neutral,Neutral,Strongly Disagree,Strongly Disagree,Strongly Disagree,Strongly Disagree,Neutral,Strongly Disagree,Strongly Disagree,Strongly Disagree,Strongly Disagree,Strongly Disagree,Strongly Disagree,Disagree,Neutral,Neutral,Neutral,Neutral,Strongly Disagree,Strongly Disagree,Strongly Disagree,Disagree,Neutral,Neutral,Strongly Disagree,Neutral,Strongly Disagree,Neutral,Disagree,Yes,No,,No,-,-,-,-,On-line,-,-,-,-,No,No,No,No,No,Female,46 50,Permanent Full-time,Executive (SES/SO),5-6,5-6
166,6.343721e+17,Brisbane North Institute of TAFE,Non-Delivery (corporate),2011.0,Retirement,-,-,-,-,-,Contributing Factors. Dissatisfaction,Job Dissatisfaction,-,-,-,-,-,Dissatisfaction with %[Institute]Q25LBL%,Disagree,Agree,Agree,Disagree,Disagree,Neutral,Disagree,Disagree,Disagree,Disagree,Neutral,Neutral,Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Agree,Agree,Strongly Agree,Agree,Agree,Strongly Agree,Strongly Agree,Yes,Yes,Yes,Yes,Face to Face,-,-,Face to Face,-,-,Face to Face,-,-,Yes,Yes,Yes,Yes,No,Male,56 or older,Permanent Full-time,Administration (AO),11-20,11-20
539,6.348187e+17,Southbank Institute of Technology,Non-Delivery (corporate),2012.0,Resignation,-,-,-,Ill Health,-,-,-,-,-,-,-,-,,Disagree,,,,,,,,,Disagree,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
253,6.345408e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2011.0,Contract Expired,,,,,,,,,,,,,,Neutral,Neutral,Neutral,Neutral,Neutral,Agree,Agree,Neutral,Neutral,Neutral,Not Applicable,Not Applicable,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Neutral,Neutral,Neutral,Agree,Agree,Agree,Agree,Agree,Not Applicable,Agree,Agree,Yes,,,,-,-,-,-,-,-,-,-,-,,,,Yes,Yes,Female,41 45,Temporary Full-time,Teacher (including LVT),Less than 1 year,Less than 1 year
416,6.346686e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2012.0,Resignation,-,-,-,-,-,-,-,-,-,-,Other,-,,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Agree,Disagree,Agree,Agree,Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Strongly Agree,Disagree,Yes,Yes,Yes,No,Face to Face,-,-,Face to Face,On-line,-,-,-,-,Yes,Yes,Yes,Yes,Yes,Female,31 35,Temporary Full-time,Administration (AO),3-4,3-4


In [4]:
# get basic information of dataset
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
 #   Column                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                         --------------  -----  
 0   Record ID                                                                                                                                                      702 non-null    float64
 1   Institute                                                                                                                                                      702 non-null    object 
 2   WorkArea                                                                                                                                  

In [5]:
dete_survey = pd.read_csv('datasets/dete_survey.csv')
dete_survey.sample(5)

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,Career move to public sector,Career move to private sector,Interpersonal conflicts,Job dissatisfaction,Dissatisfaction with the department,Physical work environment,Lack of recognition,Lack of job security,Work location,Employment conditions,Maternity/family,Relocation,Study/Travel,Ill Health,Traumatic incident,Work life balance,Workload,None of the above,Professional Development,Opportunities for promotion,Staff morale,Workplace issue,Physical environment,Worklife balance,Stress and pressure support,Performance of supervisor,Peer support,Initiative,Skills,Coach,Career Aspirations,Feedback,Further PD,Communication,My say,Information,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
627,628,Ill Health Retirement,05/2013,2000,2000,Teacher Aide,,Not Stated,,Permanent Part-time,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,N,D,D,D,A,A,N,SD,N,A,D,,SD,D,D,D,SD,D,N,,N,Female,61 or older,,,,,
446,447,Age Retirement,2013,1987,2000,Teacher,Primary,Metropolitan,,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,SA,SA,SA,SA,SA,A,SA,SA,SA,SA,SA,SA,A,SA,SA,SA,SA,SA,SA,A,SA,Female,61 or older,,,,,
315,316,Other,2012,Not Stated,Not Stated,Cleaner,,South East,,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,A,N,N,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,D,M,Female,61 or older,,,,,
694,696,Resignation-Other reasons,Not Stated,2012,Not Stated,Teacher Aide,,Metropolitan,,Casual,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,,,,,,,,,,,,,,,,,,,,,,Female,46-50,,,,,
487,488,Ill Health Retirement,2012,1993,2010,Teacher Aide,,North Coast,,Permanent Part-time,False,False,True,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,SD,SD,SD,SD,A,A,SD,SD,D,D,SD,SD,SD,M,SD,SD,SD,SD,D,SD,D,Female,61 or older,,,,,


In [6]:
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-

In [7]:
# check null/missing values into both dataframes

In [8]:
tafe_survey.isnull().sum()

Record ID                                                                                                                                                          0
Institute                                                                                                                                                          0
WorkArea                                                                                                                                                           0
CESSATION YEAR                                                                                                                                                     7
Reason for ceasing employment                                                                                                                                      1
Contributing Factors. Career Move - Public Sector                                                                                                                265
Contributi

In [9]:
dete_survey.isnull().sum()

ID                                       0
SeparationType                           0
Cease Date                               0
DETE Start Date                          0
Role Start Date                          0
Position                                 5
Classification                         367
Region                                   0
Business Unit                          696
Employment Status                        5
Career move to public sector             0
Career move to private sector            0
Interpersonal conflicts                  0
Job dissatisfaction                      0
Dissatisfaction with the department      0
Physical work environment                0
Lack of recognition                      0
Lack of job security                     0
Work location                            0
Employment conditions                    0
Maternity/family                         0
Relocation                               0
Study/Travel                             0
Ill Health 

We can make the following observations based on the work above:
- The dete_survey dataframe contains 'Not Stated' values that indicate values are missing, but they aren't represented as NaN.
- Both the dete_survey and tafe_survey contain many columns that we don't need to complete our analysis.
- Each dataframe contains many of the same columns, but the column names are different. There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

### Identify Missing Values and Drop Unnecessary Columns

In [10]:
# just use the following columns from the `dete_dataset`

dete_columns = ['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Gender', 'Age', 'Aboriginal', 'Torres Strait',
       'South Sea', 'Disability', 'NESB']
# Read in the data again, but this time read `Not Stated` values as `NaN`

dete_survey_updated = pd.read_csv('datasets/dete_survey.csv', usecols=dete_columns, na_values='Not Stated')
dete_survey_updated.head()

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,Career move to public sector,Career move to private sector,Interpersonal conflicts,Job dissatisfaction,Dissatisfaction with the department,Physical work environment,Lack of recognition,Lack of job security,Work location,Employment conditions,Maternity/family,Relocation,Study/Travel,Ill Health,Traumatic incident,Work life balance,Workload,None of the above,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984.0,2004.0,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,,,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011.0,2011.0,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970.0,1989.0,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,Female,61 or older,,,,,


In [11]:
dete_survey_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 35 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ID                                   822 non-null    int64  
 1   SeparationType                       822 non-null    object 
 2   Cease Date                           788 non-null    object 
 3   DETE Start Date                      749 non-null    float64
 4   Role Start Date                      724 non-null    float64
 5   Position                             817 non-null    object 
 6   Classification                       455 non-null    object 
 7   Region                               717 non-null    object 
 8   Business Unit                        126 non-null    object 
 9   Employment Status                    817 non-null    object 
 10  Career move to public sector         822 non-null    bool   
 11  Career move to private sector   

In [12]:
# just use the following columns from the `dete_dataset`

tafe_columns = ['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender. What is your Gender?', 'CurrentAge. Current Age',
       'Employment Type. Employment Type', 'Classification. Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)']

# Read in the data again, but this time read `Not Stated` values as `NaN`
tafe_survey_updated = pd.read_csv('datasets/tafe_survey.csv', usecols=tafe_columns, na_values='Not Stated')
tafe_survey_updated.head()

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,,,,,,,,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,-,-,-,-,Travel,-,-,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,-,-,-,-,-,-,NONE,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,-,-,-,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,-,-,-,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [13]:
tafe_survey_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 23 columns):
 #   Column                                                                     Non-Null Count  Dtype  
---  ------                                                                     --------------  -----  
 0   Record ID                                                                  702 non-null    float64
 1   Institute                                                                  702 non-null    object 
 2   WorkArea                                                                   702 non-null    object 
 3   CESSATION YEAR                                                             695 non-null    float64
 4   Reason for ceasing employment                                              701 non-null    object 
 5   Contributing Factors. Career Move - Public Sector                          437 non-null    object 
 6   Contributing Factors. Career Move - Private Sector        

#### Clean column names
To clean the column names, following steps take place
- convert upper case into lower case 
- remove left and right white spaces
- remove white spaces with `-`

In [14]:
dete_survey_updated.columns =  dete_survey_updated.columns.str.lower().str.strip().str.replace(' ','_')
dete_survey_updated.head()

Unnamed: 0,id,separationtype,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,career_move_to_public_sector,career_move_to_private_sector,interpersonal_conflicts,job_dissatisfaction,dissatisfaction_with_the_department,physical_work_environment,lack_of_recognition,lack_of_job_security,work_location,employment_conditions,maternity/family,relocation,study/travel,ill_health,traumatic_incident,work_life_balance,workload,none_of_the_above,gender,age,aboriginal,torres_strait,south_sea,disability,nesb
0,1,Ill Health Retirement,08/2012,1984.0,2004.0,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,,,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011.0,2011.0,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970.0,1989.0,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,Female,61 or older,,,,,


#### Update column names of `tafe_survey_updated` to match the names in dete_survey_updated


In [15]:
tafe_survey_updated.head()

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,,,,,,,,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,-,-,-,-,Travel,-,-,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,-,-,-,-,-,-,NONE,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,-,-,-,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,-,-,-,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [16]:
mapping = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age',
       'Employment Type. Employment Type': 'employment_status',
       'Classification. Classification': 'position',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}
tafe_survey_updated = tafe_survey_updated.rename(mapping, axis = 1)

# Check that the specified column names were updated correctly
tafe_survey_updated.columns


Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

In [17]:
tafe_survey_updated.head()

Unnamed: 0,id,Institute,WorkArea,cease_date,separationtype,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE,gender,age,employment_status,position,institute_service,role_service
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,,,,,,,,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,-,-,-,-,Travel,-,-,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,-,-,-,-,-,-,NONE,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,-,-,-,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,-,-,-,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4
