# Case Study: Do Employees with more Experience make more Money?

### Objectives
After this lesson you should be able to...

+ Add new columns to a DataFrame
+ Change the data type of a column
+ Create categorical data type with the **`cut`** function

### Prepare for this lesson by...
+ Read the first section on [Categorical Data](http://pandas.pydata.org/pandas-docs/stable/categorical.html)

## Introduction
In this lesson we will first learn how to add new columns to our DataFrame and use a little bit of datetime functionality to calculate years of experience. We will then determine if more experience results in a larger salary.

### Adding new columns to the DataFrame
A very common task during analysis is to add new columns to the working DataFrame. Usually, some operation is performed using the existing columns with the outcome added as a new column. The easiest way to add a new column is put the new column name inside the indexing operator and set it equal to a scalar value like an integer.

In [1]:
import pandas as pd
import numpy as np

In [2]:
employee = pd.read_csv('../../data/employee.csv')
employee.head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


In [3]:
# add a constant value as a new column
# the new column is always last

employee['NEW_CONSTANT_COLUMN'] = 5

In [4]:
# Take a look at the last few columns and there you see the new column
employee.iloc[:5, -5:]

Unnamed: 0,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,NEW_CONSTANT_COLUMN
0,Female,Active,2006-06-12,2012-10-13,5
1,Female,Active,2000-07-19,2010-09-18,5
2,Male,Active,2015-02-03,2015-02-03,5
3,Male,Active,1982-02-08,1991-05-25,5
4,Male,Active,1989-06-19,1994-10-22,5


### Adding two columns to create a new one
It's possible to add two numeric or string columns together and assign it as a new column. Here we add together **`RACE`** and **`GENDER`** separated by a hyphen.

In [5]:
employee['RACE_GENDER'] = employee['RACE'] + '-' + employee['GENDER']

In [6]:
# inspect the name columns to see if new column is as intended
employee[['RACE', 'GENDER', 'RACE_GENDER']].head()

Unnamed: 0,RACE,GENDER,RACE_GENDER
0,Hispanic/Latino,Female,Hispanic/Latino-Female
1,Hispanic/Latino,Female,Hispanic/Latino-Female
2,White,Male,White-Male
3,White,Male,White-Male
4,White,Male,White-Male


### Taking the value count of this new column
Now we can get a more precise breakdown of gender and race. It's very interesting to see why there are relatively few white females.

In [7]:
employee['RACE_GENDER'].value_counts()

White-Male                                  557
Black or African American-Male              395
Hispanic/Latino-Male                        341
Black or African American-Female            305
Hispanic/Latino-Female                      139
White-Female                                108
Asian/Pacific Islander-Male                  77
Asian/Pacific Islander-Female                30
American Indian or Alaskan Native-Female      6
American Indian or Alaskan Native-Male        5
Others-Male                                   1
Others-Female                                 1
Name: RACE_GENDER, dtype: int64

### Adding a random numeric column
Let's say the city of Houston wants to give everyone a random bonus ranging anywhere between 0 and 10% of their current base salary. To begin, we will make a column called **RANDOM_BONUS** and assign each employee a random number between 0 and .1. We use the numpy random module function **`rand`** to create random numbers between 0 and 1 and multiply this by .1 to get the desired bonus

In [8]:
np.random.seed(123)
n = len(employee)
employee['RANDOM_BONUS'] = np.random.rand(n) * .1

In [9]:
# inspect last columns
employee.iloc[:3, -5:]

Unnamed: 0,HIRE_DATE,JOB_DATE,NEW_CONSTANT_COLUMN,RACE_GENDER,RANDOM_BONUS
0,2006-06-12,2012-10-13,5,Hispanic/Latino-Female,0.069647
1,2000-07-19,2010-09-18,5,Hispanic/Latino-Female,0.028614
2,2015-02-03,2015-02-03,5,White-Male,0.022685


### Data Types
In a previous notebook, all the pandas data types were listed in a table. We can use either the **`info`** method or the **`dtypes`** attribute to list the data types

In [10]:
# only look at the data types - returns a Series
employee.dtypes

POSITION_TITLE          object
DEPARTMENT              object
BASE_SALARY            float64
RACE                    object
EMPLOYMENT_TYPE         object
GENDER                  object
EMPLOYMENT_STATUS       object
HIRE_DATE               object
JOB_DATE                object
NEW_CONSTANT_COLUMN      int64
RACE_GENDER             object
RANDOM_BONUS           float64
dtype: object

### Changing Data Types
Occasionally, some columns will not be the desired type and need to be changed to a different type. A common situation occurs when dealing with dates that are read in as strings and defaulted to the **`object`** data type. The Series method **astype** will attempt to force a column to a different type.

The argument passed to **astype** must be the name of the new data type as either a string or a NumPy object. pandas now requires that you pass the precision of the datetime. For now we will use nanosecond precision and use the string **`datetime64[ns]`**. The **HIRE_DATE** column is currently a string and gets converted to a **datetime64** below along with other date columns.

The function **`pd.to_datetime`** also works and is more flexible that **`astype`**.

In [11]:
# using a string
employee['HIRE_DATE'] = employee['HIRE_DATE'].astype('datetime64[ns]')

# yet another way using a function with flexibility to do more things
employee['JOB_DATE'] = pd.to_datetime(employee['JOB_DATE'])

In [12]:
employee.dtypes

POSITION_TITLE                 object
DEPARTMENT                     object
BASE_SALARY                   float64
RACE                           object
EMPLOYMENT_TYPE                object
GENDER                         object
EMPLOYMENT_STATUS              object
HIRE_DATE              datetime64[ns]
JOB_DATE               datetime64[ns]
NEW_CONSTANT_COLUMN             int64
RACE_GENDER                    object
RANDOM_BONUS                  float64
dtype: object

# Case Study: Do people with more experience make more money?
To answer this question, the number of years of experience needs to be calculated from the column **HIRE_DATE**. **datetime64** columns can be subtracted from one another. We will use the date that the data was generated which was around December, 2016.

In [13]:
pull_date = pd.Timestamp('2016-12-1')
pull_date

Timestamp('2016-12-01 00:00:00')

In [14]:
# subtract the hire date from today to get the number of days of experience
experience = pull_date - employee['HIRE_DATE']

# print out head of series
experience.head()

0    3825 days
1    5979 days
2     667 days
3   12715 days
4   10027 days
Name: HIRE_DATE, dtype: timedelta64[ns]

### Converting to years
Notice that the data type is now **timedelta64** which just reprsents an amount of time in days. To convert this to years an esoteric command must be run. [See here for more detail](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html#frequency-conversion)

In [15]:
# convert to years
years_experience = experience / pd.Timedelta(1, 'Y')

# inspect and check that it makes sense
years_experience.head()

0    10.472494
1    16.369946
2     1.826184
3    34.812488
4    27.452994
Name: HIRE_DATE, dtype: float64

In [16]:
# Make a new column
employee['YEARS_EXPERIENCE'] = years_experience

### Creating categories for years of experience
It's possible to divide numerical columns into different categories based on their value. The pandas **cut** function accepts a Series or an array and a list of the edges of the **bins**. Each category can be given a **label** as well. A series is returned that is of **categorical** type - unique to Pandas. [More on categorical data](http://pandas.pydata.org/pandas-docs/stable/categorical.html)

In [17]:
# create Series of categorical data
exp_categories = pd.cut(years_experience, bins=[0, 5, 15, 100], labels=['Novice', 'Experienced', 'Senior'])

In [18]:
# inspect Seriers
exp_categories.head(10)

0    Experienced
1         Senior
2         Novice
3         Senior
4         Senior
5         Senior
6         Novice
7         Novice
8         Senior
9         Novice
Name: HIRE_DATE, dtype: category
Categories (3, object): [Novice < Experienced < Senior]

In [19]:
# get some summary statistics
exp_categories.value_counts()

Senior         806
Experienced    663
Novice         531
Name: HIRE_DATE, dtype: int64

In [20]:
# Create new column
employee['EXPERIENCE_LEVEL'] = exp_categories

## Reading in Dates
It is possible to read in dates correctly using **`read_csv`**. Use a list of the column names you would like to be dates as the argument for the **`parse_dates`** parameter.

In [21]:
employee2 = pd.read_csv('../../data/employee.csv', parse_dates=['HIRE_DATE', 'JOB_DATE'])
employee2.dtypes

POSITION_TITLE               object
DEPARTMENT                   object
BASE_SALARY                 float64
RACE                         object
EMPLOYMENT_TYPE              object
GENDER                       object
EMPLOYMENT_STATUS            object
HIRE_DATE            datetime64[ns]
JOB_DATE             datetime64[ns]
dtype: object

# Exercises

In [23]:
# you can run this cell to get all the transformations from this notebook all at once
employee = pd.read_csv('../../data/employee.csv', parse_dates=['HIRE_DATE', 'JOB_DATE'])
employee['NEW_CONSTANT_COLUMN'] = 5
employee['RACE_GENDER'] = employee['RACE'] + '-' + employee['GENDER']

np.random.seed(123)
n = len(employee)
employee['RANDOM_BONUS'] = np.random.rand(n) * .1

employee['YEARS_EXPERIENCE'] = (pd.Timestamp('2016-12-1') - employee['HIRE_DATE']) / pd.Timedelta(1, 'Y')
employee['EXPERIENCE_LEVEL'] =  pd.cut(employee['YEARS_EXPERIENCE'], 
                                       bins=[0, 5, 15, 100], 
                                       labels=['Novice', 'Experienced', 'Senior'])

In [24]:
employee.head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,NEW_CONSTANT_COLUMN,RACE_GENDER,RANDOM_BONUS,YEARS_EXPERIENCE,EXPERIENCE_LEVEL
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,5,Hispanic/Latino-Female,0.069647,10.472494,Experienced
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,5,Hispanic/Latino-Female,0.028614,16.369946,Senior
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,5,White-Male,0.022685,1.826184,Novice
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,5,White-Male,0.055131,34.812488,Senior
4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,5,White-Male,0.071947,27.452994,Senior


### Problem 1
<span  style="color:green; font-size:16px">Create new columns **`BONUS`** and **`TOTAL_COMP`**. Use column **`RANDOM_BONUS`** to calculate the bonus.</span>

In [25]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Use the **`EXPERIENCE_LEVEL`** column to determine if more experienced employees make more money.</span>

In [26]:
# your code here

# Solutions

### Problem 1
<span  style="color:green; font-size:16px">Create new columns **`BONUS`** and **`TOTAL_COMP`**. Use column **`RANDOM_BONUS`** to calculate the bonus.</span>

In [27]:
employee['BONUS'] = employee['RANDOM_BONUS'] * employee['BASE_SALARY']
employee['TOTAL_COMP'] = employee['BASE_SALARY'] + employee['BONUS']

employee.iloc[:3, -5:]

Unnamed: 0,RANDOM_BONUS,YEARS_EXPERIENCE,EXPERIENCE_LEVEL,BONUS,TOTAL_COMP
0,0.069647,10.472494,Experienced,8487.31279,130349.31279
1,0.028614,16.369946,Senior,747.539013,26872.539013
2,0.022685,1.826184,Novice,1027.160697,46306.160697


### Problem 2
<span  style="color:green; font-size:16px">Use the **`EXPERIENCE_LEVEL`** column to determine if more experienced employees make more money.</span>

In [28]:
novice = employee.loc[employee['EXPERIENCE_LEVEL'] == 'Novice', 'BASE_SALARY']
exper = employee.loc[employee['EXPERIENCE_LEVEL'] == 'Experienced', 'BASE_SALARY']
senior = employee.loc[employee['EXPERIENCE_LEVEL'] == 'Senior', 'BASE_SALARY']

novice.mean(), exper.mean(), senior.mean()

(44987.484, 55264.92867981791, 63638.224209078406)