# Introduction to Machine Learning

## 1. Defining the Question

### a) Specifying the Data Analysis Question

Can we predict whether an employee is eligible for promotion or not?

### b) Defining the Metric for Success

The machine learning model should predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process.

### c) Understanding the context 

Human resources have been using analytics for years. However, the collection, processing, and analysis of data have been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR.

A large Multinational Corporation has nine broad verticals
across the organization. This client faces a problem of identifying the right people for promotion (only for the manager position and below) and preparing them in
time. The currently process is as follows:
* They first identify a set of employees based on recommendations/ past performance.
* Selected employees go through the separate training and evaluation program for each vertical.
* These programs are based on the required skill of each vertical. At the end of the program, based on various factors such as training performance, KPI completion
(only employees with KPIs completed greater than 60% are considered) etc., the employee gets a promotion.

For the process mentioned above, the final promotions are only announced after the evaluation, and this leads to a delay in transition to their new roles. The company needs help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

### d) Recording the Experimental Design

1. Load libraries and the dataset.
2. Data exploration.
3. Data preparation.
4. Data modelling.
5. Summary and recommendations.

### e) Data Relevance

The given data set is relevant in answering the research question.

## 2. Reading the Data

In [1]:
# Importing our libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Load the data below
# --- 
hrdataset_url = "https://raw.githubusercontent.com/wambasisamuel/DE_Week04_Tuesday/main/hrdataset.csv"
df_hr = pd.read_csv(hrdataset_url) 

In [3]:
# Checking the first 5 rows of data
df_hr.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [4]:
# Checking the last 5 rows of data
# ---
df_hr.tail(5)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0
54807,51526,HR,region_22,Bachelor's,m,other,1,27,1.0,5,0,0,49,0


In [5]:
# Checking number of rows and columns
df_hr.shape

(54808, 14)

In [6]:
# Checking datatypes
df_hr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  KPIs_met >80%         54808 non-null  int64  
 11  awards_won?           54808 non-null  int64  
 12  avg_training_score    54808 non-null  int64  
 13  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.9+ MB


Observations:

*   The are 54808 observations in the dataset.
*   The dataset has 14 features.
*   There are 5 categorical features
*   There are 9 numerical features



## 3. External Data Source Validation

The provided dataset matches the one on Kaggle. It has enough features to help in developing a machine learning model that can predict employee promotions.

## 4. Data Preparation

### Data Standardisation

In [7]:
# Standardise column names
# ---
# replace > with 'gt'
# replace % with '_pct'
# replace whitespace with '_'
# remove ? from column names 
df_hr.columns = df_hr.columns.str.strip().str.lower().str.replace('>', 'gt_').str.replace('%', '_pct').str.replace(' ','_').str.replace('?', '')
df_hr.columns

  import sys


Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'kpis_met_gt_80_pct', 'awards_won',
       'avg_training_score', 'is_promoted'],
      dtype='object')

### Data Cleaning

#### Irrelevant Data

The columns *employee_id, region, gender, recruitment_channel* do not seem to have a relation with promotion hence I will drop them.

In [8]:
df_hr.drop(columns=['employee_id','region','gender','recruitment_channel'],inplace=True)
df_hr

Unnamed: 0,department,education,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_gt_80_pct,awards_won,avg_training_score,is_promoted
0,Sales & Marketing,Master's & above,1,35,5.0,8,1,0,49,0
1,Operations,Bachelor's,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,Bachelor's,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,Bachelor's,2,39,1.0,10,0,0,50,0
4,Technology,Bachelor's,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...
54803,Technology,Bachelor's,1,48,3.0,17,0,0,78,0
54804,Operations,Master's & above,1,37,2.0,6,0,0,56,0
54805,Analytics,Bachelor's,1,27,5.0,3,1,0,79,0
54806,Sales & Marketing,,1,29,1.0,2,0,0,45,0


#### Duplicate data

In [9]:
# Find the total duplicate records
#df_duplicate = df_hr.duplicated()
#sum(df_duplicate)
df_hr.duplicated().sum()

7245

In [10]:
# Drop duplicates
df_hr.drop_duplicates(keep='first',inplace=True)
df_hr.shape 

(47563, 10)

#### Missing Data

In [11]:
# Checking missing entries of all the variables
# ---
# 
df_hr.isnull().sum()

department                 0
education               2189
no_of_trainings            0
age                        0
previous_year_rating    2568
length_of_service          0
kpis_met_gt_80_pct         0
awards_won                 0
avg_training_score         0
is_promoted                0
dtype: int64

In [12]:
# I will replace null values in the `previous_year_rating` column with the mean
df_hr['previous_year_rating'].fillna(value=df_hr['previous_year_rating'].mean(), inplace=True)

# I will drop rows where the `education` column has null values
df_hr.dropna(subset=['education'], inplace=True)

df_hr.shape

(45374, 10)

In [13]:
# Checking missing entries of all the variables
# ---
# 
df_hr.isnull().sum()

department              0
education               0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
kpis_met_gt_80_pct      0
awards_won              0
avg_training_score      0
is_promoted             0
dtype: int64


## 5. Data Modelling

I will create and train a model which I can then use to make predictions.

#### Training the model

In [14]:
features = df_hr.drop(['is_promoted','department','education'], axis=1)
target = df_hr['is_promoted']

model = DecisionTreeClassifier()

model.fit(features, target)

DecisionTreeClassifier()

In [15]:
df_hr.head()

Unnamed: 0,department,education,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met_gt_80_pct,awards_won,avg_training_score,is_promoted
0,Sales & Marketing,Master's & above,1,35,5.0,8,1,0,49,0
1,Operations,Bachelor's,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,Bachelor's,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,Bachelor's,2,39,1.0,10,0,0,50,0
4,Technology,Bachelor's,1,45,3.0,2,0,0,73,0


#### Predicting using the model

I would like to predict the promotion status for the following two employees.

Attributes for the first employee are as follows:

* Number of trainings = 2
* Age = 37
* Rating for the previous year = 4.0
* Length of service = 8
* KPIs met above 80% = 0
* Awards won = 0
* Average training score = 48

Attributes for the second employee:

* Number of trainings = 1
* Age = 29
* Rating for the previous year = 5.0
* Length of service = 3
* KPIs met above 80% = 0
* Awards won = 0
* Average training score = 80

In [16]:
new_features = pd.DataFrame(
    [
        [2, 37, 4.0, 8, 0, 0, 48],
        [1, 29, 5.0, 3, 0, 0, 80],
    ],
    columns=features.columns,
)

answers = model.predict(new_features)
print(answers)

[0 1]


## 6. Summary and Recommendations

Below are the findings:

1. The model I created is able to make predictions   
2. I recommend exploring other models apart from the decision tree classifier



