# Data Wrangling and Exploration

In this mini project (MP3), we are working with both supervised and unsupervised machine learning.

Since we need to build different models using the same dataset, we decided to split them into separate sections.

The first step in building our machine learning models is to load and clean the data. As we have already done this several times in MP1 and MP2, we will skip detailed explanations here and instead provide a single, clean code block that can be reused across all tasks in this project.

If you need a refresher on how to load and clean data, please refer to the relevant sections in MP1 and MP2.

## Objective
The objective of this mini project is to provide practice in data analysis and prediction by regression,
classification and clustering algorithms.


## Problem Statement
Attrition is the rate at which employees leave their job. When attrition reaches high levels, it becomes
a concern for the company. Therefore, it is important to find out why employees leave, which factors
contribute to such significant decision.

In [1]:
# Note: this import can change from class to class in MP3 

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn tools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, r2_score

# Style til grafer
sns.set(style="whitegrid")

In [None]:
# Load data 
emp_data = pd.read_csv('../Data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Clean dataset (there are no duplicates or null values, so no action needed for deleting those)
print(emp_data.isnull().sum())
print("Duplicate values: ", emp_data.duplicated().sum())

emp_data.columns

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')