# Dataset: Employee Churn

Your first client is Pear Inc, a multinational company worried about its poor talent retention. Pear has a peculiar hiring strategy. They offer free classes and hire the best students. The strategy is working, but many new hires leave the company after a few months. This is a huge waste of time and money.

In the past few months, they collected a dataset with information about their employees and recorded whether they churned or not. Due to the churning period being so short, they are confident that the history of each candidate is enough to predict the churn and the experience is Pear is not relevant.

Gabriele, the Head of Talent, is counting on you to put a plug in this problem. And Fabio, an ML Engineer and new colleague of yours, will be eyeing your work too!

**Challenge 1**: Pear Inc needs your help to figure out what makes an employee stick versus split. Craft a Jupyter notebook to answer their question. Keep it simple enough for Gabriele but detailed enough to dazzle Fabio.

## Library import

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import ppscore as pps

from ydata_profiling import ProfileReport
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

## Load data

In [37]:
data_raw = pd.read_csv("datasets/employee-churn/churn.csv")
data_raw.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [38]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

### Features

- enrollee_id: Unique ID for candidate 
- city: City code 
- city_development_index: Development index of the city (scaled)
- gender: Gender of candidate 
- relevant_experience: Relevant experience of candidate 
- enrolled_university: Type of University course enrolled if any 
- education_level: Education level of candidate 
- major_discipline: Education discipline of candidate 
- experience: Candidate total experience in years 
- company_size: Number of employees in current employer's company 
- company_type: Type of current employer 
- last_new_job: Difference in years between previous job and current job 
- training_hours: training hours completed 
- target: 0 – Not looking for job change, 1 – Looking for a job change

## Menage NaN values

Lets see the **percentage of missing values** in every columns.

In [39]:
data_raw.isna().sum()/len(data_raw)*100

enrollee_id                0.000000
city                       0.000000
city_development_index     0.000000
gender                    23.530640
relevent_experience        0.000000
enrolled_university        2.014824
education_level            2.401086
major_discipline          14.683161
experience                 0.339284
company_size              30.994885
company_type              32.049274
last_new_job               2.207955
training_hours             0.000000
target                     0.000000
dtype: float64

### First approach: try to remove all the missing values

Let us evaluate the percentage of information that would be lost by removing all of the rows with at least one NaN values:

In [40]:
1 - (len(data_raw.dropna())/len(data_raw))

0.5325712496085186

This does not appear to be an advantageous choice because more than half of the dataset would be eliminated.

### Second approach: kNN

**insert kNN explanation!!!**

To use kNN it is necessary to convert categorical data (i.e. words and frases) into numerical data. To do so I decided to use both `sklearn.preprocessing.LabelEncoder` and `sklearn.preprocessing.OneHotEncoder`.

**explain the differencies and the two use cases**

In [46]:
def data_reduction(df_raw:pd.DataFrame):
    df = df_raw.copy()

    df['city'] = df['city'].str.split('_').str[1].astype('float')

    df['education_level'] = OrdinalEncoder(
        categories = [['Primary School', 'High School', 'Graduate', 'Masters', 'Phd']],
        handle_unknown = 'use_encoded_value',
        unknown_value = np.nan,
    ).fit_transform(df[['education_level']])
    
    df['experience'] = OrdinalEncoder(
        categories = [['<1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','>20']],
        handle_unknown = 'use_encoded_value',
        unknown_value = np.nan,
    ).fit_transform(df[['experience']])

    # Ordinal per company_size
    # Ordinal per last_new_job

    #OneHot/get_dummy per tutti gli altri
    
    return df 

In [45]:
data = data_reduction(data_raw)
data.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,103.0,0.92,Male,Has relevent experience,no_enrollment,2.0,STEM,20.0,,,1,36,1.0
1,29725,40.0,0.776,Male,No relevent experience,no_enrollment,2.0,STEM,14.0,50-99,Pvt Ltd,>4,47,0.0
2,11561,21.0,0.624,,No relevent experience,Full time course,2.0,STEM,4.0,,,never,83,0.0
3,33241,115.0,0.789,,No relevent experience,,2.0,Business Degree,0.0,,Pvt Ltd,never,52,1.0
4,666,162.0,0.767,Male,Has relevent experience,no_enrollment,3.0,STEM,20.0,50-99,Funded Startup,4,8,0.0


In [None]:
#imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
#imputer.fit(data_dummies)
#data_filled = imputer.transform(data_dummies)
#data_filled = pd.DataFrame(data_filled, columns=data_dummies.columns)
#data_filled.isnull().values.any()

In [None]:
#data_final = pd.from_dummies(data_filled)
#data_final.head()