# HA06.1_LR

Your company's HR department asks you, as a data scientist, to *build a model to predict the retention behavior* (Bindungsverhalten) of employees in different departments. Therefore, they provide you a data set (downloaded from [Kaggle](https://www.kaggle.com/giripujar/hr-analytics)) consits of 15000 different samples. The original notebook is forked from [here](https://github.com/codebasics/py/tree/master/ML/7_logistic_reg/Exercise).

In this home assignment we will cover the following topics: 

1. **Data exploration and investigation**
2. **Logistic Regressions**





In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/lnxdxC/DSAI/main/L06_Classification_and_Regression/HR.csv')
df.head()

# 1. Data exploration

### 1.1 How many employees have left the company?

$x = \frac{n_\text{left} }{n_\text{overall}}$

In [None]:
df.left[df.left==1].sum() / df.shape[0]

### 1.2 How many employees have retained?
$x = n_\text{overall} - n_\text{left}$

In [None]:
df.shape[0] - df.left[df.left==1].sum()

# Alternative: 
retained = df[df.left==0]
retained.shape[0]

### 1.3 Average figures for all columns in terms of employee quits

In [None]:
df.groupby('left').mean()  # We first group the data by means of the entries in "left" and secondly, determine the average values of every column

From above table we can see:
- **Satisfaction Level**: Satisfaction level seems to be relatively low (0.44) in employees leaving the firm vs the retained ones (0.66)
- **Average Monthly Hours**: Average monthly hours are higher in employees leaving the firm (199 vs 207)
- **Promotion Last 5 Years**: Employees who are given promotion are likely to be retained at firm


### 1.4 Impact of salary on employee retention

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

pd.crosstab(df.salary, df.left).plot(kind='bar', ax=ax)
fig.show()

Above bar chart shows employees with high salaries are likely to not leave the company

### 1.5 Department wise employee retention rate

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
pd.crosstab(df.Department,df.left).plot(kind='bar', ax=ax)

### 1.6 Preparation for model generation

From above chart there seem to be some impact of department on employee retention but it is not major hence we will ignore department in our analysis. From the data analysis so far we can conclude that we will use following variables as independant variables in our model
- **Satisfaction Level**
- **Average Monthly Hours**
- **Promotion Last 5 Years**
- **Salary**

In [None]:
# We create a new dataframe holding only the information which we will use for the model generation. Thus, we simply can hand over the data frame as input feature space in the model training pipeline
subdf = df[['satisfaction_level','average_montly_hours','promotion_last_5years','salary']]
subdf.head()

#### 1.6.1. Tackle salary (cast from string to int/bool)

The field `salary` consists of strings. It needs to be converted to numbers and we will use dummy variable for that. We reffer at this point to this [repositry](https://github.com/codebasics/py) for further information about encoding.

In [None]:
salary_dummies = pd.get_dummies(subdf.salary, prefix="salary")
df_with_dummies = pd.concat([subdf,salary_dummies], axis='columns')
df_with_dummies.drop('salary',axis='columns',inplace=True)
X = df_with_dummies

In [None]:
# Finally, we get the feature space for our model
X.head()

In [None]:
y = df.left

#### 1.6.2. Split test and training data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.3)

#### 1.6.3. Train the logistic regression

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

#### 1.6.4. Prediction
We can now predict the behavior of the employees. Either we use the test data to predict/evalute the behavior or we apply this model to new, unseen data

In [None]:
model.predict(X_test)

#### 1.6.5. Accuracy of the model

In [None]:
model.score(X_test,y_test)

In [None]:
model.predict(X_test[4:10])

In [None]:
y_test[4:10].values