<a href="https://colab.research.google.com/github/elinonga/logistic-regression-analysis/blob/main/Quiz_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Importing & Merging Data**


In [46]:
# Importing Pandas and NumPy
import pandas as pd
import numpy as np

# Importing Files in Google Colab
from google.colab import files
import io

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [47]:
# "Choose the Files" prompt
data_to_load = files.upload()

Saving hr_analytics.csv to hr_analytics (2).csv


In [49]:
# importing the files as done in google colab
analytics = pd.read_csv(io.BytesIO(data_to_load['hr_analytics.csv']))

### **Let's understand the structure of our dataframe**

In [50]:
analytics.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


### **Dropping unnecessary variables**

In [52]:
analytics = analytics.drop(['Department'], axis=1)

### **Dummy Variable Creation**

In [53]:
salary = pd.get_dummies(analytics.salary)
salary

Unnamed: 0,high,low,medium
0,0,1,0
1,0,0,1
2,0,0,1
3,0,1,0
4,0,1,0
...,...,...,...
14994,0,1,0
14995,0,1,0
14996,0,1,0
14997,0,1,0


In [54]:
# Creating a dummy variable for the variable 'salary' and dropping the first one.
pm = pd.get_dummies(analytics['salary'],prefix='salary',drop_first=True)
#Adding the results to the master dataframe
analytics = pd.concat([analytics,pm],axis=1)


### **Dropping the repeated variables**

In [55]:
# We have created dummies for the below variables, so we can drop them
analytics = analytics.drop(['salary'], 1)

### **Checking the new data**

In [56]:
analytics.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,1,0,1,0
1,0.8,0.86,5,262,6,0,1,0,0,1
2,0.11,0.88,7,272,4,0,1,0,0,1
3,0.72,0.87,5,223,5,0,1,0,1,0
4,0.37,0.52,2,159,3,0,1,0,1,0


In [57]:
analytics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   salary_low             14999 non-null  uint8  
 9   salary_medium          14999 non-null  uint8  
dtypes: float64(2), int64(6), uint8(2)
memory usage: 966.9 KB


In [58]:
analytics.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,salary_low,salary_medium
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268,0.487766,0.429762
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281,0.499867,0.495059
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0,1.0,1.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0,1.0,1.0


### **Checking for Missing Values and Inputing Them**

In [59]:
# Adding up the missing values (column-wise)
analytics.isnull().sum()

satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
salary_low               0
salary_medium            0
dtype: int64

### **Feature Standardisation**

In [60]:
# Normalising continuous features
df = analytics[['number_project','average_montly_hours','time_spend_company']]

In [61]:
# defining a normalisation function
def normalize (x):
  return ((x-np.min(x)) / (max(x) - min(x)))

# applying normalize ( ) to all columns
df_normalize = df.apply(normalize)

# Futa za zamani ili kusiwe na double-count
analytics = analytics.drop(['number_project','average_montly_hours','time_spend_company'], 1)

#Adding the results to the master dataframe
analytics = pd.concat([analytics,df_normalize],axis=1)

In [62]:
analytics.head()

Unnamed: 0,satisfaction_level,last_evaluation,Work_accident,left,promotion_last_5years,salary_low,salary_medium,number_project,average_montly_hours,time_spend_company
0,0.38,0.53,0,1,0,1,0,0.0,0.285047,0.125
1,0.8,0.86,0,1,0,0,1,0.6,0.775701,0.5
2,0.11,0.88,0,1,0,0,1,1.0,0.82243,0.25
3,0.72,0.87,0,1,0,1,0,0.6,0.593458,0.375
4,0.37,0.52,0,1,0,1,0,0.0,0.294393,0.125


## **Model Building**
Let's start by splitting our data into a training set and a test set.

### **Splitting Data into Training and Test Sets**

In [65]:
from sklearn.model_selection import train_test_split

In [66]:
# Putting feature variable to X
X = analytics[['satisfaction_level', 'last_evaluation', 'Work_accident', 'promotion_last_5years', 'salary_low',
             'salary_medium', 'number_project', 'average_montly_hours', 'time_spend_company']]

# Putting response variable to y
y = analytics['left']

In [67]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=100)

### **Running your first Training Model**

In [70]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [71]:
model.fit(X_train, y_train)

LogisticRegression()

### **Making Predictions**

In [72]:
model.predict(X_test)

array([0, 0, 1, ..., 0, 0, 0])

### **Checking accuracy of the model**

In [73]:
model.score(X_test,y_test)

0.7824444444444445