Let's start by importing the important libraries for data analysis, data visualization, etc.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Let's now import our training data and store in a dataframe named train_data.

In [None]:
train_data = pd.read_csv('/kaggle/input/hranalysis/train.csv')
train_data.head()

Let's now do some exploratory data analysis by using the pairplot feature of seaborn to get a brief idea about our dataset and its features.(Warning: this can take some time)

In [None]:
sns.pairplot(train_data)

Our main goal is to effectively classify our data. Feature employee_id has no contribution in this since the employee-id is not anyhow related to whether the person gets promotion or not. So we just remove that from our dataset.

In [None]:
train_data.drop('employee_id',inplace=True,axis='columns')

In [None]:
train_data.head()

Let's do some more analysis of our data by using these countlots. I have restricted myself to only 2 and rather given more time to the cleaning and improvement of our data.

In [None]:
sns.countplot(x='department',data=train_data,hue='is_promoted')

In [None]:
sns.countplot(x='gender',data=train_data,hue='is_promoted')

Let's now look for the presence of any missing values in out data and try to clean it.

In [None]:
sns.heatmap(train_data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Only 2 columns have missing data in them: education and previous_year_rating. For education, most of the entries are 'Bachelor's'. So it would make sense to actually replace all the missing values with 'Bachelor's' since they are having maximum probability. The function below will do this.

In [None]:
def fill_edu(col):
    if(pd.isnull(col)):
        return "Bachelor's"
    else:
        return col

In [None]:
train_data['education'] = train_data['education'].apply(fill_edu)

Now let's check our data.

In [None]:
sns.heatmap(train_data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Great! Now that our data doesn't have any missing education values, let's try and do the same for previous_year_rating column also. Here we'll try to find the most popular rating and assign that to all missing values.

In [None]:
sns.countplot(x='previous_year_rating',data=train_data)

We can see that 3.0 is the most common rating so we will insert 3.0 for all missing values in previous_year_rating.

In [None]:
def fill_rating(col):
    if(pd.isnull(col)):
        return 3.0
    else:
        return col

In [None]:
train_data['previous_year_rating'] = train_data['previous_year_rating'].apply(fill_rating)

Let's check our dataset now.

In [None]:
sns.heatmap(train_data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Great! Now our dataset has no missing values. We'll now proceeed further to convert the categorical data formats to dummies format so that our models can use them more effectively. We will use the get_dummies() method of pandas for this purpose. (Dummies can increase the precision of a model a lot instead of normal object types)

In [None]:
train_data.info()

In [None]:
dep = pd.get_dummies(train_data['department'],drop_first=True)
edu = pd.get_dummies(train_data['education'],drop_first=True)
reg = pd.get_dummies(train_data['region'],drop_first=True)
gen = pd.get_dummies(train_data['gender'],drop_first=True)
rec = pd.get_dummies(train_data['recruitment_channel'],drop_first=True)
dep.head()

Now that we have all our dummies let's remove the object type columns and insert our dummies instead.

In [None]:
train_data.drop(['department','region','education','gender','recruitment_channel'],axis=1,inplace=True)

In [None]:
train_data = pd.concat([train_data,dep,reg,edu,gen,rec],axis=1)

In [None]:
train_data.head()

Now our data is ready to be fit into our model, so let's first start by doing a train_test_split to validate our model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_data.drop('is_promoted',axis=1), 
                                                    train_data['is_promoted'], test_size=0.25, 
                                                    random_state=101)

Let us now use a simple logistic regression model to classify our data. We will check its accuracy later on.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

Let's predict our test data and evaluate the precision of our model now.

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test,predictions))

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
print(confusion_matrix(y_test,predictions))

Let's try using an decision tree classifier and check its precision.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train,y_train)

In [None]:
predictions = dtree.predict(X_test)

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))
print("Training Accuracy for Decision tree classifier :", dtree.score(X_train, y_train))

As we can see that the our models are able to predict is_promoted=0 quite accurately but not as much for is_promoted=1. This is probably because the dataset doesn't have as much is_promoted=1 cases as is_promoted=0. Let's try standard scaling our data and then check the precision.(training accuracy is good but not good enough for validation accuracy)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

Let's try first for logistic regression model and test.

In [None]:
logmodel2 = LogisticRegression()
logmodel2.fit(X_train,y_train)

In [None]:
predictions = logmodel2.predict(X_test)

In [None]:
print(classification_report(y_test,predictions))

We can see that the precision of is_promoted=1 is much better here. Now lets try to do the same for a random forest classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print(classification_report(y_test,rfc_pred))

We can see that the random forest classifier is slightly better that logistic regression. Let us test their training accuracy to check which is better.

In [None]:
print("Training Accuracy for Random Forest classifier :", rfc.score(X_train, y_train))

In [None]:
print("Training Accuracy for logistic regression classifier :", logmodel2.score(X_train, y_train))

We can see that accuracy of Random forest classifier is only better. Thanks for reading upto here, if you liked my kernel please leave an upvote.