This is an alternative version of Uplift Modeling VS Churn Prediction from my [previous experiment](https://www.kaggle.com/davinwijaya/why-you-should-start-using-uplift-modeling)
. The difference is in this notebook I use Logistic Regression as the Machine Learning algorithm.

# 1. Setup
First let's set up the environment and datasets

In [None]:
# Import the packages and libraries needed for this project
import matplotlib as mpl, matplotlib.pyplot as plt, \
pandas as pd, seaborn as sns, sklearn as sk
from sklearn.metrics import accuracy_score, \
confusion_matrix, multilabel_confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
# Checking for package version
print("Matplotlib Version", mpl.__version__)
print("Pandas Version", pd.__version__)
print("Seaborn Version", sns.__version__)
print("Sci-kit learn Version", sk.__version__)

In [None]:
# Import the dataset
df_data_2 = pd.read_csv('/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df_model_2 = df_data_2.copy()

# 2. Data Exploration

Now we've the datasets ready, let's check for null data.

In [None]:
# Explore dataset 2
df_data_2.head(5)

In [None]:
# Check for null.
display(df_data_2.isnull().values.any())

Good, there is no null data.

# 3. Data preprocessing

In [None]:
# Declare unwanted features (i.e. columns) that will be dropped.
drop_2 = ['EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18']
# drop the features.
df_model_2 = df_model_2.drop(drop_2,axis=1)

In [None]:
# Rename all target features.
df_model_2 = df_model_2.rename(columns={'Attrition': 'churn'})

In [None]:
# Rename all treatment features.
df_model_2 = df_model_2.rename(columns={'OverTime': 'treatment'})

In [None]:
# Declare features for label encoding.
string2 = ['churn',
          'treatment',
          'BusinessTravel']
# Explore the unique data for label encoding
for col in string2:
    display(col, df_model_2[col].unique())

In [None]:
# Manually label the churn status, Yes = 1, and No = 0
df_model_2.churn = df_model_2.churn.map({'Yes': 1, 'No': 0})

# Manually label the treatment status (Overtime), Yes = 0 (Overtime), No = 1 (Does not receive Overtime)
df_model_2.treatment = df_model_2.treatment.map({'Yes': 0, 'No': 1})

# Declaration BusinessTravel
df_model_2.BusinessTravel = df_model_2.BusinessTravel.map({'Non-Travel': 0,
                                                           'Travel_Rarely': 1,
                                                           'Travel_Frequently':2})

Secondly, let's turn the rest of the string/object data into integer with the magical get_dummies function (One hot encoding) from Pandas package, so we can feed the data into LogisticRegression. Moreover, I add another dataframe df_model_inverse that will be useful for later:

In [None]:
# One-Hot Encoding:
df_model_2, df_model_inverse_2 = pd.get_dummies(df_model_2), pd.get_dummies(df_model_2)

Let's check the treatment's correlation to employee turnover:

In [None]:
def correlation_treatment(df:pd.DataFrame):
    """Function to calculate the treatment's correlation
    """
    correlation = df[['treatment','churn']].corr(method ='pearson') 
    return(pd.DataFrame(round(correlation.loc['churn'] * 100,2)))

In [None]:
print("Dataset 2:", correlation_treatment(df_model_2).iloc[0,0])

Good, now all of the treatment features are negatively correlated. We will use the positive ones later at the end of this project. Next let's add the four uplift category for each datasets:

In [None]:
def declare_target_class(df:pd.DataFrame):
    """Function for declare the target class
    """
    #CN:
    df['target_class'] = 0 
    #CR:
    df.loc[(df.treatment == 0) & (df.churn == 0),'target_class'] = 1 
    #TN:
    df.loc[(df.treatment == 1) & (df.churn == 1),'target_class'] = 2 
    #TR:
    df.loc[(df.treatment == 1) & (df.churn == 0),'target_class'] = 3 
    return df

In [None]:
# Add the four target classes
df_model_2 = declare_target_class(df_model_2)

# 4. Machine Learning Modeling

Finally we're ready to start the machine learning process:

In [None]:
def split_data(df_model:pd.DataFrame):
    """Split data into training data and testing data
    """
    X = df_model.drop(['churn','target_class'],axis=1)
    y = df_model.churn
    z = df_model.target_class
    X_train, X_test, \
    y_train, y_test, \
    z_train, z_test = train_test_split(X,
                                       y,
                                       z,
                                       test_size=0.3,
                                       random_state=42,
                                       stratify=df_model['treatment'])
    return X_train,X_test, y_train, y_test, z_train, z_test


def machine_learning(X_train:pd.DataFrame,
                     X_test:pd.DataFrame,
                     y_train:pd.DataFrame,
                     y_test:pd.DataFrame,
                     z_train:pd.DataFrame,
                     z_test:pd.DataFrame):
    """Machine learning process consists of 
    data training, and data testing process (i.e. prediction) with Logistic Regression Algorithm
    """
    # prepare a new DataFrame
    prediction_results = pd.DataFrame(X_test).copy()
    
    # train the ETU model
    model_etu \
    = LogisticRegression().fit(X_train.drop('treatment', axis=1), z_train)
    # prediction Process for ETU model 
    prediction_etu \
    = model_etu.predict(X_test.drop('treatment', axis=1))
    probability__etu \
    = model_etu.predict_proba(X_test.drop('treatment', axis=1))
    prediction_results['prediction_target_class'] = prediction_etu
    prediction_results['proba_CN'] = probability__etu[:,0] 
    prediction_results['proba_CR'] = probability__etu[:,1] 
    prediction_results['proba_TN'] = probability__etu[:,2] 
    prediction_results['proba_TR'] = probability__etu[:,3]
    prediction_results['score_etu'] = prediction_results.eval('\
    proba_CN/(proba_CN+proba_CR) \
    + proba_TR/(proba_TN+proba_TR) \
    - proba_TN/(proba_TN+proba_TR) \
    - proba_CR/(proba_CN+proba_CR)')  
    
    # add the churn and target class into dataframe as validation data
    prediction_results['churn'] = y_test
    prediction_results['target_class'] = z_test
    return prediction_results


def predict(df_model:pd.DataFrame):
    """Combining data split and machine learning process with Logistic Regression
    """
    X_train, X_test, y_train, y_test, z_train, z_test = split_data(df_model)
    prediction_results = machine_learning(X_train,
                                          X_test,
                                          y_train,
                                          y_test,
                                          z_train,
                                          z_test)
    print("Prediction has succeeded")
    return prediction_results

In [None]:
# Machine Learning Modelling Process
print("predicting dataset 2 ...")
prediction_results_2 = predict(df_model_2)

Prediction results are stored in prediction_results_1, prediction_results_2, and prediction_results_3 for dataset 1, dataset 2, and dataset 3, respectively.

# 5. Evaluating predictive performance

Now let's evaluate the predictive performance:

In [None]:
def cm_evaluation(df:pd.DataFrame):
    """Confusion matrix evaluation
    """      
    print("-----------------------------------")
    
    print("2. ETU's confusion matrix result:")   
    confusion_etu = multilabel_confusion_matrix(df['target_class'], df['prediction_target_class'])
    print("a. CN's confusion matrix:")  
    df_cn = pd.DataFrame(confusion_etu[0], columns = ['True','False'], index = ['Positive','Negative'])
    print(df_cn)
    print("b. CR's confusion matrix:") 
    df_cr = pd.DataFrame(confusion_etu[1], columns = ['True','False'], index = ['Positive','Negative'])
    print(df_cr) 
    print("c. TN's confusion matrix:")
    df_tn = pd.DataFrame(confusion_etu[2], columns = ['True','False'], index = ['Positive','Negative'])
    print(df_tn) 
    print("d. TR's confusion matrix:") 
    df_tr = pd.DataFrame(confusion_etu[3], columns = ['True','False'], index = ['Positive','Negative'])
    print(df_tr)
    
    print("===================================")

In [Confusion Matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62), the True Positive and False Negative are the amount of successful predictions and the True Negative and False Positive are the amount of failed predictions. Therefore, let's generate the confusion matrices:

In [None]:
# Confusion Matrix Evaluation
print("Dataset 2")
cm_evaluation(prediction_results_2)

Now, let's calculate the accuracy result:

In [None]:
def accuracy_evaluation(df:pd.DataFrame):
    """Accuracy evaluation
    """    
    
    akurasi_uplift = accuracy_score(df['target_class'],
                                    df['prediction_target_class'])
    print('ETU model accuracy: %.2f%%' % (akurasi_uplift * 100.0))

In [None]:
# Accuracy Evaluation Process.
print("Dataset 2")
accuracy_evaluation(prediction_results_2)

Wow, seems like ETP models are much better than ETU models in terms of prediction accuracy. That makes sense anyway, because ETP models only predict two possible outcomes (The employee is turnover or stay), where ETU models predict four possible outcomes (Persuadables, Sure Things, Lost Causes, and Sleeping Dogs/Do-not-disturbs). But will ETP will also have a better performance in solving the employee turnover (prescriptive performance)? Let's find out.

# 6. Evaluating prescriptive performance

Now let's use the prediction results to solve the problem. As explained before, for ETP model employees are ranked by their turnover probability. Employees with the highest turnover probability will be targeted with a retention campaign (the treatment features declared before). On the other side, the ETU models are ranked by its uplift score with LGWUM's formulation.

In [None]:
def sorting_data(df:pd.DataFrame):
    """Function to sort data
    """
    # Set up new DataFrames for ETP model and ETU model
    df_u = pd.DataFrame({'n':[], 'target_class':[]})
    df_u['target_class'] = df['target_class']
    
    
    # Add quantiles
    df_u['n'] = df.score_etu.rank(pct=True, ascending=False)
    df_u['score'] = df['score_etu']
    
    
    # Ranking the data by deciles
    df_u = df_u.sort_values(by='n').reset_index(drop=True)
    return df_u


def calculating_qini(df:pd.DataFrame):
    """Function to measure the Qini value
    """
    # Calculate the C, T, CR, and TR
    C, T = sum(df['target_class'] <= 1), sum(df['target_class'] >= 2)
    df['cr'] = 0
    df['tr'] = 0
    df.loc[df.target_class  == 1,'cr'] = 1
    df.loc[df.target_class  == 3,'tr'] = 1
    df['cr/c'] = df.cr.cumsum() / C
    df['tr/t'] = df.tr.cumsum() / T
    
    
    # Calculate & add the qini value into the Dataframe
    df['uplift'] = df['tr/t'] - df['cr/c']
    df['random'] = df['n'] * df['uplift'].iloc[-1]
    # Add q0 into the Dataframe
    q0 = pd.DataFrame({'n':0, 'uplift':0, 'target_class': None}, index =[0])
    qini = pd.concat([q0, df]).reset_index(drop = True)
    return qini


def merging_data(df_u:pd.DataFrame):
    """Function to add the 'Model' column and merge the dataframe into one
    """
    df_u['model'] = 'ETU'
    df = pd.concat([df_u]).sort_values(by='n').reset_index(drop = True)
    return df


def plot_qini(df:pd.DataFrame):
    """Function to plot qini
    """
    # Define the data that will be plotted
    order = ['ETU','ETP']
    ax = sns.lineplot(x='n', y=df.uplift, hue='model', data=df,
                      style='model', palette=['red','deepskyblue'],
                      style_order=order, hue_order = order)
    
    
    # Additional plot display settings
    handles, labels = ax.get_legend_handles_labels()
    plt.xlabel('Proportion targeted',fontsize=30)
    plt.ylabel('Uplift',fontsize=30)
    plt.subplots_adjust(right=1)
    plt.subplots_adjust(top=1)
    plt.legend(fontsize=30)
    ax.tick_params(labelsize=24)
    ax.legend(handles=handles[1:], labels=labels[1:])
    ax.plot([0,1], [0,df.loc[len(df) - 1,'uplift']],'--', color='grey')
    return ax


def evaluation_qini(prediction_results:pd.DataFrame):
    """Function to combine all qini evaluation processes
    """
    df_u = sorting_data(prediction_results)
    qini_u = calculating_qini(df_u)
    qini = merging_data(qini_u)
    ax = plot_qini(qini)
    return ax, qini

In [None]:
# Qini evaluation results for DataSet 2 with negative treatment correlation
ax, qini_2 = evaluation_qini(prediction_results_2)
plt.title('Qini Curve - Dataset 2',fontsize=20)


# save into pdf:
# plt.savefig('qini_2_n.pdf', bbox_inches='tight'

The nect process to inverse treatment's parameter.
Thus also inverse the treatment's correlation from negative to positive
this is the opposite of previous treatment "Overtime" which is the treatment is to target employee with Overtime.


In [None]:
# So now we change the treatment from Overtime to No-overtime, Yes = 1 (Receive No-overtime), No = 0 (Does not receive No-overtime)
df_model_inverse_2.treatment = df_model_inverse_2.treatment.replace({0: 1, 1: 0})

In [None]:
# Recalculate the treatment correlation
display(correlation_treatment(df_model_inverse_2).iloc[0,0])

Good, now the treatment features are positively correlated with employee turnover. This means, if we target the employees with this treatment, it's more likely that the employee turnover rate will be increased. So it'll be wise to use this treatment carefully. Okay, now let's repeat the prediction procedure once again:

In [None]:
# Add the target class feature to all three datasets
df_model_inverse_2= declare_target_class(df_model_inverse_2)

In [None]:
# Do the prediction process once more time
prediction_results_inverse_2 = predict(df_model_inverse_2)

In [None]:
# qini evaluation results for DataSet 2 with positive treatment correlation
ax, qini_inverse_2 = evaluation_qini(prediction_results_inverse_2)
plt.title('Qini Curve - Dataset 2',fontsize=20)


# save into pdf:
plt.savefig('qini_2_p.jpg', bbox_inches='tight')