1. Why this project is important? What problem are you trying to solve?

We are trying to help the company to study the employee attrition rate. A HubSpot report found that the employee turnover lost productivity costs U.S. businesses a shocking 1.8 trillion every year. Some studies predict that every time a business replaces a salaried employee, it costs 6 to 9 months' salary on average. For an employee making 60,000 a year, that's 30,000 to 45,000 in recruiting and training expenses. Predicting employee attrition is very important and there are several reasons: 

**Retention and Cost**: Employee attrition can be costly for a company. It involves expenses reated to hiring and training new employees, as well as the potential loss of productivity during the transition period. 

**Employee Satisfaction and Engagement**:High attrition rates can be indicative of underlying issues related to employee satisfaction, engagement, and well-being. Investigating attrition helps identify the reasons why employees are leaving, allowing the company to address concerns and improve the work environment.

**Talent Acquisition and Employer Branding**: High attrition rates can negatively impact a company's reputation as an employer. Prospective employees may view high turnover as a red flag, affecting the company's ability to attract and retain top talent.

**Lost institutional  knowledge**: When highly-skilled or longtime employees leave, your organization loses some institutional knowledge, or the combined skill set and experience of your business.

The above are the important reasons for us to help the company study the employee attrition rate. Understanding and predicting employee attrition can assist the company in formulating appropriate strategies to reduce employee turnover, enhance employee satisfaction, and strengthen talent management capabilities

2. How do you measure the model performance (metrics)? What is the benchmark?

**Business value:**
The business value of solving this problem is the ability to identify employee at risk of planing to leave the company and take proactive measures to retain them, leading to improved work environment, retain talented employees, employee satisfaction, reducing the cost for new employee training and then increased revenue.

**Benchmark:**
We need to establish a benchmark by comparing our model's performance metrics (such as accuracy, precision, recall, or AUC-ROC) against industry standards (typical attrition rates for the company) or previous results achieved by similar studies.
The metrics could be the attrition rate: The attrition rate measures the proportion of employees who have left the company and correlated it to the specified conditions, such as employee satisfication, hourly date, promotion... etc. 

3. How is your model? Have achieved your goal? How to evaluate the business value of your model?

Yes, based on the models, we got ~0.9 precision for RandomForest and Catboost model. 

4. Any insights have you gotten from your model? Any actionable suggestions can you provide to your business partner?

We use the Catboost model to predict the churn and the precision is 0.905. The feature importance can be as suggestion to reduce the churn rate, such as overtime, monthly income.. etc.

5. What is the most challenging part in the project? How did you solve it? How will you futher improve your model if you get more resources and time.

The most challenging for this project is the exploratory data analysis (EDA) phase, particularly when dealing with a large number of features. Here are the stratgeies:

1) Feature selection: we plot different histgram map to check the distribution of each features and remove non relavent features
2) Correlation features: we genearte the heatmap to check the correlation for each features and remove the multiconlinearity of features
3) Feature engineering: we consider both one-hot encoding and ording encoding for categorical feaetures and we also use Catboost model as comparison for other models. 

In [None]:
import pandas as pd
data=pd.read_csv("IBM_HR_Data.csv", low_memory=False)

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
data.dtypes
def convertAttrition(s):
    if s=="Voluntary Resignation":
        return 1
    if s=="Current employee":
        return 0
    else:
        return None
    

In [None]:
import matplotlib.pyplot as plt
data["Attrition"]=data.Attrition.map(convertAttrition)
data=data.dropna(subset=["Attrition"])
data.Attrition.value_counts()
data_temp=data[["Gender","MaritalStatus","Department","EducationField","Employee Source","BusinessTravel","Over18","OverTime","Education","EmployeeCount","EnvironmentSatisfaction","JobInvolvement","JobLevel","JobRole","JobSatisfaction","DistanceFromHome","NumCompaniesWorked","PercentSalaryHike","PerformanceRating","RelationshipSatisfaction","StandardHours","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion","YearsWithCurrManager","Attrition"]]
data_temp.columns

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
features=["Gender","MaritalStatus","Department","EducationField","Employee Source","BusinessTravel","Over18","OverTime","Education","EmployeeCount","EnvironmentSatisfaction","JobInvolvement","JobLevel","JobRole","JobSatisfaction","DistanceFromHome","NumCompaniesWorked","PercentSalaryHike","PerformanceRating","RelationshipSatisfaction","StandardHours","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion","YearsWithCurrManager"]
features
from IPython.display import display
for i in features:
    
    fig = px.histogram(data_temp, x="Attrition", color=i, barmode="stack", title=i+" distribution<b>")
    fig.update_layout(width=500, height=350, bargap=0.1)
    
    fig.update_xaxes(
        type="category",
        tickvals=[0, 1],
        ticktext=["0", "1"]
    )
    
    display(fig)


In [None]:
check_features_values=[]
check_features=[]
for i in features:
    value_counts_df=pd.DataFrame(data[i].value_counts())
    indices=value_counts_df[value_counts_df[i] == 1].index.tolist()
    #print(indices)
    if len(indices)>0:
        for j in range(0,len(indices)):
            check_features_values.append(indices[j])
            check_features.append(i)  

In [None]:
data_cleanup=data
for i in range(0,len(check_features_values)):
    data_cleanup=data_cleanup[data_cleanup[check_features[i]] != check_features_values[i]]

In [None]:
from IPython.display import display
sns.set(style="whitegrid")
sns.set_context("paper")
for i in features:
    
    fig = px.histogram(data_cleanup, x="Attrition", color=i, barmode="stack", title=i+" distribution<b>")
    fig.update_layout(width=500, height=350, bargap=0.1)
    
    fig.update_xaxes(
        type="category",
        tickvals=[0, 1],
        ticktext=["0", "1"]
    )
    
    display(fig)

In [None]:
data_cleanup = data_cleanup.fillna(data_cleanup.mode().iloc[0])
data_cleanup["Application ID"].value_counts()
data_cleanup = data_cleanup[data_cleanup["Application ID"]!=("Test")]
data_cleanup = data_cleanup[data_cleanup["EmployeeNumber"]!=("Test")]
data_cleanup = data_cleanup[data_cleanup["EmployeeNumber"]!=("TEST")]
data_cleanup = data_cleanup[data_cleanup["EmployeeNumber"]!=("TESTING")]

In [None]:
data_cleanup['DistanceFromHome'] =data_cleanup['DistanceFromHome'].astype('float64')
data_cleanup['EmployeeCount'] =data_cleanup['EmployeeCount'].astype('float64')
data_cleanup['EmployeeNumber'] =data_cleanup['EmployeeNumber'].astype('float64')
data_cleanup['HourlyRate'] =data_cleanup['HourlyRate'].astype('float64')
data_cleanup['JobSatisfaction'] =data_cleanup['JobSatisfaction'].astype('float64')
data_cleanup['MonthlyIncome'] =data_cleanup['MonthlyIncome'].astype('float64')
data_cleanup['PercentSalaryHike'] =data_cleanup['PercentSalaryHike'].astype('float64')

In [None]:
data_cleanup.columns.to_series().groupby(data_cleanup.dtypes).groups
sns.set(style="whitegrid")
sns.set_context("paper")
data_cleanup.hist(figsize=(20,20))
plt.show()

In [None]:
#data_cleanup = data_cleanup.drop(["EmployeeCount","StandardHours","Over18"], axis=1)
feature_numerical=["Age","DistanceFromHome",'Education', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
from IPython.display import display, HTML
display(HTML("<style>.output{max-height:1000 !important;}</style>"))


In [None]:
feature_cat=['BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'OverTime',
 'Employee Source',"DistanceFromHome"]
feature_cat2=[]
for column in data_cleanup.columns:
    if data_cleanup[column].nunique() < 50:
        feature_cat2.append(column)
(feature_cat2).remove("Attrition")

In [None]:
import math

%matplotlib inline

# Calculate the number of rows and columns for the subplots
num_features = len(feature_cat2)
num_rows = 10
num_cols = 3

# Calculate the total number of subplots
num_subplots = num_rows * num_cols

# Create the figure and subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(40, 70))

sns.set(style="whitegrid")
sns.set_context("paper")

# Flatten the axes array for easier indexing
axes = axes.flatten()

# Iterate over the features and create subplots
for i, feature in enumerate(feature_cat2):
    if i < num_subplots:
        ax = axes[i]
        df_EducationField = pd.DataFrame(columns=["Field", "% of Leavers"])
        j = 0
        for field in list(data_cleanup[feature].unique()):
            ratio = data_cleanup[(data_cleanup[feature] == field) & (data_cleanup['Attrition'] == 1)].shape[0] / data_cleanup[data_cleanup[feature] == field].shape[0]
            df_EducationField.loc[j] = (field, ratio * 100)
            j += 1
        df_EF = df_EducationField.groupby(by="Field").sum()
        x_labels = df_EF.index.tolist()  # Get the categorical values for x-axis labels
        ax.bar(x_labels, df_EF['% of Leavers'])
        ax.set_xlabel("", fontsize=30)
        ax.set_ylabel("% of Leavers", fontsize=30)
        ax.set_title(feature, fontsize=30)
        ax.tick_params(axis='x', labelsize=30)
        ax.tick_params(axis='y', labelsize=30)
      #  ax.set_xticks(range(len(x_labels)))
      #  ax.set_xticklabels(x_labels, rotation=90, ha='right')

        # Mark maximum value with annotation
        max_value = df_EF['% of Leavers'].max()
        max_index = df_EF['% of Leavers'].idxmax()
        ax.annotate(f"Max: {max_value:.2f}%", xy=(max_index, max_value), xytext=(0, 5),
                    textcoords='offset points', ha='center', fontsize=30, color='red')

# Hide the empty subplots if there are any
if num_features < num_subplots:
    for j in range(num_features, num_subplots):
        fig.delaxes(axes[j])

# Adjust the layout
plt.tight_layout()

# Show the plots
plt.show()


In [None]:
import pandas as pd
data=pd.read_csv("cleaned_forOneHot.csv", low_memory=False)
pd.set_option('display.max_columns', None)

In [None]:
data

In [None]:
data=data.dropna()

In [None]:
new_order=["Age","Gender","MaritalStatus","Education","EducationField","Employee Source","DistanceFromHome","Department","BusinessTravel","JobLevel","JobRole","OverTime","StandardHours","NumCompaniesWorked","EmployeeCount","HourlyRate","DailyRate","MonthlyRate","MonthlyIncome","PercentSalaryHike","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion","YearsWithCurrManager","JobInvolvement","PerformanceRating","EnvironmentSatisfaction","JobSatisfaction","RelationshipSatisfaction","WorkLifeBalance","Attrition"]

In [None]:
data =data.reindex(columns=new_order)
data

In [None]:
encoded_data = pd.get_dummies(data, columns=['MaritalStatus'])

In [None]:
encoded_data = pd.get_dummies(encoded_data, columns=['EducationField'])
encoded_data = pd.get_dummies(encoded_data, columns=['Employee Source'])
encoded_data = pd.get_dummies(encoded_data, columns=['Department'])
encoded_data = pd.get_dummies(encoded_data, columns=['JobRole'])
encoded_data = pd.get_dummies(encoded_data, columns=['Gender'])


In [None]:
encoded_data.BusinessTravel.value_counts()

In [None]:
category_order = ['Non-Travel','Travel_Rarely','Travel_Frequently']
category_mapping = {category: index for index, category in enumerate(category_order)}
encoded_data['BusinessTravel'] = encoded_data['BusinessTravel'].map(category_mapping)

In [None]:
category_order = ['No','Yes']
category_mapping = {category: index for index, category in enumerate(category_order)}
encoded_data['OverTime'] = encoded_data['OverTime'].map(category_mapping)

In [None]:
def convertAttrition(s):
    if s=="Voluntary Resignation":
        return 1
    if s=="Current employee":
        return 0
    else:
        return None
encoded_data["Attrition"]=encoded_data.Attrition.map(convertAttrition)

In [None]:
encoded_data.columns

In [None]:
new_order=['Age', 'Gender_Female', 'Gender_Male','MaritalStatus_Divorced', 'MaritalStatus_Married',
       'MaritalStatus_Single','Education','EducationField_Human Resources',
       'EducationField_Life Sciences', 'EducationField_Marketing',
       'EducationField_Medical', 'EducationField_Other',
       'EducationField_Technical Degree', 'EducationField_Test','Employee Source_Adzuna', 'Employee Source_Company Website',
       'Employee Source_GlassDoor', 'Employee Source_Indeed',
       'Employee Source_Jora', 'Employee Source_LinkedIn',
       'Employee Source_Recruit.net', 'Employee Source_Referral',
       'Employee Source_Seek', 'Employee Source_Test', 'DistanceFromHome', 'Department_Human Resources', 'Department_Research & Development',
       'Department_Sales','BusinessTravel', 'JobLevel','JobRole_Healthcare Representative',
       'JobRole_Human Resources', 'JobRole_Laboratory Technician',
       'JobRole_Manager', 'JobRole_Manufacturing Director',
       'JobRole_Research Director', 'JobRole_Research Scientist',
       'JobRole_Sales Executive', 'JobRole_Sales Representative',
       'OverTime', 'StandardHours', 'NumCompaniesWorked', 'EmployeeCount',
       'HourlyRate', 'DailyRate', 'MonthlyRate', 'MonthlyIncome',
       'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager', 'JobInvolvement',
       'PerformanceRating', 'EnvironmentSatisfaction', 'JobSatisfaction',
       'RelationshipSatisfaction', 'WorkLifeBalance', 'Attrition'
       ]

In [None]:
encoded_data =encoded_data.reindex(columns=new_order)

encoded_data = encoded_data.drop('StandardHours', axis=1)
encoded_data = encoded_data.drop('EmployeeCount', axis=1)
encoded_data

In [None]:
correlation_cost_median = encoded_data.corr(method='pearson')
filtered_features = []

fig, axs = plt.subplots(nrows=11, ncols=4, figsize=(20, 35))

for i, feature in enumerate(encoded_data.columns):
    row = i // 4  # Changed row and col calculation
    col = i % 4

    top_10_corr_features = correlation_cost_median[feature].sort_values(ascending=False).head(10)
    top_corr_features_filtered = top_10_corr_features[top_10_corr_features > 0.05]

    if len(top_corr_features_filtered) >= 2:
        filtered_features.append((feature, top_corr_features_filtered[0], top_corr_features_filtered[1]))

# Sort the filtered features based on the correlation values of the top two features
filtered_features = sorted(filtered_features, key=lambda x: abs(x[1]) + abs(x[2]), reverse=True)

for i, (feature, corr1, corr2) in enumerate(filtered_features):
    row = i // 4  # Changed row and col calculation
    col = i % 4

    top_10_corr_features = correlation_cost_median[feature].sort_values(ascending=False).head(10)
    top_corr_features_filtered = top_10_corr_features[top_10_corr_features > 0.05]

    heatmap = sns.heatmap(top_10_corr_features.values.reshape(-1, 1), annot=True, cmap='coolwarm', vmin=-1, vmax=1, cbar=False, ax=axs[row, col], annot_kws={"fontsize": 12})
    heatmap.set_title(f'{feature}', fontdict={'fontsize': 12}, pad=12)
    heatmap.set_yticklabels(top_10_corr_features.index, rotation=0, fontsize=12)

plt.tight_layout()
plt.show()


In [None]:
encoded_data

In [None]:
y=encoded_data["Attrition"]
encoded_data_new = encoded_data.drop('Attrition', axis=1)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(encoded_data_new, y, test_size=0.3, random_state=42)

### Modeling :  We first check the simplest logisticRegression as a baseline model

(1) LogisticRegression

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

steps = [('rescale', MinMaxScaler()),
         ('logr', LogisticRegression(max_iter=1000))]
model = Pipeline(steps)
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

In [None]:
print("Accuracy: {:.3f}".format(accuracy))
print("Precision: {:.3f}".format(precision))
print("Recall: {:.3f}".format(recall))
print("F1-score: {:.3f}".format(f1))
print("AUC-ROC: {:.3f}".format(auc_roc))

In [None]:
## Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_test, model.predict(X_test))
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Calculate the predicted probabilities for the positive class
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Compute the false positive rate (FPR), true positive rate (TPR), and thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate the AUC-ROC
auc = roc_auc_score(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='ROC curve (AUC = {:.3f})'.format(auc))
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

(2) RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Create a Random Forest classifier
model = RandomForestClassifier()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy: {:.3f}".format(accuracy))
print("Precision: {:.3f}".format(precision))
print("Recall: {:.3f}".format(recall))
print("F1-score: {:.3f}".format(f1))
print("AUC-ROC: {:.3f}".format(auc_roc))


In [None]:
## Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_test, model.predict(X_test))
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
from sklearn.model_selection import GridSearchCV

rf_classifier = RandomForestClassifier(class_weight = "balanced",random_state=7)
param_grid = {'n_estimators': [50, 75, 100, 125, 150, 175],
              'min_samples_split':[2,4,6,8,10],
              'min_samples_leaf': [1, 2, 3, 4],
              'max_depth': [5, 10, 15, 20, 25]}

grid_obj = GridSearchCV(rf_classifier,
                        return_train_score=True,
                        param_grid=param_grid,
                        scoring='roc_auc',
                        cv=10)

grid_fit = grid_obj.fit(X_train, y_train)
rf_opt = grid_fit.best_estimator_

print('='*20)
print("best params: " + str(grid_obj.best_estimator_))
print("best params: " + str(grid_obj.best_params_))
print('best score:', grid_obj.best_score_)
print('='*20)

(3) xgboostclassifier

In [None]:
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


# Create a Random Forest classifier
model = xgb.XGBClassifier()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy: {:.3f}".format(accuracy))
print("Precision: {:.3f}".format(precision))
print("Recall: {:.3f}".format(recall))
print("F1-score: {:.3f}".format(f1))
print("AUC-ROC: {:.3f}".format(auc_roc))


In [None]:
## Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_test, model.predict(X_test))
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
data["Attrition"]=data.Attrition.map(convertAttrition)
y=data["Attrition"]
data_cat = data.drop('Attrition', axis=1)

In [None]:
cat_features=[1,2,4,5,7,8,10,11]
X = data.drop('Attrition', axis=1)

In [None]:
X

In [None]:
import catboost as cb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming you have your feature data X and target variable y

# Step 1: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Create and train the CatBoost model
model = cb.CatBoostClassifier(iterations=1000, learning_rate=0.1, random_seed=42)
model.fit(X_train, y_train, cat_features)

# Step 3: Make predictions
y_pred = model.predict(X_test)

# Step 4: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy: {:.3f}".format(accuracy))
print("Precision: {:.3f}".format(precision))
print("Recall: {:.3f}".format(recall))
print("F1-score: {:.3f}".format(f1))
print("AUC-ROC: {:.3f}".format(auc_roc))


In [None]:
feature_importance = model.feature_importances_

# Print feature importance scores
for feature_name, importance in zip(X.columns, feature_importance):
    print(f"{feature_name}: {importance}")
    

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.cm as cm
from catboost import CatBoostClassifier, Pool

# Assuming you have the feature importance scores stored in a variable named 'feature_importance'
# and the corresponding feature names in a variable named 'feature_names'

# Sort the feature importance values and feature names together
sorted_indices = feature_importance.argsort()[::-1]
sorted_feature_importance = feature_importance[sorted_indices]
sorted_feature_names = [X.columns[i] for i in sorted_indices]

sns.set(style="whitegrid")
sns.set_context("paper")

# Create a colormap
colormap = cm.get_cmap('viridis', len(sorted_feature_importance))

# Plot the feature importance with color
plt.figure(figsize=(10, 6))
bars = plt.bar(range(len(sorted_feature_importance)), sorted_feature_importance, color=colormap(np.arange(len(sorted_feature_importance))))

plt.xticks(range(len(sorted_feature_importance)), sorted_feature_names, rotation='vertical')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importance')
plt.tight_layout()

# Add a colorbar for reference
sm = cm.ScalarMappable(cmap=colormap)
sm.set_array([])  # dummy array for the colorbar
cbar = plt.colorbar(sm)
cbar.set_label('Importance', rotation=90)

plt.show()

In [None]:
from sklearn.model_selection import train_test_split  # import 'train_test_split'
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Libraries for data modelling
from sklearn import svm, tree, linear_model, neighbors
from sklearn import naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# selection of algorithms to consider and set performance measure
models = []
models.append(('Logistic Regression', LogisticRegression(solver='liblinear', random_state=7,
                                                         class_weight='balanced')))
models.append(('Random Forest', RandomForestClassifier(
    n_estimators=100, random_state=7)))
models.append(('SVM', SVC(gamma='auto', random_state=7)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('Decision Tree Classifier',
               DecisionTreeClassifier(random_state=7)))
models.append(('Gaussian NB', GaussianNB()))

In [None]:
models

In [None]:
# Common sklearn Model Helpers
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

# sklearn modules for performance metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
from sklearn.metrics import auc, roc_auc_score, roc_curve, recall_score, log_loss
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, make_scorer
from sklearn.metrics import average_precision_score

acc_results = []
auc_results = []
names = []
# set table to table to populate with performance results
col = ['Algorithm', 'ROC AUC Mean', 'ROC AUC STD', 
       'Accuracy Mean', 'Accuracy STD']
df_results = pd.DataFrame(columns=col)
i = 0
# evaluate each model using cross-validation
for name, model in models:
    kfold = model_selection.KFold(
        n_splits=10,   shuffle=False)  # 10-fold cross-validation

    cv_acc_results = model_selection.cross_val_score(  # accuracy scoring
        model, X_train, y_train, cv=kfold, scoring='accuracy')

    cv_auc_results = model_selection.cross_val_score(  # roc_auc scoring
        model, X_train, y_train, cv=kfold, scoring='roc_auc')

    acc_results.append(cv_acc_results)
    auc_results.append(cv_auc_results)
    names.append(name)
    df_results.loc[i] = [name,
                         round(cv_auc_results.mean()*100, 2),
                         round(cv_auc_results.std()*100, 2),
                         round(cv_acc_results.mean()*100, 2),
                         round(cv_acc_results.std()*100, 2)
                         ]
    i += 1
df_results.sort_values(by=['ROC AUC Mean'], ascending=False)

In [None]:
df_results.sort_values(by=["Algorithm"], ascending=False)

In [None]:
fig = plt.figure(figsize=(15, 7))
fig.suptitle('Algorithm Accuracy Comparison')
ax = fig.add_subplot(111)
plt.boxplot(acc_results)
ax.set_xticklabels(names)
plt.show()