# Telco Customer Churn - Supervised Learning Project

# 1. Project Topic

In this project we want to build a supervised machine learning model that is able to predict whether a customer of a telecommunication company is going to churn or not. This is a classification task with two potential outcomes. To achieve this, we are working with a public dataset from Kaggle that is not part of a competition which means there is no pre-specified evaluation metric. Since the optimal evaluation metric can depend on the structure of our dataset, we will first perform an exploratory data analysis and get to know our data. By leveraging our insights we can then decide which metric might be the best.

# 2. Project Goal

Churn, within a business context, refers to the loss of a previously acquired customer who has the potential to generate profit. The specific definition of churn may vary across industries. For instance, in the healthcare industry, customers who are deceased are considered churned, whereas in the finance sector, individuals with inactive credit cards are classified as churned.

Retaining an existing customer is more cost-effective than acquiring a new one. Therefore, preventing customer churn is crucial for maintaining a consistent revenue stream. By building a model that is able to predict churn we could save our company a lot of money. We are also interested in which factors have the biggest impact on a customer to churn. By knowning this, we might be able to implement new business strategies that help us retain our customers.

# 3. Data

The project is based on the following public Kaggle dataset: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

The dataset has a tabular format and has multiple columns:

* Churn - Customers who left within the last month
* Phone, internet, online security, online backup, etc. - Services that each customer has signed up for
* Contract information, payment method, monthly charges, total charges, etc. - Customer account information
* Gender, age, etc. - Demographic information about customers

# 4. Import Python Libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
import scipy.stats as stats
import optuna

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

# 5. Exploratory Data Analysis

## 5.1 Data Description

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                        'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 
                        'StreamingTV', 'StreamingMovies', 'Contract','PaperlessBilling', 'PaymentMethod']

We are working with a dataset of 7043 rows and 21 columns. The last column contains the labeled output value 'Churn' that indicates whether a customer has churned or not. We can also see that there are a lot of categorical variables and only a few numerical variables. As the ouput of the above commands shows the file size is approximately 1.1MB which is quite small, so we don't have to worry about special techniques for handling large file sizes.

## 5.2 Data Cleaning

### 5.2.1 Missing Values

First, we check for NA or NULL values using the built-in functions of pandas dataframes.

In [None]:
df.isna().sum()

In [None]:
df.isnull().sum()

Finally, we can also search for empty (or just spaces) string values. 

In [None]:
np.where(df.applymap(lambda x: str(x).strip() == ''))

We observe no NA or NULL values but 11 empty strings or spaces in the 'TotalCharges' column. Such empty values make no sense inside a column where we would expect numerical values. In order to handle those empty values there are several commonly used strategies:

* Impute missing values based on their distribution
* Drop rows with empty values
* Use advanced imputation techniques (e.g. regression imputation)

We will start by looking at the distribution of the valid values.

In order to visualize the valid values of the 'TotalCharges' column we need to filter out the missing values first and convert the datatype to float64.

In [None]:
def visualize_totalcharges(df):        
    plt.hist(df_filtered['TotalCharges'], bins=20)
    plt.xlabel('Value')
    plt.ylabel('Frequency')
    plt.title('Histogram of TotalCharges')
    plt.show()

In [None]:
# first filter out empty values
df_filtered = df[df['TotalCharges'].astype(str).str.strip() != '']
# then convert to numeric datatype
df_filtered.loc[:, 'TotalCharges'] = pd.to_numeric(df_filtered['TotalCharges']).astype('float64')

In [None]:
visualize_totalcharges(df_filtered)

The distribution is positively skewed which means we cannot simly impute the mean value. The shape looks more like an exponential distribution. We will check this using the Kolmogorov-Smirnov test. This test makes no assumption about the distribution of data. It can be used to compare a sample with a reference probability distribution which is an exponential in our case.

In [None]:
# perform the Kolmogorov-Smirnov test
data = df_filtered['TotalCharges'].tolist()
_, p_value = stats.kstest(data, 'expon')
print("P-value of Kolmogorov-Smirnov test: ", p_value)

In addition, we can also inspect visually if our data originates from an exponential distribution by creating a QQ-Plot.

In [None]:
stats.probplot(data, dist='expon', plot=plt)
plt.title('Q-Q Plot')
plt.show()

We observe a good match with the exponential distribution in the lower quantiles. However, for higher quantiles we don't have an exponential shape anymore. This is also reflected by the p-value of the Kolmogorov-Smirnov test. Assuming a typical significance level of 0.05, the p-value of 0 (which is rounded here) signals that the underlying distribution is indeed not an exponential one.

Since it is really hard to guess the exact distribution, we cannot easily fill the missing values with statistical measures like mean oder median at this point. There are two options left how we can proceed from here.

* Drop rows with empty values
* Use advanced imputation techniques

We are dealing with only 11 empty values in contrast to over 7000 entries in our dataset. As a first step, we will simply drop them since they make up only such a small part of the whole data. But of course, we should keep in mind that there might be room for improvement here.

From this point on, we will work with the **df_filtered** dataset where we already removed the rows containing missing values and converted to the correct datatype.

### 5.2.2 Inspecting numerical features

Next, we will inspect the numerical features and check whether there are any outliers present in the data.

In [None]:
df_filtered.info()

First, we print out the dataset information and see that we correctly removed some entries. For our numerical features we plot histograms which allow us to easily recognize the distribution and detect any potential outliers. 

In [None]:
def visualize_numerical_features(df):
    
    fig,axes=plt.subplots(1,3, figsize=(15, 3))

    ax = sns.histplot(df_filtered['tenure'].values, ax=axes[0]).set_title('tenure')
    ax = sns.histplot(df_filtered['MonthlyCharges'].values, ax=axes[1]).set_title('MonthlyCharges')
    ax = sns.histplot(df_filtered['TotalCharges'].values, ax=axes[2]).set_title('TotalCharges')

In [None]:
visualize_numerical_features(df_filtered)

Our numerical features do not have any outliers. Most of our customers seem to have either really low or high tenure values. That means there are many newly acquired customers and also many that stayed for at least 70 month with the company. The monthly and total charges are both positively skewed which implicates that most of customers are charged with relatively small amounts. At this point it seems likely that monthly and total charges are correlated to each other since the latter one might be calculated based on the former one. Therefore, we will have a look at the correlation matrix.

In [None]:
# Calculate correlation matrix for numerical columns
corr_matrix = df_filtered[numerical_features].corr()

# Visualize correlation matrix using a heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

As the heatmap shows, tenure and total charges are highly positively correlated. As suspected earlier, there is also a noticable correlation between monthly and total charges. Since these correlations can impact our model performance we have to think about different strategies to handle them.

We could use techniques like Principal Component Analysis, Regularization or Feature Selection to fix this problem. But as a first step we will build our model without handling the correlation and see how it performs. Again, we should keep in mind that using one of those methods might yield an improvement in model performance.

### 5.2.3 Inspecting categorical features

In this step, we will have a look at the categorical features by creating countplots using seaborn. We will see how the different categories are encoded and if we have to clean things up.

In [None]:
def visualize_categorical_features(df):
    
    fig,axes=plt.subplots(4,4, figsize=(15, 30))
    axes = axes.flatten()
    
    for i in range(len(categorical_features)):
        ax = sns.countplot(x=categorical_features[i], data=df, ax=axes[i], hue='Churn')

In [None]:
visualize_categorical_features(df_filtered)

There are many things to notice in the plots. For example, customers with month-to-month contracts churn more often than those with yearly contracts. The flexibility that the company offers their customers with those short contract durations will likely attract more people but also makes them stay less longer with the company. Another insight we get is that customers using a fiber optic for the internet service are more likely to churn than those who use DSL. As a company we could try to convince more people to switch to DSL in order to reduce customer churn.

Overall, we observe quite an incosistent encoding of the categorical values. For building our model it makes sense to have a consistent and standardized encoding that we will apply now. We also recode the output column 'Churn' from 'Yes' and 'No' to 1 and 0 values.

In [None]:
df_final = df_filtered.copy()
le = LabelEncoder()

for f in categorical_features :
    df_final[f] = le.fit_transform(df_final[f])
    print(f,' : ',df_final[f].unique(),' = ',le.inverse_transform(df_final[f].unique()))
    
df_final['Churn'] = le.fit_transform(df_final['Churn'])
print('Churn',' : ',df_final['Churn'].unique(),' = ',le.inverse_transform(df_final['Churn'].unique()))

In [None]:
visualize_categorical_features(df_final)

This looks much cleaner. Finally, we perform a one-hot encoding creating dummy variables out of the categorical features and also drop the 'customerID' column since it adds no value to our model.

In [None]:
df_final = pd.get_dummies(df_final, columns = categorical_features, drop_first=True)
df_final.drop('customerID', axis=1, inplace=True)
df_final.head()

## 5.3 Feature Importance

It might also be worth looking at the importance of different features for our model. This could help us decide which features could be dropped to further increasing model performance. Since the plan is to work with tree-based models we have the advantage that models like Random Forest or Gradient Boosting provide a built-in feature importance measure. We will have a look at this in a few moments. Maybe we will recognize that one of our highly correlated features (Tenure, MonthlyCharges, TotalCharges) is less relevant and can be dropped in order to automatically solve the correlation problem. 

## 5.4 Investigating Imbalance

As a last step before building our model we have to investigate our dataset for potential imbalance since this plays a huge role in deciding which evaluation metric to choose.

In [None]:
def calculate_imbalance(df):
    
    churns = len(df.loc[df['Churn'] == 1])
    no_churns = len(df.loc[df['Churn'] == 0])
    
    # calculate the ratios
    imbalance_ratio = no_churns / churns
    churn_ratio = churns / (churns + no_churns)
    
    print("Imbalance ratio:", round(imbalance_ratio, 3))
    print("Ratio of churns:", round(churn_ratio, 3))
    

In [None]:
def visualize_imbalance(df):
    
    churns = len(df.loc[df['Churn'] == 1])
    no_churns = len(df.loc[df['Churn'] == 0])
    
    labels = ['Churn', 'No Churn']
    values = [churns, no_churns]
    
    # create a bar plot
    plt.bar(labels, values)
    plt.xlabel("Output")
    plt.ylabel("Count")
    plt.title("Imbalance of Dataset")
    plt.show()

In [None]:
calculate_imbalance(df_final)

In [None]:
visualize_imbalance(df_final)

We find that there are 2,762 times more customers who haven't churn than those who have stayed with the company. In other words, only 26.6% of customers in the given data set churn. This means our dataset is highly imbalanced and we cannot use an evaluation metric like accuracy for model evaluation.

If we would not deal with this problem we would actually train our model to be biased and to predict 'No Churn' most of the time. We would obtain a model that cannot predict 'Churn' in a reliable way. In fact, there are multiple ways to deal with imbalanced data and prevent biased model performance. A few of them are:

* Resampling (Oversampling / Undersampling)
* Class Weighting during model building
* Focus other evaluation metrics

Before we apply methods like resampling or class weighting we will build our model using the imbalanced data as a first step and focus on the alternative evaluation metrics.

A good choice might be the recall metric in order to avoid too many false-negatives. As it is commonly known, the cost of customer churn is much higher than the cost to retain the same customer. Therefore false-negatives will have much worse impact on the business than a few false positives.

# 6. Model Building

For this classification task we will focus on logistic regression and tree-based models since they are very flexible yet still provide good explainability. We will try out different models, inspect how they perform on our dataset and compare them to each other. The models we will use are:

* Logistic Regression
* Random Forest
* Light GBM

Besides the first two commonly known models we will also try out a more sophisticated model called Light GBM which stands for light gradient-boosting machine, originally developed by Microsoft. This model is based on decision tree algorithms and used for ranking and classification tasks with a focus on performance and scalability.

## 6.1 Basic Models

First, we will split our preprocessed dataset into training and test set and prepare a data structure to store our evaluation metrics for model performance.

In [None]:
# split features and target variable
X = df_final.drop('Churn', axis=1)
y = df_final['Churn']

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# store resulting metrics
model_metrics = {'Model': [],
                 'Precision': [],
                 'Recall': [],
                 'F1': [],
                 'Accuracy': []}

In [None]:
# stores the model's evaluation metrics
def add_model_metrics(model_name, y_test, y_pred):
    model_metrics['Model'].append(model_name)
    model_metrics['Precision'].append(precision_score(y_test, y_pred))
    model_metrics['Recall'].append(recall_score(y_test, y_pred))
    model_metrics['F1'].append(f1_score(y_test, y_pred))
    model_metrics['Accuracy'].append(accuracy_score(y_test, y_pred))

# fit a logistic regression model
def fit_logreg(model_name, X_train, y_train, X_test, y_test):   
    logreg_classifier = LogisticRegression()
    logreg_classifier.fit(X_train, y_train)
    
    y_pred = logreg_classifier.predict(X_test)
    add_model_metrics(model_name, y_test, y_pred)
    return logreg_classifier

# fit a random forest model
def fit_randomforest(model_name, X_train, y_train, X_test, y_test):   
    rf_classifier = RandomForestClassifier()
    rf_classifier.fit(X_train, y_train)

    y_pred = rf_classifier.predict(X_test)
    add_model_metrics(model_name, y_test, y_pred)
    return rf_classifier
    
# fit a light GBM model
def fit_lightgbm(model_name, X_train, y_train, X_test, y_test):   
    gmb_classifier = lgb.LGBMClassifier()
    gmb_classifier.fit(X_train, y_train)

    y_pred = gmb_classifier.predict(X_test)
    add_model_metrics(model_name, y_test, y_pred)
    return gmb_classifier

In [None]:
# fit basic models
logreg_classifier = fit_logreg('LR', X_train, y_train, X_test, y_test)
rf_classifier = fit_randomforest('RF', X_train, y_train, X_test, y_test)
gmb_classifier = fit_lightgbm('GBM', X_train, y_train, X_test, y_test)

# show results
print(pd.DataFrame.from_dict(model_metrics))

As we can see from the output table all three models perform really bad at predicting churn. This is probably the result of the highly imbalanced dataset. Our main interest lies on the recall value. In this context, the Light GBM model does the best job having a recall value of around 56%.

We want to make sure to detect as many customers as possible that are about to churn even if we have some customers in the same group that won't churn. Therefore the recall metric is a good choice for model evaluation. Looking at the values these basic models are not sufficient.

## 6.2 Oversampling

At this point we will adress the problem of the imbalanced dataset by oversampling the output class that is underrepresented. We will use the SMOTE library which stands for Synthetic Minority Oversampling Technique. The SMOTE class acts like a data transform object from scikit-learn in that it must be defined and configured, fit on a dataset, then applied to create a new transformed version of the dataset. This library does not simply duplicate examples but rather synthesizes new ones from the minority class.

In detail, the process begins by randomly selecting an example from the minority class. Then, SMOTE identifies its k nearest neighbors from the same class where one neighbor is randomly chosen. Next, a synthetic example is created by combining the selected example and the chosen neighbor. It does this by drawing a line between the two examples in the feature space. Along this line, a new sample is generated at a randomly selected point. The whole process is repeated multiple times.

In [None]:
# create the SMOTE oversampler
oversampler = SMOTE(random_state=42)

# perform oversampling on the cleaned dataset
X_resampled, y_resampled = oversampler.fit_resample(X, y)

print("Class distribution before oversampling:")
print(y.value_counts())

print("Class distribution after oversampling:")
print(pd.Series(y_resampled).value_counts())

# split data into training and test sets again
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

Our class distribution is now balanced. Let's fit the models again using the new data.

In [None]:
# fit oversampled models
logreg_classifier = fit_logreg('LR (Oversampled)', X_train, y_train, X_test, y_test)
rf_classifier = fit_randomforest('RF (Oversampled)', X_train, y_train, X_test, y_test)
gmb_classifier = fit_lightgbm('GBM (Oversampled)', X_train, y_train, X_test, y_test)

# show results
print(pd.DataFrame.from_dict(model_metrics))

Our evaluation metrics are much better now. All three models have a recall value of roughly 85% which can be seen as pretty good. Both tree models perform nearly the same considering the metrics. Interestingly, even the simple logistic regression model does a good job and even has the highest recall value at this point.

## 6.3 Feature Importance

Let's see if we can improve model performance even more by dropping some of most irrelevant features from our models. In order to do this, we will inspect the built-in feature importance property of our models that ranks all features according to their importance for the model.

### 6.3.1 Logistic Regression

In [None]:
# Get feature importance
feature_importance = abs(logreg_classifier.coef_[0])

# Sort in descending order
indices = feature_importance.argsort()[::-1]

# Print feature ranking
for f in range(X_train.shape[1]):
    print(f"{f + 1}. Feature '{df_final.columns[f]}' ({feature_importance[indices[f]]})")

### 6.3.2 Random Forest

In [None]:
# Get the feature importance (absolute value of coefficients)
feature_importance = abs(rf_classifier.feature_importances_)

# Sort feature importance in descending order
indices = feature_importance.argsort()[::-1]

# Print feature ranking
for f in range(X_train.shape[1]):
    print(f"{f + 1}. Feature '{df_final.columns[f]}' ({feature_importance[indices[f]]})")

### 6.3.3 Light GBM

In [None]:
# Get the feature importance (absolute value of coefficients)
feature_importance = abs(gmb_classifier.feature_importances_)

# Sort feature importance in descending order
indices = feature_importance.argsort()[::-1]

# Print feature ranking
for f in range(X_train.shape[1]):
    print(f"{f + 1}. Feature '{df_final.columns[f]}' ({feature_importance[indices[f]]})")

Even though the absolute value of feature importance differs from model to model, the ranking is basically the same. For all three models, the numerical features are the most important whereas the payment method for example does not seem to play an important role for predicting churn. This means all our numerical features that were correlated with each other are important for the model which is why we won't drop any of them. 

To investigate a potential model improvement and since the ranking is the same for all three models we will try including only the first 20 features for experimental purpose. Here, we will also include feature number 21 'TechSupport_2' because it originates from the same feature as 'TechSupport_1'. 

In [None]:
columns_to_drop = ['StreamingTV_1', 'StreamingTV_2', 'StreamingMovies_1', 'StreamingMovies_2', 
                   'Contract_1', 'Contract_2','PaperlessBilling_1', 'PaymentMethod_1', 'PaymentMethod_2']

X_feature_selected = X_resampled.drop(columns_to_drop, axis=1)
y_feature_selected = y_resampled

# split data into training and test sets again
X_train, X_test, y_train, y_test = train_test_split(X_feature_selected, y_resampled, test_size=0.2, random_state=42)

In [None]:
# fit feature selected models
logreg_classifier = fit_logreg('LR (Feature Sel. 10)', X_train, y_train, X_test, y_test)
rf_classifier = fit_randomforest('RF (Feature Sel. 10)', X_train, y_train, X_test, y_test)
gmb_classifier = fit_lightgbm('GBM (Feature Sel. 10)', X_train, y_train, X_test, y_test)

# show results
print(pd.DataFrame.from_dict(model_metrics))

Dropping the last 10 features had not the effect we hoped for. The recall for Logistic Regression and Random Forest is now even slightly worse than before. Only the Light GBM model improved a little bit. It seems like it is not that easy and there may be hidden relationships that we should not remove whereas each model behaves a little different on its own.

Since the noticed differences were very small overall, we will not follow up any further on this even if a more sophisticated approach might work better here, e.g. a k-fold cross-validation to find the optimal subset of features.

Before we proceed, we have to roll back our training and test data to the resampled version.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

## 6.4 Hyperparameter Tuning

Instead of proceeding with feature selection we will try to improve our model by performing a hyperparameter tuning.

For our logistic regression model we will tune the 'C' parameter that is used for regularization and therefore for preventing overfitting. For random forest model we will tune typical tree parameters like the maximum depth of the generated trees or the minimum number of samples required to be at a leaf node. Finally, our Light GBM model also has typical tree-based parameters like the maximum depth but also some sepcial ones like the learning rate.

All best models are selected using grid search and the recall metric as a strategy to evaluate the performance of the cross-validated model on the test set.

In [None]:
def fit_logreg_tuning(model_name, X_train, y_train, X_test, y_test):   
    logreg_classifier = LogisticRegression()
    
    # define hyperparameter grid
    param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
    grid_search = GridSearchCV(logreg_classifier, param_grid, scoring='recall', cv=3)
    grid_search.fit(X_train, y_train)
    
    # get best hyperparameters and model
    best_params = grid_search.best_params_
    best_model = grid_search.best_estimator_
    print("Best parameters for Logistic Regression:", best_params)
    
    # predict
    y_pred = best_model.predict(X_test)
    add_model_metrics(model_name, y_test, y_pred)
    
    return best_model, best_params
    
def fit_randomforest_tuning(model_name, X_train, y_train, X_test, y_test):   
    rf_classifier = RandomForestClassifier()
    
    # define hyperparameter grid
    param_grid = {
        'n_estimators': [100,200,300,400],
        'max_depth': [10, 13, 17, 20],
        'min_samples_leaf': [2,3,4],
        'min_samples_split': [2,5,10],
        'max_features': ['sqrt', 'auto']
    }
    grid_search = GridSearchCV(rf_classifier, param_grid, scoring='recall', cv=3)
    grid_search.fit(X_train, y_train)
    
    # get best hyperparameters and model
    best_params = grid_search.best_params_
    best_model = grid_search.best_estimator_
    print("Best parameters for Random Forest:", best_params)
    
    # predict
    y_pred = best_model.predict(X_test)
    add_model_metrics(model_name, y_test, y_pred)
    
    return rf_classifier, best_params
    
def fit_lightgbm_tuning(model_name, X_train, y_train, X_test, y_test):   
    gmb_classifier = lgb.LGBMClassifier()
    
    # define hyperparameter grid
    param_grid = {
        'n_estimators': [500,600,700,800],
        'max_depth': [5, 10, 15],
        'learning_rate': [0.10, 0.15, 0.20],
        'min_child_weight': [1],
        'colsample_bytree': [0.5]
    }
    grid_search = GridSearchCV(estimator=gmb_classifier, param_grid=param_grid, scoring='recall', cv=3)
    grid_search.fit(X_train, y_train)
    
    # get best hyperparameters and score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best parameters for Light GBM:", best_params)
    
    # train the LightGBM classifier with the best parameters
    best_lgbm = lgb.LGBMClassifier(**best_params)
    best_lgbm.fit(X_train, y_train)
    
    # predict
    y_pred = best_lgbm.predict(X_test)
    add_model_metrics(model_name, y_test, y_pred)
        
    return gmb_classifier, best_params

In [None]:
# fit hyperparameter tuned models
logreg_classifier, lr_best_params = fit_logreg_tuning('LR (Hyp. tuned)', X_train, y_train, X_test, y_test)
rf_classifier, rf_best_params = fit_randomforest_tuning('RF (Hyp. tuned)', X_train, y_train, X_test, y_test)
gmb_classifier, gmb_best_params = fit_lightgbm_tuning('GBM (Hyp. tuned)', X_train, y_train, X_test, y_test)

# show results
print(pd.DataFrame.from_dict(model_metrics))

The output shows the best selected parameters for each model. Overall, the hyperparameter tuning did not have a huge impact on the model scores. Some of the metrics even got worse. We will discuss these final results in the next part.

# 7. Results and Analysis

## 7.1 Evaluation Metrics

As already mentioned, in the context of our business it has to be the highest priority to detect as many customers as possible that are about to churn even if we also classify some some customers as 'likely to churn' even if that is not true (false positive). Therefore the recall metric is a good choice for model evaluation. Other possible options would be a balanced accuracy that accounts for the actual number of positive and negative samples or the F1 score that is the harmonic mean of precision and recall.

Let's print the evaluation metrics of all of our models in a clean looking table format.

In [None]:
model_metrics_df = pd.DataFrame.from_dict(model_metrics)
model_metrics_df

Using the basic models yielded poor results due to the imbalanced dataset. Performing oversampling on the minority output class resulted in a huge improvement across all models whereas the logistic regression model performed best when only looking at the recall score. Next, we inspected feature importance and tried to remove the least important ones. Here, our models behaved quite differently. The performance of logistic regression and random forest became worse whereas light GBM slightly improved. Due to these mixed and negligible effects we rolled back to our oversampled dataset and performed a final hyperparameter tuning. Compared to the values after oversampling recall of logistic regression and light GBM did not change much whereas recall of random forest made the most significant jump up to a value of around 87%.

## 7.2 Model Performance and Visualization

We can now plot a ranking of all our models accoring to their recall value.

In [None]:
# sort by recall value
model_metrics_df_sorted = model_metrics_df.sort_values(by='Recall', ascending=False)

# plot the ranking
plt.figure(figsize=(8, 6))
plt.barh(model_metrics_df_sorted['Model'], model_metrics_df_sorted['Recall'], color='b')
plt.xlabel('Recall')
plt.ylabel('Model')
plt.title('Ranking of Recall-Name Pairs')
plt.grid(True)
plt.show()

And we see our final and best model considering the recall metric would be Random Forest with the following set of parameters.

In [None]:
pd.DataFrame.from_dict(rf_best_params, orient='index', columns=['Value'])

# 8. Conclusion

As a final step in this project let's talk about what did work out well and where things could be improved.

## 8.1 Learnings and Takeaways

Running through a machine learning project from start to finish really shows how crucial the preparation steps like data cleaning and preparation are for model building. As we have seen, imputing missing values can be very difficult depending on the underlying data distribution. There can be also complex and hidden relationships between the data and the different models (and parameters) that make certain model improvement techniques not work out as we have seen with the feature importance. And finally, for classification tasks an imbalanced dataset can occur very often in real-world scenarios like this and we have to account for it by performing some form of resampling. Additionally, we have to be aware that not all evaluation metrics fit the problem in cases where we suffer from imbalanced data.

## 8.2 What did not work

Imputing missing values for the total charges column did not work as hoped for since our data did not originate from the suspected exponential distribution. Another aspect that did not work as expected was the feature selection process during model building where we tried to remove seemingly irrelevant features. By doing so, we could not achieve significant performance improvement but made some models even worse than before.

## 8.3 Possible Improvements

The usage of advanced imputation techniques like regression imputation for handling missing values could result in a (slightly) better model performance. We could also apply advanced techniques for optimal feature selection like cross-validation and optimal subset detection instead of just removing the last 10 features as we did during model building. Maybe this could also help us with the problem of correlated features. If not, we should try out other methods like PCA or Regularization. Of course, we should also check the potential correlation between the categorical features in this course.