<h1><center>Telco Customer Churn - Exploratory Data Analysis</center></h1>

# 1. Introduction # 



**1.1 Goal**

The aim of this notebook is to examine which customer groups are affected by a high churn rate. The churn rate represents the ratio of lost customers to total customers in a specific period of time.


**1.2 Relevance**

According to experts, the cost of acquiring new customers is up to five times higher than keeping existing customers. Customer loyalty is therefore a central goal of a sustainable business strategy. An important element of this strategy is the prevention of customer churn. In the digital age, this is more true than ever, as offers can be compared very easily.

In the telecommunication sector customer churn is one of the biggest problems. Vodafone, for example, had a churn rate of 12% in Germany, 24% in Italy, 26% in the UK and even 28% in Spain in the fourth quarter of the 2020/21 financial year (source: https://www.statista.com/statistics/972046/vodafone-churn-rate-european-countries/). It is therefore important for telecommunication companies to analyze relevant customer data (goal of this notebook) and, based on this, develop a robust churn prediction model (goal of following notebook) in order to retain customers and develop strategies to reduce the churn rate.


**1.3 Research Question**

Which customer groups show an above-average churn rate?

# 2. Imports

**2.1 Libraries**

To analyse the data, we first need to import the required libraries:

In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import seaborn as sns
from collections import OrderedDict
cmaps = OrderedDict()
import warnings

**2.2 Colors**

Customized colors for plotting are defined:

In [None]:
# definition of colors
custom_colors=['#c14953','#d96548','#f2a553','#f3c969','#98e2c6', '#86c1b2', '#74a09e']
customPalette = sns.set_palette(sns.color_palette(custom_colors))

**2.3 Data**

I created one dataset out of the following single datasets to get additional features (like satisfaction score, total revenues and cltv):

https://www.kaggle.com/blastchar/telco-customer-churn ("../input/d/blastchar/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv../input/d/blastchar/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
https://www.kaggle.com/ylchang/telco-customer-churn-1113 ("../input/telco-customer-churn-1113../input/telco-customer-churn-1113")

In [None]:
# load data
df = pd.read_csv("../input/telco-customer-churn/Telco_customer_churn.csv")

We use the head() function to show the first 5 rows:

In [None]:
df.head()

...and the info() function to show data types:

In [None]:
df.info()


# 3. Data Preparation

**3.1 Data types, missing values and data cleaning**

Total charges with data type "object" have to be converted into "float" to make calculations later on. 

There are eleven customers with missing values for total charges. A closer look at the data reveals that these eleven customers also have a tenure of zero. So, it can be assumed that these are new customers who have not yet incurred any fees. Missing values are therefore imputed with zero. 

Satisfaction Score should be a categorical feature (5 categories). So data type "integer" is converted into data type "object".

For senior citizen, which is a categorical feature as well, the values of 0 (=no) and 1 (=yes) are put in inverted commas (str) to convert this feature into data type "object" as well.

In [None]:
# convert data types and impute missing values with zero
df["TotalCharges"] = df["TotalCharges"].replace(" ", 0).astype("float32")
df["Satisfaction Score"] = df["Satisfaction Score"].astype("object")
df["SeniorCitizen"] = df["SeniorCitizen"].replace(0, "0").replace(1, "1") 

Columns which are not needed are removed:

In [None]:
# remove columns which are not needed
df = df.drop("Churn Score.1", axis=1)
df = df.drop("Churn Score", axis=1)

We use the info() function again to show the remaining variables including their data types:

In [None]:
df.info()

We can save the cleaned data set as csv file (##### = file path):

In [None]:
#df.to_csv (r'/#####/Telco_customer_churn_cleaned.csv', index = None, header=True)

To simplify the code later on, columns/variables are written like this:

In [None]:
# variable defintions
customer_id = df["customerID"]
gender = df["gender"]
senior_citizen = df["SeniorCitizen"]
partner = df["Partner"]
dependents = df["Dependents"]
tenure = df["tenure"]
phone_service = df["PhoneService"]
multiple_lines = df["MultipleLines"]
internet_service = df["InternetService"]
online_security = df["OnlineSecurity"]
online_backup = df["OnlineBackup"]
device_protection = df["DeviceProtection"]
tech_support = df["TechSupport"]
streaming_tv = df["StreamingTV"]
streaming_movies = df["StreamingMovies"]
contract = df["Contract"]
paperless_billing = df["PaperlessBilling"]
payment_method = df["PaymentMethod"]
monthly_charges = df["MonthlyCharges"]
total_charges = df["TotalCharges"]
churn = df["Churn"] #churn yes/no
churn_rate = df["churn_rate"] #churn 1/0
cltv = df["CLTV"]
churn_reason = df["Churn Reason"]
country = df["Country"]
state = df["State"]
city = df["City"]
zip_code = df["Zip Code"]
lat_long = df["Lat Long"]
latitude = df["Latitude"]
longitude = df["Longitude"]
age = df["Age"]
married = df["Married"]
referred_a_friend = df["Referred a Friend"]
number_of_referrals = df["Number of Referrals"]
offer = df["Offer"]
avg_monthly_long_distance_charges = df["Avg Monthly Long Distance Charges"]
avg_monthly_gb_download = df["Avg Monthly GB Download"]
streaming_music = df["Streaming Music"]
premium_tech_support = df["Premium Tech Support"]
unlimited_data = df["Unlimited Data"]
total_refunds = df["Total Refunds"]
total_extra_data_charges = df["Total Extra Data Charges"]
total_long_distance_charges = df["Total Long Distance Charges"]
total_revenue = df["Total Revenue"]
satisfaction_score = df ["Satisfaction Score"]
customer_status = df["Customer Status"]
churn_category = df["Churn Category"]

**3.2 Split Features**

Features are split into numeric and categorical features. Shape and head for numeric resp. categorical features are shown.

In [None]:
# numeric features
num_features = df[["tenure", "MonthlyCharges", "TotalCharges", 
                 "CLTV", "Total Revenue"]]

In [None]:
num_features.shape

In [None]:
num_features.head()

In [None]:
cat_features = df[["gender", "SeniorCitizen", "Partner", "Dependents",
                   "PhoneService", "MultipleLines", "InternetService",
                   "OnlineSecurity", "OnlineBackup", "DeviceProtection",
                   "TechSupport", "StreamingTV", "StreamingMovies", "Contract",
                   "PaperlessBilling", "PaymentMethod","Satisfaction Score"]]

In [None]:
cat_features.shape

In [None]:
cat_features.head()

To suppress warnings we run the following code:

In [None]:
warnings.filterwarnings("ignore")

**3.3 Outlier Detection**

There are different statistical methods for identifying outliers. The IQR rule is used here. The IQR (Inter Quartile Range) is defined as the difference between the upper (Q3) and the lower quartile (Q1). With this common rule, a value is treated as an outlier if it falls more than 1.5 * IQR above the upper quartile (Q3) or below the lower quartile (Q1). Lower outliers are also below Q1-1.5 * IQR, upper outliers above Q3 + 1.5 * IQR. With box plots outlier can be displayed graphically.

In [None]:
# outlier detection
def boxplot(num_features):
    plt.figure(figsize=(6,1))
    ax = sns.boxplot(num_features, width=0.3, whis=1.5, color="#f3c969")
    ax.xaxis.labelpad=10
boxplot(tenure)
boxplot(monthly_charges)
boxplot(total_charges)
boxplot(cltv)
boxplot(total_revenue) # contains outliers --> use Standard Scaler

We can notice that for the feature "total revenue" there are 20 values lying outside the fences (i.e. Q3 + 1.5 * IQR). Even if these outliers are not extreme we have to be careful later on when conducting the churn prediction models (not part of this notebook). When it comes to feature engineering we have to use the StandardScaler, which (in contrast to the MinMaxScaler) can handle outliers well.

# 4. EDA

**4.1 Target Variable**

What is the overall churn rate?

In [None]:
# target variable - churn rate
plt.figure(figsize=(6,6))
plt.pie(df["Churn"].value_counts(),shadow=False,startangle=90,
        labels=df["Churn"].value_counts().index,autopct='%0.1f%%',
        explode=(0,0.05),colors=['#74a09e','#c14953'])
plt.title('Churn Rate')
plt.show()

The overall churn rate is 26.5%. 

So let's look at different customer groups. Which groups have a higher probability to churn?

**4.2 Numeric Features**

For the numeric features kernel density estimate (KDE) functions are caluclated and plotted.

In [None]:
# KDEplots for numeric features
def kdeplot(feature):
    plt.figure(figsize=(9,2))
    plt.title("KDE for {}".format(feature))
    # plt.tight_layout(pad=1.2)
    ax_kde = sns.kdeplot(df[df['Churn'] == 'No'][feature].dropna(), color= '#74a09e', label= 'Churn: No', shade='True')
    ax_kde = sns.kdeplot(df[df['Churn'] == 'Yes'][feature].dropna(), color= "#c14953", label= 'Churn: Yes', shade='True')
    ax_kde.yaxis.labelpad=10
    ax_kde.xaxis.labelpad=10
    ax_kde.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.4f}'))
    sns.despine(left="True")
kdeplot("tenure")
kdeplot("MonthlyCharges")
kdeplot ("TotalCharges")
kdeplot("CLTV")
kdeplot("Total Revenue")

The following conclusions can be drawn:
* New customers with a short tenure are more likely to switch.
* Customers who pay higher monthly fees are also more likely to churn.

**4.3 Categorical Features**

For the categorical features barplots show the differences between target groups.

In [None]:
# barplots for categorical features

def barplot(feature):
    plt.figure() 
    ax_bar = sns.barplot(x=feature, y="churn_rate", data=df, ci=0, 
                         palette=customPalette, saturation=0.75)
    for p in ax_bar.patches:
        ax_bar.annotate(format(p.get_height(), '.1%'), 
                       (p.get_x() + p.get_width() / 2., p.get_height()), 
                       ha = 'center', va = 'center', 
                       xytext = (0, 9), 
                       textcoords = 'offset points')
    ax_bar.yaxis.labelpad=10
    ax_bar.xaxis.labelpad=10
    sns.despine(left="True")
    ax_bar.set_ylim(0, 0.50) 
    ax_bar.set_yticklabels(['{:,.0%}'.format(x) for x in ax_bar.get_yticks()])
    ax_bar.yaxis.set_ticks([])
barplot("gender")
barplot("Partner")
barplot("Dependents")
barplot("PhoneService")
barplot("MultipleLines")
barplot("InternetService")
barplot("OnlineSecurity")
barplot("OnlineBackup")
barplot("DeviceProtection")
barplot("TechSupport")
barplot("StreamingTV")
barplot("StreamingMovies")
barplot("Contract")
barplot("PaperlessBilling")

# senior citizen
def barplot(feature):
    plt.figure() 
    ax_bar = sns.barplot(x=feature, y="churn_rate", data=df, ci=0, 
                         palette=customPalette, saturation=0.75)
    for p in ax_bar.patches:
        ax_bar.annotate(format(p.get_height(), '.1%'), 
                       (p.get_x() + p.get_width() / 2., p.get_height()), 
                       ha = 'center', va = 'center', 
                       xytext = (0, 9), 
                       textcoords = 'offset points')
    ax_bar.yaxis.labelpad=10
    ax_bar.xaxis.labelpad=10
    ax_bar.set_xticklabels(["No","Yes"])
    sns.despine(left="True")
    ax_bar.set_ylim(0, 0.50) 
    ax_bar.set_yticklabels(['{:,.0%}'.format(x) for x in ax_bar.get_yticks()])
    ax_bar.yaxis.set_ticks([])
barplot("SeniorCitizen")

# payment method
def barplot(feature):
    plt.figure() 
    ax_bar = sns.barplot(x=feature, y="churn_rate", data=df, ci=0, 
                         palette=customPalette, saturation=0.75)
    for p in ax_bar.patches:
        ax_bar.annotate(format(p.get_height(), '.1%'), 
                       (p.get_x() + p.get_width() / 2., p.get_height()), 
                       ha = 'center', va = 'center', 
                       xytext = (0, 9), 
                       textcoords = 'offset points')
    ax_bar.yaxis.labelpad=10
    ax_bar.xaxis.labelpad=10
    sns.despine(left="True")
    ax_bar.set_ylim(0, 0.50) 
    ax_bar.set_yticklabels(['{:,.0%}'.format(x) for x in ax_bar.get_yticks()])
    ax_bar.yaxis.set_ticks([])
    ax_bar.set_xticklabels(ax_bar.get_xticklabels(),rotation=30)
barplot("PaymentMethod")


# satisfaction score
def barplot(feature):
    plt.figure() 
    ax_bar = sns.barplot(x=feature, y="churn_rate", data=df, ci=0, 
                         palette=customPalette, saturation=0.75)
    for p in ax_bar.patches:
        ax_bar.annotate(format(p.get_height(), '.1%'), 
                       (p.get_x() + p.get_width() / 2., p.get_height()), 
                       ha = 'center', va = 'center', 
                       xytext = (0, 9), 
                       textcoords = 'offset points')
    ax_bar.yaxis.labelpad=10
    ax_bar.xaxis.labelpad=10
    sns.despine(left="True")
    ax_bar.set_ylim(0, 1.2) 
    ax_bar.set_yticklabels(['{:,.0%}'.format(x) for x in ax_bar.get_yticks()])
    ax_bar.yaxis.set_ticks([])
    ax_bar.set_xticklabels(ax_bar.get_xticklabels(),rotation=30)
barplot("Satisfaction Score")

The following customer groups have a probability to churn:
* customers without partners and customers without family members as well as senior citizens
* customers with fiber optic use
* customers with short term contracts (month-to-month)
* customers with paperless invoices
* customers who pay by electronic check
* customers with low satisfaction score (1,2)

On the contrary, customers with additional security services or additional technical support are less likely to churn.


# 5. Outlook

By taking appropriate measures for the identified target groups (e.g. offering additional services, optimized sales and resale processes, individualized approach, target group-specific advertising, etc.), customer loyalty can be improved and customer churn can be reduced. 

One of the most powerful tools to avoid churn is churn prediction. Based on historical data, a machine learningmachine learning model can be developed to predict future churn. However, the exact calculation of the churn rate alone is not decisive.

Because not all customers who are at risk of churning are worth the effort to keep them equally. In order to answer the question of which are the best customers of my company, a customer segmentation is necessary.

The segmentation results can then be used to retain the best customers with offers that are tailored to their needs (high product market fit). 