# Import libraries

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("ticks")

In [None]:
from shutil import copyfile

copyfile(src = "../usr/lib/telco_data_cleaning_pipeline/telco_data_cleaning_pipeline.py",
         dst = "../working/telco_data_cleaning_pipeline.py")

In [None]:
from telco_data_cleaning_pipeline import *

# Load data

In [None]:
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
df.head()

# Clean data

The `customerID` column is useles, let's drop it.

In [None]:
cleaned_df = (
    df.pipe(start_pipeline)
    .pipe(drop_noisy_columns, cols=["customerID"])
    .pipe(replace_empty_strings_with_nan)
)

In [None]:
cleaned_df.head()

Let's investigate the missing values:

In [None]:
cleaned_df.isna().sum() / len(df)

We can see that the percentage of missing values in the `TotalCharges` column is less than 1%, so we can drop all rows with missing values.

In [None]:
cleaned_df = cleaned_df.pipe(drop_missing_values)

In [None]:
cleaned_df.head()

Now we can convert each column to the appropriate `dtype`.

In [None]:
cleaned_df = cleaned_df.pipe(
    convert_column_dtypes, {"SeniorCitizen": "str", "TotalCharges": np.float}
)

In [None]:
cleaned_df.head()

In [None]:
cleaned_df.dtypes

Our final cleaning and processing **pipeline** should be:
- Drop noisy column `customerID`.
- Replace empty strings with `NaN`.
- Drop all missing values.
- Convert `dtype`s.

In [None]:
cleaned_df = (
    df.pipe(start_pipeline)
    .pipe(drop_noisy_columns, cols=["customerID"])
    .pipe(replace_empty_strings_with_nan)
    .pipe(drop_missing_values)
    .pipe(map_column_values, col="SeniorCitizen", mapping_dict={0: "No", 1: "Yes"})
    .pipe(convert_column_dtypes, dtypes_mapping={"TotalCharges": np.float},)
)

In [None]:
cleaned_df.head()

# Exploratory Data Analysis

In this section we'll explore the different variables in this dataset in order to understand what are the variable types? how do they interact with each other, and studying the realtions between the *predictor* variables and the *target* variable.

According to the dataset author, the variables can be divided into three categories:
- Customers demographic variables: demographic attributes of the customer.
- Customer account variables: variables related to the customer account, such as the payment method, contract type, etc ...
- Customer services variables: information about the services the customer is using.

In [None]:
demographic_cols = [
    "gender",
    "SeniorCitizen",
    "Partner",
    "Dependents",
]

In [None]:
account_cols = [
    "tenure",
    "Contract",
    "PaymentMethod",
    "PaperlessBilling",
    "MonthlyCharges",
    "TotalCharges",
]

In [None]:
services_cols = [
    "PhoneService",
    "MultipleLines",
    "InternetService",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
]

In [None]:
internet_service_cols = [
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
]

In [None]:
color_map = {"Yes": "#ef553b", "No": "#636efa"}

## Target variable

We'll start our analysis with the target varible: `Churn`, to see how common is it that the customer leave the company.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df, x="Churn", color="Churn", color_discrete_map=color_map,
)

fig.show()

This chart shows that the data is *slightly* imbalanced, as the number of *churned* customers is relatively smaller than those who didn't.

## Demographic attributes

In this section, we'll focus on the demographic variables related to the customer, to understand how these variables relate to the customer leaving the company or not.

### `Gender`

Does the customer's gender have any effect on churning?

In [None]:
fig = px.histogram(
    data_frame=cleaned_df, x="gender", color="Churn", color_discrete_map=color_map
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

The customer's gender isn't predictive of churning, males and females are both likely the same to leave (or stay in) the company.

### `SeniorCitizen`

*Senior citizens* are citizens who are retired and above the age of 60 or 65.

Let's see if *older* customers prefer to keep using the company or not?

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="SeniorCitizen",
    color="Churn",
    color_discrete_map=color_map,
)

fig.show()

We can see that *old* customers resemble a very small fraction out of the total customers, and there's no strong indication that they would leave the company.

### `Partner`

The `Partner` variable states whether the customer has a partner or not. Does this variable affect the customer's decision in leaving the company? 

In [None]:
fig = px.histogram(
    data_frame=cleaned_df, x="Partner", color="Churn", color_discrete_map=color_map
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

From the chart we can see that the `Partner` variable is not a predictive variable of churning.

### `Dependents`

The `Dependents` variable states whether the customer has dependents or not. A dependent is a person who relies on another as a primary source of income, for example, children.

Do customers who have dependents (who need financial support from the *dependee*) are more likely to leave the company? perhaps for finding another company with lower costs.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df, x="Dependents", color="Churn", color_discrete_map=color_map
)

fig.show()

The chart shows that this variable has no influence on leaving the company.

## Account attributes

This section will be devoted to explore the customer's account information, for example, for how long they've been customers for the company, their monthly and total charges, contract type, and payment method.

### `tenure`

The `tenure` variable is a discrete numerical variable, representing the number of **months** the customer has stayed in the company.

This variable is an important variable, as it should gives us insights on how the *customer churn rate* changes with respect to *customer tenure*.

In [None]:
cleaned_df.tenure.head()

In [None]:
df.tenure.describe()

In [None]:
df.tenure.skew()

First, let's plot the `tenure` variable alone, to see what distribution does it have?

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="tenure",
    marginal="box",
    nbins=30,
    title="Customer tenure distribution",
)

fig.show()

The variable has a positive skew, and a lot points are spreaded under 10 months value. So we can say that a considerable portion of the customers are *new* customers, who have been in this company for about a year.

Now, let's plot this variable again, but this time highlighting the `Churn` variable with `tenure`.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="tenure",
    color="Churn",
    barmode="group",
    marginal="box",
    color_discrete_map=color_map,
    nbins=30,
    title="Customer tenure distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.show()

We can conclude from this plot some important insights:
- New customers with about 1 year of tenure are the most likely to churn.
- The chart also shows that the higher customer tenure is, the less likely is he to churn.
- There are some extreme points (outliers) for the *churning* customers, where they have a very high tenure (almost 6 years), yet they churn from the company. These might be *outliers*, or this could be related to other factors.

### `MonthlyCharges`

The `MonthlyCharges` variable is a discrete numerical variable, which represents the amount charged to the customer every month.

Analysing this variable should gives us a *clue* on how customer payments impact on the overall churt rate, and it would let us see how much customers pay every month on averag, does the payments distribution follow a *normal distribution*? and many other questions.

**Note**: The unit of monthly charges wasn't mentioned in the dataset description.

In [None]:
cleaned_df.MonthlyCharges.describe()

In [None]:
cleaned_df.MonthlyCharges.skew()

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="MonthlyCharges",
    marginal="box",
    color="Churn",
    barmode="group",
    color_discrete_map=color_map,
    nbins=20,
    title="Monthly charges distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.show()

We can see from the chart that there are two peak values, about 500 customers pay between 10 and 20, and nearly 1000 customers pay between 20 and 30.

This large number of customers paying low fees might be for paying some initial fees upon creating a contract with the company.

Other than these peaks, we can also observe that higher monthly charges are linked with higher chaurn rate.

The high charges could be explained by customers purchasing more *premium* services. The above chart suggests that these premium services could be causing customers dissatisfaction, which leads to leaving the company.

### `TotalCharges`

The `TotalCharges` is similar to `MonthlyCharges`, it's a discrete numerical variable, which represents the total amount charged to customers.

**Note**: the unit of this variable also wasn't mentioned in the dataset description.

In [None]:
cleaned_df.TotalCharges.describe()

In [None]:
cleaned_df.TotalCharges.skew()

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="TotalCharges",
    marginal="box",
    color="Churn",
    barmode="group",
    color_discrete_map=color_map,
    nbins=30,
    title="Total charges distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.show()

This variable represents the cumulative charges paid by customers since they started using the company. A time-series variable would've been more helpful to analyse time-based patterns, and how total charges change from month to month, or from year to year.

### `Contract`

The `Contract` variable is a categorical variable representing the contract term of the customer.

Let's see whether the term of contract has any influence in staying or leaving the company.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="Contract",
    color="Contract",
    title="Customer contract distribution",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

The majority of customers prefer *monthly* contracts, which probably require lower fees, and might be also favored by new customers who are not sure if this company would deliver them what they expect.

Let's see the relation between contract type and whether customer churn the company or not:

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="Contract",
    color="Churn",
    barmode="group",
    color_discrete_map=color_map,
    title="Customer contract distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

This chart suggests that customers who use short term contracts are far more likely to leave the company.

### `PaymentMethod`

This variable represents the customer's payment method.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="PaymentMethod",
    color="PaymentMethod",
    title="Payment method distribution",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="PaymentMethod",
    color="Churn",
    title="Payment method distribution for <b>churning</b> and <b>non-churning</b> customers",
    histnorm="probability",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

Customers who use *Electronic check* payment method are more likely to leave the company.

### `PaperlessBilling`

Paperless billing is a way of receiving *bills* electronically, rather than with paper bills.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="PaperlessBilling",
    color="Churn",
    color_discrete_map=color_map,
    title="Paperless billing distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.show()

We can see from the chart that customers who use paperless billing have higher churning rate compared to customers who don't.

## Services attributes

There are two main services the customer can have: phone service and internet service.

Phone service has only one additional service, the `MultipleLines` services.

Internet services, on the other hand, has several additional services:
- `OnlineSecurity`
- `OnlineBackup`
- `DeviceProtection`
- `TechSupport`
- `StreamingTV`
- `StreamingMovies`

In this section, we'll focus on studying how different services contribute to the overall **customer satisfaction**, which will lead to either staying or leaving the company.

### `PhoneService`

The `PhoneService` variable indicates whether the customer has a phone service or not.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="PhoneService",
    color="Churn",
    color_discrete_map=color_map,
    title="Phone service distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

We can see that *almost* all customers have the phone service, which is understandable, because this is the *very minimum* service.

### `MultipleLines`

The `MultipleLines` is an additional phone service, which indicates having multiple lines or not.

Let's see if having this service would have any impact on leaving the company:

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="MultipleLines",
    color="Churn",
    color_discrete_map=color_map,
    title="Multiple lines distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

We can see that this service has no influence on churning, since both customers who have the service and customers who don't are the same likely to churn.

Now, let's show how the `MonthlyCharges` changes with respect to different values of `MultipleLines`, which would be helpful to understand how *costly* is this service?

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="MonthlyCharges",
    color="MultipleLines",
    barmode="group",
    nbins=20,
    facet_row="Churn",
    height=800,
    title="Monthly charges distribution for different values of <i>multiple lines</i>",
)

fig.show()

Customer who have the multiple lines services tend to have higher monthly charges, but it's not clear how much does service contribute to the customer monthly charges.

### `InternetService`

The `InternetService` variable defines the type of internet service, which can be either `DSL`, `Fiber optic` or no service at all.

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="InternetService",
    color="Churn",
    title="Internet service distribution for <b>churning</b> and <b>non-churning</b> customers",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

The chart shows that a lot of customers choose the `Fiber optic` service, and it's also evident that the customers who use `Fiber optic` have high churn rate, this might suggest a *dissatisfaction* with this type of internet service.

Let's see how the monthly charges distribution for the different types of internet service:

In [None]:
fig = px.histogram(
    data_frame=cleaned_df,
    x="MonthlyCharges",
    color="InternetService",
    barmode="group",
    facet_row="Churn",
    height=800,
    nbins=20,
    title="Monthly charges distribution for different values of <i>internet service</i>",
)

fig.show()

This chart shows that, on general, customers whoe are using the `Fiber optic` internet service are paying more monthly charges than customers who are using `DSL`, and of course higher than customer who don't use internet service at all.

The chart also shows that among the customers who leave the company, the fiber optic users are the most common. This might be linked to some characteristics of the fiber optic service: price, quality comapred with other companies, etc ...

### other services

There are a couple of additional services for the internet service, which include:

- `OnlineSecurity`
- `OnlineBackup`
- `DeviceProtection`
- `TechSupport`
- `StreamingTV`
- `StreamingMovies`

Let's show for each service, the number of customers who have and the number of customers who don't

In [None]:
internet_services_df = pd.melt(
    frame=cleaned_df.loc[cleaned_df.InternetService != "No", internet_service_cols],
    var_name="service",
    value_name="HasService",
)

In [None]:
fig = px.histogram(
    data_frame=internet_services_df,
    x="service",
    color="HasService",
    title="Distribution of additional internet services",
)

fig.show()

# Focusing on *recent* customers

So far, our analysis included all customers, although the customer's behavior changes depending on his lifetime within the company (new vs. loyal).

As we saw earlier from the `tenure` variable distribution that new customers (the ones that have been using the company's services for less than a year) are more likely to churn.

In this section, we'll take a closer look at these customers -let's call them **recent customers**- to understand their behavior.

The `tenure` variable is measured in number of months. so let's keep only the customers who stayed in the company for a 12 months (a year):

In [None]:
one_year_customers_df = cleaned_df[cleaned_df.tenure <= 12]

In [None]:
len(one_year_customers_df) / len(cleaned_df)

30% of customers stayed in the company for a year.

Now, we can ask different questions on this new subset of data.

## How many of recent customers did churn?

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="Churn",
    color="Churn",
    color_discrete_map=color_map,
    title="Churn distribution for new customers",
)

fig.show()

Almost half of the recent customers have churned.

## The monthly charges of recent customers:

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="MonthlyCharges",
    color="Churn",
    color_discrete_map=color_map,
    barmode="group",
    marginal="box",
    nbins=20,
    title="Distribution of monthly charges for new customers",
)

fig.show()

The higher new customers pay per month, the more likely they churn from the company.

## What contract type did recent customers prefer?

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="Contract",
    color="Churn",
    title="Contract type distribution for new customers",
)

fig.show()

It seems that the `month-to-month` contract is the most prefered type of contract for new customers.

We can also see customers who use this type of contract are more likely to churn.

## What type of services did recent customer use?

In this section we'll investigate what type of services did recent customers choose to use, and the relation between different services and how much the customer pay per month.

All that will help us identify any services that are more expensive compared to other services, or caused the customer any dissatisfaction, which leads them to churn from the company.

### `PhoneService`

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="PhoneService",
    color="Churn",
    color_discrete_map=color_map,
    title="Phone service distribution",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

Nearly all new customers have phone service.

### `MultipleLines`

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="MultipleLines",
    color="Churn",
    color_discrete_map=color_map,
    title="Multiple lines distribution",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="MonthlyCharges",
    color="MultipleLines",
    barmode="group",
    facet_row="Churn",
    nbins=20,
    height=800,
    title="Monthly charges distribution for different values of <i>multiple lines</i>",
)

fig.show()

The majority of new customers choose not to have the multiple lines service.

### `InternetService`

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="InternetService",
    color="Churn",
    color_discrete_map=color_map,
    title="Internet service distribution",
)

fig.update_xaxes(categoryorder="total descending")

fig.show()

In [None]:
fig = px.histogram(
    data_frame=one_year_customers_df,
    x="MonthlyCharges",
    color="InternetService",
    barmode="group",
    facet_row="Churn",
    nbins=20,
    height=800,
    title='Monthly charges distribution for different values of <i>internet service</i>'
)

fig.show()

These two charts support our previous findings:
- The `Fiber optic` internet service is linked with higher monthly charges.
- The churn rate for customers who use the `Fiber optic` service is higher, compared to the `DSL` service.

### other services

In [None]:
one_year_internet_services_df = pd.melt(
    frame=one_year_customers_df.loc[
        one_year_customers_df.InternetService != "No", internet_service_cols
    ],
    var_name="service",
    value_name="HasService",
)

In [None]:
fig = px.histogram(
    data_frame=one_year_internet_services_df, x="service", color="HasService"
)

fig.show()

# Conclusion

- Using this data we were able to answer some basic questions on the customers and why the leave the company.

- We saw that customers who use the `Fiber optic` service are very likely to churn the company, this is very important insight, as it should be understood the reason behind this, maybe there are competing companies which provide the same service but with better offers.

- We also saw how the churn rate changes for different customer tenure, new customers find very different companies to choose from, and therefore, the company should invest a lot in keeping these customers, and making sure they turn into loyal customers.

- With this data alone, it would be hard to actually understand what the customers like, and how to make sure they don't churn. Some important information is needed in this context, such as:
    - Service price information.
    - Time series data for customer's monthly charges, since they signed a contract with the company.
    - Other competing companies services information.