### **Business Problem Identification**

In the early stages, the process of identifying problems in the business context requires action or solutions to be achieved. This process involves evaluating a business to identify areas that need improvement in order to achieve the desired business goals.

##### **Market Research**

Currently, telecommunications service subscribers have many choices and can easily switch subscriptions from one service provider to another. Not a few customers who frequently switch subscriptions because of the many promotions offered by various telecommunications service providers. Service providers who provide the best service and the most competitive prices will be the choice of customers.

Marketing practitioners in this industry try hard so that customers do not switch to become customers of competing companies. Why is that? Because acquiring new customers costs way more than retaining old and loyal customers. Thus, retaining existing customers is a higher priority than finding new customers. This refers to the journal *Reichheld, F.F. and Sasser, E. (1990) Zero Defects: Quality Comes to Services. Harvard Business Review, 68, 105-111*.

Meanwhile, retaining customers is also not an easy matter. A common way is to provide offer packages with special prices or bonuses to customers so they are not tempted to switch to competing companies. As for this method, if it is given to all existing customers, it will be expensive because customers who have a tendency to churn (unsubscribe) are generally only a small number. There is no urgency to provide special offers or bonuses to loyal customers, because even without that they will still be customers.

A better way is to ensure that special offers or bonuses are only given to certain customers who are known to have a tendency to churn. Because it is aimed specifically at certain customers, the costs required to provide promotions are lower.

Churn Prediction is a predictive model that is widely used in the industry with the aim of finding and identifying which customers are most likely to unsubscribe, as well as knowing what symptoms or signs appear. By paying attention to these signs, subscribers who have a high probability of stopping can be contacted and then given a package offer at a special price to prevent them from completely unsubscribing.

##### **Case Analysis and Goals**

In this case, a prediction model will be created for telecommunications companies that sell internet services. Not a few customers have switched to competing companies due to more attractive price and service offers, which has an impact on the company's loss of revenue and has the potential for customer dissatisfaction. The company's management is aware of this problem and plans to launch promotional programs to hold down the churn rate. This promotional program will only be offered to customer groups that are considered prone to churn. To be more effective, machine learning is needed to define this customer group.

The purpose of this predictive model is to generate a churn score for each customer which indicates whether the customer in question is predicted to unsubscribe or not. This prediction model will use predictors in the form of customer patterns in using internet services on corporate networks. The results of this prediction will later be implemented into appropriate actions as described in the previous paragraph so that the number of churned customers will decrease, increase customer satisfaction, and be able to increase revenue and profitability for the company.

How is it possible for the predictions made to look at customer usage patterns? A person's behavior in the past can be used as a benchmark to see the behavior in question in the future. These behaviors will be analyzed from the existing data. By knowing what signs indicate someone is about to unsubscribe, companies can do something to prevent them from actually unsubscribing.

Ultimately, the results of this model will be used by the company's marketing division to offer special packages to customers whose churn score is 'yes', with the aim of preventing them from switching to a competitor.

### **Data Understanding**

At this stage, Data Understanding is performed with the aim of gaining a deeper understanding of the data characteristics being analyzed before before getting into Explanatory Data Analysis. Some steps of preparation are conducted as a slight cleanse process, those are data formatting and duplicates. Futher preparation processes will go along with in Explanatory Data Analysis. Dataset used can be downloaded at [following link](https://github.com/reyharighy/Predictive_Machine_Learning_for_Telco_Customer_Churn/blob/main/data_telco_customer_churn.csv). Next, dataset will be imported using library **Pandas** will then be stored in a variable named `data`.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('data_telco_customer_churn.csv')
data.head()

Unnamed: 0,Dependents,tenure,OnlineSecurity,OnlineBackup,InternetService,DeviceProtection,TechSupport,Contract,PaperlessBilling,MonthlyCharges,Churn
0,Yes,9,No,No,DSL,Yes,Yes,Month-to-month,Yes,72.9,Yes
1,No,14,No,Yes,Fiber optic,Yes,No,Month-to-month,Yes,82.65,No
2,No,64,Yes,No,DSL,Yes,Yes,Two year,No,47.85,Yes
3,No,72,Yes,Yes,DSL,Yes,Yes,Two year,No,69.65,No
4,No,3,No internet service,No internet service,No,No internet service,No internet service,Month-to-month,Yes,23.6,No


##### **Context**

Dataset represents customer profiles, both those who have stopped and those who are still subscribed. Whether or not customers switch to the various services offered is a condition in the past that can be used as an indicator to predict whether current customers tend to churn or not.

##### **Attribute Information**

Each column is a feature or variable, both independent and response variables, which can be described in the following table.

| attributes | Description |
| --- | --- |
| `Dependents` | Does the customer have dependents or not |
| `tenure` | How long the customer has subscribed to the company's services |
| `OnlineSecurity` | Does the customer use the *Online Security* service or not |
| `OnlineBackup` | Does the customer use the *Online Backup* service or not |
| `InternetService` | Does the customer subscribe to *Internet Service* or not |
| `DeviceProtection` | Does the customer use the *Device Protection* service or not |
| `TechSupport` | Does the customer use *Tech Support* services or not |
| `contracts` | The duration of the contract used |
| `PaperlessBilling` | Is the bill sent on a *paperless* basis or not |
| `MonthlyCharges` | Number of bills charged each month |
| `Churn` | Has the customer unsubscribed or not |

Each column is checked for attribute information as a first step to understanding dataset. This information can be in the form of data types, total unique data, unique data samples, and the number of empty data. Regarding unique data, it provides basic information about inclusive data contained in a column. From here, the characteristics of the information can be seen, including data uniformity, consistency of data writing, even data types as well. In some cases, the data format must be adjusted to make it more structured.

In [2]:
items = [[
    col, 
    data[col].dtype, 
    data[col].nunique(), 
    list(data[col].unique()[:3]), 
    data[col].isnull().sum()
] for col in data]

display(pd.DataFrame(data=items,columns=[
    'Attributes',
    'Data Type',
    'Total Unique',
    'Unique Sample',
    'Total Missing'
]))

data.info(verbose=False, memory_usage=True)

Unnamed: 0,Attributes,Data Type,Total Unique,Unique Sample,Total Missing
0,Dependents,object,2,"[Yes, No]",0
1,tenure,int64,73,"[9, 14, 64]",0
2,OnlineSecurity,object,3,"[No, Yes, No internet service]",0
3,OnlineBackup,object,3,"[No, Yes, No internet service]",0
4,InternetService,object,3,"[DSL, Fiber optic, No]",0
5,DeviceProtection,object,3,"[Yes, No internet service, No]",0
6,TechSupport,object,3,"[Yes, No, No internet service]",0
7,Contract,object,3,"[Month-to-month, Two year, One year]",0
8,PaperlessBilling,object,2,"[Yes, No]",0
9,MonthlyCharges,float64,1422,"[72.9, 82.65, 47.85]",0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4930 entries, 0 to 4929
Columns: 11 entries, Dependents to Churn
dtypes: float64(1), int64(1), object(9)
memory usage: 423.8+ KB


Based on the attribute information above, it can be explained as follows.

1. Dataframe consists of 4930 rows and 11 columns where each row is a unit of information from a customer accompanied by 11 attribute information from all columns.

1. Most of the attributes analyzed are of nominal categorical type, which means that they do not have a level between one class and another. However, the data type used is `object`. This is not efficient when viewed from the memory capacity used so that the data type will be changed to be `category`.
1. Other attributes, such as `tenure` and `MonthlyCharges`, are numerical with their respective data types, namely `int64` and `float64`. The memory capacity can still be made more efficient by using more efficient data types, such as `int8`, `float32`, and others without having to change or remove the information contained.
1. Categorical data is written consistently so it is assumed that there are no errors in the data input process. This can be determined from the fit between the number and unique data samples.
1. For numeric data, there is no need to pay attention to writing consistency or unique data because it does not represent a class but a measurement result.
1. No empty data was found in each column so that the handling missing values process is no longer needed later.

##### **Duplicates Data**

Duplicate data refers to a data set that has identical values ​​in all or most of the features in the dataset. This is a problem in data analysis and can lead to inaccurate or biased results if not handled properly. Therefore, it is necessary to check for duplicate data.

In [3]:
print('Total duplicates is {} records.'.format(data.duplicated().sum()))
data[data.duplicated()].head()

Total duplicates is 77 records.


Unnamed: 0,Dependents,tenure,OnlineSecurity,OnlineBackup,InternetService,DeviceProtection,TechSupport,Contract,PaperlessBilling,MonthlyCharges,Churn
624,No,1,No internet service,No internet service,No,No internet service,No internet service,Month-to-month,No,19.65,No
701,No,41,No internet service,No internet service,No,No internet service,No internet service,Two year,No,20.65,No
786,No,1,No,No,Fiber optic,No,No,Month-to-month,Yes,69.65,Yes
951,No,1,No internet service,No internet service,No,No internet service,No internet service,Month-to-month,No,20.15,Yes
1266,No,1,No internet service,No internet service,No,No internet service,No internet service,Month-to-month,No,19.65,No


There were found 77 pieces of data that indicated duplicates. If you look at the 5 examples of data above, data is considered duplicate if most of the features have the same value as other data. However, a more precise assessment is whether these same values ​​are unique identifiers or not. Based on dataset this time, there are no features used as unique identifiers so data that is considered a duplicate of method `.duplicated()` does not need to be handled.

##### **Descriptive Statistics**

Descriptive statistics provide a summary and structure the characteristics of dataframe. In the case of numerical data, it will contain information about distribution, central tendency, and distribution. For those that are non-numeric, it contains information about the amount of data, the number of unique data, the mode, and the mode frequency.

In [4]:
for i, type in enumerate([['int','float'],['object']]):
    print('Descriptive Statistics on {} Features'.format('Numerical' if i == 0 else 'Categorical'))
    display(data.describe(include=type))

Descriptive Statistics on Numerical Features


Unnamed: 0,tenure,MonthlyCharges
count,4930.0,4930.0
mean,32.401217,64.883032
std,24.501193,29.92396
min,0.0,18.8
25%,9.0,37.05
50%,29.0,70.35
75%,55.0,89.85
max,72.0,118.65


Descriptive Statistics on Categorical Features


Unnamed: 0,Dependents,OnlineSecurity,OnlineBackup,InternetService,DeviceProtection,TechSupport,Contract,PaperlessBilling,Churn
count,4930,4930,4930,4930,4930,4930,4930,4930,4930
unique,2,3,3,3,3,3,3,2,2
top,No,No,No,Fiber optic,No,No,Month-to-month,Yes,No
freq,3446,2445,2172,2172,2186,2467,2721,2957,3614


Descriptive statistics can assist in an initial understanding of the characteristics of the data being analyzed.

##### **Data Formatting**

As previously explained, dataframe can be further optimized by changing its data type to the appropriate data type so that the memory capacity used becomes more efficient.

In [5]:
before = [data[j].dtype for j in data.columns]

for i in data.columns:
    if 'float' in str(data[i].dtype):
        data[i] = data[i].astype('float32')
    elif 'int' in str(data[i].dtype):
        n_bits = data[i].max().itemsize
        args = 'int8' if n_bits <= 8 else 'int16' if n_bits <= 16 else 'int32' if n_bits <= 32 else 'int64'
        data[i] = data[i].astype(args)
    else:
        args = 'category' if data[i].nunique() <= 10 else 'object'
        data[i] = data[i].astype('category')

items = [[
    col, 
    before[k], 
    data[col].dtype
] for k, col in enumerate(data.columns)]

display(pd.DataFrame(data=items,columns=[
    'Features',
    'Before',
    'After'
]))

data.info(verbose=False, memory_usage=True)

Unnamed: 0,Features,Before,After
0,Dependents,object,category
1,tenure,int64,int8
2,OnlineSecurity,object,category
3,OnlineBackup,object,category
4,InternetService,object,category
5,DeviceProtection,object,category
6,TechSupport,object,category
7,Contract,object,category
8,PaperlessBilling,object,category
9,MonthlyCharges,float64,float32


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4930 entries, 0 to 4929
Columns: 11 entries, Dependents to Churn
dtypes: category(9), float32(1), int8(1)
memory usage: 68.7 KB


It can be seen that the memory usage can be reduced to a minimum of 68.7 KB. This means that it is 6 times more efficient than the previous data type, which was 423.8 KB, thereby saving memory usage on the computer and allowing programs to run faster.