In [2]:
#Load data manipulation package
import numpy as np
import pandas as pd

#Data dumb store
import joblib

# 1. Business Understanding
---

- Knowing customer churn is essential for internet provider company for effectively prevent them using appropriate marketing initiative
- Internet provider company wants to give specific marketing initiative for preventing customer churn

## 1.1 Business Objective
---

- Reduce 30% customer churn in the next period year 
- Identify the relationship between predictors and target variable
- company wants to prevent unnecessary allocation of resources and efforts towards non-churn customers (they set threshold maximum allowed false positives is 20%)

## 1.2 Business Questions
---

- How can internet provider company develop marketing effectiveness strategy to increase reduce 30% of churn?
- What marketing initiatives are suitable to reduce the 30% customer churn?
- How to prevent unnecessary budget allocation?

## 1.3 Modelling Task
---

- Output target: **customer status of churn (categorical)**


- The goal of this project is to predict whether a customer will churn or not based on various features.
Task: **Classification task**


- We need a model that can be easily interpreted so that we can understand how each feature contributes to the prediction. This can help us gain insights into the underlying factors that influence whether a customer will churn or not.
Model used: **Logistic regression**


- We will use ROC/AUC as our evaluation metric since we want it's not influeced by imbalance target and we need to measure threshold using ROC/AUC. **Evaluation metric: ROC/AUC**

# 2. Modelling Workflow
---

## **Machine Learning Workflow** (Simplified)

### 1. <font color='blue'> Importing Data to Python:
    * Data description, Importing data, Data splitting
    
### 2. <font color='blue'> Exploratory Data Analysis:
    *Descriptive statistic, Missing value checking, Data exploration
    
### 3. <font color='blue'> Prepocessing:
    * Missing value handling, Outliers handling

### 4. <font color='blue'> Modelling:
    * Model fitting, Evaluation
    
### 5. <font color='blue'> Lift Chart & Interpretation:
    * Targetting customer churn, Coeficient interpretation

# 3. Load Data
---

- We want to describe our features and target feature
- Load the data from specific path

## 3.1 Data Description
---

The potential **predictors** for the response variable are:
1. `is_tv_subscriber`
 - `is_tv_subscriber = 0` for the customers who didn't subscribe to the TV package or only subscribe to the internet package.
  - `is_tv_subscriber = 1` for the customers who subscribe to the TV package.


2. `is_movie_package_subscriber`
  - `is_movie_package_subscriber = 0` for the customers who didn't subscribe to the movie package or only subscribe to the internet package.
  - `is_movie_package_subscriber = 1` for the customers who subscribe to the movie package.


3. `subscription_age` is the years of customer using the internet service.
4. `bill_avg` is the last three months' bill average.
5. `remaining_contract` is the year remaining for the customer's subscription contract. If null, the customer hasn't had a contract.
6. `service_failure_count` is the number of calls to the Call Center for service failure for the last three months.
7. `download_avg` is the last three months internet usage in GB.
8. `upload_avg` is the last three months upload in GB.
9. `download_over_limit` is the count of the internet usage over the customer's limit.

**Target variable**:
- `churn`
  - `churn = 0` for the customers who retain.
  - `churn = 1` for the customers who cancel their subscription before the contract ends or the customers who didn't renew their subscription after the contract ended.

## 3.2 Importing Data
---

In [25]:
# Read dataset function
def read_data(path):
    
    # 1. Read data
    data = pd.read_csv(path,
                      index_col = 0,
                      low_memory = False)
    
    # 2. Drop diplicates
    data = data.drop_duplicates()
    
    # 3. Reset index
    data = data.reset_index(drop=True)
    
    # 4. Print data shape
    print('Data shape :', data.shape)
    
    return data

In [5]:
path = 'data/raw/internet_service_churn.csv'

In [26]:
# Read data
data = read_data(path)

Data shape : (70124, 11)


In [27]:
data.head()

Unnamed: 0,id,is_tv_subscriber,is_movie_package_subscriber,subscription_age,bill_avg,reamining_contract,service_failure_count,download_avg,upload_avg,download_over_limit,churn
0,15,1,0,11.95,25,0.14,0,8.4,2.3,0,0
1,18,0,0,8.22,0,,0,0.0,0.0,0,1
2,23,1,0,8.91,16,0.0,0,13.7,0.9,0,1
3,27,0,0,6.87,21,,1,0.0,0.0,0,1
4,34,0,0,6.39,0,,0,0.0,0.0,0,1


In [28]:
data.dtypes

id                               int64
is_tv_subscriber                 int64
is_movie_package_subscriber      int64
subscription_age               float64
bill_avg                         int64
reamining_contract             float64
service_failure_count            int64
download_avg                   float64
upload_avg                     float64
download_over_limit              int64
churn                            int64
dtype: object

In [31]:
print(f"Number of duplicated data: {data.duplicated().sum()}")

Number of duplicated data: 0


# 4. Data Splitting
---

In [32]:
# function split input and output
def split_input_output(data, target_column):
    """
    Function to split input (x) and output (y)

    Parameters
    ----------
    data: <pandas dataframe>
          dataframe input

    target_column: <string>
                   output column name

    Return
    ------

    x: <pandas dataframe>
        input data

    y: <pandas dataframe>
       output data

    """
    X = data.drop(columns = target_column)
    y = data[target_column]

    return X, y

In [33]:
# split input output
X, y = split_input_output(data = data, 
                          target_column = 'churn')

In [34]:
# Check data dimension
n_samples, n_features = X.shape

# Print number of samples and features
print(f'Number of samples  :    {n_samples}')
print(f'Number of features :    {n_features}')

Number of samples  :    70124
Number of features :    10


In [35]:
# Import train test split
from sklearn.model_selection import train_test_split



# Train and Test Split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.3,
                                                    random_state = 31,
                                                    stratify = y)


# Test and Valid Split
X_valid, X_test, y_valid, y_test = train_test_split(X_test,
                                                    y_test,
                                                    test_size = 0.5,
                                                    random_state = 31,
                                                    stratify = y_test)

In [36]:
print(f"X_train shape : {X_train.shape}")
print(f"X_valid shape : {X_valid.shape}")
print(f"X_test shape : {X_test.shape}")
print(f"y_train shape : {y_train.shape}")
print(f"y_valid shape : {y_valid.shape}")
print(f"y_test shape : {y_test.shape}")

X_train shape : (49086, 10)
X_valid shape : (10519, 10)
X_test shape : (10519, 10)
y_train shape : (49086,)
y_valid shape : (10519,)
y_test shape : (10519,)


In [37]:
# dumb the data
joblib.dump(X_train, "data/raw/X_train.pkl")
joblib.dump(y_train, "data/raw/y_train.pkl")
joblib.dump(X_valid, "data/raw/X_valid.pkl")
joblib.dump(y_valid, "data/raw/y_valid.pkl")
joblib.dump(X_test, "data/raw/X_test.pkl")
joblib.dump(y_test, "data/raw/y_test.pkl")

['data/raw/y_test.pkl']